Vocabulary as a central concept in Information Science

Preprint of paper published as "Vocabulary as a Central Concept in Library and Information Science" in: Digital Libraries: Interdisciplinary Concepts, Challenges, and Opportunities. Proceedings of the Third International Conference on Conceptions of Library and Information Science (CoLIS3, Dubrovnik, Croatia, 23-26 May 1999. Ed. by T. Arpanac et al. Zagreb: Lokve, pp 3-12. ISBN 953-6003-37-6.

VOCABULARY AS A CENTRAL CONCEPT
IN LIBRARY AND INFORMATION SCIENCE

Michael Buckland,
School of Information Management & Systems,
University of California,
Berkeley, CA 94720-4600, USA

*New* Словниковий запас як центральна концепція в бібліотечній та інформаційній науці. Russian translation by Studybounty, September 2023.

Abstract: The nature and role of vocabulary in information systems is examined. "Vocabulary" commonly refers to the stylized adaptation of natural language to form indexes and thesauri. Much of bibliographic access, filtering, and information retrieval can be viewed as matching or translating across vocabularies. Multiple vocabularies are simultaneously present. A simple query in an online catalog normally involves at least five distinct vocabularies: those of the authors; the cataloger; the syndetic structure; the searcher; and the formulated query.

Vocabulary can be defined as the range (or repertoire) of values in any field of bibliographic description and, in a more extended sense, the range of types in a set at any level (word, field, collection, and library). Digital libraries can be represented by a simple recursive model composed of sets ("collections") and two kinds of operation on the sets.

Vocabulary problems are central to the economics of digital libraries because unfamiliar vocabulary reduces search effectiveness. Issues of identity are central to Library and Information Science because of the indexical role of vocabulary. Vocabulary is a central component in digital libraries. Problems inherent in vocabulary help explain the nature and history of conceptions of Library and Information Science.

Introduction

Vocabulary is not usually regarded a central feature of libraries, digital or otherwise. Library discourse is more often concerned with collections, budgets, staffing, buildings, users, management, and other practical topics. But in this paper we take a closer look at what can be said about "vocabulary" in libraries, especially digital libraries.

"Vocabulary" is a regular, respectable word in everyday discourse, but it has had an unsatisfactory position in Library and Information Science. Here it seems to be somewhat alien, something exotic, that has wandered away from its ordinary habitat into another environment. It seems, somehow, an immigrant, useful, but somehow suspect. When used in Library and Information Science, "vocabulary" is commonly and awkwardly qualified. One speaks of "natural language vocabulary" and of a "controlled vocabulary." These phrases imply possibilities of improper "unnatural language vocabulary" and, perhaps, dangerous "uncontrolled vocabulary." It is as if the concept of vocabulary has been is only half accepted in our field. Suppose that we were adopt it and naturalize it. What would we find to say about it?

I will make three large claims, concerning vocabulary. You may find my conclusions obvious, or trivial, or in error. Perhaps you will agree with my conclusions, but think that I am extending the meaning of the word "vocabulary" too far. If so, we should separate discussion of the phenomena being considered from the quite different discussion concerning the proper use of the word "vocabulary."

1. An Economic Claim. Vocabulary is central to the cost-effectiveness of digital libraries, and, therefore, to returns on investment. There a massive investment world-wide in making repositories accessible over networks and also a major investment in providing indexing, categorizing, and other metadata. A situation of increasing difficulty arises for users of the repositories because the number and proportion of network-accessible repositories with unfamiliar metadata vocabularies is increasing. Decreasing effectiveness in selection is the predictable result. (We use "selection" as a general term to include searching, filtering and retrieval.) Therefore, any technique that can assist in the use of unfamiliar metadata, either by making that unfamiliar metadata more familiar or by mitigating the consequences of it being unfamiliar, could provide enormous leverage in improving the rate of return on the enormous investments that have been made in establishing repositories and their metadata. (That is the economic rationale behind our project "Search Support for Unfamiliar Metadata Vocabularies." www.sims.berkeley.edu/research/metadata/ )

2. Issues of Identity are Central to Information Science. In a world in which the politics of identity is central, issues of identity are also central to information science, and they are so for reasons relating to the role of vocabulary.

3. Vocabulary is a Central Component in Digital Libraries. The claim here is, firstly, that all filtering and retrieval systems can be modeled in terms of a series of transformations of sets (or "collections") from one state to another and, secondly, that "vocabulary" is an appropriate term for the variety or range of values in any given set (or collection).

Some Examples

Consider the following Library of Congress Subject Heading "God - Knowableness - History of doctrines - Early church, ca. 30-600" assigned on LCCN 8005064. This is hardly natural English. When read, it sounds like a telegram. It has a syntactical structure that is the reverse of ordinary English, where qualifying adjectives and phrases ordinarily precede what they qualify. It sounds more like normal English if you read the words in the reverse order. The heading is correctly formed, but, one might say, rather unnatural. Someone interested in the history of understanding the Almighty is unlikely, in my view, to think of "God - Knowableness."

Another example is from an information system which utilizes a specialized vocabulary for classification and search purposes. The Census Bureau's U.S. Imports and Exports numeric data issued on CD-ROM and accessible at http://govinfo.kerr.orst.edu/impexp.html. These data have wide-ranging importance for strategic policy decisions in government and industry. Yet if someone interested in the automobile industry did a commodity search using the term "automobile" will find nothing. A search on "cars" will lead to "Railway or Tramway Stock". Data are there, but under "Passenger Motor Vehicles, Spark Ignition Engine."

Rockets are increasingly used in military hostilities. How many does the U.S. export each year? A search of the exports data using the word "Rockets" yields the commodity category "Bearings, Transmission, Gaskets, Misc." Restricting the search to the singular form "rocket" yields an additional three categories:

Photographic or Cinematographic Goods

Engines, Parts, Etc.

Arms and Ammunition, Parts and Accessories Thereof

The last of these specifically concerns military weapons exports from the United States. The word "rocket" is found only in the category MISSILE & ROCKET LAUNCHERS AND SIMILAR PROJECTORS (9301009050), not in the larger export category GUIDED MISSILES (9306900020). The general heading category for this section is: BOMBS, GRENADES, ETC (9306). Clearly researchers who wish to use this database need for a tool which will bridge the gap between common terminology and the highly specialized classification scheme which has evolved for categorizing these data.

Sometimes the effects are subtle as well as unexpected as in these two exact subject searches in MELVYL the University of California's online library catalog:

FIND XSU VIETNAM WAR
Search result: 0 records
FIND XSU VIETNAMESE CONFLICT
Search result: 4,190 records

Vocabulary problems are increased when dealing with foreign languages. An example is a search concerning aerial photography by the German army during the First World War, specifically "Drachenphotographie." "Drachen" is the German word for a kite and searching led to the technical literature of that period on military aerial photography using kites. What was found, however, was irrelevant, because in 1892 the Germans had developed a tethered observation balloon that was aerodynamic. Being aerodynamic, it was, in a sense, like a kite and was known as a kite-balloon, "Drachenfesselballon"-- or, "Drachen," for short. So "Drachenphotographie," in this specialized context, referred not to photography from a kite, but from an observation balloon. Even knowing that, the searcher would have been more effective if he had known that, in English, the term "Aerostat" was used as a technical for observation balloons.

It is obvious that different languages, as English, Chinese, and German, use different words. Also the case that, within any given language, different domains use differing vocabularies. These differences are often more extensive expected. Consider the subject headings assigned to documents concerned with "coastal pollution" in the Library of Congress Subject Headings and in Medical Subject Headings (MeSH). "Coastal" and "Pollution" as subject keywords yielded no results, but records were found by searching for these two words in titles. The subject headings that had been assigned were, in ranked order:

LCSH: Marine pollution; Coastal zone management; Water -- Pollution; Petroleum industry and trade; Beach erosion; Coasts; Barrier islands; Coastal changes; etc.

MeSH: Seawater; Water pollution; Bacteria; Water microbiology; Air pollution; Environmental monitoring; Bathing beaches; Environmental pollution; etc.

Note the variety and how little the two lists have in common. The subject headings used are plausible, but who could be expected to imagine more than a few of them? It is easier to recognize pertinent terms than to predict what they will be. In this case, at least three different vocabularies were simultaneously in play: LCSH, MeSH, and mine.

Not only is the use of vocabulary in subject indexing a language activity, but language use is unavoidably culturally-based. The use of vocabulary to denote something is socially situated.

Providing Some Remedies

The Dewey Decimal Classification notation "330" denoting Economics is a kind of a word with an odd appearance. Recognition that indexing and classification is a language activity is not new. Metadata systems used to be referred to as "documentary languages" or "metalanguages." Dewey numbers are meaningful, if you are familiar with the numbers. But the meanings are more or less opaque until familiarity is developed through usage. What one needs is a translating from our own terms to the Dewey Decimal Classification's terms, an English to "Dewey" dictionary.

Melvil Dewey provided this in the form of his Relativ Index, which we tend to take for granted as an appendix to the classification. Dewey himself considered the most important part of his system (Olding. 1996, p. 82-83).

"This alfabetic Index, the most important feature of the sistem, consists of headings gatherd from a great variety of sources, as uzers of the sistem hav found them desirabl.... The Index givs similar or sinonimus words,... so any intelijent person wil surely get the ryt number... The Relativ Index, with its cachwords... insures that books on same faze of any subject cuming before the clasifyers shal be assynd to same place, and that any reader seeking these books shal be referd instantly to that place." (Olding 1966, 89-91).

"The Subject Index of this sistem is a skeleton dictionary catalog, covering everything not coverd by the `name catalog.' Insted of giving book titles under each hed, the number refers to all those titles simply and directly.... We therefore unite advantajes of dictionary and clast catalogs, not by mingling them and so losing much of the simplicity of one and as much of excelence of the other, but by realy uzing both, each with its own merits. Only one set of titles is needed, for our clas numbers make this availabl for both catalogs." (Olding 1966, 104)

Dewey's use of reformed spelling reminds us that "natural" language is based on conventions and would resist retrieval based on "full-text" searching which depends upon conventions in word usage.

Multiple Vocabularies

All selection systems involve multiple vocabularies. Even in the most primitive case, where unedited texts are searched with unedited queries, there are at least two vocabularies:

1. The vocabulary (or vocabularies) of the author(s) of the documents searched; and

2. The vocabulary of the searcher.

In operational selection systems the number of vocabularies is likely to be much larger. An online library catalog, for example, would ordinarily include an additional three: The two already noted, plus:

3. The vocabulary of the cataloger, used in creating representations of the documents, modifies, replaces, and / or supplements the author's vocabulary;

4. "See," "See also," and other syndetic structure modify, replace, or supplement the cataloger's vocabulary; and

5. The vocabulary of the searcher as formulated as a search query.

Precisely because there is a multiplicity of vocabularies, there is always a possibility of mismatch in any transition between vocabularies, a dissonance in meaning. If the searcher asks for A and the author wrote B, they might be expressing the same meaning in different ways (synonyms), or they might both write A and be meaning different things (homographs).

Intermediate vocabularies (of the cataloger, the syndetic structure, the formulated query) could be regarded as intended to normalize term usage so that any discrepancies are rectified. The cataloger's subject heading rectifies the author's title by representing the topic of the document in a standardized vocabulary. A experienced searcher knows how to modify their own or others' statement of need in terms the system will respond to usefully.

There are as many re-representations in vocabulary as there as transitions between vocabularies. Each re-representation provides an opportunity to rectify dissonances between selector and document, but it can also create dissonances. In the example of "Vietnamese Conflict" the cataloger's vocabulary is at variance with both authors' and searchers', thereby creating a dissonance and creating a problem, that requires an additional transitional vocabulary if it is to be rectified. In this case a cross-reference: VIETNAM WAR use VIETNAMESE CONFLICT would rectify the discrepancy. Alternatively, a good search intermediary (human or computerized) might know enough to prompt a change of terminology to adapt to the system's vocabulary.

Yet just as each transition between vocabularies provides an opportunity to rectify or create dissonances, ambiguity, which diffuses the meaning by representing one meaning as two or more, is also a likely outcome. The opportunities for introducing ambiguity can be multiplicative, since every additional transition to a new vocabulary compounds the opportunities for ambiguity and error.

We have spoken of multiple vocabularies as if each were straightforward and homogeneous, but any vocabulary is likely to be internally inconsistent, whether through the variation in an individual's usage or because a multiplicity of actors is involved, each with his or her personal usage (synchronous variation). Nor should we assume that any individual or group is consistent in his or her use of vocabulary over time in a changing world (diachronous variation) or that change over time is consistent among all parties. Vocabulary is used to indicate meaning, but language is unstable, dynamic. Words are expansive in the sense that can take on new meanings. Differences, ambiguities, and uncertainties of meaning, once recognized may be resolved though exploration of the other's meaning. Dialog brings out differences. Differences generate change. In this sense, words have lives on their own, taking on new meanings regardless of what lexicographers may have written. Further, meanings are complex: There is both denotation, what a word refers to, and connotation, the indirect nuances and associations that color the meaning perceived.

Form and Meaning

We have referred above to an undefined "dissonance" between vocabularies. The use of digital technology leads us to think of vocabulary in terms of character strings that can be manipulated, but the meanings of words are constructed subjectively and situationally and the use of vocabulary is social.

There is a duality between the form and the meaning of words: The same form of word can have different meanings in different vocabularies and the same meaning may be expressed by different forms of words. Form matters in a information systems because technology operates on the physical characteristics, not the significance, of words. Software operates on character strings, on lexical entities, not on concepts. When the word "concept" is used in relation to computer-based retrieval, its use is metaphorical, ordinarily referring to some algorithm relationship between lexical entities.

Minimally the relationship between any pair of terms in just two vocabularies involves four contingencies:

Same form, same meaning: Same word.
Same form, different meaning: Homograph.
Different form, same meaning: Synonym.
Different form, different meaning: Different word.

Here "same" means, in practice, equally acceptable for the purpose at hand, rather than identity in any strict sense. Perfect synonyms are rare in the English language. Usually there are some differences in meaning.

The Exchange of Meaning and Issues of Identity

Transitions from one vocabulary to another without loss of meaning depends on tight constraints. A formal condition for transformation without loss of meaning is that there would be an unambiguous and reversible equivalence in each transformation. This is unlikely except where there is an entirely closed system in which all forms of expression are unambiguously defined a priori. This condition will not obtain whenever human beings are involved, so the presence of multiple vocabularies in selection systems reflects a multiplicity of possible meanings which exists in any exchange of meaning. Selection systems function as dialogical interactions between vocabularies that operate in at least two directions: The vocabulary of the searcher is challenged by limits, by constraints in the vocabulary of the selection system; and the vocabulary of documents is also modified, "controlled." Both are adapted towards convergence for the purpose of achieving correspondence in meaning. But also there is an expansion of vocabularies, as searcher, cataloger, and cross-reference maker continuously adapt to novel challenges, changing circumstances, and novel vocabulary. (Berman's Prejudices and Antipathies provides plentiful suggestions for changes to the Library of Congress Subject Headings (Berman 1971)).

Issues of identity are central to social life. In the classic definition of Edward Tylor: "Culture or civilization, taken in its wide ethnographic sense, is that complex whole which includes knowledge, belief, art, morals, law, custom and any other capabilities and habits acquired by man as a member of society." Tylor 1871).

Identity is defined by difference. Nothing can be satisfactorily defined in terms of itself. "I am me," signifies that I am not you. I look more like him than like her. I agree with this person rather than with that person. I prefer the beliefs, art, morals, law custom, and habits of this group, not those of that group.

Categorization and vocabulary to represent categories are essential to this kind of differentiation. This process of differentiation and categorization is the essence of the social world and it is the essence of selection systems. To play a role, as librarians, teachers and others involved in the transmission of culture do, in the shaping of indexical relationships, including distinguishing and characterizing between this and that, between us and them, is to make Library and Information Science a part of processes and issues that are central to society. This is an additional reason why transitions ("mapping") between vocabularies is, or should be, a central concern in Library and Information Science.

Defining Vocabulary in LIS

If we think that the concept of vocabulary is of any importance in Library and Information Science, as I have claimed, then maybe we should treat "vocabulary" as more than a loan word from language studies. If we are to incorporate the concept of vocabulary into Library and Information Science, then we should seek a definition, or at least an understanding of what we mean by "vocabulary" that fits the domain of Library and Information Science in an intellectually satisfying manner. This conference is concerned with digital libraries. What would be an effective definition of "vocabulary" in the context of digital libraries? To answer this question we consider first the ordinary meanings of the word and then the nature of digital libraries.

The Oxford English Dictionary (1989, vol 19, 721) provides four definitions of "Vocabulary."

1. A collection or list of words with brief explanations of their meanings.
2. The range of a language of a particular person, class, profession, or the like.
3. The sum or aggregate of words composing a language; and
4. Figuratively, A set of artistic or stylistic forms, techniques, movements, etc, available to a particular person, etc.

The underlying notion is that "vocabulary" denotes an enumeration of the different expressions of meaning, the repertoire of representational forms. In linguistics the words "type" and "token" are used, where every instance of a word is a "token," and each different kind of word is a "type." The range or repertoire of different types could be called the vocabulary. So using "vocabulary" for the range or repertoire of index terms or subject headings would be appropriate.

But in Library and Information Science the terms used to express meaning are often either rather unnatural adaptations of natural language (e.g. God -- Knowableness...) or use an artificial notation, such as 330 to denote Economics. Indeed such systems for the representation of meaning are a specialty of the field. The use of such descriptive systems is a kind of language activity and they have long been referred to as "documentary languages" or "metalanguages," meaning, if you will, languages of metadata. It would, therefore, be very appropriate in Library and Information Science to extend the use of "vocabulary" to refer to the range or repertoire of allowed terms in a thesaurus, of numbers used in a classification scheme, and of codes in any categorization. In the specialized context of digital libraries, there seems no reason not to use "vocabulary" as a technical term to denote the range or repertoire of any MARC field or other metadata field.

Treating "vocabulary" a technical term in Library and Information Science to denote the range of any metadata field opens up additional possibilities, because the entire structure of digital library systems can be represented in terms sets (or collections) and transitions from a set to a derived set. There appear to be only two kinds of transition:

1. Transformation, as when a set of catalog cards are derived from a set of books, or when a set of vectors are derived from digital texts; and
2. Partitioning (or (re-)ordering), as when cards are alphabetized or a subset of records are selected as a retrieved set.

So far as Christian Plaunt and I have been able to determine, all digital library structures, indeed all filtering and retrieval systems, can be modeled in this way by sequences of sets and transitions to (derived) sets. Digital library structures are hierarchies of sets: There are networks of repositories, each containing collections of documents, typically containing paragraphs composed of words made up from letters. Metadata are composed of fields, often containing sub-fields, and so on. The details of this model and the consequences of digital libraries having such a structure are discussed elsewhere (Buckland & Plaunt 1994; also Plaunt 1997, Buckland & Plaunt 1997). What is relevant here is that if digital libraries can be usefully regarded in terms of sets and if, in defining our terms in Library and Information Science, "vocabulary" could well be defined as the range of any set. If it were, then "vocabulary" immediately becomes a central technical term in this field.

Summary and Conclusions

In information retrieval "vocabulary" usually refers to the stylized adaptation of natural language to form indexing terms. Closer examination reveals vocabulary as a powerful and pervasive notion, because digital libraries include a multiplicity of languages and, therefore, of vocabularies. Each transaction in the familiar library catalog involves at least five different vocabularies: authors', indexers', syndetic structure, searchers', and formulated queries. Indexing, whether with "natural" or artificial notation, is a describing activity and, therefore, a language activity. It is traditional and appropriate to refer to metadata systems as "documentary languages."

If we take the term "vocabulary" in its ordinary sense to denote that range or repertoire of different words used, then it would also seem reasonable to use it for the range of any kind of metadata, for example any MARC field. If a word is to be used as a technical term in any domain, for example in Library and Information Science, it had better have an agreed meaning in that domain. It would be reasonable and useful to use "vocabulary" to denote the range found in any set (or collection) of words, including all metadata, and this provides great generality when used in relation to the functional model we have proposed.

Mapping across vocabularies, from a term in one vocabulary to the corresponding term or terms in another, is increasingly needed as convenient access expands to more and more repositories and to additional, less familiar metadata.

Vocabulary is central to the economics of digital libraries because unfamiliar terminology impedes effective searching. Vocabulary is also important because it is central to issues of identity, which, in turn, are central to society. Vocabulary, if given a technical definition in Library and Information Science as the variety or range of values in a set, is a central feature in the structure and use of digital libraries.

There is another consideration. It is a simplification, but I suggest that the historical development of conceptions of Library and Information Science can be better understood if we think in terms of two different traditions, which I call a "document tradition" and a "formal tradition."

In the "formal tradition" I include all those techniques and technologies based on logic and algorithms: punch cards, digital computers, data-processing, computing, artificial intelligence, and historic traditions of information retrieval as reflected in meetings of ACM SIGIR. It is this formal tradition that has done so much to make our conference topic--digital libraries--possible. But this tradition depends on definitions and reliable procedures and is at odds with the variability of human language and of human behavior.

In the "document tradition" I would place the historic practices of document services, such as bibliography, librarianship, archivists and records managers. In this tradition the concern has been with documents in the sense of signifying objects and their use in the service of multiple objectives: practical utility, education, recreation, literacy, and diverse social services. This tradition has a certain logic: It entails that professional practice extends to any kind of signifying object in any format, that it include (potentially) anything that helps knowledge, and an understanding that documents have to do with knowledge, meaning, learning, description, language, and ambiguity (Buckland 1997). It follows that every conception of Library and Information Science cannot be complete if it does not incorporate cultural studies, and that, ultimately, a mature, well-developed conception of Library and Information Science must necessarily have lively roots in the concerns of the humanities and qualitative social sciences.

Two traditions appear to be, ultimately, incompatible because they start from fundamentally different bases. Nevertheless, we cannot choose either one exclusively if we are to be both effective and practical. However, vocabulary is central to both traditions. Both must, in their differing ways, deal with issues of vocabulary, which provides a kind of meeting place. The topic of vocabulary, I conclude, is important for this conference because the nature and role of vocabulary is central to any credible conception of Library and Information Science.

Acknowledgments

The ideas in this paper draw heavily on the author's collaborations with Ron Day, with Christian Plaunt, and with co-workers in the Search Support for Unfamiliar Metadata Vocabularies project (DARPA Contract N66001-97-C-8541; AO# F477.

References

Berman, S. (1993). Prejudices and Antipathies: A Tract on the LC Subject Headings Concerning People. Jefferson, NC: McFarland. First published 1971 by Scarecrow Press.

Buckland, M. K. (1997). What is a "document"? Journal of the American Society for Information Science 48, 804-809. Reprinted in T. B. Hahn & M. K. Buckland, e's. (1998). Historical Studies in Information Science. Medford, NJ: Information Today, 215-220.

Buckland, M. K. and C. Plaunt. (1994). On the construction of selection systems. Library Hi Tech, 48, 15-28. [ HTML].

Buckland, M. K. and C. Plaunt. (1997). Selecting Libraries, Selecting Documents, Selecting Data. In: Proceedings of the International Symposium on Research, Development & Practice in Digital Libraries 1997, ISDL 97, Nov. 18-21, 1997, Tsukuba, Japan, pp. 85-91. Tsukuba, Japan: University of Library and Information Science, 1997, Japan. [HTML].

Norgard, B. A., M.G. Berger, M. K. Buckland, & C. Plaunt. (1993). The online catalog: From technical services to access service. Advances in Librarianship 17, 111-148.

Olding, R. K., ed. 1996. Readings in Library Cataloguing. Hamden, CT: Archon Press.

The Oxford English Dictionary. (1989). 2nd ed. Oxford: Clarendon Press.

Plaunt, C. (1997). A Functional Model of Information Retrieval Systems. Doctoral dissertation, University of California, Berkeley.

Go to OASIS Studies Page or Michael Buckland's home-page.