VOCABULARY AS A CENTRAL CONCEPT
IN LIBRARY AND INFORMATION
SCIENCE
Michael Buckland, Abstract: The nature and role of vocabulary in information systems is examined. "Vocabulary"
commonly refers to the stylized adaptation of natural language to form indexes and thesauri.
Much of bibliographic access, filtering, and information retrieval can be viewed as matching or
translating across vocabularies. Multiple vocabularies are simultaneously present. A simple
query in an online catalog normally involves at least five distinct vocabularies: those of the
authors; the cataloger; the syndetic structure; the searcher; and the formulated query. Vocabulary can be defined as the range (or repertoire) of values in any field of bibliographic
description and, in a more extended sense, the range of types in a set at any level (word, field,
collection, and library). Digital libraries can be represented by a simple recursive model
composed of sets ("collections") and two kinds of operation on the sets. "Vocabulary" is a regular, respectable word in everyday discourse, but it has had an
unsatisfactory position in Library and Information Science. Here it seems to be somewhat alien,
something exotic, that has wandered away from its ordinary habitat into another environment.
It seems, somehow, an immigrant, useful, but somehow suspect. When used in Library and
Information Science, "vocabulary" is commonly and awkwardly qualified. One speaks of
"natural language vocabulary" and of a "controlled vocabulary." These phrases imply
possibilities of improper "unnatural language vocabulary" and, perhaps, dangerous "uncontrolled
vocabulary." It is as if the concept of vocabulary has been is only half accepted in our field.
Suppose that we were adopt it and naturalize it. What would we find to say about it? I will make three large claims, concerning vocabulary. You may find my conclusions
obvious, or trivial, or in error. Perhaps you will agree with my conclusions, but think that I am
extending the meaning of the word "vocabulary" too far. If so, we should separate discussion
of the phenomena being considered from the quite different discussion concerning the proper
use of the word "vocabulary." 1. An Economic Claim. Vocabulary is central to the cost-effectiveness of digital libraries, and,
therefore, to returns on investment. There a massive investment world-wide in making
repositories accessible over networks and also a major investment in providing indexing,
categorizing, and other metadata. A situation of increasing difficulty arises for users of the
repositories because the number and proportion of network-accessible repositories with
unfamiliar metadata vocabularies is increasing. Decreasing effectiveness in selection is the
predictable result. (We use "selection" as a general term to include searching, filtering and
retrieval.) Therefore, any technique that can assist in the use of unfamiliar metadata, either by
making that unfamiliar metadata more familiar or by mitigating the consequences of it being
unfamiliar, could provide enormous leverage in improving the rate of return on the enormous
investments that have been made in establishing repositories and their metadata. (That is the
economic rationale behind our project "Search Support for Unfamiliar Metadata Vocabularies."
www.sims.berkeley.edu/research/metadata/ ) 2. Issues of Identity are Central to Information Science. In a world in which the politics of
identity is central, issues of identity are also central to information science, and they are so for
reasons relating to the role of vocabulary. 3. Vocabulary is a Central Component in Digital Libraries. The claim here is, firstly, that all
filtering and retrieval systems can be modeled in terms of a series of transformations of sets (or
"collections") from one state to another and, secondly, that "vocabulary" is an appropriate term
for the variety or range of values in any given set (or collection). Some Examples Consider the following Library of Congress Subject Heading "God - Knowableness -
History of doctrines - Early church, ca. 30-600" assigned on LCCN 8005064. This is hardly
natural English. When read, it sounds like a telegram. It has a syntactical structure that is the
reverse of ordinary English, where qualifying adjectives and phrases ordinarily precede what
they qualify. It sounds more like normal English if you read the words in the reverse order. The
heading is correctly formed, but, one might say, rather unnatural. Someone interested in the
history of understanding the Almighty is unlikely, in my view, to think of "God -
Knowableness." Another example is from an information system which utilizes a specialized vocabulary
for classification and search purposes. The Census Bureau's U.S. Imports and Exports numeric
data issued on CD-ROM and accessible at http://govinfo.kerr.orst.edu/impexp.html. These data
have wide-ranging importance for strategic policy decisions in government and industry. Yet
if someone interested in the automobile industry did a commodity search using the term
"automobile" will find nothing. A search on "cars" will lead to "Railway or Tramway Stock".
Data are there, but under "Passenger Motor Vehicles, Spark Ignition Engine." Rockets are increasingly used in military hostilities. How many does the U.S. export each
year? A search of the exports data using the word "Rockets" yields the commodity category
"Bearings, Transmission, Gaskets, Misc." Restricting the search to the singular form "rocket"
yields an additional three categories: Photographic or Cinematographic Goods Engines, Parts, Etc. Arms and Ammunition, Parts and Accessories Thereof The last of these specifically concerns military weapons exports from the United States.
The word "rocket" is found only in the category MISSILE & ROCKET LAUNCHERS AND
SIMILAR PROJECTORS (9301009050), not in the larger export category GUIDED MISSILES
(9306900020). The general heading category for this section is: BOMBS, GRENADES, ETC
(9306). Clearly researchers who wish to use this database need for a tool which will bridge the
gap between common terminology and the highly specialized classification scheme which has
evolved for categorizing these data. Sometimes the effects are subtle as well as unexpected as in these two exact subject
searches in MELVYL the University of California's online library catalog: FIND XSU VIETNAM WAR Vocabulary problems are increased when dealing with foreign languages. An example is
a search concerning aerial photography by the German army during the First World War,
specifically "Drachenphotographie." "Drachen" is the German word for a kite and searching led
to the technical literature of that period on military aerial photography using kites. What was
found, however, was irrelevant, because in 1892 the Germans had developed a tethered
observation balloon that was aerodynamic. Being aerodynamic, it was, in a sense, like a kite and
was known as a kite-balloon, "Drachenfesselballon"-- or, "Drachen," for short. So
"Drachenphotographie," in this specialized context, referred not to photography from a kite, but
from an observation balloon. Even knowing that, the searcher would have been more effective
if he had known that, in English, the term "Aerostat" was used as a technical for observation
balloons. It is obvious that different languages, as English, Chinese, and German, use different
words. Also the case that, within any given language, different domains use differing
vocabularies. These differences are often more extensive expected. Consider the subject
headings assigned to documents concerned with "coastal pollution" in the Library of Congress
Subject Headings and in Medical Subject Headings (MeSH). "Coastal" and "Pollution" as subject
keywords yielded no results, but records were found by searching for these two words in titles.
The subject headings that had been assigned were, in ranked order: LCSH: Marine pollution; Coastal zone management; Water -- Pollution; Petroleum
industry and trade; Beach erosion; Coasts; Barrier islands; Coastal changes; etc. MeSH: Seawater; Water pollution; Bacteria; Water microbiology; Air pollution;
Environmental monitoring; Bathing beaches; Environmental pollution; etc. Note the variety and how little the two lists have in common. The subject headings used
are plausible, but who could be expected to imagine more than a few of them? It is easier to
recognize pertinent terms than to predict what they will be. In this case, at least three different
vocabularies were simultaneously in play: LCSH, MeSH, and mine. Not only is the use of vocabulary in subject indexing a language activity, but language use
is unavoidably culturally-based. The use of vocabulary to denote something is socially situated. Providing Some Remedies The Dewey Decimal Classification notation "330" denoting Economics is a kind of a word
with an odd appearance. Recognition that indexing and classification is a language activity is
not new. Metadata systems used to be referred to as "documentary languages" or
"metalanguages." Dewey numbers are meaningful, if you are familiar with the numbers. But
the meanings are more or less opaque until familiarity is developed through usage. What one
needs is a translating from our own terms to the Dewey Decimal Classification's terms, an
English to "Dewey" dictionary. Melvil Dewey provided this in the form of his Relativ Index, which we tend to take for
granted as an appendix to the classification. Dewey himself considered the most important part
of his system (Olding. 1996, p. 82-83). "This alfabetic Index, the most important feature of the sistem, consists of headings
gatherd from a great variety of sources, as uzers of the sistem hav found them desirabl....
The Index givs similar or sinonimus words,... so any intelijent person wil surely get the
ryt number... The Relativ Index, with its cachwords... insures that books on same faze
of any subject cuming before the clasifyers shal be assynd to same place, and that any
reader seeking these books shal be referd instantly to that place." (Olding 1966, 89-91). "The Subject Index of this sistem is a skeleton dictionary catalog, covering everything not
coverd by the `name catalog.' Insted of giving book titles under each hed, the number
refers to all those titles simply and directly.... We therefore unite advantajes of dictionary
and clast catalogs, not by mingling them and so losing much of the simplicity of one and
as much of excelence of the other, but by realy uzing both, each with its own merits. Only
one set of titles is needed, for our clas numbers make this availabl for both catalogs."
(Olding 1966, 104) Dewey's use of reformed spelling reminds us that "natural" language is based on
conventions and would resist retrieval based on "full-text" searching which depends upon
conventions in word usage. Multiple Vocabularies All selection systems involve multiple vocabularies. Even in the most primitive case,
where unedited texts are searched with unedited queries, there are at least two vocabularies: 1. The vocabulary (or vocabularies) of the author(s) of the documents searched; and 2. The vocabulary of the searcher. In operational selection systems the number of vocabularies is likely to be much larger.
An online library catalog, for example, would ordinarily include an additional three: The two
already noted, plus: 3. The vocabulary of the cataloger, used in creating representations of the documents, modifies,
replaces, and / or supplements the author's vocabulary; 4. "See," "See also," and other syndetic structure modify, replace, or supplement the cataloger's
vocabulary; and 5. The vocabulary of the searcher as formulated as a search query. Precisely because there is a multiplicity of vocabularies, there is always a possibility of
mismatch in any transition between vocabularies, a dissonance in meaning. If the searcher asks
for A and the author wrote B, they might be expressing the same meaning in different ways
(synonyms), or they might both write A and be meaning different things (homographs). Intermediate vocabularies (of the cataloger, the syndetic structure, the formulated query)
could be regarded as intended to normalize term usage so that any discrepancies are rectified.
The cataloger's subject heading rectifies the author's title by representing the topic of the
document in a standardized vocabulary. A experienced searcher knows how to modify their own
or others' statement of need in terms the system will respond to usefully. There are as many re-representations in vocabulary as there as transitions between
vocabularies. Each re-representation provides an opportunity to rectify dissonances between
selector and document, but it can also create dissonances. In the example of "Vietnamese
Conflict" the cataloger's vocabulary is at variance with both authors' and searchers', thereby
creating a dissonance and creating a problem, that requires an additional transitional vocabulary
if it is to be rectified. In this case a cross-reference: VIETNAM WAR use VIETNAMESE
CONFLICT would rectify the discrepancy. Alternatively, a good search intermediary (human
or computerized) might know enough to prompt a change of terminology to adapt to the
system's vocabulary. Yet just as each transition between vocabularies provides an opportunity to rectify or
create dissonances, ambiguity, which diffuses the meaning by representing one meaning as two
or more, is also a likely outcome. The opportunities for introducing ambiguity can be
multiplicative, since every additional transition to a new vocabulary compounds the
opportunities for ambiguity and error. We have spoken of multiple vocabularies as if each were straightforward and
homogeneous, but any vocabulary is likely to be internally inconsistent, whether through the
variation in an individual's usage or because a multiplicity of actors is involved, each with his
or her personal usage (synchronous variation). Nor should we assume that any individual or
group is consistent in his or her use of vocabulary over time in a changing world (diachronous
variation) or that change over time is consistent among all parties. Vocabulary is used to
indicate meaning, but language is unstable, dynamic. Words are expansive in the sense that can
take on new meanings. Differences, ambiguities, and uncertainties of meaning, once recognized
may be resolved though exploration of the other's meaning. Dialog brings out differences.
Differences generate change. In this sense, words have lives on their own, taking on new
meanings regardless of what lexicographers may have written. Further, meanings are complex:
There is both denotation, what a word refers to, and connotation, the indirect nuances and
associations that color the meaning perceived. Form and Meaning We have referred above to an undefined "dissonance" between vocabularies. The use of
digital technology leads us to think of vocabulary in terms of character strings that can be
manipulated, but the meanings of words are constructed subjectively and situationally and the
use of vocabulary is social. There is a duality between the form and the meaning of words: The same form of word
can have different meanings in different vocabularies and the same meaning may be expressed
by different forms of words. Form matters in a information systems because technology
operates on the physical characteristics, not the significance, of words. Software operates on
character strings, on lexical entities, not on concepts. When the word "concept" is used in
relation to computer-based retrieval, its use is metaphorical, ordinarily referring to some
algorithm relationship between lexical entities. Minimally the relationship between any pair of terms in just two vocabularies involves
four contingencies: Same form, same meaning: Same word. Here "same" means, in practice, equally acceptable for the purpose at hand, rather than
identity in any strict sense. Perfect synonyms are rare in the English language. Usually there
are some differences in meaning. The Exchange of Meaning and Issues of Identity Transitions from one vocabulary to another without loss of meaning depends on tight
constraints. A formal condition for transformation without loss of meaning is that there would
be an unambiguous and reversible equivalence in each transformation. This is unlikely except
where there is an entirely closed system in which all forms of expression are unambiguously
defined a priori. This condition will not obtain whenever human beings are involved, so the
presence of multiple vocabularies in selection systems reflects a multiplicity of possible
meanings which exists in any exchange of meaning. Selection systems function as dialogical
interactions between vocabularies that operate in at least two directions: The vocabulary of the
searcher is challenged by limits, by constraints in the vocabulary of the selection system; and
the vocabulary of documents is also modified, "controlled." Both are adapted towards
convergence for the purpose of achieving correspondence in meaning. But also there is an
expansion of vocabularies, as searcher, cataloger, and cross-reference maker continuously adapt
to novel challenges, changing circumstances, and novel vocabulary. (Berman's Prejudices and
Antipathies provides plentiful suggestions for changes to the Library of Congress Subject
Headings (Berman 1971)). Issues of identity are central to social life. In the classic definition of Edward Tylor:
"Culture or civilization, taken in its wide ethnographic sense, is that complex whole which
includes knowledge, belief, art, morals, law, custom and any other capabilities and habits
acquired by man as a member of society." Tylor 1871). Identity is defined by difference. Nothing can be satisfactorily defined in terms of itself.
"I am me," signifies that I am not you. I look more like him than like her. I agree with this
person rather than with that person. I prefer the beliefs, art, morals, law custom, and habits of
this group, not those of that group. Categorization and vocabulary to represent categories are essential to this kind of
differentiation. This process of differentiation and categorization is the essence of the social
world and it is the essence of selection systems. To play a role, as librarians, teachers and others
involved in the transmission of culture do, in the shaping of indexical relationships, including
distinguishing and characterizing between this and that, between us and them, is to make Library
and Information Science a part of processes and issues that are central to society. This is an
additional reason why transitions ("mapping") between vocabularies is, or should be, a central
concern in Library and Information Science. Defining Vocabulary in LIS If we think that the concept of vocabulary is of any importance in Library and Information
Science, as I have claimed, then maybe we should treat "vocabulary" as more than a loan word
from language studies. If we are to incorporate the concept of vocabulary into Library and
Information Science, then we should seek a definition, or at least an understanding of what we
mean by "vocabulary" that fits the domain of Library and Information Science in an
intellectually satisfying manner. This conference is concerned with digital libraries. What
would be an effective definition of "vocabulary" in the context of digital libraries? To answer
this question we consider first the ordinary meanings of the word and then the nature of digital
libraries. The Oxford English Dictionary (1989, vol 19, 721) provides four definitions of
"Vocabulary." 1. A collection or list of words with brief explanations of their meanings. The underlying notion is that "vocabulary" denotes an enumeration of the different
expressions of meaning, the repertoire of representational forms. In linguistics the words "type"
and "token" are used, where every instance of a word is a "token," and each different kind of
word is a "type." The range or repertoire of different types could be called the vocabulary. So
using "vocabulary" for the range or repertoire of index terms or subject headings would be
appropriate. But in Library and Information Science the terms used to express meaning are often either
rather unnatural adaptations of natural language (e.g. God -- Knowableness...) or use an artificial
notation, such as 330 to denote Economics. Indeed such systems for the representation of
meaning are a specialty of the field. The use of such descriptive systems is a kind of language
activity and they have long been referred to as "documentary languages" or "metalanguages,"
meaning, if you will, languages of metadata. It would, therefore, be very appropriate in Library
and Information Science to extend the use of "vocabulary" to refer to the range or repertoire of
allowed terms in a thesaurus, of numbers used in a classification scheme, and of codes in any
categorization. In the specialized context of digital libraries, there seems no reason not to use
"vocabulary" as a technical term to denote the range or repertoire of any MARC field or other
metadata field. Treating "vocabulary" a technical term in Library and Information Science to denote the
range of any metadata field opens up additional possibilities, because the entire structure of
digital library systems can be represented in terms sets (or collections) and transitions from a
set to a derived set. There appear to be only two kinds of transition: 1. Transformation, as when a set of catalog cards are derived from a set of books, or
when a set of vectors are derived from digital texts; and So far as Christian Plaunt and I have been able to determine, all digital library structures,
indeed all filtering and retrieval systems, can be modeled in this way by sequences of sets and
transitions to (derived) sets. Digital library structures are hierarchies of sets: There are
networks of repositories, each containing collections of documents, typically containing
paragraphs composed of words made up from letters. Metadata are composed of fields, often
containing sub-fields, and so on. The details of this model and the consequences of digital
libraries having such a structure are discussed elsewhere (Buckland & Plaunt 1994; also Plaunt
1997, Buckland & Plaunt 1997). What is relevant here is that if digital libraries can be usefully
regarded in terms of sets and if, in defining our terms in Library and Information Science,
"vocabulary" could well be defined as the range of any set. If it were, then "vocabulary"
immediately becomes a central technical term in this field. Summary and Conclusions In information retrieval "vocabulary" usually refers to the stylized adaptation of natural
language to form indexing terms. Closer examination reveals vocabulary as a powerful and
pervasive notion, because digital libraries include a multiplicity of languages and, therefore, of
vocabularies. Each transaction in the familiar library catalog involves at least five different
vocabularies: authors', indexers', syndetic structure, searchers', and formulated queries.
Indexing, whether with "natural" or artificial notation, is a describing activity and, therefore, a
language activity. It is traditional and appropriate to refer to metadata systems as "documentary
languages." If we take the term "vocabulary" in its ordinary sense to denote that range or repertoire
of different words used, then it would also seem reasonable to use it for the range of any kind
of metadata, for example any MARC field. If a word is to be used as a technical term in any
domain, for example in Library and Information Science, it had better have an agreed meaning
in that domain. It would be reasonable and useful to use "vocabulary" to denote the range found
in any set (or collection) of words, including all metadata, and this provides great generality
when used in relation to the functional model we have proposed. Mapping across vocabularies, from a term in one vocabulary to the corresponding term
or terms in another, is increasingly needed as convenient access expands to more and more
repositories and to additional, less familiar metadata. Vocabulary is central to the economics of digital libraries because unfamiliar terminology
impedes effective searching. Vocabulary is also important because it is central to issues of
identity, which, in turn, are central to society. Vocabulary, if given a technical definition in
Library and Information Science as the variety or range of values in a set, is a central feature in
the structure and use of digital libraries. There is another consideration. It is a simplification, but I suggest that the historical
development of conceptions of Library and Information Science can be better understood if we
think in terms of two different traditions, which I call a "document tradition" and a "formal
tradition." In the "formal tradition" I include all those techniques and technologies based on logic and
algorithms: punch cards, digital computers, data-processing, computing, artificial intelligence,
and historic traditions of information retrieval as reflected in meetings of ACM SIGIR. It is this
formal tradition that has done so much to make our conference topic--digital libraries--possible.
But this tradition depends on definitions and reliable procedures and is at odds with the
variability of human language and of human behavior. In the "document tradition" I would place the historic practices of document services, such
as bibliography, librarianship, archivists and records managers. In this tradition the concern has
been with documents in the sense of signifying objects and their use in the service of multiple
objectives: practical utility, education, recreation, literacy, and diverse social services. This
tradition has a certain logic: It entails that professional practice extends to any kind of signifying
object in any format, that it include (potentially) anything that helps knowledge, and an
understanding that documents have to do with knowledge, meaning, learning, description,
language, and ambiguity (Buckland 1997). It follows that every conception of Library and
Information Science cannot be complete if it does not incorporate cultural studies, and that,
ultimately, a mature, well-developed conception of Library and Information Science must
necessarily have lively roots in the concerns of the humanities and qualitative social sciences. Two traditions appear to be, ultimately, incompatible because they start from
fundamentally different bases. Nevertheless, we cannot choose either one exclusively if we are
to be both effective and practical. However, vocabulary is central to both traditions. Both must,
in their differing ways, deal with issues of vocabulary, which provides a kind of meeting place.
The topic of vocabulary, I conclude, is important for this conference because the nature and role
of vocabulary is central to any credible conception of Library and Information Science. Acknowledgments The ideas in this paper draw heavily on the author's collaborations with Ron Day, with
Christian Plaunt, and with co-workers in the
Search Support for Unfamiliar Metadata
Vocabularies project (DARPA Contract N66001-97-C-8541; AO# F477.
References Berman, S. (1993). Prejudices and Antipathies: A Tract on the LC Subject Headings
Concerning People. Jefferson, NC: McFarland. First published 1971 by Scarecrow Press. Buckland, M. K. (1997). What is a "document"? Journal of the American Society for
Information Science 48, 804-809. Reprinted in T. B. Hahn & M. K. Buckland, eds.
(1998). Historical Studies in Information Science. Medford, NJ: Information Today, 215-220. Buckland, M. K. and C. Plaunt. (1994). On the construction of selection systems. Library Hi
Tech, 48, 15-28.
Buckland, M. K. and C. Plaunt. (1997). Selecting Libraries, Selecting Documents, Selecting
Data. In: Proceedings of the International Symposium on Research, Development &
Practice in Digital Libraries 1997, ISDL 97, Nov. 18-21, 1997, Tsukuba,
Japan, pp. 85-91. Tsukuba, Japan: University of Library and
Information Science, 1997, Japan.
[HTML]. Norgard, B. A., M.G. Berger, M. K. Buckland, & C. Plaunt. (1993). The online catalog: From
technical services to access service. Advances in Librarianship 17, 111-148. Olding, R. K., ed. 1996. Readings in Library Cataloguing. Hamden, CT: Archon Press. The Oxford English Dictionary. (1989). 2nd ed. Oxford: Clarendon Press. Plaunt, C. (1997). A Functional Model of Information Retrieval Systems. Doctoral dissertation,
University of California, Berkeley.
Introduction
Vocabulary is not usually regarded a central feature of libraries, digital or otherwise.
Library discourse is more often concerned with collections, budgets, staffing, buildings, users,
management, and other practical topics. But in this paper we take a closer look at what can be
said about "vocabulary" in libraries, especially digital libraries.
Search result: 0 records
FIND XSU VIETNAMESE CONFLICT
Search result: 4,190 records
Same form, different meaning: Homograph.
Different form, same meaning: Synonym.
Different form, different meaning: Different word.
2. The range of a language of a particular person, class, profession, or the like.
3. The sum or aggregate of words composing a language; and
4. Figuratively, A set of artistic or stylistic forms, techniques, movements, etc, available
to a particular person, etc.
2. Partitioning (or (re-)ordering), as when cards are alphabetized or a subset of records
are selected as a retrieved set.
Go to
Michael Buckland's
home-page.