Will Google’s Keyword Searching Eliminate the Need for LC Cataloging and Classification?

Thomas Mann


Thomas Mann is a Reference Librarian in the Main Reading Room of the Library of Congress; he is the author of The Oxford Guide to Library Research (3rd ed. forthcoming in October, 2005). This paper was written for AFSCME 2910.


Abstract

Google Print does not "change everything" regarding the need for professional cataloging and classification of books; its limitations make cataloging and classification even more important to researchers. Google’s keyword search mechanism, backed by the display of results in "relevance ranked" order, is expressly designed and optimized for quick information seeking rather than scholarship. Internet keyword searching does not provide scholars with the structured menus of research options, such as those in OPAC browse displays, that they need for overview perspectives on the book literature of their topics. Keyword searching fails to map the taxonomies that alert researchers to unanticipated aspects of their subjects. It fails to retrieve literature that uses keywords other than those the researcher can specify; it misses not only synonyms and variant phrases but also all relevant works in foreign languages. Searching by keywords is not the same as searching by conceptual categories. Google software fails especially to retrieve desired keywords in contexts segregated from the appearance of the same words in irrelevant contexts. As a consequence of the design limitations of the Google search interface, researchers cannot use Google to systematically recognize relevant books whose exact terminology they cannot specify in advance. Cataloging and classification, in contrast, do provide the recognition mechanisms that scholarship requires for systematic literature retrieval in book collections.


To fulfill its mission to "Organize all the world’s information and make it universally accessible and useful," Google has recently announced a project to put 15,000,000 digitized books on the open Internet. From one perspective, Google Print is indeed a consummation devoutly to be wished, as it will "free" those books from their current physical locations within library walls, and make them searchable from anywhere, at anytime, by anyone with an Internet connection. This, however, does not "change everything" in the library field. What concerns me is the concealed proposition entailed in the project: the prospect of greatly expanded content for the open Internet comes freighted with a severely diminished capacity for finding that content. Access to the books, especially via subject searching, will become immeasurably more difficult.

Google’s software allows entry only through the prior specification of keywords. Granted, there are times when searching for particular terms is exactly what a researcher wants–for example, I once had to help someone who needed definitions of obscure medical and entomological words (arnaldia, cincinelles) used at the time of the Third Crusade, and which do not show up in any current dictionaries. It turns out I could indeed find their meanings through browsing old volumes in relevant areas of LC’s bookstacks; but for zeroing in immediately on such very distinctive words Google Print would have been a godsend.

Most scholarship, however, is not at the level of searching for discrete facts, let alone unusually distinctive keywords. The former requires researchers to get an overview of sources relevant to a topic–not just "something" delivered "quickly"–and can be positively vitiated by a failure to attain a comprehensive perspective. Indeed, a recent survey of "Historians and Their Information Sources" confirms that "Comprehensiveness is clearly the highest priority in searching a database" (College & Research Libraries, 65 [2004]).

Google Print, no matter how large its content, cannot provide efficient access to its books because it is hamstrung by poor search software–poor, at least, for scholarly purposes. It has three major problems.

First, keyword searching of digitized full texts, no matter how cleverly the keywords are "relevance ranked," cannot provide a coherent overview of the book-literature of a topic. For example, if a researcher is interested in the history of Afghanistan, a keyword search in the current Google Web file on "Afghanistan" and "history" produces over 11,000,000 hits. The first page of retrievals does indeed include a number of short articles and chronologies on the country’ s history–perhaps enough for a high school paper, but not enough for in-depth understanding. In contrast, traditional subject headings in a library OPAC provide an initial browse display like this:

Most of these headings are themselves further subdivided. All of these aspects of the topic might well be of interest to an historian; but such a taxonomy of options for pursuing the inquiry cannot be revealed by a simple combination of the keywords "Afghanistan" and "history" in a blank Google search box.

There is a categorical distinction between "prior specification" and "recognition" subject searching techniques. Keywords inquiries–no matter how the words are weighted, ranked, massaged, or manipulated–essentially give you only those results having the terms you’ ve been able to specify in advance. They do not bring to you attention, except by chance, conceptual options that are slightly different in their focus. They do not allow you to recognize related sources whose terms you cannot think of beforehand.

Traditional librarian-created subject cataloging, in contrast, does precisely this–and does it systematically, in a way that makes the appearance of such structured option-arrays predictable, no matter what the subject area.

The second, related problem with having only keyword access to content is that it cannot solve the problems of synonyms, variant phrases, and different languages being used for the same subjects. To take but one of the many subject headings above, Afghanistan– History, the 190 book titles that are grouped together under this single term in the catalog of the Library of Congress exhibit an amazing and unpredictable variety of keywords, among them the following:

Come Back to Afghanistan: A California Teenager’s Story
Conflict in Afghanistan: Studies in Asymmetric Warfare
Garden of the Eight Paradises: Babur and the Culture of Empire in Central Asia, Afghanistan and India (1483-1530)
Istoriia Afganistana: v Sostave Velikikh Imperii, Derzhava Akhmed-Shakha Durrani, Anglo-Afganskie Voiny, Pravlenie Abdurrakhman-Khana, Konflikty XX Veka
Misteri dell’Afghanistan: Dalle Origini alla Caduta dei Taliban
Pour Mieux Comprendre l’Afghanistan
Invasioni dell’Afghanistan: da Alessandro Magno a Bush
Afghanistan: een Gescheidenis
Afgan Turkistani: Mazlum Turklerin Ulkesi
Sipah-i Hindukush
Afghan Occupation of Safavid Persia, 1721-1729

The overall subject heading (Afghanistan–History), even apart from filling a slot in relation to many other recognizably relevant subdivisions, thus serves to round up within itself an entire group of relevant works whose own keywords are so divergent that they could never be specified in advance. While they are all "about" the history of Afghanistan, they do not use those two exact words. This is a crucial difference between access by subject headings vs. keywords. "Relevance ranking," no matter how sophisticated, is not the same thing as conceptual categorization. The latter cannot be done systematically by machine algorithms "on the fly" even within English language sources, let alone across multiple languages simultaneously. And yet categorization is crucial to scholarship (as opposed to mere information-seeking), because scholarship requires comprehensive overviews of relevant literature.

Some inexperienced researchers seem to believe that the presence of billions of keywords (from millions of books) in a searchable database will assure that "Afghanistan" and "history" searched as terms within full texts will still retrieve all of the relevant books, even if those words don’t appear in the books’ titles. Unfortunately for scholarship, however, no such efficient retrieval is possible with Google’s search mechanisms.

It is immediately obvious that the foreign books will not be retrieved by English words; the groupings achieved by subject headings would thus be dispersed to the winds. (Disturbing parallels to the Tower of Babel come immediately to mind–this time, however, the Tower of Google will not be a myth.) The Library of Congress has always prided itself in the unmatched range and diversity of its foreign language holdings–a tradition that traces back to Mr. Jefferson himself. The Library has devoted extraordinary effort and expense, an investment of millions of taxpayers’ dollars over two centuries, in collecting, categorizing, and collocating works in more than 500 languages in order to bring the literature of the whole world to the attention of American scholars. This entire project would be vitiated in one fell swoop if Google’s search mechanism replaced LC’s own cataloging and classification. For example, I recently had to find out which book was the first one ever printed in the French language. Since my own French is not very good, I first tried English language inquiries, including the phrases "first book published in French" and "first book published in the French language" in Google, with no luck. To make a long story short, I eventually found a reference to a 1910 Catalogue of a Collection of Early French Books in the Library of C. Fairfax Murray; but when I went to retrieve it from the bookstacks at Z1023.M94, I found a shelf marker noting that it was "missing in inventory." Just to the left of it, however, at Z1023.L838, was a Manuel du Bibliographie Francais from 1927, which listed the Recuil des Histoires de Troyes (published in Cologne in 1466) as "le premier livre qui ait vu le jour dans notre langue." The important point here is that even if this Manuel had been fully digitized in Google Print, I still would not have found it because no search of English terms would have retrieved it; and any French search I would have thought of would have included the phrase "langue francais" rather than "notre langue." I would have missed it, in other words, because I could not specify in advance the right keyword search terms. Due to the library’s classification scheme, however, I could recognize the variant French phrase that actually appears in the book, because its full text was shelved within easy browsing distance, adjacent to the English language book that I actually was looking for. Google, in contrast, cannot bring about the collocation of full texts in different languages.

Even the majority of English language books, all by themselves, will be largely lost to view in Google Print because of the third major problem with keyword searching: it cannot segregate the appearance of the right words in conceptual contexts apart from the appearance of the same words in the wrong contexts. Any experience at all with Google will show this to be true; its "relevance ranking" fails miserably in this area of displaying the right keywords in the right conceptual contexts. If a Google Web search for "Afghanistan" and "history" produces eleven million hits right now, a similar search in Google Print, with 14.5 billion pages of keywords, is very likely to produce similar results. It will become utterly impossible to "see the forest for the trees" with Google’s software; the "forest" overviews created by LC’s cataloging in OPAC browse displays, such as for Afghanistan above, will be completely lost.

The Google Print project will be hampered by a further problem: its scanned 15,000,000 books will include tens of thousands of dictionaries. Any keyword searched will thus retrieve all dictionaries in which the word appears–nor could results be "progressively refined" by adding more words because those words, too, will "hit" in the same dictionaries. (This is already a problem for researchers using a much smaller full-text database, the Evans Early American Imprints.)

Google Print, no matter how wonderful its contents, will not enable scholars to find its contents by the crucial "recognition" means of searching brought about by subject heading categorizations in library catalogs and by subject-classified shelving of books in library bookstacks. The latter is a mechanism that does indeed limit the appearance of recognizably relevant keywords within full texts that are themselves conceptually segregated from other groupings of full texts having the same words in undesired contents. For another example, I was once asked to find information on traveling libraries that circulated among lighthouse keepers at the turn of the last century. The researcher wanted comments on, or recollections of the use of these libraries, that she could quote in her study. Since neither the online catalog nor a variety of indexes and databases worked, I went back into the stacks to browse through the 438 volumes shelved in the class area VK1000-1025 ("Lighthouse service"). Within that controlled grouping, I found fifteen books that had directly relevant sections. Any appearance of the word "libraries" was almost guaranteed to be in the right context because I was limiting my search for it to only those books that were on lighthouses to begin with. In contrast, an "Advanced" search in Google Web right now, combining "libraries" with either "lighthouse" or "lighthouses," produces 803,000 hits. Google’s software is not capable of segregating the right words into only the desired context. It provides no conceptual boundaries–searchers can never tell how far down the display list they need to go to find all of the relevant hits buried within the chaff.

Google cannot provide overview maps of the many aspects of a topic; it cannot systematically retrieve conceptually relevant terms that are different from the exact words specified; and it cannot prevent the appearance of the "right" words from being buried within mountains of hits having the same keywords in irrelevant contexts. Keyword searching in any database cannot solve these problems; the difficulties are considerably exacerbated in Google because it has such a poor keyword search software to begin with–i.e., it does not allow full Boolean combinations in nested parenthesis, or wildcard truncation, or proximity operators. Nor will Google Print be able to allow limitation of words to particular fields, because its high-volume scanning will not be able to distinguish or segregate such fields (title, contents, notes, bibliographies, etc.) within book texts to begin with.

Let’s not be naively misled by the term "relevance ranking." All this refers to is a set of algorithms for arranging the display of those "hits" containing the exact words the searcher has specified. The results having the most appearances of those words (in proportion to the length of the texts), and those that have the most other sites linked to them, will appear at the top of the retrieval screens. "Ranking" by these criteria, however, does not accomplish any of the tasks necessary for scholarly control of book literature: standardization of search terms (authority control), systematic linkage of related concepts, creation of overview browse displays in OPACs, and creation of categories of full-texts related by concept (no matter what their keywords) in limited and manageably browsable groupings. It utterly destroys the recognition access required by scholarship in exchange for a very low level of prior-specification access appropriate only to quick information-seeking, devoid of context, connection, focus, or comprehensiveness.

It has been asserted that students today no longer use libraries, and that they rely on the Internet for all of their search needs. Such assertions are based on bad scholarship and questionable claims to evidence. See the "Survey of Library User Studies" mounted, along with the present paper, on AFSCME’ s Web site (www.guild2910.org). The user studies consistently indicate that while students turn first to the Internet, the large majority of them also go on to use real libraries as well–a point confirmed repeatedly in the CLIR Dimensions study, the OCLC White Paper, and the Columbia EPIC survey, among others. (The Pew study, a survey of 754 underage children’s use of the Internet for grade school or high school papers, is not relevant to the concerns of research libraries.)

The bottom line is that it is extremely naive for LC management to think that it can dispense with LC subject cataloging and classified book-shelving if Google Print (or Internet Archive) provides electronic full texts of books. The "same" books are not nearly as discoverable in Google as they are through traditional search mechanisms. The larger the book collection, the more–not less–scholars are dependent on categorizations created by subject headings in catalogs and classified shelving in bookstacks. "Access" provided by the mere addition of more keywords to a database, with their "ranking" by machine algorithms, is radically different from conceptual categorization; it is only the latter that provides both overview and recognition access to books’ contents. Without those means of access, provided and maintained primarily by the Library of Congress, scholars–and scholarship itself–will be lost in a wilderness of inadequately sorted information "atomized" at keyword levels into conceptual incoherence, and retrievable only superficially via unsystematic and haphazard guesswork.

The Library of Congress has the foremost responsibility of any institution in the country to maintain the quality, accuracy, and consistency of the cataloging and classification schemes that are known throughout the world under our name. This responsibility cannot be abdicated to the collective member libraries of OCLC or to any other body of thousands of participants. "What is everyone’s responsibility is no one’s responsibility." LC itself has the professional duty to maintain the highest standards in this area, to "provide maximum access and facilitate effective use of the collections by Congress and other customers," according to our own Stategic Plan FY 2004-2008. The quality of scholarship that will even be possible in the future, in research libraries throughout our nation, is dependent on the professionalism of the Library of Congress in maintaining high standards of quality in both cataloging and classification. Without those standards, English language books will be atomized into incoherent wildernesses, and foreign books will be scattered to the winds by a Tower of Google that rivals the Tower of Babel in disrupting scholarly communication. It is our LC system, above all others, that solves the myriad problems of access to book literature that are created and exacerbated by the inadequate search mechanisms of Google and other Internet companies. The maintenance of these control systems, that are so necessary to substantive scholarship in all subject areas and all languages in research libraries throughout the world, needs to be a much higher priority in LC’s budget than the continuing digitization of old, copyright-free special collections in narrow subject niches that have limited general utility.



Home / Membership / Current items / Resources / Documents / Links of interest

For any questions regarding this website, please send an email to the webmaster.

This page was last updated on August 15, 2005.