One of the top issues in the library automation arena in the last couple of years involves the development of a new generation of interfaces to replace online catalogs that have fallen behind the expectation of Web savvy library users. This aspect of library automation currently attracts incredible interest—almost all libraries are giving consideration to how they can bring the search tools they offer for their collections and their overall Web presence up to the level expected on the Web today. Even in this early point in the adoption cycle of new library interfaces, it’s time to press onward toward even more effective and powerful search tools. In this month’s column I’m especially interested in exploring the expansion of metadata-based search into deep search based on the full content of digitized materials.
The Current line of Next Generation Interfaces
I’ve written previously about the new generation of library interfaces. Products like Encore from Innovative Interfaces, Ex Libris Primo, Endeca ProFind, AquaBrowser Library from Medialab Solutions, VUfind created at Villanova University allow libraries to step up to interfaces appropriate for the current millennium. These products bring an expanded and consolidated scope of search, faceted navigation, sophisticated search technologies, relevancy ranking of results, enriched displays, many other features needed to bring the user experience of the library’s Web presence up to the par established on the commercial Web.
We can expect other products to roll out in the near future, both from commercial vendors and the open source realm. The Library Corporation has announced Indigo, SirsiDynix has a product based on GlobalBrain from BrainWare in the works, and the University of Rochester River Campus Libraries is working on the eXtensible Catalog with the backing of the Andrew W. Mellon Foundation. BiblioCommons is hard at work on a next generation library interface that aims to integrate social computing into the search process.
In other recent news, R.R. Bowker has acquired AquaBrowser from Medialab solutions, leading us to expect this next-gen interface to become integrated into the product suite offered by Serial Solutions, another Cambridge Information Group company, and to receive a huge boost in marketing as a result of its incorporation into a larger corporate environment.
While each of the existing products has important qualities that distinguish it from its competitors, they have a great deal in common. They all offer tremendous improvements over legacy library OPACs. Still, it’s important that we not become complacent with the concepts and features seen in the current generation of library interfaces. Web-time advances rapidly. Lest we fall behind again, we should already be thinking about what follows the current generation of next-generation library interfaces. I’ve been thinking lately about some added dimensions that will benefit the next generation of library interfaces and especially about how library search can expand beyond some of its current constraints.
One of the essential improvements to search that I see in the near future involves much deeper searching capabilities. Mass digitization offers the potential to transform the way that we search our book collections. Most library search environments perform searching based on metadata records that describe a given item in their collection. The standard approach to book search in libraries works with MARC records. For other types of collection such as images or other multimedia content, we make use various flavors of Dublin Core or other appropriate metadata formats.
We are all aware of the digitization efforts of Google Library Print to digitize millions of books in the world’s leading libraries, its partnerships with publishers for searching the full text of current publications and the similar ambitions of Microsoft Live Book Search. Outside the commercial arena, the Open Content Alliance has digitization efforts underway on a somewhat smaller scale. Million of books have already been digitized through these efforts, with tens of millions expected to be completed within just a few years.
In this age of mass digitization, I worry that search based solely on metadata will ultimately fail to provide the optimal level of discovery. The commercial global search companies have invested heavily in gaining access to the full text of books and stand to gain a significant advantage in the book search arena, a domain that libraries have long claimed as a key point of excellence. As deep search becomes standard fare for book content on the commercial Web, it’s important for libraries to find ways to offer similar capabilities within their own discovery environments.
The key motivation behind these digitization efforts lies in its benefits to the search process. Full online presentation of book-length works currently and for the foreseeable future will be impeded by copyright restrictions and by practical limitations of online reading environments. For the time being, these digitized books primarily fuel discovery environments that drive sales of print materials and library loans, and of course, increase interest in the search engines themselves.
Deep search based on the full text of digitized books complements the ways that libraries search their book collections. In the traditional library metadata model, a cataloger constructs a MARC record, transcribing titles, authors, and detailed publication data, places the book in context through the selection of an appropriate call number, and assigns subject headings that best describe what the book is about. The library cataloging process aims to provide a variety of access points for the benefit of patrons as they search library collections.
Libraries are well served by this method of cataloging, and will be into the foreseeable future. Yet, the advent of mass digitization offers, or even demands, a new dimension of search based on the availability of the actual full text of the book. In this age of mass digitization, each word or phrase of a book becomes a possible access point.
Descriptive cataloging provides a selected number of high-quality access points. Full text searching provides an enormous quantity of mostly undifferentiated access points. Search technology, however, is advancing rapidly to perform very sophisticated retrieval based on the full text, leveraging context, position, and patterns in ways that approach the efficacy of hand-crafted metadata.
At the present time, we’re in the early days on the front of book search based on mass digitization. The proportion of digitized books relative to the whole body of works published remains fairly small. The time may come fairly quickly when we hit the critical mass where most book content will be digitized which may bring us to a tipping point where deep search of book content will more seriously challenge traditional search based solely on metadata.
Benefits of book search based on the full text of digitized books are already evident. Google Book search already incorporates millions of books. Though I’m not aware of the total numbers of books digitized through Google Library Print project, the University of Michigan recently celebrated its millionth digitized volume. If other library partners have made similar progress, Google Book search already spans millions of volumes.
I’ve been playing with Google Book Search in the last few days, and I’m fairly impressed. I’m particularly interested in whether it turns up items that would not have turned up in search based on the metadata. I didn’t do a rigorous analysis, but consistently found that it turns up relevant items based on phrases in the works not represented in MARC records. The search “RFID in Libraries” for example returns results that includes Privacy in the 21st Century: Issues for Public, School, and Academic Libraries by Helen R. Adams, which includes many references in the text to RFID, which would probably be of interest to someone thinking about implementing this technology in a library. A look at the book’s corresponding record in WorldCat reveals no reference to RFID. This is just one tiny example of the kinds of retrieval possible through deep search that could be missed in systems that rely on metadata alone.
Amazon.com’s “Search inside the book” serves as another example of deep search on the commercial Web. For many of the books in its inventory, Amazon offers users the ability to perform full text searching. This feature can be very useful for determining a give book contains specific content of interest. As far as I can determine, search within the Amazon site seems based on metadata and is yet using the full text. I’ve tried searching phrased seen within a book through the “Search Inside” feature that fail to find the item when searching the site. Search inside the book is another feature that I’d love to see on library sites.
Metadata + full text = powerful search
I believe that metadata based search and deep search based on digitized content are not mutually exclusive, but complementary approaches. Full text search tends to return too much; search based on metadata alone can return too little. A combination of the two should result in incredibly powerful search environments. Even when we get to the point where all book content is available digitally, I think that creating high-quality metadata records will continue to be a valuable activity.
The new Book Search API
Support of OpenURL
Google has done some interesting work to connect its Book Search with library collections. It has establish versions of its Book Search service for the some of its library partners, including Harvard University, that provides specific hooks into its local link resolver and catalog to facilitate access to books available in its own libraries (see http://books.google.com/books/harvard). When searching this version of Google Book Search, results include a link titled “Find at Harvard University” which is an OpenURL resolved by Harvard’s SFX server that takes the user to the record in their HOLLIS catalog so that it can be borrowed locally. Harvard offers links from its site into the Harvard Google Book Search so that Harvard users enter the version of the service that gives them the best chance of getting to the books held within their own libraries.
Library versions of deep search?
I don’t think that libraries should cede deep search to the commercial search companies. Rather, I think that future generations of library search environments must find ways to incorporate this deep full-text search of books and other materials. I don’t necessarily have a path in mind on how this can be accomplished. Even collectively, libraries will be hard pressed to muster the resources on the scale that Google has invested to make a comprehensive book search service within reach. Nonetheless, without deep search in a library-built environment, we remain uncomfortably vulnerable to the generosity of the commercial search companies. While Google seems interested in providing services consistent with library interests today, its business strategies could evolve in the future in ways that may not favor the position of libraries.
As libraries seek to create future generations of library interfaces that work toward providing a single point of entry into all of its content and services, I think that deep search stands as an important ingredient. I’ve written previously about the importance of creating search interfaces that consolidate the print and electronic resources into a single search process. As library interfaces develop toward future generations, I think that it’s also important that they incorporate deep search to the largest extent possible.