Lately, it seems like every new concept, movement, or technology is “open”. While being open is surely a good characteristic for a computer system to embody, the term has found its way into so many names and acronyms, that it's hard to keep track of them all. Several key movements have emerged in the library and information systems world: the Open Archives Initiatives, OpenURL as the basis for reference linking environments, the Open Source movement for software development, and the Open Archival Information Systems framework for creating reliable information repositories. In this month's column we will talk about the Open Archives Initiative. I'll plan to cover the each of the other movements in future columns.
Open Archives Initiative
One of the most interesting recent developments in the digital library arena involves the Open Archives Initiative (OAI). Developed only within the last few years, many of the major players in digital libraries and scholarly communications have lent strong support to this new approach to information discovery, search, and retrieval. While still in the early phases of its deployment, we can see this initiative gaining significant momentum, and will likely be a key piece of the future digital library landscape.
The Open Archives Initiative emerged out of the scholarly communications arena as a means to provide interoperability among multiple information sources. OAI relies on a model of metadata harvesting to support the creation of information services that span multiple sources or that offer other value-added features. The OAI universe is based on information repositories, or “data providers,” that make their metadata available, using a prescribed set of protocols, to “service providers” that build new information resources. End-users gain the benefit of OAI-based services that aggregate the metadata of multiple OAI repositories. Note that OAI operates with metadata, not complete works of digital content. In most cases the metadata include links back to the original information repositories for access to the documents or other digital objects.
As a publicly available, well-defined specification developed collaboratively among a large group of stakeholders, OAI can truly be considered as open. Any public, private, or commercial entity can build systems that comply with this protocol. The use of the term archives has proven to be somewhat controversial. For many, an “archive” refers to a physical or digital collection that meets very specific expectations regarding the management and preservation of materials. The OAI community uses “archive” in a much more general way, referring to any kind of repository of digital content.
The Open Archives Initiative makes no assumptions regarding access to information. An OAI service provider can provide free and universal access, it can restrict access to specific communities, or it can impose fees for access. OAI is neutral regarding access control and business models.
Simplicity is the key
The OAI metadata harvesting protocol operates only between data providers and service providers behind the scenes. No special software is needed to search an OAI-based information service. The users of a service need not even be aware that OAI was used as the means for collecting the metadata.
The OAI protocols were designed to be very simple and efficient. By avoiding complexity, enabling an existing information repository able to function as an OAI compliant data provider is a relatively simple process, generally requiring only a few days worth of programming effort. Building services is somewhat more challenging.
OAI vs. Z39.50
The metadata harvesting approach of OAI stands in distinct contract to search and retrieval protocols such as Z39.50. In essence, Z39.50 and OIA both accomplish what is often called federated searching, allowing users to gather information from multiple related resources through a single interface. One search query can return results from many information resources, increasing the comprehensiveness of the information available without forcing the user to search multiple places individually.
The library world has long relied on Z39.50 as the means for providing the ability to search multiple online catalogs or other bibliographic resources. Z39.50 relies on an online, real-time connection between the searcher's system and one or more targets using a thick and complex set of communications protocols. Sessions can be established where a query is broadcast to multiple Z39.50 targets, and the results from each are gathered, sorted, de-duplicated, and presented to the user. The advantage of Z39.50 lies in the ability to search remote resources through a common user interface and in the immediacy of accessing current information in real time. Z39.50, however, is also subject to the vagaries associated with maintaining online connections with remote servers. Especially when searching multiple remote resources, Z39.50 can be a complex and unwieldy information retrieval model.
A Metadata Harvesting Approach
OAI avoids the need to maintain interactive connections among all the original repositories and the end user by creating pre-built collections of metadata. An OAI service gathers all the metadata from the data providers it aggregates in advance. This framework allows for the creation of comprehensive indexes and other components that might enhance the search and retrieval process. OAI-based services can offer rapid search and retrieval capability since the metadata are already physically present and do not have to be retrieved dynamically from remote servers.
The communications model of OAI relies on the bulk transfer, or harvesting, of metadata between a service and all its data providers, based on a set of very simple protocols. The initial harvest can involve massive amounts of data. Fortunately, these data transfers can be scheduled to occur during off-peak hours.
While the service provider initiates each harvesting session, the data provider can set limits on the process, applying whatever throttles are necessary to avoid overload conditions. In essence, the data provider's server always has the option of telling the service provider's harvester to come back later when it's likely not to be so busy. The control mechanism for the data provider is implemented in the OAI protocol through the use of a “resumption token” which stops the current transfer, giving the harvester a ticket to come back later and begin where this session ended.
Especially when dealing with large data stores, the initial OAI harvest will be broken down into multiple chunks, dependent on the number of records that the various data providers' servers are configured to deliver at a time. Once the initial metadata transfer has taken place, the service provider's harvester must maintain a regular schedule of visiting its data providers to gather new, updated, and deleted items. The currency of an OAI-based service depends on frequent harvesting.
Given that the information in an OAI service is typically updated weekly or at best daily, this model would not lend itself to applications that demand up-to-the-minute currency of information. For this reason, it is unlikely that OAI would replace interactive protocols like Z39.50 for online library catalogs, where it is important to see current information about the checkout status of a book.
The minimum metadata structure prescribed by OAI is unqualified Dublin Core. It is permissible, however, for data providers and service providers to agree to specific qualifications of Dublin Core or to use alternate metadata structures that are appropriate to their disciplines. OAI allows for very flexible metadata options.
Historical Background
OAI traces its beginnings to efforts to increase interoperability among pre-print servers of scientific and technical papers. In some scientific disciplines, especially physics, scholars and researchers would deposit articles and papers into pre-print servers, allowing for the dissemination of information among the scholarly community in a way much more rapid than traditional printed journals. Pre-print servers did not preempt researchers from publishing in traditional journals—researchers would also submit their work for publication in the usual venues. Two of the most successful and well-known pre-print servers are the arXiv e-Print server at the Los Alamos National Laboratories (LANL) and the CogPrints electronic archive for psychology and cognitive sciences. There are dozens (if not hundreds) of other major pre-print servers, with new ones emerging all the time. Preprint servers have made a major impact in scholarly communications in many areas of scientific research.
In 1999, a group of researchers and librarians met in Santa Fe, New Mexico, to discuss issues related to pre-print servers, focusing especially on the need to provide ways for users to search across these servers more efficiently. Out of this meeting grew the initial concepts of creating services based on harvested metadata. The Santa Fe Conventions, as they were originally known, quickly evolved into the Open Archives Initiative, with only minor revisions along the way. A series of OAI workshops were held that gave potential implementers of the protocol opportunities for participation in the development of the protocols. A steering committee was appointed, charged with overseeing the development of OAI and promoting its use (www.openarchives.org/news/oaiscpress000825.html).
While OAI began in the pre-print server arena, it is now generally regarded as a model that can work with a wide range of content types across multiple disciplines. One of the strengths of OAI is that its simple approach can be applied to many different applications.
Supporters and Current Projects
The Open Archives Initiative finds great support through organizations such as the Digital Library Foundation and the Coalition for Networked Information. These organizations have provided an ongoing set of meetings that facilitated the development of the conceptual framework, the protocols themselves, and the promotion of OAI related activities. The Andrew W. Mellon has taken interest in the Open Archives Initiative, providing funding for related projects at seven organizations: The Research Libraries Group, University of Michigan, University of Illinois at Urbana-Champaign, Emory University, Woodrow Wilson International Center for Scholars, the University of Virginia, and Southeastern Library Network, Inc. (SOLINET) (see: www.arl.org/newsltr/217/waters.html).
Virginia Tech University has been an important participant in OAI. Among its other OAI activities, Virginia Tech has applied OAI to its Networked Digital Library of Theses and Dissertations (NDLTD), allowing any organization that participates in the NDLTD program to easily become an OAI data provider. Other key institutions involved in OAI include Harvard University, Cornell University, OCLC, The Library of Congress, the Joint Information Systems Committee of the UK, and many others.
Given the current broad base of support by very influential organizations and institutions, the Open Archives Initiative in on track toward becoming a very important part of the digital library infrastructure.
Resources:
The Open Archives Initiative maintains a web site at www.openarchives.org. Go here for definitive information, including the documents defining the OAI protocol, FAQs, and links to supporters and implementers.
For a very thorough and insightful treatment of the topic read Clifford Lynch's “Metadata harvesting and the Open Archives Initiative” in the ARL Bimonthly Report 217 August 2001 p. 1-9, available on the web at: www.arl.org/newsltr/217/mhp.html.