Library Technology Guides

Document Repository

Understanding the Protocol for Metadata Harvesting of the Open Archives Initiative

Computers in LIbraries [September 2002]

Image for Understanding the Protocol for Metadata Harvesting of the Open Archives Initiative

An issue of Computers in Libraries that's dedicated to protocols wouldn't be complete without a discussion of the Open Archives Initiative Protocol for Metadata Harvesting. This protocol provides the basis for an information discovery environment that relies on transferring metadata en masse from one server to another in a network of information systems.

One of the fundamental needs that arises as digital collections proliferate is in finding ways for end-users to discover what resources are available. The Open Archives Initiative (OAI) has focused on developing a framework where metadata can be harvested from multiple digital library collections-or "repositories"-and can then be loaded into a centralized service. The result is a database of metadata records that has been built from multiple, related repositories. Then, as users search the new database and select metadata records that interest them, the service provides links that allow them to view the corresponding object on the original repository.

The Open Archives Initiative began only a few years ago, but has already made a major impact in the digital library arena. While OAI can still be considered a nascent protocol, most digital library projects are already working toward adding OAI capabilities to their systems, or are at least planning on how they might do so in the future. OAI has gained tremendous momentum. Organizations that have expressed strong levels of support for OAI include the Digital Library Federation, the Association of Research Libraries, The Andrew W. Mellon Foundation, and the Coalition for Networked Information.

How the OAI Model Works

In the vocabulary of Open Archives Initiative, servers (and their owners) can be considered either "data providers" or "service providers."

Data providers are the primary owners and maintainers of information; they operate a "repository" of some type. To be an OAI data provider, a repository must run a software application that allows its metadata to be requested in bulk by other participants in the Open Archives Initiative. That is, the repository operates a software module on its server that delivers metadata records upon request, following the rules specified in the Protocol for Metadata Harvesting (PMH). It does not have to make its information objects themselves available for harvesting, but only the metadata that describe them. In the case of a repository of full-text journal articles, for example, the data provider would offer up citation records that describe the articles, but not the articles themselves.

Service providers are the groups/servers that harvest the metadata (from data providers) and organize it so that users can then easily search a large database of related information that's come from many different sources.

The Protocol for Metadata Harvesting can be used with a wide range of information resources. It is not limited to any given scholarly discipline and is designed to work with any format of information. OAI can be used for physical collections and digital collections. In the digital arena, the objects can be text, still images, sound, animation, full-motion video, or anything else. Since PMH works with metadata, and since there are metadata formats for all types of information, its potential universe of applicability is very broad.

The three figures illustrate the basic approach of the OAI-PMH. In the first figure we see that the service provider harvests metadata records from multiple repositories. Figure 2 illustrates that end-users can then search the service provider's database that aggregates the metadata from multiple data providers. In Figure 3 we see that the users can then view the original documents (or other digital objects) by virtue of the links provided in the metadata records retrieved from the service provider. Now that I've explained the basic concept, I can give you more details about the steps of this process.

Data Providers: The Open Archives Initiative seeks to tap the resources of existing repositories, providing the means to help end-users discover the items of information available to them. While there may be some repositories created specifically for use within the Open Archives Initiative, most data providers already exist and use the OAI-PMH as a means to increase their impact.

Any information resource can become an OAI data provider by installing software that makes its metadata available to harvesters.

Being a data provider in OAI is entirely voluntary. Unless the data provider explicitly loads an OAI interface, no harvester can obtain its metadata using PMH. This is unlike the general Web, where software robots can extract pages without having to obtain permission from the server's operator.

A data provider will configure its OAI interface so that it will not have a negative impact on its existing services. Providers would likely allow large harvesting sessions to take place only during off-peak hours, or they might operate the OAI software on a separate server from their production systems.

Service Providers: Basically, service providers take metadata that they have acquired by using a PMH "harvester" to create information resources or portals of interest to a community of users. A typical service provider will make arrangements with multiple data providers. The service provider's harvester will interact with each data provider's repository, using the PMH to extract its metadata records. An OAI-PMH harvester is analogous to the Web-crawling robots that systematically retrieve and index HTML documents for Web search engines.

The harvested metadata records are loaded into a database that's managed by the service provider. The service provider will then offer a search-and-retrieval interface, allowing end-users to search the metadata records that it has gathered from a variety of data providers. The metadata records will typically include links that allow the user to view the full documents that reside on the original repositories.

Information services can use the PMH in a variety of ways. Any existing service can use PMH to enhance its services. A service can receive metadata from multiple sources. It might create some metadata locally, obtain some types of data through existing file transfer protocols, and use OAI to get it from others.

Value-Added Services: The resources offered by service providers add a new dimension of value beyond that available in the individual repositories. The most obvious way to add value to repositories lies in aggregation. Individual repositories each tend to be focused on a narrow area of interest within a discipline. But a single service provider can harvest metadata from multiple repositories to create an information resource that spans many specialties within the discipline. End-users benefit in that they can now search one resource to discover information within their disciplines without having to search all the repositories individually. This relieves the end-user from having to know a priori about all the individual information repositories that might interest them, but he or she can take advantage of pre-built aggregations created by service providers.

Keep in mind that the service provider offers only the metadata from its derivative data providers. In most cases, the user will search the service provider's server, and will link to the original data provider's server to view the actual information.

Metadata: PMH works with metadata, but in a very flexible way. The standard specifies that at a minimum, data providers must offer metadata in unqualified Dublin Core (see http://www.dublin One of the few stipulations is that each metadata record must have an identifier number that is unique within the repository. While the standard specifies unqualified Dublin Core, groups of data providers and service providers can agree to use other metadata formats that may be more appropriate to their disciplines or content types. A number of disciplinespecific metadata formats have been used in OAI systems, including MARC, EAD (Encoded Archival Description), TEI (Text Encoding Initiative), IMS (Instructional Management System, used in educational environments), and others.

Open Language Archives Community (OLAC; http://www.language-archives .org), a group of institutions working to develop a worldwide digital library of language resources, is a typical example. These institutions find that unqualified Dublin Core is inadequate for their needs, and have created a metadata format tailored for their type of information. This group has developed the OLAC Metadata Set for the OAI Metadata Harvesting Protocol OLAC-MS.

If you're wondering about OAI costs, no rules are specified as to whether the service providers can charge or not. "Open" is not synonymous with "free." The fact that metadata are exchanged using open standards has no bearing on whether the information is for fee or for free.

Here's Exactly How the OAI-PMH Transfers Data

The designers of PMH intended for it to be a very simple protocol that can be implemented with a minimal amount of programming. Most implementers report that an existing repository has the ability to become an OAI data provider using PMH with only a few hours of programming. Becoming a service provider is somewhat more technically challenging, but can still be accomplished with a moderate investment of technical effort.

The Open Archives Initiative provides a set of open source program modules to assist those who are interested in being data providers or service providers. Program code examples for implementing the OAI-PMH are available in Perl, Java, C++, and other programming languages. (See tools.html.)

The protocol specifies a set of "requests" and "responses" related to the process of transferring records from a data provider to a service provider. The protocol takes advantage of existing lower-level network protocols already in wide use. PMH transmits its requests and responses over the http protocol, the same one used for the Web.

PMH is designed to move metadata records from a data provider to a service provider. It is not designed for interactive search queries. It does not support the selection of records by author, topic, or other qualifiers appropriate for online searching. Rather, it consists of requests and responses that identify and download metadata records in bulk, based on dates and simple sets. A data provider can treat its collection of metadata records as a single whole or can divide it into sets.

PMH specifies a set of requests that allows a data provider to interact with a harvester. The ultimate goal of the session is to harvest records from a service provider en masse on an initial session and to request new and changed records in subsequent visits. Some of the requests and responses deal with identifying the data provider and what sets and metadata formats it supports. Requests related to record delivery can provide optional parameters for timestamps and sets to perform selective harvesting.

PMH stays true to its ideal of simplicity by specifying only six requests. Those interested in a detailed explanation of the requests and responses should study the original protocol document at http://www protocol.htm. In general terms, the requests are as follows:

  1. GetRecord: Requests that a particular metadata record be transferred from the data provider's repository to the harvester,
  2. Identify: The harvester asks the data provider to send information identifying itself.

  3. ListIdentifiers: The harvester asks the data provider to provide the list of record headers that satisfy the request.
  4. ListMetadataFormats: The harvester asks the data provider to reveal what metadata formats it supports.
  5. ListRecords: Used to harvest records from the repository. The harvester can specify the timestamp values and sets for selective harvesting.
  6. ListSets: The harvester asks the data provider what sets are defined within the repository.

A harvester's initial visit to a data provider may involve the transfer of a massive number of records. One of the concerns with the PMH model involves how a service provider can obtain large numbers of metadata records from a data provider without overburdening the system. The way that metadata records are transferred remains under the control of the data provider. The PMH protocol takes into consideration that the data provider will have preferences regarding when it will want to respond to a harvester and how many records it will deliver in a given time. PMH includes a control mechanism called a "resumption token." At any time, a data provider's server can return an incomplete set of records in response to a request, issuing a resumption token that specifies when the harvester can return.

Some Historical Background

While the OAI is now considered as a framework that's applicable to almost any type of information system, its origins lie in a much narrower community-that of the pre-print servers, such as those that became popular in the discipline of high-energy physics. These pre-print servers emerged as an alternative electronic publishing model that allowed researchers to read scientific papers in their areas of interest in a much more timely way than is possible with the traditional printed scholarly journals. The pre-print servers such as arXiv (http://, though run largely informally, quickly became the primary way that researchers in some scientific disciplines engaged in scholarly communication.

As pre-print servers proliferated, researchers in a given field would then have to keep track of which ones existed and what types of articles were available on them. To stay current, a scientist would need to frequently check multiple preprint servers. Of course, it would be better if a researcher could rely on a single system to reveal the content of all the various pre-print servers within a discipline! This basic need is for "federated searching"-the ability to enter a query once that will search multiple information systems and return a unified result set with links back to the original documents. This was the question that interested Herbert Van de Sompel, who became one of the main architects of the Open Archives arena.

OAI traces its beginning to a 1999 meeting in Santa Fe, New Mexico, where a group of scholars, computer scientists, and librarians discussed the problem of federated searching in the pre-print arena. Here they conceived the basic principles that underlie the current Open Archives Initiative. While the specific implementation details specified in the "Santa Fe Convention" were revised somewhat in subsequent meetings, the basic model has remained constant. It revolves around architecture for federated searching where metadata are gathered from multiple repositories in advance and placed on a central service.

The Santa Fe Convention focused on the specific needs of the pre-print community. Those who were involved quickly realized, however, that this approach could easily be generalized for almost any digital library environment, regardless of subject discipline or format of content.

The movement adopted the name "Open Archives Initiative," displacing the name "Santa Fe Convention," in 2000. That name hasn't been without controversy. The term archives, among preservationists, refers to facilities or organizations dedicated to the long-term maintenance of materials. In the context of OAI, the term is used very loosely. Any server that offers information can be a repository, and it mandates no specific provisions for long-term preservation concerns. Though the name Open Archives Initiative continues, the protocol was termed Protocol for Metadata Harvesting, avoiding the complications associated with the term "archive."

While I wasn't part of the Santa Fe meeting, I was present at the second meeting in June 2000. The initial version of the OAI protocol was formulated in that year. One of the fundamental issues in creating a new protocol involves providing a stable environment for those who want to develop applications. When OAI released version 1.0 of the protocol, the group specified that no major changes would be made for about a year. After that year, revisions would be incorporated into the protocol based on the input of implementers. Version 2.0 of the OAI was just released on June 14, 2002.

The Administration of OAI

Every protocol or standard needs some type of organizational oversight. OAI is not currently under the jurisdiction of any of the major standards bodies such as the Internet Engineering Task Force (IETF) or the National Information Standards Organization (NISO). Rather, OAI has created its own organizational oversight group.

The Open Archives Initiative established a steering committee in August 2000, consisting of 12 individuals representing institutions with significant interests in the protocol. The steering committee establishes the strategic direction of the initiative. An executive committee, currently made up of Carl Lagoze and Herbert Van de Sompel, attends to the operational details of OAI. A technical committee was created in July 2001 to evaluate the current version of the protocol and to compile a new revision of the protocol.

The Open Archives Initiative, given its short history, has already made a large impact in the realm of digital libraries and information retrieval. Those involved in maintaining information systems now have a very powerful machine at their disposal for acquiring metadata. Repository groups have another means of helping end-users discover their content. If your organization is involved in Web-- based information systems, and you haven't already looked at OAI, you might do well to study the protocol and investigate whether you want to add to your arsenal. In an information environment that's increasingly concerned with metadata, a tool that aids in the process of transporting and sharing metadata efficiently is a welcome asset.

View Citation
Publication Year:2002
Type of Material:Article
Language English
Published in: Computers in LIbraries
Publication Info:Volume 22 Number 08
Issue:September 2002
Publisher:Information Today
Place of Publication:Nashville, TN
Subject: Open archives
Record Number:9944
Last Update:2022-09-24 07:26:03
Date Created:0000-00-00 00:00:00