Library Technology Guides

Document Repository

Semantic Structure enhances Discovery of Library Resources

Computers in Libraries [October 2014] The Systems Librarian

Image for Semantic Structure enhances Discovery of Library Resources

It has been very interesting to track the ways that the online catalogs, discovery services, and other tools that libraries use to provide access have evolved during recent years. In the course of my career, I have seen a continual evolution. My earliest experience in this area involved the replacement of card catalogs with terminals associated with a mainframe running the NOTIS (Northwestern Online Integration System) library management system. Text-based online catalogs were eventually displaced by those that offered graphical interfaces running on Microsoft Windows, which then gave way to web-based catalogs. Although the interfaces evolved through many interface styles, their scope remained relatively static, addressing bibliographic and holdings records representing books and journal titles. A more recent phase saw the development of index-based or web-scale discovery services that exploded the scope of discovery to individual articles, book chapters, and other content objects. While this current slate of products isn't perfect, it now seems plausible that these discovery services will be able to represent the totality of a library's collection, with diminishing gaps in coverage.

Even as library discovery services become more comprehensive, the challenge of providing access to library materials remains far from solved. These discovery services have become indispensable tools, but they are effective only when the library patrons know to come to the library's website to use them, as many do. But unfortunately, only a fraction of the research takes place in this way. Library patrons usually begin their research by searching through Google or one of the other internet search engines. I think that it is important for libraries to continue to improve their discovery tools. It's also necessary to explore and exploit any available means to make library collections more findable to those that do not initiate their research with library-provided interfaces. To the extent possible, library resources should be findable on the open web and in the broader universe of information.

I have had a long-standing interest in applying the techniques of SEO to library materials. The same practices followed by those who are promoting products to sell can be used to draw attention to cultural and scholarly resources available in libraries. These methods seem especially well-suited for the unique materials within a library's collections. In this month's column, I'll review some of the SEO techniques that have been well-established and discuss some more recent approaches that improve discoverability through layering semantic structures into the presentation of library resources.

Basic SEO

There are a variety of standard SEO techniques that can help provide better exposure to library content. These techniques can be applied to any page delivered on a library website, as collection items are presented through CMSs or similar scenarios. The following are some examples of basic SEO practices.

Each resource should be associated with a stable and simple URL. Many CMSs may use complex URLs that include unpredictable session keys that are not persistent. These complex URLs are confusing to humans and likewise can confound the bots that collect content for search engines. In most systems that use these session keys, it is also possible to reference pages through a cleanly structured URL.

Another basic technique involves providing the search engines with a definitive listing of the unique resources available. This listing can be accomplished by producing a simple XML document structured according to the sitemap protocol ( Search engines periodically visit a website to harvest pages to be indexed. These search engine bots traverse the site, attempting to ferret out all of the pages. Especially for sites based on CMS or other dynamic process, search bots may not be able to easily identify all the unique pages available. Some sites use sequences of numbers of text to tie all the activities a user performs on a site into a defined session. These session keys can result in the harvesting bots thrashing through thousands of copies of the same page. The sitemap protocol provides a mechanism for a website manager to provide a systematic listing of all the unique pages available, including optional information regarding when the page was last updated and its relative importance. Some CMSs can automatically generate a sitemap.xml document or a separate script can be developed to produce and periodically update this critical file. The name and location of the sitemap file can be placed in the robots.txt document of a web server, and it can be manually submitted to Google through its webmaster's console.

Each resource page should have the basic metadata tags to provide basic information to the search indexes. The two most critical ones include the [title] (which names the page as it is presented in many different contexts) and [description] (which provides a concise summary of the resource). It is important to uniquely assign a unique [title] to each page rather than use a generic descriptor. Descriptions should be peppered with keywords that best capture the significance of the resource. Most search engines will give these words special consideration as they index the page.

It is essential that any text provided through metadata tags matches the actual content of the page. Search engines will penalize any site that attempts to misrepresent pages using unrelated embedded metadata. Some less ethical sites may attempt to dupe search engines into indexing their sites simply to increase click counts or to present unsavory content.

For resources describing documents--such as articles, reports, or book chapters--additional metadata fields can be used to provide citation details. These metadata fields are especially useful for inclusion in Google Scholar. Such metadata tags appear in the header of the document and are not part of the body where the content of the resource is presented. Metadata tags following Dublin Core are often used to provide these citation details. In recent years, however, the guidelines for Google Scholar recommend using those defined by High Wire Press, which offer a more precise and structured approach to providing citation details. The following example was extracted from the header of a page that presents one of my recent CIL columns:

<meta name="citation_title" content="Balancing the Management of Electronic and Print Resources" /> <meta name="citation_author" content="Breeding, Marshall" /> <meta name="citation_publication_ date" content="2014/06/01" /> <meta name="citation_journal_ title" content="Computers in Libraries" /> <meta name="citation_volume" content="34" /> <meta name="citation_issue" content="05" /> <meta name = "citation_firstpage" content="19" /> <meta name="citation_lastpage" content="21" />

Enhancing Resources With Semantic Structure

In addition to the standard SEO techniques highlighted previously, it is also possible to increase the impact of resource pages delivered on the web by providing another layer of structure that adds semantic structure to the page. The general idea involves delivering webpages in a way that includes structures so that other computers can discern the meaning of objects in addition to the coding used to present the page for human consumption. This layering of semantic meaning can be accomplished through microformats or microdata--which have been around for almost a decade--or specifically through a more recently developed set of structures defined through Google also refers to this technique as "rich snippets."

The addition of semantic structure to a page provides opportunities for the content to be reused and referenced in additional contexts. It can associate or link content in the page to related resources in relationships that would otherwise not be apparent. The use of can be considered as a small step into the realm of the semantic web where pages embody both machine-actionable content as well as visual presentation.

Unlike the metadata tags mentioned previously--providing the title, description, and citation data in the header of the page--semantic structures are intermingled with the presentation of the content in the page body. The basic idea involves surrounding the presentational coding in HTML with additional tags, usually

<div> or <span> envelopes that provide structured semantic meaning.

A page presented entirely for human use would use HTML tags and cascading style sheets (CSS) to present information. An address, for example, is represented as a string of text that a person would be able to recognize. Even though it might be labeled in text as "address," most computer-based processes would have difficulty understanding it as an address, much less parsing its individual components. The following is an example address presented only through HTML:

<p>615 Church Street<br /> Nashvitle, Tennesseecbr /> 37219 <br /> United States</p>

Through the addition of an additional layer of coding that does not impact the presentation when viewed by a person through a web browser, a string of text such as an address is enriched with semantic meaning that can be processed by a computer. The following example is taken from a library listing on Library Technology Guides that presents a snippet of code representing the same address with the additional coding to give it semantic structure:

<div itemscope="itemscope" itemtype=""> <div itemprop="address" itemscopeitemtype=""> <p>Address: <span itemprop="streetAddress">615 Church Street<br /></span> <span itemprop="addressLocality">Nashville</span> <span itemprop="addressRegion">Tennessee</spanxbr /> <span itemprop="postaLCode" >37219</span><br /> <span itemprop="addressCountry">United States</span> </p> </div><!--end microdataitemprop address div--> </div><!--End microdataitemscope Library div-->

Although the coding is a bit more complex, it is not difficult to program a content management environment to add these additional structures to the template it uses to display a record from an underlying database. The result of the new coding can be seen through the Structured Data Testing Tool ( webmasters/tools/richsnippets) offered by Google, other utilities programmed to understand, or other microdata formats.

The layering of these additional semantic layers brings advantages, especially in the area of better representing the item in search engines and in the growing realm of linked data. Pages constructed in this way are able to participate in information environments not just as an assembly of keywords, but also with a structure that exposes the meaning of is key components. Search engines that understand a string of text as an address can then automatically make use of that data in a context that benefits from localization. In the library listings, I also provide the geographic coordinates of libraries through the GeoCoordinates schema so that other resources can determine more precisely where to place the entity on a map. includes schemas for Creative Work, Event, Organization, Person, Place, Product, Review, and Action--each of which has many subordinate schemas, providing a large range of structured vocabularies from which to draw in enriching resource pages for any type of library-related content.

Channeling Patrons to the Library

Many of the techniques to maximize content on the web emerged to optimize consumption of commercial content. Any organization that depends on finding its customers on the web survives or not by its effective use of SEO techniques. and other microdata formats emerged to provide better structure for consumer catalogs.

Libraries likewise need to take advantage of any available technique that will help provide stronger exposure of its resources on the open web. It seems important that library resources be discoverable to the extent possible when our patrons search on the open web. It is a major omission when those resources remain invisible to our communities except when they happen to remember to visit the library or its website directly. Through SEO and semantic web technologies, libraries can work toward better discovery in the broader information universe. Exposing library resources in this way has great potential for increasing their use overall and for funneling patrons into the library's virtual presence.

Exposing library content on the web isn't necessarily a replacement for a comprehensive discovery service addressing library resources. It seems unlikely that, even with more techniques for enhancing resources with semantic structure and providing enhanced information for general indexing, a general web search can fulfill more advanced research needs. However, these methods for creating higher exposure on the web create a funnel for channeling patrons back to the library and the specialized tools that it provides for resource discovery. Enhancing resources with semantic content also provides the opportunity for incorporating your library's resources into the information ecosystems that might later be established by the broader library community and by those in surrounding specialized disciplines.

View Citation
Publication Year:2014
Type of Material:Article
Language English
Published in: Computers in Libraries
Publication Info:Volume 34 Number 08
Issue:October 2014
Publisher:Information Today
Series: Systems Librarian
Place of Publication:Medford, NJ
Notes:Systems Librarian Column
Record Number:19938
Last Update:2023-09-22 09:55:55
Date Created:2014-10-20 11:12:49