One of my projects that I've mentioned many times in this space is the Vanderbilt Television News Archive. In past columns, I've talked about how we use large-scale digital video technologies and described some of the techniques we've found for analyzing Web server logs to help us monitor activity. Recently, I've been looking for ways to attract new visitors to the archive's Web site. We implemented a scheme for unleashing the data held in our database of news abstracts so that it can be discovered on the open Web through search engines-especially Google. We see ourselves victim to what's often called the "invisible Web": Information dynamically presented to users as they search a resource can be hard to find from the broader Web since it isn't normally indexed by the search engines.
The Vanderbilt Television News Archive provides "TV-NewsSearch," a Web-accessible database that describes our collection of national news broadcasts. The database currently holds more than 800,000 records, most of which have news summaries that describe a news story, including the event, people involved, and the video shown. We offer a search engine that works effectively to help you find items in our collection once you get to our site. While TV-NewsSearch works well for those who manage to find our site, we also wanted to use its contents to attract more visitors.
Entrepreneurial Motivation
We're highly motivated to find new ways of bringing more activity to our site. The archive's mission is to preserve and provide access to national news broadcasts. The Vanderbilt Television News Archive is unique in many ways, and we work hard to ensure its ongoing activity. We function under a mandate to keep a balanced budget and must find ways to earn income to meet operational expenses. In this entrepreneurial environment, we know that an increase of visitors to our site can be leveraged to help keep us out of the red.
Part of the archive's income is from the service fees we charge for our videotape loan service. All items in our collection are available for loan on videotape. As users search the TV-NewsSearch database, they can select whole programs or individual items to be copied onto a videotape, which is then loaned to them for a limited time for viewing and research. (We don't sell or license material; we just charge a fee to cover our costs.) The fees work on a sliding scale, with a subsidized rate to users who are associated with Vanderbilt or one of our sponsoring institutions. The TV News Web site includes an e-commerce system that allows users to select the items they want to borrow and to pay for them online.
We hypothesized that videotape loan income would grow in some proportion to increased Web site activity. While activity on the site has always been brisk, it hasn't been overwhelming. We believed that the activity on the site would be higher (perhaps resulting in more videotape orders) if more people knew we existed. So we needed to help them find us.
Bring 'Em in Through Google
We all know that a huge number of Web users rely on Google. While other search engines should not be neglected, we focused our attention on Google first since it would have the biggest impact.
I began by exploring our Web server logs to see exactly how people find us. One of the most important components of a Web server's logs is the HTTP Referrer. This tells you which Web site and page the user was on when he or she clicked on a link to get to your site. The following is a typical log entry:
2006-01-09 14:09:12 129.59.150 .105 GET /index.pl - 80 - fl-71-0244-74.dyn.sprint-hsd.net Mozilla/ 5.0+(Macintosh;+U;+PPC+Mac+ OS+X;+en) +AppleWebKit/312.1 +(KHTML,+like+Gecko)+Safari /312 http://www.google.com/ search?hl=en&q=television+news +search&btnG=Google+Search 200 O O 11804
The HTTP Referrer is http://www .google.com/search?hl=en&q=television +news+search&btnG=Google+Search.
We can tell that the user came to us from Google. We can even tell what search string he or she typed in by parsing the query string associated with the referring URL. The user typed "television news search," which is typical of what we found when we did an extensive analysis of the Web logs. "Vanderbilt university," "tv news," "tv archive," and "television news" are other common search terms. We interpreted this to mean that if users know they're looking for television news or already know we exist, Google works well to deliver them to us. But if they're looking for information on a particular person, topic, product, or event that might be within our collection, we were invisible to them.
Building an OpenWeb Site
To reach more potential users, we needed to expose the content of our database to the open Web and allow Google and other crawlers to harvest and index it. Yet we wanted to do this in such a way that we would retain control of our content. The goal was to seed as much information into Google as possible and to design a path that would lead users through the front door of our Web site. To achieve this, we decided to create a Web page for each of the news abstracts in our database-all 805,609 of them-and submit these pages for Google to index. We placed the static HTML pages on a separate Web site we call our OpenWeb.
We designed these derivative Web pages for two purposes. One was to carry payloads of text that would be harvested and indexed by the search engines. The other was to provide a one-way door into our Web site. We weren't interested in using these pages to replace our database. We wanted them to be funnels into it. Once users are in our TV-NewsSearch environment, they can request the item they discovered through the OpenWeb or search for others in our collection that might match their interests.
As shown in Figure 1, the static Web page that we produced for each record includes basic information about the news clip, tells users who we are, and displays a prominent button that users can click on to enter our site.
We needed a fast and easy way to create these pages and keep them up-to-date. My regular readers will know that I have a fondness for the Perl programming language. For this project, I developed a Perl script that systematically extracts records from the database to generate the Web pages. The script takes only a couple of hours to process the entire database. We refresh the pages every week so that the new records we add and the changes we make are reflected in the OpenWeb site. In addition to generating the pages themselves, the script creates an index in HTML that links them all together. While it would be possible for users to browse our site with this index, it was created primarily to guide the search engine crawlers to each of the pages on the OpenWeb.
Google Sitemap Protocol
Soon we had more than 800,000 pages of rich content that were ready to be indexed. It would have been possible at this point just to submit the base URL to each of the search engines and hope for the best. But we wanted a more managed approach-especially for Google. When you have a very large number of URLs that you want the search engines to harvest, it's important to be sure that the spidering activities of the Web crawlers don't overwhelm your site. You also want to take advantage of any available means of describing your site for efficient harvesting.
Google allows Web site managers to create maps of their sites in XML. These sitemaps lay out the contents, indicate which pages are most important for indexing, and tell googlebot (Google's Web crawler) when each page was last updated. Google provides documentation on how to create these sitemaps. It insists that the use of the sitemap protocol will not improve the page rank of a Web site, but it can help you be sure that all your pages are added to the Google index. Since the sitemap includes data about when pages are updated, googlebot can focus its time on new and modified pages without performing wholesale harvesting on each pass. (see http:// www.google.com/webmasters/sitemaps/ docs/en/protocol.html.)
It was fairly easy to extend the script that generates the static HTML pages for the OpenWeb so that it could create all the XML sitemaps at the same time. Any given sitemap can contain no more than 50,000 URLs. Since our site greatly exceeds that number, I created a sitemap for each year's worth of abstracts. Google's protocol allows multiple sitemaps to be linked together in an index. Figure 2 shows a snippet from one of the sitemaps.
Google provides an interface that lets you submit, manage, and monitor sitemaps; it's accessed through a Google Webmaster account. Once you log in, you can submit your sitemaps to Google, monitor how the service interacts with them, and see any errors it encounters. (see http://www.google .com/webmasters/sitemaps.)
One of the coolest features of the Webmaster account is the information it provides on how you get visitors to your site. The service lists the "top search queries," which describes the words that, when typed into Google, return pages from your site. It also shows you "top search query cliks," the words that not only prompted your page to be displayed in Google, but that also prompted users to click on the link to visit your site.
Our Initial Findings
In order to compare the activity on the Open Web to the regular TV News site, I also created a sitemap for TV News. It lists URLs for all the pages that describe the archive and its services, but does not include anything for the dynamic content out of TV-NewsSearch.
It's fascinating to see what kinds of words bring traffic into these sites. The regular TV News site tends to match only words related to "television," "news," and "Vanderbilt." With the OpenWeb site, almost anything can drive folks in. Today, for example, some of the top search queries were "dubai ruler death," "titanic sinking," "Hughes inheritance," and "gun violence Boston." The names of news anchors, celebrities, and others show up constantly as search words. Now we're seeing traffic on our site based on the content within our collection.
We submitted the base URL of our OpenWeb site to all the search engines in early August 2005 and shortly thereafter registered our sitemaps with Google. How long does it take to get your Web site indexed? We learned that it doesn't happen overnight. We monitored the logs of the OpenWeb and counted the number of times googlebot visited the site each day. We saw a lot of interesting patterns. It probed the site several times, grabbing the robots .txt, the sitemap.xml file, and a few sample pages before it began harvesting the pages en masse. We also noticed that there was a considerable delay before the harvested pages were available in the Google index. It took about 4 months from the original submission until most of our pages were actively available through Google.
The next challenge was earning page rank. Sites that have been recently submitted to Google have a low ranking and tend not to show up near the top of a results set. We continue to monitor how Google delivers users to our site and are on the watch for ways to improve how we're indexed. This page-ranking concern has spawned an industry called "search engine optimization," whose members try to increase the page rank of their clients. Google works hard not to be influenced so that it can deliver search results based on the quality of the information rather than on commercial interests. Since we believe we have high-quality information, we're optimistic that our pages will work their way further toward the top over time.
Even Deeper Analysis
While the Google Webmaster account gives some interesting information on how people get to the Web site, we needed much more information to determine the level of impact produced by the OpenWeb. Although there are lots of ready-made applications for analyzing Web server logs, we had some pretty specific questions that would be hard to answer with an off-the-shelf package.
Again, Perl came to the rescue. I developed a script to analyze our Web server logs and to measure the interaction between the OpenWeb site and the regular TV News site. One of the items of interest was the number of visitors to our regular site who come from the static OpenWeb site. We also captured Google's referrals to both sites as well as the search queries involved. To test our initial hypothesis about increasing income, the script also traced all of the videotape requests made each month to determine whether they originated from the OpenWeb.
What Did We Find Out?
As hoped, our OpenWeb project did result in additional Web site activity and an increase in videotape loan requests. For each month the OpenWeb site has been active, we have been able to document a significant level of activity in which the user begins with Google, finds a page on the OpenWeb, and enters our production Web site. The number of visits that follow this path is about three to four times the number that arrives at our site directly from Google. The analysis also revealed the number of instances in which the user begins on Google, hits our OpenWeb site, enters the production site, and subsequently places an order. From August 2005 to January 2006, roughly 30 percent of our videotape requests can be directly attributed to this path, although the number of items in these requests tends to be small.
The bottom line is that, during this period, 13 percent of our income from videotape service fees can be attributed to our OpenWeb initiative. That figure represents income we would not have taken in otherwise.
Our analysis demonstrates that the project was successful. We're seeing increases in overall Web site activity and in the number of requests made for videotapes. Although the gains so far are modest, we remain hopeful of larger improvements over time.
You can apply this approach to almost any digital library collection. If your collection remains locked into the invisible Web, some of the techniques that worked for us can be used to improve its visibility and to increase interest and use.