One of the projects that I've been working on in the last few weeks involves trying to increase the visibility and effectiveness of one of the Web sites I maintain. I've mentioned my involvement with Vanderbilt University's Television News Archive in previous columns. We have a Web site (http:// tvnews.vanderbilt.edu) that provides access to the archive's large collection of news programming through a searchable database. The archive's staff members populate this database with detailed information about each segment of news programming that is recorded, complete with a narrative abstract. Through this database, people can find items that meet their personal interests or research needs. Some users find the information they seek by reading the abstracts in the database, but others need to view the video itself. We offer a videotape loan service; we make copies of news programs that users can borrow. The database interface includes an e-commerce function that allows users to select the items they want to view and pay the service fees we charge to recover our costs.
The effectiveness of the TV News Archive's Web site can ultimately be measured in the number of successful searches that are performed and the quantity of videotape requests that are placed. While we have seen a significant increase in these activity levels over the course of the last few years, we continue to look for ways to boost activity even more. We continue to believe that we haven't yet gotten to the point of reaching all potential users or that those who do visit our site always find the material they seek.
Recently, I've been looking for ways to increase the activity on the Web site and, hopefully, to boost the number of videotape loan requests. This project includes two lines of investigation. One is streamlining and optimizing how the Web site works and then improving usability; the other is devising strategies to improve the site's visibility and discoverability on the Web. During the past few weeks, I've been working on ways to expose more of our metadata on the open Web to increase use of the site. I'll probably be talking about that in a future column-this one focuses on the methods of analysis available to study the usage of the site, which are necessary to identify any problems and to make improvements.
One of the main characteristics of our archive is that we get very few on-site visitors. Almost all use comes from remote users via the Web site, so it's essential that it work well. I've been studying the site's usage in detail. While some of the techniques I've used may be particular to our site, most apply to analyzing any library site.
Conducting in-person usability studies and focus groups is one of the best ways to learn about the usability of a Web site. While we have done that, I'm now taking a more forensic approach-analyzing logs and other system data to measure the effectiveness of the Web site design and search engine.
Studying Web Server Logs
Web server logs provide a wealth of raw data about how users approach your site. If you host your own site, you should have easy access to them. People who run their sites on external servers may need to negotiate with the systems administrator for access to the log files. All Web servers accumulate detailed information for each page requested. Exactly what elements the server records vary according to the type of software used and the options set by the site's administrator. Let's walk through some of the most important data elements.
The page requested describes the document the server was asked to deliver. The HTTP status code indicates whether the request was successful or whether some other condition occurred. This is represented by a three-digit number: 200 indicates that the page was delivered successfully; the dreaded 404 means the page could not be found. Many other status codes have been defined and are fully documented by the World Wide Web Consortium at http://www.w3.org/Protocols/rfc2616/ rfc2616-secl0.html. The IP address of the requester provides some information about the entity seeking the page. While we're not necessarily interested in identifying individuals, it is useful to separate persons who are using Web browsers from software bots from search engines that are harvesting pages from the site. The address of the requester also makes it possible to gather together all the requests that are related to each session.
Most Web servers have an option to simply record the IP address in the log or to perform a request to the DNS (domain naming system) to get a more informative rendition. It does take a bit more computing power to perform this reverse-DNS look-up, but most Web servers can handle the load easily. If your site receives hundreds of page requests per second, though, you may want to turn off reverse-DNS look-ups. That doesn't mean that you can't get this information. Most Web log analysis software has the capability to perform DNS look-ups as they process the file.
The referring URL describes the last page the user visited before requesting the current page. This data element helps you know how visitors find your site and how they navigate through your pages. The User Agent records the text string that identifies the type of Web browser that made the request. If you are concerned that some pages on your site might not be best viewed by older and non-standards-compliant browsers, this information allows you to compile statistics on what portion of your users might be affected.
Processing Web Log Files
The first and most obvious approach to analyzing your Web site is to process these log files with one of the many Web site analysis products. While the specific features vary somewhat, these packages generate statistical reports that summarize the overall use of the site. Some of the Web site analysis software packages that I have used lately are Analog (http://www. analog .ex) and AWStats (http://www.awstats .org), which are both free, open source applications. I've also used WebTrends (http://www.webtrends.com), which is a commercial product. The free products provide almost all of the statistics and reports that are needed for most library Web sites. Commercial packages provide sophisticated features that might be needed to optimize very complex sites and to help improve e-commerce activity.
The log analysis software will reveal the main characteristics of the server's use-such as the overall volume of requests, the popularity of individual pages, and the volume of activity by time of day and the day of the week. This analysis will get beyond measuring activity in terms of individual hits and will show the number of user sessions-a much more meaningful metric. Most log analysis reports will also cull out and summarize the page requests made through search engine bots. The number of raw hits greatly exceeds the level of activity by end users due to the inflation caused by search engine harvesting.
While looking at a snapshot of activity can be informative, it's even more enlightening to record some of the key measures in a spreadsheet every week or month. You can use this data to plot increased or decreased activity over time and to correlate changes to external factors. An increase in activity following the implementation of changes in the design of the site might be taken as a measure of success. One would naturally expect usage levels to gradually increase over time-it's the abrupt spikes and valleys that invite further study.
It's also useful to look at the native Web server logs themselves. You can use your favorite text editor to view the file and search for texts. It's also handy to use a tool such as the UNIX utility "grep" (global regular expression print) to search for text strings in the file. While the log files can look a bit intimidating at first, you can learn a lot about your site's activity by studying them directly. It's helpful, for example, to select a few random users represented in the log file and to reconstruct the path they followed through your site. Look for patterns and problems. Do they seem to follow a natural path to a destination page, or do they exit the site abruptly?
It's important to take care of any 404 (page not found) errors that are identified in the logs. You can use the referrer data on the log entries with 404 errors to identify the page with the incorrect link. These errors can be caused by broken links within the site or by outdated links on external sites. Any broken links within your own site can be repaired easily once identified. Errors caused by external sites pointing to pages that no longer exist on your server can be more difficult to fix. You may need to contact that site's Webmaster to request a correction. If this error occurs frequently in your logs, you can put a page on your server with that name that redirects the user to the current version of the page. This approach can convince users that the site is not defunct and can result in a successful visit.
A Web site that provides access to content through a searchable database can be even more complex to analyze. Almost all of the activity of our Television News Archive's site takes place within the TV-NewsSearch database of more than 750,000 abstracts of news programs. We want to make sure that visitors find the database easy to use so that they find the items they need within our collection. We're especially interested in knowing when the searcher gets a response of no results when the content exists in the collection. It's important for us to be sure that the terms we use in describing the items correspond with the terms that searchers are likely to try.
The Web server's native log files do not help analyze the performance of the search engine. The search engine we created produces its own logs that record search terms entered, results returned, the display of individual records, and program listings as well as all the steps involved in user registration and loan requests. This log file allows us to analyze each search session. How many searches does each user perform? When he gets zero results, does he give up or keep trying? Are there searches that list zero results where we really do have the material (but it's described with different terms)? The problems identified in this part of the analysis tend to be more difficult to address. While there may be some issues that can be improved by making adjustments to the interface and the way that terms are preprocessed before being submitted to the search engine, some may ultimately require changes in the way that we describe programs.
Maintenance Is Worthwhile
All of this goes to demonstrate that maintaining a Web site is a never-ending process. We must always be vigilant to prevent errors and problems from cropping up on our sites. It's also important to regularly assess the site and make adjustments as needed. Given how much most organizations rely on their Web sites to deliver services to their users, it's well worth the time to make sure the site is working optimally.