As my regular readers probably know by now, one of my main areas of responsibility for the last 2 years or so has been the Vanderbilt Television News Archive, which recently rejoined the university library system. The archive records, describes, and provides access to news programming of the national television networks. In the last 2 years we have been working hard on shifting from a videotape-based operation to one based on digital technologies. See tvnews.vanderbilt.edu for more information on the archive.
Modest Staffing + Budget = Build Your Own Software
While we've been fortunate to receive generous funding-including grants from the National Science Foundation and the National Endowment for the Humanities-for some major projects, the ongoing operation of the archive runs on a shoestring budget. Consistent with this issue's theme of Making the Most of What You Have, one of our key challenges has been to undertake a number of complex and large-scale technology developments with a very modest level of staffing and a tight software budget. In order to move the operation forward, we've created a new Web-based technical infrastructure and built a number of tools and utilities that provide access to the collection and automate many aspects of the operation. While we use off-the-shelf components where we can, much of the environment depends on locally developed software. Through the use of the Perl programming language and a flexible back-end database system, we have been able to create a variety of interfaces and tools that have proven to be very useful. These are some of our building blocks:
TV-NewsSearch: A Web-enabled database dubbed TV-NewsSearch represents the collection of the archive that spans 35 years of content. TV-NewsSearch currently includes about 750,000 records, most of which consist of full-text abstracts and other metadata that describe each news segment within an evening news broadcast. This interface serves as the primary means for users and staff to find items within the collection. The text of the abstracts, created by the archive's editorial staff, provides descriptions of each news segment and greatly enriches the ability to find material through keyword searching. The interface allows users to browse the collection by program date. The system not only includes a search-and-retrieval interface seen by the users of the system, but also a set of Web-based tools for editing records and ingesting abstracts created with word processing software. We have scripts that allow us to export records in XML when we need to exchange data with other systems.
IP-Authenticated Subscription Service: In order to help financially sustain the archive, in January 2004, we launched a subscription-based premium service targeted at other academic institutions. This service, like most other databases to which libraries subscribe, relies on IP authentication to recognize whether a user is associated with a subscribing institution. The IP address and corresponding institution code are loaded into a database that provides a quick lookup for authentication as each user logs onto the Web site. So far, we have over 12.5 million IP addresses in our authentication database. We also have a database to manage subscriptions, containing required information for each institution, terms of their subscriptions, payments, contact information, and the like.
E-Commerce System: An online e-commerce system supports the fee-based videotape loan service the archive offers. The system allows users to select and place orders for items online and to securely enter credit card data or other payment options. We modeled the online ordering system after the shopping-cart systems common on consumer Web sites. We've also created a management system for processing payments, producing invoices, and generating financial reports. A significant concern with this component of the system involves the extra layers of security required when dealing with financial transactions. The use of SSL (Secure Socket Layer) safely encrypts the transmission of any pages that involve credit card transactions.
Automation of Digitization Work Flow: Every day, the archive digitizes a great deal of video content-including new programs recorded and older material converted from our historical collection. We're currently at the beginning of a 2-year project to digitize our entire 30,000 hours of news programming recorded from 1968 through 2003. We encode everything into MPEG2, the digital format we deemed most suitable for preserving the collection. We send copies of each of these MPEG-2 files to the Library of Congress for permanent archiving. Since the MPEG-2 files are quite large and not necessarily optimized for online streaming, we transcode all the MPEG-2 files into RealMedia format. We learned very quickly that the logistics of tracking files, moving them from computer to computer, and transcoding from MPEG-2 to RealMedia was very time-consuming and prone to human error, and therefore a good candidate for automation. We devised an automatic processing system, starting with a Web-based interface programmed in Perl with a back-end database. The database manages records that hold queue and status information that can be used to manage the flow of each file through all the steps in processing and allow staff to track progress. Given the high volume of material involved, a few hours devoted to programming this automated system has led to significantly better efficiency and should save many hundreds of hours over the term of our conversion project and indefinitely into the future for processing new recordings.
Some Tips for Success
I believe that one of the main factors in building a successful customized application lies in selecting the right technical components for the project and using a modular programming technique. The main components we used for our applications have been the Perl programming language along with the connectivity modules that allow it to work with our Web servers and backend database structure.
I discussed the Perl programming language in The Systems Librarian column (titled "Expanding the Systems Librarian's Toolkit") back when it appeared in Information Today (January 2002).
While PHP and Java have increased in popularity over the last few years, I continue to find Perl to be an excellent choice for Web- and database-oriented programming. It's one of the most flexible programming languages around. For novice programmers, it's fairly easy to produce scripts that do useful work. Yet, it's a programming language you don't outgrow. It has many nuances of functionality that satisfy the needs of even the most complex programming tasks.
One of the great advantages that Perl offers is a large degree of independence relative to operating systems, hardware platforms, and other components. There are versions of Perl for all the major operating systems. Our production environment for the Vanderbilt Television News Archive is Windows 2003 Advanced Server running on Dell PowerEdge servers. Since Perl also runs well under UNIX, it would be fairly easy to move our applications to that operating system should we ever feel the need. We use US (Internet Information Server) as the Web server for our production environment since it's well-integrated with the operating system and with our network. I have a demonstration version of the system that runs on a laptop computer using the open source Apache Web server. We did not have to change a single line of programming code to run this alternate Web server, again due to good support for Perl in both.
Another important aspect of flexibility is the database component of the system. We currently rely on Inmagic's DB/TextWorks as our primary database engine. Its Open Database Connectivity (ODBC) driver allows us to access the database using standard data access components such as ODBC and SQL. DB/Text Works, given its superior handling of large amounts of text, is well-suited to our application, especially for TV-NewsSearch with its huge number of news-segment abstracts. Since we designed the application modularly and restricted all database-specific coding to a single Perl program, it would also be very easy to substitute another database, such as MySQL or Oracle, for DB/TextWorks. While I continue to be impressed with how well DB/TextWorks has scaled to our large databases, we anticipate that at some point the volume of transactions may increase enough to lead us to migrate to another database platform.
So far, our locally developed system has been successful. Since we have full control of the application, we have been able to build in the functions and features needed for the operation. The system has been reliable with virtually no downtime. Performance has been speedy despite steadily increasing transaction loads. The development costs have been low and the benefits have been high. Yet, while this do-it-yourself approach has worked well in our situation, I wouldn't necessarily recommend it for all circumstances. In our case, no off-the-shelf software was available with all the features and functions we needed. While the operation may eventually outgrow the system we've developed, it has thus far served us quite well and allowed us to make rapid progress in managing our collection and providing access to it via the Web.