Abstract: The Official Record of the Proceedings of the Irish Parliament is a document collection spanning 76 years, 600 volumes and some 125 feet of shelf space.
The project described in this paper involved capturing these volumes electronically in XML (eXtensible Markup Language) and automatic conversion to a CD-ROM/Internet publishing product known as Folio Views.
Folio Views is a commercial text database/search and retrieval tool that is particularly popular in the government/legal/financial publishing sectors. It has a full text search engine that is powerful and fast—particularly for large document collections. A single Folio Views publication (known as an Infobase) can be up to 4 GB in size.
To get data into Folio Views it must be converted to a tagged text format called Folio Flat File (FFF). FFF can be created directly from word processing documents but it can also be generated from databases and structured text formats such as XML as discussed in this paper.
All the software aspects of the electronic publishing process—from data capture quality assurance through to the final generation of a >2 GB Folio Views Infobase—are written in Python.
This paper provides a brief overview of XML and illustrates how and why Python was used to build this production system. It presents an overview of a Python toolkit for XML processing known as LumberJack developed by the authors. It includes details of some of the techniques used to integrate Python programs as first class "documents" in the overall document hierarchy of the project. It also presents details of how Python was used as a powerful document validation and reporting tool.
|Type of Material:
|November 10-13, 1998
|Place of Publication:
|7th International Python Conference