Library Technology Guides

Document Repository

On DigiPaper and the dissemination of electronic documents

D-Lib magazine [January 2000]

.

Copyright (c) 2000 Corporation for National Research Initiatives

Abstract: Encoding electronic documents involves a tradeoff between maximizing the ease of dissemination and preserving the document appearance. For instance, a simple text file is the most easily and universally disseminated form of document, but it preserves none of the appearance. This paper proposes a new image-based document representation, called DigiPaper, which is designed to easily disseminate electronic documents with a guaranteed appearance, thus eliminating the tradeoff. DigiPaper provides fixed appearance by representing documents in image form, but uses new compression techniques to make the file size comparable to formats such as Word, PowerPoint or PDF. DigiPaper compression is based on two technologies, the Mixed Raster Content (MRC) color image model and token-compression. DigiPaper files are much smaller than current image formats used for scanning, achieving about a factor of 7 improvement in compression over TIFF Group 4 compressed images.

Electronic documents have forever changed the ways in which we share information, primarily due to the ease with which they can be disseminated compared to physical documents. Electronic documents can be more widely and cheaply disseminated because they can be transmitted across networks, replicated virtually for free, and accessed simultaneously by multiple users. Thus, as a medium for the dissemination of information, electronic documents are considerably more powerful than physical media such as paper.

Certain underlying assumptions that aid in the dissemination of electronic documents limit their usefulness, however. One such assumption is that the information in a document is carried primarily by its textual content, at the expense of the information carried by other elements such as layout and design. For instance, markup languages such as SGML and HTML focus on the textual content and make the specification of most layout, font, and other design content secondary. While this assumption is most evident in HTML, it is reflected to some degree in every electronic document representation, including "layout based" representations such as PostScript and PDF.

Because it is assumed that a document can be successfully transmitted by distributing its text, electronic documents have an intrinsic malleability in their rendering. This is manifested in common features which allow networked documents to be displayed on a multitude of platforms, such as the ability to make a font substitution when the specified font is not available or to adapt the aspect ratio of a document to that of the monitor or browser window. The most malleable (least fixed) form of a document is simple ASCII text, which specifies no information about appearance. A highly malleable document format is preferable when a document is to be presented in highly different ways, such as color versus monochrome, or on very different devices, such as a Palm Pilot and a photographic printer. In these situations, breadth of dissemination is privileged over appearance, and a malleable format that preserves text content over appearance is greatly advantageous.

While document malleability increases the potential for dissemination of the textual content, it limits the potential for dissemination of design and layout information, which are often very important in conveying information. The world of printed-paper documents has a centuries-long tradition of valuing document presentation for the information that it carries and for its communicative effect. Print publications have traditionally used the graphic elements of documents as an important aid in conveying meaning. Fonts, layout, and graphics appeal to our senses, reinforcing the connotations and emotional response the document tries to elicit. The physical characteristics of the medium allow the designer to determine to a large extent the visual experience of the reader. Graphic design has an increasing presence in today''s business documents (e.g., annual reports, brochures, and product catalogs), making company literature more appealing and easier to understand for the reader.

Many authors and designers moving to the digital medium have been unwilling to abandon the practice of controlling document appearance. Despite the difficulties entailed in precisely encoding the layout of a page in HTML, the majority of well-designed web sites go to great lengths to control the page seen by the reader to the finest detail. They often use elements of HTML such as tables and images in ways not originally intended. In the minds of the designers of these web sites, the value of document appearance clearly outweighs the advantages of the malleable electronic document. Extensions to HTML such as Cascading Style Sheets (CSS) are intended to extend the ability of the designer to fix the appearance of the document. However, these extensions still do not provide the kind of control over appearance that is afforded by paper documents. Furthermore, for a malleable form of document, the document creator does not really know how the document will look, and cannot control the document''s visual impact on the reader. By neglecting document presentation, the electronic document risks losing this valuable avenue of expression.

In this paper, we argue that there is an important role for electronic documents with a guaranteed fixed appearance that can be controlled by the document creator, much as with paper documents. Today, the need for networked documents with a fixed appearance is met by one of two methods. The first method is using a standardized page description language (PDL), such as Postscript or PDF. This approach has the advantages that Postscript and PDF are widely used, they are relatively compact (e.g., compared to image formats such as TIFF), and they encode much of the document structure. Nevertheless, the disadvantages of this method are substantial. To varying degrees, PDLs are not sufficiently standardized and require considerable processing power to display. Moreover, in practice they cannot guarantee the document appearance. Standard PDF and PostScript files are rendered differently on each device. This is sometimes imperceptible, but many users encounter documents they cannot view or print, or that appear in a distorted manner, especially when the correct fonts are not available.

A few niche markets, such as publishing, use digital images of a document as a means of addressing the need for a fixed electronic form of the document. Digital images provide guaranteed appearance: the placement of text and graphic art is fixed, fonts are not an issue, and text, art, and photographs can be mixed at will. In addition, the document can be viewed or printed without requiring the application that generated it (such as MS Word or PowerPoint), nor requiring a PDL renderer such as Adobe Acrobat or a Postscript viewer. This solution of transmitting electronic documents using an image representation has become standard in communities such as publishing and digital archiving. Publishers send books to press as TIFF images (often embedded in PostScript), thus avoiding problems caused by the lack of appropriate fonts in the printer. In digital archival repositories, the current practice is to use TIFF files with CCITT Group 4 compression as a preservation format. However, digital document images have one main drawback that prevents them from being used more widely: they tend to be very large. Consequently, they use a lot of storage and transmit too slowly for the majority of users. Their use is only cost effective in very specific cases where storage and bandwidth considerations are not an issue.

This paper proposes a new image-based document representation, DigiPaper,1 to encode efficiently document appearance and maintain the high dissemination potential characteristic of the electronic medium. DigiPaper provides guaranteed appearance, relies to a minimum on the environment in which it is rendered (e.g., does not require particular fonts or the application that created the document), and thus is readable by a varied audience. It eliminates the tradeoff between maximizing dissemination and preserving document appearance that today faces the creator of electronic documents. DigiPaper is designed to meet the need for a fixed electronic form of a document, while keeping file sizes small. It can be used successfully both with scanned and electronic source documents. Electronic source documents include those rendered to page images from page description languages such as Postscript (so called RIPped documents), and those generated with text processors or presentation software such as Microsoft Word or PowerPoint.

DigiPaper is a structured image representation for documents. One of the main reasons document images are so large is that current formats do not take sufficient advantage of the special nature of document images. For instance, most documents are composed of different types of content. Text, photographs, graphs, tables, and business graphics often appear together in a single page. A single treatment (i.e., resolution, color depth, compression) is never suited to all these kinds of material, but conventional document image formats do not provide good support for combining multiple encoding techniques. By using a structured image representation, with different layers for different kinds of material, it is possible to obtain much better compression. DigiPaper applies to each such layer an encoding method that is appropriate to that type of material, thereby providing a good trade-off between storage efficiency and image quality. To represent the different content types in the multiple layers, DigiPaper uses the Mixed Raster Content (MRC) imaging model. For compression, DigiPaper makes heavy use of token compression.


Permalink:
View Citation
Publication Year:2000
Type of Material:Article
Language English
Published in: D-Lib magazine
Publication Info:Volume 6 Number 1
Issue:January 2000
Publisher:Corporation for National Research Initiatives
Subject: Electronic publishing
DigiPaper
Online access:http://www.dlib.org/dlib/january00/moll/01moll.html
ISSN:1082-9873
Record Number:7885
Last Update:2012-12-29 14:06:47
Date Created:0000-00-00 00:00:00