Cyril W. Cleverdon, a pioneer in the development of online data base services at the Cranfield Institute, gave the keynote address at the Fall meeting of the Association of Information and Dissemination Centers (ASIDIC). Cleverdon concentrated on the user connection in speaking on "Optimizing Convenient Online Access to Online Bibliographic Data Bases." His thrust was that many of the existing data base services are overly expensive and openly hostile to the users they are designed to serve. Many of the concerns about the end users of online data base services are transferable to the online patron access catalog users.
In working toward a blueprint for optimizing convenient access, Cleverdon drew on research results from a number of studies, many of them in the areas of science and technology. In relation to controlled vocabulary indexing, research has shown that when two different groups of people:
- construct a subject headings list or thesaurus in the same subject, only 60% of the terms will be common,
- index a set of documents using a controlled vocabulary, only 30% of the terms will be common,
- search for a subject in the data base, only 40% of the citations will be the same,
- assess the results of a search, there will be only 60% agreement on the relevance of the retrieved citations.
Reliance on the use of abstracts and free text minimizes the impact of human variability in thesaurus development and indexing.
The application of a controlled vocabulary which limits the number of terms available to the searcher can be justified in a printed index but not in an online environment in which the computer has the ability to limit the universe at the time of searching rather than at the time of input. Cleverdon was critical of the limiting approach most commonly used-Boolean searching techniques. In his opinion, "it is ironic and absurd that present online systems fail, to take full advantage of computer power and continue to use Boolean searching-an inefficient search technique." The major disadvantage of Boolean techniques is that they divide the universe of available material into two distinct and separate categories: those which match the search criteria and those which do not match. Among the set which matches the search query, there is an equal probability of any of the retrieved citations being relevant. The convenience of users requires that search output be ranked in order of probable relevance.
"Quorum" function searches in which a list of terms is input and document citations are output ranked in order from those which contain all search terms through those with all but one, all but two, etc., compare well with Boolean searching. Quorum searches have the advantages of being simpler for the searcher; being output in a logical order; being flexibleto increase recall the searcher asks for the output of results down to a relatively low level of match, to increase precision the searcher limits the output to those citations which match on a large number of search terms; reducing search time, and improving performance. In short, it is user friendly.
Test results on limited samples show that the Quorum approach compares favorably with Boolean even at the fourth lowest level of a nine level hierarchy. It performs even better with material input at a high level of exhaustivityinput, for instance, as abstracts rather than as indexed citations.
User convenience is further hindered by the number of different data bases in a particular field. This proliferation is a hangover from the pattern established for printed abstracts and indexes. In an online environment there is no need for such duplication. Cleverdon admitted that a change to a single online file for each discipline was probably a political and economic impracticality; the entrenched interests are too strong to sway.
Apart from duplication, other aspects of existing data bases contribute to the user hostility of current systems. The objective of 100 percent coverage is laudable but is not one which serves the best interests of users because it entails no differentiation on the basis of the quality of the papers referenced in the data base. Research has shown that only a small percentage of published papers are published in identifiable journals and cited in later literature. Cleverdon ventured that it can probably be assumed that the 80-20 rule applies: 80% of users needs can be met by 20% of the available literature.
In relation to data base coverage and the number of citations retrieved on a search, recent research by Lance at London University indicates a very finite limit to the number of papers actually referred to by end users. In summary, the study indicates that if 10 retrieved citations are judged relevant, no more than 28 will be consulted. On average, for even the largest search output with hundreds of relevant citations, the upper limit to the number of papers consulted is 34. The maximum number of papers ever consulted varies according to the subject field of the user: for engineers it is eight, sci-tech personnel 34, and medical professionals 68. The amount of information which is gained decreases as the number of papers read increases. The 80-20 rule may be presumed to be at work again.
In Cleverdon's opinion, factors such as these present conclusive arguments against the development of comprehensive, all-inclusive data bases. Such files are being developed to meet the requirements of only a tiny proportion of users and operate to the disadvantage of the majority. In an ideal situation, not only would there be only one data base per subject, but that data base would reference the contents of only a small number of journals in that field.
In answer to a question seeking information on any operational Quorum systems, Cleverdon referred to work undertaken at the National Library of Medicine and reported by Thomas Doskos. The team developed a system which accepted questions typed in natural language, selected every significant word and phrase from the input, and automatically "ANDed" these terms. Output was ranked in the order of probable relevance and offered for user assessment. The system then took these assessments and used them to refine the search and the answer set, producing another list of citations. The system was inefficient in terms of its use of machine resources, and NLM never made it available.