The Andrew W. Mellon Foundation has awarded a research and development grant to the Internet Archive to address the critical need to preserve the "long tail" of open access scholarly communications. The project, Ensuring the Persistent Access of Long Tail Open Access Journal Literature, builds on prototype work identifying at-risk content held in web archives by using data provided by identifier services and registries. Furthermore, the project expands on work acquiring missing open access articles via customized web harvesting, improving discovery and access to this materials from within extant web archives, and developing machine learning approaches, training sets, and cost models for advancing and scaling this project's work.
The project will explore how adding automation to the already highly automated systems for archiving the web at scale can help address the need to preserve at-risk open access scholarly outputs. Instead of specialized curation and ingest systems, the project will work to identify the scholarly content already collected in general web collections, both those of the Internet Archive and collaborating partners, and implement automated systems to ensure at-risk scholarly outputs on the web are well-collected and are associated with the appropriate metadata. The proposal envisages two opposite but complementary approaches:
A top-down approach involves taking journal metadata and open data sets from identifier and registry sources such as ISSN, DOAJ, Unpaywall, CrossRef, and others and examining the content of large-scale web archives to ask "is this journal being collected and preserved and, if not, how can collection be improved?"
A bottom-up approach involves examining the content of general domain-scale and global-scale web archives to ask "is this content a journal and, if so, can it be associated with external identifier and metadata sources for enhanced discovery and access?"
The grant will fund work to use the output of these approaches to generate training sets and test them against smaller web collections in order to estimate how effective this approach would be at identifying the long-tail content, how expensive a full-scale effort would be, and what level of computing infrastructure is needed to perform such work. The project will also build a model for better understanding the costs for other web archiving institutions to do similar analysis upon their collection using the project's algorithms and tools. Lastly, the project team, in the Web Archiving and Data Services group with Director Jefferson Bailey as Principal Investigator, will undertake a planning process to determine resource requirements and work necessary to build a sustainable workflow to keep the results up-to-date incrementally as publication continues.
In combination, these approaches will both improve the current state of preservation for long-tail journal materials as well as develop models for how this work can be automated and applied to existing corpora at scale. Thanks to the Mellon Foundation for their support of this work and we look forward to sharing the project's open-source tools and outcomes with a broad community of partners.