As part of our ongoing efforts to archive and provide perpetual access to at-risk, open-access scholarship, we have released Refcat ("reference" + "catalog"), the citation index culled from the catalog that underpins our IA Scholar service for discovering the scholarly literature and research outputs within Internet Archive. This first release of the Refcat dataset contains over 1.3 billion citations extracted from over 60 million metadata records and over 120 million scholarly artifacts (articles, books, datasets, proceedings, code, etc) that IA Scholar has archived through web harvesting, digitization, integrations with other open knowledge services, and through partnerships and joint initiatives.
Refcat represents one of the larger citation graph datasets of scholarly literature, as well as uniquely containing a notable portion of citations from works that do not have a DOI or persistent identifier. We hope this dataset will be a valuable community resource alongside other critical knowledge graph projects, including those with which we are collaborating, such as OpenCitations and Wikicite.
The Refcat dataset is released under a CC0 license and is available for download from archive.org. The related software created for the extraction and matching process, including exact and fuzzy citation matching (refcat and fuzzycat), are also released as open-source tools. For those interested in technical details about the project, a white paper is available on arxiv.org authored by IA engineers, including Martin Czygan, who led work on Refcat, and is described in our catalog user guide.
What does Refcat mean for regular users of IA Scholar? Refcat results from work to ensure the interconnection between material within IA Scholar and other resources archived in Internet Archive in order to make browsing and lookups easier and to ensure overall citation integrity and persistence. For example, there are over 25 million web links in the citations in Refcat and we were able to match ~14 million of these to archived web pages in Wayback Machine and also found that ~18% of these matched web citations are no longer available on the live web. Web links in citations not in Wayback Machine have been added to ongoing web harvests. We also matched over 20 million citations to books that are available for lending in our Open Library service and matched over 1 million citations to Wikipedia entries.
Besides interconnection, Refcat will allow users to understand what works have cited a specific scholarly resource (i.e. "cited by" or "inbound citations") that will help with improved discovery features. Finally, knowing the full "knowledge graph" of IA Scholar helps us better identify important scholarly material that we have not yet archived, thus improving the overall quality and extent of the collection. This, in turn, aids scholars by ensuring their open-access work is archived and accessible forever, especially for those whose publisher may not have the resources for long-term preservation, and it ensures that related outputs like research registrations or datasets are also archived, matched to the article of record, and available into the future.
The Refcat release is a milestone of Phase Two of our project, "Ensuring the Persistent Access of Long Tail Open Access Journal Literature," first announced in 2018 and supported by funding from the Andrew W. Mellon Foundation. Current work focuses on citation integrity within the IA Scholar archive, partnerships and services, such as our role in the multi-institutional Project Jasper and our partnership with Center for Open Science, and the addition of secondary scholarly outputs to IA Scholar, including datasets, software, and other non-article/book scholarly materials. Lookout for a plethora of announcements about other IA Scholar milestones in the coming months!