Clara Fehrenbach spoke with Sebastian Hammer, Co-founder and President of Index Data, to learn more about Reservoir.
ReShare Shared Inventory
A shared inventory and consortial discovery has been a foundational piece of the Project ReShare vision since the beginning. The 2021 ReShare Returnables launches at PALCI and ConnectNY went live with FOLIO's Inventory module (dubbed "mod-inventory" in FOLIO-speak) as the basis for ReShare's Shared Inventory storage, which in turn feeds into discovery layers, such as VuFind. Mod-inventory worked for its purpose, but it became clear that the way ReShare needs to ingest and use bibliographic data calls for a more flexible shared inventory infrastructure that is designed to ingest data from many different sources (i.e. individual member libraries in a consortium.)
Because ReShare was intended to be modular from the start, it was possible for Project ReShare and Index Data to be responsive to the needs of the community and update the infrastructure behind the Shared Inventory.
What is Reservoir?
Originally coined mod-meta-storage, Reservoir is the new underlying infrastructure of ReShare Shared Inventory. Based primarily on PostgreSQL, Reservoir was envisioned and realized due to community need, both to address inefficiencies discovered in the live environments at PALCI and ConnectNY and to support the onboarding of IPLC onto ReShare Returnables using their Platform for Open Data (POD) infrastructure. Reservoir is designed to be both fast (quickly handling a very large number of records) and flexible (poised to reuse its contents for future purposes.)
In order to accomplish speed and flexibility, Reservoir does not merge records as they're imported in the same way that mod-inventory was designed to do. According to Sebastian, Reservoir works instead by "storing incoming bibliographic records separately and ‘clustering' them using a match algorithm." Then the records can be "merged" later for use in a consortial discovery layer or for other purposes. This method of clustering now, merging later was designed to allow much easier experimentation with different matching algorithms, since clusters can be reconfigured or rebuilt without needing a full data reload. It's even possible to use more than one different matching algorithm at the same time with Reservoir.
Want to know how much faster Reservoir is? Consider this: Using Reservoir, it takes less than a week to ingest, merge, and process a collection of about 80 million bibliographic records. Before Reservoir, it would have taken approximately five months to complete the same process.
A reservoir is "a large natural or artificial lake used as a source of water supply."
Taking inspiration from "data lake" terminology and imagery, Reservoir was named because it is envisioned as a data lake that ingests data from sources "upstream" and provides a supply of "clean" data to any service positioned "downstream." Currently, the primary use of this data is in consortial discovery using VuFind, but it could be adapted for many different purposes, including consortial collection analysis.