The Code4Lib Journal – Strategies for Digital Library Migration Mission Editorial Committee Process and Structure Code4Lib Issue 56, 2023-04-21 Strategies for Digital Library Migration A migration of the datastore and data model for Stanford Digital Repository’s digital object metadata was recently completed. This paper describes the motivations for this work and some of the strategies used to accomplish the migration. Strategies include: adopting a validatable data model, abstracting the datastore behind an API, separating concerns, testing metadata mappings against real digital objects, using reports to understand the data, templating unit tests, performing a rolling migration, and incorporating the migration into ongoing project work. These strategies may be useful to other repository or digital library application migrations. by Justin Littman, Mike Giarlo, Peter Mangiafico, Laura Wrubel, Naomi Dushay, Aaron Collier, Arcadia Falcone Introduction For the past 4 years, a team at Stanford University Libraries (SUL) has been working on and off towards migrating to a new datastore and data model for the Stanford Digital Repository’s (SDR) digital object metadata. [1] The goal of this paper is to describe the motivations for this work and some of the strategies used to accomplish this migration. These strategies may be useful to other repository or digital library application migrations. About the Stanford Digital Repository In support of research, teaching and learning, the Stanford Digital Repository is an ecosystem of applications, systems, and services that house the digital collections of Stanford University Libraries. Collections housed by SDR include: Google-scanned books Stanford dissertations and theses University Archives Allen Ginsberg papers Buckminster Fuller papers Parker Library Fugitive U.S. Agencies Web Archive At the time of writing, SDR has almost 5 million digital objects composed of more than 530 million content files. [2] SDR is extremely heterogeneous along several facets, including content types (e.g., books, images, web archives, GIS datasets) and file types (e.g., XML, TIFF, WARC, MP4). Repository Ecosystem & Data Flows There are multiple avenues for users to deposit digital objects (metadata and content files) in SDR. These include, but are not limited to: A web based self-deposit interface for single item deposit by researchers (Stanford faculty, students, and staff) A web based self-deposit interface for theses and dissertations by Stanford graduate and undergraduate students A web based interface for administrative library staff that allows for bulk deposit API based deposit (currently internal library staff use only) Customized third-party software optimized for managing digitization of physical materials by Stanford University Libraries staff SDR interacts with other digital library applications at SUL. In particular, it deposits digital objects into the Preservation System [3] and publishes digital objects to the Access System. SDR also retrieves descriptive metadata from Symphony, SUL’s integrated library system. [4] The preservation and access systems are not discussed here, though briefly: The preservation system provides redundant onsite and offsite backup, with validity and audit checks. The access system provides web based search, discovery, viewing (e.g., via IIIF), and download. Throughout this ecosystem, varying levels of discovery, access, and embargo restrictions can be set and are enforced. Metadata Management Store SDR has been in operation since 2006. For at least the last 11 years [5], it relied on Fedora 3 as a datastore for its digital object metadata. SDR’s digital object metadata includes description, identification, access / rights, structure, and other administrative metadata. The digital object’s content files are not stored in Fedora. While Fedora 3 has been the SDR’s “workhorse” over the years, it suffers from a number of critical shortcomings: Fedora 3 is no longer supported and was last released in 2015. Fedora 3 relies on Java 8, which has been “end-of-life” since 2019. The basic units of metadata storage in Fedora 3 are XML files stored on disk (“datastreams” in Fedora parlance). These datastreams are not validated and schema-less, by design. Ruby libraries supporting Fedora 3 (ActiveFedora 8.x and Rubydora) are no longer under active development, going moribund in 2018, which constrained our ability to update to the latest versions of the programming language (Ruby) and web application framework (Rails) used throughout SDR. Fedora 3 does not support transactions, meaning that problems like network blips can result in digital objects with incomplete or otherwise broken metadata. Fedora 3 does not support constraints (e.g., uniqueness), allowing for problematic digital object metadata to be included in the repository. By itself, Fedora 3 does not support advanced querying of digital object metadata. In SDR, querying is provided by an instance of the Solr full text search service, which must be kept in sync with the underlying digital object metadata datastore. Fedora 3 is not designed for load balancing and thus becomes a single point of failure under heavy load, even if other parts of the system are load balanced to handle higher load. We have observed this behavior in SDR where Fedora often “falls over” under load. In addition, the SDR application and data ecosystem suffered from critical shortcomings: SDR applications were tightly bound to Fedora, interacting directly with Fedora to update XML datastreams. Many SDR applications took advantage of the flexibility of Fedora’s datastreams to store application-specific metadata. Without the constraints of a schema / validation, the XML metadata was inconsistent due to: Each application directly manipulating XML. Users being empowered to manually edit XML datastreams in SDR’s management application as the endorsed way to make certain types of functional changes to objects. Abandoned metadata approaches and legacy applications left unremediated and divergent. Low prioritization of auditing and quality check tools for entire corpus of SDR XML The totality of these shortcomings motivated the migration to a new metadata management store and a rich, validatable data model for SDR’s digital object metadata. [6] In the rest of this paper, we will describe some of the strategies used to accomplish this migration. To this end, we will focus on the motivation and logic of these strategies, rather than the code or implementation details. Strategy #1: Adopt a Validatable Data Model One of the crucial lessons from SDR is that the data model for the digital object metadata should be validatable, meaning that an instance of digital object metadata can be checked to ensure that it conforms to the data model. The primary means of making the data model validatable was to specify a schema. In addition to providing validatability, the schema serves as documentation for the data model. We [7] developed a home-grown data model for digital object metadata known as “Cocina”. Cocina digital object metadata is serialized as JSON and the schema is represented as an OpenAPI specification. [8] OpenAPI was selected for the schema because: It is widely adopted and has excellent tooling in multiple programming languages. The same OpenAPI specification could be used as a specification for the API applications that “speak” Cocina. OpenAPI provides a balance between ease of use and powerful validation. Critical to integrating Cocina across our application was having a library to provide Ruby classes for Cocina. Thus, Cocina digital object metadata is manipulated in code as Ruby objects, not as JSON. The Cocina classes are created by a code generator [9] that we authored to transform a subset of OpenAPI into Ruby. As changes are made to the Cocina model, the OpenAPI specification is modified and new Cocina Ruby classes are generated. The crux of this approach, however, is that when a new Cocina object is instantiated, it is automatically validated against the OpenAPI schema. [10] Thus, digital object metadata is validated at the code level and at the API level. And, in both cases, the validation is performed using the same mechanism, viz., the OpenAPI specification. There is a small subset of validations that cannot easily be represented in OpenAPI. For example, validating that the access rights at the object level are consistent with the access rights at the file level. For these, validation is performed in code that is also executed whenever a new Cocina object is instantiated. Together with the OpenAPI validation, this gives us high confidence in the consistency of Cocina digital objects. [11] It is worth noting that there is one part of the Cocina data model where this is not strictly true. The Cocina data model for descriptive metadata was architected by Arcadia Falcone (SDR’s metadata specialist) to allow for mapping to multiple metadata formats (e.g., MODS and DataCite). As such, it was structured to have a great deal more flexibility than other parts of the Cocina data model. While the basic structure of the descriptive model is validated via OpenAPI, additional semantic validation is defined elsewhere in the Ruby software library via a YAML file. [12] This file enumerates lists of terms (e.g., valid note types) against which the values of selected properties may be checked at Cocina object creation time. These lists are more frequently changed than the model as a whole, so storing them separately allows the OpenAPI specification to remain more stable. In addition, the YAML file is used as the basis for documentation aimed at metadata creators who do not need the full model documentation. Strategy #2: Abstract the Datastore Behind an API To address the tight binding between applications and the digital object metadata datastore (i.e., Fedora 3), the datastore has been abstracted behind an API. This API, known as DOR [13] Services Application (DSA) allows applications to create, retrieve, and update Cocina digital objects, as well as a number of other repository functions such as initiating accessioning, managing versions, and recording events. Probably the greatest effort in the migration process has been to rewrite or refactor the existing applications to use Cocina objects instead of Fedora internally and to interact with DSA. This has been an iterative process, as we built out parts of the Cocina model and DSA functionality as needed for each SDR application being worked on (see Strategy #8). Prior to migration, DSA internally performed a real-time two-way mapping between Cocina objects and Fedora objects [14]. Thus, the basic activities of DSA prior to migration were: When an application requested a Cocina object from DSA: A Fedora object was retrieved from the Fedora datastore. The Fedora object was mapped to a Cocina object. The Cocina object was returned to the application. When an application provided a Cocina object to be created: A new Fedora object was instantiated. The Fedora object was updated based on the Cocina object. The Fedora object was saved to the Fedora datastore. When an application provided a Cocina object to be updated: An existing Fedora object was retrieved from the Fedora datastore. The Fedora object was updated based on the Cocina object. The Fedora object was saved to the Fedora datastore. Given the complexity of the Fedora and Cocina data models and the substantial heterogeneity of the existing digital objects in SDR, this mapping code was of significant complexity; strategies for managing this are described below. With all of our applications decoupled from Fedora, DSA became the sole application interacting directly with Fedora; at this point, DSA was itself tightly coupled with Fedora. One of the final phases of the migration was to encapsulate all Fedora interaction code within DSA to a dedicated service class known as the Cocina Object Store. All other services in DSA manipulated only Cocina objects; the mapping between Cocina objects and Fedora objects happened within the Cocina Object Store; this allowed a rolling migration as detailed in strategy #7. Strategy #3: Separate Concerns One way of reducing complexity in the digital object data model and making the migration more manageable was to promote the separation of concerns, where services with distinct responsibilities were implemented in distinct applications. Also part of the separation of concerns was using distinct datastores for individual application specific services, rather than relying on storing metadata centrally in Fedora datastreams. One such example is technical metadata. Previously, technical metadata was generated by JHOVE and stored as XML in its own Fedora datastream. This approach had a number of shortcomings [15]: Technical metadata was tightly bound to Fedora. As the syntax of the technical metadata created by JHOVE changed over time, the metadata grew inconsistent. There was no efficient way to query the technical metadata. This approach was replaced by a dedicated technical metadata service. Implemented as a Rails application with an API, the Technical Metadata Service utilizes its own relational datastore for storing and querying technical metadata. It also obviated the need to include technical metadata (which itself is very complex and verbose) in the Cocina data model. This approach was applied to various other services (e.g., events and workflow) that previously were tightly bound to Fedora in a single monolithic application. Another benefit of this approach is that if and when these specific services or technologies change, this can be done in isolation as long as the API is maintained. Strategy #4: Test Mappings Against Real (Cached) Digital Objects As mentioned previously, the two-way mapping between the Cocina data model and the Fedora data model was developed iteratively. Our initial approach was to rely on unit tests to ensure the accuracy of the mappings. Almost immediately, however, we encountered problems. Despite the unit tests and testing in a staging environment, each new release of mapping code brought a cavalcade of bugs, some breaking entire SDR applications, others affecting individual digital objects. [16] A bug might be an actual unhandled software error or a mismapping. While in part due to the complexity of the mapping, the core problem was the heterogeneity of our digital object metadata. (The reasons for this heterogeneity are articulated above.) In essence, we did not know all of the cases that the mapping had to handle or the unit tests had to cover. To address this, the approach that we took was to test each mapping change against a large set of actual digital object metadata (for a typical code change, the metadata for 100,000 digital objects). Testing against a large set of actual digital object metadata had a significant barrier: Repeatedly retrieving digital object metadata from Fedora at this scale took too long and testing against actual digital object metadata meant impacting the production Fedora instance. To overcome this, a read-only cache was created where the datastreams for each digital object were stored in a zip file. The cache was kept updated by a process (called “cache-o-matic”) that queried Solr for recently changed digital objects and regenerated the cache files for those objects. The cache was much faster than using Fedora (in part because the cache used high speed storage) and had a smaller impact on production. To verify a change, a developer would run “before” and “after” tests. This test would allow the developer to verify that a proposed mapping change would improve roundtrip mapping (or at least not make it any worse) and not introduce an unhandled software error. For each digital object in the test: A Fedora object is instantiated from the cache. The Fedora object is mapped to a Cocina object. The Cocina object is run through object creation resulting in a new Fedora object. The datastream in the original Fedora object are normalized and compared against the datastreams in the new Fedora object. The new Fedora object is mapped to a new Cocina object. The new Cocina object is compared against the original Cocina object. The normalization in step 4 begs explanation. There are a number of reasons that normalization might be necessary, but in general it is to account for the heterogeneity of the data. For example, in early SDR, labels in the content metadata datastream were expressed as: My label Later this changed to: The Fedora to Cocina mapping knew how to handle both cases. However, when mapping from Cocina to Fedora to generate content metadata, a label was always mapped to