The Code4Lib Journal – Toward element-level interoperability in bibliographic metadata
Mission
Editorial Committee
Process and Structure
Code4Lib
Issue 2, 2008-03-24
Toward element-level interoperability in bibliographic metadata
This paper discusses an approach and set of tools for translating bibliographic metadata from one format to another. A computational model is proposed to formalize the notion of a ‘crosswalk’. The translation process separates semantics from syntax, and specifies a crosswalk as machine executable translation files which are focused on assertions of element equivalence and are closely associated with the underlying intellectual analysis of metadata translation. A data model developed by the authors called Morfrom serves as an internal generic metadata format. Translation logic is written in an XML scripting language designed by the authors called the Semantic Equivalence Expression Language (Seel). These techniques have been built into an OCLC software toolkit to manage large and diverse collections of metadata records, called the Crosswalk Web Service.
by Carol Jean Godby, Devon Smith and Eric Childress
1. Who uses interoperable records?
The winters are gloomy in Columbus, Ohio, so Susan decides to take an online course on memoir writing. After reviewing one of her homework assignments, her instructor emails her late one night with the recommendation that she take a close look at how Homer Hickam, in his celebrated book Rocket Boys, manages to craft gritty but lyrical descriptions of his boyhood home, a coal mining town in a remote valley of West Virginia in the 1950s. Susan needs this book tomorrow morning, so she wonders if she can get the book from a nearby library. She goes to WorldCat.org [1] and types the title into the search box. The top-ranked result shows that the book is available from Worthington Public Library, just ten miles away from her house.
This paper describes one component in a complex process that enables WorldCat.org to deliver information like this to library patrons. How does this service, which can be accessed from the Internet anywhere in the world, recognize that a library close to Susan has the book she wants?
At the start of the process, third-party jobbers obtain orders from a collection of libraries, which they fill by placing a bulk order to an online bookseller. The Worthington Library’s copy of Rocket Boys and the other books in the same order are located and shipped from the bookseller’s warehouse using the information in a record coded in ONIX [2], the relatively new but emerging de-facto standard for descriptions of materials handled by the publishing industry. The jobber’s order is also sent to OCLC to determine whether it can be matched with the corresponding MARC [3] records in the WorldCat® database [4]. A software module translates some of the elements in the ONIX record to MARC and searches for it. Since Rocket Boys has already been cataloged, the WorldCat record is updated with a flag indicating that the book will soon be held by the Worthington Public Library. WorldCat.org now has the essential information required to connect Susan with her book.
2. Translating bibliographic metadata: a short progress report
The above scenario is an example of how the library community interacts with publishers to deliver value to patrons. The workflow isn’t particularly complex, nor is the relationship between libraries and publishers especially problematic because the negotiations are about well-understood resources that must simply be transferred from one community to another. Nevertheless, the achievement of this goal requires that two descriptive standards be synchronized. Both are sophisticated. Both are entrenched in their respective communities of practice and both are necessary because they annotate the outcomes of events involving the book in two different contexts. The ONIX standard originated from the need to identify and track books as they are bought and sold. The transaction typically involves a bookseller Web site such as Amazon.com, which links a physical item to a machine-readable description and associates it with ancillary materials, including web sites, study guides, cover art, tables of contents, reviews, or digitized pages, all used to create a rich context for interacting with the book before and after a purchase. The MARC standard evolved from the need to create an authoritative description establishing the authorship, subject, and provenance of a work that can be interpreted as an archival record with many anticipated uses in the library community. The MARC record has relatively few links to commercial transaction data or to elements required to populate a given Web site, while the ONIX record is less rich in links to name and subject authorities. But the two descriptions are not incompatible.
2.1. The crosswalk.
In fact, there is considerable overlap because the ONIX and MARC descriptions of Rocket Boys refer to the same book. But it takes the effort of an expert familiar with the formal specifications as well as the normal use cases of both standards to tease out the commonalities and record them, typically in a table of correspondences called a crosswalk [5], such as a sample of the ONIX-to-MARC mapping maintained by the Library of Congress [6] shown in Table 1:
Title composite
The Title Composite contains the text of a title, including a subtitle when necessary.
(Used to set field 246 2nd indicator: If b202 = 00, 246 I2 = #; if b202 = 01, 246 I2 = 2)
246 $a
245 $a
245 $a
245 $b; 246 $b
Table 1. A fragment of an ONIX to MARC crosswalk
If the goal is to share records between libraries and publishers, clear correspondences between ONIX and MARC must be established among elements that identify authors, titles, publishers, dates of publication, numeric identifiers, editions, physical formats, summaries, tables of contents, or subjects. It is less clear what to do with descriptions of the bundled items, such as associated Web sites and user reviews that can have their own complete bibliographic descriptions. But the easy correspondences are good enough to conduct much of the day-to-day work in digital libraries and library support organizations, including record creation, normalization, and enhancement; database searches and retrievals; and the automatic generation of rough drafts for archival records that would be completed by human experts.
Nevertheless, we have discovered that even the slightest incompatibilities among record formats produce huge backlogs that translate to unfulfilled queries from users. Consider a few elements from some of the major standards for representing bibliographic information that name primary book titles. In the MARC to Dublin Core crosswalk, 245 $a maps to Dublin Core (DC) title, a relationship that persists through XML, ASN.1/BER[7], RDF [8], and various standard or non-standard human-readable displays. It also persists through multiple versions and editions of both standards, such as OAI-PMH Dublin Core [9], MARC-XML [10], MARC 2709 [11], and Versions 1.1 and 1.2 of Unqualified Dublin Core (or DC-Simple) [12], as well as Dublin Core Terms [13] (or Qualified DC). The MODS standard can also be reasonably added to this mix. MODS [14] is a slight simplification of MARC designed for the description digital library resources, whose title concept, , is also mapped to 245 $a and DC:title, according to crosswalks available from the Library of Congress. MARC 245 $a also maps to several ONIX elements, among them . In addition, crosswalks are being developed between Dublin Core and ONIX, which maps DC:title to .
In sum, crosswalks exist between MARC and Dublin Core, ONIX and MARC, ONIX and Dublin Core, MODS and MARC, and MODS and Dublin Core. For a single concept, perhaps corresponding to an educated layman’s understanding of title, the relationships among the standards are consistent and easy to implement. But in its current form, this information is difficult to use for automated processing. It requires corresponding technical resources such as XML schemas and XSLT scripts, which are all too often unavailable, broken, or obsolete. The relative scarcity of machine-processable files implies that it is difficult to transform this content into an executable format and maintain it. But perhaps we need to acknowledge that the antecedent step of intellectual analysis is also difficult. It is a nontrivial task to determine whether existing crosswalks are sufficient to perform a translation, or how they can be modified or enhanced. If crosswalks need to be developed, they are often created from scratch even though they might be algorithmically derived. These problems imply a need for a better formal model of the crosswalk: a model that makes it easier to do the intellectual analysis, one that more closely associates the executable files with the analysis, and one that encodes a realistic proposal for managing different versions and encodings of standards.
2.2. Toward schema-level interoperability.
Using the terms defined in Chan and Zeng’s progress report on interoperability [15,16], the problem we have defined is in the scope of schema-level interoperability because we are trying to identify the common ground among formally defined metadata schemas independently of the short-term need to process instance records. But our goal is a conceptual framework that will make this job easier because we want a solution that can be reused for different kinds of tasks and applications.
Chan and Zeng identify several conceptual objects that have emerged from the effort to achieve this kind of interoperability. In addition to the crosswalk, which is the focus of our work, they also describe derivations, switching-across schemas, and application profiles, all of which have a place in the solution outlined in subsequent sections of this article. A derivation results when a new schema is created from an antecedent, as when MODS was derived from MARC to create a standard that is more appropriate for software systems that manage descriptions in digital library collections. A switching-across schema is a common format, which we call an interoperable core in a hub-and-spoke model [17], into which multiple schemas can be mapped to create a highly efficient processing model. And an application profile is a hybrid schema that is created by combining elements from more than one schema, observing certain restrictions on how they are used [18]. Application profiles achieve interoperability by reusing elements defined in existing schemas and creating new elements only when necessary. For example, Dublin Core Terms is an application profile that extends the Dublin Core Element set with a rich set of vocabulary for describing relationships, dates, and audience levels, among other concepts.
Crosswalks, derivatives, hub-and-spoke models, and application profiles respond to the need to identify common ground in the complex landscape of resource description. But these objects also imply an unresolved tension between the need to minimize the proliferation of standards and the need to create useful machine-processable descriptions of resources. We concentrate on the crosswalk, partly because of the institutional and technical context of our work, but also because we are primarily interested in devising reusable solutions by identifying and exploiting the common ground in metadata standards. But we also believe that the crosswalk is more fundamental because it underlies two of the objects that Chan and Zeng describe: the derivative, which requires a crosswalk to map elements from the parent schema to the child; and the hub-and-spoke model, which requires a crosswalk to map related schemas to an interoperable core.
But the crosswalk is built up from an even more fundamental unit of analysis: a single element defined in a standard, such as title in Dublin Core, MARC, MODS, and ONIX. Like words in a natural language, metadata elements can be extracted from one context and combined into new ones, forming application profiles. They can also enter into semantic relations, such as the equivalences defined by crosswalks, which resemble synonymy. When the focus is on the element, Chan and Zeng’s schema-level interoperability reduces to the problem of element-level interoperability, whose technical infrastructure is a baseline on which more comprehensive systems can eventually be built. And as our opening scenario shows, commercially useful work can be done here, work that is now being done redundantly and wastefully, if at all.
Here we describe a generic computational model that permits someone with an expert’s knowledge of the metadata standards used in libraries, museums, education, the publishing industry, or other contexts that have well-defined data models, to develop executable translations by creating a list of equivalences that looks like the crosswalk shown in Table 1. The model is structured so that the standards expert can concentrate on mapping equivalent elements, letting the software handle the messy details that emerge from multiple structural realizations, such as those that come up when the title concept is identified in MODS, MARC, Dublin Core, and ONIX. Because the model closely resembles the human-produced mappings, relatively little programming is required to make this data actionable—in fact, the process of converting crosswalks to executable code can be largely automated. Once the system is seeded with the most important crosswalks, it can run with minimal human intervention. Only minor tweaks are required to add new standards or accommodate new versions or translations.
The system described below is the core logic of the Crosswalk Web Service, which represents an outcome of the research program first reported in Godby, Childress and Smith [17], and Childress, Godby and Young [19] and is currently being used to power the metadata crosswalking facility in OCLC Connexion®, as well as many offline services that process larges sets of records. A publicly accessible demo will soon be available from OCLC’s ResearchWorks page [20].
The Crosswalk Web Service is part of a larger effort to re-engineer OCLC’s products as a collection of modular, configurable services for harvesting, validating, enhancing, and manipulating descriptions of resources, many of which are engineered from Open Source software and exposed through Web service interfaces. The result is an architecture that reduces internal duplication of effort and exposes critical functionalty to any partner, client, or member that recognizes a common set of standards and can be accessed through an Internet connection.
3. A process model for record translation
Figure 2 depicts the major steps in the Crosswalk Web Service process model. It is designed to separate the semantics from the syntax of the translation task, from which many desirable consequences follow, including reusable components and simplified translation scripts, as we argue below. But the most important consequence is that the task of creating and interpreting crosswalks remains abstract, so the metadata subject matter expert can concentrate on defining mappings without getting bogged down by complex structural detail. We will illustrate the process by walking through the translation of a simple description of Rocket Boys that starts out as an ONIX record (shown in Figure 5).
Figure 2. The translation model [View full-size image]
3.1. A brief overview.
Input records conforming to the source metadata standard at State A are converted to a common XML structure at Step 1 and are then translated to the semantics of the target metadata standard at Step 2. The result of the translation is a set of normalized XML records at State C, which are converted to the native structure of the target metadata schema at Step 3, producing output records at State D. Only the first and last steps of the translation model have access to the syntactic details of the native standards. In the internal Step 2, the translation operates on a common, locally defined structure.
3.2. Step 1: Reading the input.
A utility program, informally called a reader, converts an input metadata record to the internal structure required by the translation software in our model. This structure is an XML container we have named Morfrom, whose document type definition contains four elements: record, the top-level element; header, which identifies a default namespace for the record and links to a resource containing a definition of the elements that populate it; field, whose name attribute identifies an element in the native standard; and value, which defines a path to the data. The field elements preserve the element order, sequence, and hierarchical structure, while the name attributes on field preserve the names of the elements in the native record. Finally, the value elements enclose the unique data in the record.
Figure 3 shows the result of running the ONIX reader on our sample record. As shown, this record contains author, title, and subject elements that will be translated. By convention, Morfrom records are designated with the .mfm suffix.
Figure 3. The input record in native ONIX syntax and Morfrom [View full-size image]
3.3. Step 2: The translation.
The key component of the processing model is Step 2 in Figure 2, where the translation takes place. The translation process has two inputs: the translation logic, and a Morfrom-encoded record in the source metadata standard, i.e., the record containing ONIX elements shown in the bottom portion of Figure 3.
The translation logic is written in an XML scripting language we have designed and named Seel, an acronym for Semantic Equivalence Expression Language. It is processed by a custom-written interpreter, which performs the essential function of XSLT or other easily written scripting languages, but represents a computational model of the crosswalk that keeps the defining feature, the assertions of equivalence, in the foreground. The Seel language specification and interpreter bear some resemblance to the Jumper package [21], but Seel has more features for expressing fine-grained details of conditional translations, some of which we describe below.
Figure 4 shows a complete Seel script that can process everything but the subject elements in the sample input, which are treated in Figure 5. A Seel script is enclosed in the element, which has two children. The specifies the metadata standards in the source and target of the translation. One or more