The Code4Lib Journal – Recommendations for the application of Schema.org to aggregated Cultural Heritage metadata to increase relevance and visibility to search engines: the case of Europeana Mission Editorial Committee Process and Structure Code4Lib Issue 36, 2017-04-20 Recommendations for the application of Schema.org to aggregated Cultural Heritage metadata to increase relevance and visibility to search engines: the case of Europeana Europeana provides access to more than 54 million cultural heritage objects through its portal Europeana Collections. It is crucial for Europeana to be recognized by search engines as a trusted authoritative repository of cultural heritage objects. Indeed, even though its portal is the main entry point, most Europeana users come to it via search engines. Europeana Collections is fuelled by metadata describing cultural objects, represented in the Europeana Data Model (EDM). This paper presents the research and consequent recommendations for publishing Europeana metadata using the Schema.org vocabulary and best practices. Schema.org html embedded metadata to be consumed by search engines to power rich services (such as Google Knowledge Graph). Schema.org is an open and widely adopted initiative (used by over 12 million domains) backed by Google, Bing, Yahoo!, and Yandex, for sharing metadata across the web It underpins the emergence of new web techniques, such as so called Semantic SEO. Our research addressed the representation of the embedded metadata as part of the Europeana HTML pages and sitemaps so that the re-use of this data can be optimized. The practical objective of our work is to produce a Schema.org representation of Europeana resources described in EDM, being the richest as possible and tailored to Europeana’s realities and user needs as well the search engines and their users. by Richard Wallis, Antoine Isaac, Valentine Charles, and Hugo Manguinhas Introduction This article captures the core output of a project between the authors in which Richard Wallis, as an external consultant, was engaged to advise Europeana on the potential for, and ramifications of, introducing detailed Schema.org structured data into Europeana websites. Europeana provides access to more than 54 million cultural heritage objects through its portal Europeana Collections[1]. The portal is fueled by metadata describing digitized cultural objects from cultural institutions, which results from a long curation and aggregation process that leads to representing it into the Europeana Data Model[2] (EDM). It is therefore crucial for Europeana to be recognized as a trusted repository of cultural heritage on the web and as an authoritative resource by search engines. Indeed, even if the portal is considered as the main entry point for searching for cultural heritage, most of the Europeana users come to it via search engines, especially Google. The Schema.org vocabulary[3] provides a way to embed metadata into web pages for direct consumption by search engines such as Google to power rich user services (such as the Google Knowledge Graph). Schema.org is an open initiative, backed by Google, Bing, Yahoo!, and Yandex that has established a widely adopted practice (used by over 12 million domains) and vocabulary for sharing metadata across the web. Its broad adoption underpins the emergence of new techniques, such as so-called Semantic SEO. Europeana started to research the benefits of Schema.org for its services, primarily the ability of the Schema.org vocabulary to enable external organizations in general, and Search Engines in particular, to crawl and add that data into their Knowledge Graphs, thus enhancing the discoverability of cultural resources. Our research lead us to investigate, review, and propose ways forward for the utilization of the Schema.org vocabulary and associated technologies in Europeana data services. We addressed first the representation of Europeana resources using the Schema.org vocabulary. This work resulted in a set of data mapping recommendations taking into account various concerns such as the representation of literals versus URIs. We also addressed the publication of the embedded Schema.org data as part of the Europeana dynamic html pages and sitemaps using JSON-LD serialization. Our recommendations detail the practical techniques that need to be considered to optimize the sharing of this data via the Europeana websites and services. We propose several approaches to identify the effects of adding Schema.org to the sites and produce analytics on fine-grained visiting patterns, which can be relevant for both Europeana and its data providers. For these different areas, recommendations are made that correspond to the current situation as well as a more ideal situation where data for concepts, persons and other associated entities will be available as part of Europeana data services. Even though these recommendations have yet to be implemented in Europeana, we think they can be useful to institutions interested in using Schema.org. 1. Data semantics and structure The practical objective of our work is to produce a Schema.org representation of Europeana resources described in EDM, being as rich as possible and tailored to Europeana’s realities and user needs as well as the search engines and their users. Our mapping recommendations (see Appendix 1) identify Schema.org types and properties and the source EDM term entities that could be used to derive data values, or suggest appropriate Schema.org types. For example schema:name would be used for data derived from dc:title. Instead of elaborating on each individual mapping decision, the remainder of this section discusses the more general challenges, design decisions and recommendations for extending the coverage of available data. At first view there are three major mapping challenges between EDM, as used by Europeana, and the more generic Schema.org vocabulary. Note: In the examples of this article the output format for Schema.org data is assumed to be JSON-LD[4]. The reasoning for this choice is detailed in the JSON-LD output section in the JSON-LD output section below. Challenge 1: Flatter Hierarchy The first challenge is to map the EDM metadata structure to more generic (and flatter) Schema.org classes. EDM metadata is organized as a set of resources describing a Cultural Heritage Object (i.e., edm:ProvidedCHO and ore:Proxy that represent different data sources for objects), one or more digital representations (i.e, edm:WebResource) and “contextual” resources (places, persons, concepts, timespans), in compliance with the one-to-one principle[5]. These resources are grouped together by an edm:Aggregation into one bundle recording also some information on the aggregation process (e.g., the providers of the metadata). This package of data contains many internal relationships linking resources to each other. The Schema.org approach is to define a schema:CreativeWork (equivalent to the CHO) plus links to access the resource. The inevitable result of mapping to Schema.org will be a flatter representation of the data. In both cases, however, there are still relationships with associated contextual entities (in the EDM sense, schema:Person, schema:Organization, schema:Place, etc. are contextual entities). Most of the ‘flattening effect’ would be apparent for proxies and aggregations. Figure 1. The four main classes of the EDM metadata structure and their relations. Challenge 2: Identifying the Type of Thing The core entity in EDM is the edm:ProvidedCHO, broadly equivalent to the Schema.org class schema:CreativeWork, with the type of object being defined via properties such as dc:type. However, using Schema.org it is preferable wherever possible to identify the specific type of ‘thing’ (schema:Thing being the base class in Schema.org which all other inherit from) being described – schema:Book, schema:Painting, schema:Sculpture, schema:ImageObject, etc. Often this data is available within EDM, via properties such as dc:type and references to concepts from, e.g., the Getty Art and Architecture thesaurus[6]. Web Resources represented as schema:MediaObjects also ideally should be identified by more specific subtypes of schema:MediaObject (schema:ImageObject, schema:AudioObject, schema:VideoObject). Again this will require mapping between MimeTypes, file extensions, etc. to ascertain the correct type. These mapping rules, the basis for which are referenced in the Data Mapping Recommendations Appendix below, will need to be evolved and established over time. Challenge 3: (identified) Things vs Strings Record vs. Entity view EDM is a step towards a semantic approach to data modeling, but the Europeana data currently available provides a record oriented view. References to entities, and their data, are included in a single output, and not all entities (organizations, persons, concepts etc.) have separate URI identifiers. Ideally for Schema.org, output should be in an entity based form, where each entity is described as such with its own URI, and references to other entities (creator of an object, for example) provided only as a URI. This approach is guided by Linked Data principles.[7] Satisfying this need imposes both a data and a technical challenge. Firstly, the data for each entity should be available in a form resembling an RDF Concise Bounded Description (CBD)[8], providing a graph based view of the data for an individual entity and its relationship(s) with other entities.In simple terms this restricts the properties output to those directly associated with the entity being described (Formed from the RDF triples having the entity’s URI as their subject). EDM object records, currently served by Europeana’s API which seeks to reflect the type of records usually produced and consumed in the cultural sector[9], are a good base. However further refinement may be needed focusing on separate description of individual entities (person, place, creative work, etc) as against aggregating them into a single record.EDM Proxies, which are essentially “derived” resources, should receive specific attention while assembling the data to be published. Secondly, the systems need to be able to resolve and display each entity based upon its URI, supplying the embedded Schema.org data for it. It is recognized that moving to fully entity based data output will be an evolutionary process for Europeana (especially in terms of web identifiers and web pages associated with resources). We recommend in the interim that standard RDF blank node techniques (cf. point on blank nodes below) be used to separate out the notional entities within the combined data payload. Expanding String Literals into Entities Wherever possible literals should be expanded to an entity description of the appropriate type. For example the following in JSON-LD: "creator": "J .M. Smith" Should be expanded to: "creator": { "@type": "Person" "name": "J. M. Smith" } This will explicitly inform those consuming the data of related entity types. (See note below about blank nodes for preferred JSON-LD syntax) Entity Data, Blank Nodes, URIs In JSON-LD output (as in some other RDF syntaxes) there are three formatting options in a combined payload for describing entities, such as creator in the examples below, related to a resource. Implicit Blank Nodes (nested output) – This is where the data for the associated entity is provided inline as part of the resource description data: { "@id": " http://data.europeana.eu/item/2021618/internetserver_Details_kunst_25027", "@type": ["CreativeWork", "VisualArtwork"], "name": "Mona Lisa | Leonardo da Vinci", "creator": { "@type": "Person", "name": "Luigi Calamatta" } } Explicit Blank nodes – A blank node reference to the associated entity is provided in the resource description with the entity description provided in the same data payload thus: "@graph":[ { "@id": "http://data.europeana.eu/item/2021618/internetserver_Details_kunst_25027", "@type": ["CreativeWork","VisualArtwork"], "name": "Mona Lisa | Leonardo da Vinci", "creator": "_:p0", }, { "@id": "_:p0", "@type": "Person", "name": "Luigi Calamatta" } ] Entity Reference – A resolvable URI is provided for the associated entity. It is for a consuming application or service to access that URI to obtain the entity description: { "@id": " http://data.europeana.eu/item/2021618/internetserver_Details_kunst_25027", "@type": ["CreativeWork","VisualArtwork"], "name": "Mona Lisa | Leonardo da Vinci", "creator": "http://data.europeana.eu/item/p12345", } The recommended ideal for this is option 3 – Entity Reference, but is achievable only when entity descriptions with resolvable URIs have been created. Europeana is currently working on a “semantic Entity Collection” that acts as a centralised point of reference and access to data about contextual entities. This Entity Collection will provide resolvable URIs (at http://data.europeana.eu) for all the entities described in Europeana. In the meantime Pption 2 – Explicit Blank Nodes – is used. This will facilitate easier adoption for option 3 at a later date. “Duplicating” statement objects both as URIs and Labels When separate entity URIs are provided as values for properties such as dc:creator, the question arises as to whether the label associated with these entities should be represented next to the URI of the separate entity. This would provide “more readable” data, also accomodating machine clients who couldn’t ‘follow their nose’ from the creator URI to the creator name in the separate (Linked Data) description. As in the following: { "@id": "http://data.europeana.eu/item/2021618/internetserver_Details_kunst_25027", "@type": ["CreativeWork", "VisualArtwork"], "name": "Mona Lisa | Leonardo da Vinci", "creator": ["http://data.europeana.eu/item/p12345", "Luigi Calamatta"] } This approach is not recommended as it can introduce maintenance problems, as changes to the (remotely accessible) Linked Data available for the entity being referenced may invalidate the name/label that has been reproduced locally. Mapping to external resources It is preferable wherever possible to map internal Europeana entities, and concept identifiers, to external equivalents. Identifying an object, organization, person, concept, etc. to be schema:sameAs a Wikidata or similar identifier. To a certain extent the ability to identify such references is constrained by the amount and quality of data contributed to Europeana by its partners. However, especially when processing (enriching) data to identify entities for persons, organizations, etc. this need should be given high consideration. 1.4 URI design decisions Choice of Identifying URIs The Identifying URI for a Europeana resource should be of the form: http://data.europeana.eu/.......... This differs from the current URL form for web pages: http://www.europeana.eu/portal/.......... This approach separates the provision of canonical Europeana identifiers from the structure of current and future web site structure and implementation, which allows unambiguous knowledge to be published about each of them. Note that requests for human-readable html representations of a resource with a data URI are already redirected by Europeana services to the appropriately addressed www addressed page. Data references between Web page and Resource In line with the above decision, it is recommended that consideration be given to outputting a representation of the relationship between a resource and the web page displaying it. For example: "@graph":[ { "@id": "http://www.europeana.eu/portal/record/2021618/internetserver_Details_kunst_25027.html", "@type": "WebPage", "mainEntity": "http://data.europeana.eu/item/2021618/internetserver_Details_kunst_25027" }, { "@id": "http://data.europeana.eu/item/2021618/internetserver_Details_kunst_25027", "@type": ["CreativeWork", "VisualArtwork"], "name": "Mona Lisa | Leonardo da Vinci", "creator": ["http://data.europeana.eu/item/p12345", "Luigi Calamatta"] "mainEntityOfPage": "http://www.europeana.eu/portal/record/2021618/internetserver_Details_kunst_25027.html" } ] Note: In this example the property schema:mainEntityOfPage is used to reference the page on which the resource is displayed. This is optional as it may be difficult to maintain, especially when a resource is displayed on more than one page, in themed portals etc. 1.5 Data coverage recommendations Annotations Europeana represents annotations (such as users’ tags, links to related objects) with an extension of EDM based on the recommendations from W3C Web Annotation Model[10]. When processing entities for mapping to Schema.org information in these annotations may provide valuable information to be shared using Schema.org properties. For example annotations with a ‘motivation’ of oa:commenting could be mapped to schema:comment; oa:tagging to schema:about or possibly schema:keywords; oa:linking to schema:sameAs or schema:exampleOfWork, etc. It was beyond the scope of the project to analyze the current, and potential future, coverage of annotations for comments and provenance etc. in Europeana data. It was recommended however that this coverage should be taken into account when building the mapping rules and processes. This recommendation carries the caveat that the processing required to do this may impact performance for On-the-fly data creation (as described in the following Technology section of this report) and therefore may have to be deferred until there are batch mode capabilities available. Description of Europeana and providers as organizations It is important in placing Europeana resources into context on the web that they are related to well Schema.org-described descriptions of associated entities such as for Europeana as an organization and its data providing partners as organizations. To this end, it is recommended to embed a rich Schema.org description of Europeana as an organization in the http://europeana.eu page. Equivalent efforts should also be made for pages about each of the providing partners, when possible, and appropriate references to URIs of Europeana and providers should be made in the (object) data. Note: in the longer term, to boost the ‘Schema.org reputation’ of the Europeana organization resource, snippets of mark-up “powered by Europeana” should have Schema.org mark-up in them, referring to http://europeana.eu. 3. Technology The purpose of defining Europeana resources using the Schema.org vocabulary is to enable external organizations in general, and Search Engines in particular, to consume the data into their Knowledge Graphs of resources on the web. Search engines consume such data from sites on the web as part of their standard web crawl processes. They parse the totality of the contents of crawled html pages identifying structured data, in Schema.org form, marked-up using either Microdata, RDFa, or JSON-LD syntax[11]. The technical requirements are to provide this Schema.org data embedded within the html pages of Europeana websites without detrimentally impacting the primary purpose of those pages in supporting human interaction. 3.1 Separation of Interface Concerns Over time, it is highly likely that the user interface design requirements of Europeana websites will need to change independently of the underlying data structures. This could be because of aesthetic changes or the introduction of themed views. Equally as the Schema.org vocabulary and associated industry practices evolve, and the modeling and quality of data stored by Europeana evolves, there will be need to change the structured data embedded in a page without changing the visual representation. A standard, and the recommended, approach to solve the separation of these needs is to ‘bolt-on’ the structured data to the page construction. This technique, as implemented by OCLC on WorldCat.org, involves inserting a section in the page source code, containing the structured data, that does not impact on its visual output. 3.2 JSON-LD Output If Schema.org output is to be ‘bolted-on’ to page source, the recommended approach is to use JSON-LD format inserted into a html script tag: This format provides the smallest payload, compared with RDFa and Microdata; JSON-LD is supported by many tools, and is already used in some Europeana API services. 3.3 JSON-LD usage Significant consumers of Schema.org, especially those consuming JSON-LD do not share which tools they are using to parse the data. Therefore the implementation of data format within the JSON-LD should be as simple as possible and follow established guidelines (see the section on Multilingual Strings in Appendix 1). 3.4 Dynamic Loading of JSON-LD One technically attractive option for embedding JSON-LD is to enable the dynamic loading of the JSON-LD contents into the page. This is achieved by calling a script, potentially hosted on a different server to that serving the web page. This script either will contain the JSON-LD code or will use JavaScript code to insert it into the of the document. For example: