The Code4Lib Journal – Building a Better Book in the Browser (Using Semantic Web technologies and HTML5) Mission Editorial Committee Process and Structure Code4Lib Issue 29, 2015-07-15 Building a Better Book in the Browser (Using Semantic Web technologies and HTML5) The library as place and service continues to be shaped by the legacy of the book. The book itself has evolved in recent years, with various technologies vying to become the next dominant book form. In this article, we discuss the design and development of our prototype software from Montana State University (MSU) Library for presenting books inside of web browsers. The article outlines the contextual background and technological potential for publishing traditional book content through the web using open standards. Our prototype demonstrates the application of HTML5, structured data with RDFa and Schema.org markup, linked data components using JSON-LD, and an API-driven data model. We examine how this open web model impacts discovery, reading analytics, eBook production, and machine-readability for libraries considering how to unite software development and publishing. by Jason A. Clark and Scott W. H. Young Introduction and Background The library as place and service continues to be shaped by the legacy of the book. The book itself has evolved quickly in recent years, with various technologies vying to become the next dominant book form. Identifying an effective and popular next step for the book has been an ongoing challenge for publishers, libraries, and content creators, all of whom have focused significant resources on developing new models for creating, publishing, and accessing book content. An overall view of this landscape is provided by the Pew Research Center, whose survey data shows that e-reading continues to rise [1], e-reading device ownership continues to rise [2], and mobile device and tablet ownership continues to rise [3]. Within this context many examples from across book publishing attest to the level of attention and resources now dedicated to developing new models for publishing. Representing the efforts of content creators are services such as Editorially, a collaborative web-based platform for both creation and publication (since acquired by Vox Media) [4]. PressBooks is a similar publication service, with the added ability to publish to various formats, including PDF, Mobi, EPUB, and a fully web-based format that PressBooks calls a “webbook” [5]. Leanpub is a self-publishing tool that the blurs the line between writer and publisher [6]. Other more grassroots efforts are seen in The People’s E-Book, a successful Kickstarter project that promises to “make e-books better ” [7], and in FuturePress, an open source HTML5 and JavaScript epub3 reader for the web [8]. Project GITenberg is a new and fast-growing collection of over 43,000 free, open, collaborative, and trackable ebooks built and distributed through the Git version control system [9]. In addition to e-book tools and services, recent conferences, library training, and specific employment opportunities have explored the publishability and accessibility of book content. The international “If Book Then” conference brings together writers, publishers, technologists, scholars, journalists, and others to imagine what happens when “we really jump away from the boundaries of the physical (and even digital) book” [10]. The New York Public Library hosted the Open Book Hack Week in December 2013, an event designed “to help us imagine the future of digital books, and advance the open source and open API building blocks needed for the diverse ecosystem of authors, designers, developers, publishers, libraries, booksellers, and readers.” The NYPL has positioned itself as a leader in developing ebooks for libraries, led by its NYPL Labs digital division (a recent NYPL Labs job advertisement for the position of Lead Systems Architect/Engineer included, “If you’re game to help re-imagine the public library eBook experience, then make that vision real (and scaleable), then we want you”) [11]. The Digital Public Library of America (DPLA) has announced a partnership with President Obama to provide children with greater access to eBooks [12], and the second annual DPLAfest conference this year included a multi-day workshop to explore the future of ebook publishing [13], This group identified a 10-part “ebook stack” to help frame discussion around possible shared services at a national network scale. The ebook stack includes reader interfaces, discovery systems, and assessment feedback mechanisms. Wake Forest University Libraries is pursuing a related initiative with its mini-MOOC series, “ZSRx: Digital Publishing.” With this online learning program Wake Forest promises to explore the “past, present, and future perfect tenses of e-books, self-publishing, and the digital publishing landscape” [14]. Other universities are experimenting with alternative modes of open publishing, such as SUNY’s Open Textbook initiatives [15]. Questions of information access, ease-of-use, and content quality are of vital importance not only to libraries and publishers, but to the public as well. Discussion in public forums continues to emphasize, and in some corners even agonize, over publishing and the state of the book. A recent posting on the Chronicle blog network warns of “celebrating booklessness” in libraries, as such a development will damage libraries and librarians over the long term [16]. Other analyses identify the evolution of technology and book publishing not as a sign of the end times, but simply as the driver of a new form of reading. A feature published on the website of Random House of Canada remarks, “People aren’t reading less, just differently. Technology has innovated” [17]. The feature offers a quote from artist Tan Lin: “People forget that a book or codex is a technology.” Indeed, discussion within publishing and libraries regarding this topic too often focuses on the loss of our primary artifact, the book, while neglecting new forms of publishing enabled by today’s primary and most powerful information technology, the web. For more than two decades the web has enabled alternative forms of publishing, and now newer web technologies and publishing services have been introduced that promise even further advancement of content publishing. Widespread support of digital publishing, however, is not a given. Criticism of book publishing in a digital environment has been recently expressed with some force by UCLA professor Johanna Drucker, who approaches digital publishing with marked apprehension in an essay titled, “Pixel Dust: Illusions of Innovation in Scholarly Publishing” [18]. Drucker maintains a strong skepticism of digital platforms as a cure for the crisis of publishing. She notes that most phases of the complex publishing cycle—acquisition, editing, reviewing, fact-checking, design, promotion, and distribution—remain in place in the digital environment, and that only the final stage—the form of production—experiences a foundational shift. In the context of the web, how viable is book publishing? How is content acquired and vetted? What is the sustainability model for production? What technologies underpin open web publishing? These questions are fundamental to our own project at the Montana State University (MSU) Library. Responding to these questions and to the need for a new book publishing model, we at the MSU Library are exploring new possibilities for publishing through the web with open standards. We were particularly interested in the possibilities that open standards have for moving outside of DRM restrictions for book content and how open markup standards allow for broader access to book content for anyone with a web browser. We even speculated that these open standards would bring an easier archiving model to digital books as preserving and emulating text and simple images could be brought into web archiving routines and allow us to move away from the emulation and software restrictions of .pdf or ebook proprietary formats present in long-term archiving of book content. The initial “book objects” for our MSU Library prototypes represent two distinct categories of non-fiction and fiction: 1) a cookbook with essays for an MSU undergraduate history course, 2) a student literary journal featuring images, prose, and poetry created by MSU students. In the pipeline, are two additional academic press formats including: a textbook for a statistics undergraduate course, and an academic journal featuring the mountain science of the region [19]. One of our goals was to test the software across multiple textual formats and see how broadly our book model could be applied. Our overall approach to this project is outlined within a wider framework of machine-readable, structured, semantic data (Arlitsch et al. 2014). Our project utilizes a new method for publishing within a web browser using HTML5, structured data with RDFa and Schema.org markup, linked data components using JSON-LD, and an API-driven data model [20]. These technologies together unlock the book by transforming its content into a semantic, machine-readable, and extensible platform. Drucker in fact reserves her limited praise of digital publishing for this particular aspect, “the most exciting and innovative aspects of digital presentation are the ways in which structured data—texts with humanly-embedded organization—can be searched and analyzed” [21]. We agree with this viewpoint, and have begun this work partly to demonstrate the capabilities enabled by publishing semantically structured book data within a modern web browser. In this article, we will outline the application and benefits of RDFa, Schema.org, and linked data models for book production. We will detail the structured data model that can turn book content into API-enabled webpages. We will also analyze the effects of this web publishing practice for machine-understanding, Search Engine Optimization (SEO), and User Experience (UX). Finally, we will discuss the advantages and disadvantages of this model. This web book model can function as a demonstration of a real-world application of semantic structured data. With this project we aim to tie together three threads within the context of the evolving library: open web publishing, software development, and the book. Structured Data At first thought, the idea that publishing would be tied to forms of markup seems at odds with the present. Modern writing software relies on text-based WYSIWG interfaces (think of Microsoft Word or the Google Docs interface). However, early word processors required special tags to create text that was bold, paragraphed, paginated, and countless other styles. And even more recent textual markup standards such as SGML and eventually DocBook worked by requiring strict declarative markup that described a document’s structure and attributes. It is interesting to think how these previous markup structures informed the semantics of the written word. These early notions of structured data were an inspiration as we started to think about the value of structured data for the book. For our web book prototypes we chose to build with HTML5 and embedded RDFa, coupled with the web-scale controlled vocabulary of Schema.org. Our first goal was to create markup that was rich and nuanced so that it might have its own semantic value evident to any user who viewed the source code on the page. We began by building the components of the book using EPUB 3.0 HTML [22]. This allowed us to break the book into pieces and assign common divisions that you might see for book content. For example, we were able to give our book cover (index.html) some common types such as: