The Code4Lib Journal – Getting What We Paid for: a Script to Verify Full Access to E-Resources Mission Editorial Committee Process and Structure Code4Lib Issue 25, 2014-07-21 Getting What We Paid for: a Script to Verify Full Access to E-Resources Libraries regularly pay for packages of e-resources containing hundreds to thousands of individual titles. Ideally, library patrons could access the full content of all titles in such packages. In reality, library staff and patrons inevitably stumble across inaccessible titles, but no library has the resources to manually verify full access to all titles, and basic URL checkers cannot check for access. This article describes the E-Resource Access Checker—a script that automates the verification of full access. With the Access Checker, library staff can identify all inaccessible titles in a package and bring these problems to content providers’ attention to ensure we get what we pay for. By Kristina M. Spurgin, E-Resources Cataloger, University of North Carolina at Chapel Hill Introduction Libraries have historically exerted significant control over their print (and other tangible format) collections. Traditionally, libraries have had print workflows involving checking in the individual physical items purchased to ensure all ordered materials were received; cataloging each item to provide intellectual access to the material via the library catalog; physical processing (affixing spine labels, barcodes, security devices) to prepare the material for library use; [1] and shelving each item somewhere in the library to provide physical access to the material. Our workflows for processing electronic (e-) resources tend in part to mirror the ones we use for processing print, but the scales are vastly different. Most libraries that provide access to e-resources obtain many of them by purchasing or subscribing to e-resource packages. Each package may contain hundreds (and sometimes hundreds of thousands) of titles—the individual e-books, online journals, streaming media, online datasets, and other electronic resources we hope our patrons want. [2] When purchasing or subscribing to e-resource packages, libraries cede much of their traditional control to content providers, mainly due to the size of e-resources packages and the growing need for libraries to do more with less. To give a sense of the scale we are talking about, here is a snapshot of what we are dealing with at my library. As the E-Resources Cataloger at the University of North Carolina at Chapel Hill Libraries, [3] I keep track of about 700 separate e-resources packages. [4] These packages were represented by about 1,250,000 records in our catalog in our last yearly catalog record count-by-package, conducted in July, 2013. The smallest packages had only two associated records; the largest package had 312,464. In the first four months of 2014, I loaded 112 separate batches of catalog records for 36 different packages of e-books, e-journals, streaming media, and web resources—a total of 102,252 new catalog records added. This means my two-person section has been responsible for acquiring, checking, batch editing, and adding to our catalog over 25,000 MARC records per month on average this year. Smaller libraries may have smaller numbers of e-resources to handle, but they will also likely have fewer people to work on e-resources management. Perhaps one person will be tasked with handling the purchasing, licensing, cataloging, and troubleshooting of e-resources on top of other job duties. In no library does anyone have the ability to manage e-resource packages at the individual title level. And yet, our patrons are looking for those specific titles relevant to their needs and interests. Workflows for processing electronic resources Checking in the contents of an e-resource package entails verifying the receipt of every individual title included in the package based on a title list made available by the content provider. In the context of e-resources, “to receive” a title means that the content provider has given us—and our patrons—full access to the online content of that title: we can read the whole book, watch the entire film, etc. Unlike with print resources, we cannot manually check that we have “received” each of hundreds or thousands of titles included in a package. Usually we look at a few of the titles we know are part of a package to verify that the provider has activated our access to that package and then move on. We trust that the content provider is indeed giving us full access to all the titles we have paid for. Providing intellectual access to the contents of e-resources packages still happens via the library catalog. [5] We typically rely on content-providers or other 3rd party vendors to provide batch-files of catalog metadata for the individual contents of e-resource packages. Usually we deal with multiple batches of metadata for a given package as metadata are gradually created and/or new titles are added to a package. Unlike with print materials, we do not have the resources to compare each metadata record with the item it describes. We trust that we will eventually receive records for all of the titles included in a package and that we are not receiving records for titles our patrons won’t be able to access. Shelving e-resources obviously does not happen literally, but we do rely on content providers to host the individual titles within packages so they are accessible to our patrons. The content providers maintain the shelves, as it were. We trust they will arrange the titles in some sensible manner and maintain consistent arrangement of those titles. Whatever link resolving and redirecting content providers implement, we trust that the URLs we are given in title lists and metadata records will continue to point our patrons to the full content of titles, much as call numbers point patrons to specific books on shelves. Finding e-resource access problems The scale at which libraries routinely purchase e-resource packages and the very nature of electronic resources require that we shift much of the responsibility for processing resources onto content providers. This also largely precludes us from checking over the results as thoroughly as we do with print resources. For our own sanity, we often have to assume that content providers have processed things correctly and that the resources we’re paying for work. However, we know from experience that this is not always the case. Sometimes we stumble over problems ourselves; other times we find out about access problems from patrons. When a properly-authenticated patron reports that they have hit a paywall or other “You do not have access to this content” barrier within a resource we’ve paid for, it puts us in a reactive position. Content providers usually resolve known-and-reported access problems quickly, but we are still nagged by the suspicion that what we’re aware of is only the tip of the iceberg. Trust alone is clearly not a viable option: we need a way to verify that we actually have access to everything we’ve purchased. This is why I’ve developed the E-Resource Access Checker script, a Ruby script that automates checking availability of full access to electronic resources. It allows us to take a more proactive approach to finding access problems at each stage of our e-resource workflow. First, the Access Checker provides an automated method for verifying full access to titles within a package based on a title list containing the URL for each individual title. In this sense, it can be conceptualized as a tool for checking in the contents of e-resource packages. Note that it is called an “Access Checker” because it goes beyond simply checking URLs, which would not reveal access problems per se. This difference is described more fully below. Second, it can help resolve some metadata problems. While it can’t find missing metadata records, it can identify individual records within a batch that link to resources lacking full access. It can also flag records containing invalid URLs or DOI-based URLs that fail to resolve. Third, it can help find problems for resources that are already “on the shelf,” so to speak. The nature of e-resources means that access problems may arise at any time, even if we verify that we have full access to all titles in a package at the time of purchase. The content provider may have server trouble. Errors may be introduced in our proxy or IP configurations. Titles may be removed from packages, even if we have purchased “perpetual access,” and some providers are more proactive than others about notifying libraries about such withdrawals. If we have catalog records linking to titles in such situations, these titles are effectively missing from the shelves. Via a kind of virtual shelf-reading of a particular package’s contents, the Access Checker can help determine how widespread access problems are for that package. It can also help identify which existing catalog records may need to be suppressed from view or deleted from the catalog because the “volumes” are now missing. Verifying full access vs. verifying URLs Having full access to an e-resource means you can use that resource in its entirety: e.g., you can read the whole book or watch the whole video. If you can only read table of contents and the first chapter of an e-book, you don’t have full access to that e-book. If the URL in a catalog record takes you to a DOI-resolution error or a “No item found” message within the content provider’s site, you effectively do not have access to that e-book unless you think to search the provider’s site for the title (and know how to do so). Note that all of the problematic situations in the previous paragraph involve the patron being pointed to valid Web resources. For this reason, integrated library system-based or stand-alone URL checkers do not solve the problem addressed by the Access Checker. These URL checkers mainly report whether a URL points to a valid Web resource or not based on the HTTP status code the server returns (404, 502, etc). But e-resource content providers rarely (if ever) use HTTP status codes to indicate whether or not an e-resource is accessible. When they provide restricted-access messages about their e-resources, they generally send them with a 200 (OK) status, so traditional URL checkers fail to report them as problematic. Furthermore, a URL checker might report that a URL redirects to another URL, but it won’t actually follow the redirect. With e-resources, we expect that most URLs from title lists or in MARC records will redirect to a different URL on the content provider’s platform, so telling us that a URL redirects is uninformative. We need to know what content is found at the end of the redirect chain. In 2013, the Code4lib Journal published an article describing a tool called Normac, which includes an access checker that works on the same principles as the one described here. [6] This was the first I had heard of any other solution for automated access verification, and I had developed the first versions of my Access Checker in late 2010. Normac’s access checker is part of a more complex tool to automate e-resource metadata batch processing, and requires users to configure their own access checking logic profiles. The Access Checker I’ve developed is a simpler solution that staff at any library can use to verify access to their e-resources (on the supported platforms) without any configuration. The downside to this is that users of this Access Checker who want to check access on platforms not currently supported need either a) to edit their version of the script locally; b) to contribute changes to the script via GitHub; or c) to ask me to add support for the desired platform (and wait for me to make that happen). How the E-Resource Access Checker works On many e-resource platforms, the descriptive landing page for an e-resource to which a patron has full access includes a clear indication of its accessibility: a green check mark, for instance, or a text string such as “Access to this e-book is provided by…” or “Full text.” This access signifier is different on each content-provider platform—and even sometimes different for individual packages offered by the same content provider. The presence of the access signifier (or the string representing it in the HTML source) is what the Access Checker pays attention to. However, verifying access isn’t always a binary decision, and it isn’t always as clear as affirming the presence of a positive indicator of full access. On some platforms, there is no positive access signifier; in these cases, the absence of an error or “You do not have access” message is assumed to indicate full access. In some packages, we see multiple types of access problems. SpringerLink is a good example of this. Typical access problems on this platform fall into four categories: A URL points to the e-book title on SpringerLink, but access is restricted (only the table-of-contents and front matter are available). A URL points to a “not found” page within SpringerLink (which returns a 200 (OK) HTTP status code). A DOI-based URL does not resolve and points to “DOI Not Found” page. A DOI-based URL resolves to the e-book on a partner publisher’s site, where our successfully authenticated users are asked to purchase the book in order to view its contents. Resolving each of these types of access problem requires different steps, so we need to know which category a given problem falls into. The E-Resource Access Checker is a simple JRuby script [7] that automates checking the access status of individual e-resource titles. You run it from the command line using the input and output files as parameters: > jruby access_checker.rb urls_to_check.csv access_results.csv The Access Checker takes input from a .csv file, which may contain any number of columns (title and record ID are commonly included). The last column must contain one URL associated with the title, [8] and all the URLs in the file should be on the same content-provider platform. The title lists obtained from content providers are usually spreadsheets that are easily saved as .csv files for use as Access Checker input. Bibliographic data and URLs can also be extracted or exported from MARC records in a spreadsheet-like format easily convertible to .csv from MarcEdit and some integrated library systems. When a user starts the script, she is presented with a list of platforms supported by the Access Checker and must type which platform she wants to check for access. The script checks the HTML source code that each URL points to for appropriate access-signifier string(s) for the specified platform. If a string matches, the corresponding access result is returned. If no match is found for any of that platform’s access signifiers, the script returns the result: “Check access manually.” Details on the access signifiers used for the platforms currently supported by the Access Checker are found in the table below. While the Access Checker is running, the result for each URL is printed to the screen so the user can get a general sense of the progress and the severity of the access problems. The access result for each URL is also appended to the output .csv file. The output file is equivalent to the input file, with a new “access” column added to the end. We have been using the Access Checker heavily at UNC Chapel Hill since late 2010. It has been used successfully by at least three other institutions. Access checking for EBSCOhost ebooks was added at the request of a colleague at one of these institutions just before UNC began purchasing titles on that platform. Platform String to Match Access Result Returned if Matched Apabi type="onlineread" Access probably ok Alexander Street Press Page Not Found Page not found Alexander Street Press error Error returned Alexander Street Press Browse Full access Duke University Press (Highwire) DOI Not Found DOI error Duke University Press (Highwire) Log in to the e-Duke Books Scholarly Collection site No access Duke University Press (Highwire) t-page-nav-arrows Full access Ebrary Document Unavailable. No access Ebrary Date Published Full access EBSCOhost class="std-warning-text">No results No access EBSCOhost eBook Full Text Full access ScienceDirect