error

the code4lib journal – extra editorial: on the release of patron data in issue 58 of code4lib journal mission editorial committee process and structure code4lib issue 58, 2023-12-04 extra editorial: on the release of patron data in issue 58 of code4lib journal we, the editors of the code4lib journal, sincerely apologize for the recent incident in which personally identifiable information (pii) was released through the publication of an article in issue 58. the article “bringing it all together: data from everywhere to build dashboards” linked to two microsoft power bi files containing circulation data. timeline this is a summary of events: on monday dec 4, 2023, 11:28 pm est, the article was published as part of issue 58 of the code4lib journal. on tuesday dec 5, 2023, 02:02 pm est, a concerned reader emailed code4lib journal to flag the inclusion of the files using the email address that appears on the journal’s website. unfortunately, unknown to the current editors, that email account was not operational and, hence, no action was taken. on tuesday dec 5, 2023, 02:40 pm est, the author of the paper contacted its editor about the files; the editor removed the links to the files from the article at 02:49 pm est. on tuesday dec 5, 2023, 03:09 pm est, the author of the paper asked its editor to remove the files from the server; the editor – who did not have email access at that time – removed them at 04:15 pm est. on friday dec 8, 2023, around 11:40 am est, another editor informed the editorial board about an open letter about the incident, which was announced on mastodon. at the time of writing, the letter has 161 signatories. statistics during the time the files were online, they were accessed from 7 different ip addresses, with several accesses coming from the same ip addresses: almabibliographic holdings data.pbix 2023-12-04: 4 successful accesses (gets with return code 200, and no failed access attempts) 2023-12-05: 6 successful accesses (gets with return code 200), 3 failed attempts (gets with return code 404) almacirculation statistics.pbix 2023-12-04: 5 successful accesses (gets with return code 200, and no failed access attempts) 2023-12-05: 19 successful accesses (18 gets, 1 head, with return code 200), 3 failed attempts (2 heads, 1 get, with return code 404) the files were found not to have been cached by google, bing, yandex, yahoo, ecosia, duckduckgo, and internet archive. context the released files were in a proprietary file format, microsoft power bi, with which none of the editors have experience. since this article did not describe the actual content of the files, there was no immediate reason to believe they would contain pii. this was an erroneous assumption that the code4lib editors take full responsibility for. the current editors were also unaware that the email account listed on the code4lib journal website was not operational, slowing the notification of the editorial board, and causing the files to remain online for a longer period of time. this is another error that the editors take full responsibility for. the editorial board has since re-established access to this email address. going forward we are determined to take greater measures to prevent similar incidents from occurring in future journal issues by improving the editorial process. code4lib operates without a budget, and with volunteer editors. this means that fully addressing the feedback we have received in a responsible and sustainable way will take time. effective immediately and until we can establish policies and procedures that better safeguard personal information, code4lib journal will not accept or publish papers that utilize individuals’ personal data. we will describe this change on the journal’s website and in the call for papers. we invite colleagues who are knowledgeable in establishing relevant policies and procedures to support the code4lib journal by using their expertise to recommend sustainable guidelines that are informed by existing best practice, either independently or in the form of a journal article. we are grateful to all of those who worked to raise this important issue and look forward to collaborating with the community on best practices going forward. monday february 5, 2024. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial: just enough of a shared vision mission editorial committee process and structure code4lib issue 43, 2019-02-14 editorial: just enough of a shared vision what makes a vibrant community? a shared vision! when we live into a shared vision, we can accomplish big goals even when our motivations are not completely aligned. by peter murray, issue 43 coordinating editor i love my job, and i love this profession. that is, i get excited about my job as the open source community advocate at index data and am an eager participant in the library technology profession because we freely share our knowledge and experiences. i will take as a given that i have both a myopic view and rose-colored glasses. [1] (myopic because there may be other professions that have similar inclinations to share expertise and experiences; if so, i’d love to compare notes! rose-colored…well, you can be the judge.) let me explain. in my day job working on the folio project, i alternate between astonishment at how well the community works towards the same goal and unease at not knowing how-and-why it works the way it does. (or worse: that one misstep could bring the project’s community crashing down.) this isn’t to say that the project doesn’t have rough edges; there have been disagreements over priorities, technical approaches, and other concerns. instead, the community seems to get stronger as it works through those disagreements and adds more partners. the same can be said about the code4lib community. sixteen years ago some colleagues got together to shared their expertise and experiences – first on a mailing list, then an irc channel. out of that sprang national meetings (america and japan that i know of), numerous regional meetings, this journal, and a slack team. (did i miss anything?) we’ve been through our struggles — figuring out if we’ve grown into needing a fiscal agent and who that fiscal agent would be, for example. people have left the community, and new people have joined. why does this work? can preconditions be set to strengthen the chance that a community will succeed? much has been written and talked about best practices for healthy communities, and i want to add to that body of thought. so here is my take — let’s call it the “just enough of a shared vision” theory. the just-enough-of-a-shared-vision theory i think there is a crucial need for a common understanding of what a community is about. this common understanding needs to be ingrained so deeply in the community that the participants are guided by this shared vision when they are not consciously thinking about it. and further, that there is a close alignment of one’s personal goals, one’s organizations’ goals, and the community goals. or, put another way, all three (the person, the organization, and the community) are getting benefits for the effort. what might this look like? one year’s conference organizers generously share their knowledge with the next year’s organizers. a beginner’s question on a mailing list is answered by half a dozen people with personal stories of hard-won wisdom. and yes, a new author is inspired by previous writings in the code4lib journal and wants to share their own experiences. (or become a volunteer editor for the journal!) viewed this way, we might be able to work out how to create a shared vision for a vibrant community. i think it comes in three parts. openness to sharing the vision. this openness takes the form of community members being willing to live into the community’s vision and an innate acceptance of those who say they are living into the same vision. openness to being wrong. while sharing the vision, the community members know that the community is not perfect; that there are misunderstandings, blind spots, and inadequacies. openness to new ideas. the building of the community is never done; it is a journey of experiments in doing things better and learning from each other. the community’s vision is attractive to people and organizations. people grow in experience and personal connections. library patrons are better off through services improved by that experience and personal connections. organizations take from the community in proportion to what they give to the community. and the community moves forward. in a community consisting of libraries and non-profits, i think much of this comes naturally. when commercial ventures are added to the mix, community members can wonder about motivations. is the company going to put into the community in proportion to the benefit they receive? it is here that we lean on the “just enough” part of the theory. the goals of the community and of the organizations involved do not need to align perfectly, and they probably never will. but there needs to be close enough alignment and openness in communication so the rest of the community members understand the alignment. this means that decisions are made in such a way that the goals of a participating organization are met and the goals of the community are met. that when decisions are out of balance between the organization and the community, the community has an instinctive reaction to guide the process back to balance. and i get that while this is easy to say, it is hard in practice and in specific circumstances. if the shared vision is strong enough and inclusive enough, though – wow, that is a place i’d love to be. introduction to issue 43 issue 43 has seven articles. developing weeding protocols for born digital collections authored by athina livanos-propst and edited by junior tidal: when you have 100,000 resources how do you construct a sensible way to evaluate the quality of your collection? this article describes one way to approach the challenge. content dissemination from small-scale museum and archival collections authored by avgoustinos avgousti, georgios papaioannou and feliz ribeiro gouveia, and edited by sara amato: extending the description of specialized collections – historic coins, in this case. never best practices: born-digital audiovisual preservation authored by julia kim, rebecca fraimow and erica titkemeyer, and edited by rebecca hirsch: three case studies of how three libraries with different needs and goals approach digital preservation. scope: a digital archives access interface authored by kelly stewart and stefana breitwieser, and edited by eric hanson: this collaboration extends the description of disparate digital objects. making the move to open journal systems 3 authored by mariya maistrovskaya and kaitlin newson, and edited by ron peterson: the university of toronto libraries had a tall task to update their hosted ojs installations, and in this article they describe how they accomplished it. improving the discoverability and web impact of open repositories authored by george macgregor, and edited by péter király: the university of strathclyde at glasgow test changes to their eprints site to improve search engine optimization and offer suggestions for any site looking to improve repository visibility. a systematic approach to collecting student work authored by janina mueller, and edited by ron peterson: the harvard university graduate school of design describes the technical and social issues behind their efforts to archive student works out of their learning management system. inspired by something you see here? please consider submitting an idea for an article to the code4lib journal. thank you, carol in the course of putting together this issue, code4lib journal editor carol bean finished retiring from the journal. (andrew said goodbye in his editorial in the last issue, but it has taken this long to complete the process!) carol volunteered as an editor for the journal’s first issue, and over the course of the next 42 issues she eagerly offered her insights and keen copyeditor eye. over the last few months she has transferred her knowledge and her responsibilities to other members of the editorial committee, and with a full-throated voice we say “thank you, carol!” endnotes [1] “rose-colored glasses” — an optimistic perception of something; a positive opinion; seeing something in a positive way, often thinking of it as better than it actually is. (see wiktionary definition.) i’m also grateful for colleagues like filip that make me think of broader geographic and cultural communities, and so i now think about including explanations of idioms when i use them. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – column: 700 dollars and a dream : take a chance on koha, there’s very little to lose mission editorial committee process and structure code4lib issue 1, 2007-12-17 column: 700 dollars and a dream : take a chance on koha, there’s very little to lose i truly believe that the meekest amongst us has a special duty and a special circumstance that fosters innovation. ours is not the culture of red tape entrenched tradition, but rather the atmosphere of the pioneer. no one will notice a failed experiment in the middle of nowhere, but they’ll certainly notice a cataloguer someplace in edema making a dent in backwards standards. by bws johnson one of the sentiments that i try to infect the library science populace with is the notion that innovation can come from anywhere at all. this is especially true of rural libraries. i truly believe that the meekest amongst us has a special duty and a special circumstance that fosters innovation. ours is not the culture of red tape entrenched tradition, but rather the atmosphere of the pioneer. no one will notice a failed experiment in the middle of nowhere, but they’ll certainly notice a cataloguer someplace in edema making a dent in backwards standards. so much of this field is not about the money—technology is definitely in that basket, particularly with open source making a fierce showing of things. this was the argument i took to my board: let me take a small portion of our state aid to public libraries money and try out this new fangled thing. the software’s free. yes really, we don’t pay anything for it, i just go out and grab it. nope, it won’t be more than $1,000 so we’ll still have emergency money. if it doesn’t work out, i can just reformat the drive, make it a regular old public access terminal, and we’re good to go. nope, it’s not stealing. everyone’s got something to give back to the community, and when i set things up and figure out what does what, i’ll write about it, and that’ll be our share. it’s like a barn raising. we had nothing to lose—our library wasn’t automated yet. ours was the perfect test environment. my board was receptive, my patrons weren’t attached to any system whatsoever, and my staff were behind the move. my loving husband was willing to install this for us, as well as physically assemble a custom server, although that’s not necessary—koha will run on just about everything. as with any other product, the more you put into your hardware, the better the result until you top out at a certain point. on koha, that point is frighteningly low, making my $700 box a ferrari testarossa. with the proliferation of linux user groups out there, it oughtn’t be too hard for just about anyone to approach the geeks that be and walk away with a functional server in a matter of a few hours. we’re mostly done with bibliographic input now—we’ve got just over 7,000 items catalogued of about 8,500. our patrons are in the database. we could circulate now if we wanted to. we’ve tested the basic features that we need and they work well enough for our purposes. when we started out, we were looking for the barest minimum of functionality. we got a whole lot more than we bargained for. koha is far more reliable than many commercial ils options. this was certainly a factor with me. it seemed as though things would be down every other month for a few days of unscheduled time with a few of the commercial products i’ve had the displeasure of experiencing. our server has been down twice in about 3 years of testing, with the box running 24/7. once was when my roommate inadvertently unplugged the server to charge his mobile phone. the second time was a catastrophic hardware failure. the power supply essentially caught fire. i was terribly worried my data was toast. it wasn’t. i had backups, but i didn’t need to use them. koha is far better at keyword searching than anything i’ve ever seen. something in the way it ranks search results really ends up giving you highly relevant items first. it also loots and pillages its way through a marc record so that those notes fields everyone tires over are searched through, too. the support is astounding. i have yet to pay money for support, yet i’ve had developers bend over backwards to program in a feature i’ve wanted, in a remarkably short span of time. a basic reports module came to me free of charge inside of a couple of days from across the ocean in france. an irc channel dedicated to koha tends to have someone on it most of the time. with heavily involved developers in the united states, france and new zealand the project doesn’t sleep. i can’t imagine the results you would see if you had a few thousand dollars to give to a developer for your feature. at a recent demo, i was eating lunch and chatting with the other librarians about which developer was responsible for what feature. one of the other librarians stopped me and said of their product, “wow, this is so great. you know the names of the developers. we’re lucky to even get through to support on the telephone!” with koha you get something you don’t get with any other product. you have compleat control over what your catalogue looks like, you don’t have to wrestle with a vendor to get your data to do what you want it to do, and if the product doesn’t have a feature you need, you can programme it or pay someone to add it. the rate of development and improvement over the past few years has been nothing short of astounding. when i started using koha, it was very wooden and very ugly. it’s come a long way since then. the current out of the box release is on par with at least a handful of commercial products. when the templates are customised for a given library, the product can meld seamlessly and aesthetically with a library’s website. the horowhenua library trust catalogue can give you a taste of the aesthetics: (http://www.library.org.nz/cgi-bin/koha/opac-main.pl) the upcoming version 3 looks quite like the athens county catalogue: (http://search.athenscounty.lib.oh.us/) since koha was developed in new zealand, connectivity issues caused the developers to make a product that would be very easy to access regardless of the speed of a person’s connection. i was able to access the catalogue which resided at home in albany, new york from my library in western massachusetts with no noticeable wait time for searching and data input over an incredibly crummy connection. (it was allegedly a 56k connection, but the plain old dial up telephone line connections routinely ran faster.) it’s not for everyone, however. installation is still difficult. unless you’ve someone in the area who is very comfortable with linux administration, this project will be a difficult set up. on the other hand, one can pay for a preinstalled box. cataloguing for a large institution would be tough. holdings information is a bit bodged at the moment. the cataloguing module is certainly clunky to use. the interface is tabbed with each marc field getting its own text box. as a result, either a librarian ends up sticking all of the fields in one tab for a really long screen of many, many boxes, or fields are missed by sloppy cataloguers that don’t switch tabs. it is possible to set up frameworks that anticipate necessary fields for a given material type, but this entails a good deal of planning during setup. the good news in this department is that thanks to google summer of code, a powerful new tool is being worked on to make things much nicer for cataloguers everywhere, and functionality should be vastly improved with version 3. reports are also getting a massive workover thanks to sponsorship from the british national health service. these can be tricky from a programmer’s perspective thanks to each client wanting a different data set. the new module will guide a user through the process of selecting which sets they’d like in order to produce the table or chart they’d like to pull from the raw data. because koha came from the mind of a computer programmer, there are creature comforts that librarians take for granted that could be absent or less fleshed out than one might like. increasingly, this is less true as the developers address new feature requests and the project gathers fans, and thus steam, along the way. the positive side of this is that it rapidly assimilates neat new web 2 innovations, for instance tag clouds are going to be featured in the new opac. like everything out there, there are bugs. developers do work to keep this down to a minimum, but i don’t want anyone to think i promised perfection. users are encouraged, and yea, even thanked when they submit problems to the project’s bug tracker, bugzilla (http://bugs.koha.org/cgi-bin/bugzilla/index.cgi). it’s far from perfect, but i can’t name a commercial product that has it all. ask yourself—what does my library have to lose? why not run an open source catalogue redundantly to your current system to discover the differences for yourself? if you do like koha, imagine how much you do have to lose in terms of that nasty annual license fee. you can choose to either have the product supported at an affordable rate or you can just set everything up yourself and never pay a thing except for the cost of the hardware. the model that koha is based on is very similar to national public radio or the corporation for public broadcasting. open source is out there waiting to be enjoyed by everyone, regardless of financial status. just as local programming is developed in your backyard and contributed back to the national efforts, individual libraries can customise their installation. when some flavours are contributed back, like the nelsonville templates, they prove to be very popular and are widely accepted in turn, like fresh air or nova. not everyone supports their local affiliate in a fund drive, and not everyone can afford to financially support the koha project. when libraries choose to pay for support or new features, everyone benefits since good reliable features can be selected out and then rolled into the product. even small contributions of time and labour end up making large differences in making the product better through collective effort. there is further information and a demo on the koha web site: http://www.koha.org and nicole engard has a blog entry about koha: http://www.web2learning.net/archives/1165 about the author: bws johnson is a graduate of the graduate school of library and information science at the university of illinois urbana – champaign and was the director of the hinsdale public library in hinsdale, ma. she was recently honoured to serve as president of the western massachusetts regional library system. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – infomaki: an open source, lightweight usability testing tool mission editorial committee process and structure code4lib issue 8, 2009-11-23 infomaki: an open source, lightweight usability testing tool infomaki is an open source “lightweight” usability testing tool developed by the new york public library to evaluate new designs for the nypl.org web site and uncover insights about our patrons. designed from the ground up to be as respectful of the respondents’ time as possible, it presents respondents with a single question at a time from a pool of active questions. in just over seven months of use, it has fielded over 100,000 responses from over 10,000 respondents. by michael lascarides introduction in november 2008, in anticipation of an upcoming home page redesign, the new york public library’s digital experience group ran a traditional online survey using the popular web-based tool surveymonkey.com, which we linked to from a one-line text banner at the top of our nypl.org homepage. it was a regular, please-answer-these-questions pitch comprised of 19 questions about web usage habits spread across 8 pages. over 14 days, that survey received 7,341 individual answers to questions from 520 respondents, just 60% of whom completed the whole survey. about the same time, we discovered the five second test (fivesecondtest.com), a web service built by an australian design firm based on an idea proposed by usability expert jared spool. the five second test (as the name implies) involves showing a visual design to a user for five seconds and then asking them to recall specific features, or asking them which of two designs (each shown for two and a half seconds) they liked better. we even considered using the fivesecondtest.com service to evaluate new nypl.org designs, but it lacked an easy way to redirect users back to our site once they were finished. the contrast between the two tools got us wondering if there wasn’t a way to make surveys and usability testing more painless (and dare we say fun?) for our users in order to maximize the number of responses received. to be sure, the library has strategic questions that require a lot of setup and deep knowledge about the respondents. we have a lot of those questions, and we are asking them in all their properly-sampled, audience-segmented glory, often with the assistance of consultants and our strategy department. but during the day-to-day process of designing a web site, what is often needed is just a reassurance that our team is on the right track. we contacted the fivesecondtest.com developers, who graciously gave us their blessing to adapt and expand on their ideas, and set about coding a prototype. ruby on rails was chosen as the development platform due to its flexibility and rapid prototyping capabilities, its full open source codebase, and the fact that our team had already successfully built several rails-based sites. in february 2009, we launched our solution: infomaki, an open source web application that incorporates ideas from both the five second test and traditional surveys. in its first 48 hours of public use, infomaki collected over 6,900 responses from 840 respondents, almost exceeding the entire total from the two-week traditional survey. design and implementation infomaki is a “lightweight” usability testing tool developed to evaluate new designs for the nypl.org web site and uncover insights about our patrons. designed to act as a “one question” survey, it presents respondents with a single question randomly selected from a pool of active questions. initially, two types of questions were supported: multiple choice and “where would you click to…?” (attached to a screenshot or other image). recently, we have added five-second tests for comparing two designs and for testing recall of a design’s features. response times for each answer are captured as well. infomaki was designed from the ground up to be as respectful of the respondent’s time as possible. to this end, all of the language used in the project is geared towards lowering the cognitive load on the respondent. the link from our main web site to the tool reads, “answer a single question and help us improve our web site!” the “sales pitch” makes it clear that even if you only answer one question, it will be welcomed. figure 1 infomaki public home page figure 2 infomaki sample public page responding to a question only takes one click, and the “thank you” page that follows immediately (but politely) asks the respondent to answer another. as such, we’re finding that even with the “one question” pitch, an astonishing 90% of respondents answered more than one question, and the average number of questions answered per respondent is almost 11. it seems to be the potato chip of surveys: no one can eat just one. this has made an appreciable difference in our approach to surveys: rapid feedback leads to rapid turnover. we’re mining the vast middle ground between putting a full survey in the field with full protocols and methodologies, and asking people in the office “does this look right to you?” infomaki is not intended to be a formal research tool; rather, its strength lies in lowering the turnaround time between formulating a question and getting a response to that question from the general public. to this end, care has been taken to make it as easy as possible for staff to add questions to the system. designers here have already gotten into the habit of adding questions on friday night and returning monday morning to several hundred responses on their latest designs from weekend visitors. internally, the application is optimized to store all results from varied types of questions in a single common database table, which makes it extremely easy to analyze response statistics and ensure that no respondent sees the same question more than once. response data is displayed in tables and histograms (for multiple choice-type questions) and heat maps (for “click on this”-style questions). heat maps can show up as individual clicks or a percentage grid overlay, and colors are adjustable for contrast with different designs. we welcome those from outside the nypl who would like to analyze the collected data; feel free to contact us. figure 3 infomaki sample heat map results page figure 4 infomaki sample histogram results page results the “lightweight” level of engagement on the part of the user has led to stellar response rates. in its first seven months of intermittent use, infomaki has captured 111,823 responses to 231 design, language and demographic questions from 10,203 individual respondents. that’s an average of 484 responses per question posted and 10.96 questions answered per person. when the banner is posted on nypl.org, roughly 1% of visitors click through from the main site, and over 90% of respondents answer more than the “one question” we asked them for. surprisingly, given that it’s essentially just a survey tool, users have called infomaki “fun,” “like a video game,” and “addictive”. more than one person has reported wanting to “find the end” by answering all of the active questions. in fact, the first improvement implemented as a result of user feedback was a way to skip the thank you page and keep answering questions without interruption. by testing designs with infomaki and in-person usability tests in tandem, we have been able to uncover a number of insights and potential pitfalls. ambiguities in navigation language were especially plentiful; for example, shortening the link to our fundraising page from “support the library” to “support” led to confusion with “technical support” (we reverted to “support the library”), and using the label “community” for a page with links to social networking tools had the unforeseen effect of siphoning clicks away from users seeking information on their local branch library (as much as 40% in one test; we changed the link to “interact with us”). more broadly, the iterative testing process has made abundantly clear the degree to which changes in a single element have an effect on other parts of the page. on one recent web page design, the main navigation was working acceptably, and when we added a search bar, response times went up precipitously. analysis showed that the search bar components looked too much like the other navigation links, increasing the (apparent) number of choices that users were required to cognitively parse. adding a background tint to the navigation and other design cues to create a distinction with the navigation returned the response times to acceptable levels. figure 5 navigation design changes figure 6 example of recall test drawbacks and criticism we have identified a few problems with the infomaki approach. first, by linking to the survey from the main web site, we don’t get a rounded profile of all library users. it’s safe to assume that infomaki respondents are among our more web-savvy patrons. but as long as we’re aware of that limitation, we’ve determined that it’s not a detriment to have a bias towards web users since most questions posted directly relate to the web site. a more pressing issue is the identification of particular “user segments” or “personas,” groups of users who are deemed to have the same general behavior patterns (such as researchers, recent immigrants, and so on). when we recently announced a new round of tests on twitter, usability expert craig tomlin tweeted back, “ouch, they don’t know the persona of the tester!” we had not deemed segmentation as critical for this particular test, since we were a) mainly testing global navigation which needed to apply to everyone, and b) testing so iteratively that problem areas would become apparent even though we might not know who was having the problem. but tomlin’s comment spurred us to add new tools to mark the referral source of the respondent. we also plan to add cross-segmenting based on the subset of users who answered a demographic question during the same session (for example, show only clicks from respondents over 50 years old). there are definite order biases that can creep in by presenting questions randomly. sometimes, when asking for user feedback in a text field, we will find users responding with the exact language that was used in a previous question. this may be mitigated in the future by presenting questions in a preferred order rather than a truly random one (for example, the “five-second recall” test works best when the user hasn’t already seen the same design in a different kind of question). however, we feel strongly that frequent, high-volume iterations of testing, combined with smaller volumes of more formal, segmented testing should give us a well-rounded view of potential problem areas with web designs. open source release the infomaki source code was released to the community under the gnu general public license on may 5, 2009. the current release is a “throw it over the side and see if it swims” release. to get it running, one needs to be familiar with the ruby on rails programming framework. it has spotty-to-nonexistent test coverage, a bit of vestigial code, and possibly some dependencies on a rubygem or two that we forgot to package. some non-user-friendly features remain in the administrative interface, such as the fact that it’s possible to delete screenshots that are in use, causing errors. the nypl plans to update the code to include more user-friendly administrative features by the end of the year. for more information on deploying infomaki from the current codebase including step-by-step “quick start” instructions, see this blog post: http://labs.nypl.org/2009/05/06/infomaki-goes-open-source/ since the open source release, we have been alerted to a couple of similar projects (see usabilla and chalkmark in the “references” section), the developers of which have been gracious in sharing ideas with us. to the best of our knowledge, infomaki is the only full open source tool in its class. roadmap we’ve added a number of generic demographic questions to the mix (how old are you, where do you live, etc.) and the hope is that in future versions, we will be able to segment responses to one question based on the answers to another question. for example, we can test familiarity with certain terms in one question and segment out those responses by age (for any respondents who answered both questions). behind the scenes, there are definitely some improvements that need to be made. it’s becoming clear that a frequent pattern of use is to test the same question (”where would you click to…?”) over screen-shots of several variant designs. right now, one must enter the same question repeatedly to get these comparisons. a future redesign of the administrative interface may allow us to build a suite of questions and simply upload a new screenshot to that suite to launch a new battery of tests and compare it to previous versions. we have also been thinking about ways to score accuracy, perhaps by adding values to particular click locations. since the tool is already capturing response times, a scatterplot chart with time on one axis and accuracy on the other would be a compelling illustration of which designs are performing the best. ideally, we’d like to work out a way that this tool can be “baked in” to the new nypl.org redesign so that user feedback becomes an ongoing, always-on process. we are considering ways of displaying the feedback banner based on context, such as only displaying the banner to a small, random percentage of visitors, or only to those visiting certain pages or searching for certain terms. as of this writing, we already have prototype versions of some of these features running locally and will be folding them into the public source code within a few weeks. we encourage everyone to download the infomaki source and let us know how your experience goes! references infomaki project page at sourceforge http://sourceforge.net/projects/infomaki/ infomaki on twitter http://twitter.com/infomaki infomaki’s launch announcement http://labs.nypl.org/2009/02/16/introducing-infomaki-bite-sized-usability-testing/ infomaki’s open source release announcement http://labs.nypl.org/2009/05/06/infomaki-goes-open-source/ jared spool’s original post on the five second test concept http://www.uie.com/articles/five_second_test/ five second test web service, from the australian firm angry monkeys http://fivesecondtest.com/ usabilla, a similar web service (free beta) http://usabilla.com/ chalkmark, another similar web service (fee-based) http://www.optimalworkshop.com/chalkmark_alt.htm the design firm volkside was one of the first users of infomaki after its public release http://www.volkside.com/2009/07/usability-test-with-infomaki/ about the author michael lascarides is the user analyst for the digital experience group of the new york public library and a member of the mfa computer art faculty at the school of visual arts. subscribe to comments: for this article | for all articles 2 responses to "infomaki: an open source, lightweight usability testing tool" please leave a response below: dennis van der heijden, 2010-10-29 besides ny library is someone else using this? like to see some working example the ny lib is not working anymore jesus tramullas, 2011-01-16 a reference to this work (in spanish), http://tramullas.com/2010/11/02/analisis-de-usabilidad-con-infomaki/ leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – developing an online platform for gamified library instruction mission editorial committee process and structure code4lib issue 35, 2017-01-30 developing an online platform for gamified library instruction gamification is a concept that has been catching fire for a while now in education, particularly in libraries. this article describes a pilot effort to create an online gamified platform for use in the woodbury university library’s information literacy course. the objectives of this project were both to increase student engagement and learning, and to serve as an opportunity for myself to further develop my web development skills. the platform was developed using the codeigniter web framework and consisted of several homework exercises ranging from a top-down two-dimensional library exploration game to a tutorial on cleaning up machine-generated apa citations. this article details the project’s planning and development process, the gamification concepts that helped guide the conceptualization of each exercise, reflections on the platform’s implementation in four course sections, and aspirations for the future of the project. it is hoped that this article will serve as an example of the opportunities–and challenges–that await both librarians and instructors who wish to add coding to their existing skill set. by jared cowing introduction undergraduate students at woodbury university are required to take a 1-unit information literacy class, information theory and practice, which is taught by librarians. each librarian is allowed flexibility in customizing their course sections to suit their teaching style and to try new ideas. a librarian wishing to make such customizations might find themselves frustrated with functional limitations presented by moodle, woodbury’s course management system. those possible frustrations include the rigidity in how online activities may be structured: instructors can create a multiple-choice quiz or prompts for the student to write an answer to a question, but it is not possible to create deeper interaction through the recall and manipulation of previous answers or through more advanced logic to determine custom reactions to student responses. another frustration might be the visual experience of using moodle; assignment content competes with other visual information that lines all four sides of the browser window, increasing the cognitive load of users. this visual information ranges from layer-upon-layer of navigation bars to moodle announcements to footer disclaimers. while moodle provides us with deep functionality in many areas that is invaluable, the overall user experience of moodle could be described by few as ‘fun’ or ‘inspiring curiosity.’ upon encountering these frustrations, i began a project to develop a web platform that could host more customized and interactive class assignments. initial experiment and obstacles the goal of this project started out simple: to create an interface that could accommodate multiple-choice questions or written answers, and that allowed for more flexible responses to user input. just as importantly, the interface needed to be as visually clean as possible, containing the absolute minimum of noise necessary. the hope with this visual approach was that it might reduce cognitive load and help users focus more on the single question being presented to them at any given time. such an interface was relatively simple to create with basic html/css, javascript, and php. it was indeed a cleaner visual experience. however, it became quickly apparent that despite some visual benefits, the functionality was not unique enough to justify a departure from moodle. editing the questions required directly entering them in the php code, which was certainly no faster than editing questions through a graphic user interface in moodle. additionally, there was no authentication or database system in place to securely identify and store student answers. to justify any time spent pushing this idea further, i would need to stop and think more clearly about my goals and the technology needed to achieve them. as someone with novice coding skills, these initial efforts represented an effort to have some fun coding a basic interactive tool. to progress any further, the project’s more personal goals would need to take a back seat to the realistic needs of a web platform intended for a real classroom environment. figure 1. this might look cleaner than moodle, but is this quiz that much more engaging, or is it just reinventing the wheel? getting serious about having fun taking the time to stop and think about where this project was going revealed a clearer set of goals and requirements: a platform or framework was required that allowed for basic functions like user authentication and sessions, the storage/retrieval of information from a database, and usage of a templating system. this platform could not be so overly complex that i would need to spend my time learning how to do things ‘the drupal way,’ for example, and in so doing lose some ability to be flexible and improvise. from a pedagogical standpoint, i required the ability to create assignments that were interactive, engaging, personalized, and that could vary widely in structure from week-to-week. the answer to these technological needs, codeigniter, came through the suggestion of a colleague. the answer to the pedagogical needs came through the concept of gamification. codeigniter codeigniter is a php-based web framework that is lean, efficient, and highly extensible. it is built on mvc architecture (model-view-controller) and represented the ideal balance of functionality with the simplicity that allowed for rapid development. the documentation is up to date and extremely clear, and is written in a way that allows for the flexibility to either take advantage of codeigniter’s mvc structure or ignore it completely. for a beginner, it represents the ideal sandbox to work in with just the right amount of fool proofing to keep one out of trouble. gamification gamification has only recently gotten a name but has been a trend in education for some time. at its core, it is “using game-based mechanics, aesthetics, and game thinking to engage people, motivate action, promote learning, and solve problems” (kapp 2012). it does not need to be the literal use of games in the classroom; rather, it can be any effort that utilizes motivational mechanisms that also exist in games. done well, it should not come off as a gimmick or as an excuse to play games. rather, gamification in the classroom should be used with clear learning objectives in mind. to dig deeply into how or why it might work, it is necessary to delve into game theory and the psychology of motivation. many such theories attempt to break down the individual mechanisms present in most games (kapp 2012). other theories detail the various personality types that gravitate to different game mechanisms (bartle 1996). still more theories attempt to define the very concept of “fun” or the conditions necessary for a person to experience motivation (deci and ryan 2000). working result utilizing codeigniter and the principles of gamification in a new round of development, the result was a platform containing five separate homework assignments. to access each assignment, the student would click on a link in moodle that took them to a login screen specific to that assignment. each student’s login credentials were set manually in code at the onset of the course. after logging in, the student would get an introductory screen letting them know what they could expect. after that point, the nature of each assignment diverged considerably. the first two homework assignments were largely multiple choice and text-answer based, and utilized much of the pre-codeigniter code. questions were mostly stored and supplied through a php script that existed outside of codeigniter’s mvc structure and were dynamically inserted into the page through jquery animations. figure 2. the beginning of homework #2, which includes custom paths based on the student’s choice of major. the third homework was the first one conceptualized after codeigniter and gamification became integrated into the project. the objective of this assignment was to prepare students for a physical library tour that would be taking place the following class. the intent was to have students start the tour already curious about various shelving locations and materials in the library. to achieve that end, the homework assignment became a top-down game in which students had to move around a representation of the library’s floor plan to discover all our shelf locations–along with several easter egg locations. at the end of the game, students receive a prompt to go to the library in person, find some of these shelf locations, and write some observations in a moodle forum. the game was built mostly using javascript and css, and consists of several html tables used to simulate the floor plan of each library space. when a player presses an arrow key to move, their current location and desired movement direction are checked against an array of 1’s and 0’s that dictate whether a player can or can’t move in that direction. this array also helps determine whether an event will be triggered upon entering the destination ‘tile.’ curious readers can try the game here using ‘code4lib’ as both the username and password. the objective of the fourth assignment grew out of my own classroom metaphor that research is like assembling a team of superheroes, in that one must think about how each source complements the strengths and weaknesses of the others, and how each helps to accomplish the ‘mission’ at hand. this metaphor is to counter a common student temptation to find several similar sources that all provide the same information. to encourage students to practice this ‘team building’ mentality, the objective of this assignment was to have students assemble a team of ‘heroes’ (sources) to help answer a previously identified research question. the homework consists of a workspace in which they can fill in basic information on each hero to demonstrate what role it plays in their research. students can also see how far along they are using a progress bar. when they select a type of source to categorize their hero under, a corresponding insignia is revealed to show that they have ‘recruited’ a new source to their team. figure 3. the fourth homework assignment on this site, labeled here as #7 because assignments 4-6 were separate activities completed outside of this site. making this assignment work required much more constant communication with the mysql database of student responses, and so the mvc architecture of codeigniter was leveraged much more in order to access its native database functions. while there are advantages to the visual continuity provided when using jquery to dynamically insert new content on a page, properly taking advantage of codeigniter’s database driver tools required sending information to a controller php script and–in so doing–loading a new page using a coded url. receiving information from the php script and displaying it on the page also necessitated the reloading of pages. i took from this assignment a much better appreciation of the power available through more fully utilizing codeigniter’s core architecture. the final homework assignment on this site involved citations and apa formatting. few people would call this their favorite topic, and so the challenge in developing this assignment was to make it engaging and to prompt students to pay closer attention to how and why citations are used. the result was an assignment containing a mix of small exercises including a ‘seek the citation errors’ mini-game. it also would recall and display students’ answers from previous assignments to prompt them into thinking about what information they would need to look the source up again. reception, aspirations, and advice this platform was developed mostly through the winter of 2015/16 and has so far been used by myself while teaching four sections of information theory and practice–one in spring 2016, one in the summer, and two this fall. some takeaways became clear early on, while others have taken some time to think over. student reactions the most positive reaction seemed to come from the top-down game. students would begin the following physical library tour asking about parts of the library–such as our loft space–that they did not previously know existed. my speculation is that the positive reaction came in part because this assignment is the most game-like of them all. additionally, the fact that the game’s setting was in our own library may have made the experience more personal and relatable. the version of this platform that was used in the spring and summer did not contain the functionality to recall a student’s previous answers. if they restarted a homework assignment, any answers already submitted would not populate in text boxes (except for the “hero recruiting” assignment, which required the recall and display of stored answers to work properly). while this shortcoming was disclaimed in the assignment instructions, it understandably led several students to think that their assignments had been erased or never received for grading. this was a frustrating user experience resulting from my having omitted a critical feature. in the time between the summer and fall semesters, the functionality necessary to display and modify a student’s prior answers was added. aspirations for the project while the result of this project–a gamification of my homework assignments–was worthwhile, the next goal is to think about ways that the platform could be extended to further infuse those concepts into class lectures and in-class activities. those gamified in-class activities that have been used–such as an online drag and drop call number sorting exercise i developed–appear fairly successful in engaging students. another aspiration is to develop assessment measures to better gauge the platform’s effectiveness. it is difficult to gauge a student’s reactions while they complete a homework assignment beyond looking at their homework answers and their overall class performance. in addition, student responses in the standard university course evaluations were usually more general in nature and made few distinctions between the gamified class elements being tested and the core elements that were shared by all sections of information theory and practice. to better measure the success of these efforts, a new assessment method will be needed that prompts more specific feedback from students. researching the theories behind gamification has provided new lenses through which to analyze each class assignment and look for ways to engage different types of learners. delving deeper into the literature on gamification may reveal new ways that different personality types could be motivated when completing these assignments. these efforts will help to ensure that my own attempts at gamification are not limited by personal preferences for what features are the most engaging and instructive. one possible step might be to empower students who are more motivated by social or competitive game mechanisms; this could be done through features like allowing students in the top-down library exploration game to drop notes and items on the library floor for their other classmates to find. in teaching a graded course, more control is offered in that students become partly responsible for observing the technological requirements of each assignment. this represents a departure from the realities experienced by professional web developers, who must be prepared to serve users of any device, operating system or internet browser. this platform made heavy use of css3 and jquery animations that rendered older versions of internet explorer useless, and it came as a great surprise just how many students were still using these older browsers. looking forward, one final goal is to explore making this web platform more accessible for users of different browsers and devices, and also for users with visual disabilities. advice for others in addition to the other stated goals of this project, it also served as one possible example of how an instructor or librarian might–or might not–be able to use basic coding skills as a tool in their pedagogical toolkit. a project of this nature could be rewarding for both the instructor and students, but with major cautions. one caution is that for any experimental product created by a novice coder, students may find bugs or encounter the occasional user experience difficulty. expect some inquiries on weekends and in the late of the night, and be prepared to offer quick responses to address any problems right away. students may need to be offered the benefit of the doubt, and some flexibility, if they say that they ran into trouble. no student wants to be penalized for a technical issue. beginning coders should be encouraged to let their imaginations run wild and not be afraid to build things that are flawed in conception and wildly inefficient in execution. it is easy to allow the rigid and complex structures of professional coding practice to dim the light of inspiration that will drive a novice to learn. it is that messy sort of trial and error that can nurture a long-lasting love for coding, and it may produce the occasional innovation worth keeping. nevertheless, if the intent is to finish with a polished product used by others, one must be prepared to scrap much of what has been done in order to build again with a clear strategy and outcomes in mind. for learning purposes, there should be no shame in having to do something twice if unique insight was gained each time–provided that no deadlines were missed along the way. conclusion i plan to continue using this platform for the near future, with a stronger focus on assessment and a willingness to continually make strategic improvements. i’m especially interested in getting feedback from more seasoned developers and instructors alike about what they think could be improved or done differently. while this project was far from groundbreaking in the technical or pedagogical sense, the hope is that it represents a realistic example of what can be produced by a librarian or instructor who is interested in trying their hand at coding. references bartle r. 1996. hearts, clubs, diamonds, spades: players who suit muds. journal of mud research. deci el, ryan rm. 2000. self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. american psychologist. 55(1). kapp km. 2012. the gamification of learning and instruction: game-based methods and strategies for training and education. san francisco: pfeiffer. further reading de jong t. 2010. cognitive load theory, educational research, and instructional design: some food for thought. instructional science [internet]. [cited 2016 nov 18];38(2):105-134. available from: http://link.springer.com/article/10.1007/s11251-009-9110-0 ke f. 2009. a qualitative meta-analysis of computer games as learning tools. in: ferdig re, editor. effective electronic gaming in education. hershey (pa): information science reference. p. 1-32. madigan j. 2016. getting gamers: the psychology of video games and their impact on the people who play them. lanham (md): rowman & littlefield. sitzmann t. 2011. a meta-analytic examination of the instructional effectiveness of computer-based simulation games. personnel psychology. 64(2): 489-528. about the author jared cowing is the systems librarian and an assistant professor at woodbury university in burbank, ca. his professional interests include the gamification of libraries and the development of more interactive, intuitive library discovery interfaces using rich library metadata. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – developing an academic image collection with flickr mission editorial committee process and structure code4lib issue 3, 2008-06-23 developing an academic image collection with flickr a group at lewis & clark college in portland are in the process of developing an educational collection of contemporary ceramics images using the photo sharing site flickr as a back end. this article discusses the evolution of the project, flickr machine tags, and the concept of flickr as an application database layer. the article includes code samples for creating and querying machine tags using the flickr api. by jeremy mcwilliams introduction academic visual resources are in the midst of a shift from traditional slide libraries to reliance upon digital collections. rather than loading the slide tray for a class, instructors are turning to digital image collections like artstor, james madison university’s mdid, and collection software like insight, for teaching. such resources tend to have higher quality and better-described images than what one might get from a google image search. yet resources like mdid and artstor are closed data silos and can be difficult to work with due to proprietary presentation software and copyright restrictions on the images themselves. and despite typically lower quality images found via google image search, instructors often use those images because they’re easy to find and use (if not necessarily legal). in the summer of 2007, a group at lewis & clark college in portland, oregon, decided to create an alternative image resource collection for education. specifically, the goal was to develop a collection of contemporary ceramics images, as no such resource existed. but rather than gathering images in a closed platform like mdid or artstor, we wanted to develop a collection that had high quality images, was open to anyone, included a distributed model for adding and cataloging images, and was mobile/remixable in the spirit of web 2.0. it became clear that the photo sharing site flickr was an intriguing, if somewhat experimental, solution to achieve these goals. flickr already has a ‘group’ model, in which users can contribute images toward a shared collection. a flickr group can also be moderated, so a curatorial board of designated administrators can accept/reject images submitted to the group. flickr also allows users to assign a creative commons license to images they own, which permits the images to be used with fewer copyright restrictions. in addition, flickr’s impressive application programming interface (api) lets developers easily create web sites with flickr images and data. with these ideas in mind, we decided to take the plunge and attempt to build a contemporary ceramics image collection for education with flickr as the primary back-end. we figured it could end up as a failed r&d experiment, or it could provide a revolutionary model for academic image resource collections. our results thus far are at accessceramics.org. this article will discuss the site design and evolution of accessceramics, flickr machine tags, and the concept of flickr as the database layer for an academic image collection. accessceramics.org: initial design our working group consisted of ted vogel (assistant professor of art, program head in ceramics, department of art, lewis & clark college), margo ballantyne (visual resources curator), mark dahl (assistant director for systems and technical services), and myself, jeremy mcwilliams (digital services coordinator). we didn’t really have any extra time, resources, or additional staff to devote to the creation of the image collection, though our expertise in different areas helped to distribute the workload fairly evenly. ted and margo developed the metadata schema and worked directly with artists, while mark and i handled the technological aspects, including plenty of testing within flickr, and the development on the code base. we hoped to rely as much as possible on the existing flickr infrastructure for collection organization, metadata storage, and user authentication. the idea was to create a lightweight, mobile site that was little more than a thin technological layer on top of flickr. the initial site consisted of php, css, and the jquery javascript library, and handled all data storage within flickr via the api (figure 1). essentially, we wanted flickr to be the database. figure 1: accessceramics initial model [view full-size image] during our initial development, i created a test flickr group, and wrote php code to create an interface that interacted with flickr using its api. the site was designed to work with basic flickr api functions, including flickr authentication, viewing flickr image sets on our interface, adding tags to images, and submitting images to flickr groups. once development and test phases were completed, we invited local ceramicists to create free flickr accounts, upload images of their works, and join our flickr group. the artists then used our interface to catalog and submit their images to the collection. to do this, an artist would log in to flickr through our site, select an image from their personal flickr collection, and then add metadata to a cataloging form to describe the image (figure 2). upon submission, php scripts converted the metadata to tags, added them to the image on flickr, and placed it in the flickr group queue, where it awaited approval or denial by a group administrator. figure 2: accessceramics cataloging form[view full-size image] yet our code did more than just convert metadata to tags. in order to create useful metadata for images, the cataloged data was converted to machine tags, a relatively new type of tag structure that we hoped could make the ‘flickr as database’ concept a reality. flickr machine tags one of flickr’s most popular features is the ability for users to describe images with tags. this creates an environment for social photo sharing, as flickr users can easily view sets of images tagged with common keywords. but tags alone don’t provide the depth of metadata description that some image collections might require. users of such collections should be able to search and browse by different fields, and keyword tagging simply doesn’t provide that framework. recognizing that need, flickr launched machine tags in january of 2007. machine tags have the format namespace:field=value, enabling complex image descriptions. not only can machine tags allow field-value relationships, but similar relationships can be grouped together with a common namespace. geotagging is perhaps the most common use of machine tags, as a simple keyword tag of ‘45.12234’ won’t have the same meaning as geo:lat=45.12234. and while there was some initial discussion to regulate the use of machine tag namespaces, the selection of a machine tag namespace in practice is largely arbitrary. while flickr users can create and view machine tags in the flickr interface, they are intended primarily for use in the flickr api. since a user on flickr is not likely to add a tag like ‘image:color=red’, it makes more sense for code in an external application to take user inputted metadata and convert them to machine tag syntax. in the case of accessceramics, we developed a cataloging interface for artists that takes form values, converts them to machine tags, and uses the api to tag the targeted image on flickr. similarly, if a user wanted to browse images in which the glaze is electric oxidation, the application should convert the query to machine tag syntax, perform the query through the api, and return a formatted results set. in other words, the existence and use of machine tags should be invisible to the user, just as sql queries are in common lamp applications. flickr api machine tag code samples flickr’s api has over 100 methods for a variety of purposes, utilizes a rest-style format, and requires an api key. each method has a thorough demonstration page in which users can enter sample queries, and receive xml responses (here’s an example). developers have also created a number of api kits in a variety of languages to further simplify the api interaction process (we use dan coulter’s phpflickr). for the purpose of these examples, we’ll use language-independent rest url queries, with xml responses. we hope they will provide some insight on our attempts to store and retrieve image metadata in flickr. adding a machine tag to add a tag or machine tag to a photo in flickr, flickr.photos.addtags is the appropriate method. in this example, we’ll add a machine tag to an image for a fictional image collection of dogs. to designate a ‘breed’ field as ‘cocker spaniel’, the machine tag could read dogs:breed=’cocker spaniel’. below is the structure of the rest url to add the machine tag to the flickr image through the api: http://api.flickr.com/services/rest/?method=flickr.photos.addtags&api_key=xxx&photo_id=2411163173&tags=dogs:breed='cocker+spaniel'&auth_token=xxxx&api_sig=xxx this particular action requires an auth_token and api_sig to verify the write permission to add the tag (more information about flickr api authentication can be found on the authentication api page for flickr services). also, quotes are required around cocker spaniel, as it is a multi-word tag seperated by spaces. quotes aren’t required if the value is a single word. the response xml is: which just confirms the addition of the machine tag. the screenshot below shows the addition of the machine tag in flickr. notice that machine tags are grouped separately from normal tags. retrieve results via machine tag an effective method to perform a machine tag search is flickr.photos.search (for more information on machine tag query syntax visit the flickr.photos.search page). this method has an optional machine tag parameter, and can also be narrowed to a group. here is a query of the accessceramics flickr group for images in which the object type is a wall piece: http://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=xxx&machine_tags=ã“machine_tagsã“+=>+ã“ceramics:object_type=wall_pieceã“&group_id=511711%40n24 the machine_tags value in the url is slightly different than the rest of the parameter / value pairs. also, notice the underscore between wall and piece. it is important to note that this is not machine tag query syntax, but rather a decision we made for accessceramics in handling multiple word machine tag values. this practice seemed to yield better success, and we used php to convert underscores to spaces when preparing the html view. the quotes around machine_tags and ceramics:object_type=wall_piece are part of the rest syntax for machine tags used in flickr.photos.search. this yields the response: when parsed, reformatted, and augmented with additional data using the flickr api method flickr.photos.getinfo, a results screen like this can be created (figure 3): figure 3: formatted results screen from machine tag query [view full-size image ] ‘flickr as database’ flickr has an excellent infrastructure for developing image collections, both on the site itself, and with external applications using the api. however, in some ways, it may not be quite ready as an exclusive database layer for an academic image resource. by default, anyone can add tags to images in flickr, and changing tagging permissions is not entirely a straightforward process. this is analogous to permitting anyone to have ‘write access’ to the database. perhaps in the context of sites like wikipedia, this isn’t necessarily a deal breaker. nonetheless, it’s difficult to ignore potential cases of sabotage. for example, if a user didn’t like a particular piece of art in our collection, he/she could find the image on flickr, and tag it with ceramics:title=’i_am_not_fond_of_this’. while machine tags have potentially expanded flickr as a resource, they still lack a couple important features. perhaps most important is the inability to perform truncated machine tag searches. for example, a search of ceramics:material=clay will return only exact matches; machine tags disallow wild card variables in the value portion of the tag. creating a search interface without this feature would likely create a frustrating experience for the user. also, as mentioned previously, there is currently no authority to regulate machine tag use. this is probably fine for now, but could become an issue if more developers use machine tags. the ‘flickr as database’ model also lacks a degree of centralized control to handle tedious details, like spelling variations on metadata tags. by exclusively using flickr to store metadata, artists would be required to make corrections to their own tags in order to adhere to bibliographic control. this isn’t very practical, as tasks like this should be performed centrally. in our case, we’re indebted to our artists for taking the time to upload and catalog their images; we don’t want to hassle them with nitpicky metadata problems. while the flickr api is quite possibly the web services standard by which other apis should be judged, it lacks certain methods that would be useful for our project. if we wanted to create a ‘browse by field’ screen, for example, there currently isn’t an api method to gather all possible machine tag values in an image collection for a given field. it would require an api call to retrieve a list of images in the collection, another api call per image to retrieve the machine tags, some code to select the desired tags via regular expressions and place them in an array, and finally some processing of that array prior to displaying on the interface. this sequence of events just isn’t practical. a comparable action to retrieve the same result set from a mysql database would require just a single query and some basic conversion of the results to html. because of these various issues, we ultimately decided to abandon using flickr as the exclusive database layer, and began storing metadata in a mysql database (figure 4). not only does this give us more control over the data, but it will make the development of site tools easier, as we won’t necessarily have to depend upon the existence of certain flickr api methods to add functionality. in our new model, artist-entered metadata is stored in the mysql database, and machine tags are generated by an accessceramics ‘super user’ flickr account. we’re still creating machine tags on flickr images, with the hope that functionality will improve in the near future, or that flickr will build a true collection feature to fit cases like ours. the shift to a mysql database will give the site administrator more control of the metadata, make the site run faster, and will catalyze the development of site tools and functionality. figure 4: accessceramics current model [view full-size image] future directions as of spring 2008, accessceramics has a little over one hundred images contributed by about a dozen artists. we’re hoping that volume might be a typical weekly yield in the very near future. we are also attempting to procure grant funding to accelerate the development of the project; up to this point, all work on the project has been squeezed in amongst our other various duties. with additional funding, we would hire a coordinator to help artists with the image submission process, and enlist technical expertise to better design and develop the site. aside from increasing the volume of the collection, we plan to develop tools to facilitate educational use of the images. our short term laundry list includes the creation of better searching and browsing capabilities, and facilitating the use of the images and metadata in slideshow and presentation software. we also hope ceramics educators will use the accessceramics collection in the flickr interface, as flickr has a track record of unveiling new tools for users, some of which may be useful in an educational setting. while flickr has ruffled some feathers by now supporting video, the development has added some intriguing possibilities for enhancing accessceramics and similar flickr-based collections. our site could eventually include videos showing comprehensive views of a given piece of art, tours of ceramics exhibits, interviews with contributing artists, and actual lectures or teaching tips that could be useful for education. and videos work seamlessly with the flickr api; this post by flickr’s kellan elliott-mccrea includes further description and code samples for embedding flickr videos within a web application. we hope other developers will continue to discover ways to use flickr for education and digital collections. while some in the academic community might view flickr as little more than a variation of myspace, perhaps the recently added library of congress collection will help change perceptions and encourage more experimentation and development with flickr. peter brantley of the digital library federation and mark dahl from our accessceramics group have discussed the notion of an ‘academic flickr’, theoretically provided as a sub-service of flickr itself. while this may or may not come to fruition, we should take advantage of what flickr has already offered: a free set of wonderful tools to help us redefine what a visual resources collection can be. acknowledgments: thanks to ted, margo, and mark for your hard work, to watzek library director jim kopp for letting us library people work on this, to the contributing artists, and to flickr for being flickr. note: accessceramics is not an open source project, though we would be happy to share our code. email jeremy (jeremy2443@gmail.com) if you’re interested. about the author jeremy mcwilliams is the digital services coordinator at lewis & clark college’s watzek library in portland, or. he has been at lewis & clark for 10 years, and enjoys creating public and staff-side library web applications. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – lantern: a pandoc template for oer publishing mission editorial committee process and structure code4lib issue 53, 2022-05-09 lantern: a pandoc template for oer publishing lantern is a template and workflow for using pandoc and github to create and host multi-format open educational resources (oer) online. it applies minimal computing methods to oer publishing practices. the purpose is to minimize the technical footprint for digital publishing while maximizing control over the form, content, and distribution of oer texts. lantern uses markdown and yaml to capture an oer’s source content and metadata and pandoc to transform it into html, pdf, epub, and docx formats. pandoc’s options and arguments are pre-configured in a bash script to simplify the process for users. lantern is available as a template repository on github. the template repository is set up to run pandoc with github actions and serve output files on github pages for convenience; however, github is not a required dependency. lantern can be used on any modern computer to produce oer files that can be uploaded to any modern web server. by chris diaz motivations open educational resources (oer) are free teaching and learning materials that are available online for unlimited use, consultation, adaptation, and distribution. typically, oer are distributed under a creative commons license [1]. while they can be downloaded and used for free, maintaining an oer support infrastructure is an expensive endeavor. for example, academic libraries provide services to faculty focused on oer awareness, adoption, and creation. these services require staffing, training, coordination, technology, and marketing. institutional oer grants and faculty stipends are a popular method for incentivizing and supporting the creation of new oer (santiago & ray 2020). however, in order for the public to reap the benefits, the oer needs to be published. libraries also support the publication of oer by making the oer content discoverable, accessible, preservable, and reusable. the supporting infrastructure that libraries provide for oer raises questions about sustainability. lantern was developed with these questions in mind. the sustainability of an oer depends on the oer’s ongoing ability to meet its educational purpose. for oer initiatives in libraries, there are two primary sustainability concerns to keep in mind: (1) the production and access of oer and (2) the use and reuse of oer by the public (wiley 2006). both of these parts require people, workflows, and technologies. by minimizing the costs of digital infrastructure and maintenance, libraries can increase investments in people and workflows for oer production and access. lantern was designed to reduce the technical complexity of technology stacks necessary for the production, sharing, use, and re-use of oer by meeting these sustainability criteria, adapted from wiley (2006): oer is created in a format that operates equally well across hardware and operating systems oer is available to the public in such a way that edits and adaptations can be made for teaching and learning in a variety of contexts these criteria for the sustainability of oer can be aligned with the principles underlying minimal computing, a framework developed by digital humanists for designing systems that only use the hardware and software resources that are necessary for the task (gil 2015). this thought exercise helped reduce the technology stack lantern uses to create, host, and archive oer. lantern’s stack is focused on structured plain text, static web technologies, version control, and open source software. lantern figure 1. lantern workflow overview. lantern’s workflow begins with common word processing software and ends with a multiformat oer publication. lantern provides a folder template and bash script for using pandoc to convert between manuscript (.docx or .odt), source (markdown, yaml, bibtex), and publication formats (html, pdf, epub) [2]. it is intended to make using pandoc, a comprehensive document conversion tool, easier to use for oer creators and publishers who are generally new to command line programs. lantern aims to teach fundamental digital skills in plain text editing, static web technologies, and open source software in order to encourage the use of minimal technology stacks in digital library projects. “minimal technology stacks” refers to the intentional constraints around the technology components required for a computing process. pandoc is a command line tool that converts an input file of one file format into an output file of another file format. both the input and output files need to be represented in a structured markup language. at the time of this writing, pandoc (version 2.17.1) can read from 39 input file formats and write to 62 output formats, with varying levels of accuracy. each conversion can take zero or dozens of options from 96 that are available. the lantern.sh bash script, files, and folders within the lantern template repository simplify the level of customization available to pandoc users for oer production use cases. for example, the bash function responsible for generating the pdf looks like this: pdf() { # combine all markdown files into one pandoc text/*.md -o _temp/chapters.md # convert markdown to html to pdf pandoc _temp/chapters.md \ --defaults settings/pdf.yml \ --output $output_directory/$output_filename.pdf echo "the pdf edition is now available." } the pdf function first calls pandoc to combine all of the markdown files within the /text/ subfolder to make up the body of the oer, ordering them by filename. the function then calls pandoc again to convert the concatenated markdown file into a pdf using the settings specified in a pandoc defaults file [3]. the defaults file specifies a selection of pandoc options, including the metadata, templates, and pdf settings we want to apply [4]. each output format has its own function within the script following the basic workflow but referencing different defaults files and options. ideally, the lantern template repository serves as an approachable foundation from which users can build their own customizations and features for their projects. structured plain text most people write and edit text using a rich text editor, like those found in microsoft word, google docs, wordpress, and email programs. rich text editors display the style elements of the document, but obscure the semantic elements of the document’s underlying structure. this leads people to use alignment and bold fonts to signal that a specific text element is a heading, which sighted people may understand visually, but machines (like screen-readers) might miss. to avoid this pitfall, lantern provides tips on tagging manuscript files in .docx format with proper headings and styles using word processing software, like microsoft word. document structure is essential for accessibility, formatting, and portability. lantern uses markdown and yaml as the structured plain text representations of an oer’s content and metadata. plain text offers numerous advantages for library-based oer publishers, as tenen and wythoff (2014) explain: plain text both ensures transparency and answers the standards of long-term preservation. ms word may go the way of wordperfect in the future, but plain text will always remain easy to read, catalog, mine, and transform. furthermore, plain text enables easy and powerful versioning of the document, which is useful in collaboration and organizing drafts… plain text is backwards compatible and future-proof. lantern also provides a file structure for an oer project. an example project folder contains a lantern.sh script file, a metadata.yml file, a subfolder (/text/) for markdown files, and several other subfolders containing the templates, styles, and configurations. this structure enforces a separation between content and style; lantern users only need to use the metadata.yml and /text/ subfolder. markdown provides many advantages for academic writing, oer production, and preservation. john gruber, one of its inventors, explains the philosophy of markdown this way (2004): markdown is intended to be as easy-to-read and easy-to-write as is feasible. readability, however, is emphasized above all else. a markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. here’s an example of markdown syntax: # chapter title introductory paragraph text with **bold** and *italic* text. ## section heading introductory paragraph for the section. another paragraph, but with a [link to a website](https://example.com). ### subsection heading more content, but in list form: list item list item list item markdown is useful for digital publishers and preservationists because it is human-readable in its raw form and machine-readable for converting to html and dozens of other markup formats. lantern mostly follows pandoc’s markdown syntax for textual elements, with additional support for numbered references (for equations, figures, and tables), callout boxes, and exercise questions. lantern organizes the content of an oer project as one or more markdown files in a `/text/` subfolder. each file is named according to its numerical order within the larger project: 001-preface.md 010-introduction.md 020-theory.md lantern uses yaml as its primary metadata format. like markdown, yaml was selected for its readability and ability to be transformed into json and several other structured data formats. the metadata file contains bibliographic metadata fields represented in yaml syntax, for example: title: lantern subtitle: an oer publishing toolkit author: name: chris diaz affiliation: northwestern university name: lauren mckeen mcdonald affiliation: northwestern university keywords: textbooks oer digital publishing github for oer publishing perhaps one of the most powerful advantages for using plain text to organize and produce oer is the ability to use the git version control system and the github ecosystem of collaboration and automation tools. the management, collaboration, and preservation benefits of git and github for library technology projects are well documented (davis 2015, giorgio et al. 2019; becker et al. 2020). lantern demonstrates the benefits of git and github for oer projects. lantern is a template repository on github. it is intended to make it easy for anyone to generate their own oer projects using the same repository structure and files. in practice, a user would login to github, visit the lantern repository, and generate a new repository for their oer project. they would then add their own project’s content and metadata and use lantern’s preconfigured settings to produce their multi-format oer for free. lantern’s pre-configurations take advantage of github actions [5] and github pages [6]. github actions is a programmable workflow automation tool and github pages is a static website hosting service. these features are especially useful for oer publishing. lantern provides documentation to help users prepare manuscripts in common file formats (.docx) and github actions to convert them to markdown using pandoc. the basic workflow involves the following steps: user generates a github repository using lantern as a template repository user uploads .docx files to the /original/ subfolder user makes a commit using the github web interface: “add files via upload” this triggers a github actions workflow that performs the following tasks: provision a hosted virtual machine running ubuntu 20.04+ lts install pandoc 2.17+ checkout the main branch of the github repository convert each .docx file to a markdown file using pandoc move the markdown files to the /text/ subfolder commit this change back to the main branch of the github repository figure 2. logs from using github actions to convert manuscript files with pandoc. after this process, the user is ready to check the markdown files for any conversion errors and make necessary changes using github’s web interface for editing and previewing markdown. figure 3. github’s web interface for code (i.e. markdown) editing. figure 4. github’s web interface for previewing markdown rendered as html. lantern adopts a lightweight continuous integration / continuous deployment approach to oer publishing. lantern is preconfigured to build and deploy the html version of an oer project by default. other output formats, such as pdf and epub, are needed to be enabled by making a change in a configuration file. each time the user makes a commit to either the metadata.yml file or any of the markdown files in the /text/ subfolder, another github actions workflow will be triggered, executing the following tasks: provision a hosted virtual machine running ubuntu 20.04+ lts install pandoc 2.17+ and other lantern dependencies checkout the main branch of the github repository run the lantern.sh script, which builds a static html website for the oer by default deploy the website files to the gh-pages branch once the user enables the github pages feature on their repository, the website files contained within the gh-pages branch of the repository will be made publicly available at a github.io url. from then on, each new commit in a lantern oer’s repository will trigger a rebuild and redeployment of the oer website and output formats. users can disable the public accessibility of their in-development oer project by disabling github pages in their repository and re-enable it whenever they are ready to publish. static web technologies lantern builds a static website for the oer’s metadata, full-text, and downloadable assets (e.g. pdf, epub, etc.). static websites are faster, cheaper, simpler, and more secure than dynamically generated websites because they remove the authentication, database, and application layer typically used by content management systems (newson 2017; varner 2017; diaz 2018). static websites are read-only and require minimal maintenance in order for the public to visit and use the website. their reduced complexity makes them an attractive option for oer publishers and digital archivists (rumianek 2013). lantern transforms the metadata.yml and /text/ subfolder into a multi-page static website using pandoc and bash scripting (lantern.sh). if the user decides to produce pdf, epub, and docx versions of the oer project, each of those documents will be linked from the website available for download. here is output directory for a real-world example of an oer website built with lantern hosted on github pages [7]: css/ js/ .nojekyll cname index.html 010-intro.html 020-casual-inference.html 030-theory.html 040-data.html 050-hypothesis-testing.html 060-surveys.html 070-experiments.html 080-large-n.html 090-small-n.html 100-social-networks.html 110-machine-learning.html 120-conclusion.html 900-appendix-math.html clipperton_emps.docx clipperton_emps.epub clipperton_emps.pdf static websites are compelling options for well-scoped web publications, like oer, monographs, scholarship, digital collections, and exhibits, that libraries hope to maintain in perpetuity. lib-static provides models, concepts, and community around leveraging minimal digital infrastructures and static web technologies for library projects [8]. websites that require content management systems and server-side application software in order to function can become costly and difficult to maintain. oer publications in particular may require years of stability, even if the content is no longer updated. static websites provide that stability. portability with open source software library-based publishers, scholarly communications specialists, and open education advocates developed a keen interest in advancing an open infrastructure for scholarly publishing after the news that bepress, a provider of proprietary institutional repository software, was acquired by elsevier (schonfeld 2017; schonfeld 2019). this news generated new investments in open source software development for libraries and non-profits involved in digital publishing, among many other initiatives (lewis 2017; invest in open infrastructure). this momentum provided the motivation to prioritize the use of portable, cross-platform, open source software as the foundation from which lantern was developed. lantern requires the following software programs: pandoc: a command-line document converter pagedjs: a pdf generator for html styled with css any unix shell interpreter with grep, awk, and sed utilities any text editor git (a source code version management system), pandoc-crossref (a filter for handling cross-references to equations, figures, and tables), and latex (a pdf typesetting system), are not required but can be useful for collaborative or mathematically-rich oer projects. all of these programs are open source and compatible on macos, windows, and linux operating systems. open source software was a requirement for lantern’s design because it is less likely to produce the problem of vendor lock-in, a phenomenon in which a customer becomes dependent upon a vendor’s products (maxwell et al. 2019). lantern teaches the fundamental skills of using markdown, yaml, and command line programs necessary to use other software that performs similar functions if and when lantern’s software dependencies become unusable for any reason. markdown and yaml can be parsed and converted by hundreds of other software libraries in dozens of programming languages. github is provided as a convenience, but it is not required for use. lantern users can download the template files from github, run the software on their own files, and upload the output files to any web hosting service. conclusions reducing unnecessary overhead for the setup and maintenance of systems will ultimately lower technology and labor costs. an oer support infrastructure within an academic library is composed of people, workflows, and technologies. if given a budget to design and implement an oer program, it would be reasonable to think that the administrative roles and editorial processes for oer creation (i.e. the people and workflows) should be the highest priority, with promotion and discovery for oer adoption (i.e. people and workflows) following closely behind. lantern was designed to enable librarians to have a robust publishing workflow with the fewest technology maintenance expenses in order to devote more resources to the labor of oer creation and adoption. by adopting approaches like lib-static and minimal computing, librarians can focus on developing transferable skills rather than learning specific platforms. it is not the goal for lantern to become a “publishing platform” for oer. the goal is to demonstrate how fundamental digital skills with structured plain text, version control, and open source software can help librarians design and deploy sustainable web products for their communities. notes [1]: overview of oer licensing here: https://en.wikipedia.org/wiki/open_educational_resource#licensing_and_types [2]: git repository of lantern on github: https://github.com/nulib-oer/lantern [3]: pandoc documentation on using defaults files for configuration management: https://pandoc.org/manual.html#defaults-files [4]: example of a pandoc default’s file used for managing pdf output settings: https://github.com/nulib-oer/lantern/blob/main/settings/pdf.yml [5]: github actions is a continuous integration and continuous deployment service: https://github.com/features/actions [6]: github pages is a static website hosting service: https://pages.github.com/ [7]: example of the website files generated from lantern (https://github.com/nulib-oer/emps/tree/gh-pages) and the final website (https://emps.northwestern.pub/). [8]: lib-static community website: https://lib-static.github.io/ bibliography becker d, williamson e, wikle o. 2020. collectionbuilder-contentdm: developing a static web ‘skin’ for contentdm-based digital collections. the code4lib journal [internet]. (49). [accessed 2022 feb 23]. available from: https://journal.code4lib.org/articles/15326. davis, r. 2015. git and github for librarians. behavioral & social sciences librarian 34.3. 159–164. available from: https://academicworks.cuny.edu/jj_pubs/34/. diaz c. 2018. using static site generators for scholarly publications and open educational resources. the code4lib journal [internet].(42). [accessed 2022 feb 23]. available from: https://journal.code4lib.org/articles/13861. giorgio s, et al. 2019. what is git/github? – librarycarpentry/lc-git: library carpentry: introduction to git. [internet]. available from: http://doi.org/10.5281/zenodo.3265772. gil a. 2015. the user, the learner, and the machines we make. minimal computing: a working group of go: dh [internet]. [cited 2022 february 24]. available from: https://go-dh.github.io/mincomp/thoughts/2015/05/21/user-vs-learner/. gruber j. 2004. markdown: syntax. daring fireball [internet]. available from: https://daringfireball.net/projects/markdown/syntax. invest in open infrastructure (page 1). invest in open infrastructure. [accessed 2022 feb 23]. available from: https://investinopen.org/about/. lewis dw, goetsch l, graves d, roy m. 2018. funding community controlled open infrastructure for scholarly communication: the 2.5% commitment initiative | lewis | college & research libraries news. doi: https://doi.org/10.5860/crln.79.3.133. [accessed 2022 feb 9]. available from: https://crln.acrl.org/index.php/crlnews/article/view/16902. maxwell jw, hanson e, desai l, tiampo c, o’donnell k, ketheeswaran a, sun m, walter e, michelle e. 2019. setting context. in: mind the gap: a landscape analysis of open source publishing tools and platforms. [accessed 2022 feb 23]. available from: https://mindthegap.pubpub.org/pub/gei072ab/release/2. newson k. 2017. tools and workflows for collaborating on static website projects. the code4lib journal [internet].(38). [accessed 2022 feb 23]. available from: https://journal.code4lib.org/articles/12779. rumianek, m. 2013. archiving and recovering database-driven websites. d-lib magazine [internet]. [cited 2022 february 23]. available from: http://www.dlib.org/dlib/january13/rumianek/01rumianek.html. santiago a, ray l. 2020. navigating support models for oer publishing: case studies from the university of houston and the university of washington. reference services review. 48(3):397–413. doi:10.1108/rsr-03-2020-0019. schonfeld rc. 2017. elsevier acquires institutional repository provider bepress. the scholarly kitchen. [accessed 2022 feb 23]. available from: https://scholarlykitchen.sspnet.org/2017/08/02/elsevier-acquires-bepress/. schonfeld rc. 2019. invest in open infrastructure: an interview with dan whaley. the scholarly kitchen. [accessed 2022 feb 17]. available from: https://scholarlykitchen.sspnet.org/2019/06/12/invest-open-infrastructure/. tenen d & wythoff g. 2014. sustainable authorship in plain text using pandoc and markdown. the programming historian [internet]. available from: https://doi.org/10.46430/phen0041. varner s. 2017. minimal computing in libraries: introduction. minimal computing [internet]. [accessed 2022 feb 23]. available from: https://go-dh.github.io/mincomp/thoughts/2017/01/15/mincomp-libraries-intro/. wiley d. 2006. on the sustainability of open educational resource initiatives in higher education. oecd’s centre for educational research and innovation [internet]. [cited 2022 february 24]. available from: https://www.oecd.org/education/ceri/38645447.pdf. about the author chris diaz (https://chrisdaaz.github.io) is the digital publishing librarian at northwestern university. he is an avid user of static site generators for library-based publishing projects, such as journals, monographs, exhibits, and open educational resources. lantern received financial support from the association of research libraries’ venture fund program in 2020. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial mission editorial committee process and structure code4lib issue 58, 2023-12-04 editorial issue 58 of the code4lib journal is bursting at the seams with examples of how libraries are creating new technologies, leveraging existing technologies, and exploring the use of ai to benefit library work. we had an unprecedented number of submissions this quarter and the resulting issue features 16 articles detailing some of the more unique and innovative technology projects libraries are working on today. this issue features several articles on how libraries are using programming tools and related technologies to enhance work in technical services. enhancing serials holdings data: a pymarc-powered clean-up project discusses the use of the alma api and the python library pymarc to conduct a post-migration clean-up project on serials holdings data. the use of python to support technical services work in libraries features research detailing the various ways libraries are using python in technical services, and pipeline or pipe dream: building a scaled automated metadata creation and ingest workflow using web scraping tools describes the use of python and apis to automate the collection of documents and data online, while a practical method for searching scholarly papers in the general index without a high-performance computer discusses how the r programming language can be used to build a bibliography and visualizations from the general index. in addition to the use of coding, other articles describe using various technology tools to enhance library work. using scalable vector graphics and google sheets to build a visual tool location webapp describes the use of google sheets and svgs, while bringing it all together – data from everywhere to build dashboards and real-time reporting using the alma api and google apps script discuss how the authors used tools like powerbi, powerautomate, and google apps to gather disparate data for dashboards and reporting. other technology tools being used include airtable and aviary, discussed in using airtable to download and parse digital humanities data and leveraging aviary for past and future audiovisual collections respectively. standing up vendor-provided web hosting wervices at florida state university libraries: a case study describes how florida state university libraries is using reclaim hosting’s domain of one’s own web-hosting service to provide web domains to fsu faculty, staff, and students. also included in this issue are articles on technology being used for archives and digital collections, including islandora for archival access and discovery on how unlv implemented islandora 2, and developing a multi-portal digital library system: exploring the technical decision-making in developing the new university of florida digital collections system, about how uf created their own digital collections system using a combination of python, apis, elasticsearch, reactjs, postgresql and more. jupyter notebooks and institutional repositories: realities, opportunities and exploring a path forward discusses how institutional repositories can be used to preserve scholarship housed in jupyter notebooks. issue 58 also includes several articles that explore the potential use of ai in libraries, including the use of chatgpt in beyond the hype cycle: experiments with chatgpt’s advanced data analysis at the palo alto city library and automated speech recognition technologies in comparative analysis of automated speech recognition technologies for enhanced audiovisual accessibility. finally, using event notifications, solid and orchestration for decentralizing and decoupling scholarly communication describes koreografeye, an automated assistant prototype that can enhance scholarly infrastructure, providing value-added services to institutional repositories. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – applying lessons from 8 things we hate about it to libraries mission editorial committee process and structure code4lib issue 13, 2011-04-11 applying lessons from 8 things we hate about it to libraries book review of 8 things we hate about it with commentary on how susan cramm’s points can be applied to libraries. cramm, susan. 2010. 8 things we hate about i.t. boston (ma): harvard business press. coins by timothy m. mcgeary introduction in 8 things we hate about it susan cramm discusses frustrations business leaders have with it and strategies to remove those frustrations in order to form effective partnerships. while this book focuses on solving the traditional corporate struggle between business leaders and their it group, library technologists are in a unique position in this struggle: right in the middle. the vast majority of academic libraries fall into two categories: libraries that are separate from central it and libraries that are merged with it in one organization. in both of these examples, librarians are the “business leaders” bringing projects, ideas, and requirements to library technologists, the latter of which fit the classic role of it. but within the larger organization, library technologists turn around seeking the same support from central it. often this role is a no-win situation, due to lack of power and difficulties in communication. even though the examples susan cramm offers are focused on the private sector, the principles are just as important to discuss and implement in libraries. all librarians and library technologists would be wise to heed the lessons in ms. cramm’s book, including those from libraries that either completely control or outsource their entire it infrastructure and support. cramm lays out the following battles with it: service or control, results or respect, tactics or strategy, expense or investment, quickness or quality, customization or standardization, innovation or bureaucracy, goodness or greatness. for each battle, i will briefly discuss ms. cramm’s descriptions of the battles and suggestions for overcoming these challenges, while providing examples and commentary on their applications to the library. before going any further, it is important to define a few key terms that ms. cramm repeats frequently: term definition business or line leader generally non-technical, high-level staff who come up with ideas that promote a business idea or end-user goal; in the library, this can be applied to library directors, librarians, or other library staff not directly building or supporting technology. it leader generally technical staff, but sometimes refers to management of technology teams or units, and must make decisions consistent with the enterprise; in the library, this can be applied to a systems librarian who is the one-stop source of all technical solutions. enterprise the entire it infrastructure and technological foundation of the business. this includes staffing, servers, software, operations costs, recovery, new development, and support. when enterprise is applied to a people group (also referred to generally as it), it can be assumed a cio or cto is top personnel of the enterprise, and thus the enterprise people group works in the best interest of the cio/cto office. 1) service or control it leaders want to please business experts because they get excited solving big problems, but it leaders answer to both business leaders and the enterprise, which may make it difficult for it leaders to fully meet the expectations. business leaders have a present problem and want a present solution, but the enterprise requires long-term sustainability. the line leader wants service the enterprise wants control i have a great idea for a project. yes, but it’s not good enough. the systems need to do a lot. yes, but you can spend only a little. i know what the systems need to do. yes, but you need to make sure others agree. i have a vendor that i would like to work with. yes, but you need to use approved vendors. there’s a package that does everything we need. yes, but you need to comply with standards. i’d like to get started right now. yes, but you have to wait for resources. i’d like the project to get done faster. yes, but you need to follow processes. the project just needs a little more time (or money). yes, but your time is up, and you need to put the project out of its misery. the system’s ready to be rolled out. yes, but you need to comply with the security, regulatory compliance, and business continuity policies. the system has generated great business. yes, but you haven’t increased your p&l targets. the system needs to be upgraded (or enhanced). yes, but you aren’t using what you have. table 1-1 service or control (p. 17) there are four parts to what cramm describes as a balanced approach for organizations to have successful partnerships between business and it leadership. these are: realizing values (investing in projects with highest roi and holding those investments accountable); serving the business (balancing innovation as an art not a science, thus projects are most successful with clearly defined outcomes, limiting time frame, funding in stages, scope is managed appropriately, and the right resources are assigned); running efficiently (simplifying and standardizing technologies, vendors, operations, and creating efficient automated processes); securing the future (leverage existing it solutions to support strategic business requirements, thus eliminating the multitude of applications and systems needing support, upgrades, or overhauls in the future). (p. 18-19) the corresponding key to finding a balance in libraries is acting in partnership with both business and it. it’s not service or control, but service and control. partners back each other, and look to work together for a successful venture. librarians need to realize solutions must be sustainable, not just fill a need. technologists need to realize they can’t solve a problem on their own just because a problem has been stated. both groups need to think strategically together about how a potential project benefits their larger organization. early in my career, as the lone library technologist in my organization, i made four of the eleven business-leader statements listed in table 1-1 when communicating with central it staff about a project. in fact, i was so entrenched in the library business requirements that i lost sight of both the relationship dependency i had with central it and the enterprise. the result was a very vocal argument with their cubicle farm. regardless of any shared blame, i was both too naïve and too inexperienced to make such demands on the enterprise. while the hard knocks experience taught me a great deal, i would have rather been given this advice before that project, rather than spending the greater part of the next year re-building the respect i had lost. 2) results or respect the business leader wants results and generally does not understand the it requirements those results depend on. most frequently the solution has been to go around it directly to a vendor if the business leader does not believe it will produce their required results. but often it is still required to intervene, at a minimum, or take over entirely. (p. 29-30) cramm urges business leaders, and i believe this is just as true for library leaders, to consider focusing on building connections and relationships with it instead of merely focusing on the results. (p. 30) make no mistake – this is hard: personalities clash, experiences vary, and alignments differ. consider: it people are more aligned and/or affiliated with it and their technology than their library organization at large. it staff have different backgrounds or experiences within libraries and library business that makes communication difficult. business and library leaders focus on getting a project done now; it leaders focus on getting the project done right, to sustain and support it. (p. 31) but it is important to realize that both business and it staff hate the control of bureaucracy. this common point of pain can often be a jumping point in starting to build a sustainable, and respectful, relationship between library leaders and it. another consideration for library leaders in building respect with library it is this statistic: on average, 55% of it human resources are required to maintain, support, or operate it solutions. 10% are typically dedicated to planning, with 35% left over for new development. (p. 34) mileage will vary, but my opinion is that within the library, it technologists spend closer to 70% on maintenance, operation, and support, leaving 30% for planning and new development. often planning gets an even shorter stick, or maybe no allocation at all, to the detriment of both future development and especially the inevitable, and more difficult, support, thus straining the next cycle of new development. library business leaders can change the rate of results by changing their focus from results to respect. add an it leader on your team, elevate it to a true partner status, and develop interpersonal connections with it members who are working on your projects. these are excellent ways to ensure greater rates of success for your projects. last, seek their input about what they expect a library business leader (you) to accomplish. (p. 38-40) 3) tactics or strategy strategies are not goals, annual objectives, or key initiatives. strategy is also not about meeting short-term demands that overwhelm our time and resources. strategy, as cramm defines it, is the foundation used to make daily, monthly, quarterly, and annual decisions, and the tactics concerning goals, objectives, and initiatives. (p. 43-45) because of short staffing in many, if not most, libraries, strategic planning tends to lose out to demands, objectives, and goals. library leaders decide that demand for applications, and the innate desire to fulfill that demand, suffices for it strategy. what is often not realized is that building strategy is an ongoing process, and not finite events that occur every three to five years. (p. 60) annual goals lists, while effective for planning, should be checked against and prioritized by your it strategy. if your library doesn’t have one, now is the time to start making one. but who initiates strategy? typically both business and it wait for the other, as shown in the following table. business says it leaders should… it says that business should… “partner more with the business to identify how technology can be strategically deployed for the business.” “involve it as early as possible when contemplating changes to the business or new initiatives.” “set common goals for long-term business initiatives and the technology supporting them.” “involve it early and strategically.” “get more involved in the early stages of developing strategy.” “bring in it early; initiate frequent strategy discussions.” “be involved in the strategic meetings within all aspects of the business.” “collaborate with it to develop strategic and operating plans.” “offer visioning workshops to broaden the minds of business leaders.” “involve it in strategy discussions.” table 3-1: who should involve whom? (p. 45) the key is to work together. it will often not know the business problem, especially in the library, which needs to be solved with an it-based strategy. likewise ordering it solutions, as one-off applications, will reveal your ineffective it-based strategy and, likely, an unwilling it participant. (p. 46) while this chapter deals with complex strategic positioning, librarians and technologists should follow a similar path when developing it-enabled business strategy, as cramm describes in the figure below: understanding the fundamentals industry, competitive environment, key trends business fundamentals (business outlook, economic model, key metrics, long-term targets) key initiatives brainstorm how to influence performance customers (number, channels, features, pricing) capital (facilities, equipment, inventory, etc.) resources (people, raw materials, energy, etc.) cycle time (sales, order, fulfillment, replenishment, etc.) articulate business objectives as they relate to cost focus, value differentiation, flexibility, agility, growth, and human resources articulate it objectives understand it fundamentals (strengths/weaknesses, key metrics, long-term targets, key initiatives) for each business objective, articulate the role of it and the implication to process, information, and it architecture identify it-enabled initiatives articulate key initiatives, timing, and success measures figure 3-1: deriving it-enabled business strategy (p. 55) the key to all this is gaining quality commitment to the strategy beyond the quality of the idea. (p. 60) ideas come and go; yet even good ideas require strong commitment for a project to become successful. without commitment from both the library leader and it, a good idea could easily become a failed project. by building a strong strategic foundation based on respectful communication of both it and library/business initiatives, projects will have a much better chance for successful implementation and on-going support. 4) expense or investment of the 8 things we hate about it, this might be the most complicated to translate to library land because it has such a corporate feel to it. but in reality, the pieces of demand management, which is the focus of deciding if a project is an expense or an investment, is just as important within library technology projects. whether demand management is actually performed or not will likely vary based on the size and it-maturity of the library. whatever level of demand management is implemented plays an important, if not vital, role in successful technology ventures. so what is demand management? cramm defines it as the “process of allocating limited resources to the overall benefit of the enterprise.” (p. 64) continuing in the definition, demand management is a cycle and a process involving the following steps: strategic planning – as addressed above, this provides the context for prioritization of investing in it-enabled projects portfolio management – provides an on-going guidance on project investment decision and project review decision rights – the principle that business leaders have authority to decide what it is needed and it leaders have authority to decide how it is delivered financial planning determines amount of funding or price it costs for business to find external funding prioritization – occurs parallel to and in conjunction with the four above steps value management – accountability for defined business value and monitoring of actual results (p. 64-65) of all enterprises, libraries can surely understand the definition, and impact, of limited resources; therefore it is even more critical for libraries to implement demand management. without using pieces like strategic planning and portfolio management, libraries can become easily overwhelmed with too many projects on-going or too many applications to support long-term. at the same time, it is important that each side, library business leaders and it leaders, be responsible for their areas of expertise. not only does this maintain balance within project-planning relationships, but it also keeps accountabilities balanced when evaluating outcomes. of all areas, value management is the step most often ignored or implemented poorly in all businesses that use it. we easily attempt to predict or define the value of a project beforehand, but do we honestly seek to evaluate whether that value was returned? libraries rarely have a financial return on investment, so usage statistics are often the most reliable return, albeit sometimes misleading. regardless of the metric used to rate the return on investment, it is prudent for libraries to take the time to evaluate whether existing projects are as valuable in operation as they were predicted during planning. 5) quickness or quality only half of it projects deliver successfully or on time, on budget, and on spec. 12% deliver too little or are canceled, 33% too late (and by an average of 71% longer than scheduled), and more than one-third are over budget (by an average of 41%). (p. 85-86) the biggest reason is that defining specifications and requirements is very hard, and iterations take time to get it right. this naturally results in a frustration between “done” and “done enough.” (p. 86) the balance is between getting it done and getting it done right. otherwise, an organization may find that being out of balance results in standalone solutions that are difficult to integrate or over-engineered solutions that miss the business requirements. cramm suggests, however, that project success can be increased through these steps (p. 89): define a clear objective for the business and people on the front lines engage the intellectual, emotional, and tangible aspects of the people affected by this new project integrate and streamline business processes leverage existing technology to the fullest use fast-cycle approaches, using best human resources and delivering every 3-6 months beyond cramm’s suggestion of delivering every 3-6 months, there is an increase in investment within higher education on agile (also known as scrum) project methodology. some colleges and universities have joined together to create an organization called scrumu (http://www.scrumu.edu) to support implementations of scrum project management. i have implemented agile project management at lehigh with some initial success. [1] agile allows both a quick delivery of solutions with the ability to revise requirements and specifications before going too far down the wrong path. as always, results will vary based on the implementation, but it can provide a jumpstart to solving the dilemma cramm is describing in this section. 6) customization or standardization cramm compares moving a software solution into production to moving into an exclusive community – it must measure up to the existing standards. (p. 108) how “gated” the higher education and library production environments are varies widely. much commentary has been made recently about enterprise versus toy projects in the code4lib community (see [2] hellman, 2010, for example), but in reality a production environment there are existing standards that must be met. these existing standards usually (and should) contain requirements to provide adequate documentation on how the application was built, proves the application was thoroughly tested, and that it meets the organization’s requirements of compliance, security, and service. the benefit of such scrutiny, once accepted, is that your project acquires security, backup, performance monitoring, and detailed support. (p. 108) but cramm adds that you still need to do some more work (p. 109-110): train users thoroughly to use the full application resolve technical issues quickly maintain and enhance the application secure funding/resources to support all of the above the rest of the chapter cramm focuses on the need to reduce the total cost of running technology. it is important, as the business leaders in the relationship between the library and central it, to consider the impact our new project has on the it enterprise. do we really need a separate server? can an existing database server be leveraged? can we piggyback on a storage solution? are we utilizing enough of the servers and storage already provisioned for us? libraries should think carefully about these questions and their respective answers when considering the deployment of our new technology. 7) innovation or bureaucracy cramm states that 37% of it leaders and 50% of business leaders agree: “it is overly bureaucratic and control oriented.” (p. 125) technologists would rather have business leaders marvel at a successful project than complain about why it says it can’t be done. but it struggles daily with an overwhelming amount of project requests, on top of supporting existing enterprise solutions. (p. 126) three key areas bog down it (p. 126-127): dealing with existing complexity of mixed and matched solutions, unstandardized legacy systems, and standalone applications. reducing this complexity would significantly reduce cost. promoting enterprise interest – once it finds and implements standardized technology, it becomes a strategic initiative to leverage it where possible, often at the frustration of business leaders looking for new innovations. lack of resources – it is often not proactive in supporting the organization or driving business decisions because it lacks resources to be proactive while standardizing and supporting the current enterprise. to create more innovation, library leaders need to take more responsibility to be innovative themselves. this requires a more hands-on approach in partnership with it. this removes the “reading of the mind” aspect of the business-it relationship. requirements documents do not do this – person-to-person activity does. (p. 129) secondly, ask it to create a solution that allows you to do more without their need to intervene. gain their trust by asking them to build the functionality for you to do more on your own, while admitting that any harm you do will have serious consequences in a system clean-up. take accountability for that privilege and power. (p. 129-130) finally, forge a new partnership, as cramm shows in table 7-1 from page 139: helping it help the business helping the business help it embrace enterprise interests clarify and streamline decision making strengthen relationships with it create a business-savvy and service-oriented it organization develop it-enabled business strategies facilitate the development of it-enabled business strategies and enterprise architecture generate value from it-enabled investment ensure that it-enabled value is realized deliver quickly and with quality deliver quickly and with quality focus customizations on necessary differences drive down year-over-year lights-on expenses invest in it smarts enhance business partner self-sufficiency assume permanent accountability for the it assets that fuel your business fully democratize it table 7-1: leadership principles of the new partnership 8) goodness or greatness cramm hopes that by addressing the first seven issues, business leaders can promote it from being just good to great. this will happen from encouraging more it-smart business leadership to treating it as a partner, not as a service, as well as following the entire organizational strategic chain of planning, investing, innovating, delivering, and operating. (p. 141-143) confronting the difficult facts of supporting the entire enterprise, cramm asks us, as business leaders to our it, to answer these questions (p. 143-144): what have i learned? what frustrates me about it? how am i contributing to my frustrations with it? what am i going to do about it? but while we may think we know what we need to change, cramm advises that it is important to do a reality check with our it partners. get feedback and own up to it. (p. 146) cramm gives excellent examples about how good it can be, and realistic examples of how bad it can get without the proper leadership and partnership. conclusion it is the responsibility of library leaders to set the tone of partnership with it, and our role as library technologists to see both sides of the coin. library technologists have a unique opportunity to serve library organizations as both it providers to librarians and business leaders to the larger supporting it organization. susan cramm offers excellent advice, strategy, and most importantly examples that can be used to position library technology in effective partnerships within our organizations. it is not an exaggeration, in the ever-increasing technological era we live in, to describe library technology as a keystone for our organizations. but in taking on that position, it is even more important to heed the strategic advice of leading effectively between the business of libraries and the enterprise of it. references: [1] mcgeary, timothy m. january 2011. implementing agile at lehigh university. team dynamixhe community column [internet]; available from: http://community.teamdynamix.com/columns/73/implementing-agile-at-lehigh-university [2] hellman, eric. february 8, 2011. toys and tools vs. the enterprise at code4lib. go to hellman [internet]; available from: http://go-to-hellman.blogspot.com/2011/02/toys-and-tools-vs-enterprise-at.html about the author tim mcgeary serves as the team leader of library technology at lehigh university libraries, and is responsible for the technology infrastructure, specifically focusing on leading the development of efficient and dynamic solutions to connect library collections to users. tim has presented nationally and regionally on library technology development, managing electronic resources, open source solutions, and the kuali open library environment (ole). tim currently serves as kuali ole functional council member for lehigh university, the code4lib journal editorial committee, the lyrasis advisory panel for open source initiatives, and the niso erm data standards & best practices review committee. subscribe to comments: for this article | for all articles 2 responses to "applying lessons from 8 things we hate about it to libraries" please leave a response below: jenn riley, 2011-04-12 fantastic writeup tim. your insights applying this model to libraries strongly resonate with me, highlighting issues i and many others are currently struggling with. thanks for this. timmcgeary, 2011-04-14 you are very welcome, jenn. it is my pleasure. thank you for your kind comments. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – a video digital library to support physicians’ decision-making about autism mission editorial committee process and structure code4lib issue 23, 2014-01-17 a video digital library to support physicians’ decision-making about autism a prototype digital video library was developed as part of a project to assist rural primary care clinics with diagnosis of autism, funded by the national network of libraries of medicine. the digital video library takes play sample videos generated by a rural clinic and makes it available to experts at the autism spectrum disorders (asd) clinic at the university of alabama. the experts are able to annotate segments of the video using an integrated version of the childhood autism ratings scale-second edition standard version (cars2). the digital video library then extracts the annotated segments, and provides a robust search and browse feature. the videos can then be accessed by the subject’s primary care physician. this article summarizes the development and features of the digital video library. by matthew a. griffin, mlis, dan albertson, ph.d., and angela b. barber, ph.d. introduction a prototype video digital library was developed as part of a partnership between primary care medical clinics and the university of alabama, a project funded by the national network of libraries of medicine, southeastern/atlantic region (nnlm sea). this article summarizes the current functionality of a prototype video digital library aimed at supporting physicians treating potential autism cases and describes the developmental processes and implemented system components. caregivers of children who fail an autism screener at the primary care clinics partnering with this project may be asked if trained staff at the clinic can conduct a video-recorded structured play-sample for further analysis at the university of alabama autism spectrum disorders (asd) clinic. these videos (children playing) can be uploaded to the video digital library, which in turn makes them available to the asd clinic, where teams of experts can select full play-sample videos to examine within a secure area of the video digital library. observations, scores, and other clinical notes can be attached or appended to the video (i.e. annotated) using an integrated version of the childhood autism ratings scale-second edition standard version (cars2). the video digital library then uses evaluation data and other input from autism experts to segment the full videos, which would otherwise be inefficient for a child’s primary care physician to use within the context of patient care or to navigate to the meaningful observations within a video. researchers in this ongoing project anticipate that physicians will be able to use the processed shorter video clips, generated from the full play-sample videos, and corresponding embedded feedback from autism experts, to make decisions about patient care. the digital library ultimately enables physicians and autism experts alike to search and browse usable individual video clips of patients by score, test type, gender, age, and other attributes, using an interface designed around the envisioned users (i.e. health professionals) and basic human computer interaction (hci) metrics. having the processed video clips accessible through the digital library allows physicians to compare and contrast different patients and observations across a larger video collection, which can, ultimately, enhance understanding about autism. the video digital library is a secure, web accessible, video retrieval system that uses an apache web server and ssl certificates for information transmission over hypertext transfer protocol (https). php, mysql, javascript, jquery, python, and ffmpeg are the current developmental tools used for implementation and ongoing maintenance of the video digital library: mysql is the database system of the video digital library. php enables database connectivity and added security. python programs execute the video processing functions of the video digital library. the python functions segment and index the contributed video files by handling the naming and iterative functions. ffmpeg manipulates the video files. details of the video library’s implementation is presented in this article, along with plans for future development, which will further examine certain characteristics of clinical videos for improving search, browse, and application of video as a resource in the context of patient care. technical innovations of this project include the design of user interfaces to collect, order, and effectively present different information formats for supporting clinical tasks and decision-making. key innovations of video digital library in this section, two important pages of the digital library with unique interface features will be detailed. the design and functionality of the annotation and search pages contain key innovations of the video digital library. annotation page the annotation page enables experts of the ua asd clinic to score children and their behaviors while watching a video play session in the digital library. this page incorporates a digital adaptation of the childhood autism ratings scale-second edition standard version (cars2) for conducting patient assessments (schopler, et. al., 2013)]. definitions of scores and categories were inserted by the developers into the annotation page so that the autism experts could quickly reference the meanings of each score across all categories. each category (e.g. imitation, object use, relating to people, etc.) of the cars2 has a function that allows the autism experts to indicate up to three time points, along with an accompanying notes field for times indicated. the autism experts, using the video digital library, manually input time points in mm:ss format and input text in the notes field. the form does not allow for saving drafts; however, since the autism expert must watch the full video to conduct assessments and complete the annotation form, this was deemed to not be a necessary function. once the form is submitted, a record for the video is transmitted to the database, and the full video is marked for processing by programs used to extract the video clips. full videos are processed, and individual clips of notable tests are extracted and created. therefore, corresponding search results of individual clips are available on the site within 15 minutes of the annotation form being completed and submitted. figure 1. screen capture of the annotation page. search page the search functions of the video digital library comprise the other innovative page. first, a user can narrow search results dynamically by using the picnet table filter (tapia g … [updated 2013]), described further below. the filter creates a dropdown menu that the user can use to limit the search results by hiding the content of filtered table elements. in addition to the dynamic content, such as age or patient id being filtered, the static definitions of the scores displayed can be refined by keyword search. all limiters of the picnet table filter can be used in combination with each other to limit results shown. in the event of a system error, the clear selection button placed at the top of the results page will refresh and initiate a new session. as important as these two pages are for users of the video digital library, the overall functionality is dependent on its systematic ability to segment video files, i.e. extract video clips. figure 2. screen capture of the search page with picnet filter. development of video digital library the video digital library is a new application that relies on a number of open source tools. open source tools used for implementation, other than those briefly mentioned above, are described in this section. open source tools used with the assortment of processes required to build this interactive video digital library, the primary developer looked for open source tools that would provide the means to develop and implement the needed functionality. the smarty template engine design formed the building blocks for the individual webpages (basic syntax … [updated 2013]). smarty “tags” allow variables to be securely shared across multiple webpages, which helped avoid having to store variables in the address bar or in cookies. smarty tags also provide the compartmentalization of the development process, and the smarty template engine permits each page, specifically the annotation and search pages, to have the required flexibility of unique programming and functionality. adminer (formerly phpminadmin), an open source program, is used as a database management tool. adminer minimalized installation steps and requirements (given it is a single php file) and provided a clean interface for the database manager. picnet table filter, as presented above, provided an open source solution for the search page, providing the ability to filter results once the main search query is submitted. picnet was developed with jquery, and thus can dynamically change the results displayed through keyword or dropdown menu. having the ability to search and browse by both known and exploratory criteria is crucial for a digital library type of retrieval tool. security security of the video digital library was a significant consideration throughout development and implementation; several security measures and methods were implemented. these barriers were designed to make it unlikely that an unauthorized user or program could access information, such as the videos, clinic notes, and scores, from the video digital library. the primary security features included the php templates to hide webpage extensions in an attempt to camouflage the language in which webpages were written. in addition, secure sessions were also implemented, which were tested with several hacking scenarios by the primary developer. these tests helped suggest several security approaches and enhancements. one specific change, implemented by the primary developer, was the addition of a conditional statement to test the user’s ip-address, saved from the current login session. the temporarily saved ip-address will block a user if their ip-address changes during a session. when this happens, the user is presented with a login screen for authorized users to reenter their credentials. video being the most important aspect of the video digital library, we used caution in designing the mechanism for delivering video. we decided to use flash early on in development process so that older web browsers would be able to play the video. but without a video streaming server, flash would download a copy of the video to the user’s computer causing a serious security issue for the video digital library. we researched several methods to provide video streaming for the video digital library, and module h264 was found to fit our needs. after installing h264 several changes to the coding were necessary. this was an easy task due to the video digital library referencing one function in order to play video. making the necessary changes there allowed the h264 module to function for the entire video digital library. full directions on h264 available from http://h264.code-shop.com/trac. to prevent sql injection during user searches, the system uses php data objects (pdo) extensions (achour m … [updated 2013]). pdo extension is a database driver for php that is database specific. for this project we used pdo_mysql, but several others are available at: www.php.net/manual/en/pdo.drivers.php pdo also cleans any sql queries of any trouble caused by characters such as: ‘)’, ‘ ;’, ‘}’, and ‘]’. the example code below shows how we used pdo to insert records into the database. $database; try { $database = new pdo('mysql:host=127.0.0.1;dbname=`database_name`, `user`, `password`); //solution to connection error by adding these following lines. $database -> setattribute( pdo::attr_default_fetch_mode, pdo::fetch_assoc ); $database -> setattribute( pdo::attr_persistent, true ); } catch (pdoexception $error) { $database = new pdo('mysql:host=localhost;dbname=`database_name`, `user`, `password`); $database -> setattribute( pdo::attr_default_fetch_mode, pdo::fetch_assoc ); $database -> setattribute( pdo::attr_persistent, true ); } figure 3. connection to the database using pdo //prep the query to the database the appropriate info about clips $query = $database->prepare(<<execute(array( ':sessionsid' => $_sessions['sessionsid'], ':patientid' => $patientid, ':ratingid' => $ratingid, ':path' => "$/$filename.mp4”, ':key_frame' => "$/$filename.jpg”, ':user_id' => $_session['userid’] )); figure 5. writing to the database using pdo as seen above, the pdo function was called to connect to the database and use the :table_field => $php_variable to insert each record. this approach prevents a user from conducting a search for “child play; truncate table patient;” and causing everything in the patient’s table to be erased. video processing the program to automatically segment the video into individual clips required multiple iterations, or versions. one such version did not use python, but instead called ffmpeg directly from php. however, in order to quickly develop a script to control for file naming and check for file creation, we used python. the main function of the python script is to call ffmpeg for the clip creation and keyframe extraction, from the clips, along with creating filenames and file paths. clip extraction ffmpeg enables video clip extraction. the correct syntax to create a usable video file was discovered by trial and error. the best solution was to use the application hand brake, which uses ffmpeg to process video. as a function of the software, handbrake displays the command line parameters used when converting video. the resulting code from handbrake is “ffpmeg, ‘-y’,’-i’, filename,’-vcodec’,’libx264’,’-s’,’320×240’…” this is then adapted to use in the python script as seen below. def cut(movie, start, clip_name): #calls ffmpeg to slice video if(subprocess.popen(['../ffmpeg', '-y', '-i', movie, '-ss', str(start), '-t', '15', "-vcodec", "libx264", "-s", "320x240", "-crf", '20', "-async", '2', "-ab", "96", "-ar", "44100", clip_name])): sleep(120) #solves racing condition when clips are long figure 6. primary python function as seen in the comments of the code, a race condition was discovered late in the development process, and the quick fix was to add a wait command. it is bad practice to depend upon a wait command to fix a bug. however, as this script is running on the server, the user will never be aware of this wait time. in addition, this fix appears to be one of the few ways to fix this race condition known as, “time of check to time of use,” or tocttou. tocttou explains the race condition which occurs in the fraction of a second between the check for a file and the use of that file. the race condition is in that fraction of a second when the file can be deleted, moved, or accessed by another program causing the file to be inaccessible (mathias et. al 2012). since this is not a unique problem to the python language, the effect can only be mitigated (pilgrim m … [updated 2013]). the race condition is compounded by the twelve ranking categories, each one having the possibility of three video clips being created. so, at most, thirty-six video clips and keyframes are created. having the program wait was the best solution, as there is no reliable method to check that a file is not in use or readable prior to calling ffmpeg. keyframe extraction keyframes are individual frames, i.e. still images, used to represent the visual contents of a video to a user through a user interface. for this project, once a video clip is created, a keyframe (i.e., visual surrogate) is needed to represent the clip to users in the search results page. the keyframes as extracted and employed for the video digital library are stored as jpegs in order to keep file size down. keyframes are selected by simply choosing and extracting the middle frame number from a video clip and designating that as the visual surrogate. for example, if a clip has 100 frames, approximately the 50th frame in the clip would be extracted and designated as the “keyframe.” this keyframe selection approach has proven to be just as effective as other more complex image processing approaches, previously evaluated for detecting the “best” keyframe through content-based comparisons. extracting the middle frame from the video was made a bit more complex in this context since there was no control over the length of the video uploaded to the website. the best solution was redirecting the standard output of ffmpeg using the parameter ‘stdout = pipe”, shown in the following code sample. the output contains detailed information of the video file created by ffmpeg. another function writes the output to a text file in order to generate the duration of the video. extracting duration is detailed below. def logfile(clip_name): result = subprocess.popen(['../ffmpeg', '-i', clip_name], stdout = pipe, stderr = stdout) return [result.stdout.readlines()] figure 7. python function for redirecting the standard output of ffmpeg. b'ffmpeg version n-43574-g6093960 copyright (c) 2000-2012 the ffmpeg developers\n' b' built on aug 15 2012 05:18:13 with gcc 4.6 (debian 4.6.3-1)\n' b" configuration: --prefix=/root/ffmpeg-static/64bit --extra-cflags='-i/root/ffmpeg-static/64bit/include -static' --extra-ldflags='-l/root/ffmpeg-static/64bit/lib -static' --extra-libs='-lxml2 -lexpat -lfreetype' --enable-static --disable-shared --disable-ffserver --disable-doc --enable-bzlib --enable-zlib --enable-postproc --enable-runtime-cpudetect --enable-libx264 --enable-gpl --enable-libtheora --enable-libvorbis --enable-libmp3lame --enable-gray --enable-libass --enable-libfreetype --enable-libopenjpeg --enable-libspeex --enable-libvo-aacenc --enable-libvo-amrwbenc --enable-version3 --enable-libvpx\n" b' libavutil 51. 69.100 / 51. 69.100\n' b' libavcodec 54. 52.100 / 54. 52.100\n' b' libavformat 54. 23.100 / 54. 23.100\n' b' libavdevice 54. 2.100 / 54. 2.100\n' b' libavfilter 3. 9.100 / 3. 9.100\n' b' libswscale 2. 1.101 / 2. 1.101\n' b' libswresample 0. 15.100 / 0. 15.100\n' b' libpostproc 52. 0.100 / 52. 0.100\n' b"input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/videos/demo01/1/1/3.mp4':\n" b' metadata:\n' b' major_brand : isom\n' b' minor_version : 512\n' b' compatible_brands: isomiso2avc1mp41\n' b' encoder : lavf54.23.100\n' b' duration: 00:00:15.03, start: 0.000000, bitrate: 522 kb/s\n' b' stream #0:0(und): video: h264 (high) (avc1 / 0x31637661), yuv420p, 320x240 [sar 1:1 dar 4:3], 385 kb/s, 29.97 fps, 29.97 tbr, 30k tbn, 59.94 tbc\n' b' metadata:\n' b' handler_name : videohandler\n' b' stream #0:1(und): audio: aac (mp4a / 0x6134706d), 44100 hz, stereo, s16, 128 kb/s\n' b' metadata:\n' b' handler_name : soundhandler\n' b'at least one output file must be specified\n' figure 8. raw output of ffmpeg as written in the text file. the script iterates over each line of the ffmpeg output, as seen above, checking for the string “duration” and then extracts the duration of the full video by slicing the string. a conditional test for duration and using string slicing calls the process for retrieval of the full duration video. the appropriate timing to extract the keyframe was computed by converting the given duration of a clip into seconds and dividing it by two. with the middle of the clip calculated, python names the clip iteratively with the regular expression (‘%s’ % (count)). the clip number or file name matches the resulting keyframe name. for example, python names the video full_video, then each extracted video clip numbered sequentially (e.g. 1.mp4, 2.mp4). the filename for the keyframe matches that of the video clip with a .jpg file extention. the full video, extracted clips and representing keyframes for each submitted play sample are all saved in a separate folder. the file path for each play sample (i.e. session for a child) is built using the patient id and the session id. for example, the path for the second session, “play sample,” of patient 1 would be ‘/data/1/2’. using python in conjunction with ffmpeg was challenging, but the end result is a digital library application able to process video and extract shorter clips according to input from the autism experts. future development at the conclusion of the primary developmental stages, system developers compiled a list of tools that would further enhance implementation and functionality of the video digital library. the annotation and search pages, discussed above, were notably complex to design and implement. future development of the system may potentially benefit by incorporating ajax (asynchronous javascript and xml) for making the annotation form more dynamic and user friendly, considering ajax allows webpage content to change and update without reloading the full webpage (ajax tutorial … [updated 2013]). one obstacle remains for any future development of this system. when patient age is used to search for clips, several clips are returned, but one of the clips returned does not match search criteria. it is not an error with the database entry and must be an error in the php code. further development of the search page is at the top of the developers’ priorities for future development due to the importance of being able to search the digital library using different user criteria. search is further complemented by browse functions, e.g. for browsing search results, which is significant for visual collections, as users expect visual feedback and surrogates to peruse a collection and to base relevance judgments. conclusion open source tools and platforms were combined to make a stable, secure, and innovative video digital library. lessons learned from the development of this system motivate future research and experimentation. this project demonstrates potential for enhancing clinical services in underserved areas. furthermore, the prototype video digital library can be used to streamline patient assessments by autism experts and provide enhanced information back to the patient’s primary physician for making clinical decisions. future research with this project can also employ the prototype presented here to examine how video digital libraries enhance understanding of autism among physicians and physician/caregiver communications. in addition, findings from future research may lead to health literacy centered designs of video digital libraries. references achour m, betz f. php manual [internet]. [updated 2013 nov 15]. www.php.net.; [cited 2013 nov 17]. available from: http://www.php.net/manual/en/ back ajax tutorial [internet]. [updated 2013 nov 17]. www.w3schools.com.; [cited 2013 nov 17]. available from: http://www.w3schools.com/ajax/ back bellard f. ffmpeg documentation [internet]. [updated 2013 nov 17]. www.ffmpeg.org.; [cited 2013 nov 17]. available from: http://www.ffmpeg.org/documentation.html basic syntax [internet]. [updated 2013 nov 13]. www.smarty.net.; [cited 2013 nov 13]. available from: http://www.smarty.net/docs/en/language.basic.syntax.tpl back mathias p., gross t. protecting applications against tocttou races by user-space chacing of file metadata. 2012. acm sigplan notices; 47(7):215-226. available from: http://libdata.lib.ua.edu/login?url=http://search.ebscohost.com/login.aspx?direct=true&db=edswsc&an=000308657200020&site=eds-live&scope=site back pilgrim m. 2009. [internet]. 2. apress.; [cited 2013 nov 17]. available from: http://www.diveinto.org/python3/ back schopler e, van bourgondien m. childhood autism rating scale, second edition (cars2) [internet]. [updated 2013 nov 15]. pearson.; [cited 2013 nov 15] . available from: http://www.pearsonclinical.com/psychology/products/100000164/childhood-autism-rating-scale-second-edition-cars2.html. back tapia g. picnet table filter [internet]. [updated 2013 nov 17]. www.picnet.com.au.; [cited 2013 nov 17]. available from: http://www.picnet.com.au/picnet-table-filter.html back about the authors matthew a. griffin is a first year doctoral student in the college of communications and information science, university of alabama where he recently received an mlis. matthew is interested in researching digital libraries as part of his doctoral program. dan albertson is an associate professor in the school of library and information studies, university of alabama. his work can be found in places such as the journal of the american society for information science and technology, journal of documentation, journal of information science, journal of education for library and information science, and others. dan’s primary research interest is interactive information retrieval; he holds a ph.d. in information science from indiana university, bloomington. angela barber is an assistant professor in the department of communicative disorders, university of alabama. her research focuses on early identification and intervention with young children who have autism. she holds a ph.d. in communication disorders from florida state university. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – archidora: integrating archivematica and islandora mission editorial committee process and structure code4lib issue 39, 2018-02-05 archidora: integrating archivematica and islandora “archidora” is shorthand for the publicly available integration between the open source software packages archivematica and islandora. sponsored by the university of saskatchewan library, this integration enables the automated ingest into archivematica of objects created in islandora. this will allow institutions that use islandora as a digital asset management system, particularly for digitized material, to take advantage of archivematica’s standards-based digital preservation functionality, without requiring staff doing digitization to interact with archivematica. this paper outlines the basic functionality and workflow of archidora; provides an overview of the development process including challenges and lessons learned; and discusses related initiatives and possible future directions for development. by tim hutchinson project background and institutional context the university of saskatchewan library has a long and productive history of digitization projects, with the first “virtual exhibit” launched in 1995, and over 50 thematic projects, both institutional and consortial, completed since then. until 2013, the university of saskatchewan archives and the special collections unit of the university of saskatchewan library were organizationally separate, but collaborated on projects and both received systems/it support from the library. starting in 2005, the university archives moved to a home-grown database for digital projects, allowing us to more systematically create metadata records and store high resolution images. in the meantime, special collections had started using contentdm for various collections. the initiation of the saskatchewan history online (sho) project in 2011 allowed the university library to move towards a more programmatic approach to digital initiatives. sho, originally dubbed the saskatchewan multitype digitization initiative, was a three-year provincial project funded by saskatchewan’s ministry of education – the government home of both the provincial library and the multitype library board. the university of saskatchewan library was contracted to coordinate the project, with a committee of the multitype library board, the saskatchewan digital alliance (sda), serving as an advisory body to the project. sda planning as well as discussions within the university library helped identify digital preservation as an important component of the sho project. for too long, our projects had produced extensive and unique digital content, but without any preservation plan or infrastructure beyond basic backups. a consultation with artefactual systems (lead developers for archivematica) in january 2012 set the wheels in motion towards our selection of archivematica as a preservation system to ultimately manage digital content from various sources and systems. in addition to islandora, which was selected as the digital asset management system for the sho project (and as a replacement for contentdm), we were also using dspace for an institutional repository of electronic theses and dissertations. of course, we also had material from many different digital projects (managed in legacy systems or as flat files) – not to mention born digital records. we thought that addressing the preservation of digitized material would be a good way to establish an infrastructure for digital preservation that could later be used for other systems like dspace, as well as the more complex area of born digital records. development process the main development work on the archivematica/islandora integration was completed between 2013 and 2014, with additional enhancements and bug fixes since then. overall, the project has moved much more slowly than anyone had anticipated – during the design, development, and deployment stages. discussions with artefactual systems started later in 2012, with design requirements partly informed by an islandora camp session that summer. the first quote was received in october 2012. however, this focused on archivematica to islandora workflow. further discussion about requirements – especially the requirement to have this process be automated, with staff doing digitization not needing to interact with archivematica – led to a more detailed project outline, and the clarification that development of islandora would also be required. artefactual took the lead on the project, coordinating with discovery garden, which would ultimately be the subcontractor for the development on the islandora side. the complexities of the technical interactions between the two systems, clarifications relating to university of saskatchewan workflows, and time constraints for all parties meant that a substantially revised quote did not arrive until may 2013. a contract was approved for work to take place between october 2013 and january 2014. ultimately this main phase of work was completed in september 2014. testing during the first phase of development revealed some limitations in the archidora module, i.e. desired functionality which had not been considered. in particular, we wanted to add a configuration option to suppress archivematica ingests for a given collection; to add better handling for compound objects and books; and to add the ability to delete obj datastreams (i.e. the master objects) in bulk. we undertook a separate contract with discovery garden, during fall 2014, to address this. since then, progress on moving into production has been slow, and we are now targeting early 2018 for this milestone. however, i am risking becoming the boy who cried wolf: i have declared that deployment was close at least a few times over the last couple years. there have been a number of factors contributing to this. both archivematica and islandora are under active development, and have many moving parts. on a few occasions, a change to another part of the code, or even configuration changes, have led to regressions or unexpected behaviour. in some cases it has been difficult to quickly assess whether the source of the problem is in islandora or archivematica. and resolving one issue has sometimes surfaced others. our lead in-house islandora developer made some changes to the archidora module in order to allow recursive use of the drush script. there was an understable learning curve involved, and the usual competing priorities for time. in tying up loose ends on the archivematica side, we did not initially formalize or schedule the work required. this work got done through a combination of goodwill on the part of artefactual (especially for a few items out of scope of the original contract), and the university library’s support contract with artefactual. as a result, timelines were a lot longer than would have been the case with a paid contract. design and functionality in general terms, objects are first ingested into islandora. these are added to a deposit in the archivematica storage service. after each deposit is finalized (reaching a certain size or time limit), they are sent to the archivematica dashboard to be processed, with an archival information package (aip) as the final output. this schematic from artefactual outlines the detailed workflow as initially developed. it remains largely accurate. figure 1. islandora/archivematica integration workflow, as originally designed [1]. the islandora/archivematica integration was achieved through the development of an islandora module called archidora; and through code changes to both the archivematica dashboard and storage service. on the archivematica side, a key component, which will facilitate integrations with other systems, was the development of a sword api, part of the storage service’s rest api. the sword api “allows 3rd party applications to automate the process of creating transfers” (sword api [updated 2017]). the following walks through the basic configuration, and the workflow for an object starting with its ingest into islandora. configuration – islandora configure the archidora module in islandora, with basic settings like storage service url, username and api key. you can also configure the maximum size of the transfer and the length of time (maximum age) before the transfer will be finalized (the settings are labelled aip but it’s really the deposit/transfer). after a transfer is finalized, no more objects will be added to it, and the processes on the archivematica side can begin. settings for the size and maximum age will largely depend on institutional workflows as well as server capacity (e.g. size of transfers that archivematica can handle). maximum size is likely more important, but the maximum age ensures that a transfer will not get abandoned if enough objects are not added to it. the cron time setting helps account for the ingest of compound objects, so that the xml file documenting the relationships between objects is ingested along with the objects. figure 2. archivematica integration configuration in islandora configuration – archivematica in the storage service, add a location … figure 3. configuration of location for fedora deposits in archivematica storage service … and a corresponding space for fedora deposits figure 4. configuration of space for fedora deposits in archivematica storage service the pipeline also needs to be configured as for any archivematica deployment, but to ensure that transfers are approved automatically, the api username and key need to be populated. figure 5. configuration of pipeline in archivematica storage service you can also configure a post-store callback, so that archivematica will update islandora with information about objects that have been ingested and therefore can be deleted from islandora. this generates a list in islandora’s “manage” tab for the relevant collection for manual processing, with an interface allowing either individual or bulk deleting (see figure 18). institutions may want to use this functionality to avoid redundant storage of the master digital objects, if the master no longer needs to be accessible in islandora. figure 6. configuration of post-store callback in archivematica storage service. in archivematica, it is important to make sure that the processing configuration is set up not to require any user intervention. figure 7. archivematica processing configuration if your digitization workflow ensures that you’re already reliably creating preservation quality files, you may also want to disable certain default normalization rules. for example, if a tiff file is identified in the format policy registry as a preservation format, archivematica will still normalize it if a normalization rule is enabled for that format (which is the default for tiff files). this will result in both the original file and the normalized file being saved in the aip. the ability to disable a rule was added as part of archidora development, but an arguably more intuitive enhancement would be to automatically skip the normalization task if an object is already in a preservation format. figure 8. archivematica preservation planning entry islandora workflow now we’re ready to bring some objects into islandora. if you use the zip importer, then it’s best to configure it to use the filename as the datastream label (e.g. filename.tif rather than obj). if you use a generic datastream label, archivematica will still be able to ingest the objects, but there may be microservices that return errors due to the lack of a file extension. in particular, ffmpeg has known issues. figure 9. configuration of islandora’s zip importer you can use any method to ingest objects into islandora. archivematica interacts with the fedora repository, not the drupal frontend. there is also a drush script available to do batch transfers of objects already in islandora. for this example, we have a small collection of three tiff files. figure 10. objects ingested into islandora. archivematica workflow on the archivematica side, the deposit initiated in islandora is first sent to the storage service. initially the islandora mets files are saved, and archivematica fetches the corresponding objects and mods files. figure 11. downloaded fedora deposit, showing submissiondocumentation folder. the objects are saved in separate subdirectories. this covers situations where there may be duplicate filenames in the same transfer – especially the “obj” naming mentioned earlier. figure 12. downloaded fedora deposit, showing top level. figure 13. downloaded fedora deposit, showing one object subfolder. once the deposit is finalized (following one or more cron calls in islandora), a transfer is initiated on the archivematica dashboard. there is an automation tool available to automatically remove successfully completed transfers and ingests from the dashboard. the transfer name is automatically generated from the mods title (using the first record, if there are multiple objects in the deposit); where needed it will be sanitized to deal with spaces, diacritics, etc. figure 14. archivematica dashboard, showing transfers in progress. completed aips can be browsed and searched on the archival storage tab. in addition to the usual search functionality, you can search for an islandora pid (e.g. islandora:1234) via the identifiers field. the full mods is not indexed, since it’s assumed users will do any detailed searching in islandora. figure 15. archivematica dashboard, archival storage tab the aip structure mirrors that of the deposit. figure 16. downloaded aip, showing objects folder. islandora troubleshooting and follow-up back in islandora, the manage | archivematica tab provides information about the status of individual objects. you can also use the “send to archivematica” button to initiate a new deposit. this is useful for testing, but would also be required in the case of a failed transfer or if an object in an existing record is replaced (archivematica will only fetch new objects). figure 17. islandora administrator interface, showing manage | archivematica tab log reports are also available under reports | recent log reports. these are primarily useful in the case of a failed deposit. at the collection level (or compound object/book level), the manage | archivematica tab provides an interface for deleting objects that have been ingested by archivematica, if the callback is configured. figure 18. islandora administrator interface, showing manage | archivematica tab at the collection level challenges and lessons learned planning while code regressions and limitations on people’s time (as outlined above) are hard to avoid, a key lesson learned is that the design requirements should have been much clearer from the outset. indeed, the initial requirements were developed by artefactual and discovery garden following a series of e-mails and meetings; the university library did not submit any formal documentation. this lack of concrete direction contributed to the false start we experienced, with a quote for an integration assuming workflow would start in archivematica. further, at the outset of the project, we had only recently adopted islandora; and as project lead i had very little hands-on experience with islandora before we started to do quality assurance, let alone during the development of requirements. for example, we missed some simple things, such as object naming conventions for the default batch import routine (resulting in duplicate filenames), or detailing how different object types (e.g. compound objects) should be handled. since this development work was part of the sho project, it was not actually managed within the library as a separate project, which might have helped mitigate some of these planning shortfalls. sustainability an ongoing challenge relates to ongoing maintenance and sustainability of the code. this initially resulted from a lack of understanding about the different development models for archivematica (maintained by artefactual systems) and islandora. artefactual systems is the “lead developer” of archivematica. as such, any development work done for clients is pursued with open source release, and more general utility, in mind. as artefactual describes on its website: “… although we may develop new features for you, we will not be creating a custom application that will need to be maintained by your organization. instead, we will incorporate the new customizations into later releases of the software and support them independently of your organization so that others can benefit from them” (artefactual systems – services – development). the development relating to the archivematica/islandora integration has been incorporated into the most recent public releases of archivematica, and is stable as of archivematica 1.6.1 and storage service 0.10.0. my experience had been with atom, the other software package maintained by artefactual systems. so i had mistakenly assumed that the development model for islandora, and the approach of discovery garden, would be similar. indeed, we incorporated language in the development contract with artefactual to allow both companies to publish the code under their respective open source licenses. (the university’s contract was with artefactual, which subcontracted with discovery garden for the work on the islandora module.) contrary to archivematica, islandora’s code is owned by the islandora foundation. during the process of arranging for the necessary approvals to transfer the archidora module to the foundation, we discovered one stumbling block: the lack of a volunteer component manager to shepherd its incorporation into the islandora code base [2]. since the development had been contracted out, we were not in a position to identify an in-house component manager. islandora guru mark jordan (simon fraser university) kindly volunteered for that role. however, a more important barrier was that the development of the module had not been done with general release initially in mind. the islandora foundation has accepted the module, but it is currently being hosted at islandora labs, and will require further development and testing to be considered for the core release (archidora module [updated 2015]). until this module is deployed by a broader sector of the islandora community and is ready for incorporation into the general release, it will essentially be a customization that the university of saskatchewan needs to maintain. future development opportunities there are a number of possible improvements to the archivematica/islandora integration. i will discuss just a few here. artefactual systems’ lead archivematica developer mentioned several other possibilities in his open repositories 2015 talk (simpson, 2015). a two-way street while this not a use case currently important for the university of saskatchewan, there has been a lot of interest expressed in integration of archivematica and islandora in the opposite direction. that is, ingest of objects first into archivematica, which would generate the packages required for either automated or manual upload to islandora – similar to the integrations for atom and contentdm, described above. this is of interest especially for institutions that are using islandora as the access system for both digitized and born-digital material. mark jordan’s presentation at islandora camp in 2012 described both use cases, and sketched out preliminary development strategies (jordan, 2012). a few institutions have reported on local customizations and experimentation. for example, michigan state university (msu) libraries transformed the mets file from archivematica into a version for ingest into fedora (collie and mak, 2013). a poster presentation also by msu (collie, higgins, mak and nicholson, 2014) further makes the case for islandora/archivematica integration, by highlighting elements of the ndsa levels of preservation (national digital stewardship alliance, 2013) that each system addresses. more recently, a couple threads in the islandora google group have captured interest on working on the archivematica to islandora integration, and reported on on possible development approaches. as outlined by mark jordan, “islandora is ready for this, via the islandora rest module. the work required to have archivematica produce dips for public access in islandora needs to be done on the archivematica side” [3]. the zuse institute berlin may have taken the archivematica to islandora integration the furthest, reporting at ipres 2015 on an implementation involving archivematica, fedora/islandora, and irods (klindt and amrhein, 2015). this code does not appear to be publicly available at this time. there is also potential for integration between archivematica and fedora, rather than islandora per se. indeed, the current integration is primarily between the archivematica storage service and the fedora repository, even though this is achieved through an islandora module. storage of aips in a fedora repository is one area of interest; this has been achieved, for example, at the universities of york and hull as part of an ongoing research data project (mitcham et al, 2016). integration of islandora metadata the mods files are saved in the aip as part of submissiondocumentation; and at least for images, mods and dc metadata become part of the archivematica mets file via exif tool output. but this descriptive metadata is not fully searchable, and the dmdsec section of the archivematica mets file is not populated, as it would be for imported metadata (archivematica documentation: import metadata). this kind of integration was not a priority for our initial development; we assume that most searching will take place in islandora, with the islandora pid sufficient to pull the master object from archivematica. however, adding more integrated metadata would introduce possibilities for richer integration with other systems, for example for dissemination information packages (dips) generated for atom or other access systems. there is also potential to take greater advantage of islandora’s available digital preservation functionality. the islandora premis module provides the capability to generate xml and html representations of premis data in islandora, currently including fixity checks, agent information, and rights metadata from the descriptive record [4]. this module was not on our radar during the initial development of archidora, and we have not currently implemented it. clearly, however, there would be advantages to including islandora’s premis data in the packages generated by archivematica. at least part of the premis output from islandora – fixity information – is currently included in the fedora mets file. development would likely be needed to configure how this data is included in the archivematica mets file, or to pull in the full premis xml file from islandora. re-ingesting objects/aips currently, the archivematica process is only triggered for new – not updated – objects in islandora. arguably a new aip should be created if digital objects are replaced; decisions about what extent of changes to metadata warrant a new aip are more challenging [5]. a manual process is available, through the “add to archivematica” button, but this is obviously subject to user error. another potential downside to this manual process is that this creates an aip with just that object, rather than the larger aips generated as part of normal workflow; and this option is not available at the book or compound object level. for larger sets of objects needing re-ingest, the drush script would be another option, but this also currently needs to be run manually. conclusion the archivematica/islandora integration adds to a growing set of integrations between archivematica and other digital preservation, access and repository software packages, including atom, contentdm, dspace, archivesspace (eckard, pillen and shallcross, 2017), and lockss. responding to a suggestion that archivematica and islandora might be competing for the same users, then artefactual president peter van garderen tweeted, “we don’t compete, we integrate” (van garderen, 2012). an artefactual analyst elabourated on this philosophy in her talk at open repositories 2015, focusing on endpoints and handoffs from source systems (mumma, 2015). while islandora has some digital preservation functionality, our preference is to use a system with digital preservation as a specialization, and take advantage of islandora’s specialization as an access and digital asset management system. since islandora is actively used by multiple users, archivematica also provides a better option for reliable preservation of the master objects. ultimately, as described in the background section, other systems and sources will feed into archivematica, so that we are not managing preservation in multiple systems. we are always glad to hear about interest in adopting and developing archidora [6]. over time, we are hopeful that wider adoption will result in archidora (in both archivematica and islandora) achieving a status as community owned and maintained software, integrating archivematica and islandora with workflows in both directions. references archidora module, islandora labs [internet]. [updated (last commit) 2015 march 2]. github [cited 2017 october 23]. available from: https://github.com/islandora-labs/archidora archivematica documentation: import metadata [internet]. artefactual systems: archivematica documentation, version 1.6 [cited 2017 october 19]. available from: https://www.archivematica.org/en/docs/archivematica-1.6/user-manual/transfer/import-metadata/ artefactual systems – services – development [internet]. artefactual systems website [cited 2017 october 23]. available from: https://www.artefactual.com/services/development/ collie a, higgins d, mak l, nicholson s. 2014. furthering the community: integrating archivematica and islandora [internet]. library information and technology association (poster presentation), january 2014. [cited 2017 october 18]. available from: https://figshare.com/articles/furthering_the_community_integrating_archivematica_and_islandora/899883 collie a, mak l. 2013. incompatible or interoperable? a mets bridge for a small gap between two digital preservation software packages [internet]. alcts metadata interest group, ala midwinter meeting; seattle, washington: 2013 january 27. [cited 2017 october 18]. available from: http://connect.ala.org/node/199172 eckard m, pillen d, shallcross m. 2017. bridging technologies to efficiently arrange and describe digital archives: the bentley historical library’s archivesspace-archivematica-dspace workflow integration project. code4lib journal [internet]; issue 35 (2017 january 30). [cited 2017 october 25]. available from: http://journal.code4lib.org/articles/12105 jordan m. 2012. integrating islandora and archivematica [internet]. charlottetown, canada: islandora camp, 2012 august 2. [cited 2017 october 18]. available from: http://summit.sfu.ca/item/10873 klindt m, amrhein k. 2015. one core preservation system for all your data – no exceptions! proceedings of the 12th international conference on digital preservation, chapel hill, north carolina, 2015 november 2-6 [internet]. [cited 2017 october 25]. available from: https://phaidra.univie.ac.at/detail_object/o:429551 mitcham j, awre c, allinson j, green r, wilson s. 2016. filling the digital preservation gap: a jisc research data spring project; phase three report [internet], 2016 october. [cited 2017 october 25]. available from: https://dx.doi.org/10.6084/m9.figshare.4040787 mumma c. 2015. archivematica: handshaking towards comprehensive digital preservation workflows. or2015: 10th international conference on open repositories, indianapolis, indiana, 2015 june 9 [internet]; [cited 2017 october 25]. available from: http://program.or2015.net/mumma-archivematica_integration-174_a.pdf national digital stewardship alliance, ndsa levels of preservation, version 1 (2013?) [internet]. [cited 2017 october 18]. available from: http://www.digitalpreservation.gov:8081/ndsa/activities/levels.html simpson j. 2015. archidora: leveraging archivematica preservation services with an islandora front-end [internet]. or2015: 10th international conference on open repositories, indianapolis, indiana, 2015 june 9. [cited 2017 october 23]. available from: http://program.or2015.net/simpson-archidora-229.pdf sword api [internet]. [updated 2017 march 23]. artefactual systems: archivematica wiki [cited 2017 october 25]. available from: https://wiki.archivematica.org/sword_api van garderen p. 2012 april 25 [internet]. twitter [cited 2017 october 25]. available from: https://twitter.com/pjvangarderen/status/195206083806113792 notes [1] from improvements/islandora [internet]. [updated 2016 march 17]. artefactual systems: archivematica wiki [cited 2017 october 20]. available from: https://wiki.archivematica.org/improvements/islandora. further technical documentation is also available on this page. a key difference between the schematic and the functionality as it now exists is that the post-store call back prompts archivematica to list the object(s) in the aip as ready to be deleted, but this is not done automatically; rather, the objects can be selected for individual or bulk deletion in the manage tab of the relevant collection/book/compound object. [2] the islandora foundation licensed software acceptance procedure [cited 2017 october 23; available from: http://islandora.ca/developers/lsap) appears to have been fleshed out since our contribution of archidora. an archived version of the page dated march 2015 (https://web.archive.org/web/20150316194238/http://islandora.ca/developers/lsap) does not include a reference to the component manager requirement. that element had been added by september 2015, with a more thorough revision as recently as march 2017. [3] islandora and archivematica [internet]. islandora users group, 2017 august 16 [cited 2017 october 23]. available from: https://groups.google.com/d/msg/islandora/5qxsz3vwbvw/ognos7lkagaj. see also aip, dip, sip generation [internet], islandora users group, 2017 august 30 [cited 2017 october 23]. available from: https://groups.google.com/d/topic/islandora/w7uee1wb0v4/discussion [4] islandora premis [internet]. github [cited 2017 october 19]. available from: https://github.com/islandora/islandora_premis. for more details and discussion of future directions, see jordan m., mclellan e., premis in open-source software: islandora and archivematica. in: dappert a., guenther r., peyrard s. (eds) digital preservation metadata for practitioners [internet]. springer, cham, 2016 [cited: 2017 october 18]. available from: https://doi.org/10.1007/978-3-319-43763-7_16 [5] a presentation at access 2017 outlined some of these issues, notably the fact that objects can be saved multiple times as part of a single ingest. see weiwei shi, shane murnaghan, and matt barnett, the way leads to pushmi-pullyu, a lightweight approach to managing content flow for repository preservation at uofa libraries [internet], saskatoon, canada: access 2017, 2017 september 28, [cited 2017 october 19]. available from: https://drive.google.com/file/d/0b8b5fxybn_3_tws0um1nmln1vu1iuzrvcuduqtqxeupyvhbr/view [6] beyond some informal inquiries, we are aware of archidora’s use in testing workflows for research data. see research data canada, rdc federated pilot for data ingest and preservation, 2015 january 9 [internet]. [cited 2017 october 25]. available from: https://www.rdc-drc.ca/the-rdc-federated-pilot-for-data-ingest-and-preservation/ about the author tim hutchinson has been an archivist at the university of saskatchewan since 1997. he was appointed as university archivist in 2004, and as head of university archives & special collections in 2013 (he is currently on sabbatical); and has been active in a range of activities and developments in the areas of digital archives and preservation, digitization, archival descriptive standards, and technology for archives more generally. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – a practical method for searching scholarly papers in the general index without a high-performance computer mission editorial committee process and structure code4lib issue 58, 2023-12-04 a practical method for searching scholarly papers in the general index without a high-performance computer the general index is a free database that offers unprecedented access to keywords and ngrams derived from the full text of over 107 million scholarly articles. its simplest use is looking up articles that contain a term of interest, but the data set is large enough for text mining and corpus linguistics. despite being positioned as a public utility, there is no user interface; one must download, query, and extract results from raw data tables. not only is computing skill a barrier to use, but the file sizes are too large for most desktop computers to handle. this article will show a practical way to use the gi for researchers with moderate skills and resources. it will walk though building a bibliography of articles and a visualizing yearly prevalence of a topic in the general index, using simple r programming commands and a modestly equipped desktop computer (code is available at https://osf.io/s39n7/). it will briefly discuss what else can be done (and how) with more powerful computational resources. by emily cukier introduction the general index (gi) is a massive corpus of text data derived from published articles; its stated purpose is to open scholarly literature to public inquiry. the gi contains tables of keywords [1], ngrams, and associated metadata extracted from a corpus of over 107 million journal articles drawn from general and specialized journals [2]. this enables full-text searching of a wide swath of academic literature in a single shot, without limitation by vendor platform or paywalls. the gi is the output of a mass journal article accumulation project spearheaded by open information advocate carl malamud and conducted at jawaharlal nehru university (jnu) in new delhi [3]. public resource, a nonprofit founded by malamud, released the first public version of the general index in october 2021 [4]. as of this article’s publication, it contained articles published as recently as 2020. the gi could be a powerful aid to librarians and digital humanists for tracking concepts through scholarly literature. one clear application is in literature review and systematic searching: one can query search terms simultaneously across more articles and disciplines than contained in even the largest vendor discovery platforms, and without permission to access full text [5]. with minimal manipulation one can link term instances to publication years, enabling diachronic analysis of word usage over time. this allows for visualization of trends in journal articles, much like google ngram viewer does for books. but unlike google, the gi links keyword and ngram data to the document of origin, which enables identification and retrieval of the works. a substantial obstacle to using the gi is lack of an exploratory interface. the user must download the raw data tables, then devise methods for querying the records. even though the tables are split into slices, reading them requires more disk space and memory than found in typical desktop computers. with some deliberate programming choices and tricks, it is still possible to use portions of the gi in a limited computing environment. this paper will present one workflow for querying gi keywords in a desktop computing environment using the r programming language. it will show how to create a bibliography of literature on a topic of interest and plot its yearly prevalence in gi articles. it will also discuss adapting the method to a high-performance computing environment, which enables analysis of gi ngrams. contents the general index comprises three data tables of keywords, ngrams, and metadata parceled into sixteen slices apiece and formatted as .sql files. slices are compressed to facilitate downloads. the compressed files take up a total of 4.7 terabytes of space and expand to a total of 37.9 terabytes. table 1. general index compressed and expanded file sizes compressed slices compressed total unzipped slices unzipped total metadata – single file enhanced (version 10.21) n/a 24.2 gb 4.2-4.4 gb 70 gb keywords (version 10.18) 21-23 gb 355 gb 95-102 gb 1.6 tb ngrams (version 10.18) 262-288 gb 4.5 tb 2.1-2.3 tb 36 tb records are allocated to slices according to the first digit of a field labeled “dkey”. the dkey is a unique identifier for each article consisting of 40 hexadecimal digits determined by applying a hash algorithm to the file that contains it. this common field enables linking keyword and ngram records to their corresponding metadata (table 2). file names are standardized to include the data type (“keywords”, “ngrams”, “info” or “meta”) and dkey first digit – e.g., the 11th slice of ngram data is named “doc_ngrams_b.sql”. ngrams (1-5 grams) were determined using spacy, a cython-based natural language processing tool. keywords were determined using the python package yake (yet another keyword extractor) [6]. yake ranks the importance of words within a document according to text features and statistics like word frequency, position in the document, relation to context, and capitalization. the keyword and ngram tables each contain fields for the keyword or ngram found, its lower-case equivalent, and number of tokens (words) it contains. each also has a field that describes the term’s representation within the article: the yake score for keywords (smaller values correspond to greater importance), and term frequency (number of occurrences in the document) for ngrams. the metadata table contains metadata derived from both outside the file (document metadata) and extracted from full text. of particular interest for citation tracking and analysis are fields for doi, isbn, journal title, article title, publication date, and authors. a partial list of fields and their descriptions can be found in the gi readme file. table 2. selected general index metadata fields keywords ngrams metadata dkey dkey dkey keywords ngram doi keywords_lc ngram_lc isbn keyword_tokens ngram_tokens journal keyword_score ngram_count title term_freq pub_date author the corpus is heavily weighted towards recent publications, with the bulk of the articles with determinable publication years (39%) (45844188 of total) published in 2017-2019. five percent of articles did not have readily extractable publication years. 1544 (0.001%) had publication years falling outside of the feasible values of 1600-2020. figure 1. number of articles per year. figure 2. number of articles per year, logarithmic scale. the log scale makes several features of the data visually apparent: the absolute numbers during lower-volume publication years, the onset and duration of an exponential growth phase, and the publication burst in the most recent few years. the earliest publications in the gi (with realistic publication years) were produced in the 1660s. thence forward, the gi contains on the order of 100 articles per year until 1880, when the article volume shows steady exponential growth through 2016. article volume jumps by orders of magnitude in 2017. low article volume in 2020 suggests that article accumulation ended in this year. local workflow: literature indexing the following diagram demonstrates the steps for compiling bibliographic information and determining yearly prevalence of a term in the gi. figure 3. gi workflow diagram. functions and packages are shown in italics. i performed these steps on a hp computer running windows 10 with a 3.4 ghz intel i3 cpu, 8 gb ram, 256 gb internal hard drive, and an external 1 tb hard drive. scripts were written in r and run in rstudio. part i: file preparation download & unzip general index files are available for download at https://archive.org/details/generalindex. downloading each keyword file and the single enhanced metadata file took roughly 6 hours apiece at a download speed of approximately 1 mb/sec. the zipped gi files require decompression before manipulation. to unzip the combined metadata file, i used the free utility 7-zip (available at https://www.7-zip.org) twice: once to unpack the file from .gz to .tar, and again to yield a set of 16 .sql files. to prepare the keyword files, i created an rstudio script that would individually decompress each file, search for terms of interest, retain the matches, and remove the expanded file before decompressing the next. this limited the amount of disk space needed to store keyword files to 457 gb at any given time (355 gb for all zipped slices + 102 gb for the largest uncompressed file). if disk space is truly at a premium, it would be possible to download a single keyword file at a time, unzip it, extract records of interest, and delete the file before moving on to the next. part ii: compiling a bibliography term search drosera is a genus of carnivorous plants (commonly known as sundews) found on six continents [7]. to demonstrate literature indexing, i chose to search the gi keyword files for mentions of sundew or drosera. the smallest decompressed keyword file was about 95 gb, which was too large to read into rstudio. however, the r function fread() (part of the data.table package) can presearch a structured file and read in only the lines that match a particular pattern. this is done using the element “cmd” to execute an operating system command on the file before importing it into the rstudio environment. in this case, i used the pattern-searching command grep [8] to select lines that contained either ‘drosera’ or ‘sundew’. r could readily handle this data subset. to reduce exclusion of valid hits, i used a case-insensitive search and placed no restrictions on characters before or after each search string. # -------creating a bibliography for search terms-----------# ## ---------unzip each file and read in keyword hits -------## keyword_path = "d:/gi_files/keywords" keyword_files = dir(keyword_path, pattern="*.sql.zip", full.names = true) keyword_headers = c("dkey", "keywords", "keywords_lc", "keyword_tokens", "keyword_score", "doc_count", "insert_date") keyword_tbl % filter(!keywords %in% "sundewall") %>% distinct(dkey) the algorithm took roughly 50 minutes to process each slice (30 for unzipping and 20 for term matching), less than 14 hours total. it retrieved 9675 matching keyword records across the slices, which required less than 1 mb of disk storage to save as a .tsv file. at this stage it is prudent to manually inspect a sample of records to ensure the query behaves as intended. after looking at the results in openrefine [9], i added lines of code to filter the surname “sundewall”, a spurious hit caught by my original query. i also included a line to remove duplicate instances in the dkey field. duplicate article matches can occur even when searching for a single word, as gi keyword records may include the same term on its own and as part of one or more phrases. after these filtering steps, 4,565 unique articles remained. match metadata the next steps to turn this occurrence list into a useful index are to load in the gi metadata corresponding to the identified articles, using the common “dkey” field. i was unable to load the >4gb metadata slice files into rstudio due to memory limitations. instead, i made similar use of fread() and grep as with the keyword files, here reading in only the metadata records with dkey identifiers that matched the keyword hits. since the search for each individual dkey occurred serially, this took considerable time – about 36 hours for the 4,565 unique sundew dkeys. metadata were found for 4336 (95.0%) of them. ## ------read in matching article metadata ---------------## metadata_path = "d:/gi_files/metadata/doc_info.2022.12.update/doc_meta/doc_meta/" meta_headers = c("dkey", "raw_id", "meta_key", "doc_doi", "meta_doi", "doi", "doi_flag", "doi_status_detail", "doi_status", "isbn", "journal", "doc_title", "meta_title", "title", "doc_pub_date", "meta_pub_date", "pub_date", "doc_author", "meta_author", "author", "doc_size", "insert_date", "multirow_flag") term_metadata et al.\\r (editors).\\r the correspondence of charles darwin\\r . volume 20. xxxix + 862 pp., illus., bibl., index. cambridge: cambridge university press, 2013. â£90 (cloth).frederick burkhardt; james secord;\\r et al.\\r (editors).\\r the correspondence of charles darwin\\r . volume 21. xxxvii + 784 pp., illus., bibl., index. cambridge: cambridge university press, 2014. â£90 (cloth). ghiselin, michael 12/2/2015 isis 10.1086/684657 structure elucidation of plumbagin analogues from\\r dionea muscipula\\r and their\\r in vitro\\r immunological activity on human granulocytes and lymphocytes kreher, b.; neszmelyi, a.; polos, k.; wagner, h. 1989-2 planta medica 10.1055/s-2006-961890 trading in genes. development perspectives on biotechnology, trade and sustainability van damme, patrick 2006-12 economic botany 10.1663/0013-0001(2006)60[394b:tigdpo]2.0.co;2 flowers of july and august edward h. day 1879-08 the art amateur 10.2307/25626849 the article metadata show inconsistent formatting, sporadic inclusion of markup, and missing information. nonetheless, in most cases there is sufficient information to identify each article. one could ask what advantages this method has over searching for papers in google scholar. the analogous query (drosera* or sundew*) yielded 6,140 results (search performed on jun 2, 2023) – admittedly a larger number than found in the general index. however, google scholar records cannot be exported in bulk. also, not all articles and citations in google scholar have searchable full text [10]. since neither database fully encompasses the scholarly literature, searching the general index could complement searches in google scholar. part iii: yearly term prevalence select publication dates reporting the number of articles per year containing a given term is not useful for determining the term’s prevalence over time, as it does not account for the total number of articles in the general index for that year. finding the total is challenging in a limited computing setting, because each single metadata slice is too large to read fully into r. i overcame this limitation with another application of fread(). i used a command to accomplish two tasks on each line of the file before reading in the data: select lines that contain metadata records (to avoid reading in sql header information) and read in only the columns needed for the current task. to accomplish this, i settled on a statement in the awk [11] programming language [12]. the statement selects lines that begin with a 40-character hexadecimal number (corresponding to a valid dkey), then returns the fields corresponding to the dkey and publication date [13]. this allowed reading in dkey/publication date pairs at a rate of 2-3 minutes per slice, or about 40 minutes total. # -----calculate yearly term prevalence --------------# ## ------get years for all metadata -----------------## date_headers = c("dkey", "pub_date") slice_list % mutate(pub_year = str_extract(pub_date, '[0-9]{4,}')) count_tbl[[slice+1]] % count(pub_year, wt=n) i encountered memory constraints when trying to combine the dkey/publication date information across all slices. instead, i had the script count the number of gi articles for each year for each slice individually without retaining the raw dkey/publication date information, and later totaled the year counts across slices. spurious hits are unlikely at this stage. the dkey identifier is assigned to articles by applying a hash algorithm to yield a 40-digit hexadecimal number. thus, there are 1640 (1.5 * 1048) possible unique results of this hash. the probability that the set of 1.1*108 articles contains at least one pair of different articles with the same hash is roughly 3 * 10-23 [14]. determine year the gi metadata table contains three fields with information about publication year: doc_pub_date (publication date extracted from the document text), meta_pub_date (publication date given in document metadata), and pub_date (date from metadata if available, text if not). i chose to determine dates from the pub_date field to give the greatest likelihood of successful extraction. date formats are not standardized in the gi – months and days are not always given, and the order and separating characters may vary. i focused on the four-digit year to avoid confusion among ways to format dates and provide a convenient unit for diachronic analysis. for both the metadata and keyword bibliographic data, i used the r stringr function str_extract() to detect when four consecutive digits were present in the pub_date field and add them to a new field, pub_year. count & normalize the rstudio package dplyr contains convenient functions to count() occurrences in data columns and mutate() columns to perform arithmetic operations. i used count() on the pub_year field to determine the total number of articles per year in the gi, and number of articles per year in the keyword hit subset. i joined these results into a new table by the common field of publication year, then calculated a new field called “prevalence” by dividing the count of articles containing the term by the count of articles in the gi for that year. ## ------determine term publication years, ## -----count and normalize -------------------------## term_pub_year % mutate(pub_year = str_extract(pub_date, '[0-9]{4,}')) %>% count(pub_year) %>% left_join(meta_dates, join_by(pub_year), suffix=c(".t", ".m")) %>% mutate(prevalence = n.t/n.m) %>% arrange(pub_year) visualize ## ------plot prevalence by year --------------------## prevalence_plot =0 “violin 1” ; “percussion 2” playertoapart musicpart “specified” ; “multiple” ; “performer’s choice” ; “unspecified” ; “unknown” playertoapartnumber musicpart whole number >=0 solo musicpart ; musicmedium ; musicensemble “is solo” ; “is not solo” ; “performer’s choice” ; “unspecified” ; “unknown” accompaniment musicpart ; musicmedium ; musicensemble “is an accompaniment” ; “is not an accompaniment” optional musicpart ; musicmedium “is optional” ; “is not optional” ; “performer’s choice” ; “unspecified” ; “unknown” ad lib musicpart ; musicmedium “is ad lib” ; “is not ad lib” ; “performer’s choice” ; “unspecified” ; “unknown” offstage musicpart ; musicmedium “is offstage” ; “is not offstage” ; “performer’s choice” ; “unspecified” ; “unknown” obligato musicpart ; musicmedium “is an obligato part” ; “is not an obligato part” ; “performer’s choice” ; “unspecified” ; “unknown” amplified musicpart ; musicmedium “is amplified” ; “is not amplified” ; “performer’s choice” ; “unspecified” ; “unknown” prerecorded musicpart ; musicmedium “is prerecorded” ; “is not prerecorded” ; “performer’s choice” ; “unspecified” ; “unknown” periodinstrument musicpart ; musicmedium “is a period instrument” ; “is not a period instrument” ; “performer’s choice” ; “unspecified” ; “unknown” periodinstrumentnote musicpart ; musicmedium uri or literal “baroque” [flute] fingeringsystem musicpart ; musicmedium uri “german” [recorder] tuningsystem musicpart ; musicmedium uri “just intonation” ; “pythagorean” ; “equal temperament” tuningreferencepitch musicpart ; musicmedium uri the pitch name of the tuning reference pitch. “a” tuningreferencefrequencyhz musicpart ; musicmedium number >=0 the frequency of the tuning reference pitch in hertz. “432” scordatura musicpart ; musicmedium “is tuned scordatura” ; “is not tuned scordatura” ; “performer’s choice” ; “unspecified” ; “unknown” tuning of a western string instrument which deviates from the standard tuning. tuningnote musicpart ; musicmedium literal “piano is tuned quarter tone flat” ; “drop d” [guitar] handsnumber musicpart ; musicmedium whole number >=1 number of hands playing an instrument handsside musicpart ; musicmedium ; musicperformer “left” ; “right” ; “performer’s choice” ; “unspecified” ; “unknown” which hand is being used for playing an instrument handsnote musicpart ; musicmedium ; musicperformer literal “piano (2), 3 hands” doublebasscextension musicpart ; musicmedium “requires a c extention” ; “does not requires a c extension” ; “performer’s choice” ; “unspecified” ; “unknown” flutebfoot musicpart ; musicmedium “requires a b foot” ; “does not require a b foot” ; “performer’s choice” ; “unspecified” ; “unknown” instrumentmute musicpart ; musicmedium uri instrumentmutenote musicpart ; musicmedium literal instrumentdimension musicpart ; musicmedium uri * hook for measurements of musical instruments instrumentdimensionnote musicpart ; musicmedium literal “26 inch” [timpani] instrumentsize musicpart ; musicmedium uri * hook for instrument sizes instrumentsizenote musicpart ; musicmedium literal “three-quarter” [guitar] ; “concert” [ukelele] instrumentpitch musicpart ; musicmedium uri pitch of single-pitched musical instruments “c4” [crotale] instrumenttransposition musicpart ; musicmedium uri key of transposing instruments “a” [clarinet] instrumenttranspositionnote musicpart ; musicmedium literal “clarinet in a” instrumentrangenumber musicpart ; musicmedium whole number >=0 pitch range of a musical instrument in number of half steps “60” [marimba] instrumentrangelowest musicpart ; musicmedium uri lowest pitch of a musical instrument “c2” [marimba] instrumentrangehighest musicpart ; musicmedium uri highest pitch of a musical instrument “c7” [marimba] instrumentnote musicpart ; musicmedium literal “5 octaves” [marimba] voicetype musicpart ; musicmedium uri “mezzo soprano” ; “contralto” voiceweight musicpart ; musicmedium uri “spinto” ; “soubrette” voicetessitura musicpart ; musicmedium uri “high” ; “medium” ; “low” voicepitchlowest musicpart ; musicmedium uri lowest pitch required of the vocalist voicepitchhighest musicpart ; musicmedium uri highest pitch required of the vocalist voicenote musicpart ; musicmedium literal technicalrequirement musicpart ; musicmedium uri * hook for computer / recording carrier / playback device information technicalrequirementnote musicpart ; musicmedium literal “requires an 8 track player” next steps with the onvie model laid out conceptually in this paper, the next steps would be to formalize the ontology with modeler software, and, with the help of the library community, integrate into library linked data editors and test a range of use cases to further fine tune semantics, data constraints, and documentation. in the course testing, standardized vocabularies can be developed for the many characterizations in the list of refinements, possibly in alignment with open platforms such as wikidata. some suggestions for these “hooks” can be found in the “notes” column in table 3. finally, pathways to publishing and maintaining the ontology can be explored. further analysis can be made in relation to the unimarc encoding standard, as well as to library-adjacent ontologies such as doremus, developed for analysis and visualization of music data, the music ontology, which focuses on capturing production of musical events, and musicbrainz, which is widely used for sound recordings. as none of these ontologies currently includes a model built out for medium of performance, developing a mechanism to hook onvie on to them would be a worthwhile investigation. another potentially fruitful area of study would be to align musicparts in onvie with other ontologies where the concept is also used, such as the “observations” object in the “musicological objects” layer in lewis et al. (2022). conclusion although linked data are designed to be machine-actionable, it is humans who ultimately employ the mediums with their voices, instruments, and other tools, to create music. it is also humans who are ultimately responsible for expressing and actuating each part of a musical work. this new ontology for voices, instruments, and ensembles, when hooked on to linked data bibliographic systems, will enable a medium of performance access point at a fine level of precision and completeness. library users will be provided a more straightforward path not only toward identifying and selecting music resources, but also toward discovering additional insights into the evolution of performing forces in the history of music making, a whole new area of humanistic studies previously hidden in plain sight. bibliography corigliano, j. (2011). chiaroscuro: for two pianos (one tuned down a 1/4 tone) (ed 4435, hl 50490191). g. schirmer. coyle, k. (2011). marc as metadata: a start. code4lib journal, 14. https://journal.code4lib.org/articles/5468. elmer, m. (1960). the music catalog as a reference tool. library trends, 8(4), 529-538. https://www.ideals.illinois.edu/bitstream/handle/2142/5907/librarytrendsv8i4f_opt.pdf. haas, g. f. (2009). trois hommages: für zwei klaviere (im vierteltonabstand gestimmt) zu zwei händen (1982/84) (ue 34 693). universal edition. heath, d. (1986). out of the cool: for flute (or violin) and piano (ch55693). chester music. lewis, d., shibata, e., saccomano, m., rosendahl, l., kepper, j., hankinson, a., siegert, c., and page, k. (2022). a model for annotating musical versions and arrangements across multiple documents and media. in 9th international conference on digital libraries for musicology (dlfm2022) (pp. 10-18). association for computing machinery. https://doi.org/10.1145/3543882.3543891. library of congress. (2012). summary of programmatic changes to the lc/naco authority file: what lc-pcc rda catalogers need to know. https://www.loc.gov/aba/rda/pdf/lcnaf_rdaphase.pdf. library of congress. (2014). library of congress launches medium of performance thesaurus for music. https://www.loc.gov/catdir/cpso/medprf-list-launch.html. library of congress. (2020). content designator history. in 382 – medium of performance. https://www.loc.gov/marc/bibliographic/bd382.html. mullin, c. (2018). an amicable divorce: programmatic derivation of faceted data from library of congress subject headings for music. cataloging & classification quarterly, 56(7), 607-627. https://doi.org/10.1080/01639374.2018.1516709. newcomer, n. l., belford, r., kulczak, d., szeto, k., matthews, j., & shaw, m. (2013). music discovery requirements: a guide to optimizing interfaces. notes, 69(3), 494-524. http://dx.doi.org/10.1353/not.2013.0017. online computer library center. (2022, may 24). oclc 989164116. a publicly accessible version of this record as imported by university of california, berkeley can be retrieved at https://search.library.berkeley.edu/discovery/sourcerecord?vid=01ucs_ber:ucb&docid=alma991033194509706532&recordowner=01ucs_network. ostrove, g. e. (2001). music subject cataloging and form/genre implementation at the library of congress. cataloging & classification quarterly, 32(2), 91-106. https://doi.org/10.1300/j104v32n02_08. performed music ontology: documentation (2021, may 6). https://github.com/ld4p/performedmusicontology/tree/main/documentation. rda toolkit. (2022). american library association. https://access.rdatoolkit.org/. subject analysis committee. (2017). a brave new (faceted) world: towards full implementation of library of congress faceted vocabularies. association for library collections & technical services. http://hdl.handle.net/11213/8146. subject analysis committee. (2022). retrospective implementation of library of congress faceted vocabularies. association for library collections & technical services. http://hdl.handle.net/11213/17998. swanson, c. m. (2003). adding to the viola repertoire by arranging: a study on methods of arranging music for viola from clarinet, with an original arrangement of the saint-saens clarinet sonata in e-flat, op. 167 (umi number: 3107045) [doctoral dissertation, university of arizona]. proquest information and learning. http://hdl.handle.net/10150/280389. szeto k. (2010a). concerto for violin and chamber ensemble arranged after robert schumann’s scherzo from symphony no. 2. alexander street press classical scores library, volume iv. https://search.alexanderstreet.com/preview/work/bibliographic_entity%7cscore%7c3643958. szeto k. (2010b). robert schumann’s symphony no. 2 in c major arranged for chamber ensemble. alexander street press classical scores library, volume iv. https://search.alexanderstreet.com/preview/work/bibliographic_entity%7cscore%7c3643956. szeto k. (2013). leoš janácek: ouvertüre zur oper “aus einem totenhaus” für solovioline und ensemble (ue 36 249). universal edition. szeto, k. (2016, october 15). untangling medium of performance for the linked data environment [presentation]. new york state-ontario chapter of the music library association fall meeting, university of toronto, ontario, canada. https://academicworks.cuny.edu/bb_pubs/1251/. szeto, k., adams, a. d., billet, k. e., busselen, c., kishimoto, k. s., loprete, a. a., mcfall, l., rondeau, s., snyder, t. l., soe nyun, j. l., vanden dries, w. r., & vermeij, h. (2016). report of the cmc bibframe task force to the board of the music library association. https://cmc.wp.musiclibraryassoc.org/wp-content/uploads/sites/5/2019/02/201602task_force_report.pdf. szeto, k. (2017). the mystery of the schubert song. notes, 74(1), 9-23. https://doi.org/10.1353/not.2017.0071. weiß, c., zalkow, f., arifi-müller, v., müller, m., koops, h. v., volk, a., & grohganz, h. g. (2021). schubert winterreise dataset: a multimodal scenario for music analysis. journal on computing and cultural heritage, 14(2), 1-18. https://doi.org/10.1145/3429743. about the author kimmy szeto is associate professor and metadata librarian at baruch college, city university of new york, where he manages metadata for digital resources. his recent research focuses on the technical and conceptual tensions between cataloging practice and the linked data environment. outside the library and academia, kimmy can be heard as a chamber arranger of symphonic works and as a collaborative pianist in concert halls and theaters around new york city. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – leveraging the rbms/bsc latin place names file with python mission editorial committee process and structure code4lib issue 48, 2020-05-11 leveraging the rbms/bsc latin place names file with python to answer the relatively straight-forward question “which rare materials in my library catalog were published in venice?” requires an advanced knowledge of geography, language, orthography, alphabet graphical changes, cataloging standards, transcription practices, and data analysis. the imprint statements of rare materials transcribe place names more faithfully as it appears on the piece itself, such as venetus, or venetiae, rather than a recognizable and contemporary form of place name, such as venice, italy. rare materials catalogers recognize this geographic discoverability and selection issue and solve it with a standardized solution. to add consistency and normalization to imprint locations, rare materials catalogers utilize hierarchical place names to create a special imprint index. however, this normalized and contemporary form of place name is often missing from legacy bibliographic records. this article demonstrates using a traditional rare materials cataloging aid, the rbms/bsc latin place names file, with programming tools, jupyter notebook and python, to retrospectively populate a special imprint index for 17th-century rare materials. this methodology enriched 1,487 machine readable cataloging (marc) bibliographic records with hierarchical place names (marc 752 fields) as part of a small pilot project. this article details a partially automated solution to this geographic discoverability and selection issue; however, a human component is still ultimately required to fully optimize the bibliographic data. by kalan knudson davis background in 2018, the university of minnesota (umn) libraries cataloging and metadata group (cmg), an alma governance group that creates and maintains system-wide cataloging and metadata policy and procedures, approved a new written policy for rare book cataloging. these rare book cataloging guidelines stipulated the use of hierarchical place names to aid discoverability for contemporary place names as well as more faithfully transcribed imprint locations, but there were hundreds of thousands of extant umn libraries bibliographic records for rare materials that did not have this hierarchical place name field populated. indexing the hierarchical place name field was problematic as only a small subset of recently cataloged rare materials could utilize it. a solution needed to extend much further than updating cataloging policy and reindexing fields in umn’s primo discovery layer. redescribing all rare materials imprint statements to reflect current cataloging standards and transcription practice was out of scope due to other cataloging priorities and rare materials staffing limitations. a purely manual record-by-record method of adding contemporary place names retrospectively to rare materials records would be time consuming, error prone, and an impractical solution. i set out to develop an automated approach by testing and implementing a small pilot project. geographic selection illustrated in marc the orbis latinus lists eleven variations for the place name venice, italy: venetia, s. marci civ., portus venetus, venecia, veneciae, venetiae, veneticis, venetie insula, venetorum civ., venetus, and venitia (grässe 1971) [1]. these entries for venice are among the thousands of potential rare materials place names contained in the medieval geography reference. for many legacy rare materials, a contemporary form of place name, such as the search keyword venice, italy, appears nowhere on the bibliographic record. to add consistency and discoverability to imprint locations, rare materials catalogers utilize the hierarchical place name, or the marc 752 field, to create a special imprint index (library of congress 1999 [updated 2019 november] [2]). the marc 752 field has been redefined as recently as 2017 with several marc discussion papers and proposals regarding its use (library of congress 2004 [3], library of congress 2004b [4], and library of congress 2004c [5]). this recently refined field in a mature bibliographic repertoire results in many legacy cataloging records that do not currently utilize the marc 752 field to its fullest capacity. rare materials catalogers optimize the place of publication transcribed in the marc 260/264 subfield $a for the identification of bibliographical variants in manifestations via a more faithful transcription ([library of congress 1999b [updated 2019 november] [6]). to further illustrate the geographic selection problem with marc examples, consider the following transcribed place names for venice, italy. 260 \\ $a venetiis : $b ex typographia remondiniana ..., $c mdcclxiii [1763] 260 \\ $a impressum venetijs: $b per jacobu[m] pentiu[m] de leucho : $b arte impensa vero juncte de junctis florentini, $c anno d[omi]ni 1508 vltimo octobris [31 oct. 1508] 264 \1 $a in venetia : $b nella stamperia di giouanni salis, $c m. dc. xvi. [1616] from those same bibliographic records illustrated above, rare materials catalogers optimize the following marc 752 field for rare materials geographic selection tasks (library of congress 1999 [updated 2019 november] [7]). adding the marc 752 makes these records discoverable by a contemporary and normalized place name. 752 \\ $a italy $d venice. although the fuller, more faithful transcriptions of rare materials allow users to clearly identify and distinguish among the variants of manifestations, the variety of transcription methods found in the imprint statements of bibliographic records and the orthographic and graphic changes throughout time can obscure geographic selection, which is remedied by the addition of normalized and contemporary hierarchical place names. literature review rare materials descriptive needs the introduction to the descriptive cataloging for rare materials (books) (dcrm(b)) emphasizes the rare materials’ need for “more faithful transcriptions and more accurate physical descriptions” and that this practice “facilitate[s] differentiation between manifestations and reveal[s] the presence of bibliographic variants among seemingly identical items” (bibliographic standards committee 2011 [8]). rare materials have artifactual value; therefore, rare materials cataloging addresses them both as a text and as a physical object. the cataloging of rare materials requires “additional details of description in order to identify significant characteristics” (bibliographic standards committee 2011 [9]) and an accurate representation of the entity as it describes itself. in order to achieve these more faithful transcriptions, rare materials cataloging standards and transcription practices, much like graphical representation of language and orthography, have changed and evolved over time. where needed, current rare materials cataloging standards deviate from other bibliographic standards to meet unique descriptive needs (bibliographic standards committee 2011 [10]). for grammatically separable elements, dcrm(b), allows for the noted transposition of elements (bibliographic standards committee 2011 [11]), but the use of contractions or transcribed phrases such as en venetia and impressum venetiis add an additional layer of imprint location complexity as well as indication of role. transcribed place name abbreviations such as venet. or lugduni batauor. are common. the article “transcription in rare books cataloging” discusses the differences in cataloging and transcription practices between dcrm(b), international standard bibliographic description (isbd), and resource description access’ (rda) ‘take what you see’ principle for early printed books (sjökvist 2006 [12]). sjökvist discusses transcription and normalization in the terms of fulfilling user tasks: identification and retrieval (i.e. selection), respectively. although the article concentrates its arguments on title access rather than imprint locations, its conclusion of using two different marc fields, each optimized to a different ifla lrm user task, is similar to how rare materials catalogers currently handle imprint locations. leslie and griffin in their 2013 discussion paper “transcription of early letter forms in rare materials cataloging” note that “there is no clearly superior solution to the [transcription] problem; our task is finding the least painful solution” (leslie and griffin 2013 [13]). the article discusses the evolution of letter forms; spelling conventions; and contractions, ligatures, and digraphs usage, both archaic and modern, while thoroughly addressing the 23-letter roman alphabet’s journey to our modern 26-letter system. leslie and griffin evaluate several different transcription systems by ensuring user convenience, representation of the entity as it describes itself, accuracy of descriptions, principle of sufficiency and necessity, and the principle of standardization. transcribed imprint statements illustrate the numerous difficulties presented to rare materials catalogers in graphical and orthographical usage, both modern and archaic, as the “gradual differentiation between u and v as representing different sounds, in particular u (and i) as vowels, and v (and j) as consonants” occurred over time (leslie and griffin 2013 [14]). this leads to further place name spelling variations in the geographic selection puzzle as characters may have variations in transcription due to the cataloging rules and title page language. for example, transcribing basileæ or basileae depends on whether the title page is in english or french. the article “rda and rare books cataloging” discusses imprint treatments between the marc 260 and marc 264 subfield $a, in particular where elements of the imprint statement are grammatically inseparable (burns 2019 [15]). burns notes that “publication statements on rare books can present some of the most challenging transcription issues for rda rules” (burns 2019 [16]). indeed, rda publication statements will surely bring many more transcribed place name variations and more transcription practices to bibliographical records. place names and imprints in rare book cataloging orbis latinus, recently revised in 1971, has been the standard reference resource for medieval geography since its original publication in 1861 (grässe 1861 [17]). frederick benedict edited a second edition in 1909 with special considerations for medieval and modern latinities (grässe 1909 [18]). r.c. schmidt published a third edition in 1922 (grässe 1922 [19]). helmut plechl in cooperation with günter spitzbart helmed a fourth edition in 1971. this most recent publication has been expanded to over 579 pages (grässe and plechl 1971 [20]). the first efforts to digitize the 1909 resource took place at columbia university under karen green in 2000 (bibliographic standards committee 2015 [21]). another important place name resource, peddie’s place names in imprints, first published in 1932, often finds a home within the workstations of rare materials catalogers. the preface notes, “one of the difficulties of the cataloguer and the bibliographer has been to identify the place of printing or publication when disguised in latin, or in some dialect form. this index is intended to help in the solution of this problem” (peddie 1932 [22]). the rbms/bsc latin place names (lpn) file is a listing of latin place names found in the imprints of materials printed before 1801 and their vernacular equivalents in resource description and access (rda) form. the file has been compiled from r.a. peddie’s place names in imprints and grässe’s orbis latinus. robert l. maxwell created the rbms/bsc lpn file in 1997 with the assistance of karen larson (bibliographic standards committee 2015 [23]). this online resource has been maintained since that time as a rare book manuscript section / bibliographic standards committee cataloging aid. the introductory section of rbms/bsc latin place names (lpn) file states, “this database was compiled from the imprint information in cataloging records of several anglo-american research libraries. because these records were created over a long period of time and under different standards and rules of transcription, the orthography of the place names with respect to i/j and u/v/w does not necessarily reflect what was found in the original [materials]” (bibliographic standards committee 2015 [24]). out of necessity, the rbms/bsc has standardized the file; therefore, rare materials imprint statements contain many more place name variations than are listed explicitly in the file. in the case of venice, italy, for example, one of the eleven place names listed in grässe’s orbis latinus, venetiis, has been transcribed in bibliographic records as venetijs. recent developments in web encoding standards are another complexity inherent in leveraging the rbms/bsc lpn file. html5 utilizes unicode utf-8 as, generally, does current bibliographic practice. however, the introduction to the rbms/bsc lpn file also notes, “at the time that the latin place names file was developed, it was only possible to enter ascii characters in html files. this limited the availability of diacritical marks to the most common. therefore, some rda forms, especially those for eastern european cities, are missing their diacritics” (bibliographic standards committee 2015 [25]). the twenty plus variations transcribed place names of nuremberg, munich, and zurich (e.g.. nürnberg, münchen, and zürych) illustrate this. use of python in cataloging workflows frank’s article “augmenting the cataloger’s bag of tricks : using marcedit, python, and pymarc for batch-processing marc records generated from the archivists’ toolkit” provides an overview of the ongoing trend that “combines traditional cataloging standards and programming, in order [for catalogers] to efficiently manage their workflows” (frank 2013 [26]). frank specifically mentions the pymarc python library created by gabriel farrell, mark matienzo, and ed summers (pymarc [updated 2020 april] [27]) as a workflow adaptation, with the one-by-one record approach no longer being a sustainable practice. in their “leveraging python to improve ebook metadata selection, ingest, and management,” thompson and traill discuss the use of the pandas python module (pandas … , [updated 2020 march] [28]) to create a dataframe to analyze and evaluate the quality of marc records, as well as to write out comma separated value (csv) reports for output and further human remediation tasks (thompson and traill 2017 [29]). thompson and traill’s use of dataframes not only to execute evaluation of marc records, but also to implement workflows within those records themselves heavily influenced the development of the methodology and script described in this article. uris in marc the “task group on linked data best practices final report” details recommendations for practical experimentation and guidance to catalogers for coding marc subfields $0, $1, and $4 (program for cooperative cataloging 2019 [30]). the report also recommends an appropriate subfield order for the uri and non-uri marc subfields. of particular interest are the multiple options and recommendations regarding the marc subfield $4: the use of option three “if a relationship uri or code is given, it is encouraged but not required to also provide the corresponding label ($e, $i, or $j depending on the field)” (program for cooperative cataloging 2019 [31]). the pcc task group’s final report, along with the marc object table (program for cooperative cataloging 2019b [updated 2019 august 31] [32]) and the “pcc task group on uris in marc’s formulating and obtaining uris: a guide to commonly used vocabularies and reference sources” (program for cooperative cataloging 2019c [33]), informed the choice, construction, and order of the uris and enriched marc 752 fields for this pilot project. method the script’s purpose is to check each record for existing marc 260/264 subfield $a values and 008 marc country code value against the dataframe version of the rbms/bsc latin place names file, supply normalized geographic data in the marc 752 field when possible, and output the results into three separate spreadsheets for review. turning the rbms/bsc latin place names file into a pandas dataframe to create and test the python scripts, i extracted the 1,487 marc bibliographic records for 17th-century rare materials as a sample from the university of minnesota libraries’ alma library services platform. this dataset exemplified many of the characteristic orthographic, geographic, and graphical variations in the imprint locations of rare materials, while remaining a small subset of manageable size for a pilot project. the marc bibliographic record dataset contained 323 unique strings between the marc 260/264 subfield $a. i added these unique strings, along with an additional 1,411 entries extracted from the rbms/bsc latin place name file, to a comma separated value (csv) file. each row within the csv file was also populated with the various values comprising the marc 752 subfield components, including the country ($a), the first-order political jurisdiction (i.e. state or province) ($b), city ($d), relator term ($e), relator term uri ($4), and real world object uri ($1). table 1. an excerpt of the csv input for the rbms/bsc latin place names file dataframe. venetii it italy$dvenice, italy no value venice, venetiis it italy$dvenice, italy no value venice, venetio it italy$dvenice, italy no value venice, venetis it italy$venice, italy no value venice, venetiuis it italy$dvenice, italy no value venice, the csv file is an essential component of the python script’s iterative process. peddie, grässe, and the bibliographic standards committee did not design their geographic resources to account for every place name variation, phrase, scribal contraction, and abbreviation. the bibliographic records themselves account for every possible use case. the script user feeds additional locations and place name variations into the script for processing making for a flexible, locally customizable, and responsive design. the python script feeds the csv file into a pandas dataframe. table 2.the enriched rbms/bsc latin place name file is fed into pandas dataframe. #create pandas dataframe masterchart = pd.read_csv('masterchartcleaner.csv') #index the publication_place column idx = pd.index(masterchart.publication_place) evaluating marc records– the good, the bad, and the okayish marc record evaluation and dataframe querying takes place before a record is sent on to one of the python script’s three pymarc writers. the evaluation process splits the original marc file into one of three possible paths based on evaluation of the data contained within the records. option 1: 752s are already present on marc record recently cataloged rare materials may already have a marc 752 field populated with a hierarchical place name. in these cases, the script identifies these records and writes them to a separate file for manual review. the file reviewer may wish to further enrich the existing marc 752 subfields with additional real world object uris, if these are lacking. table 3.the script writes rare materials that already have a populated marc 752 field to a file for review. #if there is already a 752 in the record, send to the #okayish writer for manual review. else: #create dictionary of values for okayish csv file mmsid = [record['001']] oclcno = [record['035']] ccfrom008 = countrycode field260 = [record['260']] field264 = [record['264']] field752 = [record['752']] #create a dictionary of lists dict = {'alma mms id': mmsid, 'oclc number': oclcno, 'country code': ccfrom008, '260': field260, '264': field264, '752': field752} okayishdf = pd.dataframe(dict) #saves marc information to the okayish dataframe okayishdf.to_csv('reportmarcalreadywith752s.csv', mode='a', header=false) writerokayish.write(record) option 2: 752s are added to marc record for marc records containing only one marc 260/264 subfield $a/$e value, the script queries the pandas dataframe to find a relevant match. the python script assigns the marc country code in the 008 field and the marc 260/264 subfield $a to variables. the pandas dataframe is essentially filtered twice, matching both the 008 country code and marc 260/264 subfield $a values. table 4.the enriched rbms/bsc latin place name file dataframe is filtered to match the bibliographic record’s 008 country code and marc 262/264 subfield $a. #this filters the original masterchart dataframe for marc #country code and publication place queriedchart = masterchart.query('countrycode == @countrycode and publication_place == @clean264a', inplace = false) #this returns the size of the queriedchart, in case there #aren't any matches. size = queriedchart.size at this stage, the script takes special care to assign the appropriate relator terms (subfield $e) and relator uris (subfield $4) for marc 264 fields according to the various indicator values. as noted in the dcrm(b), “the roles of publishers, printers, and booksellers were not clearly delimited in the hand-press period. statements relating to printing frequently appear prominently on early printed materials, reflecting the tendency of printers to function as more than solely manufacturers” (bibliographic standards committee 2011 [34]). therefore, defining relator terms for the marc 260 field is not as straightforward, but follows current dcrm(b) practice: “if the publication bears only a statement relating to manufacture, or multiple such statements, generally assume the manufacturer(s) to also be functioning as publisher(s), distributor(s), etc.” (bibliographic standards committee 2011 [35]). not every relator relationship can be programmatically accounted for. although the script takes special care to evaluate the nature of the place name’s relationship to the manifestation, a rare materials cataloger may wish to evaluate and scan the supplied marc 752 subfield $e and subfield $4 on retrospectively enriched records. table 5.the script assigns relator terms and uris according to indicator values. #assign value to 752 subfield e and subfield 4 based on 264 #indicators. for imprintrole in record.get_fields('264'): #assign 752 subfield e and 4 for production place if imprintrole.indicator2 == '0': field752e = "production place." field7524 = "http://id.loc.gov/vocabulary/relators/prp.html" #assign 752 subfield e and 4 for publication place elif imprintrole.indicator2 == '1': field752e = "publication place." field7524 = "http://id.loc.gov/vocabulary/relators/pup.html" #assign 752 subfield e and 4 for distribution place elif imprintrole.indicator2 == '2': field752e = "distribution place." field7524 = "http://id.loc.gov/vocabulary/relators/dbp.html" #assign 752 subfield e and 4 for manufacture place elif imprintrole.indicator2 == '3': field752e = "manufacture place." field7524 = "http://id.loc.gov/vocabulary/relators/mfp.html" #handle records that do not have indicators with 264, #unlikely but possible. else: #sends problem records with 264s with no indicators #to the badwriter. indexchecker = false if a match within the rbms/bsc latin place name file dataframe exists, the script compiles the subfield components for the marc 752 field and the script adds the field to the marc record. table 6.the script adds the compiled marc 752 field to the record. if indexchecker is true and size > 0: #assign the value of term3 which gives the index of the row #in the masterchart. term3 = queriedchart.index[0] #assign the values of the 752 subfields that weren't already #assigned above. field752a = masterchart.loc[term3,"752a"] field752b = masterchart.loc[term3,"752b"] field752d = masterchart.loc[term3,"752d"] field7521 = masterchart.loc[term3,"geonames_uri"] #add the assembled 752 subfields to the marc record. record.add_ordered_field( field( tag = '752', indicators = [' ',' '], subfields = [ 'a', field752a, 'b', field752b, 'd', field752d, 'e', field752e, '4', field7524, '1', field7521, ])) the script writes all records with added 752 fields to a separate file for manual review. table 7.the script writes records with populated marc 752 field to a file for review. #the marc record is written to the goodwriter after the #752 is populated and added to the record. #create dictionary of values for goodwriter csv file mmsid = [record['001']] oclcno = [record['035']] ccfrom008 = countrycode field260 = [record['260']] field264 = [record['264']] field752 = [record['752']] #create a dictionary of lists dict = {'alma mms id': mmsid, 'oclc number': oclcno, 'country code': ccfrom008, '260': field260, '264': field264, '752': field752} gooddf = pd.dataframe(dict) #saves marc information to the good dataframe gooddf.to_csv('reportmarcwith752s.csv', mode='a', header=false) writergood.write(record) option 3: 752s not added to marc record by design, the script does not add a marc 752 field for some use cases. briefly, i will describe some of these situations, characteristics, and script handling. they can be broadly categorized into the following five groups: the script identifies potential errors present in the marc record. the script identifies an imprint location that could refer to several places and the cataloger will need to evaluate the record manually and in greater context. the script can not identify the imprint place. the script user needs to add additional imprint information to the csv file. the marc record lists two places of imprint. errors are commonplace in legacy marc bibliographic records. mismatches between the 008 marc country code and place of imprint are possible where marc country codes or geographic headings are obsolete and now require evaluation and maintenance. geographic and typographic errors are also common. table 8.excerpt of mismatches between 008 marc country code and marc 260/264 values. ne =260 \\$ahafniae :$bimpensis salomonis sartorii,$c1624. ne =260 \\$ahafniae :$bliteris matthiae godicchenii,$c1664. ne =260 \\$ahafniae :$bsumptibus petri hauboldi, excudebat petrus hakius,$c1658. ne =260 \\$ahafniae :$btypis matthiae godiccheni sumptibus petri haubold,$c1666. the rbms/bsc lpn file indicates several names that refer to one or more places and/or to fictitious places. when the script can’t resolve a match based on the 008 marc county code and the latin place name, instead of adding a marc 752 field, the script places the marc record in the “752s not added to marc record” writer queue instead. table 9.example from the rbms/bsc latin place names file dataframe requiring manual resolution. lpn master file augustae * most frequently augustae vindelicorum but also argentorati augustae treuirorum londini tarracone taurini and tubingae xx manual input unfortunately, not all bibliographic mysteries are solvable. sometimes rare materials catalogers cannot determine an imprint location. the script does not add a marc 752 field for items traditionally denoted as “s.l.”, abbreviation for sine loco, or more recently as “[place of publication not identified].” table 10.examples of [place of publication not identified] as imprint locations. xx =260 \\$a[place of publication not identified],$c1613. xx =260 \\$a[place of publication not identified] :$b[publisher not identified],$c1648. xx =260 \\$a[place of publication not identified],$c1680. finally, where there is less than one or more than one of any combination of marc 260 and marc 264 and when there is more than one subfield $a/$e value, the record is kicked out for manual review. a match on a place name alone without also a match of a 008 marc county code could very easily lead to an erroneous result. a rare materials cataloger can assign only one 008 marc country code per bibliographical record, which may or may not be the country code for all imprint locations listed on the rare material. the script alone can’t resolve these situations. a rare materials cataloger needs to resolve and verify, in further consultation with bibliographic utilities and with research on the typical locations of the publisher, manufacturer, etc. the script contains a simple logical true/false variable, which accounts for many of these edge cases and marc record errors, and will write the record to the “752s not added to marc record” queue. table 11.the script sends the marc record for manual resolution. else: indexchecker = false message = "more than one 260 subfield a/e is present in record. resolve manually." table 12.the script sends a marc record for manual resolution. #if there is more or less than one 260/264 tag combo, #send it to the badwriter. else: indexchecker = false message = "more than one 260/264 fields are present in record. resolve manually." csv review files as the script writes marc records to the three writer queues, the python script publishes additional pandas dataframes as csv files. the dictionary contains important marc record identifiers, such as the 001 and 035 control numbers, the 008 marc country code, the entire 260/264 string, if assigned, the 752 string, and a status message. the script writes the csv files for human consumption and at a glance can help further diagnose errors within the marc records and/or verify various components of the marc 752 fields (locations and relator terms, alike). rare materials catalogers can evaluate the csv files to ensure that the script assigned marc 752 subfields appropriately. this is especially helpful for examining and resolving marc records that have fictitious, incorrect, grammatically inseparable elements, and/or multiple imprint locations. table 12.an excerpt from a csv file with marc records requiring further review. gw =260 \\$afrancofurti,$alipsiae :$bprostant apud nicolaum scipionem,$c1685. gw =260 \\$ahannover$aund wolffenbüttel :$bh. grentz,$c1690. gw =260 \\$ahanover ;$agvelpherit :$bsumptibus gothofredi freytagii,$c1696. gw =260 \\$aargentorati :$btypis j. caroli,$c1618. results marc records of the 1,487 marc bibliographic records for 17th-century rare materials used as an initial dataset, only five marc records were previously populated with the marc 752 field. this constitutes less than one percent of all records in the dataset. as predicted, these items were recently acquired by the umn’s collections and were cataloged within the last three years. these records illustrate option one as described above, where marc 752 fields were already present on records. i manually reviewed the marc 752 fields and enriched them with uris. from the original dataset, 1,410 marc bibliographic records were automatically populated with the hierarchical place name field, due to a confirmed queried pandas dataframe for the 008 marc country code and imprint location in the marc 260/264 field subfield $a. this constitutes 94% of all records in the dataset and illustrates option two as described above. after further review, i manually resolved 72 of the 1,487 marc bibliographic records. this constitutes five percent of all records in the dataset. these records illustrate option three as described above, where the script did not add 752s to marc records due to errors, multiple places of imprint, or other reasons. out of the 72 records, i identified 32 records that contained errors. the most common error was a mismatch between the 008 marc country code and marc 260/264 subfield $a. this included simple geographic errors, such as copenhagen (denmark) being associated with the netherlands, or toulouse listed in germany rather than france. it also included simple typographical errors, such as ne (for the netherlands) being mistyped re (for réunion). another common error occurred when a publisher name was mistaken as a place of publication. four records contained “place of publication not identified” as an imprint location and i did not evaluate them further. i identified twenty records as correct and after minor tweaks to the pandas dataframe will process correctly in the future– these were mostly due to previously unaccounted for diacritics and special characters. finally, i identified only sixteen records as having more than one marc 260/264 subfield $a. one should be cautious in correcting potential errors and manually assigning marc 752 fields keeping a vigilant eye out for false imprints, reproduction statements, fictitious imprints, or other edge use cases using human/cataloger judgment to fully optimize the data. conclusion i intended the 1,487 marc records as a pilot project, with the intention to fully populate the special imprint index using the python script for all umn libraries rare materials dated 1450 to 1800. as the pandas dataframe is constantly used, updated, and enriched, it will grow to account for as many imprint use cases as possible. the python script has the potential to be of great use to other libraries with rare materials collections who would like to retrospectively populate a special imprint index using the marc 752 field. the code is available on github for broader use and adaptation (davis [updated 2020 april] [36]). the uri enriched imprint pandas dataframe contains many important pieces of the linked data puzzle and has broad implications for the history of the book. enriching the marc records with geonames uris enables exciting bibliographical associations with larger scale geospatial data. the data collected in the pandas dataframe could inform future developments for rbms/bsc lpn file as linked open data. the tabular data could be used to export and populate future applications of owl and rdf semantic relationships for imprint places. references [8] [9] [10] [11] [34] [35] bibliographic standards committee. 2011. descriptive cataloging of rare materials (books). rare books and manuscripts section, association of college and research libraries; in collaboration with the policy and standards office of the library of congress. washington, dc: library of congress, cataloging distribution service [cited 2019 december 9]. available from: https://rbms.info/files/dcrm/dcrmb/dcrmb3.pdf [21] [23] [24] [25] bibliographic standards committee. 2015. rbms/bsc latin place names file. chicago, il: american library association, rare book and manuscript section, bibliographic standards committee [cited 2019 december 11]. available from: http://rbms.info/lpn/ [15] [16] burns, m. 2019. rda and rare books cataloging, part 2. library resources & technical services. [cited 2019 december 12]; 63(1): 4-28. available from: http://search.ebscohost.com.ezp2.lib.umn.edu/login.aspx?direct=true&authtype=ip,uid&db=aph&an=134098616&site=ehost-live [36] davis, k. k. [updated 2020 april]. rbms-bsc-lpn-file-dataframe. available from: https://github.com/ladylazarus3/rbms-bsc-lpn-file-python-dataframe [26] frank, h. 2013. augmenting the cataloger’s bag of tricks : using marcedit, python, and pymarc for batch-processing marc records generated from the archivists’ toolkit. code4lib journal [internet], issue 20 [cited 2019 december 11]. available from: https://journal.code4lib.org/articles/8336 [17] grässe, jgt. 1861. orbis latinus, oder, verzeichniss der lateinischen benennungen der bekanntesten städte etc., meere, seen, berge und flüsse in allen theilen der erde. dresden: g. schönfeld buchhandlung (c.a. werner) [cited 2019 december 9]. [18] grässe, jgt. 1909. orbis latinus, oder, verzeichnis der wichtigsten lateinischen ortsund ländernamen. new york: e. steiger & co [cited 2019 december 11]. [19] grässe, jgt. 1922. orbis latinus, oder, verzeichnis der wichtigsten lateinischen ortsund ländernamen. berlin: r.c. schmidt [cited 2019 december 11]. [1] [20] grässe, jgt and plechl, h. 1971. orbis latinus, lexikon lateinischer geographischer namen. braunschweig, germany: klinkhardt & biermann [cited 2019 december 9]. [13] [14] leslie, dj and griffin, b. 2013. transcription of early letter forms in rare materials cataloging. a discussion paper prepared for the dcrm working conference [cited 2019 december 11]. available from: http://rbms.info/files/dcrm/dcrmb/wg2lesliegriffin.pdf [2] [7] library of congress. 1999 [updated 2019 november]. marc 21 format for bibliographic data, 752. washington, dc: library of congress, network development and marc standards office [cited 2019 december 9]. available from: https://www.loc.gov/marc/bibliographic/bd752.html [6] library of congress. 1999b [updated 2019 november]. marc 21 format for bibliographic data, 260. washington, dc: library of congress, network development and marc standards office [cited 2019 december 9]. available from: https://www.loc.gov/marc/bibliographic/bd260.html [3] library of congress. 2004. marc discussion paper 2004-dp02. washington, dc: library of congress, network development and marc standards office [cited 2019 december 11]. available from: https://www.loc.gov/marc/marbi/2004/2004-dp02.html [4] library of congress. 2004b. marc discussion paper 2016-dp21. washington, dc: library of congress, network development and marc standards office [cited 2019 december 11]. available from: https://www.loc.gov/marc/mac/2016/2016-dp21.html [5] library of congress. 2004c. marc proposal no. 2004-07. washington, dc: library of congress, network development and marc standards office [cited 2019 december 11]. available from: https://www.loc.gov/marc/marbi/2004/2004-07.html [28] pandas: python data analysis library [updated 2020 march]. available from: http://pandas.pydata.org [22] peddie, ra. 1932. place names in imprints: an index to the latin and other forms used on title pages. london: grafton & co. [cited 2019 december 12]. [30] [31] program for cooperative cataloging. 2019. task group on linked data best practices final report. washington, dc: program for cooperative cataloging, library of congress [cited 2019 december 12]. available from: https://www.loc.gov/aba/pcc/taskgroup/linked-data-best-practices-final-report.pdf [32] program for cooperative cataloging. 2019b [updated 2019 august 31]. marc object table. washington, dc: program for cooperative cataloging, library of congress [cited 2019 december 12]. available from: https://docs.google.com/spreadsheets/d/1met87ymzjiwjrh_psatjpaecxv_30hw4c1skbrzq8ra/edit#gid=55309217 [33] program for cooperative cataloging. 2019c. task group on uris in marc. formulating and obtaining uris: a guide to commonly used vocabularies and reference sources. washington, dc: program for cooperative cataloging, library of congress [cited 2019 december 12]. available from: https://www.loc.gov/aba/pcc/bibframe/taskgroups/formulate_obtain_uri_guide.pdf [27] pymarc [updated 2020 april]. available from: https://gitlab.com/pymarc/pymarc [12] sjökvist, p. 2016. transcription in rare books cataloging. cataloging & classification quarterly [internet]. [cited 2019 december 12]; 54:5-6. available from: https://doi.org/10.1080/01639374.2016.1192079 [29] thompson, k. and traill, s. 2017. leveraging python to improve ebook metadata selection, ingest, and management. code4lib journal [internet], issue 38 [cited 2019 december 11]. available from: https://journal.code4lib.org/articles/12828 about the author kalan knudson davis is the special collections metadata librarian at the university of minnesota libraries. she provides intellectual access for rare materials held in the wangensteen historical library, the james ford bell library, and the andersen horticultural library where she performs specialized and rare book cataloging using a variety of descriptive standards including resource description access (rda) and the descriptive cataloging of rare materials (dcrm) manuals; provides subject access; and applies various classification schemas. although classically-trained as an anglo-american cataloging rules revision 2 (aacr2) cataloger, she aspires to become one of jay jordan’s information jedi. kalan actively feeds her keen interest in book history, codicology, paleography, bibliographical description, bookbindings, provenance, and other related disciplines, which in turn, informs her cataloging practice. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – archiving the web: a case study from the university of victoria mission editorial committee process and structure code4lib issue 26, 2014-10-21 archiving the web: a case study from the university of victoria the university of victoria libraries started archiving websites in 2013, and it quickly became apparent that many scholarly websites being produced by faculty, especially in the digital humanities, were going to prove very challenging to effectively capture and play back. this article will provide an overview of web archiving and explore the considerable legal and technical challenges of implementing a web archiving initiative at a research library, using the university of victoria’s implementation of archive-it, a web archiving service from the internet archive, as a case study, with a special focus on capturing complex, interactive websites that scholars are creating to disseminate their research in new ways. by corey davis introduction while “paper survives benign neglect for a long time” (rosenthal, 2010a), such is not the case with materials on the web. digital information is far more fragile than print. “electronic resources are profoundly unstable,” and unlike paper, they are not generally intelligible following even a modest amount of wear and tear (cohen & rosenzweig, 2005). websites are especially problematic because of the ephemeral nature of content on the web itself. the average half-life of a webpage has been estimated at two years (koehler, 2004), which is almost certainly too generous, and those that do survive tend to change frequently (dougherty et al., 2010). so while the web’s impact on society has continued to grow alongside an explosive growth of the content available, this rich trove of informational resources remains very much at risk. roger highfield (2014) of the telegraph captures the feeling of many librarians and archivists involved in web archiving when he states: “[i]f we’re not careful, historians will know more about the beginning of the past century than the start of this one.” it is this sense of urgency and obligation that in great part animates web archiving initiatives throughout the world. web archiving web archiving involves collecting data from the web, storing that data and preserving it at some level, and making it accessible to users (niu, 2012). julien masanès of the internet memory foundation, and former director of web archiving at the bibliothèque nationale de france (bnf), has identified three main technical approaches to web archiving (pennock, 2013): transactional, server-side, and client-side. “[t]ransactional archiving records content… presented to a user on a given date and time” by creating a record of all http requests and responses (pennock). this requires server-side configuration, which means the cooperation of a host. sitestory (http://mementoweb.github.io/sitestory) is an example of a transactional web archiving tool. developed at the los alamos national laboratory, sitestory enables a server, when it receives a request from a browser: …to send the response to that browser as expected, but also [to] push it into an associated web archive. as a result, all versions of all resources that are being requested by browsers are also available in the archive (van de sompel, 2012). server-side archiving also requires support from content hosts because it involves copying files directly from a server. server-side archiving can be employed when content, such as a database, cannot easily be captured via http requests. deeparc is an example of a server-side web archiving tool. developed by the bnf, it “transform[s] relational database content into xml for archiving purposes” (aubry & hafri, 2005). by far the most common form of web archiving is client-side, also referred to as remote harvesting. in this method, web crawlers use the http protocol to collect content directly from a server by crawling all the links associated with a specific ‘seed’ url. crawl behavior is determined by providing the crawler with instructions about crawl parameters, such as crawl depth (pennock). the internet archive the most well known client-side archiving technology was developed at the internet archive (ia), a remarkable organization founded in 1996 by mit graduate and silicon valley entrepreneur brewster kahle, with the express purpose of “build[ing] an internet library” (internet archive, n.d.a). it was only after kahle and his group’s groundbreaking work that “national libraries and archives around the world also began to see the importance of preserving this global resource [i.e. the web]” (grotke, 2011). as such, many national libraries have established web archiving initiatives. for example, as of 2012, the bnf’s archive consisted of 370 terabytes, or 18 billion pages. in 2013 the british library conducted the first crawl of all .uk websites, which led to the capture of 30 terabytes of data (webber, 2014). the library of congress had collected 250 terabytes by 2011, and library and archives canada’s collection consists of over 7 terabytes [1]. these and other national collections still pale in comparison to the ia, which contains over 2 petabytes of data and over 150 billion web pages, and the ia corpus grows by approximately 100 terabytes a month (internet archive, n.d.b)—larger than the entirety of most national library collections. to provide a sense of scale, the web archives of the library of congress—the world’s largest library—grows at the relatively meager rate of 5 terabytes a month (taylor, 2011). and while the ia has developed an incredibly important resource—the wayback machine gets close to 700,000 unique ip visits per day (internet archive, 2014)—in the words of clifford lynch, “terrifyingly, this major chunk of our cultural memory basically relies on the good offices and good spirit of brewster kahle” (lynch, 2014). the ia and its archiving technologies dominate the web archiving landscape. teams at ia developed the widely-used heritrix tool, a java application that crawls the web and stores content specifically for the purposes of web archiving (donovan, 2014). that content is stored in either the arc or warc format [2], both of which “specif[y] a method for combining multiple digital resources into an aggregate archive file together with related information” (library of congress, 2009). [3] this means that in addition to, for example, the html text, other files—images, javascript, media files, etc.—are combined into a single w/arc file, which also contains crawl-specific metadata to indicate “the time of harvesting, the ip address of the harvesting machine, internet media type (mime type) and response code for the harvest transaction, the purpose of harvesting, etc.” (library of congress, 2009). w/arc files can then be replayed by entering the original page’s url into an instance of the wayback machine, a java servlet application “that retrieves, displays, and indexes archived internet content” (donovan). the heritrix web crawler and the wayback machine are both open source [4], and have been extensively implemented at institutions ranging from the library of congress (n.d.) to the subscription-based web archiving service (was) of the california digital library (n.d.). warc is an iso standard (28500:2009), and w/arc files can be indexed for full-text search using open-source tools like nutchwax (developed by the internet archive and the nordic web archive), solr, or elasticsearch, all of which are based on lucene [5]. to serve the web archiving needs of smaller institutions, several saas tools have been brought to market in recent years. in 2005, the ia launched archive-it, a subscription-based service based on its core web archiving technologies, which it “developed, in particular, for memory institutions and state archives” in order to create curated collections not captured by larger web archiving efforts (information today, 2005). the california digital library’s web arching service (was) is another saas tool using heritrix, wayback, and nutchwax—all technologies developed initially at the ia¬—and is used by many institutions, including stanford and berkeley. archivethenet (also in saas) is offered by the internet memory foundation (imf) and is used, among others, by the british library to create the uk government web archive. university of victoria’s web archiving initiative with the ia and various national libraries and other organizations archiving the web, why would a research library want to get involved? to start with, ia crawls are broad and focus on top-level domains, and national library collections tend to emphasize government sites and country-specific domains (.ca, .fr, .uk, etc.). this approach needs to be paired “with deep, curated collections by theme or by site, tackled by other cultural heritage organizations” (grotke). in this way, web archives can be seen as an extension of existing collection activities meant to support faculty, student, and community research: …broad national web archive collections, which often only have limited accessibility for legal and technical reasons, may not meet the dispersed needs of individual researchers, and be in danger of providing a “one-size fits nobody” solution. archiving and providing access to individual historical web resources is the basic “must-have” of a web archive (hockx-yu, 2012). in 2013, the university of victoria libraries decided to subscribe to the archive-it service as part of a consortial license negotiated by the council of prairie and pacific university libraries (coppul). up to this point, while some at the library recognized the importance and possibilities of archiving websites—especially in support of archives and special collections—there was no clearly-articulated demand for such a service by librarians, archivists, or faculty. as such, when the coppul offer was announced and the library decided to participate in the archive-it subscription, it was without a preconceived vision of what a web archiving initiative would look like, or how it would fit into the library’s emerging strategies around digital scholarship. in fact, it took almost eight months before we identified specific websites and started initial test crawls using archive-it. it has only been since the start of 2014 that we have begun building collections in earnest. these currently include: an events-based collection on the 50th anniversary of the university; several thematic collections in support of our extensive archival and special collections holdings on anarchist and transgender issues, hyper-local news, and environmental organizations; collections of local government websites; and, digital humanities websites. the ia hosts and stores all of the data and provides a web-based interface through which relatively non-expert users can create collections. some organizational challenges considered library vs. archive while the web is often referred to as “the world’s largest library”, this analogy is disingenuous. academic library collections—or at least library acquisitions budgets—still tend to be dominated by books and journal articles (and the platforms and discovery services that make them accessible in either print or electronic form). relationships between these kinds of materials—most of which are considered discrete information objects—are not rigorously represented, and they tend to be grouped together based on some shared subjective quality, such as subject matter. this is not the case with archives, which consist largely of primary source materials organized according to very different principles: respect des fonds, provenance, and original order. at most universities, the administrative functions for library and archival acquisitions are distinct. websites do not respect such organizational arrangements. articles, reports, newsletters, and other ‘discrete published items’ are common on websites, but website structure—the relationship between documents—is also important. unlike items on a library shelf, “[w]eb documents at the page level (but also the site level) hardly ever make sense alone….[t]hey are mingled in a larger document network” (masanès, 2005). in this way, they are more like archival material, where original order—i.e. site structure—is essential to understanding the whole. and much web content, such as blogs and wikis, is of a kind whose print analogs would usually be found in collections of personal or organization documents in an archive. furthermore, the use of the term “archive” vis-à-vis web archiving, is also problematic, because in this context the term itself is “much more in keeping with the computing usage of archive as a back-up copy of information then the disciplinary perspective of archives” (owens, 2014). the organizational lines that often separate libraries from archives suggest that for web archiving initiatives to be successful, participants will have to be willing to work across traditional divides. jurisdictional issues—if not addressed upfront—can stall a web archiving initiative. in order to address this challenge, our initiative at the university of victoria is overseen by an archive-it working group with broad representation. this working group consists of people with technical and collections expertise, but also the associate university librarian for learning and research resources, the head of collections management services for the library, as well as the director of special collections and the university archivist. the ownership conundrum websites as objects of acquisition can blur the organizational lines between library and archive. but they are also problematic in the same way that a significant amount of open access journal literature is. the advent of open access content is a relatively new thing for librarians. we tend to focus on materials we have paid for. services like portico, lockss, and scholar’s portal attempt to tackle issues of preservation, but are mostly focused on subscription-based resources. and while work is currently being done by the public knowledge project (pkp) to address the long-term preservation of open journal system (ojs) content using lockss (public knowledge project, 2014), open access literature—especially from smaller publishers—tends to be more vulnerable to loss than expensive subscription-based content, simply because no single library necessarily has the incentive to invest in long-term preservation. websites tend to fall into this same category. unlike governmental organizations that have a mandate to collect certain types of websites, such as state or national libraries and archives, university libraries and archives rarely have such obligations. for individual selectors, the focus is generally on expending funds for books, journals, research databases, and other paid-for resources. there is often little incentive to build collections of archived websites, which involves a considerable amount of effort, especially when more pressing collection-building activities related to acquisitions are at hand. as such, any web archiving initiative will have to carefully consider how to engage librarians in the time-intensive process of collection building. legal issues an article for wired uk about archiving britain’s web contains the subtitle: “the legal nightmare explored” (scott, 2010). this is not far off the mark, and as of this writing, the university of victoria libraries is still in the process of finalizing our policy around web archiving. as such, we have not yet made the majority of our collections available to the public. basically, there are two approaches to dealing with the issue of copyright: opt-in and opt-out. with the opt-in approach, the institution undertaking the web archiving contacts the appropriate content owners to get explicit permission to capture their site. the opt-out approach simply captures sites of interest while respecting the robots.txt protocol, and take-down and other requests are handled after the fact on a case-by-case basis. the opt-in approach is best characterized by the british library’s approach to the creation of its open uk web archive prior to 2013. response rates to permission requests were just 24 percent. the ceo of the british library at the time, dame lynne brindley, considered this approach unsustainable because it would mean the capture of perhaps only one percent of uk websites (scott). all of this changed, of course, when legislation for legal deposit for non-print resources passed in the uk parliament in april 2013 (british library, 2013). as of this writing, the british library is working on its second annual capture of the entire uk domain (webber). the internet archive uses an opt-out approach, “that allows a website owner/content provider to remove access from the archive, and/or prevent their content from being captured by putting up a robots.txt exclusion on their website” (scott). if requested by the content owner, the ia will remove a site from public access via the wayback machine—but not, importantly, from its central index. because the scale of the web is so vast, and response rates to permission requests are so low, many consider the opt-out approach the only workable solution to web archiving. in the canadian context, this is the approach used by the university of alberta libraries (2013), which states: “when a site owner authorizes communication of their work to the public by telecommunication without technological restrictions, we view this as their implicit consent to the indexing and caching of their site.” this is the general approach the university of victoria is planning to take. and while this approach has not been tested in the courts, many universities in the u.s. and canada are conducting web archiving under the auspices of fair use (or its canadian equivalent, fair dealing), as outlined in the association of research libraries code of best practices in fair use for academic and research libraries (2012). this document states quite forcefully that: [web archives] represent a unique contribution to knowledge and pose no significant risks for owners of either the sites in question or third-party material to which those sites refer. in the absence of such collections, important information is likely to be lost to scholarship….it is fair use to create topically based collections of websites and other material from the internet and to make them available for scholarly use. the library of congress seems to strike a balance between the two perspectives. while they might seek explicit permission for some sites, “some notice at a minimum must be provided to the site owner” before their site is crawled and captured (library of congress collection development office, 2013). because of the lack of explicit legal precedence and the varying approaches to web archiving across multiple institutions in multiple jurisdictions, it is absolutely central to tackle the issue of permissions as early on as possible, often in consultation with legal counsel. technical issues the web has not only grown in size, but also in complexity. it was originally envisioned by berners-lee as a collection of documents, and up until the early 2000s, this mostly held true: in the 1990’s and even moving into the early 2000’s, most web pages and site designs were relatively simple. many used “hybrid” site models that enabled publishers to separate out resources into separate directories and to optimize for different usage scenarios and workflows, but the majority of the emphasis was on more traditional load balancing for scale vs [sic] integration across diverse hosts and services (netpreserve.org, 2012). things have certainly changed. as negulescu and rosenthal (2013) aptly put it, “it’s not your grandfather’s web anymore.” most websites are anything but simple. for example, an august 28, 2014 archive-it test crawl of the new york times homepage captured 235 urls from 61 individual hosts, including 85 images (53 jpegs, 18 gifs, and 14 pngs) and 35 javascripts. table 1: a file types report from archive-it for the homepage of the new york times, august 28, 2014. the file types report shows how many urls were captured for each specified file type. file type urls text/dns 62 image/jpeg 53 text/html 47 text/plain 36 application/x-javascript 24 image/gif 18 image/png 14 unknown 14 application/javascript 8 text/css 6 application/json 4 text/javascript 3 image/x-icon 2 text/json 2 application/x-shockwave-flash 1 application/xml 1 image/svg+xml 1 text/xml 1 the trouble with ajax websites are becoming increasingly reliant on technologies like javascript, xml, json, and ajax/j, which can be problematic for a traditional crawler like heritrix. a good example is the website for colonial despatches, a “digital archive [that] contains the original correspondence between the british colonial office and the colonies of vancouver island and british columbia” (holmes, n.d.). created by the humanities computing and media centre at the university of victoria, the site is based on a large exist database containing xml documents marked up in tei, and serves as a good example of some of the technical issues facing web archiving tools that use the client-side harvesting approach. while the website looks relatively simple, it is not. for example, the browse section (http://bcgenesis.uvic.ca/docsbydate.htm) uses ajax techniques to convert tei/xml called from the database, processes it via xslt into xhtml, and embeds it in the page using javascript and css. image 1: a simple version of the application structure of the colonial despatches website (http://bcgenesis.uvic.ca/development.htm). when users go to the browse page, they are presented with a javascript-driven list which expands as they click on list items: image 2: when users click on the final ‘link’, the javascript executes a call to the exist database to get the data related to a specific date. image 3: this content is then embedded dynamically into the page. before the advent of ajax and similar functionality, dynamically generating xml content for display in a single page was not possible, and users would most likely be taken to another page via an anchor tag (< a>) containing a url with specific database query parameters. this anchor tag would enable the heritrix crawler to execute the specific database query via http, get the associated document, and capture it. in the above example, < span> tags are used in place of < a> tags, perhaps because is more flexible in terms of display (most browsers have default display rules which can affect the appearance of < a> tags, whereas < span> is a kind of tabula rasa). here is the code for the last bullet in the list:

when the javascript is executed, the < div> tags are filled in with xslt-transformed and css-formatted tei/xml data from the exist database. because heritrix can only discover content by following links to that content from an initial ‘seed’ url, and because in this case the < span> tag is used in place of an anchor < a> tag, heritrix will not recognize this as a link and will therefore not execute the javascript necessary to create a link to query the database and generate the appropriate content for archiving. in fact, it’s not clear that this kind of dynamic menuing—even if < a> tags were used—would result in effective capture, because urls are generated through user interaction, and are therefore not a part of the page content or the associated javascript. the problems with dynamic, interactive content this example hints at one of the major limitations of all current crawlers, where “web pages that are generated via a database in response to a request from the user” are simply not captured (pennock). and this limitation is particularly serious as database-driven websites become the norm, and where content is increasingly generated dynamically based on user-initiated http requests, either through form-based queries or via ajax-driven user interactions. for example, the index page for the diary of robert graves 1935-39 and ancillary material (http://graves.uvic.ca/graves/site/index.xml) can only be accessed via form-based query, which makes it very problematic to access via the wayback machine [6]. more and more website content is being generated dynamically from database content called via user-initiated javascript. this development is almost certain to continue as html5 is more widely adopted in order to provide an app-like experience via mobile web browsers (i.e. websites will behave more like single-purpose apps, but without the constraints levied, for example, by apple’s app store). as rosenthal (2011) states, “the key impact of html5 is that, in effect, it changes the language of the web from html to javascript, from a static document description language to a programming language.” this means that traditional web crawlers like heritrix will have increasingly more trouble archiving materials effectively: “a major effort to develop techniques capable of collecting and preserving ajax and html5 is urgently needed” (rosenthal, 2011). no one knows this more than the folks at archive-it: web sites have become increasingly reliant on client-side script to render pages and to ensure an optimal viewing experience to users….for example, as a user navigates within a facebook page, content is delivered on demand through javascript when the user scrolls to an un-viewed section of their timeline. displaying content dynamically through client script allows sites to optimize the user experience and reduce the load on their servers. these optimizations, however, make it difficult for heritrix to discover resources that are necessary for optimal capture and display of archived content (reed, 2014). as of june 5th, 2014, archive-it has deployed a tool called umbra, to work alongside heritrix when crawling seed urls. according to archive-it: when a crawl is run using umbra, designated seeds are sent by heritrix to a separate process that mimics the way a browser would access the seed urls. this allows client-side script to be executed so that previously unavailable urls can be detected for heritrix to crawl. umbra also gives heritrix a flexible way to imitate human interactions with web sites that were previously not possible, such as executing javascript through clicking or hovering the mouse over different web page elements and scrolling down a page (reed). this is heartening, but the results so far have been mixed. crawls at the university of victoria since june 5th have effectively captured popular social media sites like facebook and twitter (for which umbra is optimized), but the above-mentioned issues with ajax in the colonial despatches site occurred during crawls in august 2014, two months after umbra was released. again, in this particular instance, the tag might be the problem, and an effective capture might result from using the < a> < a>tag instead, but this is unclear, and the basic issue remains: “crawlers are not able to see any content that is created dynamically” (google, 2014, june 14). these kinds of challenges are an issue for search engines like google, whose crawlers also have trouble with dynamic content. their webmaster guidelines talk about how to deal with some of these issues: use a text browser such as lynx to examine your site, because most search engine spiders see your site much as lynx would. if fancy features such as javascript, cookies, session ids, frames, dhtml, or flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site (google, 2014). because of google’s overwhelming role in content discovery, content creators have an incentive to make their stuff as visible as possible to google’s spiders. as such—and especially for larger, more commercially-oriented sites—decisions are made early in the creation of sites that allow google to more effectively index content, such as the creation of site maps in xml. and while seo is very different than web archiving, the principles are the same. if the library is able to somehow influence decisions at an early stage in the development of websites, these sites can be optimized for web archiving. as such, we have recently started a series of conversations with humanities computing and media centre on issues of web archiving and digital preservation. we are discussing the possibility of creating some built-in functionality for dynamic sites that would automatically write content into flat html for capture by archive-it, perhaps similar to techniques for generating html snapshots for indexing ajax applications in google (google, 2014, june 18). this raises another important issue: what exactly are we trying to capture? in the case of the colonial despatches website, even if a perfect representation of the look-and-feel, functionality, and interactivity could be achieved via the use of a tool like archive-it, the exist database containing the marked-up tei/xml of the original transcribed documents would not be preserved. this database—which represents the majority of the project’s human effort—arguably has more value than the website itself. we have talked with the humanities computing and media centre about possible capture strategies that include a virtual machine snapshot of the entire os. in some cases the use of a tool like deeparc might be an option. our use of archive-it has stimulated these kinds of conversations, and it’s clear that this is an area in need of considerably more investigation and testing. access and digital preservation another area of consideration is digital preservation and access. warc files captured by archive-it are stored in the united states (hanna, 2014). for canadian institutions, this is problematic because the notice-and-takedown provisions of the american digital millennium copyright act (dcma) differ significantly from the notice-and-notice regime under canada’s copyright modernization act. this situation could potentially see the removal of content by the internet archive where that content might not have been removed had a similar take-down request been issued in canada. this is even more troubling considering tools like archive-it are used to capture sites where content owners might have compelling reasons beyond copyright to see that captured content is removed (politically-oriented and government sites, for example). where a convincing argument can be made to keep public access available to captured content, having that content located in the u.s. is problematic. digital preservation is another key issue for web archives. capturing websites in warc format for playback and full-text search is only a part of what is needed for true digital preservation. warc files backed up by the internet archive are susceptible to corruption. as david rosenthal has recognized, the ia can be thought of as a statistical sample of the whole web. some data loss is to be expected, and this loss introduces an acceptable amount of noise into a very large sample (rosenthal, 2010b). this is not necessarily problematic at scale, but for the purposes of a single institution, it is concerning. archive-it does provide users with warc files on request, which can then be managed locally. at the university of victoria, we are considering obtaining copies of our warc files for processing by archivematica and storage in the coppul private lockss network (coppul pln). archivematica is an “open-source digital preservation system that is designed to maintain standards-based, long-term access to collections of digital objects” (archivematica, n.d.). archivematica can ingest warc files, which are run through a series of processes based on the open archival information system reference model (iso-oais), which allows for future preservation planning based on format identification and other information. archivematica creates archival information packages (aips) from warcs using bagit. these ‘bagged’ files can then be uploaded to the coppul-pln, where a single copy, if damaged, can be repaired by other copies in the network. none of this potential archivematica/lockss integration development work is trivial, especially the creation of a manifest for lockss to harvest aips from archivematica, but it is important to recognize that jurisdictional issues and considerations of digital preservation may require the local management of warc files to ensure ongoing access and long-term preservation. conclusion the collections being created by the university of victoria libraries are unique in the world, and even though the act of capturing websites is a messy affair, there is real value in capturing content that would almost certainly not be around by the time today’s students become tomorrow’s researchers. the importance of the web as part of the human record cannot be overstated, but there are significant challenges any organization faces in starting a web archiving initiative. organizational, legal, and technical challenges all need to be addressed. notes [1] lists of website archiving initiatives are maintained at http://netpreserve.org/about/archivelist.php and http://en.wikipedia.org/wiki/list_of_web_archiving_initiatives [2] for sample arc and warc files from the internet archive, see https://archive.org/details/examplearcandwarcfiles [3] “the (w)arc format is a revision of the arc file format [arc_ia] developed in the mid-1990’s that was first used to store web crawls as sequences of content blocks harvested from the world wide web.” https://webarchive.jira.com/wiki/display/arih/archive-it+storage+and+preservation+policy [4] see https://webarchive.jira.com/wiki/display/heritrix/heritrix and http://archive-access.sourceforge.net/projects/wayback/ [5] “solr or elasticsearch may prove more versatile than nutchwax, which is designed to be implemented in large multi-server environments.” https://www.archivematica.org/wiki/websites [6] see https://wayback.archive-it.org/4376/20140807180501/http://graves.uvic.ca/graves/site/index.xml references archivematica. (n.d.). what is archivematica? retrieved from https://www.archivematica.org/wiki/main_page association of research libraries. (2012, january). code of best practices in fair use for academic and research libraries. retrieved from http://www.arl.org/storage/documents/publications/code-of-best-practices-fair-use.pdf aubry, s., & hafri, y. (2005). deeparc. retrieved from http://deeparc.sourceforge.net/ british library. (2013, april 4). click to save the nation’s digital memory. retrieved from http://pressandpolicy.bl.uk/press-releases/click-to-save-the-nation-s-digital-memory-61b.aspx california digital library. (n.d.). web archiving services faq. retrieved from http://webarchives.cdlib.org/faq cohen, d. j., & rosenzweig, r. (2005). the fragility of digital materials. in digital history: a guide to gathering, preserving, and presenting the past on the web. retrieved from http://chnm.gmu.edu/digitalhistory/preserving/1.php donovan, l. (2014, april 25). archive-it (ait) 4.5 technical overview. retrieved from https://webarchive.jira.com/wiki/display/arih/archive-it+%28ait%29+4.5+technical+overview dougherty, m., meyer, e. t., madsen, c., van den heuval, c., thomas, a., & wyatt, s. (2010). researcher engagement with web archives: state of the art. retrieved from http://repository.jisc.ac.uk/544/1/jisc-rewa_stateoftheart_august2010.pdf google, inc. (2014). webmaster guidelines. retrieved from https://support.google.com/webmasters/answer/35769?hl=en google, inc. (2014, june 14). what the user sees, what the crawler sees. retrieved from https://developers.google.com/webmasters/ajax-crawling/docs/learn-more google, inc. (2014, june 18). how do i create an html snapshot? retrieved from https://developers.google.com/webmasters/ajax-crawling/docs/html-snapshot grotke, a. (2011). web archiving at the library of congress. computers in libraries, 31(10). retrieved from http://www.infotoday.com/cilmag/dec11/grotke.shtml hanna, k. (2014, february 12). archive-it storage and preservation policy. retrieved from https://webarchive.jira.com/wiki/display/arih/archive-it+storage+and+preservation+policy highfield, r. (2014, january 6). can dna reign supreme in the digital dark age? the telegraph. retrieved from http://www.telegraph.co.uk/science/10553626/can-dna-reign-supreme-in-the-digital-dark-age.html hockx-yu, h. (2012, december 19). digital humanities and the study of the web and web archives. retrieved http://britishlibrary.typepad.co.uk/webarchive/2012/12/digital-humanities-and-the-study-of-the-web.html#sthash.cswkltsl.dpuf holmes, m. (n.d.). development of this site [colonial despatches]. retrieved from http://bcgenesis.uvic.ca/development.htm information today. (2005, november 28). internet archive offers new archive service. retrieved from http://newsbreaks.infotoday.com/digest/internet-archive-offers-new-archive-service-16063.asp internet archive. (2014, august). number of unique ip addresses per day. retrieved from https://archive.org/stats/ internet archive. (n.d.a). about the internet archive. retrieved from https://archive.org/about/ internet archive. (n.d.b). internet archive projects. retrieved from https://archive.org/projects/ koehler, w. (2004). a longitudinal study of web pages continued: a consideration of document persistence. information research, 9(2). retrieved from http://www.informationr.net/ir/9-2/paper174.html library of congress collection development office. (2013). library of congress collections policy statements supplementary guidelines: web archiving. retrieved from http://www.loc.gov/acq/devpol/webarchive.pdf library of congress. (2009). warc, web archive file format. retrieved from http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml library of congress. (n.d.). web archiving: technical information. retrieved from http://www.loc.gov/webarchiving/technical.html lynch, c. a. (2014, january 30). challenges of stewardship at scale in the digital age. retrieved from http://youtu.be/rfvllq2nzj0 masanès, j. (2005). web archiving methods and approaches: a comparative study. library trends, 54(1). retrieved from http://muse.jhu.edu/journals/library_trends/v054/54.1masanas.html negulescu, k. c., & rosenthal, d. s. (2013). not your grandfather’s web anymore. retrieved from http://www.cni.org/topics/digital-preservation/not-your-grandfathers-web-any-more/ netpreserve.org. (2012, may 17). iipc future of the web workshop: introduction & overview. retrieved from http://netpreserve.org/sites/default/files/resources/overviewfuturewebworkshop.pdf niu, j. (2012). an overview of web archiving. d-lib magazine, 18(3/4). retrieved from http://dlib.org/dlib/march12/niu/03niu1.html owens, t. (2014, february 27). what do you mean by archive? genres of usage for digital preservers. retrieved from http://blogs.loc.gov/digitalpreservation/2014/02/what-do-you-mean-by-archive-genres-of-usage-for-digital-preservers/ pennock, m. (2013). web archiving. retrieved from http://dx.doi.org/10.7207/twr13-01 public knowledge project. (2014, june 3). announcing forthcoming pkp lockss pln. retrieved from https://pkp.sfu.ca/announcing-forthcoming-pkp-lockss-pln/ reed, s. (2014, june 5). introduction to umbra. retrieved from https://webarchive.jira.com/wiki/display/arih/introduction+to+umbra rosenthal, d. s. (2010a). how green is digital preservation? retrieved from http://www.digitalpreservation.gov/meetings/documents/othermeetings/rosenthal.pdf rosenthal, d. s. (2010b). how are we ensuring the longevity of digital documents? retrieved from https://www.youtube.com/watch?v=h53dmtbuxsk rosenthal, d. s. (2011, august 22). moonalice plays palo alto. retrieved from http://blog.dshr.org/2011/08/moonalice-plays-palo-alto.html scott, k. (2010, march 5). archiving britain’s web: the legal nightmare explored. wired magazine. retrieved from http://www.wired.co.uk/news/archive/2010-03/05/archiving-britains-web-the-legal-nightmare-explored taylor, n. (2011, june 6). web and twitter archiving at the library of congress. retrieved from http://eventsarchive.org/sites/default/files/webandtwitterarchivingatthelibraryofcongress-110614202159-phpapp01.pdf university of alberta libraries. (2013). university of alberta libraries web archiving policy. retrieved from http://www.library.ualberta.ca/aboutus/collection/policy/web%20archiving%20policy.pdf van de sompel, h. (2012). sitestory transactional web archive software released. d-lib magazine, 18(9/10). retrieved from http://www.dlib.org/dlib/september12/09inbrief.html webber, j. (2014, june 12). how big is the uk web? retrieved from http://britishlibrary.typepad.co.uk/webarchive/2014/06/how-big-is-the-uk-web.html about the author corey davis has been with the university of victoria since 2012, where he works as systems librarian. he has a particular interest in the long-term preservation of born-digital materials. prior to joining uvic, he was head of technical services at royal roads university library. subscribe to comments: for this article | for all articles one response to "archiving the web: a case study from the university of victoria" please leave a response below: andy jackson, 2014-11-04 great article. just a minor correction – we (the british library) do not run the uk government web archive. the national archives of the uk does that. thanks. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – drying our library’s libguides-based webpage by introducing vue.js mission editorial committee process and structure code4lib issue 55, 2023-1-20 drying our library’s libguides-based webpage by introducing vue.js at the kingsborough community college library, we recently decided to bring the library’s website more in line with dry principles (don’t repeat yourself). we felt we this could improve the site by creating more concise and maintainable code. dryer code would be easier to read, understand and edit. we adopted the vue.js framework in order to replace repetitive, hand-coded dropdown menus with programmatically generated markup. using vue allowed us to greatly simplify the html documents, while also improving maintainability. by mark e. eaton keeping it dry a common goal among programmers is to write code that is dry, in other words, code where you don’t repeat yourself. this is usually motivated by the insight that computers can often effectively automate repetitive tasks, making it unnecessary to repeat yourself in code. taking advantage of the efficiencies of automation is widely regarded as a best practice among programmers. however, html, when written by hand, is unfortunately not terribly suited to dry practices. html is particularly declarative: all elements of the page are explicitly laid out by the programmer, so as to fully describe its structure. the problem with this is that it means that hand-written webpages are often not very dry. even those of a relatively modest amount of complexity can quickly grow into very long html documents. this can be problematic, for a few reasons: it can become difficult to conceptualize the structure of a whole page when it stretches out over hundreds of lines. even relatively trivial aspects of coding, such as indentation, can become difficult with the deeply nested html structures of a large page. it is easy to introduce syntax errors or formatting problems into long html documents, because typos can be easily overlooked. this is especially problematic in cases where there is no built-in linting or validation.[1] at our college these challenges were familiar to us at kingsborough community college, a college of the city university of new york. our homepage, built on libguides cms, ran to over 500 lines, not including the or

sections. much of this was owing to repetitive dropdown menus: our page relies heavily upon bootstrap-based dropdown navigations to provide easy access to many of our services from the library homepage. these hand-coded menus, structured as lists of links, accounted for much of the length of the page’s source code. included below is the original code for our hamburger menu, which, despite its length, was in fact the shortest and simplest dropdown menu on our page: abstracting away some of that repetition was, in some important ways, an obvious win for the maintainers of the library webpage. there were clear benefits to abstraction. specifically, drying the page would: provide increased simplicity and maintainability; align us more with contemporary best practices in web development; allow us to write more aesthetically pleasing code; allow us to adopt and learn a modern javascript framework; raise the technical bar for what we are attempting to accomplish with our webpage. in brief, it would make the site better, and make life easier for the maintainers. these improvements were not undertaken without some hesitation. our library has non-technical librarians who work with libguides daily, and who may also want to edit our webpage. we were worried that adding another layer of abstraction might be confusing to them, as they would no longer be able to “see” the full html document object model (dom) to their satisfaction, and therefore no longer be able to properly understand and manipulate it themselves. this was an important concern. on the other hand, reducing the html devoted to dropdowns might in fact make other parts of the website more legible to our non-technical colleagues, because it would reduce the amount of noise that a non-expert user would need to filter through to accomplish their goals. in this sense, simplifying is also a way to improve access to the code. we decided to proceed because we felt that, in sum, the benefits out-weighed the drawbacks. the tradeoff is that it will make the page more maintainable for some, while it is perhaps of mixed benefit to others. this project was the best way we could find to address these issues in a balanced way, while making sustained progress on the further development of the site. selecting and using vue.js the tool we chose to do this work was vue.js (referred to as vue in the text that follows). vue is what is referred to as a “progressive” javascript framework, in that it aims to scale up, as well as scale down. scaling down was important to us, as our use case was not complicated, and we did not need the overhead of the complex build systems that are common to many javascript frameworks. we wanted something we could use within our cms. helpfully, it is possible to use vue in this way. we were able to import vue as a library with a simple call to a content delivery network (cdn), which allowed us to use it much in the same way that we would use other common libraries like bootstrap or jquery. we had access to many of vue’s abstractions by simply including like the cataloging done for frontlog materials, this solution was quick-and-dirty. while it got the job done, it was inelegant. the style of the page on which the rush cataloging form appeared did not match the style of the catalog itself. more importantly, linking to the form on a separate page required the user to remember the details about the item she was requesting. having to navigate back and forth between the record and the form was an exercise in user unfriendliness. these perceived shortcomings gave rise to the second stage of the project in which the form was integrated into the catalog itself as a popup. stage two: integrating the form into the catalog in order to integrate the form into the catalog as a popup, the requisite html for the form was added to the toplogo.html file. this html was placed inside an absolutely positioned div element whose display was set to ‘none’ and modified so that the style of the form would fit with the style of the opac. as we were making these changes, we also replaced the underlying script that controlled the e-mail submission of the form with the freely available huggins’ email form script by james s. huggins. this php script file, which—due to the millennium server’s inability to process php—was placed on ccsu’s webserver, provided a much more robust set of options for securing and customizing the form than the original script. this customizability came in quite handy in the final stage of the project when the enhancement was modified for use by all four connecticut state university libraries.

// the container for the form. it is not displayed by default; it appears when the user clicks on the link to display the form.

// the title bar for the form. rush cataloging request form (ccsu only) // the title of the form, indicating that this form is for use by library users affiliated with central connecticut state university and not any other csu campus. ? // an "about" button for toggling the display of information about the project, including its version number and a link to the project page of the php script that was used for the form. x // a "close" button to hide the form if the user does not wish to submit the request.

fill out the form below to expedite the processing of this item. cataloging and metadata services will rush catalog it and you will be notified via e-mail when it is available. the item will be held at the circulation desk under your name for 10 days.

// the instructions for the form.

// the form. // a parameter for the form instructing it to send a confirmation e-mail to the submitter. // a parameter for the form indicating which field(s) contain the submitter’s e-mail address. // a parameter for the form which specifies the names of all the fields in the form, the label that should be used for each field in the resulting e-mail message, whether the field is required, and its minimum and maximum length. // a parameter for the form indicating the required answers for various fields in the form. // a parameter for the form indicating which field(s) contain the name of the submitter. // a parameter for the form indicating the url of the landing page that will be displayed upon successfully submitting the form. // a parameter for the form indicating the e-mail address(es) of individual(s) who will receive the completed form. // a parameter for the form indicating which fields are to be excluded from the resulting e-mail message. // a parameter for the form indicating the subject line of the resulting e-mail message. // a parameter for the form indicating the text that is to appear in the resulting e-mail message below the form data. // a parameter for the form indicating the text that is to appear in the resulting e-mail message above the form data. // a parameter for the form indicating which fields are to be excluded from the confirmation e-mail sent to the submitter. // a parameter for the form indicating the "from" address for the confirmation e-mail sent to the submitter. // a parameter for the form indicating the subject line for the confirmation e-mail sent to the submitter. // a parameter for the form indicating the text that is to appear in the confirmation e-mail message below the form data. // a parameter for the form indicating the text that is to appear in the confirmation e-mail message above the form data. // a table to layout the form. // the form field into which the user inputs the author of the item being requested. // the form field into which the user inputs the title of the item being requested. // the form field into which the user inputs the publisher of the item being requested. // the form field into which the user inputs the isbn of the item being requested. // the form field into which the user inputs his or her name. // the form field into which the user inputs his or her id number. // the form field into which the user inputs his or her e-mail address. // the form field into which the user confirms his or her e-mail address. // the form field into which the user inputs his or her status. // a captcha field for weeding out spam. // the submit button for the form.


author
title
publisher
isbn
your name
your 8-digit bluechip id no.
your e-mail address
confirm your e-mail address
your status	faculty/staff	student
are you human?	winter is cold; summer is:

after you have clicked 'submit' you can click your browser's back button to return to this record.

one of the features of huggins’ e-mail form script was the ability to specify a landing page for the user to serve as a confirmation that the form has been successfully submitted. we created such a page as follows and placed the html file for the page in the screens directory of the webpac server: figure 1. the landing page. we next had to add a function to the javascript that would toggle the display property of the div element in which the form was contained. function showhide(divid, state){ // function to show or hide the form. document.getelementbyid(divid).style.display=state } finally, the rushccsu() function was modified so that the link that was to be appended to the in-processing notes would call the showhide() function for displaying the form rather than linking to the original form.

{{b.title}} by {{b.author}}

the example consists of an application module “myapp”. this module defines a controller “mycontroller” to set a list of three bibliographic items in the variable “books” of a given scope. the controller is later used in the html body to display a sorted list of books with an html template. the template makes use of standard angularjs directives (ng-repeat, ng-if) and expressions (| orderby:'title', b.title, b.author). the application logic to create such a list could also be packed in a new directive to be used as a “widget” in multiple places. modules for embedding library services the practical embedding of library services in websites with angularjs is illustrated in the following two examples. both are available as angularjs modules for easy reuse: the ng-suggest module provides access to search suggestions and links (voß and horn 2014a) and the ng-daia module provides access to availability information (voß and horn 2014b). both modules are hosted at public git repositories with api documentation, examples, and downloads (https://gbv.github.io/ng-suggest/ and https://gbv.github.io/ng-daia/). suggestions with ng-suggest the opensearch standard for search engines includes a specification for how to query search suggestions and autocomplete services via http (clinton 2006). suggestion services are provided by many search applications as “typeahead”. the method can also be used for instance by recommendation services (voß 2008) and to support tagging with controlled vocabularies (nagaya et al. 2011). figure 2. typeahead via opensearch suggestions the opensearch suggestions specification defines a query response as a json array with at least two elements (query string and a list of search completions): [ "moz", ["mozilla","mozilla firefox","mozart","mozilla thunderbird",...] ] optional elements can include descriptions and urls for each search completion. while processing of this simple format is not very complex, it still requires javascript skills to make use of a suggestion service. ng-suggest simplifies the embedding to two html statements. the following example code adds wikipedia typeahead features to an input form element:

the resulting html page would look like this: figure 3. suggest wikipedia articles with ng-suggest similar suggestions can be provided for any open search suggestions service by just changing the service’s base url. among other features, responses can be embedded as simple lists (seealso recommender services, for instance related documents and related publications), and other json response formats can be mapped to suggestions format. availability with ng-daia the daia specification defines a data model and an http api for accessing information about the current availability of documents (voß and reh 2014). in contrast to complex internal library system apis, such as ncip and slnp, daia was designed to be used openly. the aim of daia is to provide a way for libraries to allow open and easy-to-use access to holdings information from their catalogs. this, in turn, enables the inclusion of document availability information in external applications and websites (catalogs, reference management, e-learning platforms, etc.). among other formats, daia provides availability information in json, the first choice for web applications written in javascript. the angularjs module ng-daia implements client code to execute and process a daia query and to display holding information in convenient form. below is an example of the html integration:

the ng-daia module provides a directive (daia-api) to query a daia service with a given document identifier (daia-id). the recieved daia response is then fed to a customizable angularjs template, resulting in the following display: figure 4. full availability view with ng-daia default template the full availability view as implemented in the default templates of ng-daia reflects the nested structure of daia data model, consisting of an outer layer for institutional and general document information, and specific information for each document holding (voß and reh 2014). the default template of directive ng-api uses another directive for the display of a holding item (daia-item) and its item template can be customized as well. another directive is provided for the most compact display (daia-simple, see figure 5). daia simple is a flattened, aggregated form of availability information that covers typical use cases, such as short display in a result list (see section 6.1 of the daia specification). the ng-daia module includes functions to transform from full daia to daia simple as well. figure 5. minimal availability view with daia-simple all templates included in ng-daia can be customized with css. localization for display in other languages is already supported with the popular module angular-translate (precht 2014). thanks to two-way binding of angularjs variables, a simple statement such as $scope.language = 'de' can be enough to update the full availability display in another language. conclusions despite efforts to open up library systems via standard apis, for instance the ils-di recommendations (ils discovery interface task force 2008), the support of library services via open apis is rather low. if apis exist (e.g. ncip), they are often complex, vendor-specific, or available only for internal use. one reason for the lack of open apis may be the invisibility of benefits and usage examples. the examples given in this article demonstrate how library services (e.g. search suggestions, recommendations, document availability) can be used easily once they have been made available via standardized apis (e.g. open search suggestions and daia). the simple integration into web applications also requires client modules like ng-suggest and ng-daia for angularjs. modules for other standard apis relevant to libraries, such as opensearch and sru for search (hammond 2010) and paia for patron account interaction (voß 2014), shouldn’t be hard to implement. [3] most importantly, these client modules only have to be implemented once instead of having to build both server and client implementation for each particular library system. with a set of angularjs modules for the basic library services (search, availability, patron account) it should even be possible to create a custom opac interface in less than a hundred lines of html and javascript. even if angularjs is not the framework of your choice, it makes sense to provide client modules to your apis, as illustrated in figure 1. libraries should not only expose their services via openly specified apis but also provide client libraries to facilitate the integration of these services into web applications. to minimize the work of doing so, one should build on standardized apis independent from particular library systems. we hope to motivate more library developers in doing so. [4] references breeding m. 2009. opening up library systems through web services and soa: hype or reality? american library association. clinton d. 2006. opensearch suggestions. available from: http://www.opensearch.org/specifications/opensearch/extensions/suggestions hammond t. 2010. nature.com opensearch: a case study in opensearch and sru integration. d-lib magazine [internet] 16. available from: http://dlib.org/dlib/july10/hammond/07hammond.html hoskins j ed. 2014. angularjs modules. [internet]. available from: http://ngmodules.org/ ils discovery interface task force ed. 2008. technical recommendation. digital library federation. available from: http://old.diglib.org/architectures/ilsdi/ nagaya s, hayashi y, otani s, itabashi k. 2011. controlled terms or free terms? a javascript library to utilize subject headings and thesauri on the web. code4lib journal [internet]. available from: http://journal.code4lib.org/articles/5994 precht p. 2014. angular translate. [internet]. available from: http://angular-translate.github.io/ voß j, horn m. 2014a. ng-suggest. [internet]. available from: http://gbv.github.io/ng-suggest/ voß j, horn m. 2014b. ng-daia. [internet]. available from: http://gbv.github.io/ng-daia/ voß j, reh u. 2014. document availability information api (daia). available from: http://purl.org/net/daia voß j. 2008. seealso: a simple linkserver protocol. ariadne [internet]. available from: http://www.ariadne.ac.uk/issue57/voss/ voß j. 2014. patrons account information api (paia). available from: http://gbv.github.com/paia/ endnotes [1] this article is based on the assumption that libraries actually want to facilitate the use of their services. in some cases this assumption might be wrong. [2] all examples are available also as part of the article’s code repository at https://github.com/jakobib/angularjs2014/. [3] we are currently working on the module ng-skos (https://github.com/gbv/ng-skos) to interact with authority files and simple knowledge organisation systems. [4] one can also ask the vendor of library systems to implement standardized apis to the core functionality of its product, but this requires some pressure by libraries as customers. about the authors jakob voß works in research and development at the head office of the common library network (gemeinsamer bibliotheksverbund, gbv) in göttingen. moritz horn studies information management at the university of applied sciences and arts in hanover. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – pipeline or pipe dream: building a scaled automated metadata creation and ingest workflow using web scraping tools mission editorial committee process and structure code4lib issue 58, 2023-12-04 pipeline or pipe dream: building a scaled automated metadata creation and ingest workflow using web scraping tools since 2004, the fraser digital library has provided free access to publications and archival collections related to the history of economics, finance, banking, and the federal reserve system. the agile web development team that supports fraser’s digital asset management system embarked on an initiative to automate collecting documents and metadata from us governmental sources across the web. these sources present their content on web pages but do not serve the metadata and document links via an api or other semantic web technologies, making automation a unique challenge. using a combination of third-party software, lightweight cloud services, and custom python code, the fraser recurring downloads project transformed what was previously a labor-intensive daily process into a metadata creation and ingest pipeline that requires minimal human intervention or quality control. this article will provide an overview of the software and services used for the recurring downloads pipeline, as well as some of the struggles that the team encountered during the design and build process, and current use of the final product. the project required a more detailed plan than was designed and documented. the fully manual process was not intended to be automated when established, which introduced inherent complexity in creating the pipeline. a more comprehensive plan could have made the iterative development process easier by having a defined data model, and documentation of—and strategy for—edge cases. further initial analysis of the cloud services used would have defined the limitations of those services, and workarounds could have been accounted for in the project plan. while the labor-intensive manual workflow has been reduced significantly, the required skill sets to efficiently maintain the automated workflow present a sustainability challenge of task distribution between librarians and developers. this article will detail the challenges and limitations of transitioning and standardizing recurring web scraping across more than 50 sources to a semi-automated workflow and potential future improvements to the pipeline. by matthew krc and anna oates schlaack introduction the federal reserve bank of st. louis’s fraser digital library is composed of historical and contemporary publications, as well as other types of digitized and born-digital content that are curated and maintained by a team of librarians and supporting staff. the digital library staff capture and archive contemporary digital publications on a recurring basis as they are issued. these digital publications, like myriad information on the web that have archival value, are not presented as structured data that is easily retrievable through standard automated processes—e.g., metadata capture via an application programming interface (api). an institution that wishes to continuously collect such documents must either depend upon frequent, manual checks of source websites or on services that monitor and identify changes to a given web page. a challenge to automating this archival process of checking for new content is having a reliable system that can capture digital resources and readily available metadata with minimal human intervention. despite this core challenge and other potential obstacles, in 2022, the fraser digital library team embarked on a project to attempt automating elements of, what had previously been, a tedious, manual process. background since 2004, the fraser digital library has provided free access to publications and archival collections related to the history of economics, finance, banking, and the federal reserve system. the collections are maintained by a team of four library staff, and the digital library portal and digital asset management system (dams) are maintained by a team of three web developers. the library services staff and information technology staff ultimately report to the same division head, though through different reporting structures, and the two teams work together in an agile project management environment to address enhancements and maintenance for the application. physical collections for in-house or outsourced digitization are obtained through partnerships with external host repositories. digital content is primarily collected directly from government institutions’ websites. in 2006, digital library staff began collecting content as it was published [1]. over the years, the manual collection workflow saw several iterations to make the process more efficient. in 2015, staff began using distill, a tool that monitors web pages and sends alerts when a web page changes. this software was particularly valuable for reducing the extensive time required to survey sources for new content. while this improved the process by reducing the time required to evaluate whether a source had published new content, automated monitoring of web page changes was not a perfect process. the majority of alerts were false positives—i.e., distill identified that there was a change to the web page, but there was no new content. for example, if an icon moved to another location on the page or if new tweets appeared in the embedded twitter feed, the tool would send an alert. between february 2018 – july 2022, 46% of web page change alerts were false positives—resulting in additional time to filter out alerts which did produce new content. fraser collections range from government documents and data releases to academic publications and personal archives. each of those content types require different metadata and, in the case of the contemporary digital publications, the descriptive metadata provided by the source varies. some publications in fraser’s collection scope are captured from the government publishing office’s (gpo) govinfo, which serves rich, standardized metadata. other publications, such as the press releases of the u.s. department of the treasury, present only a title and date of publication on the source website. in addition to the challenge of disparate metadata availability, the unique publication schedules posed challenges to the manual workflow and any potential automation. the distill web monitoring tool was so effective for this reason–it offered the possibility of knowing when an irregular publication had been issued simply by checking for changes on the page. albeit with faults, automated monitoring streamlined the process of tracking when sources had been checked for new content. timeliness to ingest new publications in fraser is key to ensuring that the digital library is a trustworthy location for finding historic and contemporary documents on economics. as seen with the web monitoring tool, any attempt at automation could be thwarted by the possibility of dynamic websites constantly changing. as of 2021, the number of publications included in the recurring downloads workflow had increased significantly [2], and, with that growth, the time required to monitor alerts, collect, describe, and ingest content manually as it was published became a full-time position. to alleviate staff time required to complete the largely monotonous workflow, in 2019, digital library staff began exploring possibilities for automation. the most successful attempt at automation was a python program that ran daily to scrape a web page, check to see whether the captured elements already existed in the dams, and, if not, transform and ingest the metadata and file, and send an email notification to staff when the ingest was complete and ready to be made public. the program was customized for a single publication that was the most laborious to capture because of its high publication frequency, with new content published as often as every 15 minutes [3]. given the aforementioned challenges of variance in sources, automating harvesting for all of the publications was not achievable with a simple program but required an extensible web scraping framework. recurring downloads design phase informed by the successful python program pilot, a team of architects and the agile web development team that supports fraser began a design phase to ideate a pipeline that would scrape metadata and files from a set of more than 50 websites, transform and enhance the metadata, and handle ingest of the metadata and files into the dams. the automated recurring downloads process was broken into two components: data processing and data management. data processing would run entirely in the background and would be managed through a dashboard in the dams. the data processing had to be capable of: harvesting and transforming metadata to the local metadata object description schema (mods) implementation, following title-level metadata requirements handling when new content is available and filtering out existing content the ideal data management dashboard would include: a list of new records queued for quality assurance (qa) notifications for records missing required metadata fields or with metadata conflicts (e.g., the author of a work does not correspond to an existing name authority record) an interface in which to modify existing sources and add new sources to the pipeline an interface for selecting elements on a source web page that should be scraped an interface to schedule scrape jobs at the title-level (i.e., distinct scraping schedules for each publication) from the initial design sessions, the team tentatively decided on an infrastructure that: leveraged the openrefine api to handle metadata transformations so that digital library staff had ownership of modifying metadata transformations without requiring changes to the application code employed web scraping software that had out-of-the-box, point-and-click functions for selecting web elements to scrape used aws dynamodb for select administrative data that would be created exclusively for the automated workflow as evident in the many learning resources, including a library carpentry lesson and library juice academy course, openrefine is a tool widely used in libraries for metadata work, and is invaluable specifically for metadata normalization. with the pipeline, library staff hoped to automate repetitive tasks, including normalizing dates and formatting author names, so that metadata records would be public-ready when ingested into fraser. the openrefine api enables clients to run automatically a set of openrefine operations, functionally fulfilling the transformation operations needed for the pipeline. a set of operations can be created in the openrefine user interface, with which fraser library staff were already familiar. in this proposed pipeline implementation, library staff would create operations in the openrefine application, load the operations for use in the pipeline, and assign operations at the title-level. this would enable library staff to add and modify metadata transformations operations without requiring web developer support. prior to the recurring downloads project design phase, digital library staff identified parsehub and web scraper as two web scraping tools that are easy to use for library staff who do not have coding experience. building a custom web scraping tool that replicated the same functionality and user-friendly interface would not have added notable value and would have required more time for planning and investing more development resources to build. web scraper was ultimately selected for handling the web scraping element of the pipeline because it was the easiest output to process and was configurable to authenticate with other tools. in order to store metadata transformation templates that would be quickly and easily retrievable, the development team chose to use aws dynamodb for the ease with which users could access and add and edit dynamodb documents, or database records. dynamodb is an amazon web services (aws) managed nosql, non-relational database service that allows fast access to stored data. because the recurring downloads pipeline wouldn’t require complex queries on data that would be retrieved, storing the metadata transformation templates in the primary postgresql relational database was an unnecessarily complex approach. prior to the project discussions, the web development team had been interested in investigating dynamodb as a more lightweight option for certain data types, and this project was an optimal pilot. in the aws console, librarians also had access to edit dynamodb data in a manner more intuitive than typical sql database clients allow. figure 1. screenshot of aws dynamodb record form in aws console. openrefine, web scraper, and dynamodb were all new tools to the development team, and integrating openrefine in the pipeline was ultimately discarded because its integration in the pipeline would require additional developer analysis, planning, experimentation, and testing. ultimately, it was easier and, at the time, deemed more cost effective for the web developers to work with a familiar tool rather than learn to integrate one that wasn’t native to their existing webstack or a provided aws service. instead, digital library staff compiled a list of known metadata transformation operations required, which the development team would integrate into the new workflow as custom code to run in an aws serverless lambda service. pipeline implementation the recurring downloads pipeline was deployed in october 2022. due to timeline and resource constraints, the architecture and data flow were slightly different and pared down from the finalized design. the lack of some features removed from the released pipeline have posed pain-points for the digital library staff, which are detailed in the challenges section of this article. the final pipeline consisted of tools that fulfilled the central function of capturing files and metadata from a source, transforming the files and metadata to follow fraser requirements, and ingesting the files and metadata to the dams for qa before being made public. table 1. list of enhancements to existing infrastructure for the metadata pipeline tools function(s) web scraper browser extension (chrome extension, firefox add-on) sitemap creation web scraper cloud sitemap storage, scrape job scheduling, scrape job output, metadata preprocessing aws dynamodb metadata transformation template storage aws lambda metadata transformation, file transformation enhancement to dams new record queuing, record qa figure 2. diagram of recurring downloads pipeline architecture and workflow. central to the web scraper service are sitemaps. distinct from xml sitemaps that aid search engines in web crawling, web scraper sitemaps are json documents that define the website that should be scraped with a “starturl” parameter and what to scrape from the given website with “selectors.” selectors contain information that inform web scraper what and how to scrape. to set up a new source to be included in the pipeline, library staff create sitemaps using the web scraper browser extension and upload the sitemaps to web scraper cloud. { "_id": "c4l-journal", "starturl": [ "https://journal.code4lib.org/" ], "selectors": [ { "id": "issue-title", "multiple": false, "parentselectors": [ "_root" ], "regex": "", "selector": "h1.pagetitle", "type": "selectortext" }, figure 3. snippet of web scraper sitemap (see appendix 1 for full example). web scraper cloud allows users to store sitemaps and create scrape job schedules unique to each sitemap. by default, the application supports downloading scraped data output as csv, json, and xlsx. the scraped data can also be retrieved as json via the web scraper api, which was the preferred method to retrieve data for the recurring downloads pipeline. web scraper cloud supports interfacing with other applications with a webhook that sends an http post request to a specified endpoint with a corresponding authentication token and a scrape job id corresponding to the job that was kicked off. // this is the data that your server receives "scrapingjob_id": 1234 "status": "finished" "sitemap_id": 12 "sitemap_name": "my-sitemap" // optional, custom_id will be passed to post data only if set "custom_id": "your-custom-id" figure 4. web scraper webhook data included in the http post request. figure 5. screenshot of creating a sitemap to scrape articles from the current issue of code4lib journal using the web scraper browser extension for chrome. figure 6. screenshot of metadata preprocessing in web scraper cloud. among the enhancements to the existing fraser application, the development team created an http endpoint to accept the post data from web scraper and, if properly authenticated, invoke a serverless aws lambda instance with the scrape job id as an input parameter. the lambda then runs python code that requests web scraper data corresponding to the scrape job id, and performs metadata transformations and validation. the code queries aws dynamodb for a document that corresponds to the parent title record to which the scraped record would belong, and contains the specific transformations necessary to match the metadata payload that will eventually be inserted into the fraser database, using pre-formatted variable inserts, such as date values and issue numbers. these injected variable keys are represented in the dynamodb document, using a custom syntax developed for the project, with curly brackets used to represent variable names, such as {mm} for a zero-padded month number, and {month} for a month’s proper name. figure 7. screenshot of dynamodb metadata template. among the metadata values generated by the lambda is a filename that follows a standard naming convention, unique to each fraser title, usually containing a timestamp value corresponding to the date the document was published. to determine whether the record already exists in the fraser dams, the constructed filename is checked against the existing files in fraser’s file storage system in aws simple storage service (s3). if a file with the constructed name exists in s3, the file is considered not new, and the process terminates immediately. if a matching filename does not exist, the next step of the process kicks off. checking only for new content in this way does not handle edge cases where a file has been revised with new or changed content but the filename remains the same. several alternative options were considered but required exploration and experimentation that did not fit into the project timeline. recommended future work would modify the process to check other unique criteria, such as a file checksum, in order to account for revised documents that would otherwise generate the same filename. after file download and metadata transformation and validation completes, the lambda uploads the document to fraser file storage and inserts the metadata record into the fraser relational database, where pending records are accessed through the fraser dams. once metadata and files reach the dams, digital library staff conduct a qa review of the document to ensure that the record contains correct metadata. in this manual step, staff are able to clean up records that may have missing or inaccurate metadata and, in cases where a document could not be retrieved through the automated process, manually capture the file. potential future enhancements for the process include advanced and filtered logging and tracking of the web scraping and ingest processes. digital library staff have requested individual job logging in the dams dashboard to more easily track errors in the pipeline. web scraper records scrape job and webhook post data, including scrape job information for records which, ultimately, were not ingested because the content already existed in the dams. the web scraper logs do not represent failures in the lambda, where scraped content does not exist in fraser but the process ceases for another reason. notifications for and comprehensive logging of unsuccessful job runs would help staff quickly to address faulty runs in the future. challenges from the outset of development to the final implementation, the development team encountered obstacles with automating the existing recurring downloads workflow. the manual process of archiving documents had been long fraught by issues with false-positive alerts from the website monitoring software and tedious, mundane daily tasks. the workflow, however, was clearly driven by institutional knowledge, not possible to replace fully with machine processes. additionally, the development team met technical hurdles that slowed development and demanded revisiting and altering the planned architecture and workflow. several impediments that could have been anticipated and mitigated caused unanticipated disruptions. technology stack the developers supporting fraser exclusively build and maintain php applications. the aws lambda service is limited in what languages are supported. the team quickly began work on the lambda code using php and realized, after the bulk of the code had been completed, that lambda would not support php in a straightforward way. shifting from php to python was a necessary transition and required members of the team to work in an unfamiliar language while development was already in progress. more thorough preliminary analysis could have caught this earlier and allowed time for staff training in the new language, or more thorough exploration of a solution that would have supported languages that the team was already familiar with. while many of the born-digital documents are downloadable pdf files, some of the documents involved capturing semi-dynamic web pages as static pdf documents. to capture primarily textual content presented only as html, the manual archiving process depended on the print to pdf feature available in most modern web browsers. with several rounds of testing to automate this conversion process, google chrome’s headless print option stood out as the best option for producing a pdf that held a look-and-feel most similar to the original web page. while similar functions exist in python libraries, most still depend on a headless browser that emulates browser functionality without a graphical user interface (gui). python libraries that don’t use a headless browser produce pdfs with varying levels of quality, and for the most part, lack the fidelity of the original web page present in browser-based methods. using the headless browser necessitated including a google chrome binary with the deployed lambda, which proved technically difficult using the infrastructure planned at the outset of the project. because development was done locally on machines that had the google chrome binary and due to the nature of user testing, this critical flaw in the pipeline was not realized until after the project was deployed to production, at which time rearchitecting the pipeline was no longer an option [4]. institutional limitations on third-party software institutional requirements limited third-party software permissible for use. web scraper was a new software to the institution and was required to undergo thorough security audit and, as a cloud-based software, necessitated further evaluation and approval. the development team began implementation with the assumption web scraper would be approved for use. while the software ultimately was approved, plans to productionize the pipeline were dependent upon approval because a back-up solution that fully met the requirements set in the project planning had not been identified. additionally, if openrefine had been ultimately pursued as a component of the pipeline architecture, it would have also been scrutinized by the internal review process. project planning even with project planning, sufficient time was not allocated to planning and refining work and the iterative development and testing cycle needed for successfully working with the new toolkit. areas of the workflow that required additional, more nuanced consideration were neglected. many of the metadata transformation operations and templates were created ad-hoc, but, as a core function in the recurring downloads pipeline, this component of the workflow would have been greatly improved by standardizing a coherent schema, with corresponding documentation around the various metadata variables that would be scraped from the web sources. the lack of advance planning on metadata transformation led to confusion among the various stakeholders, including web developers and library staff, as to how the metadata transformation schema should be formatted, which led to on-the-fly decisions around basic requirements during development. pipeline responsibilities and sustainability early planning of the recurring downloads pipeline included considerations for library staff to have greater oversight of and responsibility for the pipeline. in the final product, the components that library staff maintained included the web scraper sitemaps and metadata templates stored in dynamodb. while these components gave a majority of workflow ownership to library staff, the product was very advanced and technical. the pipeline would have benefited from an approach that leaned less on technical staff and more on open source tools and the knowledge of library staff. one such approach, that was abandoned early on in planning, was the use of openrefine. the recurring downloads project was planned as a final business enhancement prior to a multi-year technical project that would leave applications in maintenance mode. bugs and inadequate functionality that resulted from insufficient planning and user testing would not be in scope for development until after the technical project completion. because the implementation of requirements were not fully designed as part of project planning, development would have required many iterations of testing and remediation. the goal to automate collecting content from source websites, transforming and enhancing the metadata, and ingesting the metadata and files into the dams ultimately was a success. the fraser team currently has a productionized workflow for gathering external metadata and documents and has ingested 500 new records through this process. false positive alerts, which previously took up a large portion of library staff time sifting through, have disappeared as a concern. in general, library staff time dedicated to the recurring download process has decreased. however, some of that time and effort has, in turn, shifted to technical staff, whose time has historically been more limited. any institution looking to embark on an automation effort such as this would be advised to consider the job descriptions for staff that are responsible for supporting the process or function. conclusion the fraser library staff and web developers embarked on a formidable goal in the recurring downloads project. the ingenuous plan would alleviate some of the workload and tedium involved in the daily manual process, while also optimizing the speed and efficiency with which new records could be added and increasing the scalability of collecting digital documents and publications. the project and resulting pipeline was successful in many ways: the amount of active time required to add new records was reduced – from getting an email notification with a url, reviewing the site to ensure new content was present, adding files to the s3 file store and finally creating a metadata record, to simply clicking two buttons in a single location to qa and approve a record. the number of false positive alerts of new archivable documents was reduced to zero. given constrained staff resources, the pipeline and simple dashboard made it possible to distribute the workload from a single employee to several staff without tedious training. however, the pipeline and its maintenance increased complexity for the recurring downloads workflow. instead of digital library staff owning the archival process entirely, it became distributed between technical staff and library stakeholders. much of the maintenance and additional work on the project had to be assumed by technical staff, whose capacity is consumed by other priorities outside of fraser. while this project to automate and streamline contemporary digital publication collection was not without trials, it was a valuable learning experience for stakeholders involved, will be informative for future ventures to automate workflows and tasks, and ultimately created efficiencies for digital library staff. notes [1] the first record created as part of what was later formalized as the recurring downloads workflow was the october 2006 issue of economic indicators https://fraser.stlouisfed.org/title/1/item/134. [2] from 2020-2022 the sources in scope for collection and archiving increased from ~20 to ~50. [3] fraser includes historical press releases from the u.s. department of the treasury going back to 1916 and collected ongoing from the u.s. department of the treasury press releases website. [4] a potential option would have been to use a container service such as docker in order to provide greater flexibility in pulling in external dependencies for a codebase that is otherwise limited in lambda. about the authors matthew krc (he/him) is a data engineer and product owner of the fred data desk team at the federal reserve bank of st. louis, where he contributes to and plans work on the data and metadata pipelines for the federal reserve economic data website. he has a professional background in library web application development and was previously part of the technical team that supports the fraser application. anna oates schlaack (she/her) is a product owner in the research division at the federal reserve bank of st. louis, where she leads a web development team in modernizing legacy information discovery applications. appendices appendix 1. example web scraper sitemap { "_id": "c4l-journal", "starturl": [ "https://journal.code4lib.org/" ], "selectors": [ { "id": "issue-title", "multiple": false, "parentselectors": [ "_root" ], "regex": "", "selector": "h1.pagetitle", "type": "selectortext" }, { "id": "issue-date", "multiple": false, "parentselectors": [ "_root" ], "regex": "", "selector": "h1.pagetitle", "type": "selectortext" }, { "id": "article-wrapper", "multiple": true, "parentselectors": [ "_root" ], "selector": "div.article", "type": "selectorelement" }, { "id": "article-title", "multiple": false, "parentselectors": [ "article-wrapper" ], "regex": "", "selector": "a", "type": "selectortext" }, { "id": "article-authors", "multiple": false, "parentselectors": [ "article-wrapper" ], "regex": "", "selector": "p.author", "type": "selectortext" }, { "id": "article-abstract", "multiple": false, "parentselectors": [ "article-wrapper" ], "regex": "", "selector": ".abstract p", "type": "selectortext" }, { "id": "article-url", "linktype": "linkfromhref", "multiple": false, "parentselectors": [ "article-wrapper" ], "selector": "a", "type": "selectorlink" } ] } appendix 2. example aws dynamodb document { "id": "c4l-journal", "sortkey": "c4l-journal", "cron": "0 22 * 2,5,8,11 5", "filenames": { "pdf": "c4l_{yyyy}_{nn}.pdf" }, "frequency": "q", "metadata": { "genre": [ "periodical" ], "language": [ "eng" ], "titleinfo": [ { "title": "{issue-title}", "titlepartnumber": "{issue-titlepartnumber}" } ], "origininfo": { "dateissued": "{month} {d}, {yyyy}", "issuance": "periodical", "sortdate": "{yyyy}-{mm}-{dd}" }, "typeofresource": "text", "physicaldescription": { "form": "electronic", "digitalorigin": "born digital" }, "tableofcontents": [ { "titleinfo": [ { "title": "{article-title}" } ] }, { "name": [ { "role": "creator", "namepart": [ "{article-author}" ] } ] } ], "relateditem": { "1": { "@type": "itemparent", "recordinfo": { "recordidentifier": [ "123" ] }, "titleinfo": [ { "title": "code4lib journal" } ] } } }, "s3_path": "/docs/publications/c4l/" } subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – creating library websites with joomla: not too big, not too small, just right mission editorial committee process and structure code4lib issue 12, 2010-12-21 creating library websites with joomla: not too big, not too small, just right many organizations, including libraries, are turning to content management systems to simplify the management of their websites. alfred university‘s herrick memorial library recently implemented a new website using joomla, an open-source content management system. while drupal has received significant attention in the library community, joomla may be a more practical choice for some libraries. the purpose of this paper is to share our experience with joomla so that other libraries can more easily evaluate its suitability to their environment. by ellen bahr and matt speed introduction alfred university is a small comprehensive university in western new york with a student population of about 2,400. herrick memorial library serves the college of business, the college of liberal arts and sciences, and the graduate school. a second university library, the samuel r. scholes library, serves the new york state college of ceramics, a statutory unit of the university. herrick has four librarians, including the library director, and seven support staff. figure 1. herrick’s joomla website recently, when faced with the task of redesigning our website, we chose to do it with joomla, an open source content management system. content management systems (cmss, also referred to as web content management systems or wcmss) are database-driven software systems for building and managing websites. the open source content management systems most frequently discussed in libraries today are drupal, wordpress, and joomla. drupal has a reputation for flexibility and extensibility but also for having a steep learning curve. wordpress, known for ease of use, began as a blogging platform and has evolved to incorporate more features of a cms. joomla, which emerged in 2005 as a fork in development of the mambo cms, has gotten relatively less attention in the library community but has a strong following in other sectors, both for-profit and not-for-profit. joomla framework like drupal and wordpress, joomla is written in php and stores data in a mysql database. unlike websites that rely on static html, sites built with content management systems like joomla serve content dynamically. when a user requests a page on a joomla site, the content is retrieved from the mysql database, formatted according to the site’s template, and displayed in the user’s browser. as depicted in figure 2, joomla is a “three-tiered system” [1] consisting of an extension layer, an application layer, and a framework layer. the entire joomla system interacts with the mysql database, which stores information about the site. the extension layer consists of modules, components, and templates that extend the functionality of the joomla system. some extensions are included with the joomla core installation and others can be added later. the application layer includes jinstallation for installing joomla, jadministrator which is responsible for the back-end administrative interface, jsite which is responsible for the front-end of the website, and an xml-rpc interface. the framework layer contains joomla’s programming foundation, including libraries needed by the framework and plugins that extend the functionality of the framework. figure 2. joomla software framework the core joomla installation includes all of the features needed to create a functioning website, including systems for managing users, media, languages, banners, contacts, polls, searching, web links, content, newsfeeds, menus, and templates. these features can be expanded through add-ons. help features, such as a glossary of joomla terms, links to joomla documentation, and a version checker are built into the web-based administrative interface. for a more complete description of the core joomla features see the joomla features overview [2]. the joomla extensions directory [3] provides access to thousands of add-ons, the majority of which are free. these provide additional functionality not included in the core installation. joomla’s extension manager (figure 3) makes installing extensions easy. there are three options for installing extensions: upload a package file, install from a directory, or install from a url. uploading a package file is probably the most common method. the extension archive file is downloaded from the website of the extension provider and saved locally. browse to where you saved the file and select upload file and install. at this point, joomla completes the installation. figure 3. joomla extension manager in the joomla extensions directory, extensions are classified as components (c), modules (m), plugins (p), languages (l), templates (t), or extension specific add-ons (s). many installation packages include extensions of more than one type. for example, in the screenshot below, the template toolbar extension contains only a template, while the jforms extension contains a component and a plugin. figure 4. joomla extensions directory version 1.5 of joomla includes five extension types: plugins are event handlers. “in the execution of any part of joomla, be it the core, a module or a component, an event can be triggered. when an event is triggered, plugins that are registered with the application to handle that event execute” [4]. joomla’s wysiwyg editor is an example of a plugin. content is stored in joomla’s mysql database as html. the wysiwyg editor plugin converts html to rich text for editing, and then back to html for storage. joomla plugins are similar to hooks in drupal [5]. while joomla’s documentation describes plugins as an extension type, joomla’s software framework diagram (figure 2) includes plugins in the framework layer. plugins operate mostly behind the scenes, sometimes in combination with components or modules to provide a user interface. modules are used to display chunks of content on web pages; joomla has a variety of module types to handle different kinds of content. each template has defined module positions, and multiple modules can be displayed on a given page. examples of modules include breadcrumbs, rss feeds, banners, menus, footers, simple polls, and user logins. joomla modules are similar to blocks in drupal [5]. components are mini-applications. they have more complex administrative settings than modules, and only one component can appear on a given page. an example of a component is chronoforms, which can be used to create interactive forms. joomla components are similar to modules in drupal [5]. language extensions are used to install additional languages. once installed, they are managed in the language manager. joom!fish, which supports multi-lingual websites, is an example of a language extension. templates control the design of the website. they can be custom-built or selected from among the many joomla templates available for purchase or for free. in drupal, templates are called themes. managing users joomla provides for three user types: unregistered front-end users, registered front-end users, and registered back-end users. unregistered (or public) front-end users are visitors to the site who have access to any content that doesn’t require a login. registered front-end users can be given additional privileges, including the ability to view restricted content, and to author, edit, or publish articles. there are three levels of back-end users: managers, administrators, and super administrators, each with increasing privileges. installation and configuration joomla requires a server with apache, mysql and php. the xamp installer is an easy way to install all components. at our institution, the library wanted to get some experience with joomla without having to install it locally. a relatively simple way to do this is to use a commercial hosting service. many hosting services offer “one-click” installation for joomla and drupal, putting installation within reach of those with minimal technical skills. the joomla resources directory [6] includes a list of potential hosts. we installed joomla and drupal on a hosted server in order to compare them, and the experience contributed heavily to our decision to use joomla. it was clear that joomla would require less customization than drupal, hopefully shortening our development timeline. further, it seemed unlikely that we could take advantage of everything that drupal had to offer, while joomla’s features were sufficient for our relatively straightforward needs. after talking with another academic library that had recently built a site with joomla, we decided to take the plunge. while we were experimenting with joomla on a hosted server, we were in touch with the university’s information technology services (its) department about eventually hosting the site locally. we continued to develop the site on the hosted server, with the understanding that we would contact its when we were ready to move our site to a local production environment. when we reached this point, the local installation and configuration of joomla was handled by an its network services support specialist. what follows is a brief description of the installation and configuration process at our institution. the specific steps will vary depending on a variety of factors in your local environment. refer to the joomla! 1.5 installation manual [7] for complete information. we use a physical (as opposed to virtual) server that runs oracle unbreakable enterprise linux 64-bit 5.5 with php version 5 and mysql 5. this os is very similar to red hat enterprise linux 5.5. sixty-four and 32 bit setups are identical. the server is a dell poweredge 2950 with a xeon e5405 2ghz and 8gb of ram. there are two ways to install joomla – manually or via a web interface. we installed joomla manually because of the additional options that it provides. the first step was to edit the configuration file. copy the sample configuration file and enter your local settings such as the site url and your database username and password. the next step was to create the mysql database that will be used with the site. (note: if you’re using the web-based interface, the database must be created prior to starting the setup.) to create the database in mysql, we used sqlyog [8] to execute the task. sqlyog is a simple mysql management tool that allows for the creation and management of the database server. once the database was created, we then opened the apache configuration file and added the site information (because we host more than one site on a single ip address, we created a virtual host). dns records then need to be created and configured by whoever manages your dns. in our case, this was done internally. once the apache configuration reloads the site becomes available. at this point, we needed to verify the file permissions. this requires a ssh connection to be open. once you have made a ssh connection to the server with an account that has the ability to change permissions, run the following commands from the joomla root directory (note: these commands work only in a *nix environment): find . -type f -exec chmod 644 {} \; find . -type d -exec chmod 755 {} \; these commands reset file and folder permissions for all files and folders in the joomla root recursively. finally, because our site was developed on a hosted server, we needed to move the mysql database and customized template files to the local server. this was done by connecting to the hosted server with sqlyog and downloading the sql database (sql dump). next, connect to the local server and replace the database. the customized template files were rsynced from the hosted site to the local site. backup a backup of the site consists of two parts: the joomla site files and the database. at alfred, we use an enterprise product called emc networker [9] to back up the joomla site files. if you don’t have an enterprise solution another option is to rsync, ftp, or sftp the data to another server for backup. this can be done with a crontab or windows scheduled task. simply copying the files to an external drive is another option. the second part of the backup involves the database. we use a cron job to do an sql dump of the database, which is then saved by the emc networker backup system. a joomla extension called akeeba backup can also be used for site backup [10]. upgrades and migrations in joomla, upgrading refers to the process of making updates within the same version of joomla (for example, updating from version 1.5.18 to 1.5.19) and migrating refers to the process of moving from one version of joomla to another (for example, migrating from version 1.0 to version 1.5). joomla updates are released on a fairly regular basis (every couple of months or so) while migrations are a much rarer occurrence (version 1.5 was released in 2008 and version 1.6 is currently in beta testing). the joomla documentation provides complete instructions for site updates and migrations. in the case of upgrades, an extension called update manager for joomla can be used to automate the process [11]. a good way to stay up-to-date with new releases is to subscribe to the announcements discussion in the joomla forums [12]. security joomla has some built-in security features that can be helpful to the person responsible for managing the server side of joomla. for example, it will let you know if the joomla configuration files or server configuration are incorrect or pose security risks. joomla recommends staying up-to-date with the most recent stable release of joomla and any additional extensions you’ve installed. as mentioned above, the announcement discussion topic in the joomla discussion forums provides a way to receive automatic notifications of new releases, as well as important security notifications. joomla’s security checklist [13] covers other important security topics. at alfred university we added an extra layer of security by putting a web application firewall (waf) in front of the server. we use an imperva x1000 waf, which inspects all of the traffic going into our web servers. it looks for potential exploits such as cross-site scripting, sql injection, and parameter violations. the firewall helps with blocking known problems and patching new problems as needed. when implementing the waf to the joomla site, the waf first needs to learn the site through web traffic. in our experience, this didn’t present any problems on the public website. the waf did, however, generate errors in joomla’s web-based administrative interface. some site changes generated false positives based on the way joomla handles its code through the /administrator/index.php and index2.php pages. errors were reported to information technology services staff and corrected by making changes to the imperva configuration. joomla’s administrative interface once joomla is installed, site administrators and developers access the site through a back-end, web-based administrative interface. figure 5 shows the control panel which functions as the homepage of the administrative interface. figure 5. joomla control panel in the control panel, a series of drop-down menus and shortcuts provide access to joomla’s administrative options. joomla’s official documentation [14] and the “learn to use joomla!” section of the absolute beginner’s guide to joomla! [15] are good resources for getting started with using some of the back-end options. some knowledge of html and css will be helpful if you want to edit template files. experience with php and mysql will open up additional opportunities for customizing the site. designing and building the site working with templates there are three basic approaches to designing a joomla site: find a template that you like and make only minor changes; find a template that approximates what you want and modify it to suit your needs; or design and build your own joomla template. while designing a template from scratch provides the most control, using an existing template will likely save time and reduce the need for in-house design and technical expertise. keep in mind that most templates will need at least some modification, even if it’s only replacing the default logo with your own. in herrick’s case, we found a template that had the right look and feel, and we modified it to suit our needs. we chose the “ja rutile” template, which can be purchased from joomlart. some sources for free and purchased templates are: joomlashack [16], joomlart [17], and rockettheme [18]. templates are installed and uninstalled through the web-based administrative interface. one of the first steps in editing templates is becoming familiar with a given template’s defined module positions. for example, figure 6 below shows the module positions of the default ja rutile template and figure 7 shows the module positions of our website. for any joomla site, you can view the module positions by adding “?tp=1” to the site url (http://herrick.alfred.edu/?tp=1 will display the module positions for our site). figure 6. module positions for the ja rutile template figure 7. module positions for the herrick homepage with an understanding of a template’s module positions you can then use the module manager to make basic changes to the front page. the module manager is where modules are created, edited, enabled or disabled, and assigned to defined template positions. for more complex modifications, the website developer will want to explore three files that work together to control the template: index.php, template.css, and templatedetails.xml. the joomla documentation is a good source of information on more advanced modifications, such as adding new module positions [19]. the table below shows the module names, types, and positions for herrick’s homepage. note that “rotating” is a new module position that wasn’t included in the ja rutile default template. module name module type module position top menu mod_mainmenu user3 search mod_search user4 library hours mod_custom left front rotating images mod_jw_sir rotating tabbed search mod_custom ja-news news mod_slick_rss right find mod_mainmenu user1 about mod_mainmenu user2 requests mod_mainmenu user5 help mod_mainmenu user6 herrick footer mod_footer footer content organization before building the site, it’s important to plan for the organization of its content. joomla uses a three-level hierarchical system, comprised of sections, categories, and articles to organize content. each section can contain many categories, and each category can contain many articles. a good resource for getting started with organizing content in joomla is the “learn joomla 1.5 fast!” tutorial [20]. it describes the steps involved in creating articles, organizing them with sections and categories, and using menus to display content in the site’s front end. useful extensions one of joomla’s strengths is the large number of free extensions developed by its user community. in comparison to drupal, joomla has fewer library-specific extensions. the joomla extensions directory includes some extensions for managing personal book collections [21] and j-car [22], a joomla extension for dspace. there is no joomla equivalent to sopac, a drupal module for integrating the library catalog. while joomla currently lacks library-specific extensions, it does have many useful extensions that can be used in developing a library website. following are some of the extensions and modules used in herrick’s joomla implementation: sourcerer – sourcerer [23] builds a lot of flexibility into the joomla site by allowing you to insert custom code (php or html including javascript and css) into articles and some other types of content. to use it, you simply put the sourcerer tags around the code in the wysiwyg editor: {source}your code{/source} wrappers – joomla’s wrapper module [24] can be used to display other web pages within the joomla site. this can be useful for bringing third-party content into the library website (serials solutions journal locator for example). scroll bars can be displayed if necessary. extplorer – this component [25] provides a web-based interface, accessed through joomla’s control panel, for managing and editing files on the web server. files can be edited in extplorer, or extplorer can be used to upload and download files. chronoforms – chronoforms [26] is one of many joomla extensions for creating web forms. jce – the default joomla wysiwyg editor is the tinymce editor. the jce editor [27], a component with its own configuration settings, is a more sophisticated editor than tinymce. googlemaps – this plugin [28] can be used to easily add maps to joomla sites. bigshot google analytics – this plugin [29] adds the google analytics tracker code across the site. solving problems big and small in the process of developing our library website with joomla, we occasionally ran into problems. fortunately, there are a variety of resources to turn to for help. the joomla documentation is a good first place to look for information. the joomla forums are the place to go to find answers to common questions, or to pose questions of your own. in our experience, the joomla user community is very supportive. the book, beginning joomla! [30], is recommended as a thorough and organized introduction to joomla. when it comes to editing templates, many template providers provide documentation and a forum where users can share information. there are a variety of ways to become further involved in the joomla community such as participating in forums and listservs or attending events like joomla! days [31]. the joomla in libraries [32] group, while not as active as the core joomla community, provides a place to share information about using joomla in a library setting. those who want a more systematic approach to learning joomla might investigate formal training opportunities. what’s not to like? while joomla has mostly met our library’s needs, we did discover some downsides: in our experience, there isn’t an easy way to create production and testing environments. our process will likely involve two joomla installations, and a scripted or manual process for dumping the mysql database from one site to the other. while joomla provided enough flexibility for our site, developers with strong coding skills, more ambitious goals, or a desire to control all aspects of the site might be happier with drupal, which provides a more open-ended framework. despite the large number of joomla templates available, in our experience it can still be difficult to find a really good template. many joomla templates have a similar look and feel about them. it would be great to have some joomla templates designed specifically for libraries. staffing and project timeline our project was carried out for the most part by three people: a systems librarian, a library assistant for web services and digitization initiatives, and a network services support specialist in the university’s information technology services department. teamwork was important since this was our first experience with joomla. we began to think about redesigning our website during the 2008-2009 academic year when we conducted a survey to gather input from library users. in the fall of 2009, we inventoried the content of our existing website in preparation for a complete reorganization. data from google analytics, which we had been collecting for about two years, were used to identify the most-used content. beginning in the fall of 2009, the library assistant for web services led the way in exploring drupal and joomla, while the systems librarian focused on navigation and content. because our existing site was about ten years old, the content needed reorganizing, editing, and in some cases rewriting. once we made a decision to use joomla, we selected a template and started to develop the homepage, with the goal of finalizing the homepage design and navigation before building lower-level pages. basic homepage mock-ups were used to gather feedback from faculty, students, and librarians. building the site was an iterative process, with frequent meetings and discussions between the systems librarian and the library assistant for web services. the rest of the library staff and the library director were informed and consulted at key points along the way. we completed the transition to our new site during the summer of 2010. because none of the staff worked on the website project exclusively, it’s difficult to estimate the total amount of time that the project required. during the 2009-2010 academic year, when the bulk of the work was completed, it’s estimated that the library assistant spent about 50% of his time on the project, the systems librarian spent about 20% of her time, and the network services support specialist spent about three days in total. depending on the library’s goals and staffing arrangement, a similar project could be completed in a shorter timeframe. for us, the most time-consuming parts of the project were investigating cms options and deciding to use joomla, learning how to use joomla, preparing the content and developing the navigation, and customizing the template. once these steps were completed, building the site went relatively quickly. measuring our success when we set out to redesign our library’s website, we identified the following goals, which have been mostly met by the redesign: a more modern design the ability to easily change site navigation better compliance with changing web standards cross-browser compatibility better separation of design and content so that future design changes can be more easily accommodated secure forms the ability for non-technical staff to edit or contribute content the project was also driven by an overarching need to “do more with less” and to design a site that would be easier to manage going forward. in redesigning the site, we eliminated two homegrown databases, one for displaying our research by subject guides and a-z databases list (these now reside in libguides) and the other for searching and browsing the library’s movie collection (patrons are now directed to the library catalog and static urls are used to display movie lists by genre). since moving to the new site, we have been watching our site analytics to see what impact the new site has had on use. a comparison of some site metrics for the period of september 1 through november 13, in 2009 and 2010, appears below. metric sept. 1nov. 13, 2009 sept. 1 – nov. 13, 2010 percent change visits 76,916 65,654 -14.64% page views 179,963 165,359 -8.12% pages/visit 2.34 2.52 7.65% bounce rate 68.00% 64.82% -4.68% average time on site 00:02:10 00:03:01 39.26% percent new visits 30.74% 32.31% 5.12% since there are some important differences between the old site and the new site (for example, the new site has fewer pages overall and a much flatter hierarchy) it’s somewhat difficult to make direct comparisons. and, without analytics from other libraries to compare to, the best we can do is to examine our own statistics over time. while it’s disappointing to see that overall visits and page views are down, it’s difficult to know the reason for the change; it could be that the library website is becoming less essential to research tasks than it used to be. on the other hand, it’s nice to see that other statistics have improved, including a slight reduction in the bounce rate, a significant increase in the amount of time spent on the site, and an overall increase in new visits. while we have not conducted any task-based user testing, anecdotal feedback from students about the new website has been positive. is joomla “just right” for your library? the question of whether joomla is right for any particular library depends to some extent on the library’s needs and the web developer’s skills and expectations. in our case, joomla provided a good match for our relatively modest needs, our staffing levels, and the skills of our librarians and staff. while library staff had some experience with coding, we were also willing to give up some customization in order to reduce the amount of time required to build the site and to lessen the overall demands of site management going forward. we chose a content management system because we anticipated that once the site was built it would be easier to manage, giving the systems librarian and web development assistant an opportunity to focus on other important projects. support from the university’s its department made it possible to host the site locally. in our opinion, while many libraries are turning to drupal, joomla will be a better fit for many smalland medium-sized libraries. in any case, getting hands-on experience with a variety of open source cms’s is highly recommended, as it will allow you to make an informed choice. as previously mentioned in this paper, commercial hosting services provide a low-investment way of trying open source content management systems. and, for those libraries that don’t want the responsibility of hosting the site locally, a hosted site can be a long-term option as well. lishost [33], for example, provides low-cost web hosting services specifically for libraries. in a recent email, they indicated that they currently host two joomla sites, about 80 drupal sites, and at least 100 wordpress sites (email from blake carver of lishost to the author, 2010 nov 11). when we set out to redesign our site, it was difficult to find information about using joomla in libraries. we hope, therefore, that the information presented in this paper will be helpful to other libraries considering joomla. bibliography [1] framework [internet]. [updated 2010 may 18]. joomla! documentation: [cited 2010 nov 11]. available from: http://docs.joomla.org/framework [2] joomla! features overview [internet]. c2005-2010. open source matters, inc.: [cited 2010 nov 11]. available from: http://www.joomla.org/core-features.html [3] joomla! extensions [internet]. c2005-2010. open source matters, inc.: [cited 2010 nov 11]. available from: http://extensions.joomla.org/ [4] joomla! extensions defined [internet]. [updated 2010 may 9]. joomla documentation: [cited 2010 nov 11]. available from: http://docs.joomla.org/joomla_extensions_defined [5] mort g. 2010. joomla vs drupal, a coder’s perspective [internet]. nyphp joomla sig: [cited 2010 dec 9]. available from: http://lists.nyphp.org/pipermail/joomla/2010-july/002895.html [6] joomla! resources directory [internet]. c2005-2010. open source matters, inc.: [cited 2010 nov 11]. available from: http://resources.joomla.org/directory/support-services/hosting.html [7] wallace a. 2007. joomla! 1.5 installation manual, version 0.5 [internet]. [updated 2007 oct 30]. joomla! user documentation team: [cited 2010 nov 11]. available from: http://downloads.joomlacode.org/docmanfileversion/1/7/4/17471/1.5_installation_manual_version_0.5.pdf [8] sqlyog [internet]. c2010. webyog: [cited 2010 nov 11]. available from: http://www.webyog.com/en/ [9] emc networker unified backup and recovery software. [internet]. c2010. emc corporation: [cited 2010 nov 11]. available from: http://www.emc.com/domains/legato/index.htm [10] akeeba backup [internet]. c2006-2010. akeeba developers: [cited 2010 nov 11]. available from: http://www.akeebabackup.com/ [11] update manager for joomla! [internet]. [n.d.] sam moffatt consulting: [cited 2010 nov 11]. available from: http://sammoffatt.com.au/os/joomla-15-products/3-jupdateman [12] joomla! discussions forums [internet]. c2005-2010. open source matters, inc.: [cited 2010 nov 11]. available from: http://forum.joomla.org/ [13] joomla! security checklist [internet]. [updated 2010 mar 10]. joomla! documentation: [cited 2010 nov 11]. available from: http://docs.joomla.org/category:security_checklist [14] joomla! official documentation [internet]. [updated 2010 oct 29]. joomla documentation: [cited 2010 nov 11]. available from: http://docs.joomla.org/ [15] absolute beginner’s guide to joomla! [internet]. [updated 2010 may 10]. joomla! documentation: [cited 2010 nov 11]. available from: http://docs.joomla.org/beginners [16] joomlashack [internet]. c2005-2010. joomlashack: [cited 2010 nov 11]. available from: http://www.joomlashack.com/ [17] joomlart [internet]. c2005-2010. joomlart.com: [cited 2010 nov 11]. available from: http://www.joomlart.com/ [18] rockettheme [internet]. c2010. rockettheme, llc: [cited 2010 nov 11]. available from: http://www.rockettheme.com/ [19] how do you add a new module position? [internet]. [updated 2010 sept 29]. joomla! documentation: [cited 2010 nov 11]. available from: http://docs.joomla.org/how_do_you_add_a_new_template_position%3f [20] bhide sr. [n.d.]. learn joomla! 1.5 fast! [internet]. [cited 2010 nov 11]. available from: http://help.joomla.org/files/visualguide15.pdf [21] books and libraries [internet]. c2005-2010. open source matters, inc.: [cited 2010 nov 11]. available from: http://extensions.joomla.org/extensions/living/education-a-culture/books-a-libraries [22] j-car [internet]. c2008-2010. wijiti pty ltd.: [cited 2010 nov 11]. available from: http://www.wijiti.com/projects/j-car [23] sourcerer [internet]. c2010. nonumber!: [cited 2010 dec 9]. available from: http://www.nonumber.nl/extensions/sourcerer [24] help16: extensions module manager wrapper [internet]. [updated 2010 jul 11]. joomla! documentation: [cited 2010 dec 9]. available from: http://docs.joomla.org/help16:extensions_module_manager_wrapper [25] extplorer [internet]. 2007 [updated 2010 jun 10]. soeren eberhardt: [cited 2010 dec 10]. available from: http://extplorer.sourceforge.net/ [26] what is chronoforms? [internet]. [n.d.]. chronoengine: [cited 2010 dec 10]. available from: http://www.chronoengine.com/component/content/article/1-latest/26-what-is-chronoforms.html [27] jce [internet]. c2010. joomlacontenteditor.net: [cited 2010 dec 10]. available from: http://www.joomlacontenteditor.net/ [28] plugin googlemap [internet]. c2005-2008 [updated 2010 jul 17]. open source matters, inc.: [cited 2010 dec 10]. available from: http://joomlacode.org/gf/project/mambot_google1/ [29] bigshot google analytics [internet]. [n.d.]. bigshot: [cited 2010 dec 10]. available from: http://www.thinkbigshot.com/kansas-city-marketing-services/205-free-joomla-extensions.html [30] rahmel d. 2007. beginning joomla! from novice to professional. berkeley (ca): apress. [31] joomla! days worldwide [internet]. c2005-2010. open source matters, inc.: [cited 2010 nov 11]. available from: http://community.joomla.org/events/joomla-days.html [32] joomla in libraries [internet]. [n.d.]. [cited 2010 nov 11]. available from: http://www.joomlainlibrary.com/ [33] lishost [internet]. [n.d.]. [cited 2010 nov 11]. available from: http://lishost.org/ authors ellen bahr is information systems librarian for herrick memorial library at alfred university in alfred, ny. email: bahr@alfred.edu. matt speed is network services support specialist for the information technology service department at alfred university in alfred, ny. email: speed@alfred.edu. acknowledgements ellen bahr would like to recognize brett arno’s contributions to the herrick library website and thank him for commenting on drafts of this paper. she would also like to thank kathryn frederick of the lucy scribner library at skidmore college for sharing her joomla experience with us. subscribe to comments: for this article | for all articles 7 responses to "creating library websites with joomla: not too big, not too small, just right" please leave a response below: noah, 2011-09-20 thanks for this article. very helpful. do you have any recommendations for integrating a library catalog into a joomla site? any advice would be much appreciated. thanks ellen, 2011-10-18 sorry for the delay – i just noticed that you left a comment here. this is the only joomla library catalog extension that i’m aware of: http://sourceforge.net/projects/obiblioopac4j/ there has been some discussion about this topic in the joomla in libraries forum: http://www.joomlainlibrary.com/index.php?option=com_kunena&func=view&catid=9&id=334&limit=6 in our case, we built some opac search boxes into our joomla site (this is easy to do) and we just link out to the library catalog. as far as i know, there isn’t much happening in the way of developing library catalog extensions for joomla in the same way that modules are being developed for drupal. so, if catalog integration is important to you, you may want to look at drupal. ellen noah, 2011-10-21 thanks ellen arthur, 2012-02-16 hello, can anyone share joomforest’s jf chrome template? it is here: http://www.joomforest.com/, please, someone share it, thank you build it, but will they come? | rocky mountain chapter (rmsla), 2012-03-27 […] might manage a website using drupal (http://groups.drupal.org/libraries), joomla (http://journal.code4lib.org/articles/4226) or some other content management system, but smaller organizations might find software […] glenn, 2014-06-27 ellen, in your article, you alluded to the lengthy learning curve associated with drupal (i agree); this may make joomla more favorable, but i didn’t see any specifics on why wordpress was not given serious consideration as your library’s site management platform. was wordpress security a major issue in 2009? philipp holz, 2014-08-06 for a new projects i also need a library functionality within my joomla site. do you have any suggestions for extensions which would be good a start for joomla 2.5 or 3.x? i have seen that the herrick library already uses joomla 3.x. do you use a component to manage these books or are they all separated articles? thanks in advance. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – collecting virtual reference statistics with an im chat-bot mission editorial committee process and structure code4lib issue 3, 2008-06-23 collecting virtual reference statistics with an im chat-bot a perennial problem in libraries is capturing accurate statistics. this article addresses this problem with the creative use of web 2.0 tools: meebo and aol instant messenger. it describes the development and implementation of an instant messaging “stat-bot” that prompts staff to record virtual reference statistics via im. step-by-step guidelines and the perl script are provided. by mason r.k. hall introduction over the past several years, virtual reference has become a popular and valuable point of contact between patrons and libraries. with new web-based technologies like meebo [1], librarians can now be available for virtual reference anywhere they have internet access. because of this anywhere-access factor, keeping track of statistics from multiple employees located in multiple places is a problem. e-mail and manual statistics collection and compilation are successful only if the employee remembers to compile and send them after each session. unfortunately, our employees (myself included) often forget to perform this crucial step after virtual reference. at our institution, virtual reference is usually performed as a secondary task (usually in addition to the employees’ primary duties), and can be easily forgotten about if no questions are received for a length of time. how could we prompt employees to record virtual reference statistics? after weighing our options, we agreed that the best solution to this problem would be to create a system that actively pulled statistics from the employee each hour. the system would have the ability to query staff members for statistics, and receive the information as an instant message. creating the “statbot” removed the problem of statistics collection from both the supervisors and the employees and has provided a consistent and accurate stream of quantitative data from our virtual reference system [2]. this article will explain how to set up and use the statbot in your own library using the provided perl script. requirements for this project you’ll need command line access to a linux server with perl >= 5.8.5, and the ability to add a cron entry [3]. for this particular script, we’ll be using aol’s instant messaging network for sending and receiving statistics messages. this means whoever is manning your virtual reference service will need to be logged onto aol im. many libraries now use meebo to connect to multiple services at once [4]. this allows many users on many different chat systems to access that library’s virtual reference. for this project (and for virtual reference in general) i recommend using meebo. setup 1. create statbot account first, you’ll need to create your statbot account in aol instant messaging. aol provides a special account for bots, which you can sign up for here: http://dev.aol.com/aim/bot. a. click on the “bot my screenname” link under “what do i need to get started?” b. on the right side of the screen, under “screen name or e-mail” click on the “get an account” link. c. fill in all the appropriate information. i recommend documenting the security question in case the password is ever forgotten. the screenname can be anything, but should be something your employees can readily associate with the statbot e.g. fsustatbot, mylibstatbot, etc. d. once submitted, you’ll be asked to agree to aol’s terms of service. you must agree to continue. important: you’ll need to add this screenname to your virtual reference’s aol “buddy list” so that the statbot can talk to employees using it. aol doesn’t allow bots to initiate conversations without the bot being on that user’s buddylist. 2. put statbot script on server create or choose a directory on your server and place the statbot.pl script inside of it. from inside the directory, create a folder for the statlog by issuing the following command: “touch statlog.txt” 3. customize statbot script open the statbot.pl script in your favorite text editor. you’ll need to modify a couple of lines to customize your instance. make sure to only change the information within the single quotes. my $screenname = â€˜statbotscreenname’; change this to the screenname you just created for the statbot in step 1. my $password = ‘statbotpassword’; change this to the password you created for the statbot in step 1. my $vrscreenname = ‘virtualreferenceaolscreenname’; change this to your library’s virtual reference (vr) aol screenname. my $timetoquery = ’52’; change this to the minute on the hour you would like the statbot to query vr for stats. if your vr staff changes at the end of each hour, it should ask right before the shift change (like the example). my $finishhour = ’22’; change this to the hour (in 24 hour time) that you would like the statbot to finish for the day and send its report. as is, statbot would compile and send all statistics, then shut-down at 10:00 p.m. my $statcollector = ‘sombodys@email.edu’; change this to the email address you would like the end-of-day statistics report sent to each night. my $statcollectorsmtp = ‘mail.smtp.edu’; change this to your local smtp server address. my $statlogloc = ‘/var/www/statbot’; change this to the directory on the server you have placed statbot inside of in step 2. 4. setup schedule for statbot you’ll want to set these variables to correspond to the times you would like the statbot to come online and go offline each day. use 24 hour time. if you don’t want statbot to come online that day use â€˜99′ for both the on and off times. my $monon = ’12’ ; my $monoff = ’20’ ; my $tueon = ’12’ ; my $tueoff = ’20’ ; my $wedon = ’12’ ; my $wedoff = ’20’ ; my $thuon = ’99’ ; my $thuoff = ’99’ ; my $frion = ’12’ ; my $frioff = ’18’ ; my $saton = ’99’ ; my $satoff = ’99’ ; my $sunon = ’12’ ; my $sunoff = ’17’ ; this code was originally created for a virtual reference service that operated a limited number of hours each day. if your institution uses 24 hour vr service, set the on time as â€˜0′ and the off time as â€˜24.’ [5] 5. add a crontab entry the crontab is typically located in the /etc directory, but may vary depending on your setup. you should be able to find it by issuing a “whereis crontab” command at the command line. this is the recommended crontab entry. make sure to change the path to your statbot script and the path to perl if different from the default. 30 0 * * * root /usr/bin/perl /path/to/your/statbot.pl 6. install needed perl modules the script requires two perl modules. to install these, issue the following commands at the command line: perl -mcpan -e shell install net::oscar (answer â€˜y’es to any required dependencies) install mail::sender (answer â€˜y’es to any required dependencies) how it works once all the steps above are done, you should have a fully-functioning statistics-capturing im bot. while the script itself will be run once per day by the crontab, the bot will only show up online during the times you scheduled in step 4. each hour, the statbot will then send an im to the staff member who is on virtual reference, asking for the number of questions they have fielded that hour. the user then has 5 minutes to respond and relay their statistics to the statbot, who will then record those statistics in a daily flat file (text file). in the case that the employee does not respond, or is not logged into the library’s virtual reference screenname, the statbot will send an email to the supervisor stating the shift which failed to report. the statbot will also record this error in the text file. at the end of the day, the script will shut itself down and calculate the total and average number of questions for the day, and send all of those statistics as an email attachment to the stat collector (defined in $statcollector above). example of flat file output: 12am 0 1am 0 2am 0 3am 0 4am 0 5am 0 6am 0 7am 0 8am 0 9am 0 nl 10am 0 nr 11am 1 12pm 1 1pm 1 2pm 0 3pm 1 4pm 1 5pm 0 6pm 2 7pm 1 8pm 0 9pm 1 10pm 0 nr 11pm 0 total 9 average 0.375 there are two error codes which may be found next to some entries: nl: the staff member was not logged into virtual reference at this time. nr: the staff member did not respond to the statbot when asked for hourly stats. there are two other options that are available while the statbot is online: check stats for the day if you would like to see the total number of questions for the day (up to the current hour) instant message the statbot with â€˜adminstats’ (no quotes). shut down statbot if the statbot is malfunctioning, you can shut it down prematurely by sending an instant message that reads â€˜gotobednow’ (no quotes). conclusion the statbot code has made recording statistics for virtual reference a simpler process. the granularity of hourly reports has made informed staffing changes possible, and we can now present what we believe are more accurate statistics for yearly reports (like the association of research libraries report). previous collection methods tended to be far less immediate for the employees involved. because statistics were collected monthly from employees, “estimates” of the outcomes of the previous four weeks of vr sessions were often provided instead of exact statistics. the statbot script presents a unique framework for future statistical projects. expanding this system could result in a larger statistics collection process which could pull many different statistics from many different areas in the library. this “automatic” stat collection is not limited to quantitative data either–all types of information could be collected, sorted, analyzed and finally filtered to the necessary entities. code see attachment notes [1] see www.meebo.com. an online chat application that provides one interface for many different instant messaging services. [2] chat-bots have been in use since the early 1990’s, and have varying uses and levels of complexity: http://en.wikipedia.org/wiki/chat_bot. “stat-bot” is a very simple form of this technology, designed for the singular purpose of collecting statistics. [3] if you’re not familiar with cron or how to set up a cron entry, see http://www.unixgeeks.org/security/newbie/unix/cron-1.html for a fairly straightforward overview and introduction [4] meebo currently supports aim, yahoo!, google talk, msn, icq, and jabber [5] you may ask why statbot isn’t just on constantly. when initially testing the script, there were some minor security worries, which were allayed by only having statbot online when virtual reference was online. these “security issues” are now moot, but we find that this setup still provides the least maintenance. if the statbot or the aol network dies unexpectedly, it will be restarted the next day with the crontab, regardless. about the author mason r.k. hall is the electronic resource integration management librarian at florida state university. he currently holds the record for the longest job title in his library. his email address is mrhall@fsu.edu. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – building cyprusark a web content management system for small museums collections online mission editorial committee process and structure code4lib issue 54, 2022-08-29 building cyprusark a web content management system for small museums collections online this article introduces cyprusark, a work-in-progress solution to the problems that small museums in cyprus have in providing online access to their collections. cyprusark is an open-source web content management system for small museums’ online collections. developed as part of avgousti’s ph.d. thesis, based on qualitative data collected from six small museums in cyprus. by avgoustinos avgousti, georgios papaioannou, and feliz ribeiro gouveia 1. introduction cultural heritage digitization and online accessibility offer an unprecedented opportunity to democratize museum collections. museum collections online are usually presented via institutional websites, and they mirror the world’s culture and civilization. they form an increasing trend toward a world where information is digitally preserved, stored, accessed, and disseminated instantaneously through a global and concatenated digital network. however, while the web has enabled cultural heritage organizations to democratize information to diverse audiences, many small museums do not enjoy the fruits of this digital revolution. the majority of small museums are underfunded, understaffed, and lack technology infrastructures (oecd, 2020). as a result, many small museums cannot publish their collections online. due to the problems above, digital versions of small museum collections are mainly inaccessible and hidden. subsequently, humans have less access to information that can lead to new knowledge. 2. defining scales 2.1 what is a small museum museums are defined as small based on their annual budget, number of staff, collections size and physical size of the museum. the american association for state and local history 2007 (aaslh) defines museums as small if they have a yearly budget of fewer than 250,000 dollars and limited staff with multiple responsibilities. other factors such as the size of collections and the physical size of the museum could further categorize a museum as small. katz (1995) set the same budget and the staff number at five or less. honeysett and falkowski (2018) put the budget at 300,000 dollars and five or less employees. miller (2008) notes that the average small museum has just two full-time employees and a budget of less than 90,000 dollars. watson (2007) by contrast, defines small museums as ones that grew out of the community they serve. for the purposes of this article, a small museum is one with more than one but less than five full-time employees, not including museum custodians. categorizing a museum based on its budget is difficult and contentious since often museum staff are funded by another body, such as a municipality. 2.2 what is cyprusark cyprusark (github) is proposed as a solution to the challenges small museums in cyprus face in providing online access to their collections. cyprusark is an open-source web content management system for small museums’ online collections developed as part of avgousti ph.d thesis. it follows the kiss principle, “keep it simple, stupid”. according to the kiss principle, most systems work best when they are simple rather than complex in maintenance (u.s. navy, 1960). designed primarily to meet the needs of cyprus’s small museums, cyprusark can be used by other small museums all over the world. cyprusark was developed based on qualitative research of six small museums in cyprus using semi-structured interviews. cyprusark has been created using an open-source stack consisting of the leading-edge django, a high-level python web framework, postgres database, bootstrap, an open-source css framework, and running on virtualized docker containers. instances of cyprusark can be hosted on any infrastructure selected by the museum. cyprusark differs from other comparable systems, as described below, because it focuses on small museums and most specifically small museums in cyprus. 3. related works the mukurtu content management system is an open-source platform dedicated to meeting the needs of diverse communities through the subsequent sharing of their digital cultural heritage online (humanities for all). the platform was developed on the leading open source content management system drupal, and its advert has assisted substantially in the field of extending the functionality of the content management system towards the critical issue of preserving and disseminating the knowledge derived from cultural heritage. another similar system is the vimuseu content management system, which constitutes a project dedicated to the needs of small finnish museums and has been developed by the department of arts and cultural studies at the university of jyväskylä in finland in collaboration with the university’s museum. the vimuseo cms can be regarded as an online multimedia presentation tool mainly used for virtual exhibitions and projects. the vimuseu content management system’s utility is related to digital-free institutions that wish to be more active on the internet. this tool can be described as a web application content management system (web cms) co-developed by programmers and graphic designers. additionally, a similar project carried out by the cyprus institute (cyi) and the openumisma content management system avgousti et al. (2017). the opennumisma content management system can be perceived as a reusable web-based platform which consists of a merge of digital imaging and content that offers tremendous opportunities for research, as well as for the dissemination of knowledge and data. the platform’s development seeks to create a digital web framework concerning online numismatic collections. this kind of content management system has been implemented and used by small numismatic museums in cyprus. daradimos et al. (2015) presented the significant problem of small museums that are considered unable to reach the public through the web due to the cost of commercial software and the complexity of the development and customization of such software. in his article, he recommends a different approach, along with developing a general-purpose plugin (module) that will extend the functionality of the open-source drupal cms for small museum collections online. a similar idea was also proposed by (avgousti, papaioannou, and gouveia, 2019), a collection of reusable drupal modules aimed at small-scale museums and archival online collections. the getty foundation online scholarly catalog initiative (osci) is helping museums make web-based publications available to the public with the use of technology and the internet (marrow, 2017). the development of chicagocodex can be regarded as the basis for this tool since this kind of toolkit leaves the margin for institutions to deploy a microsite’s digital collections. however, based on marrow (2017), the parameters relative to the toolkit’s complexity combined with its augmented cost on account of its technical expert requirements can be regarded as certain of its disadvantages. accordingly, the wisski a content management system platform can be regarded as one other illustration that is aimed at the production of a virtual research environment for the digital humanities (scholz, 2012). the platform is based on the open-source content management system drupal and requires installing approximately thirty contributed plugins to implement the system. the default installation of the software, in same cases, may lead to difficulties with upgrading to new versions of the system because of its heavy dependence on many other modules (velios, 2017). in line with the above, the omeka, a content management system developed specifically for scholars, librarians, archives, and museums, targets the creation of digital collections. it uses the dublin core for storing the metadata of the objects. moreover, the omeka “s” is considered more oriented towards connecting digital cultural heritage collections through the consequential usage of linked open data principles (ontotext, n.a ). hardesty (2014) mentioned that omeka had been used in various memory institutions (flynn, 2018). although, compared to large and robust ecosystems available for other large content management systems such as wordpress, joomla and drupal dombrowski (2016) argues that omeka can be described as a fairly moderate cms with minor community support, plugins, developers, and users. lastly, the glamkit content management system, or web framework, is designed specifically for glam sector organizations (galleries, libraries, archives and museums) (weakley et al., 2009). it has tools for managing and displaying collections online and complex events; it can be focused on small institutions with limited technical staff (mansfield, 2017). glamkit is developed using the django framework and the wagtail content management system focusing on common museum needs (mansfield, 2020). 4. cyprusark architecture 4.1 content model in this section, we discuss the high-level development and design of the cyprusark prototype, using the django framework. currently, the cyprusark prototype has basic functionality. keeping the design simple will help with future testing of the technology to explore how useful and applicable it is within the small museums’ domain. the cyprusark content model consists of four content types: place (e.g., museum), bundle (e.g., collection), creative work (e.g., item), and a maker (e.g., person). cyprusark content types draw inspiration from the object id, dublin core, and the schema.org field recommendations, for modeling the information and creating the necessary content types. figure 1. cyprusark conceptual content model 4.2 place the first content type is the place, which can be used to make a landing page using cyprusark. the place content type includes fifteen fields that cover most of the information that describes a place (e.g. museum). we attempted to maintain the field titles as straightforward as possible to help structure these fields with rich semantic information for future development. the table below depicts the content type place along with its fields and expected attribute types, as well as a brief description of each field. attribute/field name attribute type description name text the name of the museum (e.g., pierides museum). classification drop down list museums, etc. working hours text the time that the museum is open to the public. address text the address of the place. city text the city where the place is located (e.g. athens, nicosia). country text the country where the place is located (e.g. cyprus, greece). image image a photo(s) of the place. phone integer the telephone number of the place (e.g., 22 55667788). logo image the logo of the place. description long text short description of the place. same as link an external link (e.g., wiki page, fb page). website link an external link to the place website if it exists. order integer the order in which the content will be displayed. online boolean field if selected, the content will be shown on the template front end, if not the content will live only on the backend. slug text is the url that the place will get in cyprusark. the slug field will be auto-generated by using the place name field table 1. place content-type (e.g., museum) 4.3 bundle the bundle content type is the second content type. the bundle content type describes a “museum collection(s)”. an example of the content bundle is: “terracotta collection from salamis cyprus”. a place is related to one or multiple bundles. for example, a museum is a place and the museum collection(s) are bundles of that place. the following image shows this type of relation. figure 2. bundle(s) relationship to a place it consists of eight fields to cover most of the bundle (e.g. collection) information. the table below depicts the content type bundle along with its fields and expected attribute types, as well as a brief description of each field. attribute/field name attribute type description name text the name of the bundle (e.g. portraits collection). size integer auto-generated, based on the creative works related to the bundle. photo imageobject an image representing this bundle. description long text a description of the bundle. what is it about? same as url url of a reference webpage. online boolean field if selected, the content will be shown on the template front end, otherwise the content will live only on the backend. slug url the url of the bundle. order integer the order in which the content will be displayed. table 2. bundle content-type 4.4 creative work the third content type is the creative work content type. the creative work content type can be used to create museum item(s) (e.g. paintings, sculptures). it consists of twenty two fields to cover most of the information that can be used to publish a creative work on the web. it is not necessary to use all those fields. the creative work content type is related to the place, and the bundle content types. the table below depicts the content type of creative work fields and expected attribute types, as well as a brief description of each field. attribute/field name attribute type description art form select list the art form of a creative work (painting, drawing, sculpture, photograph). art medium select list the material used (oil, pastel, pencil). art work surface select list the supporting materials for the artwork (canvas, stone, paper, wood). additional type url or text any additional type (icon, coin) linked to an external vocabulary such as getty aat controlled vocabulary. title text the object title by which it may be known and can be identified (e.g. mona lisa). date created text when the object was made (e.g. 1650). description long text description of the object. image image object image(s). dimensions text or number the size and/or weight of the object. credit line text example: gift of j. pierpont morgan, 1917. accession number text and or number identification of the object (e.g. number 1254a124d). sameas url url to the same item (e.g. wiki, db media). culture text the culture of the item (e.g. cypriot, american, arabic). example period text the period of the item (e.g. late bronze age). rights of reproduction text e.g. public domain. citation text citation or reference to scholarly articles, web page etc. slug text the url of the item that can become a uri. order integer the order in which the content it will be displayed. bundle reference the bundle (collection) in which the creative work belongs. makertype reference the role of a maker of this creative work (e.g. creator, editor). example a maker can be a creator in creative work one, at the same time can be the author in creative work two. maker reference the maker content type. online select boxes show on the front-end. table 3. creativework content-type a place is related to a bundle(s), and a bundle is related to creative work(s). for example, a bundle is a collection of creative works that belongs to a place. the image below shows the relationship between the place, bundle, and creative work. figure 3. place, bundle(s) and creative work(s) relationship 4.5 maker the maker is the last content type. the maker content type consists of twelve fields to cover most information that can be used to describe a maker (person). a maker can be an object’s creator, artist, penciler, editor, publisher, etc. the maker content type is related or not related to the creative work content type. for example, creative work (mona lisa) is related to a maker that is the creator (leonardo da vinci ). the image below shows the relationship between the creative work and the maker concerning the rest of the content model. figure 4. place, bundle(s), creative work(s) and maker(s) relationship attribute/field name attribute type description first name text the name of the maker. last name text the last name of the maker. gender text the gender of the maker. image image a photo of the maker. date of birth date date of birth of the maker. birth place date the place where the maker was born (e.g., athens). date of death date if the maker is dead, the date of death. death place date if the maker is dead, the place of death (e.g. nicosia). same as external link an external link (e.g. wiki page). example: https://dbpedia.org/page/leonardo_da_vinci table 4. maker content-type 5. collections online creation process in this section, we will describe the cyprusark back-end used by a login user for content creation, and how content will appear to the front user. we used data from the pattichion municipal museum to demonstrate this exercise. the first step is for a back-end user to log into cyprusark, using the username and password provided by the administrator. when the user is logged in, it will be auto-redirected to the cyprusark dashboard. all login users have specific permissions; they can create, retrieve, update and delete (crud) content. however, the administrator of cyprusark can give custom permissions to individuals or groups of individuals (e.g. museum curators can only add content). 5.1 adding a place the login user can create, retrieve, update and delete (crud) a place using the place content type web form. first, the user must fill out the web form consisting of the name, and subject (e.g. pierides museum). after filling out the fields, the place content will be created and appear on the cyprusark admin dashboard. it is important to note that the user can add only a single place. the following table shows how the login user crud functionality. the user can decide whether to have the place online or not. we added this functionality to give the user the option to have the content offline until it is fully completed and ready to be displayed online. the following image represents a content store in the system backend. figure 5. print screen place backend (click to enlarge) when the online checkbox is selected, the content will appear on the front end. 5.2 adding a bundle the second step is to create a bundle (e.g. collection(s)). in order to create a bundle, the user should complete some important fields such as the name, image description and so forth. after filling out the fields, the bundle content will be created and appear on the cyprusark admin dashboard. further, the login user can select in which order the content will be displayed. the following image represents three bundles,only two displayed on the front end. the following image represents bundles that are stored in the system. figure 6. print screen (click to enlarge) when the online checkbox is selected, the content will appear on the front end. the following image shows how the place and the two bundles display on the html page. figure 7. print screen bundle(s) front-end (click to enlarge) 5.3 adding a creative work the next step is to create a creative work(s) (museum items), using the creative work content type web form, and assign the creative work to a bundle. the same creative work can belong to a single or multiple bundles. figure 8. print screen creative work(s) management dashboard (click to enlarge) from this panel the login user can make direct changes to any creative work without visiting the detailed page of each item. further, they can filter down the information using the right menu or the search bar on the top. if the online checkbox is selected, the creative work will be displayed on the html front end page. the creative work uses two html templates. the first template is a list of all creative works in relation to a bundle (collection of items) and the second template is a detailed page of each creative work. the end-user can manipulate and filter the information using the drop down menu on the top of the page. figure 9. list of creative works that belong to a bundle (click to enlarge) at this point, a login user can create a maker (person) and relate a maker to creative work. if the item is not related to any maker simply the maker will not be displayed on the html page. 5.4 adding a maker (pearson) the maker is our last content type, and the same process is needed to crud a maker. the maker may be related to the creative work. if the creative work is related to a maker, the maker will show up on the screen. the same maker can have different roles on different creative works. for example, in creative work, one is the creator, and in creative work, two is the author. figure 10. detail view of creative works with a maker relation (click to enlarge) 5.5 adding slides the login user can also change the slider by adding customized images representing their own institution. furthermore, they can link the slider image to a specific page, which could be internal or external. the user can add as many slides as they like and can change the slider order (what is first, second, etc. ). figure 11. updated slider (click to enlarge) 6. conclusion and future steps in this article, we presented an overview of the cyprusark high-fidelity prototype architecture, and we then moved to the in-depth details of this prototype. we discussed the content types and the related fields used and added to cyprusark, presenting the content relationships in a detailed content model. further, we continued with the actual presentation of cyprusark. we explained how it works and the process of creating place, bundles, creative work, and makers, as well as presenting the back and front-end. future steps will include the implementation of the semantic markup based on schema.org vocabulary or ontology. moreover, we plan to test the cyprusark prototype and explore how useful and applicable it is within the small museums’ domain in cyprus. based on the results, we will add/change the functionality of cyprusark to better fit small museums’ needs. references aaslh. “what is the definition of a small museum? survey results,” 2007. http://download.aaslh.org/small+museums/small+museum+survey+results.pdf. avgousti, avgoustinos, georgios papaioannou, and feliz ribeiro gouveia. “content dissemination from small-scale museum and archival collections: community reusable semantic metadata content models for digital humanities.” the code4lib journal, no. 43 (february 14, 2019). https://journal.code4lib.org/articles/14054. avgousti, avgoustinos, andriana nikolaidou, and ropertos georgiou. “openumisma: a software platform managing numismatic collections with a particular focus on reflectance transformation imaging.” the code4lib journal, no. 37 (2017). https://journal.code4lib.org/articles/12627. daradimos, illias, costas vassilakis, and akrivi katifori. “a drupal cms module for managing museum collections,” 2015. https://www.academia.edu/29947679/a_drupal_cms_module_for_managing_museum_collectio dombrowski, quinn. drupal for humanists. texas a&m university press, 2016. https://drupal.forhumanists.org/book. flynn, bernadette. “making collections accessible.” federation of australian historical societies inc, 2018. https://www.history.org.au/wp-content/uploads/2018/10/makingcollectionsaccessible.pdf. humanities for all. “mukurtu cms: an indigenous archive and publishing tool.” accessed august 9, 2022. https://humanitiesforall.org/projects/mukurtu-an-indigenous-archive-and-publishing-tool. shepard, michael. “review of mukurtu content management system.” language documentation & conservation, september 2014. http://scholarspace.manoa.hawaii.edu/handle/10125/24610. katz, paul. “the quandaries of the small museum.” the journal of museum education 20, no. 1 (1995): 15–17. https://www.jstor.org/stable/40479486. l. hardesty, juliet. “exhibiting library collections online: omeka in context.” new library world 115, no. 3/4 (2014): 75–86. https://doi.org/10.1108/nlw-01-2014-0013. mansfield, tomothy. “small museum? want more from your website?” the interaction consortium pty ltd, 2017. https://interaction.net.au/articles/small-museum-want-more-from-your-website/. mansfield, tomothy. “glamkit is dead. long live glamkit!” the interaction consortium, 2020. https://interaction.net.au/articles/glamkit-dead-long-live-glamkit/. marrow, deborah. museum catalogues in the digital age (osci final report). the getty foundation, los angeles, 2017. https://www.getty.edu/publications/osci-report/assets/downloads/osci-report.pdf. miller, pamela wilder. “what we do best: quality collections care practices in small museums in utah.” doctoral thesis, utah state university, 2008. https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1007&context=etd. oecd. “culture shock: covid-19 and the cultural and creative sectors,” 2020. https://www.oecd.org/coronavirus/policy-responses/culture-shock-covid-19-and-the-cultural-and-creative-sectors-08da9e0e/. ontotext. “what are linked data and linked open data?” accessed august 17, 2022. https://www.ontotext.com/knowledgehub/fundamentals/linked-data-linked-open-data/. scholz, martin, and günther görz. “wisski: a virtual research environment for cultural heritage.” in ecai, 2012. https://doi.org/10.3233/978-1-61499-098-7-1017. velios, athanasios, and aurelie martin. “off-the-shelf crm with drupal: a case study of documenting decorated papers.” international journal on digital libraries 18, no. 4 (2017): 321–31. https://doi.org/10.1007/s00799-016-0191-5. watson, sheila, ed. museums and their communities. leicester readers in museum studies. london ; new york: routledge, 2007. weakley, alastair, thomas ashelford, julien phalip, and greg turmer. “introducing glamkit – a free, open-source web framework for the glam sector,” 2009. https://www.museumsandtheweb.com/mw2010/abstracts/prg_335002451.html. about the authors avgoustinos avgousti (a.avgousti@cyi.ac.cy) senior research technical specialist in digital humanities and cultural heritage informatics at the cyprus institute at the science and technology in archaeology research center (starc). and, a ph.d. candidate at ionian university, school of information science & informatics, department of archives, library science and museology. dr. georgios papaioannou (gpapaioa@ionio.gr) associate professor at the ionian university, corfu, greece. his research interests lie in museology, digital humanities, archaeology of the eastern mediterranean and the arab world, cultural studies, and it applications including heritage documetation, big data and mobile applications. he has lectured, excavated, and conducted museum and cultural heritage work internationally (europe, arab world). he is the general-secretary of the hellenic society for near eastern studies, director of the museology research lab at the ionian university (greece), member of the pool of experts of the european museum academy, and a member of icom. dr. feliz ribeiro gouveia (fribeiro@ufp.edu.pt) professor of computer engineering at the faculty of engineering of university fernando pessoa, porto, portugal. he has worked in research and technology transfer projects for industry, in the areas of knowledge management and object-oriented databases for major european companies. he participated in several research projects in cultural heritage and digital humanities and has acted as consultant for organizations implementing cultural collection management projects. he graduated in electrotechnical engineering from the university of porto and holds a phd in computer science from the université de technologie de compiègne, france. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – from users to developers: ncsu’s involvement with an open source erm mission editorial committee process and structure code4lib issue 34, 2016-10-25 from users to developers: ncsu’s involvement with an open source erm coral, an open source electronic resource management tool, has been adopted by libraries around the world. the community manages the software development contributed to the open source codebase by independent organizations. ncsu libraries’ acquisition & discovery department started using coral to manage monograph orders at the end of 2013. since then, they have completed a series of developments to enhance coral functions for workflow management, streamlining the complex electronic resource acquisition process. this paper presents ncsu’s adoption and development of coral. it explains what prompted the development, shares the experience, from identifying internal resources to outsourcing development work, and identifies challenges and opportunities of the current mechanism of coral development. by xiaoyan song introduction the acquisition & discovery (a&d) department at north carolina state university libraries has three units: monographs, serials, and data projects and partnerships (dpp). a&d’s core functions include the acquisition, licensing, cataloging and access management of materials for all ncsu libraries, including the branches. the department employs a suite of commercial systems and home-grown tools to facilitate the core work. e-matrix, developed in-house, is used to support e-resource life-cycle management for e-journals and databases. e-matrix manages the complex relationships of e-resource titles, holdings, collections, organizations, contacts, licenses, administration information, and acquisition details. however, workflow support is not provided in e-matrix. the department has developed a set of workflow maps for daily core work related to the resource selection and acquisition process, which involves multiple departments and staff. purchase requests are often initiated by collection managers (cm), then sent to a&d, and finally routed to an appropriate workflow with corresponding staff. there were a couple of challenges with the process. first, the purchase requests came from a variety of methods including emails, phone calls, and shared google docs, and staff found it challenging to keep track of requests from mixed channels. and second, orders were distributed to staff in the two units via emails or excel files, but the sheer volume of firm orders was too great to keep track of. we needed a tool to support the diverse workflows and to address the needs for the resource selection and acquisition process. coral was chosen due to its flexible and powerful workflow function support. it is an open source electronic resource management (erm) system which supports electronic resource life-cycle management. it was initially developed by notre dame university library and was released in 2010. it includes five modules [1]: resources, licensing, organizations, usage statistics and management. the resource module stores resource records and provides support for e-resource life cycle management, including acquisition, access, cataloging, and workflow routing. the licensing module stores licensing agreements, terms and related files. the organizations module is used to manage organizations including names, contacts and accounts. the usage statistics supports counter and counter-like reports. the management module is a place to store documents such as policies, process, and procedures, and to preserve the history of the documents. all five modules are interoperable, though independent from one another. since its first release of the resources module in 2010, coral has been gradually adopted by libraries and other organizations globally. there are 61 confirmed institutions/organizations using coral presently[2]. coral users have shared their experience at a number of major conferences and these presentations have been gathered on the coral website. we noticed that many users have implemented and used coral but few performed significant functional enhancements as developers. to give back to the community, we want to share our journey to implement and develop coral, and the lessons we learned from our excursion. from users to developers to address the issues with the resource acquisition process, a&d implemented the coral resources module to manage monograph firm orders in dec. 2013. we created records for new resources to be ordered and assigned them to staff in coral. this module enabled us to keep the firm orders in one place and to keep track of such order requests. in conjunction with coral, a shared google document between a&d and cm was created to centralize all firm order purchase requests. the specialist in the monograph unit was notified whenever a new order was added to the spreadsheet, and she would then create a record in coral and use the workflow in coral to automatically assign it to a staff member, who would process the order following the corresponding workflow. this streamlined the acquisition process. when we set up these workflows in coral, we immediately realized the limitation of coral’s workflow function to distribute orders among staff members. in coral, workflows are created as global templates at the admin tab and they can then be associated with a resource record. users can make changes to the global workflows at the admin tab, but they cannot modify a workflow once it is linked to a local resource record. we identified two local workflow customization features to be added to coral workflow function. the first is to enable users to re-assign a workflow step to a different member if needed, and the second is to allow users to delete any unnecessary step in a local workflow. figure 1 shows that a workflow is linked to a resource record in coral. clicking on the pencil icon allows users to change the group assignment. to delete a step, users can click on the red x symbol. figure 1. workflow enhancement these two new features enhanced coral’s workflow functionalities and led to the shared use of coral by both the monographs unit and the serials unit. the monographs unit expanded the use of coral to include firm orders, open access cataloging, and one-time purchases (a.k.a. end-of-year purchases). the serials unit established a set of continuing resources new order workflows in coral. see figure 2 for an overview of all our coral workflows. figure 2. coral workflows the libraries were determined to contribute these modifications back to the community. the coral community requires that any submitted code changes be reviewed by the coral steering committee (sc). this committee consists of several electronic resource librarians (erl) and it staff from libraries, who are also coral users. we reached out to the committee chair and inquired about the process. it was not a formalized process then because we were the first non-sc members who showed interest in making code changes to improve coral functions. as suggested by the committee chair, we created a workflow enhancement document and shared it with coral sc, who reviewed and approved the enhancement specs. the coral sc was looking for justification for the enhancement and a description of the changes to be made.we included the use cases for the requested changes and mockups of the interface change. this round of review and approval process ensured a smooth code merge to the master code at a later stage. our it staff completed the development based on the the approved enhancement spec within several weeks. the code merge request was created on github and then reviewed and approved by coral sc before it was merged into coral codebase. while we were pleased with the enhancements, one unexpected opportunity landed in front of us – ncsu libraries announced an internal good idea mini-grant to encourage library innovation and collaboration in spring 2015. to take advantage of the opportunity, a&d and cm collaborated on a proposal for further coral improvements that focused on streamlining e-resource selection and the acquisition process across the two departments. excitingly, the proposal was approved by the good idea mini-grant committee and we received $21k for the project. two groups were formed: the admin group and the working group. the admin group, including two department heads from a&d and cm and an associate head from a&d, was charged with providing admin support and managing the project budget. the working group was led by an e-resource librarian (erl) and consisted of two specialists from a&d, one librarian from cm and one it staff member. this group was tasked with creating the spec and interacting with developers to provide testing feedback. due to the unavailability of it resources, the admin group decided to outsource the development work to provide some relief on internal competition for it resources. this required a request for proposal (rfp) process. the key component of the rfp was a detailed development specification. after the grant was announced, the working group immediately started working on the spec. within a couple of weeks, the group delivered the initial specs and shared it with the sc, who provided constructive feedback. because of the complexity of the enhancements, it took several iterations of communications between the team and the coral sc until the final spec was agreed upon. the finance & business department and the university purchasing department facilitated the rfp process. to hire an experienced development team, we proactively reached out to the coral sc and a couple of potential candidates. we received two proposals after the rfp was sent out. among the two, we chose our current development team, biblibre, a french software company, who provided a reasonable quote and had significant experience with open source tool development in the library environment. grant-funded coral developments for the grant-funded project, the team identified three areas for improvement. the first one was to build a tightly integrated process for resource selection and acquisition between a&d and cm. generally speaking, this could be a process involving acquisition staff and selectors in any library. the process had previously relied on emails and shared documents to help with decision-making. our team aimed to streamline the process and to make it more efficient. we created a propose new resource api, which allowed users to submit a resource through the api. the api was based on flight [3], a simple php framework enabling users to build restful applications. a client form, based on unirest [4], was created for users to fill in and submit a resource to coral. figure 3 shows the functions implemented in the api. for more information about the api and how to use it, refer to coral github issue #7 add an api to coral [5]. see screencast 1 to preview the completed development. figure 3. api functions unable to display content. adobe flash is required. screencast 1. propose new resource api. best viewed in full screen mode. the second area identified was management of the complex continuing resource renewal process in coral. this process involves a variety of activities and requires the authorization from our cm when there is significant price change for a renewal resource. we wanted to integrate the renewal authorization process to the workflow and to make the process transparent to all stakeholders. the team proposed a couple of enhancements to the workflow functions, allowing users to manage renewals: restart new workflow for a completed resource. coral users could not restart a new workflow if a workflow was completed for a resource. this allowed users to restart a new workflow, for example a renewal workflow, after a previous workflow was finished. see screencast 2 for a preview. archive a completed workflow. this enabled completed workflows to be archived by default, and users could show or hide the archived workflows. see screencast 2 for a preview. unable to display content. adobe flash is required. screencast 2. restart and archive a workflow. best viewed in full screen mode. add a new step to the in-progress workflow. for an on-going workflow, users could add a new step if needed. see screencast 3 for a preview. reminders for workflow deadlines. task reminders are often seen in project management applications or task management tools. this functionality would remind users to complete a step by a due date. see screencast 3 for a preview. unable to display content. adobe flash is required. screencast 3. edit the current workflow and email reminder. best viewed in full screen mode. email notes to a group/user regarding assigned resources. users had complained that the workflow assignment emails sent from coral did not contain much useful information. this would allow us to embed a concise message in the email about the assignment rather than the generic message generated by the system. see screencast 4 for a preview. unable to display content. adobe flash is required. screencast 4. embed a note to reassignment email. best viewed in full screen mode. last but not the least was to improve the import tool to support more diverse needs. at the moment, the import tool in coral only allowed a small number of fixed fields to be mapped, and this had limited the use of the tool. users could use the tool for different purposes, like loading a kbart file into coral, batch loading some cost data, or loading resources to be ordered and triggering a workflow once resources were added. to facilitate the various uses of the tool, it required more flexible data mapping. sirsidynix contributed some improvement to the coral import tool [6]. we reviewed sirsidynix’s work, collected feedback from the coral sc, and modified our original proposal to incorporate the sirsydynix contributions. this resulted in the following feature enhancements: configure/create import templates. allows users to create an import template by mapping incoming data elements to a desired field in the coral resources module. the template could be loaded and applied during import. preview the mapping before submit. the preview would provide a summary of data to be loaded, such as the total number of records loaded, the number of new records to be overlaid, etc. review history of uploads. users would be able to review the history of uploads, including the date of previous uploads, the file name, the link to the file, and the link to the imported resources. the import tool development is work in progress and the development started mid-september. challenges and opportunities as open source tool users, we are benefiting from community contributions to the project. however, as the community is growing, potential problems arise and challenge the community. during our journey, we experienced a few challenges, some may be unique to the ncsu libraries, but some may apply to others. the biggest challenge was resolving conflicts between local needs and global needs. since its first release in 2010, coral has gone through several evolutions. community members can identify and implement improvements and contribute changes to the coral codebase. enhancement requests originate from local instances. but to merge modifications to the codebase, the enhancements have to meet the community’s needs at large. it’s wonderful when local requests meet global needs. when they don’t, how do we resolve our local concerns? one way is to branch the code, which ensures local development fully responds to local needs. however, branching often means greater local tech support, and the possibility of being unable to use other desirable community contributed enhancements. in our case, our it did not favor the branch option. this left the team with no choice except to resolve the conflicts and to negotiate with the community. this required more time and effort to deliver the final product, but we were able to deliver a product for all. the negotiation process was all about seeking to understand each other and to be understood by others. since community developers can all add code to the code base, it becomes a big challenge to ensure code consistency and to maintain high quality. hurley [7] points out that “establishing a standard for code submissions, requiring acceptance of a common license, and implementing peer review are three ways in which good open source projects help to mitigate the risk of problematic code.” the coral sc is working to establish version controls and best practices to address this issue. whether or not to outsource development can be a question for any software development. the answer to that question largely depends on the availability of internal resources and on the availability of third-party resources. for our grant-funded project, we chose to outsource the development work to a third-party due to the unavailability of internal resources. networking and outreach ensured the success of finding the right outsourcing team. we realized, though, there were very few third-party developers in the us in this area of development. this may be because libraries have always relied on internal it or other internal resources to do such development work, or simply purchase available commercial products. the software industry in the us should respond to the growing needs of software developments in libraries. while working with non-us developers, it is important to take into account different cultures. the 12-week development cycle planned for our grant-funded project was extended several weeks longer because of a couple of “unknown” holidays and vacations that our french team took. it’s not a bad idea to clarify with your developers of upcoming holidays and vacations during the development cycle. this is especially important if the development is funded by state fund or other funds that have a strict spending timeline. conclusion coral has a bright future as an open source erm because of the strength of the software itself and of the strong community support. either as regular users or developers, community members undoubtedly benefit from all valuable contributions. challenges can become exciting opportunities for the software to evolve and for the community to exploit synergies. references [1] coral modules. retrieved from: http://coral-erm.org/modules/ [2] coral user map. retrieved from: http://coral-erm.org/about/user-map/ [3] flight. retrieved from http://flightphp.com/ [4] unirest. retrieved from http://unirest.io/php.html [5] add an api to coral (issue #7). retrieved from: https://github.com/coral-erm/coral/pull/41 [6] sirsidynix additions and style change. er&l 2016 coral user’s group meeting. retrieved from: http://www.slideshare.net/scottvieira/er-l-2016-coral-user-group-meeting [7] hurley, david. 2014. 12 challenges for open source projects. retrieved from: https://opensource.com/life/14/6/12-challenges-open-source-projects about the author xiaoyan song is an e-resource librarian at north carolina state university libraries. she holds an mls and a b.a. in information management and information system. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – trust, but verify: auditing vendor-supplied accessibility claims mission editorial committee process and structure code4lib issue 48, 2020-05-11 trust, but verify: auditing vendor-supplied accessibility claims despite a long-overdue push to improve the accessibility of our libraries’ online presences, much of what we offer to our patrons comes from third party vendors: discovery layers, opacs, subscription databases, and so on. we can’t directly affect the accessibility of the content on these platforms, but rely on vendors to design and test their systems and report on their accessibility through voluntary product accessibility templates (vpats). but vpats are self-reported. what if we want to verify our vendors’ claims? we can’t thoroughly test the accessibility of hundreds of vendor systems, can we? in this paper, we propose a simple methodology for spot-checking vpats. since most websites struggle with the same accessibility issues, spot checking particular success criteria in a library vendor vpat can tip us off to whether the vpat as a whole can be trusted. our methodology combines automated and manual checking, and can be done without any expensive software or complex training. what’s more, we are creating a repository to share vpat audit results with others, so that we needn’t all audit the vpats of all our systems. by melina zavala and matthew reidsma like many institutions, gvsu university libraries has made web accessibility a priority, especially in recent years after a 2016 investigation by the u.s. department of education over web accessibility was resolved (grand valley state university 2017; united states department of education 2017). while we have worked to improve the accessibility of our own online interfaces, most of our software is licensed from third-parties: discovery layers, link resolvers, opacs, and particularly our subscription databases. thankfully, vendors often self-report their own accessibility testing, since most of their customers are, like us, federally obligated to meet accessibility standards. the voluntary product accessibility template (vpat) is a document created by the information technology industry council to show how software meets the americans with disabilities act (ada) section 508 guidelines (voluntary product accessibility template [updated 2018 april]) . vpats are a useful window into the accessibility of web software, but they rely on the software vendor to self-report whether or not their software meets these guidelines (adams et al 2018). because the documents are voluntary and self-reported, others in the library world have noted that “completed vpats often lack detail and vendors can easily hide product deficiencies in this document” (adams et al 2018). this was also a concern for us at gvsu, after finding problematic accessibility issues in many of our tools that were not reflected in the product’s vpat. more recently, with the increasing adoption of vpats by vendors, several other publications have chronicled attempts to verify the vendor claims. in 2015, laura delancy published a study where she aimed to do a near-comprehensive automated test of complete vendor vpats, and found that the vpats she tested were around 20% incorrect (delancy 2015). a 2016 study of course management software accessibility in higher education found that automated testing only gets us so far – the results must still be interpreted by someone trained in accessibility (kumar and owston 2016). michael fernandez also used automated tools to audit vendor databases for accessibility but did not bother reviewing vpats, since “ there is no requirement for vendors to complete vpats, there is often a significant lack of consistency among the vpats that vendors do provide” (fernandez 2018). the tennessee board of regents, as part of their push to improve the accessibility of their resources, planned in 2018 to audit the vpats of all commercially licensed databases (adams et al 2018). more recently, rysavy and michalak (2020) published the results of an automated audit of their licensed databases for the goldey-beacom college library, where they needed to support a blind student. they noted the limitations of automated tools, in that automated tools help you identify where problems exist, but most automated reports still require interpretation by a knowledgeable person (rysavy and michalak 2020). because of the labor involved with individual libraries conducting tests of every licensed resource, several groups got together to run audits together to share the labor. after all, licensed subscription databases are the same for all libraries at a given time. in 2015, a group of uk libraries banded together to audit ebook databases (the ebook accessibility audit [2016]). their reports are available online, and often highlight accessibility issues that vendors had not mentioned in vpat disclosures. meanwhile in the united states, the big ten alliance conducted similar third-party reviews of subscription databases (library e-resource accessibility – testing [n.d]), and these reports are available to all libraries online as well. at gvsu, we wanted to make sure we were doing everything we could to improve the accessibility of our online offerings for our students. from reviewing all of these thorough reports published by other libraries and consortia, it was clear that library software vendors have made remarkable improvements over the past two decades. but from our own casual testing and user reports, we knew that there was still a lot to be done. while it has been encouraging that vendors have made improvements over the past two decades, there is still work to do. yet most libraries do not have the resources or expertise to do the kinds of full accessibility evaluations envisioned by delancey (2015) and fernandez (2018) in their studies. the studies themselves have a short shelf life, as well, given the increasing emphasis on accessibility improvements and the wide adoption of agile development methods, with frequent and continuous software releases. it also strikes us as problematic that libraries are already paying licensing fees that outpace inflation by several orders of magnitude, and then vendors also benefit from our thorough accessibility testing of these products? that feels like an additional accessibility tax which we were not willing to pay. in our case, we wanted a quick and dirty method for checking up on the accessibility claims of our vendors. if the introductory statements of our vendor vpats are to be believed, they have already done this detailed accessibility testing. the primary reason we want to verify that our vendors’ vpats are accurate is because accessibility is a priority at our institution. we strongly believe that all our users deserve equitable and seamless access to our resources. in addition, digital accessibility is the law. subscribing libraries, not commercial vendors, are responsible if the systems we subscribe to do not meet federal accessibility requirements (mulliken 2019). because of this, it is important to take this extra step to verify that our software is accessible for all users. we needed a way to identify the most common problems vendors face with accessibility, so that we could focus our spot checks on those items only. the in-depth published study of vendor vpats by delancey (2015) was over five years old at this point, and the article itself did not detail the errors every vendor faced. we also did not have the resources or time to conduct even a handful of large-scale accessibility reviews to determine this criteria. luckily, from 2015 through 2019 the big 10 academic alliance began conducting in-depth accessibility reviews on subscription databases, and publishing the results online (library e-resource accessibility – testing [n.d.]). we began this project with the assumption that most vendors suffered from the same common accessibility errors, and so we used the 65 reports available in february 2020 from the big ten alliance as a data set to look for the most common accessibility issues. reviewing all of the reports, we discovered that nearly every vendor product suffered from a combination of the same 5 or 6 accessibility issues. those issues are: improper heading use (no headings, skipped heading levels) lack of consistent image alt text (alt text missing, regardless of whether it is useful) improper form structure (missing or incorrect use of labels, fieldsets, and legends.) low color contrast keyboard navigation errors (logical order, visible focus, keyboard traps) these same issues were the most commonly reported accessibility issues in delancey’s 2015 study. so, while improvements have been made to increase the accessibility of library software, the major challenges are still in the same areas. in addition, many vendors did not make alerts or other messages that are created asynchronously available to users of assistive devices. to ensure screen readers are aware of messages that appear without a page load, developers must make use of accessible rich internet application attributes (aria). aria has a significant learning curve compared to the 5 accessibility issues listed above. developers have to understand how to incorporate aria attributes into their html to let assistive technologies know the semantic purpose of each section of the page, as well as integrate aria attributes into their scripts to ensure that dynamic content is properly recognized and announced to the reader. since our goal with this project was to make spot-checking vpats accessible to library staff without specialized training, we opted not to use it as a testing criteria. in every case from the big 10 alliance accessibility tests, vendors who failed to make messages and status updates accessible also failed the more basic criteria listed above. we then developed a simple method for testing vendor tools for these three items. first, we look over the most recent vendor vpat. many vendors now list their vpats on their support websites, although you may have to ask your sales representative for a copy [1]. the wcag 2.0 areas that relate to our five test areas of vendor vpats are as follows: improper heading use 1.3.1 information and relationships 1.3.2 meaningful sequence 2.4.6 headings and labels lack of consistent image alt text 1.1.1 non-text content 1.4.5 images of text (cannot be tested with automated tools) improper form structure 1.3.1 information and relationships 3.2.2 on input (cannot be tested with automated tools) 3.3.2 labels or instructions low color contrast 1.4.1 use of color (cannot be tested with automated tools) 1.4.3 contrast (minimum) keyboard navigation 2.1.1 keyboard 2.1.2 no keyboard trap 2.4.1 bypass blocks 2.4.3 focus order (cannot be tested with automated tools) 2.4.7 focus visible (cannot be tested with automated tools) 3.2.1 on focus (cannot be tested with automated tools) when looking at vendor vpats, reports of “partially supports” with details of what situations are not yet fully compliant with an accessibility guideline are encouraging. when vendors are doing a thorough accessibility test they will encounter errors that impede accessibility. if they are transparent enough to report that there are problems and that they are working on the issue, you can likely trust the results of the vpat. on the other hand, if a vendor claims they support every accessibility guideline that applies to them, this is a red flag. we have yet to encounter any vendor site that meets all accessibility guidelines. because larger vendors have more resources to support accessibility, they will likely have more accurate and detailed vpats (fernandez 2018). to test the first four items, we installed the wave accessibility testing extension in our web browser of choice [2]. we visited a page from the vendor, preferably in the way most of our users would encounter it, such as an article landing page linked from our discovery layer’s search results or the link resolver. we then clicked the wave extension to run the page through wave’s automated accessibility checker. while there are certainly drawbacks in using automated accessibility tests (rysavy and michalak 2020; delancey 2015; kumar and owston 2016), for the purposes of seeing how accurately vendors report the results of their own accessibility testing, automation is often enough. figure 1. screenshot of wave test results screen on taylor and francis database. wave will show a list of errors and alerts. at this point, we read through the results to look for accessibility issues that fall into one of the first four categories above. if there are images without alt text, for instance, we look carefully at the image to see if the image is decorative (in which case there should be no alt text). if the image is supposed to convey meaning, then missing alt text means the content of that image is not available to low-vision or no-vision users. if the vendor skips heading levels, or has empty headings on the page, we note those, as well as any color contrast or form issues. it is possible that wave will report accessibility errors that are not, in fact, errors. alt text is a great example. wave cannot tell if an image is decorative or not, and so it flags every image without alt text as an error. as rysavy and michalak (2020) note, “running your website through the wave tool does not result in a ‘pass’ or ‘fail,’ … [because] only humans can determine whether a web page is accessible.” finally, we test for the fifth area of concern: keyboard navigation. for this we use a simple manual test. we refresh the page (to remove the wave test results), and proceed to hit “tab” to navigate through the page. tab moves us forward, and shift + tab moves us back. the site should move us logically through every hyperlink on the page, from left to right and top to bottom. we should always be able to see where the focus is, often by the link being surrounded by a dotted line box. at no point should the focus visually disappear, and we should be able to tab through the whole page and then start over at the top again. we should never get trapped in a loop, or end up in a situation where keyboard users are “trapped” in a pop-up window or section of the site that we cannot escape. this manual test uncovers many issues. as oswal (2014) noted, “if one were to try to use these databases solely with a keyboard without a mouse, one would quickly discover a number of accessibility failures on their front search pages.” running the wave test and the manual keyboard test takes between three and five minutes, in our experience. because this test does not rely on an understanding of assistive technologies, or an understanding of writing accessible code, it can be done simply by nearly anyone in the organization. and because it is so quick, it can easily be updated as vendors release new updates that improve accessibility or add new features. although this is a quick method, we still looked for criteria to prioritize testing some vendor tools over others. in the end, we contacted our collection strategist librarian and acquisitions librarian to ask for any databases that were up for renewal this year, as well as a list of the top vendors in terms of cost to the university libraries. the list of vendors we tested on this first round was [3]: ebscohost [4] proquest [5] taylor & francis [6] wiley [7] elseivier (science direct) [8] springer [9] oxford [10] all total, even counting the time we took to take detailed notes for this article, we spent just under 90 minutes running these tests. after testing, there were clear discrepancies between what the vendor’s vpat claimed and the results of our basic accessibility tests of each platform. the main issues seen across most of the platforms were low color contrast and issues with form labels. while not all vendors exhibited the same problems, every vendor we tested had accessibility issues that were not reflected in the vpat. most of the vpats we reviewed contained sections that admitted that the software did not fully meet accessibility guidelines, detailing the issues that had been found, and at times, even offering a road map and estimated timeline for the issue to be resolved. for instance, proquest noted in their report of section 1.3.5, identifying input purpose, that as of the first quarter of 2020, they should have addressed issues with the autocomplete connecting to the form field and label that it is associated with. in addition, springer noted in several places that they “…are in the process of transferring all content from the older design to the newer design,” which would address certain problems with meeting accessibility guidelines. as we approach contract renegotiation, we will revisit those issues in order to see if they have been addressed. on the other hand, some vendors were honest about their issues, but perhaps not clear about their plans to resolve them. oxford noted in its vpat for sections 1.1.1 and 2.1.1 that it had no support for these basic areas of non-text content and keyboard navigation. they noted that they had “unlabeled and ambiguous form fields”, “missing alternative text” for images, and “non-descriptive alt text” for many of the images that did have the attribute. in addition, for keyboard navigation in their vpat, they admitted that some areas of their pages were inaccessible to keyboard users. in our test, we verified through a single page that there were orphaned form labels, redundant form labels and alt text (several images on the page had the same alt text), and that there were significant problems with keyboard navigation. several links did not receive visual focus when navigating with the keyboard, and in fact many “links” had no visual indication that they were links at all. one highlight was that the “skip to main content” link was hidden visually under normal circumstances, it immediately became visible and received focus when keyboard navigation began. while it is admirable that oxford documented these issues with their online tool, this vpat was finalized in 2017. we ran our test over 2 years later, and all of these issues remain. the vpat should not work as merely a confessional. vendors must work to improve the deficiencies in their accessibility as well. when we approach renegotiation with oxford at the end of our current contract, we will share the results of our audit as part of our negotiation. in another case, wiley listed many accessibility issues, but upon testing, we found a completely different set of issues with their user interface. they admit in their vat to missing alt text on images and struggling with keyboard navigation (wiley 2018), and these were for the most part still problems. (many of their images had alt text that said “image,” which is perhaps less useful than just leaving the alt text blank.) however, they reported that they handled visible focus well, and that they associated all input fields with the appropriate labels. in reality, though, we found that keyboard focus visually disappeared at times, and the results page we tested had several orphaned form labels, which will confuse assistive technologies. the larger vendors, proquest and ebsco, generally reported correctly the issues that they faced with accessibility. they both admitted to having limited issues with color contrast and alt text, as well as some issues where form labels were not associated with their input fields. both vendors had recent vpats (within the last six months), and at times suggested that they were working on a fix for specific issues listed in the vpats. we did find a few issues that were in the article content, rather than in the user interface. both vendors, as well as science direct, mentioned that they had limited ability to verify or improve the accessibility to content in their platforms. nevertheless, the issue of inaccessible content will be something we address in our next renegotiation. while most of the vendors at least admitted they had some issues, taylor & francis did not. their vpat claims its platform meets all of the accessibility guidelines with no issues., unfortunately, their platform was the one that had the most problems come up. they were missing alt text on many images, they had several “empty” heading and buttons (which are not visible to sighted users, but will be announced with no content to users of assistive technologies), unlabeled form elements, and an all around sloppy use of headings, skipping heading levels seemingly at will. these issues would make it hard for a screen reader to parse through and accurately tell the user what is on the page. for the non-automated part of the test, there were also a lot of issues; the skip link in the beginning of the page did not have a visible focus (it never appeared visually on the screen, so for a sighted user, it appeared that keyboard navigation was not possible). the search bar, perhaps the most useful element in the user interface, also did not have visible focus. when we finally made it to the content, the article itself was a pdf image, with alternative text that said “free first page,” but not the text of the article itself. (it is unclear whether we could get to an ocr version of the pdf by going deeper into the results.) this is a huge problem for the users who are not able to read the image itself and rely on screen readers to read it for them. what’s more, this result was shown to an authenticated user, not someone landing on the article page from the open web. the content should already be available to authorized users, right up front. to cap it off, taylor & francis traps keyboard users at the end of the page, looping between sharing and social media icons and the web browser’s user interface. once a keyboard user makes it to the share icons, the rest of the page, including article content, becomes inaccessible. these issues, as well as the incorrect vpat, will be passed along to our acquisitions and collections team to become part of our next licensing negotiation with taylor and francis. while we hold no illusions that we have captured a detailed accounting of the accessibility of our licensed databases through this simple exercise, we felt that it was time well spent. in less than two hours, we walked away with detailed notes that will enable our licensing team to press the vendors for more thorough responses, more decisive action, more transparent timelines, and perhaps even a more fair license for our library. since the process is relatively simple and quick, there is no need for a public repository of test results that will go out of date as soon as the vendor pushes the newest version live. most libraries have the ability to conduct five minute verifications of their vendors’ vpats. if we all approach the license negotiations with notes about inaccuracies in a vendor’s vpat, perhaps then we will make even quicker progress in ensuring our licensed materials are accessible to all our users. notes [1] you can also check laura delancey’s vpat website (delancey 2019), although many of these are 5 or more years old. [2] you might need to test with multiple browsers to make sure that these basic accessibility guidelines are met, but we’ve never had a situation where a vendor was perfect with one browser and not others. [3] sage was also on our list based on total cost per year, but we did not receive their vpat information in time to add them to the testing protocol. [4] ebsco’s vpat is available to subscribing libraries by emailing support@ebsco.com [5] proquest vpats are available from their accessibility directory at https://support.proquest.com/articledetail?id=ka140000000guogcac. [6] taylor and francis’ vpat is available from https://help.taylorfrancis.com/students_researchers/s/article/voluntary-product-accessibility-template-vpat. [7] wiley’s vpat is available by contacting your account manager. see https://onlinelibrary.wiley.com/accessibility for more details. [8] elsevier’s science direct vpat is available at https://www.elsevier.com/solutions/sciencedirect/support/web-accessibility. [9] springer’s vpat is available on github at https://github.com/springernature/vpat/blob/master/springerlink.md. [10] oxford university press’ vpat is available at https://academic.oup.com/journals/pages/about_us/legal/accessibility. references adams, sj., halaychik, c., and mezick, j. 2018. accessibility compliance: one state, two approaches. the serials librarian, 74(1-4): 163-169. byerley, sl., and chambers, mb. 2003. accessibility of web-based library databases: the vendors’ perspectives. library hi tech 21(3): 347-357. byerley, sl., chambers, mb., and thohira, m. 2007. accessibility of web-based library databases: the vendors’ perspectives in 2007. library hi tech, 25(4): 509-527. delancy, l. 2015. assessing the accuracy of vendor-supplied accessibility documentation. library hi tech 33(1): 103-113. delancey, l. 2019. vpat repository. [cited 2020 mar 17]. available from https://vpats.wordpress.com/. dermody, k. and majekodunmi, n. 2011. online databases and the research experience for university students with print disabilities. library hi tech 29(1): 149-160. the ebook accessibility audit. [internet]. [updated 2016]. ebook audit 2016. [cited 2020 mar 17]. available from https://sites.google.com/site/ebookaudit2016/home. grand valley state university resolution agreement ocr docket #15-16-2195. [internet]. [2017 march 13]. department of education office for civil rights. [cited 2020 mar 17]. available from https://www2.ed.gov/about/offices/list/ocr/docs/investigations/more/15162195-b.pdf. kumar, kl., and owston, r. 2016. evaluating e-learning accessibility by automated and student centered methods. educational technology research and development 64(2): 263-283. library e-resource accessibility – testing [internet]. [n.d.] champaign (il): big 10 academic alliance. [cited 2020 feb 28]. available from https://www.btaa.org/library/accessibility/library-e-resource-accessibility–testing mulliken, a. 2019. eighteen blind library users’ experiences with library websites and search tools in u.s. academic libraries: a qualitative study. college & research libraries 80(2): 152-168. oswal, sk. 2014. access to digital library databases in higher education: design problems and infrastructural gaps. work. 48:307-317. rysavy, mdt., michalak, r. 2020. assessing the accessibility of library tools & services when you aren’t an accessibility expert: part 1. journal of library administration 60(1):71-79. tatomir, j., and durrance, jc. 2010. overcoming the information gap: measuring the accessibility of library databases to adaptive technology users. library hi tech, 28(4): 577-594. united states department of education, office for civil rights region xv. [internet]. [n.d.]. department of education office for civil rights. [cited 2020 mar 17]. available from https://www2.ed.gov/about/offices/list/ocr/docs/investigations/more/15162195-a.pdf. voluntary product accessibility template (vpat). [internet]. [updated april 2018]. washington, d.c. section508.gov. [cited 2020 mar 17]. available from https://www.section508.gov/sell/vpat. about the authors melina zavala is an early career librarian that graduated from the university of illinois at urbana-champaign in 2018. they are currently working at grand valley state university as a digital scholarship librarian. matthew reidsma is the web services librarian at grand valley state university and the co-founder and former editor-in-chief of weave journal of library user experience. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – the dsa toolkit shines light into dark and stormy archives mission editorial committee process and structure code4lib issue 53, 2022-05-09 the dsa toolkit shines light into dark and stormy archives themed web archive collections exist to make sense of archived web pages (mementos). some collections contain hundreds of thousands of mementos. there are many collections about the same topic. few collections on platforms like archive-it include standardized metadata. reviewing the documents in a single collection thus becomes an expensive proposition. search engines help find individual documents but do not provide an overall understanding of each collection as a whole. visitors need to be able to understand what individual collections contain so they can make decisions about individual collections and compare them to each other. the dark and stormy archives (dsa) project applies social media storytelling to a subset of a collection to facilitate collection understanding at a glance. as part of this work, we developed the dsa toolkit, which helps archivists and visitors leverage this capability. as part of our recent international internet preservation consortium (iipc) grant, los alamos national laboratory (lanl) and old dominion university (odu) piloted the dsa toolkit with the national library of australia (nla). collectively we have made numerous improvements, from better handling of nla mementos to native linux installers to more approachable web user interfaces. our goal is to make the dsa approachable for everyone so that end-users and archivists alike can apply social media storytelling to web archives. by shawn m. jones, himarsha r. jayanetti, alex osborne, paul koerbin, martin klein, michele c. weigle, michael l. nelson editor’s note: this article makes use of robust links. next to each hyperlink the reader will discover a menu that allows them to visit an archived version of the linked resource in case the current version has changed or is no longer available. visit the robust links project for tools and more information on combating reference rot. web archive collections are too large to understand at a glance web archives are invaluable for a variety of research studies. historians have analyzed how humans interacted on extinct websites, like geocities. social scientists have used them to study the changes in social commerce over time. journalists can use web archive evidence to bring attention to questionable medical practices and document changes in government policy. some archivists create themed web archive collections by selecting web pages for preservation that support a topic. each web page, or original resource, can change over time. archivists capture these original resources at specific points in time, turning each observation into a memento. the date and time of capture is that memento’s memento-datetime. a timemap contains the set of mementos for an original resource. archive-it is a popular platform for creating themed web archive collections. themed collections also exist at the library of congress, conifer, the croatian web archive, the uk web archive, and the national library of australia’s (nla) pandora and trove collections. figure 1. archive-it has two collections about the south louisiana flood in 2016. east baton rouge parish library created the one represented by the left screenshot. old dominion university created the one on the right. (screenshots taken in june 2021.) such topical web archive collections are invaluable to those studying a topic, but it can be difficult for visitors to understand which collection they should begin exploring. figure 1 shows screenshots of two archive-it collections containing mementos about the south louisiana flood of 2016. each collection contains different metadata fields. which one should a researcher use in their project? alternatively, if a visitor uses archive-it’s collection-level search engine to find collections about a topic like “human rights,” they have 48 collections to review as of january 2022. like government of canada publications, some collections contain hundreds of thousands of documents but no metadata to help visitors understand the collection. in a 2019 study (preprint version), we evaluated all collections at archive-it and determined that collections with more original resources have less metadata to help visitors understand them. metadata is more consistent in the nla’s pandora collections than in archive-it’s collections, but the sheer size of some collections makes it difficult to understand them at a glance. figure 2 shows a screenshot of the page for the pandora subject politics. this subject is a collection in its own right and contains subcategories and collections so that visitors can view parts of the collection. it also includes a listing of 5,269 page titles. at a minimum, each pandora collection contains a collection title and page titles. some are divided into sub-collections, but a human would still need to review many documents (or at least page titles) to understand the collection. figure 2. the pandora subject politics contains subcategories and collections, as well as a list of page titles. the dsa toolkit for summarizing corpora the dark and stormy archives (dsa) project helps visitors achieve collection understanding by finding solutions that summarize web archive collections through visualizations that provide understanding at a glance. we apply social media storytelling as that visualization because most visitors are already familiar with the paradigm and thus require no additional training to understand how to consume these visualizations. social media storytelling consists of surrogates that summarize individual pages. figure 3 shows the same page rendered as a surrogate in different web platforms. most web users understand how to interpret the cards from search results. many readers understand how to interpret social cards used by facebook, twitter, tumblr, and other platforms. these surrogates take different shapes, but each attempts to summarize an individual document. figure 3. different surrogates for the same web page rendered by different platforms. figure 4. a social media story on wakelet about the 2021 us capitol breach. figure 5. a screenshot of the iipc’s archive-it collection novel coronavirus (covid-19). when we combine a group of surrogates together as a unit, we create a story that summarizes a topic. figure 4 shows a story that summarizes the 2021 us capitol breach. this story contains surrogates, in the form of social cards, each describing a different web resource discussing the attack. archive-it and pandora have their own surrogates. in archive-it’s case, the surrogate consists of the url captured and any metadata added by the archivist. in pandora’s case, each surrogate is a page title. we want to extend this idea of the social media story further than is already accomplished by these web archiving platforms. figure 5 shows a 2019 screenshot of the international internet preservation consortium’s (iipc) archive-it collection novel coronavirus (covid-19). this collection contains more than 23,000 mementos, far more than a single human can review, let alone understand at a glance. we applied the dsa toolkit to produce a story summarizing this collection (figure 6). the archive-it collection contains metadata painstakingly provided by the archivist. we see page titles, countries of publication, languages, and, in some cases, descriptions of individual resources. there are even facets to help users explore different aspects of the collection. the story generated by the dsa toolkit has people in masks, pictures of the virus, headlines, dates, sources, names, places, page summaries, maps showing the virus spreading across the world, and links back to the collection so visitors can explore that collection further. figure 6. a story by the dsa toolkit summarizing iipc’s archive-it collection novel coronavirus (covid-19) using sampling algorithms and social media storytelling. the dsa toolkit is built on our model of five storytelling processes, as shown in figure 7. a user follows these processes to tell a story that summarizes a corpus. figure 7. the five processes for telling a story that summarizes a corpus. as a summarization, our story provides a subset of the collection to a user. this subset consists of exemplars – documents that represent the collection well. thus, we need to select exemplars from the collection to tell our story. selecting exemplars can be done by humans or by various sampling algorithms. we can choose exemplars that attempt to summarize the collection as a whole or select those that feature a particular aspect of the collection, such as a specific time period, person, or source. we generate story metadata to enrich our story for the visitor. story metadata can consist of the collection name, who created the collection, and other collection metadata. the story from figure 6 shows that story metadata can also contain entities, terms, images, and other content extracted from the collection. once our exemplars are selected, we generate document metadata to summarize each exemplar. in figure 6, each of the social cards shown in the story visualizes this document metadata. document metadata can take many forms, as needed by the storyteller. we can then visualize the story in the desired medium by applying our story and document metadata. most users will want a story that visitors can view on the web, and hence we have focused on providing visualizations that use html. we can also visualize stories in other media, such as video. finally, we can distribute the story for others to consume our visualization. distribution can be from an author’s website. it could also be via social media. alternatively, a user could potentially print the story and manually hand it to someone. the point is that visitors cannot consume a story if they cannot access it. as shown in figure 8, the dsa toolkit provides tools to help meet the goals of this model. hypercane helps users select exemplars and generate story metadata. mementoembed generates document metadata. raintale helps users visualize the story by accepting input from hypercane and mementoembed. not shown in figure 8 are two additional supporting packages. the library aiu helps programmers identify the mementos and collection metadata from a web archive collection. the otmt (short for off-topic memento toolkit) identifies off-topic mementos in a collection. this paper highlights these tools and the results of their recent pilot with the nla, thanks to a grant from the iipc. figure 8. how the dsa toolkit fits with the storytelling model shown in figure 7. other efforts for understanding web archive collections the dsa toolkit is one of many tools for helping users understand web archive collections. for users with access to the web archive (warc) files that make up a web archive collection, archivespark (by holzmann et al.) and the archives unleashed toolkit (by ruest et al.) offer an environment for exploring collections. a user can load warcs into archivespark and then perform a set of operations for generating data, such as extracting titles, calculating term distributions, discovering named entities, building a graph of outgoing hyperlinks, and extracting images. archivespark typically outputs a json file consisting of the desired data. the output of one operation in archivespark can feed into another, allowing users to chain them together and produce the desired result. it is up to the end-user to further process this data with a third-party tool. archives unleashed toolkit (aut) is built on warcbase (lin et al.). based on work with web archive researchers, lin et al. developed four steps for those trying to work with web archives: filter, analyze, aggregate, and visualize. aut implements these four steps. with aut, a user can filter based on various features, such as original resource url, language, domain name, memento-datetime, or url pattern. an aut user can also generate reports on top-level domains, named entities, or link structure. archives unleashed cloud (auk) integrates a subset of this functionality with archive-it, providing different reports for archivists to apply to their own collections. if an aut user wishes to visualize the output, they must use a third-party tool. these tools inspired the development of the dsa toolkit. hypercane allows users to chain together several web archive collection operations, just like archivespark and aut. raintale accepts the output of hypercane’s operations and visualizes it. aut and archivespark can potentially help archivists select exemplars and generate story metadata. they can also extract or generate some document metadata. their focus, however, is not storytelling, so they rely on third-party tools for visualization and distribution. the input type is another difference between the dsa toolkit and these tools. where archivespark and aut accept warcs as input, the dsa toolkit does not. because its goal is storytelling through summarization and visualization, the dsa toolkit expects that all resources to be shared are accessible to the visitor of the story. also, we wanted the dsa toolkit to be usable by those who did not have access to a collection’s warcs. thus, the dsa toolkit uses memento urls instead of warc files as its input. jatowt et al. may have been the first to visualize web pages over time. their page history explorer downloaded mementos from web archives and generated screenshots ordered by memento-datetime as part of a timeline. similarly, the tmvis project (by mabe et al.) accepts a single original resource url, applies alsum’s algorithm to find the most novel mementos of that page, allows the user to reduce that set further manually, and then renders those mementos as a set of screenshots. these efforts work with a single original resource url, not a collection of many different ones. however, they satisfy parts of our storytelling model by selecting exemplars, generating story metadata, generating document metadata (in the form of a screenshot), and visualizing the story – with jatowt using a timeline and tmvis providing several different visual arrangements. with tmvis, users can distribute parts of their stories in various forms such as embeds, animated gifs, and image sliders. padia et al. generated a set of visualizations for web archive collections. padia’s visualizations did not focus on a single original resource url but instead tried to consider different features of the mementos in the collection. they produced visualizations consisting of heat maps, timelines, word clouds, bubble charts, and treemaps. when possible, they leveraged the metadata provided by the collection, allowing them to cluster individual mementos as part of these visualizations. while they successfully created a set of visualizations to convey meaning about aspects of the collection, these visualizations required training for the viewer to understand them. thus, they did not satisfy our goal of conveying meaning at a glance. figure 9. with tmvis, users can explore a single original resource over time and visualize thumbnails for that resource. alnoamany et al. pioneered combining storytelling with web archive collections. at the time of her work, the most popular social media storytelling service was the now-defunct storify. she analyzed storify stories and compared them with archive-it collections to understand their similarities and differences. she identified that popular stories contain a median of 28 links, giving automated storytelling platforms a target number of items to select from a collection. she identified some of the operations necessary to select exemplars automatically. her algorithm filtered off-topic mementos, then duplicate mementos, and then non-english content from the collection. she also applied clustering, both by memento-datetime and by content. finally, she scored the resulting mementos by padia’s web page category, mccown’s url path depth, and brunelle’s memento damage. her algorithm, which the dsa toolkit implements as dsa1, is the first of many possible methods for selecting exemplars. she demonstrated (preprint version) that participants could determine the difference between randomly generated stories and those created by human archivists through a user study. equally important, she confirmed that participants could not distinguish between stories produced by her visualization algorithm and those created by human archivists. figure 10 shows a story generated by alnoamany’s proof-of-concept with storify. her story metadata was the collection name. her proof-of-concept generated document metadata to fit into this platform. storify handled distributing the story to visitors. her research project was named dark and stormy archives, and the current dsa project is a continuation of her work. figure 10. a story produced by alnoamany’s dark and stormy archives proof-of-concept from the archive-it collection russia plane crash. table 1. different web archive collection analysis efforts and suitability for summarizing collections with storytelling. archivespark archives unleashed toolkit page history explorer tmvis padia’s visualizations alnoamany’s dark and stormy archives dsa toolkit (current) storytelling model support select exemplars yes yes yes yes yes yes yes generate story metadata yes yes yes yes yes yes yes generate document metadata yes yes yes yes yes yes yes visualize the story no no yes yes yes yes yes distribute the story no no no yes not currently public, but could be yes yes can user customize part of storytelling model? select exemplars yes yes no yes no yes yes generate story metadata yes yes no no no no yes generate document metadata yes yes no no no no yes visualize the story n/a n/a no yes no no yes distribute the story n/a n/a no no no no yes supports more than one original resource yes yes no no yes yes yes allows analysis without warcs no no yes yes yes yes yes web archive platforms supported any warc files any warc files 1 3 1 1 any mementocompliant archive the dsa toolkit figure 11. the dsa toolkit workflow for producing a story relies heavily on hypercane and raintale. the dsa toolkit consists of five different software projects that satisfy our storytelling processes. figure 11 provides an overview of the workflow for these tools. a user executes hypercane to select exemplars and generate story metadata. then they feed that output into raintale to visualize their story. both hypercane and raintale consult the other components as needed. here we introduce each tool and highlight our recent improvements thanks to the pilot with nla. aiu aiu (formerly archive-it utilities) is a python library containing several classes that provide information about web archive collections. a programmer can invoke one of these classes and supply a collection identifier. the class includes several methods providing information like collection title, the urls of the original resources in the collection, and other metadata, if present. aiu applies apis and scraping html to gather its information. aiu initially provided this information only for archive-it collections. archive-it collections contain metadata and a list of preserved original resources. from these original resources, we find timemaps. from these timemaps, we discover mementos. aiu follows this path to find the mementos in an archive-it collection. below is an example ipython session where a user can extract information about an archive-it collection (we measure the length of the list returned by the list_seed_uris() method for brevity, but it does produce a list of original resource urls for the collection): in [1]: from aiu import archiveitcollection in [2]: aic = archiveitcollection(5728) in [3]: aic.get_collection_name() out[3]: 'social media' in [4]: aic.get_collectedby() out[4]: 'willamette university' in [5]: aic.get_description() out[5]: 'social media content created by willamette university.' in [6]: aic.get_collection_uri() out[6]: 'https://archive-it.org/collections/5728' in [7]: len(aic.list_seed_uris()) out[7]: 113 as part of our iipc grant, we extended aiu to service the three types of collections at the nla. as presented through their pandora and trove websites. the pandora archive is the curated selective web archive component of the larger australian web archive (awa). the nla’s entire web archive corpus (including pandora) is accessible through the trove discovery service. while the awa includes pandora’s thematic sub-collections, the more extensive pandora subject listings are only viewable through the stand-alone pandora archive website. pandora subjects contain other pandora subjects listed as “subcategories.” each pandora subject includes a list of page titles, as seen in figure 2. each page title represents an original resource and links to a title entry page (tep). the tep serves as a timemap that links to all mementos for that original resource. aiu will follow this chain to find similar metadata to archive-it, including original resource and memento urls. pandora subjects also contain pandora collections. like subjects, each pandora collection can include collections and page titles. each page title links to a tep which functions the same as with pandora subjects. aiu will follow these links to find metadata and urls for pandora collections. trove collections are slightly different. each trove collection fans out into subcollections. subcollections contain direct links to mementos, as shown in figure 12. now aiu includes a class that will follow these links, gather metadata, and capture urls as needed. please view aiu’s documentation for more information on applying it to gather collection information. figure 12. a screenshot of a trove collection. otmt the off-topic memento toolkit (otmt) is an application and library that supports different text similarity measures for identifying which mementos in a collection are on-topic or off-topic. it is used as a library by hypercane to perform this same task. the otmt received bug fixes as part of our iipc grant work. hypercane hypercane is a complex application that provides the user with many ways to select exemplars and generate story metadata automatically. hypercane exists in the command line application hc. a user selects exemplars with the sample action of that application. for example, to select exemplars from archive-it collection 694 using the dsa1 algorithm, a user types: # hc sample dsa1 -i archiveit -a 694 -o story-mementos.tsv generating story metadata is handled by hypercane’s report action. for example, to generate a list of named entities from a file containing a list of mementos, a user types: # hc report entities -i mementos -a story-mementos.tsv \ -o entity-report.tsv finally, to create a rich story, a user can synthesize the output of other hypercane commands into other formats. to create a rich json story file for raintale, a user types: # hc synthesize raintale-story -i mementos -a story-mementos.tsv \ --imagedata ${image_report} \ --title "my story for collection x" \ --termdata terms-report.tsv \ --entitydata entity-report.tsv \ -o mystory.json the sample, report, and synthesize actions provide hypercane’s top-level functionality. their capabilities are not limited to what we show in the examples above. the sample action currently supports fourteen different sampling algorithms, denoted by their name – e.g., a user can execute stratified random sampling with hc sample stratified-random, or they can run the dsa3 algorithm from our previous work with hc sample dsa3. likewise, the report action supports ten different types of reports that can produce helpful metadata for storytelling, like extracted collection metadata, image analysis, or the list of top phrases for the input. the synthesize action gives the user different output format options, like a raintale story file or warcs. as seen above, each hypercane command supports the arguments -i (for input type) and -a (for input argument). the input type instructs hypercane on the nature of the input argument. as seen above, if the input type is archiveit, then the input argument must be an archive-it collection identifier. if the input type is mementos, then the input argument must be a file containing a list of memento urls. hypercane currently supports the following input types: mementos – for a file containing a list of memento urls timemaps – the input argument is a file containing a list of timemap urls original-resources – the input argument is a file containing a list of original resource urls archiveit – the input argument is an archive-it collection identifier pandora-subject – the input argument is a pandora subject identifier, added as part of the iipc grant work pandora-collection – the input argument is pandora collection identifier, added as part of the iipc grant work trove – the input argument is a trove collection identifier, added as part of the iipc grant work hypercane supports all memento-compliant archives. thus, a user need not rely upon mementos from archive-it or nla. the mementos and timemaps input types allow users to submit urls from other archives, such as the internet archive’s wayback machine, archive.today, or arquivo.pt. every hypercane command also supports the -o argument to specify an output file. in most cases, the output file is a list of memento urls. this feature allows the user to chain hypercane commands together so that the output of one command can feed into another. hypercane’s sample action helps users execute existing sampling algorithms, but users also have access to the following algorithmic primitives that allow them to build their samples: identify – identifies the memento, timemap, or original-resource urls for a given input type filter – produces a list of mementos that match the provided criteria (e.g., are on-topic, have unique content, are written in a specific language) cluster – clusters the collection via a given algorithm, such as through lda topic modeling or k-means clustering by memento-datetime score – scores each memento in the input by a given function order – sorts the input by a given feature, like memento-datetime by combining these primitives, users can create powerful scripts for sampling from collections, allowing them to tell many types of stories. we will not describe all of these primitives in detail for brevity but instead, direct interested readers to hypercane’s official documentation and our recent publications. as part of the iipc grant, we also sought to make hypercane more approachable to administrators and end-users. hypercane is a python application and also depends on mongodb for caching. we are evaluating native linux installers for centos 8 and ubuntu 21.10 that handle installing dependencies and provide convenience scripts for administering the application. figure 13. a screenshot of hypercane’s web user interface landing page. we developed a new web user interface (wui) so that hypercane was approachable for users not familiar with scripting or working with command-line interfaces. figure 13 shows a screenshot of this wui. we wanted to create a web user interface while still preserving the existence and capabilities of the command-line application for scripting. we built hypercane’s wui with the wooey library, which creates web user interfaces from command-line applications, satisfying our needs. with wooey, a user can fill out a web form to submit a hypercane command as a job and come back later to retrieve the results. wooey also provides authentication and separation of jobs to protect users’ privacy as a django application. hypercane’s sample action allows users to run pre-existing sampling algorithms to get a list of mementos for their story. as noted above, it supports ten sampling algorithms, including alnoamany’s. but what if a user wants to run sample, execute all report actions necessary to generate different types of story metadata, and finally synthesize a json file for use in a rich raintale story? before the wui, we suggested that they write a script that executes these commands in order. the wui presents convenience recipes that help users run complex tasks with minimal input and no scripting. hypercane’s wui currently supports seven convenience recipes as web forms. figure 14. a screenshot of the hypercane web form for the recipe “summarize a collection as a raintale story file.” figure 15. a screenshot of a completed hypercane job. the user can click the download button (top right) and save the raintale story file. (the red text only indicates a warning.) most users just wish to leverage hypercane’s storytelling capabilities. the recipe “summarize a collection as a raintale story file” accepts a collection id for archive-it, pandora subject, pandora collection, or trove and executes the necessary sample, report, and synthesize commands to produce a raintale story file. a user will need to supply the collection type, collection id, sampling algorithm, and story title, as shown in figure 14. figure 15 shows the interface once the job is complete, allowing the user to download their raintale story file. some users already have a list of memento urls or wish to use mementos from a memento-compliant archive not listed above. the recipe “summarize archived pages as a raintale story file” accepts an input file containing a newline-separated list of memento urls and does the same as the recipe above. other users have already selected exemplars from web archives and do not need to execute the sample step. the recipe “create a raintale story file from archived page urls” accepts an input file containing a newline-separated list of memento urls and executes the necessary report and synthesize commands to produce a raintale story file. sometimes archivists or researchers just need a list of the memento page urls in a collection. the “generate archived page urls from a collection” recipe accepts a collection id for archive-it, pandora subject, pandora collection, or trove and executes the necessary identify command to find all memento page urls in the collection. similarly, a user may want the list of original resource urls in a collection. the “generate original page urls from a collection” recipe performs the same steps but outputs a list of original resource urls. the user could also run the “identify by collection id” web form, but these recipes present simplified language and options. additionally, there are cases where a user will desire warcs for a third-party application. the recipe “synthesize warcs from a collection” accepts a collection id and will execute the necessary identify and synthesize steps to convert the mementos in the collection into warcs that closely resemble what the archive initially crawled. hypercane leverages the memento protocol and its knowledge of accessing raw memento content from different web archives to make this happen. these warcs are an approximation built from what hypercane can access via the web and are not identical to the web archive’s holdings. still, they are sufficient for experimentation and analysis with tools like aut. likewise, a user can produce warcs from a list of memento urls with the recipe “synthesize warcs from archive page urls.” these recipes continue to make hypercane more approachable for end-users. though hypercane primarily exists to feed content to raintale, it has evolved into a much more promising toolkit of its own. mementoembed mementoembed is a surrogate generation tool that can create four different types of surrogates. it produces social cards like those seen on facebook and twitter, page screenshots (also called browser thumbnails), word clouds, and imagereels. imagereels are animated gifs (gif version 89a) of the top images found in a memento. figure 16 displays one of mementoembed’s social cards. mementoembed also supports a web api for generating document metadata. through this web api, a client can create surrogates. clients can also request specific document metadata, such as page titles, memento-datetimes, image ranking, and automatic text summaries. we developed mementoembed after analyzing more than 50 platforms and finding that none support mementos correctly. some platforms do not understand how to extract the original resource url from the memento. others could not differentiate between a web archive’s content and the memento content. still, others refused to generate surrogates for mementos. mementoembed is critical to the dsa toolkit. hypercane applies mementoembed’s functions that discover images (preprint version) and raw memento content. raintale is a client of mementoembed’s web api. most web archives augment their mementos with navigation elements and branding to aid visitors. unfortunately, these augmentations confuse natural language processing technology, like that implemented by hypercane and mementoembed. for example, when comparing frequent terms among a set of mementos, the word “trove” appears to be the most frequent word, even though it does not represent the actual memento content. for wayback and openwayback web archives, like archive-it, mementoembed simply inserts an id_ flag into their memento url to visit a page lacking these augmentations. trove mementos required different handling. figure 17a displays a screenshot of a trove memento. figure 17b blurs its augmentations to emphasize the actual memento’s content. as part of our pilot with the nla, we had to update mementoembed to discard these augmentations when processing trove’s mementos. finding the actual memento content required the following steps: change the domain name in the memento url from webarchive.nla.gov.au to web.archive.org.au insert the id_ flag into the memento url from here, mementoembed can now process the raw memento content. figure 16. a screenshot of a mementoembed social card containing a striking image, title, description, original resource domain, and favicons, as well as information about the archive, when the archive preserved the page, and links to other mementos. memento from trove the content of the memento exists in the unblurred area. figure 17. trove mementos consist of augmentations that aid users in navigation and provide additional information that confuse natural language processing systems. we also updated mementoembed to better handle creating screenshot surrogates of trove’s mementos. the additional augmentations required changing the timeouts on mementoembed’s screenshot capabilities. as with hypercane, we created native linux installers for mementoembed for centos 8 and ubuntu 21.10 to help system administrators quickly stand up mementoembed. for more information on mementoembed, please see its documentation. raintale hypercane focuses on selecting exemplars and generating story metadata. mementoembed focuses on generating document metadata for a single memento. raintale takes in all of this information and produces a story in a format that the storyteller desires. raintale is a command-line application executed by the tellstory command. to apply the rich story file generated in the hypercane section above, a user types: # tellstory -i mystory.json --storyteller template \ --story-template mytemplate.html -o mystory.html the -i argument specifies the input file containing the information for the story. the --storyteller argument indicates the type of story being told. because the above example suggests that we wish to use a template, the --story-template argument is needed to specify a file containing a template that raintale will use to format the story. finally, the -o argument indicates where to store the resulting story. the user heavily controls the format of raintale stories through templates. figure 18 shows a story that applies a bootstrap carousel template to a set of mementos to summarize the trove collection tourism. figure 18 shows a story that applies an nla-authored template to a group of hand-selected mementos about australian zoos, wildlife sanctuaries & aquariums. figure 18. a raintale story generated using a bootstrap carousel. figure 19. a raintale story generated using a template developed by the national library of australia. because raintale supports templates, users can construct stories that take many forms, from the covid-19 story in figure 6 to the stories shown in figures 17 and 18. raintale applies a modified version of the jinja2 template language that supports additional options for extracting specific document metadata for the output. this template language does not limit raintale to html. it permits many textual formats, like markdown, xml, json, and jekyll. as with hypercane, we recognized that a command-line application is not always approachable for all software users. as part of our pilot, we leveraged wooey to create a wui for raintale as well. figures 20–22 show screenshots of a user executing “create story from template,” which produces a templated story just like we demonstrated with the tellstory command above. the “create story from preset” option lets users forgo a template in favor of built-in story templates. with “tell story with twitter,” raintale will create a twitter thread from the mementos in the input. finally, “create video story” generates an mp4 file consisting of the top-ranked images and text from each memento in the story. figure 20. a screenshot of the landing page of the raintale wui. figure 21. a screenshot of the web form allowing a user to “create story from template.” figure 22. a screenshot showing a completed raintale job where the user can download the story. as with hypercane and mementoembed, we developed native centos 8 and ubuntu 21.10 installers for raintale. we presented raintale at the wadl 2020 workshop and iipc wac 2021. an early version of raintale helped us conduct user testing that showed that social cards probably provide a better understanding of web archive collections when compared to alternatives like page screenshots. for more information on installing and leveraging raintale, please consult its documentation. the future of storytelling with web archives we see a lot of potential for further development of these tools. this pilot project has helped us better understand the needs of our users. we continue to improve the error handling of these tools as we learn more from the community. overall, we want to improve the approachability of these tools. our native installers and web user interfaces are a start but only apply to linux. thanks to the experience of creating an installer for linux, we have some ideas for a macos dmg file that supports the applications of hypercane, mementoembed, and raintale. with wooey, we could preserve our existing command-line interfaces for scripting while also providing new web user interfaces. the authors of gooey promise this same functionality using the same methods but with a native non-browser graphic user interface. we are hopeful that our efforts with wooey might make it easier to apply gooey and provide a proper native gui interface. as we build upon these technologies, we are charting a path toward making the dsa toolkit a set of microsoft windows desktop applications. as a set of python applications, the dsa toolkit will have to contend with library installation issues on windows. once we resolve those issues, we will have reached the maximum number of users with the current code. we are also exploring additional functionality. aiu, and, by proxy, hypercane now support archive-it, trove, and pandora. we can expand it to support web archive collections from the croatian web archive, the uk web archive, the library of congress, and conifer. this way, users can apply our tools to summarize more collection types. of course, we would continue to support sets of memento urls as input because that would extend the reach to web archives that do not currently support collections. the hypercane recipes inspired by our collaboration are just the beginning. we are exploring the idea of a “recipe builder” where users can create their own hypercane scripts but in a much more user-friendly fashion than shell scripting. once this is in place, we hope that users will share their recipes. we could even facilitate a recipe exchange for interested archives that want to share hypercane solutions. raintale supports templates so that archivists can customize their stories. we see a future where archivists can share the look and feel of their stories with each other. in addition to templates, we are considering the idea of a “story builder” allowing users to create stories graphically in real-time, much like bloggers do with wordpress blocks. our collaboration between lanl, odu, and the nla has been fruitful but was only the beginning. either by executing these tools on your own or visiting someone else’s stories, what gems will you discover in a web archive collection? acknowledgements our many thanks go to the iipc for funding the dsa pilot documented here. we also thank the imls for funding the dsa project through their storytelling (lg-71-15-0077-15) and cedwarc (re-70-18-0005-18) grants. this research was supported by the information science and technology institute and by the laboratory directed research and development program of los alamos national laboratory (lanl) under project number 20210529cr. lanl is operated by triad national security, llc, for the national nuclear security administration of the u.s. department of energy (contract no. 89233218cna000001). about the authors shawn m. jones (0000-0002-4372-870x), los alamos national laboratory, is an information science & technology institute postdoctoral fellow working for lanl’s information sciences (ccs-3) division. shawn recently received his ph.d. from old dominion university and was advised by dr. michael l. nelson. he is also an alumnus of odu’s web science and digital libraries research group. he has contributed tools, data, and analysis to projects and frameworks such as memento, signposting, and robust links. in addition, shawn is the lead of the dark and stormy archives project, an initiative to summarize web archive collections through visualization techniques common in social media. more information about shawn is available at: https://www.shawnmjones.org/. himarsha r. jayanetti (0000-0003-4748-9176) is a ph.d. student at old dominion university working under the supervision of dr. michele c. weigle. she is also a member of the web science and digital libraries research group as a graduate research assistant. her research interests are in digital preservation, digital libraries, and social media. she graduated from gujarat technological university in india with a bachelor’s degree in computer engineering in 2017. more information about himarsha is available at: https://himarshaj.github.io/. alex osborne develops open source tools for web archiving and maintains the infrastructure of the australian web archive at the national library of australia. paul koerbin is assistant-director web archiving at the national library of australia. he has been involved with the development and operation of the nla’s web archiving program since its inception in the 1990s, including being part of the team that developed one of the first workflow systems for selective web archiving. he has published papers and given many presentations, in australia and overseas, on web archiving operations and practice. he holds a graduate qualification in library and information studies from the university of tasmania and a ph.d. from the university of western sydney. martin klein (0000-0003-0130-2097), los alamos national laboratory, is a scientist and lead of the prototyping team in lanl’s research library. in this role, he focuses on research and development efforts in the realm of web archiving, scholarly communication, digital system interoperability, and data management. he is involved in standards and frameworks such as memento, resourcesync, signposting, and robust links. martin holds a diploma in computer science from the university of applied sciences berlin, germany, and a ph.d. in computer science from old dominion university. michele c. weigle (0000-0002-2787-7166) is a professor of computer science at old dominion university. her research interests include web science, social media, digital preservation, and information visualization. she has published over 115 articles in peer-reviewed conferences and journals and has served as pi or co-pi on external research grants totaling almost $6m from a wide range of funders, including the national science foundation, the national endowment for the humanities, the institute of museum and library services, and the andrew w. mellon foundation. she currently serves on the editorial boards of the journal of the association for information science and technology (jasist) and the international journal on digital libraries (ijdl). dr. weigle received her ph.d. in computer science from the university of north carolina in 2003. michael l. nelson (0000-0003-3749-8116) is a professor at old dominion university and the virginia modeling, analysis, and simulation center (vmasc), and he co-leads the web science and digital libraries research group. prior to joining odu, he worked at nasa langley research center from 1991-2002, where he created the nasa technical report server (ntrs). more information about dr. nelson can be found at www.cs.odu.edu/~mln/ and twitter.com/phonedude_mln. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – the martha berry digital archive project: a case study in experimental pedagogy mission editorial committee process and structure code4lib issue 17, 2012-06-01 the martha berry digital archive project: a case study in experimental pedagogy using the martha berry digital archive project as an exploratory case study, this article discusses experimental methods in digital archive development, describing how and why a small project team is leveraging undergraduate student support, a participatory (crowdsourced) editing model, and free and open source software to digitize and disseminate a large documentary collection. by stephanie a. schlitz and garrick s. bodine introduction housed and maintained by the berry college archives at berry college in rome, georgia, the martha berry collection comprises over 160 file boxes, each containing somewhere between two and more than fifteen file folders. spanning the years 1885 to 1941, the collection, which is primarily epistolary, is organized chronologically by year and therein folders are indexed alphabetically by author. perhaps half of the documents were written by berry college’s founder, martha berry, the remainder by a profoundly diverse set of correspondents, including school children, business leaders, philanthropists, literary figures, friends, politicians, heads of state, and educators. as recently as spring 2011, the collection was extant exclusively in its original print format. document-level bibliographic descriptions and metadata did not exist, and the inherited collection-level index, which consisted of a word processor document containing a serial listing of each file box and its approximate contents, was unreliable. the collection’s bibliographic state precluded broad access, and, materially, the documents were at considerable risk of deterioration. despite these challenges, stakeholders at all levels agreed the collection must be digitally preserved and disseminated. the fact that the digitization process can be technically demanding, time-consuming, and expensive is not only invisible to many stakeholders and end-users, but also to some extent immaterial. if we want our collections disseminated, discoverable, and used, we have little choice but to embrace and leverage the digitization process, perhaps especially when significant impediments exist. in the specific case of the martha berry digital archive (mbda) project, restrictive budgets, limited staff, and a large number of unedited documents presented us with an opportunity to explore methodological innovations and to pioneer new methods of archival development that would support not only the berry project but also creators and stewards of other collections who are confronted with similar resource challenges. we adopted an approach which designates experimentation and archival development as equivalent in importance to our project end goal, the digital archive itself. this case study describes three experimental, process-driven decisions we’ve made to advance development, including platform selection, hiring and training undergraduates, and designing and developing a participatory (i.e. crowdsourced) editing model.[1] platform selection mbda required selection of an infrastructure that would be user-friendly and non-intimidating to project faculty, staff and students. project deliverables, which include preservation of the martha berry collection and publication of mbda, demanded selection of a mature, extensible, and well-maintained platform, a substrate we could customize and extend to serve as an archival hybrid comprising an accessible repository behind-the-scenes and a public-facing, interactive digital archive on the front-end. mbda's hybrid infrastructure requires the advantages of powerful, often complex, digital repositories (such as fedora commons) as well as those of common, easy-to-use web publishing platforms (such as wordpress or drupal) in order to empower non-specialists to make contributions to the underlying repository our repository short list included fedora commons, drupal, and hydra, but we ultimately selected omeka, “a free, flexible, and open source web-publishing platform for the display of library, museum, archives, and scholarly collections and exhibitions”.[2] although in itself not a digital object repository, omeka’s architecture and extensibility lend it to performing this role admirably. in contrast to robust but arguably less user-friendly digital object repository environments such as fedora commons, which does not currently have an extensively built out user interface and which does not have a public facing front-end, omeka advertises “five-minute setup,” facilitates web publication, is user-friendly, and comes with pre-configured publishing outputs and themes. the ability to harness existing omeka plug-ins such as dropbox, oai-pmh repository, and simple vocab further contributes to omeka’s appeal. yet perhaps more important to our venture is the fact that omeka has a relatively mature, open plug-in architecture and its layout is thoroughly themable. the omeka project’s open-source nature makes it possible, without having to start from scratch, to build features for mbda’s unique use cases that don’t currently exist in any of the archive and publishing systems we considered. developing new functionality as a plug-in, as opposed to simply modifying or forking the omeka codebase, allows us to program in parallel with omeka’s future directions in a more predictable and less conflicting way, i.e. we can continue to take advantage of future improvements to omeka’s core without contamination from our separate codebases, ensuring a relatively clear upgrade path for both the plug-in and the core omeka system it runs on. omeka’s api is exposed to plug-in developers via numerous functions, hooks, and filters documented on the omeka website’s documentation wiki. while from a programming perspective the documentation of particular features varies in scope from extensive to absent, there is an active community of developers and users whose work and listserv discussions remain a point of reference for mbda. and omeka’s cooperative documentation model allows us to reciprocate by filling in the gaps in documentation we’ve had to bridge along the way. imaging student-driven imaging model berry college is a small liberal arts school with a student population of less than 2000. the school’s comprehensive work study program, which offers meaningful, real-world employment to every student who is willing to work, is a hallmark of the berry experience and one of martha berry’s enduring legacies. training and paying undergraduate students to image their school’s eponymous collection was not merely a practical and affordable solution, it was a strategic one. our student-driven imaging model is designed to achieve mbda-specific goals and to expand them by supporting collaborative, experiential-based learning and an opportunity for undergraduates to create links between their study in subjects such as literature and history and their work with primary source documents in the martha berry collection. workflow documentation imaging can be demanding, tedious work, and its critical importance to document preservation and to the digital archive lifecycle cannot be overemphasized. because document scans are the backbone of mbda, attention to detail, accuracy and vigilant documentation during the scanning process are paramount. acknowledging the challenges inherent to imaging and seeking to ensure the success of our student participants, we created and vetted an imaging priority list which identifies a thematically coherent sub-set of the martha berry collection to serve as a preliminary ‘core sample’ for development, test, and initial publication and we devoted several weeks to hands-on, document-level imaging workflow design. design and documentation of every step within the imaging workflow was essential, and, among other details, our model includes: limiting scanning shifts to 1 to 2 hours use of a priority guide to identify which box and folder to scan how to handle fragile, hundred-year old documents filenaming conventions which create a one-to-one correspondence between a digital object and its material exemplar how to save to the project’s remotely accessible file share documenting for other members of the scanning team where a shift started and stopped in document terms the mbda imaging guide is both meticulous and concise, including step-by-step instructions, a screen capture to illustrate nearly every step of computer-based functions, and beginning and end of process reminders. students tested and critiqued the guide for ease of use (could they follow the instructions on their own?) and clarity (were there any ambiguities? did they need additional visuals?), and we edited accordingly. we also created ancillary materials including a spreadsheet for overall progress tracking and another for day-to-day documentation. page 2 of mbda’s imaging guide we monitor progress, quality, and accuracy, and a berry college library staff member provides general oversight. however, the berry college undergraduate imaging team has, since june 2011, been managing the imaging process largely independently, has continued virtually error free, and has successfully trained new student participants. from imaging to digital archive document scans are processed and uploaded to the mbda development site via a simple multi-step workflow. the workflow begins with a mogrification script which generates jpeg derivatives (compressed image files suitable for web-based archival usage) from the original tiff-formatted images created by the scanning team (and retained by the berry college archives as official, uncompressed digital representations of documents). once the images are mogrified, we ftp files from their storage location at berry college to the development server at bloomsburg university and change the file permissions to enable ingestion within the development site. together, mogrification and ftp batch upload subvert the need for individualized/manual image manipulation or upload. next, in order to edit and publish a scanned document within mbda, it must be converted to an item (item, in omeka parlance, is analogous to the frequently used term digital object) and assigned a unique identifier (item number) and item-specific constituents (e.g. dublin core metadata, tags, associated binary or media files, etc.). we use the dropbox plug-in to automate the conversion process so that once image derivative files are batch uploaded to the appropriate folder within the mbda database, we can use a simple ‘check box’ interface to select hundreds of files for simultaneous, automated conversion. as files are converted, each is assigned the requisite item number and constituents, as well as any universal tags we ascribe. because documents in the martha berry collection share several descriptive characteristics, we also modified the dropbox plug-in so that as files are converted to items, some single-value, archive-wide metadata fields are auto-populated. for example, metadata for the dublin core (dc) element fields language (all the documents in the collection are in english), identifier (the identifier for mbda items is based on their physical filing location within the collection, and the file name given to each scanned image is aligned with this location), source and publisher (both berry college) is automatically entered as items are created. in parallel with our student-driven imaging model, we designed a metadata editing scheme, developed detailed documentation, and hired and trained a small team of undergraduate students as digital editors. once document scans are ingested within the digital archive, student editors can review documents, contribute dc metadata, and match orphaned documents with their parents to ensure that multiple pages of the same document are reunited in the virtual world. students are able to reunite the pages of a document despite the fact that every document scan (most often in the case of mbda documents, this is a single side of a single page) is mogrified and then batch uploaded as an individual document even if it represents but one page of a lengthier multi-page document. we also created a flag for review element (discussed below) so that if editors identify an important document, an image problem, a filenaming error, or other issue while they’re editing, they can easily flag it for administrative review. experimental pedagogy platform selection, dev site development [3], and implementation of the imaging workflow served as the first steps toward liberating the martha berry collection from its locus in situ and behind closed doors (a significant milestone where stakeholders are concerned) and laid the foundation for more serious and more adventurous design and development, including a participatory editing model and accompanying omeka plug-in. why participatory editing? the martha berry collection contains many, many thousands of documents. although we had prioritized boxes for imaging, we were well aware that any box within the collection may prove as important as any other. the absence of comprehensive documentation to describe the collection meant that we couldn’t predict where the documentary gems may be hidden, and we recognized that our stage one project goals must include creating the kinds of exposure and interactivity that could also support document-level description. in short, we needed to leverage this preliminary stage to get the collection catalogued. while some in the academic community remain skeptical about crowdsourcing, for over a decade, crane has been making the case for the expansion of editing beyond academia, and projects like transcribe bentham, nasa’s map mars, investigate your mp’s expenses, and most recently the national archives citizen archivist project reveal the immense, sustained interest in finding ways to involve the larger community in activities which have traditionally been restricted to the domain of scholarship.[4][5] from our perspective, not only did mbda stand to gain from crowdsourcing, we considered our experimentation an opportunity to advance emerging participatory editing practices and to carve new inflections to expand their utility. leveraging and extending omeka consistent with the open archives initiative protocol for metadata harvesting, omeka supports the dc descriptive metadata standard by providing a pre-configured set of dc element fields for manual or auto-generated population. using the simple vocab plug-in, we developed controlled vocabularies for seven dc elements, including: source: martha berry collection publisher: berry college memorial library rights: creative commons, attribution, non-commercial, no-derivatives format: image-jpeg language: english identifier: mb75_4_1_1 coverage: mount berry, georgia these element fields are automatically populated during initial object ingestion as described above. for collections where existing metadata is already available, omeka does have an oai-pmh harvester plug-in, which could perform largely the same function. a sub-set of dc elements, however, requires individualized document review and manual description, including: title description creator date type to facilitate administrative oversight and document searching and to buttress later stage tasks for mbda (which, as noted above, is a collection of letters to and from martha berry), we defined a new omeka item type: correspondence, which contains the elements script type (e.g. typewritten or handwritten) and document recipient in addition to the existing omeka document-related metadata. our newly-developed plug-in, crowd-ed [6], also adds a new element set to omeka: crowdsourcing metadata. the crowdsourcing metadata element set includes new metadata features related to crowdsourcing, such as enable crowdsourcing (i.e. whether or not to enable crowdsourcing for a given item) and flag for review (i.e. community participants and project staff can flag an item for administrator review and identify either a specific or general reason for the flag). description of the dc elements title, description, creator, date, and type as well as flag for review, recipient and script type metadata are crucial yet minimally complex aspects of the editing process. inviting students and community members to edit these, we determined, would aid in cataloging the collection, drive archival development, and provide meaningful editing tasks for participants. the crowd-ed plug-in thus enables students and community members to participate in editing these eight elements (as shown in the screen capture below), respects the controlled vocabulary set by site administrators (using the simple vocab plug-in), and provides site administrators the ability to choose which other omeka elements are available for community editing. screen capture of working prototype of the crowd-ed participatory editing interface developing the crowd-ed plug-in in planning mbda’s participatory editing model, we identified two main additions to the archive’s underlying omeka infrastructure required for our use-cases: a plug-in to provide extensive crowdsourcing functionality, and a custom, grid-based theme to simplify layout design, especially as a basis for the crowdsourcing plug-in’s rather complex editing forms and layout requirements. like most repository and archival publishing applications, omeka has been limited to an administrator-driven editing model. this means that once documents are uploaded or ingested into an omeka instance, edits can be made to items and their constituents exclusively by administrators. in many archival use cases, this data separation model, which delivers item content unidirectionally to end-users, is ideal. but in a participatory editing use case, which relies on students and community members to populate a set of defined metadata categories, facilitated, bidirectional communication is essential. by mediating bidirectionally between the digital archive and its end-users, crowd-ed’s design model extends the archive’s utility to enable guided, community-based, participatory editing. from a user perspective, crowd-ed simply presents editable forms that make it possible for authenticated site users to contribute information to the archive metadata for the crowdsourcing enabled items they encounter while browsing or searching the archive. when the crowd-ed plug-in is installed and enabled, users who are logged in are presented three paths to engage with the archive from the archive’s home page: browsing all of the published items in the archive, searching for particular items using metadata categories, or participating in the development of the archive. screen capture of (in-progress) mbda development site home page (marthaberry.org) screen capture of browse items display display of individual item the crowd-ed plug-in creates links to an item’s crowdsourcing interface on each displayed item to allow an authenticated public user to contribute information at any time. if a user chooses to follow the participate path, further explanation is given regarding the purpose and intents of mbda crowdsourcing, as well as instructions on how the user may participate in the most urgent or essential aspects of the project. users who do not have logins are directed to the create account page where they must create a new account before participating in editing tasks. because the crowd-ed interface allows non-administrative users to add and edit information in the crowdsource-enabled fields, the mbda project had to create custom validation routines (using omeka’s built-in validation hooks where possible) to allow for security and content control. we are currently developing audit trail and archiving functionality to complete the safety features and to ensure continuity should there be an issue with ‘crowd control.’ in order to provide this outer layer of functionality, crowd-ed takes advantage of another important aspect of omeka’s extensibility: the fact that it’s built on the very powerful open-source zend php framework and library. this means that any objects or methods available from the zend framework are also available to omeka plug-ins as native functionality. the functionality provided by the library enables us to record who changed what data from what value for which item at which time. it also provides site administrators with the ability easily to revert any changes made via the crowdsourcing plug-in. flash mobs? by scaling out the editing process to support community-contributed metadata, crowd-ed solves an important archival development problem. even so, we acknowledge that engaging part-time, voluntary editors as remote partners in an archival project introduces unique challenges where accuracy and continuity are concerned. but for mbda, crowdsourcing has never been synonymous with ceding control or relinquishing quality. by contrast, crowd-ed is being designed to guide contributors’ efforts and to prevent potential inconsistencies in keywording or encoding. we’ve defined editing categories that take advantage of non-specialist knowledge (e.g. identification of document author, recipient and date) and have restricted some participant choices via controlled vocabulary. and we are in the process of developing simple and interactive user access points and creating ‘how to’ tutorials. we also recognize that although crowdsourcing isn’t new, it is relatively new within the archival editing community, and the jury is still out on the value of participatory editing as a method within digital editing projects. while healthy skepticism and astute design schemes are among our best defenses against the unruly flash mobs some critics of crowdsourcing fear, we think it would be a mistake to dismiss the method untested and in doing so to dismiss the possibility of significant advancement and valuable community engagement. teaching digital editing = pedagogy although the ‘crowd’ employing our crowdsourcing effort to date has been a small one, our five-student, part-time imaging team has scanned more than 15,000 documents in less than a year, and our three-student, part-time digital editing team has matched thousands of documents and edited hundreds more. our students are learning about history, language, and politics (as well as their alma mater), and they are gaining valuable experience with a primary source collection and digital editing. their contributions have been integral to development of mbda, and the once moribund martha berry collection is thriving. while the real test of our methodology won’t begin until we advance beyond the development stage and begin public participatory editing this fall, mbda is already being discussed in college classrooms, explored by historians, and is serving as a pilot collection for innovations in ocr and transcription (up next on our to do list). notes [1] within the digital editing community, participatory editing is used as a de facto designation for the crowdsourcing of various editing tasks. this discussion uses participatory editing and crowdsourcing interchangeably to describe engaging students, community members, and the broader public in digital editing tasks. [2] about. [internet]. omeka. [cited 2012 april 5]. available from: http://omeka.org/about/ [3] in the spirit of transparency, to facilitate testing, and to permit pre-publication use within several college classes, we elected to leave the dev site open for the duration of the development lifecycle, despite the challenges doing so creates. [4] crane, gregory and jeffrey a. rydberg-cox. new technology and new roles: the need for corpus editors. in: acm 2000 digital libraries: proceedings of the fifth acm conference on digital libraries; 2000 june 2-7; san antonio, texas. p. 252-253. available from: http://hdl.handle.net/10427/57002 [5] crane, gregory. give us editors! re-inventing the edition and re-thinking the humanities [internet]. in: mcgann, jerome, editor. online humanities scholarship: the shape of things to come. connexions website. 8 may 2010. available from: http://cnx.org/content/col11199/1.1 [6] our first task was to architect and implement the crowd-ed plug-in. having done so, we’re currently working on somewhat less difficult (at least from a technical stance) though no less important (functionally) issues such as interface styling, testing, and end-user help text and tutorials. we anticipate the beta release of the crowd-ed plug-in during summer 2012. about the authors stephanie a schlitz, a berry college alumna, is an associate professor of linguistics at bloomsburg university of pennsylvania and director of the martha berry digital archive project. her email address is sschlitz@bloomu.edu. garrick s. bodine is an it manager for a software development team at pennsylvania state university-university park and lead programmer for the martha berry digital archive project. subscribe to comments: for this article | for all articles 3 responses to "the martha berry digital archive project: a case study in experimental pedagogy" please leave a response below: jonathan rochkind, 2012-07-20 while fedora et al are difficult to set up and use and often do not offer good ui’s — they, in theory, if i understand it right, have paid quite a bit of attention to digital preservation, to doing certain things to make sure your digital artifacts stay accessible and usable for, well, a while. (i could also be over-optimistic about fedora et al’s capabilities here, i do not have a lot of experience in this area). since you are calling your project an “archive”, did you consider issues of long-term digital preservation? does omeka offer anything relevant there? garrick bodine, 2012-07-30 thanks for the comment and question– in attempting to preserve *and* disseminate the documents in the berry collection, we considered both aspects pretty extensively to be sure. in our preliminary evaluations of potential software, we didn’t find omeka to purport to handle long-term preservation as one of its main use cases, and haven’t really found anything more in our initial implementation either. this forced a reassessment of how to handle long-term preservation, and we turned it into as a secondary requirement/aspect of the project given the limited resources and time available. our plan therefore basically breaks down as follows: sticking to very common digital imaging formats (tiffs, jpeg derivatives) we feel will provide enough long-term accessibility for the foreseeable future (at least until such time as another project can focus on long-term preservation as *primary*); as to the metadata, that too we feel is sufficiently open and static (textual data in a standard, documented database table in a popular, open-source rdbms) that it is foreseeably-future-proof until/unless additional resources are put toward converting it further. basically, i think that both of these cases leave us with data that can be easily (i.e. *programmatically*) transformed or exported into yet another format/configuration/implementation using an almost limitless number of methods or tools currently available. there really isn’t anything about omeka that locks the data or metadata into any particular form, and the forms we chose were based on their presumed availability moving forward. i hope that helps somewhat but if we can clarify any of these points or if you have additional insights, we’d welcome further discussion. stephanie schlitz, 2013-05-22 project update: mbda has moved to its permanent host. you can now find us at: https://mbda.berry.edu/ project code is available at: https://github.com/gsbodine?tab=repositories leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using airtable to download and parse digital humanities data mission editorial committee process and structure code4lib issue 58, 2023-12-04 using airtable to download and parse digital humanities data airtable is an increasingly popular cloud-based format for entering and storing research data, especially in the digital humanities. it combines the simplicity of spreadsheets like csv or excel with a relational database’s ability to model relationships and link records. the center for digital research in the humanities (cdrh) at nebraska uses airtable data for two projects, african poetics (africanpoetics.unl.edu) and petitioning for freedom (petitioningforfreedom.unl.edu). in the first project, the data focuses on african poets and news coverage of them, and in the second, the data focuses on habeas corpus petitions and individuals involved in the cases. cdrh’s existing software stack (designed to facilitate display and discovery) can take in data in many formats, including csv, and parse it with ruby scripts and ingest it into an api based on the elasticsearch search index. the first step in using airtable data is to download and convert it into a usable data format. this article covers the command line tools that can download tables from airtable, the formats that can be downloaded (json being the most convenient for automation) and access management for tables and authentication. python scripts can process this json data into a csv format suitable for ingesting into other systems the article goes on to discuss how this data processing might work. it also discusses the process of exporting information from the join tables, airtable’s relational database-like functionality. join data is not human-readable when exported, but it can be pre-processed in airtable into parsable formats. after processing the data into csv format, this article touches on how cdrh api fields are populated from plain values and more complicated structures including markdown-style links. finally, this article discusses the advantages and disadvantages of airtable for managing data, from a developer’s perspective. by william k. dewey using airtable to download and parse digital humanities data airtable is a cloud-based, software-as-a-service platform for entering and storing data commonly used for academic research, including in the digital humanities. it is commercial subscription-based software, using a proprietary format but with the ability to export into open formats. it can best be compared with spreadsheets (like google sheets or excel) and relational databases (like mariadb), as it combines aspects of both types of applications (airtable platform [updated 2023])[1]. although it is marketed to businesses, academic and in particular digital humanities projects are using airtable for its combination of spreadsheet-style data entry with relational modeling capabilities. airtable’s advantages include ease of use for students and project assistants, who can easily add information related to projects and link related records together. airtable compared with spreadsheets at first glance, airtable is quite like spreadsheet applications, as information can be entered into cells, rows, and columns. but its data model is not simply a flat text file like csv. fields can be populated with many data types, including dates, integers, rollup and lookup fields, and can include multiple values and controlled vocabularies (supported field types… [updated 2023])[2]. like excel, airtable provides programmatic functionality with formulas (formula field reference [updated 2023])[3]. for more involved programming involving multiple records and tables, there are automations, which might look up records and create other records based on them. it is not necessary to know a programming language to do this, because airtable provides a graphical interface to define the automation step by step (airtable automations [updated 2023])[4][5]. there are features more like a sql relational database than a spreadsheet, including the ability to link related records together using foreign keys and join tables. lookup fields provide the ability to transform data from linked records into more useful forms (linking records…[updated 2023])[6]. like many database platforms, airtable allows access privileges to be managed. the platform also provides the ability to create a workspace for an organization. within an organization, one can create a project base (a set of related tables), and give users different access privileges to it. one can grant view privileges only, editing privileges, or ownership permission to do tasks like create and delete tables. any collaborator can be given an api access token so that they can download airtable data from the command line or with scripts (airtable base collaboration…[updated 2023])[7]. airtable compared with relational databases airtable has advantages over relational databases. while sharing functionality like join tables, queries, defined data types, and programmatic capabilities, airtable requires much less coding knowledge. knowledge of sql or other programming languages is not necessary to enter or query records, create join tables or even to create formulas. the online interface provides functionality for entering data, linking records together, and adding formulas, although formulas still require some knowledge of markup and syntax (formula field reference [updated 2023])[8]. array functions, for instance, are one way to use lookup fields to redisplay joined records, and airtable’s gui will provide guidance in this transformation (using array functions… [updated 2023])[9]. unl’s center for research in the digital humanities and its technical stack airtable is used at the center for research in the digital humanities (cdrh) to serve the center’s unique needs in transforming data for the cdrh api. the cdrh is a digital humanities center based in the university of nebraska-lincoln libraries that, in collaboration with researchers, assists professors in creating web applications for projects with humanistic content and data (about the cdrh [updated 2023])[10]. when the center was developed, projects primarily involved working with xml documents, but the center now works with more varied types of data including tables, images, and maps. airtable data was structured and used so that it would be compatible with the technical stack at the center (see figure 1). the center’s local infrastructure is designed to ingest and store data so that it can be displayed in web applications. the first component of the stack is orchid, a template for front end web applications, written in ruby on rails. it follows the standard model-view-controller pattern of ruby on rails, but instead of creating data models from a sql database, orchid pulls data from the cdrh api. orchid sites provide faceted search and discovery, display api entries in a standardized way, organize data into sections, and allow for the creation of static html pages (orchid [updated 2022])[11]. the cdrh’s api is based on ruby on rails, as well. it is a front end to an elasticsearch index (a search engine based on lucene) and provides methods to post and query data, and to view the raw data in a browser. figure 1. technical stack of the center (dalziel [updated 2018][12].). when a query is made through the search bar of an orchid application, the api also queries elasticsearch. the underlying data in elasticsearch is in json format, so the api schema needs to be structured in a way to convert the data into json, with datatypes expected by the elasticsearch index (api [updated 2022])[13]. datura is the cdrh’s system for populating the api with data. it posts data into the elasticsearch index via the api from source files that are in a variety of formats. datura creates a json hash which is populated field-by-field with simple ruby functions, which take data from the document being ingested. the values of nested fields must themselves be in json format. ruby’s json libraries can help here but sometimes it is necessary to do more transformations to make sure data is parsable as json, such as removing illegal characters. an example of a nested field is “person”, which has subfields that include names (first, last, and alternate), birth and death date, and demographic information (sex, age, and race). most of the formats datura can ingest are xml based, commonly tei (text encoding initiative) documents. datura also has the capacity to ingest csv, a simple process of adding each row as a separate record into the api, with each column representing a field. although it would appear simpler to directly use the json from airtable directly (keeping in mind that it would still have to be converted to match the api schema), datura does not currently have the capacity to ingest json (datura [updated 2023])[14]. projects using airtable: petitioning for freedom and african poetics the first cdrh project in which airtable was used was petitioning for freedom. this project, funded by an nsf grant, focuses on petitions for habeas corpus in the american west from 1785 to 1920, and how they were used to challenge often unjust situations of confinement or coercion. although the principal investigator (dr. katrina jagodinsky) has her own interests (including critical legal theory, race, slavery, immigration, gender, and family law), the project is intentionally designed so that it is useful to scholars with other interests. a wide variety of cases were included from the states of iowa, washington, kansas, missouri, and nebraska (jagodinsky et al. 2023; see figure 2 for an example of a case page)[15]. apart from some xml-encoded case documents, most of the data consists of summaries and metadata about the cases in spreadsheet form. the first draft of petitioning for freedom involved a simple sql database, linked to ruby on rails with active record (the part of rails that creates ruby objects from the database), and seeded in the `db/seeds.rb` file from a csv file. seeding the database manually via ruby scripts is not a good model for data entry for even a relatively simple data set. such a script lacks much of the basic functionality of a spreadsheet application, like the ability to view the data in tabular form, sort it and add formulas. for this reason, the project transitioned to storing data in google sheets, from which data could be exported in csv format and then exported into the cdrh api (habeascorpus [updated 2023])[16] [17]. this solution made greater use of the cdrh’s existing functionality; in particular, using the existing api schema meant there was less of a need to customize the database and there was greater control over the data model. however, the csv-exported spreadsheets were not rich enough for the data and relationships dr. jagodinsky wanted to model, and promised to model in the related project grant. airtable provided much of the needed functionality like join tables, and it was an easy platform for data entry for both students and staff (see figure 3 for part of the “cases” table, including the entry underlying figure 2). formulas were worked out to export join data in readable form, although this proved to be an unwieldy process. figure 2 – petitioning for freedom front end, displaying information about a habeas corpus case. (in the matter of…[updated 2023])[18]. figure 3 – airtable back end, displaying the cases table, and information regarding individuals and the roles they played in cases, the underlying data behind figure 2. the second project for which airtable was used for data storage was the african poetics digital portal (adpd), a project funded by the ford and mellon foundations. the adpd is the digital portion of kwame and lorna dawes’ african poetry book fund dedicated to promoting african poetry and related scholarship (dawes 2023)[19]. airtable was most extensively used for the african poetics in the news project. it is a comprehensive account of coverage of african poets in british newspapers (most predominantly the times of london and sunday times) from 1865 to 1985 (see figure 4 for an example of a page on a poet). most of the data, aside from newspaper images, was in tabular or textual form. the extensive dataset included news articles, events, poets, and works, and select critical commentaries written by students (dawes et al., 2023)[20]. the data items are related to each other in complex ways, and the data model uses hidden tables containing data such as locations. originally this data was in an even more complex set of sql databases, and data entry was handled by a rails application which was not intuitive for most users. it was decided to simplify the data model and move it into airtable (figure 5), handling join tables in similar fashion to petitioning for freedom. scripts were written to download the data from airtable, process it for the cdrh’s api, and the front end was revised to use the new data schema (african_poetics [updated 2023])[21]. figure 4 – african poetics digital portal front end (delius, anthony ‘ronald’…[updated 2023])[22]. figure 5 – airtable back end for african poetics, showing some of the data for people exporting data from airtable one may desire to export airtable data offline. a goal of cdrh projects is to be able to distribute data publicly, and update project data in a way that is fully automated, consistent and reproducible. however, as a proprietary cloud-based format, one cannot download or save an “airtable table” for offline use. the tables can only be viewed with an account or personal access token to airtable’s api. fortunately, it is possible to export airtable tables into more common, open-source formats like csv and json. the csv format is generally better for sharing data, as it is more human-readable, and most spreadsheet software can open it. however, csv can only be downloaded via the web interface and is not available programmatically (airtable views [updated 2023])[23]. other cdrh projects have used processes that involve manually exporting spreadsheets from an application, but such a process proves to be unreliable. downloading airtable tables to json can be automated, as json can be downloaded via the airtable api , either by making curl requests or using command line tools. json can also be transformed into the csv format needed for the datura scripts. for these reasons the cdrh preferred the json format for exporting from airtable. the ‘airtable-export’ python module is one way to download tables as json from the command line, requiring an airtable base id and a personal access token (willison 2023)[24][25]. the token can be generated from the website, [26]replacing the deprecated api key (zhao 2023)[27]. access tokens work in the same way, so this change posed no difficulties in using the `airtable-export` script (personal access tokens, updated 2023)[28]. the base id can be found after creating an account, by viewing the api documentation,[29] clicking on the name of the base, and finding the id after “the id of this base is.” (api reference, updated 2023)[30]. this tool is incorporated into the python scripts as shown below, [31] but it could also simply be used from a terminal command line. command = f"airtable-export source/json {airtable_base_id} cases people --key={api_key} --json" for the petitioning for freedom project, the base id and access token are stored as environment variables in a .env file (not committed to github) using the module python-dotenv (kumar 2023)[32]. similar processes are used in other projects. despite the wide variety of data types in the airtable table, the values of the cells in json generally come in one of three forms, a string, an array, or a coded reference to a linked record. linked records appear as meaningful names in the table but are encoded as alphanumeric strings in the exported json, in a key that is not available in any exported data, making it difficult to use the data outside airtable. for instance, the case roles column in the cases table (linked to the case roles join table), displays information about what individuals played what role in the case, but it looks like this when downloaded as csv: "case role [join]": [ "rec1gh0qf5um5q93j", "recvnhfc6zgtkwcsd", "recmhnoslta5rkzhz", "rectoen7otienziuq", "recwtdz7zpkwf7ny1", "reclcnp4ebbrglxqn", "recinbuvyvwvjvtud" ] it is possible, but not necessarily practical, to match up ids manually with the linked table. as will be discussed later, a better way to get around this limitation of the download format is to create new airtable fields based on the linked records. processing airtable data for the cdrh api the scripts to download and preprocess the data prior to the api ingest vary in complexity. the data must be converted to csv for the datura scripts, which will convert them to a json format that can be consumed by elasticsearch. in the case of african poetics, a lot of preprocessing was done in a simple python script, which constructed json and csv files from airtable api data [33]. instead of downloading json directly it was built up step by step by api requests to airtable. records = {} url = server+base_id+'/'+table headers = {'authorization':'bearer '+api_key} params = {'view':'grid view'} res = requests.get(url, params = params, headers = headers) data = res.json() for record in data['records']: record['fields']['airtableid'] = record['id'] records[record['id']] = record the json files were then written to csvs with the csv python module, using the hash values as the header. header = list(table_fields[table].keys()) if table in ["commentaries", "events", "news items", "people", "works", "contemporary poets"]: csvfile = open('source/csv/'+table+'.csv', 'w') else: csvfile = open('scripts/airtableexport/csv/'+table+'.csv', 'w') writer = csv.dictwriter(csvfile,fieldnames=header,extrasaction='ignore',quoting=csv.quote_all) the ingest into the api did not involve an especially complicated data model. at most the api expected nested fields like “person”, which took the shape { name: name, role: role, id: id }. on the airtable side, it was necessary to use airtable formulas to combine the join data into a field that was more easily readable, like [krog, antjie](apdp.person.000980). data processing with pandas scripts the petitioning for freedom project has a more complex data schema, so extensive data processing was needed to get the data into a shape that the cdrh api could consume. python’s pandas module, which is one of the most common tools for data manipulation and analysis, turned out to be one of the best tools for that purpose. it has strengths in converting json to csv and manipulating data frames [34]. after downloading the json files via ‘airtable-export,’ the files are read into a pandas dataframe, and, following all the necessary transformations, the dataframe is turned back into csv. the scripts are intended to modify the csv so that datura scripts can post the data into the api, without causing errors or storing undesired values. there were a lot of columns that contained unwanted airtable metadata and the script dropped them with cases_frame.drop(columns=[“encoding notes”, “last modified”,…]. some columns need to be made parsable by ruby’s json plugin. for instance, a list of values might appear as the string “[0, 1, 2, 3, 4, 5]” and ruby can parse it as json and turn it back into an array. a challenge in this process was that python introduces single quotes into such fields, which is not legal json. the solution is to apply json.dumps in the python script on every field that needs to be parsed as json. for label in ["petition type", "site(s) of significance", "tags", "petitioners", "rdf person role case (from case role [join])", "petition outcome", "fate of bound party(s)", "court name(s)", "source material(s)", "bound_party_age", "bound_party_race", "bound_party_sex", "repository", "case state"]: cases_frame[label] = cases_frame[label].apply(json.dumps) the conversion process creates blank and nan values, which can cause parsing errors or unwanted values in the api, so it is necessary to strip them out. cases_frame = cases_frame.fillna('') this code replaces nan values with an empty string, which will not be ingested into the api exporting data from join tables linked records and join tables pose further challenges for those who want to reuse airtable data, and additional data processing may be necessary to make the exported data usable. the airtable application groups records together in a way that expresses more relationships than an ordinary flat file spreadsheet. for instance, petitioning for freedom uses lookup tables for people and locations associated with a case. there are two join tables, “cases data” for roles played by persons in cases (e.g. petitioning attorney, judge), and “relationships” for person-to-person relationships (e.g. lawyer for, mother of, etc.). the cases data table also contains information on personal characteristics of petitioners (race, sex, and gender) from the original case records. the relationships table is linked to the people table twice to capture reciprocal relationships like mother and son. in african poetics, much of the original complex array of sql join tables, recording information like related works and associated locations, was made into airtable join tables. the challenge for programmers is that the linked records when downloaded appear as a meaningless list of alphanumeric strings, as described above. it is possible to download the join table itself and try to match records with ids, but such a process would be overly complex to script. a better solution is to create formulas to make new fields on the primary tables. airtable does provide rollup fields, which perform a simple function on linked records to create a readable field (rollup field overview [updated 2023])[35]. the petitioning for freedom and african poetics tables use the rollup feature and custom formulas to make fields on the primary tables that represent the linked records [36]. they are structured with delimiting characters (like pipes, semicolons, and so forth), which can be parsed by ruby and ingested into the api as arrays and hashes. two examples from african poetics illustrate what this parsable data looks like in airtable. a list of related people on the “people” table looks as follows: [campbell, roy](apdp.person.000514);;;[currey, ralph nixon 'r. n.'](apdp.person.000253);;;[madge, charles henry](apdp.person.000235);;;[plomer, william](apdp.person.000132);;; on the “works” table, with the addition of a role, the field looks like this: [brutus, dennis vincent](apdp.person.000779)|poet;;;[okigbo, christopher](apdp.person.001150)|poet;;;[peters, lenrie](apdp.person.000030)|poet;;;[clark, john pepper 'j. p.'](apdp.person.000828)|poet;;; markdown syntax ‘[name](id)’ is used to represent ids from the api and corresponding names [37] otherwise the strings are split on the characters “;;;” (for each person) and then “|” (separating the markdown from the role). the parsing script might look like this, with the variable people representing the data from airtable and the parse_ function being used to process the markdown: if people && people.length > 0 people = people.split(";;;") people.each do |person| data = person.split("|") if data[0] name = parse_brackets[data[0]] id = parse_parentheses[data[0]] role = data[1] result << { name: name, role: role, id: id } end end end< here are more examples from petitioning for freedom, showing relationships between individuals, and the role an individual played in a case: "[\"pickard, aubrey\"](hc.pers.000945)|parent of|[\"pickard, leonard\"](hc.pers.000946)", [\"pickard, aubrey\"](hc.pers.000945)|unindicated|[in the matter of the petition of aubrey pickard for a writ of habeas corpus for leonard pickards](hc.case.ne.0735). however, this was a complex field to parse. when airtable formulas concatenate linked records as strings, they often put values in quotes and it is difficult to work around this behavior. the quotes also caused trouble at the stage of ingest into the cdrh api (where they were not desired in the data). the quotes were stripped out with pandas and python prior to the ingest, although another possibility would be using regular expressions in the ruby script. each of the fields with unwanted quotes was passed into a function to remove quotes: people_frame = remove_quotes(people_frame), 'person_case_year') the pandas function to remove the quotes was difficult to write, and some of the complexity arose from the fact that the values were nested within arrays such that they cannot readily be replaced. in the below pandas function, the arrays in the column are “exploded” (turning each array value into a separate row), and then quotes and nan values are replaced in each string. finally, the exploded arrays are aggregated again into their original places in the frame: def remove_quotes df[column]= df.explode(column)[column].astype(str).str.replace("\"", "").replace('nan', '').replace("none", '').groupby(level=0).agg(list) the complexity of this process ultimately arose from limitations in how airtable handles and exports data from join tables. assessing airtable as a data platform from the perspective of a backend developer, airtable has numerous advantages and disadvantages for storing and processing digital humanities data. it can model complex associations and relationships, like a relational database does, giving it an advantage over traditional spreadsheets. data entry does not require knowledge of sql or any other programming language, and is simple enough to be handled by non-programmers (like many cdrh students). the data can be authored and edited by a team working in collaboration. airtable makes it relatively easy to keep the data consistent, do deduplication, and so on. airtable tables can also be exported into open-source formats like csv and json, which can be imported into an api with simple data transformations. these advantages will mean that the center for research in digital humanities is likely to continue using it for its projects. airtable does have some disadvantages that might make a digital humanities center reconsider using it. it is a proprietary platform that can only be accessed through the website, by obtaining user permissions or an api key. airtable requires a paid subscription to start and maintain a base and add collaborators, which may be beyond the means of some institutions. with any proprietary platform there is a chance that it may change its terms in the future and make the platform less accessible and more costly. references between tables are not represented adequately in the exported data formats but shown as a list of unreadable random numbers. this accounts for much of the processing that must be done on the airtable side and backend scripts to make the data usable by apis and human readable. in addition, airtable does not have all the error checking/consistency checking features of a sql database, and many tasks (especially using formulas for transforming records) do require some programming knowledge. for this reason, it is better to use spreadsheets for simple projects that do not need linked records, and traditional relational databases if data entry can be handled by skilled professionals with programming expertise. the cdrh is exploring open-source alternatives to airtable including baserow and nocodb, although we have not yet ascertained to what degree they can replicate the functionality of airtable. future work will seek to discover and assess alternatives to airtable that are more sustainable and accessible. acknowledgements other members of the cdrh development team who worked on the petitioning for freedom and african poetics include karin dalziel (team lead and digital resources designer, who designed the airtable formulas and the schema for representing linked data), sarita garcia (who assisted with database design), andrew peterson (who managed the airtable system), greg tunink, erin chambers, and laura weakley. mike litwa of library it assisted in creating the download scripts for airtable. cory young served as project manager on petitioning for freedom. credit must also be given to the many student assistants who helped with data entry on both african poetics and petitioning for freedom. bibliography and notes [1] airtable platform [internet]. [updated 2023]. san francisco, ca: airtable [cited 2023 oct 1]. available from https://www.airtable.com/platform. [2] supported field types in airtable overview [internet]. [updated 2023 sep 18] san francisco, ca: airtable [cited 2023 oct 1]. available from https://support.airtable.com/docs/supported-field-types-in-airtable-overview [3] [7] formula field reference [internet]. [updated 2023].https://support.airtable.com/docs/formula-field-reference [4] however, formulas do require some knowledge of markup and programming, depending on their complexity. [5]getting started with airtable automations [internet]. [updated 2023 aug 23]. san francisco, ca: airtable [cited 2023 oct 1]. available from https://support.airtable.com/docs/getting-started-with-airtable-automations. [6] linking records in airtable [internet]. [updated 2023 aug 23]. san francisco, ca: airtable [cited 2023 oct 1]. available from https://support.airtable.com/docs/linking-records-in-airtable. [8]airtable base collaboration overview [internet]. [updated 2023 aug 23]. san francisco, ca: airtable [cited 2023 oct 1]. available from https://support.airtable.com/docs/airtable-collaboration-overview. [9] using array functions in airtable [internet]. [updated 2023 sep 18]. san francisco, ca: airtable [cited 2023 oct 1]. available from https://support.airtable.com/docs/using-array-functions-in-airtable. [10] about the cdrh [internet]. [updated 2023 jul 24]. lincoln, ne: university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 1]. available from https://cdrh.unl.edu/about/aboutcdrh [11] orchid [internet]. [updated 2022 dec 19]. lincoln, ne: university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 1]. available from https://github.com/cdrh/orchid. [12] dalziel, karin. 2018. a new way to publish: the cdrh api [internet]. cdrh development blog. lincoln, ne: university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 20]. available from https://cdrhdev.unl.edu/log/2018/api/. [13] [30]api reference [internet]. [updated 2023]. san francisco, ca: airtable [cited 2023 oct 1]. available from https://airtable.com/developers/web/api/introduction. [14] datura [internet]. [updated 2023 mar 23]. lincoln, ne: university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 1]. available from https://github.com/cdrh/orchid. [15]jagodinsky, katrina, cory young, et al. 2023. petitioning for freedom. lincoln, ne: university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 1]. available from https://petitioningforfreedom.unl.edu/. [16] habeascorpus. github [internet]. [updated 2023 sep 25] university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 1]. available from https://github.com/cdrh/african_poetics [17] the github files are under the name habeas_corpus because an earlier working title for the project was habeas corpus. [18] in the matter of the application for a writ of habeas corpus for gwoo shee ah look. jagodinsky, katrina, cory young, et al. petitioning for freedom [internet]. [updated 2023 oct 21] lincoln, ne: university of nebraska, lincoln. [cited 2023 oct 22] https://cdrhdev1.unl.edu/habeascorpus/cases/item/hc.case.wa.0183 [19] dawes, kwame, et al. 2023. african poetry book fund [internet]. lincoln, ne: university of nebraska, lincoln. [cited 2023 oct 1]. available from africanpoetrybf.unl.edu. [20] dawes, kwame, and lorna dawes, et al. 2023. african poetry digital portal [internet]. lincoln, ne: university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 1]. available from https://africanpoetics.unl.edu/. [21] african_poetics. github [internet]. [updated 2023 sep 25] university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 1]. available from https://github.com/cdrh/african_poetics [22] delius, anthony ‘ronald st. martin’. african poetry digital portal [internet]. [updated 2023]. lincoln, ne: university of nebraska, lincoln. center for research in digital humanities. [cited 2023 oct 22]. available from https://africanpoetics.unl.edu/inthenews/poets/item/apdp.person.000001. [23] getting started with airtable views [internet]. [updated 2023 aug 23]. san francisco, ca: airtable [cited 2023 oct 1]. available from https://support.airtable.com/docs/getting-started-with-airtable-views. [24] willison, simon. 2023. airtable-export. github [internet]. [cited 2023 oct 1] available from https://github.com/simonw/airtable-export. [25] https://github.com/simonw/airtable-export. the tool can be downloaded vi pip [26] personal access tokens [internet]. [updated 2023 apr 20]. san francisco, ca: airtable [cited 2023 oct 1]. available from https://airtable.com/create/tokens (login required). https://airtable.com/create/token [27] zhao, frank. 2023. new api capabilities now in ga and upcoming api keys deprecation period. airtable community. [cited 2023 oct 1]. available from https://community.airtable.com/t5/announcements/new-api-capabilities-now-in-ga-and-upcoming-api-keys-deprecation/ba-p/141824?utm_id=recdxe5vjzz5vr0mh&utm_id=recdxe5vjzz5vr0mh [28] although the script references an “api token” which is now deprecated as a way of accessing airtable, personal access tokens work in the same way, so this change posed no difficulties in using the `airtable-export` script (personal access tokens, updated 2023). [29] see https://airtable.com/developers/web/api/introduction after logging in. [31] although the script references an “api token” which is now deprecated as a way of accessing airtable, personal access tokens work in the same way, so this change posed no difficulties in using the `airtable-export` script (personal access tokens, updated 2023). [32] kumar, saurabh. 2023. python-dotenv. github [internet]. available from https://github.com/theskumar/python-dotenv. [33] this script can be found at https://github.com/cdrh/data_african_poetics/tree/main/scripts/airtableexport [34] the script can be found at https://github.com/cdrh/data_habeascorpus/tree/dev/scripts/python/json_to_csv.py [35] rollup field overview [internet]. [updated 2023 jul 24] san francisco, ca: airtable [cited 2023 oct 1]. available from https://support.airtable.com/docs/supported-field-types-in-airtable-overview [36] the join table was not used directly in the ingest scripts, it is simpler to have each script parse a single table. [37] functions in datura can parse this syntax with regular expressions. for instance, to extract the name between the brackets: /\[(.*?)\]/.match(query)[1] if /\[(.*)\]/.match(query) about the author will dewey is programmer/analyst ii at the center for research in digital humanities in the university of nebraska, lincoln libraries subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – scope: a digital archives access interface mission editorial committee process and structure code4lib issue 43, 2019-02-14 scope: a digital archives access interface the canadian centre for architecture (cca) identified certain technological issues, namely extensive reference workflows and under-utilizing existing metadata, as significant barriers to access for its born-digital archives. in collaboration with artefactual systems, the cca built scope, a digital archives access interface. scope allows for granular fileand item-level searching within and across digital archives, and lets users download access copies of the collection material directly to a local machine. scope is a free, open-source tool. the beta version is available to the public, and a second phase is under-development as of spring 2019. by kelly stewart & stefana breitwieser introduction as archives increasingly have acquired born-digital material, traditional methods of collection access need to expand to include digital files. however, technical challenges can potentially frustrate archives’ best efforts at leveraging meaningful metadata for searching, as well as serving, these files to researchers. in response to these challenges, the canadian centre for architecture (cca) is launching scope, a browser-based access interface for digital material. scope has been developed in partnership with artefactual systems, a company based in new westminster, canada, and the lead contributor to two other open-source applications: archivematica and atom. as a free and open-source tool, we hope to offer the digital preservation community not only a potential solution for digital collections access, but also a case study for collaborative technical solutions. why scope? scope-ing out parameters for born-digital access the cca is an international research institution and museum focused on the study and practice of architecture; it produces exhibitions and publications informed by its extensive archives, library, photographs, and prints and drawings collections. beginning in 2012, the cca began investigating a new research question: “how did the introduction of digital technology affect architecture and architectural practice?” what followed was the archaeology of the digital project, culminating in two books, three museum exhibitions, and more than twenty-five e-publications. the acquisition of twenty-five archives with a significant digital component as a part of this project led to a different kind of research question for the cca archives staff: “how do we process, preserve, and make accessible more than 5tb of complex born-digital archival material?” preservation of these born-digital files was our starting point. the digital archives staff of five began to work through the five terabytes of born-digital archival material dating from 1988-2012, comprised of roughly one million files. the goal was not only to preserve and make accessible the material from these archives, but to also layout a blueprint for what an open archival information system (oais) compliant preservation system might look like at the cca. oais is the iso-standard for digital preservation, which requires institutions to accept materials from information producers, have control over the materials in such a way that ensures long-term preservation in a self-described or independently understandable way, follow approved policies and procedures, and finally make the materials available to the designated community of intended users (lavoie, 2014; schumann and recker, 2012). oais also defines different types of information packages that can be maintained over time: submission information packages (sips), or the material as it was transferred from the donor and prior to ingest in a digital preservation system; archival information packages (aips), or the preservation copy; and dissemination information packages (dips), or the access copy (lavoie, 2014). processing of the twenty-five archaeology of the digital archives followed the usual digital archival workflows: original digital storage media was disk-imaged using a bitcurator workstation. files were carved from the disk images, and then were arranged and described. tim walsh, cca’s former digital archivist, also developed a suite of open-source tools used for digital archival processing, including brunnhilde to characterize groups of files and to flag potential preservation issues, and cca tools to package sips in a uniform way and automate description. potential preservation issues and other archival processing considerations, like corrupted timestamps or duplicate files, were corrected or accounted for as much as possible through manual clean-up and minor bash scripting. automated description was then supplemented by the digital processing archivists, following the general international standard archival description (isad-g) and local standards. the metadata was then entered into cca’s database of record, the museum system (tms). following processing, the material was ingested into archivematica, the digital preservation system built by artefactual. archivematica enacts a number of preservation micro-services (including identifying files and generating related metadata), describes each of these micro-services as a premis event, and packages this information into a mets file stored with the aip. archivematica does fixity checking on stored aips to ensure the files remain the same over time, and generates dips for user access. figure 1. screenshot of micro-services performed by archivematica on a particular sip. premis metadata is generated for each of these services, and put into a mets file. however, even though the cca’s aips were well-preserved and granularly described, we found that providing meaningful researcher access to the files and their descriptions was difficult. we were confronted with two problems. first, the existing workflow for serving the files to researchers was becoming unsustainable in that it was time-consuming, labor-intensive, and required input from many stakeholders. that workflow was as follows: a researcher made a request for the digital material and reference staff forwarded the request to the digital archives team. an archivist queried the archivematica aip storage server to generate a dip on the local machine, moved the dip to a shared folder with the researcher, and finally unzipped the files. the archivist then notified the researcher and reference, which often resulted in a consultation on how to use the files. in instances where dozens of dips were requested, this workflow could take hours. though born-digital archival material has only seen modest use as a percentage of cca’s total researchers, use has still more than doubled in the two years we have made the material available with four external researchers in 2017 and ten in 2018. internally, cca’s interest in digital architectural materials has remained strong across its programming and publishing and as an institution has committed itself to increasing visibility and expanding scholarship around these materials. we saw this moment as an opportunity to improve access workflows for these materials in order to better accommodate research as it reaches its expected critical mass in the coming years. our second problem was that traditional archival research methods seem to apply less to digital materials. historically, archival research has been conducted in the top-down method dictated by finding aids, starting at the collection-level description and moving downwards until the appropriate material is found. this method works well for traditional qualitative research questions: “what drawings do you have for zaha hadid’s phaeno science center?” however, the digital humanities occasionally have shifted archival research to a more quantitative focus: “i’m trying to discover how 3d modeling software has evolved over time. can i look at every rhinoceros file made in 2002 from across all collections?” being able to answer these types of questions would require both a deep individual knowledge of our digital holdings and endless time pouring through finding aids, an increasingly unscalable expectation due to the growing volume of available digital material. aggregate description typically used in finding aids also provided a significant barrier to answering these extremely specific questions. looking at these two problems, we knew that we had to both streamline our workflow as well as take advantage of the granular file-level description created by archivematica. after assessing existing solutions, however, we felt that building a custom application was the only way forward. atom, the archival management system also developed and maintained by artefactual, was considered but ultimately decided against given that the cca’s description is currently in tms and it seemed unwise to manage description and files across two archival management systems. other digital repository systems, like dspace and islandora, were too heavy and often displayed information in a way that did not necessarily reflect archival hierarchies in a meaningful way. their interfaces were also often in english only, falling short of quebec legal requirements. we also considered adapting the existing cca website; however the millions of item-level descriptions for digital files would have appeared alongside library books and fonds-level description, creating a lot of unnecessary noise and cluttering search results. figure 2. sample dublin core metadata, written by a processing archivist, as it appears in the archivematica-generated mets file. figure 3. sample premis metadata, created by archivematica to document micro-services, as it appears in the archivematica-generated mets file. with all this in mind, tim walsh designed and built a proof-of-concept for scope in 2017, which has now evolved into a fully-functional browser-based access interface for dips developed by artefactual. scope allows archives staff to upload dips to the interface, which then uses the mets file in the dip to display folderand file-level metadata. researchers receive a login, search this metadata to find individual files or entire dips, and download the dips directly to the locked-down workstations in the cca study room via a link on the interface. researcher access to scope is only available on these workstations, where internet and usb access is blocked. material cannot be moved off of the workstations or accessed elsewhere. (note that dips can only be downloaded as a whole, not as individual files. this is for several reasons, particularly to maintain archival context and to not break files with external references, common with architectural formats.) figure 4. screenshot of the scope home page. figure 5. screenshot of a collection-level page in scope. figure 6. screenshot of a folder-level page in scope. note the “download zip file” button. reference staff and archivists no longer need to mediate the request, meaning the turnaround time is significantly faster. searching is also greatly improved. the file-level metadata in the mets file is now discoverable, and searches can be conducted across all collections at once. researchers have direct access to file-level metadata, including file name, file format, last modified date, and size, which we were not able to display previously given the format of our finding aids. it also allows for display of premis events, which makes archival processing more transparent to our researchers. figure 7. screenshot of the search results for “autocad drawing” in scope. figure 8. screenshot of item-level page in scope, including documentation of the premis events (titled “preservation metadata”). a team effort collaborating with artefactual was an easy, but necessary decision. the first reason for doing so was a practical one. though our archives staff has a reasonably high level of technical skill, building a full-fledged application would require the work of professional developers. secondly, because an important eventual feature of scope is integration with our archivematica storage, artefactual felt like a natural collaborator due to their expertise with their product and their understanding of archival practices and principles. the cca ended up sponsoring roughly 700 hours of artefactual’s developer and analyst time over two phases from summer 2018 to spring 2019. collaboration is embedded in artefactual’s work culture. typically, a development project includes an archivist/librarian (analyst), a software developer, and a project manager. the analyst is the subject specialist who has both who has the training and expertise in the field and can translate the client’s feature request to the software developer through various means, including wireframes and feature files. finally, the process is overseen by a project manager, who takes care of administrative tasks and makes sure that everyone across both organizations is moving towards a successful conclusion. it was also a collaborative effort across the cca. archivists and other collections staff worked to communicate their needs for this application, and with the help of artefactual, translated these into particular features for the developer. members of cca’s digital division also contributed valuable expertise in digital project management and user experience testing. two user experience sessions with cca staff during the first phase were able to inform development moving forward. these sessions made it clear that any final product needed to be not just functional, but also intuitive and easy to use. they also gave us the opportunity to help conceptualize certain functionalities, particularly filtering and faceting search results, with real users. in practical terms, collaboration across the two organizations centered on producing a beta version of an application that would enable researchers to access meaningful content in archivematica dips from within the cca’s study room. the cca provided user stories and wireframes that informed the initial conceptual stages of the project. artefactual then wrote feature files in the gherkin syntax (the translation document between user and developer) to more fully describe the cca’s desired functionality. from there, development proceeded as outlined below. ultimately, the beta version of scope was demonstrated at the cca with feedback from those sessions folded into discussions for a potential second round of development. development followed an iterative, agile process, with calls between the cca and artefactual taking place every other week. these calls provided an opportunity for the cca and artefactual to review progress since the last discussion, address questions, and review/modify upcoming priorities. the code is housed in github in the cca organization as a repository called dip-access-interface. artefactual staff were given access to github. in order to manage the project overall, we used waffle, the same tool that artefactual uses to manage archivematica. from waffle, the entire team could look at issues with their associated pull requests and track progress as they were created, edited, reviewed, tested and completed. both github and waffle enable labels, and waffle allows filtering by label so that the user can select the label ‘help wanted’ and then view all issues and pull requests with that label. for example, for this project, ‘demo1’, ‘demo2’, and ‘demo3’ labels were created and associated with issues that needed to be complete in order to successfully run the first, second, and third demonstrations that cca ran internally to its user groups. scope’s waffle board was used extensively to organize work in progress, work ready for review, and upcoming tasks throughout all phases of the project. this approach allowed for on-the-fly re-scoping and re-prioritizing, which resulted in a product that more closely addressed the overall intended outcomes. figure 9. the scope waffle board, ready to go for phase 2 of development. future work we are now looking at a second round of development that will take scope to ‘post-beta’. the first round was meant to deliver a useable product. the second will introduce new features and revise existing features. the most major update will be scope’s integration with archivematica’s dip storage. we are analyzing archivematica’s current dip creation workflow and comparing it to the cca’s use of automation tools (at) to generate dips to determine the best way forward (issue #117). it is possible that we will continue to use at to generate the type of specialized dip that the cca requires. however, it’s also possible that we integrate the dip upload functionality to scope within archivematica itself. time will tell as we work through this second phase! the other major update will involve improved searching (including faceting and filtering). currently, researchers cannot search by date range or by file format, nor can they refine search lists or use tags (issue #91). this will represent another step forward as we take advantage of granular item-level metadata created by archivematica. additional features will also include reporting for reference and collection statistics, using google analytics and kibana respectively, to be able to better understand how collections are being used, what the digital archive consists of, and how it is growing over time. we will also be addressing the nearly thirty outstanding github tickets related to back-end updates and user experience. conclusion like any product, a software application needs to be maintained in order to stay relevant. there are dependencies within the application and external environmental or organizational factors that need attention. the great power of open source software is that ideally it’s the community which, through various means, contributes to the ongoing maintenance and continued relevance of the application. there is no vendor lock-in, but the trade-off is that those vested in the product must also accept responsibility for its ongoing care. it’s been said that an open source software application is like having a kitten. when someone is given a free kitten they don’t have to pay for the animal but there are vet bills to pay, food to prepare, and love to give to make sure the cute little kitten grows up to be a well-fed and contented cat. scope is that cute little kitten right now, or maybe, since we’re not out of beta, it’s still in utero. in order to make sure it grows into a dip lovin’ cat, scope needs to be housed, to be used, to be documented. those who are interested can download the application and try it out in conjunction with archivematica. when they do, they can contribute back to the project through many avenues: troubleshooting on a user forum, doing code review, writing code, and identifying or commenting on issues. in doing so, they are implicitly agreeing to become part of scope’s community of care which will ensure the continued longevity of the application. if you are interested in learning more about scope please contact the authors. you can also check out our waffle board or our github repository. we look forward to hearing from you! acknowledgments the authors would like to thank tim walsh (digital preservation librarian, concordia university libraries), bun ek (digital projects manager, cca), marc boucher (information technology, cca), martien devletter (associate director, collection, cca), and sophie couture (associate director, digital, cca) for their work on scope. additional thanks to digital processing archivists justine couture, alexandra jokinen, and mireille nappert, as well as digital archives technicians kyle dennis and anne-marie trépanier, for their work processing the archaeology of the digital collections. scope is generously funded by the montreal cultural development grant awarded by the city of montreal and the quebec department of culture and communications. about the authors kelly stewart (kstewart@artefactual.com) is director of archival and digital preservation services for artefactual systems. she holds a master of archival studies degree from the university of british columbia and has many years experience as a consultant and educator in archives and records management, specializing in policy work, business process analysis, facilitation and teaching. she has worked for artefactual since 2017. stefana breitwieser is the digital archivist at the canadian centre for architecture. she can be reached at sbreitwieser@cca.qc.ca with any questions related to scope or born-digital archives. bibliography and notes lavoie, brian. the open archival information system (oais) reference model: introductory guide (2nd edition). digital preservation coalition; 2014 [cited 2019 january 10]. available from: https://www.dpconline.org/docs/technology-watch-reports/1359-dpctw14-02/file schumann, natascha and astrid recker. demystifying oais compliance: benefits and challenges of mapping the oais data model to the gesis data archive [internet]. iassist quarterly; summer 2012 [cited 2019 january 10]. available from: https://iassistdata.org/sites/default/files/iqvol36_2_recker.pdf subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – arduino-enabled patron interaction counting mission editorial committee process and structure code4lib issue 20, 2013-04-17 arduino-enabled patron interaction counting using the arduino development board (http://arduino.cc) has become a very popular way to create hardware prototypes that bridge the divide between the physical world and the internet. this article outlines how to use an arduino, some off-the-shelf electronic parts, the processing programming language, and google documents to create a push-button reference desk transaction tally device. the design: plugged into a computer at the reference desk, staff members push the appropriate button on the device when a reference transaction occurs, and the action is instantly tallied in a google document. having a physical device on the desktop increases chances of proper collection of information since it is constantly visible and easily accessible, versus requiring staff members to click through a series of options in a piece of software running on the pc. the data can be tabulated in google documents or any other source that processes form-based html data. this article covers all of the major components of creating the project: – constructing the arduino circuit and programming it – creating the google docs form – creating the processing program that will listen for information from the arduino and send it to the google docs form by tim ribaric and jonathan younker the arduino microcontroller platform is popular with hobbyists for two main reasons: first, it is easy to use, requiring minimal knowledge of electronics and programming; second, it is very versatile in the way it can sense phenomenon from the physical world and translate it into signals that can be understood by computers. this basic understanding of the functioning of the arduino provides the basis of a patron interaction tabulating device created by the library systems and technologies department at brock university. traditionally, tracking patron interactions at the various service points in a library was conducted with paper-based tally sheets. each time a patron approaches the desk with an inquiry, a tick mark is placed in the appropriate category of the tally sheet. the data gathered can be used to keep track of how busy the desk is at specific points in time, to provide insights into what staffing levels are required, to get a general understanding of what the gaps in the service are by analyzing recurring questions, etc. most often, this data is transcribed into an electronic database or spreadsheet where further analysis can be done. at some institutions – like brock university – the paper tally form has been replaced with a software solution that allows service desk staff to enter this data electronically, using a piece of software or a web form to capture the interaction. the immediate downside to such a configuration is that it is often easy for staff to forget or to miss recording the interactions into the application when the service point becomes busy, the immediacy of paper and pencil is lost, and additional steps are introduced (opening the application, selecting the options, etc.) this project aims to address the shortcomings of each of these collection styles (physical and electronic), by creating a hybrid approach facilitated by the use of an arduino microcontroller. figure 1. tabulation device / controller box a tabulation device (seen above) is placed on the service desk and connected to a computer with a usb cable. when a patron asks a question, the staff member simply taps a button, and the interaction is added directly to a google docs spreadsheet (the data collection endpoint can easily be changed, if necessary). the immediate advantage of using this approach is that there is a lower chance of staff forgetting to tabulate the question, as the physical device would be difficult to overlook, the process is extremely quick and efficient, and the transaction is logged electronically so that it can be analyzed in a meaningful way. three distinct pieces are involved in the synthesis of this machine (known as the tabulatron): the construction and programming of the physical signaling device, a software component running on a computer to receive the signals from the physical device and to translate them to html post calls, and a google docs spreadsheet to collect the information. the signaling device the parts required to construct the device: an arduino microcontroller board (example) four pushbutton switches (example) one amber light emitting diode (led) (example) one usb cable to interface the device to a computer and to provide power(example) one enclosure to hold the constructed device (example) the example links are just suggestions and a wide variety of products could easily be substituted in their place. local electronic stores are a great resource and can easily supply all the required components. the total cost to construct a tabulatron is approximately $50. the arduino microcontroller board comes in a variety of sizes and configurations designed to be used in various different environments (http://arduino.cc/en/main/products). the arduino uno board was used for this project, however, other arduino boards could have been used instead. the only restriction is that the board chosen must have the required number of analog inputs to correspond to the number of desired buttons (smaller arduino boards often have less input/output ports). the following diagram, created in an open source package called fritzing (http://fritzing.org), shows a rendering of the completed circuit as a schematic. figure 2. schematic the wiring can be described as follows: the led is connected as an output on digital pin 13 and grounded to the arduino one terminal of each of the 4 buttons is connected as input to analog pin 0, analog 1, etc. the opposite terminals of the push buttons are wired together in series and grounded to the arduino this exact circuit, rendered with physical wires and buttons, takes shape as the following: figure 3. controller box with wiring programming the arduino microcontroller after the physical construction of the device is completed, it needs to be programmed via the arduino integrated development environment (ide). this ide is an open source package and can be downloaded from the arduino site (http://arduino.cc/en/main/software). once programmed, the arduino microcontroller will listen for button pushes and send a signal via the usb port to the connected computer. this connected computer is running a program that will be listening for these signals (outlined in the following section). for example, when the button connected to the analog 0 input is pushed, a ‘0’ character is sent to the computer via the usb port and then the led is told to flick on and off. this provides haptic feedback to the user that the button press has registered. when the button attached to analog 1 input is pushed, a ‘1’ character is sent to the computer via the usb port followed by a blink of the led, and so on. programming the arduino chip with the ide only needs to be done once. programs created with the ide are called sketches, and the chip will retain this uploaded sketch indefinitely, even after power cycling. the key function in this piece of code check_switches() was borrowed from the adafruit site (http://www.adafruit.com/blog/2009/10/20/example-code-for-multi-button-checker-with-debouncing/). adafruit is an online store for electronic components and also a resource hub for learning how to create with electronics. the following code is the complete program that the arduino needs: //tabulatron arduino code //will signal via serial whenever a button is pressed //circuit is pretty simple //-pushbutton on a0 //-pushbutton on a1 //-pushbutton on a2 //-pushbutton on a3 //-led on 13 //many,many thanks to adafruit industries where i blatantly lifted this code // http://www.adafruit.com/blog/2009/10/20/example-code-for-multi-button-checker-with-debouncing/ //buy something from there please //global variables // hash marked lines are not comments in the arduino ide as is common // with other languages, more details: http://arduino.cc/en/reference/define #define debounce 10 byte buttons[]={14,15,16,17}; // the analog 0-5 pins are also known as 14-19 #define numbuttons sizeof(buttons) byte pressed[numbuttons], justpressed[numbuttons], justreleased[numbuttons]; int led = 13; voidsetup(){ byte i; serial.begin(9600); pinmode(13,output); //make input & enable pull-up resistors on switch pins for(i=0;i millis()){ //not enough time has passed to debounce return; } //ok we have waited debounce milliseconds, lets reset the timer lasttime = millis(); for (index = 0; index < numbuttons;index++){ justpressed[index] = 0; //when we start, we clear out the "just" indicators justreleased[index] = 0; currentstate[index] = digitalread(buttons[index]); //read the button if (currentstate[index] == previousstate[index]) { if ((pressed[index] == low) && (currentstate[index] == low)) { //just pressed justpressed[index]=1; } else if ((pressed[index] == high) && (currentstate[index] == high)) { //just released justreleased[index] = 1; } pressed[index] =! currentstate[index]; //remember, digital high means not pressed } previousstate[index] = currentstate[index]; //keep a running tally of the buttons } } void loop() { check_switches(); for (byte i = 0; i < numbuttons; i++){ if (justpressed[i]) { serial.print(i,dec);//print the button number to the serial port so that the processing app can listen for it. serial.println(); serial.flush(); //always flush when you're done. flick_led();//an immediate couple of blinks of the led makes the user feel like it has done something } } } the arduino website has a clear and comprehensive tutorial on how programming the chip can be done. reading through the tutorial before programming the tabulatron will help clarify this process (http://arduino.cc/en/guide/windows | http://arduino.cc/en/guide/macosx ) setting up the reference workstation: software listening for the button press this piece of software is written using the processing ide (http://processing.org/), and its main purpose is to listen for the signals sent to the usb port from the tabulatron. processing is a java-based product that can be used for a multitude of applications. it is used here because it is very easy to draw graphical user interfaces, and it is complementary to the arduino ide, due to the fact that it has built-in support for reading signals sent through the usb connection (which is the functionality we developed in the previous section). this processing sketch is run on the computer that the arduino is connected to and always needs to be running so that it can listen for signals from the tabulatron. when run, this processing program first renders an image of the physical device to the screen with labels that indicate what each corresponding button push will do: figure 4. processing script display once the processing sketch receives a signal from the device, it completes a html post to a google form. as a further method of feedback, the rendered image on the screen will change the corresponding button that was just pressed to green. the ‘view form’ button will open a browser to the results page of the google form. the following code listing is the processing source code that completes this: //tabulatronapp // //listens for the data sent to the serial port from the arduino //post svalues to google doc // //written by tim ribaric(tim@elibtronic.ca) //@elibtronic import processing.serial.*; serial port; pfont f; string buttonsig; //a google doc form is required. //it should only have 1 question on it //that one question should be a dropdown box with only 4 values on it //example: https://docs.google.com/spreadsheet/viewform?formkey=dhdas1nkdethy2y5yknjadvfngfpske6mq#g id=0 string formkey = "dhdas1nkdethy2y5yknjadvfngfpske6mq"; // the public 'key' for the google document form (found in url of view mode) string revformkey = "0aicmfka7wo4fdhdas1nkdethy2y5yknjadvfngfpske"; // the private key for the google document form (found in the url of edit mode) string btn0 = "reference"; // the text of the first option in the dropdownbox string btn1 = "technical"; // the text of the second option in the dropdownbox string btn2 = "directional"; // the text of the third option in the dropdownbox string btn3 = "referral"; // the text of the fourth option in the dropdownbox char bsig; //clicking on the 'view form' button will open up the 'edit' mode in a browser //a login to google doc will be needed unless set to public void mousereleased(){ if(mousex > 75 && mousex < 300&&mousey > 325 &&mousey < 350) link("https://docs.google.com/spreadsheet/ccc?key="+revformkey+"#gid=0"); } //branding void viewform(){ rect(75,55,225,25); rect(75,325,225,25); ellipse(350,90,25,25); fill(0); text("viewform",150,345); text("tabulatron",150,75); fill(255); } //draw the 4 boxes and labels them according to the form values outlined above //once a button is pressed it will be drawn in green, as a form of feedback voidboxmaker(intc,intpos){ //1 draws a green box, everything else is white switch(c){ case1: fill(119,232,105); //greenish break; default: fill(255); } //four button, 4 rectangles switch(pos){ case0: rect(75,110,100,75); fill(0); text(btn0,90,155); break; case1: rect(200,110,100,75); fill(0); text(btn1,215,155); break; case2: rect(75,225,100,75); fill(0); text(btn2,83,270); break; case3: rect(200,225,100,75); fill(0); text(btn3,220,270); break; } fill(255); } //draws blank boxes void clearboxes(){ boxmaker(0,0); boxmaker(0,1); boxmaker(0,2); boxmaker(0,3); } //does the actual sending to google //will produce loads of errors in the console window //couldn't really fix these. (exceptions couldn't be caught) void tally(intbtn){ switch(btn){ case0: loadxml("https://docs.google.com/spreadsheet/formresponse?formkey="+formkey+"&ifq&ent ry.1.single="+btn0); break; case1: loadxml("https://docs.google.com/spreadsheet/formresponse?formkey="+formkey+"&ifq&ent ry.1.single="+btn1); break; case2: loadxml("https://docs.google.com/spreadsheet/formresponse?formkey="+formkey+"&ifq&ent ry.1.single="+btn2); break; case3: loadxml("https://docs.google.com/spreadsheet/formresponse?formkey="+formkey+"&ifq&ent ry.1.single="+btn3); break; default: } } voidsetup(){ f = loadfont("aharoni-bold-16.vlw"); textfont(f); size(400,400); background(128); boxmaker(0,0); boxmaker(0,1); boxmaker(0,2); boxmaker(0,3); viewform(); //at first run you'll have to figure out what comm port the arduino is on //change the array index to corresponding value println(serial.list()); port = newserial(this,serial.list()[2],9600); } //listens to serial port defined above and reads until linefeed //grabs the character and colors the box, sends the tally //works fairly well void draw(){ while(port.available() > 0){ buttonsig = port.readstringuntil(char(10)); try{ bsig = buttonsig.charat(0); } catch(exception e){ //sometimes reading from serial will produce a null value //break out and try again break; } switch(bsig){ case'0': clearboxes(); tally(0); boxmaker(1,0); break; case'1': clearboxes(); tally(1); boxmaker(1,1); break; case'2': clearboxes(); tally(2); boxmaker(1,2); break; case'3': clearboxes(); tally(3); boxmaker(1,3); default: } } } of interesting note is the function called tally() that begins on line 97. this is where the html post happens. the calls to loadxml() within tally() can be easily modified to work with a preexisting library solution instead of google docs. an additional benefit of using processing is that it can create standalone executables for all of the major platforms (mac os, windows, linux). the processing program can then be placed in the ‘startup’ directory of a windows machine, or the appropriate location on other platforms, so that when the machine boots up the program automatically runs and sits waiting for signals from the tabulatron, requiring no intervention from the staff member at all. the google documents spreadsheet form the final component is the google documents spreadsheet form. the construction of this form needs to completed as follows: one multiple choice question with four answers, each one corresponding to the labels used in the preceding processing program. figure 5. google form the processing program emulates exactly what happens when a typical end user is filling data into to a google form via the web. a few values need to be set in the processing source code to complete this action. the formkey variable (line 18) is what is found in the ‘formkey‘ variable in the url of the publicly facing form. figure 6. google form key the revformkey variable (line 19) is the ‘key’ variable is found in the url of the the form when looking at it in ‘results’ mode. figure 7. google form key the form should also have ‘share’ options enabled so that it can be viewed by any when the ‘view form’ button is clicked. this is done by clicking on the ‘share’ button and enabling ‘anyone who has the link can view’. figure 8. google form share settings putting the pieces together once the signaling device is constructed and attached via usb to a computer, the processing program is run. each time a button is pressed on the tabulatron, a signal is sent via the usb port. the processing sketch listens for this signal and completes an html post of the appropriate value to the google form. the form will automatically add a timestamp to every submission, adding an extra dimension to the data. the process is interactive and instantaneous. a video demonstrating the tabulatron in action can be found here: http://youtu.be/f7hq6lvvdhi figure 9. device in use any library staff member can be looking at the web results and can watch the patron interactions tabulated in realtime. this information can then be used to signal backup staff to come out to the desk when large volumes of interactions are seen. if a library has existing patron question counting software, this code can be easily modified to post to that instead of a google form. figure 10. google form spreadsheet view as seen earlier, if the tally function is modified it can be used to create any web call, (e.g. loading a php file that increments a mysql database). while the tabulatron was only constructed with four buttons, it could easily be modified to have more or less, depending on the needs of the institution. the code required to construct this project is all available via github and can be found here: https://github.com/elibtronic/tabulatron conclusion the tabulatron was initially conceived as a proof of concept device, yet after demonstrations and an in-house presentation, staff were immediately enthusiastic about using this in production at brock university library service points. while technically still a prototype, the device is fully functional; additional units can be easily constructed, using the exact same arduino code to program the microcontroller. we hope to deploy this prototype as a pilot project in the near future. using tools like fritzing, it would be possible to design and construct a custom printed circuit board (pcb) to reduce the size and eliminate unused features/functions of the arduino board, but at such low volumes and low setup costs to equip a library’s service points with these devices (not to mention an already small footprint), this really seems unnecessary at this point. the great advantage of the tabulatron is its customizability, simplicity, and ease of use. with a little imagination and a few additional lines of code, the tabulatron could be modified to serve a number of roles in the library. all at the push of a button. about the authors tim ribaric (tribaric@brocku.ca) is the digital services librarian at brock university located in st. catharines, ontario. he blogs at http://elibtronic.ca and is the subject specialist for computer science and philosophy as well as the administrator for the brock university digital repository. he has been working at brock since 2006. jonathan younker (jyounker@brocku.ca) is the head, library systems and technologies at brock university in st. catharines, ontario. prior to working at brock, he worked in public and academic libraries in illinois. jonathan is particularly interested in finding and developing creative solutions to ongoing library technology problems. subscribe to comments: for this article | for all articles one response to "arduino-enabled patron interaction counting" please leave a response below: donald moses, 2013-05-11 thanks for the article … i made this and have lots of ideas for other uses. a nice project and a great introduction to arduinos/processing and how we might use them. donald leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – tools and workflows for collaborating on static website projects mission editorial committee process and structure code4lib issue 38, 2017-10-18 tools and workflows for collaborating on static website projects static website generators have seen a significant increase in popularity in recent years, offering many advantages over their dynamic counterparts. while these generators were typically used for blogs, they have grown in usage for other web-based projects, including documentation, conference websites, and image collections. however, because of their technical complexity, these tools can be inaccessible to content creators depending on their level of technical skill and comfort with web development technologies. drawing from experience with a collaborative static website project, this article will provide an overview of static website generators, review different tools available for managing content, and explore workflows and best practices for collaborating with teams on static website projects. by kaitlin newson introduction in 2017, the ontario council of university libraries’ geo community wrapped up a 2-year digitization project of over 1,000 historical topographic maps of the ontario region. as part of this project, the team wanted to create an online presence for these maps to highlight this collection, promote the project, and make the maps more accessible to a wide audience of users. after evaluating different web technology options, the team decided to use a static website generator to create this online project space. static site generators static website generators are tools used to generate a website made up only of html, css, and javascript. static websites, unlike dynamic sites, do not use databases or server-side scripting languages, and the website appears to the user as it exists on the web server. a static site generator is typically made up of: a templating language for website layout and theming. a markup language for content creation (e.g. markdown). a local development server to preview and test the site before building. a compile process that builds the final site files into html, css, and javascript (rinaldi 2015). to those that have been working in web development for a while, the re-emergence of static websites may seem like a step backwards. in the beginning of the web, websites were all static, made up only of html, and eventually adding css and javascript. as web technologies developed, the limitations of static websites became apparent, and databases and server-side scripting languages began to fill the gaps in functionality left by the static web. with the emergence of large-scale content management systems (cms) like wordpress and drupal, dynamic websites became more popular as plugins and editing systems led to tools that could be used by content creators without the need for a depth of technical expertise. so, why has the shift back to static websites emerged? with the creation of the modern static site generator, static sites no longer need to be hand-coded from scratch. getting a static website live can be a straightforward process for those with the technical skill to create them. with services like github pages, there’s no need to worry about managing a server or a domain, although a custom domain can still be used – in fact, that is how code4lib’s own conference website is developed, using the jekyll static site generator. for many websites, the process of installing a cms, managing a database, and keeping the site up to date to protect against security vulnerabilities can be cumbersome and overly complex. static website generators offer a more straightforward approach, with pre-existing themes that can easily be used to create a polished and modern website with minimal need for maintenance. there are also many to choose from, with almost 200 options written in a variety of programming languages currently available according to staticgen. features while static site generators are not suitable for every website, they have a number of benefits that should be considered when evaluating systems to use for a new web project. static websites can be up to six times faster than their dynamic counterparts, primarily due to the challenges of caching dynamic content which are not a concern in a static environment (christensen 2015). these websites are also easier to secure since there is no database to be exploited or outdated plugins to create vulnerabilities. from a preservation perspective, static websites are much easier to preserve. since dynamic websites have content pulled from a database and can generate different views based on the user’s context, they can present more challenges for archiving, especially without direct access to the website’s back-end. in the case of a static website, preservation is much easier because the site is displayed as it exists on the web server (rumianek 2013). hosting is also much simpler with static websites. services like github pages have made hosting much more straightforward, with no need to manage a server or even have a registered domain. if you are using your own server, a static site means that you don’t need to worry about managing a database or working with server-side programming languages. however, static websites won’t meet the requirements of every web project. having no database or server-side scripting means that some commonly used website functionality like user accounts or input are more challenging, if not impossible, to implement. many static website generators have plugin options, but many of these plugins require integration with third-party services that may not be ideal depending on a project’s privacy requirements. for example, if you want to have a contact form on a static website, using a service like formspree is required, meaning that all of your data is passed through this service and the functionality of this feature depends on the reliability of this service. another challenge with static websites is their lack of a straightforward administrative environment like those found in cmss such as wordpress or drupal. a lack of editing environment can be a major barrier for content creators, who may not be comfortable writing content in markdown, using command line tools, or managing a git repository. however, solutions are gradually being developed to tackle this problem. content management tools one of the major disadvantages to static website generators is their lack of a content authoring environment, which can be a significant challenge for content creators that are not as familiar with web technologies. when working as a team on a project, this limitation may result in a higher workload for team members with web development expertise, and can ultimately slow down the project’s progress. since content creation for websites is frequently a collaborative process, there is a clear need to have user-friendly systems that allow users to edit content regardless of their level of technical skill or knowledge of web development. since static website content is written in markdown, many markdown editors, such as stackedit, can be excellent options for creating website content. unfortunately, many of these editors do not have the same level of integration with github as other tools or are missing important features that content creators may need, such as image import or interface elements for formatting text. markdown editors can be a valuable tool for content creators, but most do not go far enough to create a fully integrated environment that is accessible to users of different technical skill levels. as static website generators have grown in popularity, developers have begun to address the gap in editing environments by creating more robust cmss that can be used with static websites, with 27 options currently listed at headlesscms.org. while many of these services are paid, some are fully open-source and can be configured by a website administrator to be used by other project collaborators. netlifycms is emerging as a strong contender for a static site cms to bridge the gap between raw markdown editing and the wordpress-like editing environment that many are currently used to (fig. 1). this tool allows users to log into the administrative interface with their github credentials, manage different types of content, and write using a minimal and user-friendly interface. figure 1. the netlify cms, built from the demo available at https://www.netlifycms.org/docs/test-drive/ if a static website project is using github, one option is to use the built-in editor for content editing on their website (fig. 2); however, this does not give users straightforward options for text formatting, and a working knowledge of the markup language being used (e.g. markdown) is necessary. this approach is also missing functionality that content authors may need, like the ability to import images. while markdown is not a challenging syntax to learn, it’s important to make the content creation as simple as possible for users with a range of technical skill levels. by using more robust content management systems, users have more editing options available to them without requiring technical knowledge of web development. figure 2. the github editing environment prose (fig. 3) works with an existing public github repository to allow users to author content in a clean and simple environment. once editing is complete, content can be committed directly to the github repository, meaning that no working knowledge of git is required. prose has many standard text editing features, like font bolding and inserting headings, as well as the option to import images that will be inserted directly into the repository. prose also has minimal setup requirements, and only needs github authorization to access the project’s repository to get started with content management. figure 3. the prose editing environment it should be noted that the tools above rely on a project using a github repository. internal repository management tools like gitlab or bitbucket have a built-in web editor similar to that of github but do not integrate with the tools mentioned above without some code configuration. desktop editors for markdown writing may meet content editing needs in these cases but were not explored in depth for this article. the topographic maps project in 2017, members of the ontario historical topographic maps digitization project team developed their website using a static website generator. the historical topographic maps digitization project was a 2.5-year digitization initiative completed by the geography community in the ontario council of university libraries (ocul). the project involved the digitization and georeferencing of over 1,000 topographic maps across ontario. one of the final steps of the project was promotion, which involved the development of a website to showcase the project and the digitized maps. since the full resolution maps and metadata were being stored and displayed through scholars geoportal, this website was intended to promote the project and provide a user-friendly way to browse the collection, rather than to provide access to geospatial mapping tools, preservation capacity, or complete metadata records. figure 4. the historical topographic maps digitization project website, available at ocul.on.ca/topomaps the project team was made up of staff from across ocul universities and scholars portal. as the website lead, i had prior experience working with web technologies, including html, css, javascript, git, and content management systems. this prior experience gave me the necessary knowledge to handle the technical aspects of the project, while other team members focused on content. initially, we were unsure of what tool we wanted to use for the project’s website. we began by considering content management systems like drupal and omeka, but these seemed more complex than necessary for a website primarily made up of images and text. scholars portal had previously used a static site generator for another project, so i began investigating different functionality and theming options to determine if this was suitable for the new website. after creating a demonstration site and sharing it with the rest of the project team, we decided to develop the site using a static generator. hugo, a popular static generator built in the go language, was used to create the project website. hugo was selected because of the simple installation process, fast build speeds, and previous experience using the tool. the website is hosted on the ontario library research cloud (olrc), a cloud storage service built with openstack swift that has the ability to serve static content. this project serves the dual purpose of showcasing the collection and demonstrating a potential use case for institutions using the olrc. the theming for the website was based on the creative portfolio theme for hugo, which was originally created for designers to showcase their work, but works well to highlight any collection of images. the template was modified to add multilingual support for english and french, along with some css modifications to improve the colour contrast and font sizes for accessibility purposes. javascript and jquery were used for additional functionality, including magnifying the maps when a user hovers over them with their mouse (fig. 5), and filtering the list of maps based on text input. an inventory of the full collection of maps had been previously created as a google spreadsheet, so this data was converted to json which was then used to generate the listing of the collection. a google app script was used to export the spreadsheet data into usable json (fox 2013). juxtaposejs was used to create a comparison slider to demonstrate changes between map areas over time (fig. 6). the code for the website can be found in the project’s github repository. figure 5. the zoom feature of the topographic maps website (http://ocul.on.ca/topomaps/maps/map04/) figure 6. the map slider developed with juxtaposejs (http://ocul.on.ca/topomaps/highlights/) the initial draft for website content was created collaboratively in google docs, with different team members contributing individual sections. through team meetings, decisions were made about different aspects of the website, such as top-level navigation items, content structure, and the display of the maps. once the team agreed that the content was complete, it was sent for french translation. members of the project team then used prose to work with content in the github repository. as content was developed, some bugs were encountered when editing with prose which prevented changes from being committed to the repository, and the built-in github editor was used for content modifications in these cases. when changes were committed to the repository, the website lead would then pull down the changes, recompile the site, and update the development version of the site. once all of the content was in place, the team did a final review, referring any styling suggestions to the website lead, and fixing any content issues that were found. final configuration elements were added, including google analytics tracking, and the site was launched and promoted to the academic community. overall, the historical topographic maps project website was a success for the ocul geo community. the website serves as an attractive and intuitive way for users to engage with the maps that may not have been exposed to them otherwise. while more advanced geospatial data users may explore the maps in their preferred gis tool or through scholars geoportal, the website provides a space for a broader range of users to engage with the map collection. conclusions & lessons learned for the team at scholars portal, this project helped us to better understand what kinds of projects static generators are best suited for, and how they can be used collaboratively. although we do have sites built with drupal, these large content management systems are far too resource-intensive and complex for small and simple websites. static website generators still require technical knowledge to get them up and running but require far less long-term maintenance than something built in drupal or wordpress. once you understand how to use them, these tools can be incredibly efficient for getting a simple and attractive website up quickly, without needing to manage a database or write server-side code; however, an understanding of markup languages and command line tools is necessary. if you want to do customizations for your site, you may also need to understand css and javascript, and also be willing to learn the templating language that is used by the static generator. depending on your website hosting environment, you may also need knowledge of web server management. this technical knowledge is currently necessary to get a static site up and running, but hopefully future technologies will reduce this barrier to entry. static websites have seen a strong resurgence with the modern static generator, offering a number of benefits over dynamic websites in performance and security. while the editing environment isn’t yet on par with those found in content management systems like wordpress, tools are gradually emerging to fill this gap in the development process. static generators are not suitable for every kind of website, such as those that require dynamic elements like user input, but should be strongly considered as one possible option when evaluating systems to use for a new project. notes if you’re interested in making a site with jekyll and github pages, programming historian has an excellent tutorial to get you started. references christensen, mb. [updated 2015 nov 02]. why static site generators are the next big thing [internet]; [cited 2017 august 16]. available from: https://www.smashingmagazine.com/2015/11/modern-static-website-generators-next-big-thing/ fox, p. exporting a google spreadsheet as json [internet]. [updated 2013 june 07]. [cited 2017 august 16]. available from: http://blog.pamelafox.org/2013/06/exporting-google-spreadsheet-as-json.html rinaldi, b. static site generators: modern tools for static website development. sebastopol (ca): o’reilly media inc.; 2015 [cited 2017 august 16]. available from: http://www.oreilly.com/web-platform/free/static-site-generators.csp rumianek, m. 2013. archiving and recovering database-driven websites. d-lib magazine [internet]. [cited 2017 august 16]. available from: http://www.dlib.org/dlib/january13/rumianek/01rumianek.html about the author kaitlin newson (kaitlin@scholarsportal.info) is the digital projects librarian at scholars portal, the technology service arm of the ontario council of university libraries. you can find her online at www.kaitlinnewson.com and on twitter as @kaitlinnewson. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – beyond the hype cycle: experiments with chatgpt’s advanced data analysis at the palo alto city library mission editorial committee process and structure code4lib issue 58, 2023-12-04 beyond the hype cycle: experiments with chatgpt’s advanced data analysis at the palo alto city library in june and july of 2023 the palo alto city library’s digital services team embarked on an exploratory journey applying large language models (llms) to library projects. this article, complete with chat transcripts and code samples, highlights the challenges, successes, and unexpected outcomes encountered while integrating chatgpt pro into our day-to-day work. our experiments utilized chatgpts advanced data analysis feature (formerly code interpreter). the first goal tested the search engine optimization (seo) potential of chatgpt plugins. the second goal of this experiment aimed to enhance our web user experience by revising our bibliocommons taxonomy to better match customer interests and make the upcoming personalized promotions feature more relevant. chatgpt helped us perform what would otherwise be a time-consuming analysis of customer catalog usage to determine a list of taxonomy terms better aligned with that usage. in the end, both experiments proved the utility of llms in the workplace and the potential for enhancing our librarian’s skills and efficiency. the thrill of this experiment was in chatgpt’s unprecedented efficiency, adaptability, and capacity. we found it can solve a wide range of library problems and speed up project deliverables. the shortcomings of llms, however, were equally palpable. each day of the experiment we grappled with the nuances of prompt engineering, contextual understanding, and occasional miscommunications with our new ai assistant. in short, a new class of skills for information professionals came into focus. by m ryan hess and chris markman i. introduction chatgpt and other large language models (llms) have sparked much curiosity recently among libraries and information professionals. as a forward-thinking library that likes to experiment with emerging tech, we were eager to explore whether these new ai systems could enhance our work. while other libraries in 2023 tested applications for gpt-3, we wanted to push further by utilizing the more advanced gpt-4 model within chatgpt. with patron privacy top of mind, we conducted hands-on experiments focused on seo optimizations and developing an improved website taxonomy. our tests uncovered a mix of successes and failures, revealing fundamental limits around accuracy, bias, and memory. though imperfect, we believe that with careful oversight, llms can augment certain workflows – acting as research assistants, content creators, data analysts and more. in this article, we detail our search engine optimization (seo) and website taxonomy experiments to advance the conversation around ai in libraries. we share tangible lessons learned, outlining key skills information professionals will need to selectively incorporate ai into their work. while more exploration is needed, our research highlights the tremendous potential of llms to enhance the noble mission of organizing knowledge. despite several pitfalls, our experiments represent early and exciting steps toward integrating ai thoughtfully into libraries. with the right human guidance, llms promise to boost productivity, creativity, and problem-solving. the path ahead may be winding, but the possibilities make it well worth exploring. ii. methodology seo experiment #1 – chatgpt seo plugins this first experiment with seo was based on an earlier success in 2023 using the free version of chatgpt. at palo alto city library (pacl) we use biblioweb as a content management system (cms) and make every effort to fully leverage its seo tools. as a result, this often means updating or reviewing html meta descriptions for “freshness” over time. one workflow issue with this seo process is that it not only takes time to do, but it is also not the most exciting task: imagine trying to summarize page content and thinking about keyword alternatives on a library website, while also trying to keep in mind various user groups or audiences at the same time. juggling all of this can be a mentally taxing exercise. chatgpt happens to be a champion at this one specific task—just copy/paste the content of any webpage into the chat and in less than 5 seconds it can crunch large blocks of text into short snippets or lists of alternative keywords. at the same time, it can take that same webpage “gist” and generate hundreds if not thousands of variations depending on the context you provide. it’s a prime example of the “cognitive offload” that llms can provide the plugin store based on this early success, in june of 2023, it seemed like there had to be something new that our newly acquired chatgpt pro account and suite of plugins could help us with. figure 1. chatgpt about plugins what we instead found was a slew of privacy and security concerns from plugin makers with zero accountability, and a cheeky “use at your own risk” message from openai. thankfully, some of these concerns were addressed in later updates to the chatgpt plugin store in late 2023 with the inclusion of developer information links and contact details. there are sometimes, but not always, privacy policies or terms of service pages. at first glance this does seem like an improvement, but at the time of writing this article, many of these links are essentially placeholders or contain very little information. some of these seo plugins could assist with ranking your site versus others in the same keyword space and search results list. however, they were asking for api keys and full access to google analytics. this leads to very interesting philosophical questions like “is it still considered a man-in-the-middle attack if you first ask users to opt-in?”. needless to say, the cost-benefit of this setup was not adding up. after browsing several options, it looked like in the few cases where these seo tools might be useful, google search console already had us covered. figure 2. the chatgpt pro plugin store seo options with developer info links in the fine print. why reinvent the wheel? these plugins could be useful as part of some other automation routine, perhaps in combination with other chatgpt plugins or features, but for data analysis and web content strategy, chatgpt pro already felt like overkill. around this same time, seo and marketing experts started to warn about the dangers of llm generated seo spam, which eventually lead to notable events like quora.com’s misinformation feedback loop bleeding into google search results [1] and more and more of this content creeping into top search results [2]. the moral of the story here is that while chatgpt can help with some aspects of seo, the overall landscape is full of landmines. proceed with caution! seo experiment #2 – working with sitemap.xml while the first experiment with plugins (a feature limited to chatgpt pro subscribers) suggested the platform was entering gartner hype cycle’s “trough of disillusionment”, our second experiment using the newly added advanced data analysis option did have some positive results. advanced data analysis mode is an excellent feature because it can write custom python code tailor made for your specific use case. like meta tags and meta descriptions in html, the core elements of the previous experiment, this second experiment was set up to test another basic element of seo: your sitemap. you might be wondering at this point, why use python at all? it’s important to note that python is the programming language of choice for advanced data analysis mode, so this was part of the experiment itself. while this mode can assist with other coding languages, under the hood it is essentially a plugin developed by openai that “graduated” from the plugin store and became embedded into the product itself. for example, if you upload a csv file containing a data table in advanced data analysis mode, it will use a python code library to generate a graph from that data (that is, if you ask it to). we should also note here that most modern cms will automatically create and update sitemap.xml files(s) to encourage a good data harvest from roaming web spiders. it’s now a universal process and a very behind the scenes element of most websites, but this wasn’t always the case (and in fact wasn’t a built-in wordpress feature until wordpress 5.5 released in 2020 [3]). in fact, for many years generating this xml document was one of the quickest ways to improve your site’s seo. knowing this, surely, there must be something cool that chatgpt pro could do with this vital data. for example, being able to quickly create a visual representation of your websites’ overall architecture for discussions with stakeholders or for the purpose of onboarding is extremely useful. analyzing a sitemap might also help you identify pages that are, for whatever reason, not being indexed. this was the initial idea that started experiment #2. xml surprises as it turns out, our sitemap, which spans over 100 pages of content, is in fact an index file that points to a collection of smaller, chronologically sorted xml files. the real goal was to work with the data inside those smaller xml files, but before chatgpt could do anything it needed to recombine this data set into a single file. this seemed easy enough for a computer to do—trivial in fact! but this is where the trouble really began. our process went as follows: load the data. test out what the new subscription-only advanced data analysis feature could do. maybe generate some interesting visualizations or see what key insights chatgpt would be able to glean from this odd assortment of timestamps, page titles, and urls that compose our website topography. easy, right? we had seen from demos that advanced data analysis could create python code on the fly. it sounded magical. what resulted from this experiment was the realization that chatgpt pro and its shiny new gpt-4 model, despite its potential for productivity gains, also has the potential to lead one astray. what should have been an easy task was not easy for a newbie programmer who had overestimated their python coding skills. goose games the second problem this experiment posed, aside from this initial dunning-kruger effect, was that chatgpt never suggested trying a different approach! compare this with an in-person reference interview, or even what a google search might provide: chatgpt has no knowledge of the person asking the question unless you tell it, its main goal is to answer the question directly and concisely, 100% of the time. without extra prompting, llms have no concept of paring down information based on expertise, the same way you might explain how to install an ebook app differently to a patron with more or less tech-savvy. essentially, chatgpt is far too willing to return what it considers the “top search result” (to go back to the google search comparison) without first asking if that initial question or search string is really going to solve your problem. this meant that each conversation about this coding problem had turned into a “wild goose chase” where subsequent questions or error messages not only overestimated the knowledge of a novice coder, but was also unable to detect how far these conversations had strayed from the original problem. the correct approach ended up being a python script that did everything for us in one go. it looked like this: import xml.etree.elementtree as et import requests # function to get xml tree from a url def get_tree_from_url(url): response = requests.get(url) response.raise_for_status() return et.elementtree(et.fromstring(response.content)) # parse the sitemap index file sitemap_index_url = 'https://library.cityofpaloalto.org/sitemap.xml' index_tree = get_tree_from_url(sitemap_index_url) index_root = index_tree.getroot() # create a root element for the combined sitemap combined_root = et.element('urlset', xmlns="http://www.sitemaps.org/schemas/sitemap/0.9") # iterate through the sitemap urls for sitemap in index_root.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}sitemap'): loc = sitemap.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text tree = get_tree_from_url(loc) # download and parse individual sitemap root = tree.getroot() for url in root.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'): combined_root.append(url) # write the combined sitemap to a new file combined_tree = et.elementtree(combined_root) combined_tree.write('combined_sitemap.xml') by the way, chatgpt wrote and provided comments for us. it was great. the initial misstep was asking how to do each individual step in isolation using methods that were already familiar rather than simply explaining the end goal to chatgpt first and letting it recommend the best solution. as it turns out, elementtree is a python library that exactly solves the problem of parsing and navigating xml documents. after several attempts to wget and recombine these xml files using brute force, it seemed like there had to be a better way. this was not a unique problem and someone, somewhere in the history of the internet must have found an elegant solution. indeed, they had, but getting to that point required taking a step back and first asking chatgpt to lay out some options. you can read the full chat transcripts in chatgpt directly [4], but here’s a quick overview of the conversation that struck gold. below each question are some notes about why they were asked in that specific order: what are some options to solve this data problem with code? three options came back: create a python script, use a specialized command line tool, or try an xml editor. before this i had already tried some command line options to no avail, and i remember enough about the quirks of xml editing software from library school homework assignments to quickly avoid that option. here are some details about my local code environment, how would i do this option? i picked the python option with some hesitation because i knew there would be some initial setup steps for macos, but i figured chatgpt could work through any error messages. sounds good, now explain how to do this step by step…please? this is where chatgpt starts to really shine. being able to ask follow-up questions based on its step-by-step instructions is much faster than sifting through help docs or forum posts. results of this experiment were mostly indirect benefits. parsing our sitemap did highlight pages in our website that had not been updated in a very long time, but much like the seo plugins, we already had a cms that could display this same information with a few clicks. the main difference was having a 3rd party point this out automatically through a chat interface. admittedly, this same insight could have been found through sorting by the last modified date in our cms, but the key difference is that chatgpt reported this information automatically through open-ended questions. more importantly, by the end of this seo experiment, we had gained the ability to take this sitemaps data further with additional python code tutorials and preexisting software libraries. chatgpt’s human counterpart took on a new workflow they would have never attempted otherwise, learning how to create, edit, debug, and run python scripts. biblioweb taxonomy experiment our second project involved generating new website taxonomies for topics and genres to facilitate better relevancy in the forthcoming personalized promotions feature in bibliocommons. personalized promotions is a feature of bibliocommons where marketing web content known as ‘cards’ is contextually presented to users in their catalog search results [5]. for example, if a user searches for ‘great films’ a card promoting staff recommendations would show up alongside the normal catalog search results of dvds and streaming movies. figure 3. a staff list triggered by a catalog search as it appears inside the catalog search results in bibliocommons. our existing taxonomies were rather simple. with the release of personalized promotions, however, we realized more robust taxonomies would enhance the relevance of cards that were displayed to our customers. enter chatgpt we saw potential in chatgpt plus to help analyze catalog usage data in order to develop a taxonomy that was related to actual customer interests. with a plus subscription, we would have access to gpt 4’s advanced data analysis mode, which provides a way to import data directly into chatgpt for analysis. essentially, this tool is like having a data analyst on staff who can help us understand patterns in the data, clean up data and even visualize the data. data anonymity it is important to point out that we anonymized all data provided to chatgpt. this is important because any data shared with chatgpt is also transmitted to openai for the purposes of improving their products.[6] obviously exposing personal customer data was not something any library would want, so we carefully reviewed each spreadsheet intended for use with chatgpt to ensure no personal information was included. search queries first attempt our first strategy did not go well. we first tried using search queries customers had entered into our catalog. to obtain these, we used google analytics, which captures such queries as part of its tracking of urls accessed by library catalog users. in this round, we provided chatgpt with a csv file with the following headers: searchterm | page views many of these queries were composed of known titles, author names and many stop words. also, we found that some of the queries included search parameter strings such as ‘nw:[0 to 180]’ or ‘formatcode:(ebook ) lx:[0 to 400]’. at first, our work with chatgpt went well as we attempted to clean up the data, normalizing empty values in our spreadsheet and having chatgpt remove format codes and other parameters. in fact, chatgpt was excellent at helping us clean the data. we then asked it to begin analyzing the remediated data and identify some categories. as the following screenshot demonstrates, chatgpt appeared to produce a useful taxonomy. figure 4. screenshot from chatgpt that demonstrates chatgpt appeared to produce a useful taxonomy. however as we looked closer at the distribution of searches under each of the taxonomy terms, we discovered that the vast majority fell under “other themes and genres”. we attempted to force chatgpt to break the other category into smaller chunks but this failed and we were left with a rather unhelpful, uncategorized data category representing 93.4% of all user searches. after many attempts at fixing the ‘other’ problem, we realized we had hit a dead end. table 1. distribution of searches under each of the taxonomy terms. category associated search terms percent of total other themes and genres 2747 93.4 literature & fiction 81 2.75 literature & fiction – book series/authors 25 0.8 education & children’s content 49 1.7 travel & geography 27 0.9 business & economics 6 0.2 mixed categories no data no data regrouping we spent some time re-strategizing our approach with chatgpt. upon reviewing the data we were submitting to chatgpt, we could see that there were many author names and known title searches. this seemed at least a partial explanation. since chatgpt is not connected to the internet, there is no way to have it cross-reference a data set like books in amazon or the library of congress in order to identify suitable subjects for a given author or title. in fact, chatgpt seemed to understand this. figure 5. screenshot from chatgpt. as you can see from the transcript above, chatgpt was pretty good at understanding its limitations and suggesting a solution. we considered its recommendations and then decided that an entirely different approach might work out better. rather than using search queries, we would just upload circulation data associated with each title and its associated subject headings. circulation data our circulation data was drawn from book titles only and was structured as follows: title | circulation | subject 1 | subject 2 | etc... each title had multiple columns for subjects, with a few items having over 20 subjects describing them. we reduced the size of the dataset substantially. in our previous work with chatgpt, we had begun to suspect that our massive dataset was maxing out chatgpt’s memory and this was impacting its ability to deliver quality analysis. with that in mind, we reduced our original circulation data of 40,000 titles to only those titles with 100 or more circulations each. this brought the total to 1,992 titles. since an llm like chatgpt is really good at summarizing content, we thought providing the explicit subject headings might make it easier for chatgpt to define a taxonomy that was both robust and relevant to our customer’s interests. it appears that we were right. ahead of having chatgpt generate taxonomy terms, we wanted to make sure it understood the structure of library of congress subject headings, such that subjects broken up with hyphens were shorthand for denoting hierarchy. we also explained that the circulation data was important and that it should weight decisions on our taxonomy based on the popularity of the book, that is its circulation numbers. in fact, it recommended separating the subjects into distinct terms where hyphens indicated hierarchy. figure 6. screenshot from chatgpt showing it recommending separating subjects into distinct terms where hyphens indicated hierarchy. we were ready for takeoff, but we paused to remind chatgpt that we were aiming for a concise list of 40 or so taxonomy terms that should represent a ‘flat’ non-hierarchical taxonomy derived from the subject terms. it confirmed it understood and then proceeded to generate the 40 terms! the 40 terms were very much the kinds of things you would expect. terms like fiction, juvenile fiction, folklore, biography, california, dogs, etc. however there were some unexpected terms including bears, technique and behavior, etc. given this, we determined that we had probably achieved the best outcome possible, and that human intervention would be required to carry this over the finish line. therefore, we decided to expand the list to 100 items to give our human staff a good set of terms to refine. interestingly, chatgpt had one final curveball for us, which was that it provided a list of 84 terms, not 100 as we had asked for. but who’s counting? the humans did meet a few weeks later to finish where chatgpt left off and we were able to generate our genre and topics taxonomies. one particularly helpful element to this manual labor was the vector map of topics we had asked chatgpt to generate. it turns out that when confronted with strange terms, the vector map helped us understand where this term was coming from. for example, the term mice was not really associated with other animals, but was related to christmas stories. iii. discussion based on our experience we are exploring the long term feasibility and sustainability of developing a chatgpt plugin that would make interfacing with ils data much easier. as we saw in the taxonomy experiment, chatgpt has some difficulty identifying or validating simple things like book titles or author names. the ability to combine this with other plugins could also open up many new workflows and possibilities. whether we do this through an api, or z39.950 interface, or good ‘ol web scraping, we don’t know yet, but the scope of the idea and possibility of allowing other libraries to also share these benefits seems like a win-win situation. a universal ils interface for chatgpt sounds great at first, but you might already be thinking about the privacy implications of a setup like this. there would of course be numerous drawbacks for library patrons vs an online catalog, but what we are talking about is not a complete replacement, but instead a parallel system available for pacl power users in the same way interlibrary loan services are slower than shelf browsing. for some, the pros of that setup will outweigh the cons. before we jump into the new set of skills that we think will help you avoid many of the issues we encountered. it’s important to mention there are several new features in the world of generative ai that were unavailable earlier this year but will likely be available when this article is published: multimodal support. very soon it will be possible to quickly move between photos (including optical character recognition and translations) and other formats and modes within chatgpt. multi-agent chat interfaces. imagine having multiple chatbots trained to do specific tasks, all working together within the same chat interface. better memory management. in the future your ai assistant will be able to recall past conversations or have access to a custom “knowledgebase” of reference documents, training material, or local practices. new skills while everyone awaits the seemingly inevitable leaps in llm utility, we invite you to consider the following four skills which our experiences suggest library organizations should develop in their staff as we move further into the ai age. skill #1 – maintaining freshness one of the disappointing but necessary limits of chatgpt is that out of the box, it does not have a memory of past conversations. imagine working with an assistant who has seemingly read every book in your library but has zero short term memory. in some ways this person is smarter than you in terms of recall, but they’re very sleepy. all the time. how can you quickly get them up to speed before the clock strikes midnight and they turn back into a pumpkin? thankfully you’re a librarian and providing “just in time” information is not a new concept. use that skill to your advantage. this became extra important during seo experiment #2 when trial and error was a major factor. if you tell chatgpt you’re working in macos, expect different results. it’s that easy. skill #2 – prompt engineering without getting into why this term is misleading and problematic (one could argue it just sounded cooler than alternative options like “llm whisperer” or “query expert”) this is the one we’re stuck with for now. it’s also a skill that is most likely to shift the most, as ai systems continue to evolve and better deduce what you were actually trying to say. prompt engineering is also largely dependent on the domain you’re working in, so for the purpose of this article let’s focus on the general concept of design thinking. one school of thought is that “one shot” prompting is the best option—try to build up your original prompt as much as possible to get instant results. this is extremely satisfying when you get it right, and is great for sharing results with other people (or bots) but could be compared to teaching someone only how to read but not write. for best results, you will probably need to use an iterative process over time to perfect your prompting skills. the best way to increase your prompting skills is to write lots of bad prompts and analyze those results. this is an iterative process, and you are part of that feedback loop. skill #3 – the “reverse reference” interview put skill #1 and #2 together and what do you get? we’re calling this “reverse reference” because in many ways a chatgpt conversation flips the script of a normal front desk interaction at a library. the difference is that this time, you’re both the patron and the librarian. in more practical terms, think about the extra context that might change the direction of the chatbot’s response, that in a normal reference interview happens towards the end of the conversation and try to build that into the front of the conversation. so for example, asking “what does this error message mean?” can create totally different results if you don’t first tell chatgpt a little about what you were trying to do in the first place. think about all those times you’ve talked to a patron who asked “where is the biography section?” without mentioning they have a homework assignment about a local author. well guess what, there’s an entire local authors section at a different library branch! do you see how that conversation could have gone smoother? you can anticipate those needs. words have meanings! be precise and prosper. skill #4 – data remediation since most data is dirty, staff using chatgpt will benefit from learning a few of the steps involved in data remediation. while we found gpt to be quite good at identifying some data cleanup needs on its own, it is incumbent on users to prep their data before handing it off to gpt to get better results. if you do use gpt to help you remediate your data, then be sure to start a fresh chat session to restore the tokens spent in the cleanup process. either way, understanding how to remediate data is a good skill to have, especially when using data with ai. we recommend considering the following data issues your remediation efforts should account for: bias. look out for cultural bias in the data that could lead chatgpt toward biased outputs. corruption. consider the impacts on quality when data is handed off from person to person, or merged between different datasets. with each touchpoint, the chances of error increase so look over the data to ensure data integrity and consistency. intent. ensure the data you are using was intended for how you want to use it and that your assumptions about that data match the findings you hope to draw from it. consistency. if you are working with long-term data, understand that standards that can change over time. again, check for consistency. controlled entries. watch out for a lack of standardized data-entry practices or free text fields. and if you are dealing with free text entries, woe be to you. good news all of the caveats aside, we found using chatgpt to be a major enabler for our team. with great ease, we could leverage expertise and skills that our team is not strong in, realizing previously unheard-of efficiencies in our work. moreover, using chatgpt, we were able to tackle problems that previously were beyond our ability to solve. essentially, we found a new super-star member in our team that could pivot between data analyst or debugger with ease. in our taxonomy project, our gpt data analyst was able to carry out critical and time-consuming tasks that would normally take hours or even days. and it did so in seconds. and when it came to checking our code, gpt offered us what felt like having a senior python engineer on hand to guide us. importantly, this would be a role our library would never be able to afford, so it was a game-changer in that respect. all of this meant that barriers to entry were lowered significantly. projects that would have been impossible suddenly looked possible. to take the taxonomy project as an example, we never would have put the time into analyzing thousands of rows of circulation data. in a team of two with far-flung responsibilities in our library, we would have scoffed at the suggestion, as we scurried off to put out fires with online resources or web content. but with chatgpt, we needed only to dedicate a few hours of time. in fact, having gone through this exercise, we could probably do similar projects in minutes next time. and then there is the wonder of solving highly technical problems with what feels like a very patient guru who you can confide in using plain english. in a technical sense, this is the best human computer-interaction interface imaginable. in our mind, this is the truly killer app of the technology and something we continue to be awed by. whether it’s asking it in natural language to break down an unfamiliar area of knowledge or grappling with complicated technical issues, chatgpt provides your team with a mentor who is patient, non-judgmental and seemingly very interested in having you succeed. as managers, we see this as especially important for removing obstacles for non-technical staff as they solve problems in their work. iv. conclusion based on our experience, we feel that library organizations need to begin developing ai skills in their staff and begin widespread application of this power into their work. however, as with the introduction of any new technology or process, the principles of change management should be applied to help staff adapt their mental models to this new paradigm. indeed, this particular technology is so groundbreaking, and therefore disruptive, that we highly encourage managers to refresh their understanding of moving staff through the change process. you can ask chatgpt for such a refresher or go and read the seminal work of leon festinger [7] on cognitive dissonance and review david kolb’s [8] findings on adult learning. the bottom-line is that you can spare your organization lost time due to change resistance if you manage the introduction of ai into the workplace appropriately. as part of this onboarding into ai, there should be careful attention paid to training staff on the important limitations of chatgpt. hallucinations are a real issue in the world of chatgpt. this was most encountered during the taxonomy experiment when our initial approach simply overloaded gpt by uploading far too many search terms at once, but can also be felt when any individual chat sessions went on for too long. in 2023, one needs to recognize when an llm is not performing well, like the last sputtering of a car running out of gas. cost is also a factor to consider. our experiments relied on an individual chatgpt pro license which costs $20 per month, but the estimated cost for an enterprise solution at our library would be well over $30,000 a year! this is a hefty price to pay without more reliable outputs that could balance these costs with efficiency gains. in sum, facts need to be scrutinized, findings analyzed, hallucinations identified and dealt with. and as we have noted previously, there are new skills one must gain to use the tool effectively. the productivity gains are unprecedented and our experience interviewing staff who have tried gpt is that they also appreciate its potential. even for a small library system like ours, we can see how a variety of work groups could be positively impacted by incorporating gpt into strategic areas of their jobs. that said, as our experiments have also shown, there is still work to be done to improve and fine-tune chatgpt for libraries. in particular, the hallucinations and untrustworthiness of its outputs greatly limits its use by novice users. and while it understands the library domain such as the structure of marc and even the purpose of libraries, its outputs could benefit from greater access and possibly more training on library data. we encourage libraries and particularly technically savvy library staff to begin introducing chatgpt into their workplaces, their projects and processes as long as there is clear understanding of the issues we have outlined above. we believe now is the best time to do so, in fact. llms present a radical reordering of the information landscape and as information professionals, we must keep up with this rapidly changing technology. there is no better way to develop this understanding than to experiment with it here and now, developing an intimate knowledge of its shortcomings but also its potential to overhaul how libraries do their business. references [1] edwards b. 2023 sep 26. can you melt eggs? quora’s ai says “yes,” and google is sharing the result. ars technica. https://arstechnica.com/information-technology/2023/09/can-you-melt-eggs-quoras-ai-says-yes-and-google-is-sharing-the-result/. [2] hays k. ai booster paul graham complains of rise in web content that’s “ai generated seo bait.” business insider. [accessed 2023 nov 2]. https://www.businessinsider.com/paul-graham-complains-web-content-ai-generated-seo-bait-2023-9. [3] building a sitemap for a site. 2020 aug 9. learn wordpress. [accessed 2023 nov 2]. https://learn.wordpress.org/lesson-plan/building-a-sitemap-for-a-site/. [4] chatgpt. chatgpt. [accessed 2023 nov 2]. https://chat.openai.com/share/b4c46ecd-ca1f-44ec-8d25-8f04c25a6ae9. [5] bibliocommons. 2022. 2022 year in review. https://www.bibliocommons.com/news/2022-year-in-review. [6] openai. 2023. privacy policy. openaicom. https://openai.com/policies/privacy-policy. [7] festinger l. 1957. a theory of cognitive dissonance. stanford: stanford university press. [8] kolb da, lewis lh. 1986. facilitating experiential learning: observations and reflections. new directions for adult and continuing education. 1986(30):99–107. doi:https://doi.org/10.1002/ace.36719863012. about the authors m ryan hess is digital initiatives manager at palo alto city library, delivering award-winning library experiences around technology and also a lecturer at san jose state university’s ischool. ryan speaks internationally on topics of digital literacy, web3 and blockchain, artificial intelligence and was the founder of bay area library ux. prior to coming to palo alto, he was the digital services coordinator at depaul university library and a former market researcher at adobe systems. ryan.hess@cityofpaloalto.org chris markman is the digital services program coordinator at palo alto city library, where he oversees library technology services and it systems. he previously served as senior librarian at mitchell park, publishing and presenting extensively on topics like information security, vr/ar, digital literacy, and the future of libraries. he holds an ms in information technology from clark university and an mlis from simmons college. chris.markman@cityofpaloalto.org appendix 1: how to prepare your library for the ai future gartner’s hype cycle contends that technologies move up a steep curve of hype as they impress us with their novelty, only to peak soon after and collapse into the trough of disillusionment. however, long-lasting technologies, gartner contends, eventually rise back as their utility works its way into business processes. in 2023, we viewed library use cases of chatgpt as falling into the trough of disillusionment. but we also see where it will eventually go and this makes us excited. in the meantime, what should libraries do? begin introducing chatgpt to staff, teaching them about it and discovering its power together. for us, the easy applications in this process of introducing chatgpt in libraries would include the following ‘safe’ use-cases. drafting writing of any kind from emails, blogs, web content and even introductions to journal articles (ha ha) are great ways to save time and generate content more productively. get staff involved in any kind of data manipulation possible. use chatgpt to both clean up data and summarize large datasets. while users must be cautious with this use-case, as we learned, using gpt with care and a critical mind can allow individuals to quickly save time and know more about library data than ever before. using gpt as a learning tool. for staff that coordinate training for others, it can quickly create lesson plans and course outlines for you. for staff that need to learn a new skill, you now have a jedi master waiting to show you the force. on the it side, you now have a senior code engineer who can guide library staff from novice to genius when it comes to writing, debugging and learning code. combine this with the training guides and online learning courses that are already part of your library’s collection. at the reference desk, llms can be a quick first pass for staff to consult before applying their own knowledge to a question, particularly in areas they are less familiar with. however, like with data analysis, because of hallucinations, bias and other issues with the current generation of ai, we recommend this always be used with care. note: article edited on 2023 december 5 to add the appendix. subscribe to comments: for this article | for all articles one response to "beyond the hype cycle: experiments with chatgpt’s advanced data analysis at the palo alto city library" please leave a response below: https://cryptolake.online/crypto7, 2025-06-18 https://cryptolake.online/crypto7 leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – how hard can it be? : developing in open source mission editorial committee process and structure code4lib issue 7, 2009-06-26 how hard can it be? : developing in open source in 2000 a small public library system in new zealand developed and released koha, the world’s first open source library management system. this is the story of how that came to pass and why, and of the lessons learnt in their first foray into developing in open source. by joann ransom with chris cormack and rosalie blake introduction look at it from our point of view: we were a very ordinary public library in new zealand, we had hardly any money and a library management system that was going to stop working on 1st january 2000 . . . . what else could we have done? and how hard could it be anyway? the librarians would tell the programmers how a library works and they would make it so. and we weren’t going to make a big deal of this, ok; 3 months is loads of time. and thus koha was born . . . ok, not quite but pretty much. from time to time over the years, we have been asked how koha got started, because it is pretty big now, so here is our story. context horowhenua district is a one hour drive north of wellington, the capital city of new zealand. it is medium in almost every way you can think of. it is of medium population–30,000. it is on the low side of medium in terms of socio-economic status. the district has one medium-sized town, three small towns and a collection of villages and beach settlements. there are libraries in three towns, as well as a tiny volunteer library. the libraries are heavily used and well loved by their community, functioning very much like community centres. but medium does not mean ordinary. in 1996, the horowhenua district council established the horowhenua library trust (hlt) and settled on the trust all the assets of the libraries, except the buildings. from 1 january 1997, the trust would run the libraries. a fixed budget meant the trust could only seek additional funding from council for exceptional circumstances. the move away from the local authority environment allowed a quite different culture to develop in horowhenua libraries. we experienced freedom to explore alternate avenues, to innovate, to take risks in ways that would have been difficult under the direct control of a district council. being medium is an asset here. we’re large enough to reach a critical mass, but not important enough for everyone to be watching every move we make. rosalie blake was appointed head of libraries for the hlt in 1997. she developed a culture within the district libraries of open communication and consensual decision making where everyone is encouraged to help solve problems, no matter what their status in the organisation. the trustees [1] also influenced the organisational culture. in 1999 we had 5 individuals with strong community and business backgrounds who were willing to take calculated risks in order to maximise gains. our partner in this venture was katipo communications, a web development firm based in wellington, new zealand, with whom we already had a long and trusted relationship. rachel hamilton-williams started the company in september 1996, and by january 1997 had hired her first programmer, chris cormack. since then katipo has continued to grow both in numbers and skills, servicing clients throughout new zealand and the world. hlt still works very closely with katipo, a business relationship spanning over 12 years. the situation in 1998 the trust undertook a complete review of all library activities. the report, towards superior service [2], included in-depth focus group interviews of patrons. our patrons made it clear that while they appreciated that computers were a necessary part of a modern library, they did not consider them the most important part. our patrons did not want to see a deterioration of traditional library services (books to borrow, pleasant spaces to sit and browse) in favour of it development, and they did not want new charges imposed to fund sophisticated computer gear. if it came to a contest between the library as a provider of information through technological development, and the library as a community centre which is safe, comfortable and "has the walls lined with books", the people preferred the latter. if we needed to fund new technology, this was a strong warning not to dip into the book budget. in 1999 with a 12 year-old system running on a 386 server, hlt needed to think replacement. while our much-loved dos system still performed very adequately, it looked old-fashioned and somewhat tired. the crunch, however, was that while the system would probably survive y2k, we were sure our networking would not. we asked council for an exceptional circumstances grant, and were successful in getting agreement for council to fund 50% of the cost of a replacement library system. with our management contract with council, we were on a fixed income and responsible for our own finances, which made us very conscious of costs. we had negotiated a favourable deal for the telecommunications link between our branch libraries and the district library, and we were very reluctant to change that. the open source world open source software (oss) is software in which both source code and binaries are distributed or accessible for a given product, usually for free. while "shareware" and "freeware" have been available since the earliest days of computing, oss had developed in the years leading up to 2000 on a different scale entirely. it was no longer confined to the realm of "hobby" programmes. oss projects were starting to produce software that matched or exceeded the quality of commercial products at the time, and linux was starting to challenge windows in very large-scale projects. objectives our overall objective was to source a library system which: could be installed before y2k complications immobilised us, was economical, in terms of both initial purchase and future license and maintenance support fees, ran effectively and fast by dial-up modem on an ordinary telephone line, used up-to-the minute technologies, looked good, and was easy for both staff and public to use, took advantage of new technology to permit members to access our catalogue and their own records from home, and let us link easily to other sources of information – other databases and the internet. if we could achieve all of these objectives, we’d be well on the way to an excellent service. a beginning we started, traditionally enough, by sending a request for a proposal (rfp) to suppliers of library systems. our initial rfp explored the possibility of a joint development in co-operation with our southern neighbours, kapiti coast district libraries. this was an idea worth exploring, but the cost of communication links rendered the joint system prohibitively expensive, and we reluctantly went our separate ways. examination of the proposals revealed: there are systems available that over-deliver, at a cost considerably higher than we wanted to pay, the systems which we could afford met only some of our needs, all the available systems had much more expensive communications solutions than we had been using and most had higher maintenance charges. there simply was no off-the-shelf package that met all our objectives. in commenting on the proposals we received, katipo staff observed that it was a pity none of the available systems used internet technology, since that would take care of the communications speed and costs. did such a system exist? we hunted, and found fragments, and work in progress, but no system that would fit us. "how hard can it be" katipo staff wondered, "to write a library system that uses internet technology?" well, not very, as it turned out. could we write our own? there were barely 15 weeks until d-day. the proposal would develop systems that used www protocols, which was pretty radical at the time, and for which development was significantly easier and faster than under windows (which was the favoured platform for commercial library system providers). could katipo deliver in time? although primarily web designers, katipo had already done lots of database work and an opac for wellington city library. it looked do-able, but it would be tight. katipo had an excellent reputation for attractive, workable web pages for a variety of clients. the brief was that if you could use a web browser, you’d be able to use a library programme which used web technology. the issue of providing external access was no problem at all—that’s what the world wide web was all about. the system would be released under the gnu general public license (gpl) [3], ensuring that the library system we were commissioning would be free/open source software. this was suggested by katipo as a way to ensure the project had longevity; they didn’t necessarily want to spend the rest of their days supporting a proprietary system. koha would thus be available to anyone who wanted to try it and had the technical expertise to implement it. we would encourage other people to use it—because they would improve and enhance it, add to it, fix it and join it up with other good programmes. the gpl would ensure that subsequent modifications and additions by other organisations were open source as well. demonstrating the difference between the trust environment and the local authority environment appealed to us. this project had the potential for establishing a reputation for the hlt as innovators, people who put library priorities in the right order, listened to their patrons and acted upon what they heard. our patrons told us that spending large sums on computers was not their top priority. libraries are it organisations, but we don’t have to buy into high-cost solutions. no-frills efficiency and co-operative development may produce far more in the long run, not just for horowhenua, but for other libraries as well. the undeniable fact that an open-source library system is a public good was not lost on us. a significant part of the cost of commercial library systems is license fees for other commercial software used as components in that system—e.g. the cost of development tools used in the construction of the system, and ongoing license costs for other software not necessarily written by the library system provider (database systems, remote control systems, operating systems). only a portion of the cost of a commercial library system is for code actually written by the library system company. the new system we were thinking about would have two interfaces—a web browser for most purposes, but a simple, fast telnet interface for issues and returns, especially at the branches. the software katipo used to build the new library system was already well proven. it was floss software so as well as the benefits of being fast and robust it came with all the freedoms that we were seeking: debian/gnu linux as the operating system, perl as the programming language and mysql as the relational database. regarding equipment purchases, katipo had a reassuring "let’s-try-the-old-ones-and-see" attitude, before they sent us out to order new pcs for every staff member. there was a potential saving if we didn’t have to buy all new machinery. the trustees invited katipo to prepare a quote, and it was acceptable. the name we struggled with a name for our developing programme before deciding on koha. koha is a maori word from the native people of new zealand meaning gift or donation—or perhaps more like “giving your specialty to the collective event”. there is a sense of quid pro quo or reciprocation about the concept, too. in traditional maori society (and still) you bring a koha (contribution) to an event like a funeral or wedding or big meeting, often food or the specialty of your region. when it’s your turn to hold an event all your guests will bring a koha, to ease the burden of catering for a lot of people. we chose koha as the name, because it’s free and because it’s our gift to the world. figure 1: koha logos the development process the proposed method was prototyping, rather than designing. it started with a fast analysis of what the present system did, and of what we would like it to do. it was the responsibility of hlt staff to ensure the software writers did not miss any key points in their fundamental understanding of the way libraries work. a rudimentary first cut was written quickly and presented, tested, then fixed, rewritten and tested again. librarians tested, explained, clarified their needs and tested again. there was a real onus on library staff to manage quality control. it had to be a partnership process. we saw a danger that hlt staff might get in over their heads—but we were fairly confident that we already had a high level of it competence right through the staff, a high level of understanding of what our current system did and did not do. the essence of this method is that it needs to be written in a short sharp burst of activity. expectations won’t change over a short-duration project. projects that take a long time get caught in expectation creep—end users change their expectations as they see the project develop and long projects get caught in technology creep—the writers just get it nearly finished when something new and different comes along and they have to start again. we saw the initial process as taking something like ten weeks. we finished up with something very tightly tuned to horowhenua’s, and the librarians’ needs. but our finishing point would be but a beginning in the amazing world which is open source software. progress looking once again at our objectives: objective 1: a product that could be installed before the y2k. we did not work on christmas day 1999. we allowed ourselves one day off. but every other day, whether a so-called "holiday" or not, found us hard at it. and on january 5, 2000, the first day back, we had koha installed and ready to go. the minimum (issues, returns, renewals and memberships) was achieved in the nick of time, truly, and by lunchtime we were starting to breathe easy; nothing like a go-live on incompletely tested software to test your mettle! further development continued at a more decorous pace, and by june acquisitions and cataloguing were sufficiently bug-free for us to feel confident about announcing it to the new zealand library world. objective 2: a product that is economical, in terms of initial purchase, license fees and maintenance support fees. this objective had been achieved on all counts. the programming we commissioned cost us about 40% of the purchase price of an average turn-key solution. there was no requirement to purchase a maintenance contract, and no annual licence fees. when the commissioning stage was completed, we hired katipo on an "as needed" basis, fixing bugs and developing minor enhancements. we were able to use many of our older pcs, even including some diskless 486s, by running lots of copies of netscape and telnet on one of our servers, and using these old machines as dumb terminals. only four 7-year old machines had to be discarded and replaced; all other purchases were enhancements to our network and services. objective 3: a product that runs effectively and fast by dial-up modem on an ordinary telephone line. katipo pulled out all the stops on this one. each branch library had a little local network of one issues/returns machine and one opac, and they both worked just fine sharing one ordinary (not isdn, not adsl, just ordinary copper wire analogue) phone line, with an ordinary dial-up modem. the branches worked at very acceptable speed. at levin, we experienced problems during very busy times at the circulation counter. after applying every fix we could think of we achieved some improvement, but the telnet interface still let us down occasionally. finally, the programmers went back to the beginning and rewrote the telnet interface using a different interface library. objective 4: a product that uses up-to-the minute technologies, looks good and is easy for both staff and public to use. up to the minute technology was designed in. looking good is subjective, but we liked it. it was easy to use; any patron with previous experience of the internet had no problems using the opacs at horowhenua libraries, and staff were happy that their part of the system was sensible and intuitive. but remember the internet was still quite new, and in horowhenua we have a predominantly senior population with many of them flummoxed by a mouse and by screens that assumed an understanding of "if it’s underlined and coloured, you can click it for more information". we held training sessions for anyone interested, and taught more people than we care to admit how to play solitaire, for mouse practice. objective 5: a product that permits members to access our catalogue and their own records from home as easily as in the library. the opac was initially accessed only over the intranet, but stage 2 followed within a few months and allowed patrons to access our database from home to check their loans, renew items and place reserves. objective 6: a product that lets us link easily to other sources of information – other databases and the internet. internet access is available at all three libraries, and stage 2 development saw links to other databases, including the past perfect database of the holdings of horowhenua historical societies. community koha was a world first, the first open source library management programme. although our initial development was not yet completed, we announced koha in new zealand in the july 2000 issue of library life. word of mouth in the open source community is very strong, and approval was swift. it was only a matter of hours before we were noticed around the world. we received emails from new zealand, then australia, and fiji, followed by north america and europe. figure 2: koha users as of 16 april 2009 an open source project is never finished. someone will always see something else to improve and we were counting on this. we wanted to encourage a supportive community around koha right from the start. katipo set up web pages for inspection, a listserv for discussion and the means for anyone interested to download the code. within the first month, there were 368 downloads of koha, the project webpage had 2,933 hits, and more than 50 people from around the world were on the mailing list. the first occasion when a technical question was asked, and it was answered by someone who was not working for katipo or the hlt, was a real thrill. very soon after the release of koha 1.0 a newspaper library in new zealand had the system fully installed for evaluation, libraries in poland and estonia had started to translate the front pages and students in america were suggesting additions that they might write as their contributions. since then koha has gone on to win multiple awards including the 3m award for innovation in libraries, the interactive nz award in 2000, and the trophées du libre in 2003. seven years after its release and a thriving project later, chris cormack won the open source contributor award at the nz open source awards in 2007. open source projects only survive if a community builds up around the product to ensure its continual improvement. koha is stronger than ever now, supported by active developers (programmers) and users (librarians) – and they actually talk to each other! a range of tools are employed to nurture the koha community: koha web site, mailing lists: developers and general, koha developers wiki, irc real time chat, koha documentation project, list of koha libraries by geographical region, koha extensions, blog aggregator, koha special interest groups (including spanish, french, italian), conferences and workshops, paid support vendors. support there are a range of support options available for koha, both free and paid, and this has contributed to the overall strength of the koha project. free support can be accessed from the website, and includes irc, mailing lists, faqs, and documentation. the site also lists vendors who provide paid support for those not confident enough with linux to go it alone. these are drawn from around the globe, including not only new zealand but also australia, china, india, pakistan, africa, france, the uk and usa. support has never been limited to third party vendors. steven tonnenson worked on non-acquisitions based cataloguing, then moved on to write a web-based circulation module. his work formed the core of release 1.1.0 (january 2001). pawel skuza’s work on translations formed the core of release 1.1.1 and when 1.3.0 was released in september 2002 multiple flavours of marc were also supported. many of the existing submitters have been involved since the beginning: biblibre’s paul poulain and henri-damien laurent; stephen hedges, owen leonard and joshua ferraro (nelsonville public library); and mj ray from turo technology. they all joined the community in 2002. vendors like anant, biblibre, bywater, calyx, catalyst, inlibro, indserve, katipo, kohaaloha, liblime, libsoul, nchc, osslabs, paklag, ptfs, sabinet, strategic data, tamil and turo technology take the code and sell support around the product, develop add-ons and enhancements for their clients and then contribute these back to the project under the terms of the gpl license. assessment and learning looking back from a distance of 10 years, there are a few things we should have done differently: participation firstly, we did not participate in the community. once we had our own koha up and running, and had made the code available for download, we basically abandoned it. we relied on katipo to represent us in the koha community when we should have taken responsibility for this ourselves. i guess we had scratched our own itch and simply moved on, but there was also an element of a medium-sized public library in new zealand developing something simple and then going head to head with big american libraries who needed it to be more. what we should have done was stayed active and joined in discussions, articulating and defending our approach and vision. koha was designed in what is now described as a frbr [5] arrangement, although of course it wasn’t called that 10 years ago, it was just a logical way for us to arrange the catalogue. a single bibliographic record essentially described the intellectual content, then a bunch of group records were attached, each one representing a specific imprint or publication. attached to the group records were any number of items or individual copies, and items inherited all the attributes of the group they were attached to, and groups inherited all the attributes of the bibliographic record they were attached to. this made it very easy to reserve the first item which became available; really good when you didn’t care if you got the collins 1999 copy of ‘a title’ with 100 pages or the random 2007 paperback edition of ‘a title’ with 124 pages. we should have had the courage of our convictions and also the confidence to defend the decisions we made. to this day i don’t think anyone understands that structure, although remnants of it remain in the current 3.0 release. the current rda [6] work is quite exciting for us and it feels like we will be coming full circle in due course. stay with the main trunk we also made the mistake of not carrying out regular upgrades in order to stay in the main development trunk. we had such a perfectly working, highly customized and very stable library management system that we really didn’t see any need to upgrade to something which we didn’t need. for example, koha was developed to handle multiple flavours of marc quite early in its development, something we had completely ignored when designing the system. we could see no obvious benefit to be gained from cataloguing in marc, preferring to continue cataloguing by filling out fields on a form. implementing marc into koha ensured the widespread appeal of the programme but it did make our frbr system work very awkwardly. basically koha forked and we remained on a rock solid but isolated branch with a system we had to maintain ourselves, which meant that every single upgrade forever into the future would require extensive customisation. this effectively meant we did not receive any of the financial gains of oss software development. we held out for 5 years before finally upgrading from v 1.x to v 2.2.4 in 2005. this was a massive upheaval and key modules which had worked perfectly for us for 5 years were irrevocably broken in our eyes; periodicals, simple acquisitions and catalogue maintenance were the ones we mourned for the longest! koha was essentially a whole new programme by this point, and we had absolutely no one to blame but ourselves. we are philosophical about it now; it’s the nature of open source. to control or shape development you have to participate: either contribute code yourself or fund development, but definitely join in the discussion. oss development is much more consensual than you would think. ideas and issues are tossed around until a clear path forward is agreed upon and decisions tend to be stronger for it. applying the learning: koha 3.0 the release of koha 3.0 in late 2008 brought koha completely into the web 2.0 age and all that entails. we are reconciled to taking a small step back for now, but the frbr logic is around and rda should see us back where want to be in a year or so – but with all the very exciting features and opportunities that koha 3 has now. this time we are determined to take control of our own system and not operate in isolation. while we weren’t quite up to downloading the code and getting it all working, we have taken responsibility for working through all the system preferences and settings, and working out the best arrangement of the collection within the database. there are so many ways to make koha work, not just one right way, and while this is a great thing it has taken a bit to get used to. we have invested many hours in trying to understand the new terminology, like what "item types" mean in koha 3.0 as opposed to "collection codes". it is different from how we used them in 1.0 and in 2.0 also. it’s definitely better but it has taken us time to understand the logic behind changes made from 1.0 to 2.0 then on to 3.0. we are getting involved again, benefiting from the generosity of the community in sharing learning and experience. every time we have asked for help, help has been at hand. in return we have made a conscious decision to participate fully on the lists, sharing the knowledge we have gained. in the early days, the koha list appeared to have been dominated by programmers but i have noticed a lot more librarians participating now. this is great as we do understand koha from a different perspective, but we need to continue to develop koha together. conclusion the declining global economic situation means libraries must review how they invest their operating budgets to maximise return to the communities they serve. to quote the recently published darien statements [7], we need to: "adopt technology that keeps data open and free, abandon[ing] technology that does not." the time is right for oss. it fits the library philosophy of sharing and accessibility and cooperation. developing and releasing in open source has worked well for horowhenua library trust. we started the ball rolling with a local solution but like a snowball it gained its real power through the momentum gained of a global community of brains and expertise. our name for our project may have been a premonition, but we’re proud to have offered our koha to the library world. e iti noa ana na te aroha small gift, given in love (mäori proverb) references [1] the trustees of horowhenua library trust in 2000 were: alan hercus (chair), george sue, heather birrell, louise robbie and peter hodge. [2] rosalie blake, towards superior service : a review of the operations of the horowhenua library trust (june 1998). available at: http://kete.library.org.nz/site/topics/show/16-towards-superior-service [3] http://www.gnu.org/copyleft/gpl.html [4] http://en.wikipedia.org/wiki/free_and_open_source_software [5] frbr : functional requirements for bibliographic records. a conceptual entity-relationship model developed by ifla that represents a more holistic approach to retrieval and access as the relationships between the entities provide links to navigate through the hierarchy of relationships. the model is significant because it is separate from specific cataloguing standards such as aacr2 or isbd. for more information: http://www.loc.gov/catdir/cpso/whatfrbr.html [6] rda: resource description architecture is the new cataloguing standard being developed to replace aacr2 rev. in 2009. it goes beyond earlier cataloguing codes in that it provides guidelines on cataloguing digital resources and a stronger emphasis on helping users find, identify, select, and obtain the information they want. rda also supports clustering of bibliographic records to show relationships between works and their creators. this important new feature makes users more aware of a work’s different editions, translations, or physical formats. underlying rda are the conceptual models frbr (functional requirements for bibliographic records) and frad (functional requirements for authority data). for more information: http://www.collectionscanada.gc.ca/jsc/rdaprospectus.html [7] the darien statements were written by john blyberg, kathryn greenhill and cindi trainor in the days following the "in the foothills: a not-quite summit on the future of libraries" at which participants were instructed to "come prepared to help sketch out the role librarians should play in defining the future of libraries". this was held in darien, usa, 2009. more information: http://www.blyberg.net/2009/04/03/the-darien-statements-on-the-library-and-librarians/ about the authors joann ransom is currently deputy head of libraries at horowhenua library trust, levin, new zealand. she has worked on two award winning open source projects: as one of the team who developed koha back in 1999-2000, and more recently leading development of kete, an open source project for building a community digital archive of articles, images, audio, video and documents. joann is a professional librarian with qualifications in english literature and computer network administration. she regularly blogs and tweets jransom, and leads a quiet life in a small coastal village with her menagerie of children and pets. christopher cormack has a bsc computer science and a ba mathematics and maori studies. while working for katipo communications he was the lead developer of the original version of koha, which went live at horowhenua library trust january 5, 2000. since then he has served various roles in the community, release manager, qa manager and currently translation manager. christopher believes in free software, and allowing users the freedom to innovate. rosalie blake has worked at horowhenua libraries since 1981, in a variety of positions which she keeps reinventing. currently head of libraries for the horowhenua library trust, which came into being in january 1997. loves computers, email, web pages . . . so it’s not surprising she was willing to take a punt with koha, the world’s first open source library system. she has served several terms as regional and national councillor for lianza, being responsible for the great new zealand tv turn-off (a library week promotion), and the 1995 edition of the public library standards for new zealand libraries. she has an absurdly large garden which she shares with a cat and a dog. subscribe to comments: for this article | for all articles 7 responses to "how hard can it be? : developing in open source" please leave a response below: joann ransom, 2009-06-26 correct link for towards superior service report: http://kete.library.org.nz/site/topics/show/16-towards-superior-service joão, 2009-07-04 a good history about koha ils and open source for libraries. thanks. verónica lencinas, 2009-08-06 “frbr arrangement”: it was one of the first things i liked when i saw the database structure of one of the first releases. i thought: wow, they implemented frbr! but there was no marc support and that was very bad for our team. (very happy koha 2.2.9 and koha 3 user from argentina) topic 6 task » oua imt study area, 2012-08-23 […] ransom, j., cormack, c., & blake, r. (2009). how hard can it be? : developing in open source. code4lib journal, (7). retrieved from http://journal.code4lib.org/articles/1638 […] topic 6 task » tis assessment one, 2015-02-12 […] ransom, j., cormack, c., & blake, r. (2009). how hard can it be? : developing in open source. code4lib journal, (7). retrieved from http://journal.code4lib.org/articles/1638 […] topic 6 task » information services foundation practicum, 2015-09-03 […] ransom, j., cormack, c., & blake, r. (2009). how hard can it be? : developing in open source. code4lib journal, (7). retrieved from http://journal.code4lib.org/articles/1638 […] topic 6 task – blog recycle test, 2015-12-30 […] ransom, j., cormack, c., & blake, r. (2009). how hard can it be? : developing in open source. code4lib journal, (7). retrieved from http://journal.code4lib.org/articles/1638 […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – code4lib: more than a journal mission editorial committee process and structure code4lib issue 2, 2008-03-24 code4lib: more than a journal it is a pleasure and an honor to be able to introduce this, the second issue of code4lib journal. code4lib is much more than a journal. it is a thriving community. by eric lease morgan something for everyone each article in this issue has a little bit of something for all who call themselves a librarian or work in a library. each identifies some sort of library problem to be addressed, and offers one or more solutions. many are complete with code snippets. after all, this is code4lib. for example, people in public service may be interested in edward m. corrado and kathryn a. frederick’s review of database-driven subject guide applications. kenneth furuta and michele potter describe a simple help system that brings librarians running to the reference desk. margaret mellinger and kim griggs explain how library resources can be organized into course pages without the need of html knowledge and yet sport web 2.0 features. nancy fried foster, nora dimmock, and alison bersani shed light on participatory design. for those of us who enjoy cataloging and metadata issues, jonathan gorman outlines how he modified vufind to exploit wikipedia and cataloging authority records to enhance information about authors in a library catalog. chris freeland, martin kalfatovic, jay paige, and marc crozier illustrate a different use of library of congress subject headings by integrating place names with google maps. carol jean godby, devon smith and eric childress describe a technique for crosswalking just about any metadata format into just about any other metadata format. for the systems librarian in you, dan scott and kevin beswick share how they used linux live cds customized as kiosk browsers to provide laptops as ‘quick lookup’ stations at their library. andrew darby takes advantage of the google calendar api to easily manage the display of library hours. jody deridder exploits google sitemap technology and static html pages to make content in the “deep web” more accessible. we hope you find these articles useful, stimulating, and relevant to your daily working lives. we’re growing! a few months ago we put out a call to solicit more code4lib journal editors. we received quite a number of well-qualified applicants, and selecting a few from the pool was very challenging. we are happy to announce that christine schwartz, andrew darby, and ryan wick will be joining the editorial committee. “welcome!” anyone else interested in helping with the journal, whether or not they are on the editorial committee, is welcome to participate on our public discussion list. the code4lib community code4lib is more than a journal. it is a community — a group of loosely federated problem solvers who work in libraries and exploit computers to solve library problems. the exact date when it all began is impossible to track down, but it can probably be traced back to november 2003 when ed summers, robert fox, chuck bearden, dan chudnov, and myself discussed the possibilities of starting up a new mailing list. ed stated it most succinctly: code4lib would be for *any* language, not just perl. people could discuss non opensource software (although they probably wouldn’t want to), and conversation will not be limited to xml. it will be a discussion list for programmers, who like programming in/for libraries or dealing with information sciencey things. thus the code4lib mailing list was born. [1, 2] presently the mailing list includes about 900 subscribers from all over the world, and it grows by about one person every other day. since then an irc channel has become an additional venue for discussion. [3] it is a “place” where real-time discussions occur. “what is the proper syntax for this command?” “have you heard about that new api?” “how do you suggest i tweak this script?” the channel is full of mis-spellings, any number of seemingly random thoughts, bad jokes, and simultaneous conversations whose participants talk right past one another. it is inhabited by a robot named zoia who tells you the weather, searches google, plays hangman, defines words, and constantly trawls the ‘net for relevant rss feeds. it is a place where “karma” is increased or decreased by incrementing and decrementing usernames or words through the use of double plus and minus signs (ie. books++ or taxes–). presently mike giarlo, ed summers, and ross singer have the highest karma. in this seemingly chaotic environment relationships are built, ideas are formulated, and problems are solved. more recently the code4lib conference has come into being. there have only been three conferences, but each one has been filled to capacity and greatly anticipated each year. code4lib is more than the discussion on a mailing list, more than the idiosyncrasies of text-based chat sessions, more than a conference, and more than journal articles. code4lib is a growing community exhibiting the characteristics of the internet. it is a decentralized organization with no defined leaders. there is very little formal governance and when consensus is not apparent voting is employed. processes are relatively transparent; we value collaboration. there are no dues. everybody contributes what they can and the resources just seem to flow. sociologists and anthropologists would have a field day if they studied the inner workings of code4lib. consider participating in code4lib. we would love to have you, and we are sure you have something significant to offer. you don’t need to know how to program. you don’t need to be an expert in computers. you just need to be a person who enjoys a collaborative working environment where diverse and sometimes conflicting ideas abound. code4lib++ notes and links [1] these quotes are buried deep in the code4lib mailing list archives. [2] the official home page of the code4lib mailing is http://dewey.library.nd.edu/mailing-lists/code4lib/. [3] find an irc (internet relay chat) client like irssi or a use gateway like http://mibbit.com, connect to chat.freenode.net, and /join #code4lib to share your ideas and become a part of the community. find out more at http://www.code4lib.org/irc. tags: code4lib journal, editorial introduction subscribe to comments: for this article | for all articles 3 responses to "code4lib: more than a journal" please leave a response below: mike giarlo, 2008-03-25 i won’t say the karma system is rigged, but that i am currently atop the pile exposes deep and fundamental flaws in it. great introduction to the issue and the community, eric! adam chandler, 2008-03-25 thank you for putting out a fine issue, eric. i wonder, though, does batching these articles up into an issue really add any value? why not release the articles through the rss feed as soon as they make it to the end of the editorial process? i would rather read one here and one there over the course of few weeks. eric lease morgan, 2008-04-02 adam, the editors discussed the pros and cons of publishing the journal in issues vs the continuous release model. the latter model is, in some ways, easier from the editor’s point of view. but, we came to agree that publishing issues seemed like the better choice for our particular group dynamic. we also thought that many of our readers still favored discrete issues and a predictable publication schedule. if, in the future, we get more article submissions than we feel we can deal with using this model, we may change our mind rather than adopting a more frequent publication schedule. nothing is irrevocably set. –elm leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – real-time reporting using the alma api and google apps script mission editorial committee process and structure code4lib issue 58, 2023-12-04 real-time reporting using the alma api and google apps script when the university of michigan library migrated from the aleph integrated library system (ils) to the alma library services platform (lsp), many challenges arose in migrating our workflows from a multi-tier client/server structured ils with an in-house, locally hosted server which was accessed by staff through a dedicated client to a cloud-based lsp accessed by staff through a browser. among those challenges were deficiencies in timely reporting functionality in the new lsp, and incompatibility with the locally popular macro software that was currently in use. while the alma lsp includes a comprehensive business intelligence tool, alma analytics, which includes a wide variety of out-of-the-box reports and on-demand reporting, it suffers from one big limitation: the data on which the reports are based are a copy of the data from alma extracted overnight. if you need a report of data from alma that is timely, analytics isn’t suitable. these issues necessitated the development of an application that brought together the utility of the alma apis and the convenience of the google apps script platform. this article will discuss the resulting tool which provides a real-time report on invoice data stored in alma using the google apps script platform. by david fulmer introduction a long-standing and important workflow in the university of michigan library’s acquisitions area involves exporting information about invoices from our integrated library system (ils)/library services platform (lsp) to the university of michigan’s central financial system. before the migration from aleph to alma by the university of michigan libraries in 2021, a well-established local workflow using a combination of electronic data interchange (edi), spreadsheets, macro software, and manual data entry prepared the invoice information in aleph for its export, processing, and importing into the university’s central financial system. but this workflow was not compatible with the new lsp, alma, and needed to be adapted on a tight deadline. alma has an integration to interface between alma and financial systems called the enterprise resource planning (erp) integration [1]. included in this integration is an alma job to ‘export invoices for payment’. invoices to be exported from alma are designated by being given the invoice workflow status “waiting to be sent”. the integration job runs nightly and changes the invoice workflow status of exported invoices in alma to “closed” and creates an xml file of the exported invoices as output. this exported xml file is processed locally into a format that can be understood by the university’s central financial system and then transmitted to the university’s central financial system via email. two key components of this export process are the presence and correct formatting of an invoice reference number, or batch number, and a properly formatted invoice note in each invoice exported so that each exported invoice can be included in the appropriate batch and imported into the university’s central financial system successfully. if an invoice lacks a batch number, or if the invoice does not have a note that conforms to the proper formatting, the export of the invoice information from alma and its import into the university’s central financial system will not be successful, requiring time-consuming manual intervention. as may be seen, because each invoice is exported the night that it is processed, any quality check of the invoices must be performed before the export of the invoices that evening. alma analytics [2] is the reporting platform built into alma and it provides extensive reporting capabilities in subject areas of alma including users, loans, fulfillment, electronic inventory, physical inventory, and acquisitions. however, the reports produced by alma analytics are based on a copy of the data from alma made daily for the purpose of querying, so a report that needs to have current data from alma cannot be produced by alma analytics. the alma analytics reporting platform cannot be used to review invoices before export as the changes made to the invoices throughout the day will not be reportable in alma analytics until after the full database refresh or extract, transform, and load (etl), as this happens nightly after the invoices have been processed. and double-checking invoices in the alma ui would be a time-consuming manual process consisting of many clicks. before migrating to alma, the library was able to use macroexpress, an application that could navigate around the aleph client in imitation of user actions, clicking on areas of the client or designating fields, and aiding in the invoice processing workflow by helping to format invoice notes correctly. but as the migration to alma approached, it was recognized that the browser based-ui of alma was not compatible with the tools developed for aleph. research into more modern information technology tools, particularly those that would be compatible with the alma apis, led to experiments with google apps script integrated with a google sheet as a potential replacement and improvement on the obsolete workflows. google apps script google apps script [3] is a cloud-based programming platform that is tightly integrated with the google workspace suite of applications like google sheets, google docs, and google drive. in a google spreadsheet, google apps script is just a menu click away and comes with its own integrated development environment for writing sharable scripts that interact with and modify the google files. it can also incorporate other apis, including the alma apis, to fetch data from outside sources and incorporate it into a google spreadsheet. google apps script uses a scripting language that is based on javascript with a few differences, mostly additional methods google provides to make the platform even more useful. there is an included integrated development environment with a script editor that has a variety of sophisticated programming tools like autocomplete, logging, and debugging tools, making it an excellent choice for either the seasoned programmer or a novice. it has been around for over ten years (as of 2023), and in 2020 the integrated development environment was completely redesigned [4]. google apps script makes it possible to have a google spreadsheet with a custom function that queries the alma api to gather current data about invoices in alma, formats that data, and then displays that data in the google spreadsheet. alma rest apis the alma lsp comes with a suite of apis [5] that allow direct, programmatic access to the data within alma. grouped into categories such as acquisitions, bibliographic records and inventory, and users and fulfillment, the apis make it possible to “read” data from alma at any given moment, along with updating many kinds of data in alma, from bibliographic records to patron accounts. to use the alma api, you must create an alma api key, and the permissions granted to any given key may be very granular. options include environment: production or sandbox, permission: read-only access to alma, or read/write access to alma, and api keys may be restricted to particular functional areas like acquisition, bibs, or users, among others. for our purposes, an api key with acquisition, production, and read-only parameters was sufficient to get data out of alma and report on it in a google spreadsheet. the report in order to create a report that would include real-time data from alma which could be used to double-check that all invoices due to be exported had a required batch number, and had a properly formatted note, it was necessary to create a google spreadsheet with special formatting and a custom function. the custom function script was written and deployed in the apps script extension of the google spreadsheet. library staff are able to use the google spreadsheet to call up a real-time report of invoice data from alma. the custom function script communicates with the alma api and returns information about invoices in the “waiting to be sent” invoice workflow status – those due to be exported that night – displaying information about each invoice in alma due to be exported along with conditional formatting that highlights invoices that need to be edited in order to avoid problems with the export from alma and import into the university’s central financial system. the report is designed to be used at the end of the day before the ‘export invoices for payment’ job runs automatically to check if all the invoices that are “waiting to be sent” have a batch number (an invoice reference #) and a properly formatted note. the enhanced google spreadsheet can identify these two types of problems so that invoices that have them can be corrected or put back in review status so they won’t be exported that day. the script the script used to create the report [6] queries the alma api. here is how it works. to add a custom function to a google spreadsheet, you simply click on extensions > apps script at the top of the spreadsheet, which opens the apps script code editor. this code editor allows you to add programming to the google spreadsheet to do a wide variety of things in a container-bound script attached to the spreadsheet. this script begins by declaring the api key as a global variable: var apikey = ‘your_api_key’; it will be used in the main function but could also be used in other functions were there to be more added later. the custom function doing all of the work of getting the data out of alma is called “getnotes”. ‘function’ is a keyword in javascript indicating where the function begins, “getnotes” is the name that is used for this custom function, ‘(input)’ is the input parameter for the function, and the first opening curly bracket shows where the function begins, and is matched by a closing curly bracket at the very end of the function: function getnotes(input) { if (input){ the content ‘=getnotes(b1)’ is placed in cell a8 of the sheet called “sheet1” so that when content is added or changed in cell b1, the function will execute. it may be helpful to think of the getnotes function analogously to a built-in function such as ‘sum’. were we to put ‘=sum(2+b1)’ into cell a8, then a8 would display ‘2’ if b1 were empty, ‘3’ if b1 had the value ‘1’, and so on. cell a8 would display the result of the sum function adding 2 to the contents of cell b1. getnotes is a custom function which, like sum, will return data, but in this case, it will return not the result of a math operation, but the result of a call to the alma api to retrieve invoice information which is then formatted and displayed in cell a8 and other adjacent cells. the next section of the script retrieves invoice information from alma. var limit = 100; var offset = 0; var url = 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/acq/invoices/?limit='+limit+'&offset='+offset+'&invoice_workflow_status=waiting%20to%20be%20sent&apikey='+apikey; var xml = urlfetchapp.fetch(url).getcontenttext(); var document = xmlservice.parse(xml); var totalrecordcount = document.getrootelement().getattribute('total_record_count').getvalue(); xml = xml.replace(/<\?xml version='\"1.0\" encoding=\"utf-8\" standalone=\"yes\"\?>/, "")'; xml = xml.replace(/<\/invoices>/, ""); var fullresponse = xml; the first call to the api will retrieve information about the first 100 invoices that have the status “waiting to be sent”. the url in this code will vary based on location in the world (north america, europe, asia pacific, canada, or china) and the call to an alma api is directed to different urls depending on the location of your library. the alma api console documentation provides details about constructing api calls and in this case it is a “get invoices” request utilizing the “invoice_workflow_status” parameter equal to “waiting to be sent”. the reply is in xml format and the script edits this reply a little in case there are more than 100 invoices with this invoice workflow status. the next section of the script is a loop, only invoked when there are more than 100 invoices with the specified invoice workflow status. while (totalrecordcount > offset + limit) { offset += 100; var url = 'https://api-na.hosted.exlibrisgroup.com/almaws/v1/acq/invoices/?limit='+limit+'&offset='+offset+'&invoice_workflow_status=waiting%20to%20be%20sent&apikey='+apikey; var xml = urlfetchapp.fetch(url).getcontenttext(); xml = xml.replace(/<\?xml version='\"1.0\" encoding=\"utf-8\" standalone=\"yes\"\?>/, "")'; xml = xml.replace(/<\/invoices>/, ""); fullresponse = fullresponse.concat(xml); } this is not a common situation, but by changing the offset, it is possible to retrieve the information for more than 100 invoices, which is the default limit for the get invoices api request. next, the script edits the reply, combining all requests into one xml document, parses that document, and sets up the ‘invoices’ variable with the invoice information that has been retrieved from alma. fullresponse = '' + fullresponse + ''; var documentb = xmlservice.parse(fullresponse); var invoices = documentb.getrootelement().getchildren('invoice'); at this point, it might be useful to review the sample of an xml-formatted get invoices request from the alma api. the api console inside the developer network is a development environment provided by ex libris that enables you to test out the alma apis on sample data or on your library’s data. figure 1. sample of part of an xml-formatted get invoices request response from the alma api. this screenshot shows the api console inside the developer network. next, a few variables are established to store data throughout the script and to be presented in the spreadsheet at the end. var invoicearray = []; var space = [" "]; var invoicearraybad = ["these below are not okay"]; the ‘invoicearray’ is used to gather information about all the invoices in the specified status, the ‘space’ variable is used for formatting, and the ‘invoicearraybad’ is just what it sounds like: this needs attention. the script then begins a loop through all the invoices. for (var i = 0; i < invoices.length; i++) { var invoicegoodorbad = 1; var invoicenumber = invoices[i].getchild('number').gettext(); var vendorcode = invoices[i].getchild('vendor').gettext(); var referencenumber = invoices[i].getchild('reference_number').gettext(); each invoice starts out with the ‘invoicegoodorbad’ variable set to 1, which means it needs attention, but if it is found to be acceptable this variable will be changed. then the invoice number, vendor code, and reference number or batch number are retrieved from the data sent by the alma api. next, there is a check for the reference number, and if there is none, a message, ‘no batch number’, the invoice number, and the vendor code of the invoice are added to the invoicearraybad array. if (!referencenumber) { invoicearraybad.push(["no batch number",invoicenumber," ",vendorcode," "]); referencenumber = 'none'; } then the script evaluates the invoice notes. var invoicenotes = invoices[i].getchild('notes'); if (invoicenotes) { var invoicenote = invoicenotes.getchildren('note'); for (var j = 0; j < invoicenote.length; j++) { var notecontent = invoicenote[j].getchild('content').gettext(); var patt = /^act......,.............-...,.,\[p\]$/; var goodorbad = patt.test(notecontent); if (goodorbad) { var pattern = "yes"; invoicegoodorbad = 2; } else { var pattern = "no"; } invoicearray.push([pattern,invoicenumber,referencenumber,vendorcode,notecontent]); } if (invoicegoodorbad < 2){ invoicearraybad.push(["no good note",invoicenumber,referencenumber,vendorcode," "]); } } else { invoicearray.push(["no lines",invoicenumber,referencenumber,vendorcode,"no lines"]); invoicearraybad.push(["no lines",invoicenumber,referencenumber,vendorcode,"no lines"]); } } the invoice notes are retrieved one by one and checked against the pattern required by the invoice processing system for successful exporting of the invoice to the university’s central financial system. if there is at least one note in the invoice that matches the pattern then the invoice is classified with the invoicegoodorbad value set to ‘2’, if not, the invoice gets added to the invoicearraybad array as an invoice flagged for attention. an invoice may have more than one note, but each invoice needs to have one note which matches the pattern necessary for invoice processing. the script finishes up by building up an object, ‘invoicearraye’, with all the output from the alma api response having been processed by the script to return the data formatted in a way that makes it easy for library staff to review the invoices according to the needs of the invoice processing system. the ‘invoicearraye’ is a combination of the invoicearraybad information, followed by a space, followed by another row with “these below are all notes”, followed by all the ‘waiting to be sent’ invoices with their accompanying notes. because “=getnotes(b1)” is placed in cell a8, all this information is displayed in cell a8, to the right of cell a8, and below cell a8. var okaymessage = ["these below are all notes"]; invoicearraye = invoicearraybad.concat(space).concat(okaymessage).concat(invoicearray); return invoicearraye; conditional formatting the final piece of the spreadsheet is the conditional formatting rules, which have been added to the sheet to highlight certain rows and cells of the report. this screenshot shows the display of three invoices that are currently in the ‘waiting to be sent’ status in alma with a variety of problems, and one invoice that passes all the quality check tests and is ready to be exported from alma and imported into the university’s central financial system without any problems. figure 2. the display of three invoices in the note check spreadsheet. invoice number 4000 has a batch number (‘c1234’) and a properly formatted note (‘act615590,vid0000652571-001,d,[p]’), so it does not appear on the first section of the report below the first green row, but it does appear below the second green row, which is a section of the report intended to show each invoice note. invoice number 4001 also has a batch number (‘z123’) and a properly formatted note (‘act615590,vid0000652571-001,d,[p]’), so it also does not appear on the top part of the report. however, the batch number does not start with the letter ‘c’. when it was recognized that this was a potential issue, a conditional formatting rule (“=not(regexmatch(c9,”^c”))”) was added to the sheet to highlight batch numbers without the proper format in purple, which may be seen in row 15, column c. figure 3. a conditional format rule in the spreadsheet. invoice number 4002 does not have a batch number and it does not have a properly formatted note, so it appears twice on the report, in row 9 and row 10, to flag these two problems. finally, invoice number 4003 does have one note which is shown in row 16, it also has a batch number in the correct format so cell c16 is not highlighted, but the note that it has is not properly formatted so it is flagged in row 11 with the note in cell a11 “no good note”. conclusion having this tool to help identify potential problems in the processing of invoices has helped immeasurably and has served the university of michigan library well for over two years. mistakes with formatting and data entry are an unfortunate part of any procedure involving processing thousands of documents and transmitting their salient information between two systems – the library lsp and the university’s central financial system. this report with its custom function overcomes the limitation of day-old data in alma’s integrated reporting tool, alma analytics, by utilizing the alma api to retrieve real-time invoice data. using the powerful, accessible tools of the google apps script platform, this data can then be checked and formatted in a report that flags issues before they develop into problems with the export of financial information from alma and into the university’s central financial system. it was also particularly helpful to library staff during the transition from the aleph ils to alma when workflows were being adapted, new processes needed to be learned, and older tools were not compatible with the new system. a new lsp necessitated the development of new information technology tools simply to carry on with the work that needed to be done. citations [1] alma – financial systems: https://knowledge.exlibrisgroup.com/alma/product_documentation/010alma_online_help_(english)/090integrations_with_external_systems/020acquisitions/010financial_systems [2] alma analytics: https://exlibrisgroup.com/products/alma-library-services-platform/alma-analytics/ [3] google apps script: https://developers.google.com/apps-script [4] google apps script release notes for december 07, 2020: https://developers.google.com/apps-script/docs/release-notes#december_07_2020 [5] alma rest apis: https://developers.exlibrisgroup.com/alma/apis/ [6] notes check app: https://github.com/dfulmer/notes-check about the author david fulmer is an applications programmer/analyst intermediate at the university of michigan library. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – scraping bepress: downloading dissertations for preservation mission editorial committee process and structure code4lib issue 47, 2020-02-17 scraping bepress: downloading dissertations for preservation this article will describe our process developing a script to automate downloading of documents and secondary materials from our library’s bepress repository. our objective was to collect the full archive of dissertations and associated files from our repository into a local disk for potential future applications and to build out a preservation system. unlike at some institutions, our students submit directly into bepress, so we did not have a separate repository of the files; and the backup of bepress content that we had access to was not in an ideal format (for example, it included “withdrawn” items and did not effectively isolate electronic theses and dissertations). perhaps more importantly, the fact that bepress was not sword-enabled and lacked a robust api or batch export option meant that we needed to develop a data-scraping approach that would allow us to both extract files and have metadata fields populated. using a csv of all of our records provided by bepress, we wrote a script to loop through those records and download their documents, placing them in directories according to a local schema. we dealt with over 3,000 records and about three times that many items, and now have an established process for retrieving our files from bepress. details of our experience and code are included. by stephen zweibel institutional context we at the gc share a bepress digital commons instance with the rest of the cuny libraries. we work entirely with doctoral and master’s students, who often have academic work that they submit to our digital commons, which we call cuny academic works. in fact, in order to graduate, students generally must submit their final thesis/capstone/dissertation to academic works. thus all or nearly all dissertations and theses published since we began working with bepress are on academic works. with corporate changes, including the acquisition of bepress by elsevier in 2018, we in the library began to feel like we were losing control of our students’ work. additionally, it became clear to us that academic works is not a complete archival solution, because it does not provide archival quality pdfs. in response, we decided to explore creating our own digital archive using archivematica. in order to accomplish this, we needed the documents and metadata that were stored in academic works. when we requested our data from bepress, it was supplied, but we found its format and organization to be unideal, and a difficult starting point for our work. for example, it included “withdrawn” items and did not effectively isolate electronic theses and dissertations. working with this dataset would have required a great deal of manual work confirming records and editing metadata, far too much for a team of three librarians exploring this in their quieter moments. so, instead, i began to look for programmatic methods to get our documents and metadata out of academic works ourselves. the most likely solution would have been an api (application programming interface), or so i thought. often, websites with a great number of documents, especially those with an academic focus, have an api that allows for programmatic searching and harvesting of records. bepress does indeed provide an api (https://www.bepress.com/reference_guide_dc/digital-commons-oai-harvesting/), but there is an important limitation. supplemental files are not exposed through bepress’s oai (open archives initiative) interface, so these files, which may include a database, set of images, or anything not part of the primary document, would be invisible to any simple harvester i might write. this would render any result incomplete, and thus not at all an archival copy. on top of that, bepress does not provide a sword (simple web-service offering repository deposit) api, which would have made downloading our documents a relatively simple task. problem the problem became a technical one, revolving around the difficulty of retrieving each student’s thesis/dissertation/capstone as well as all supplemental files they had uploaded to academic works. as far as we could tell, there was no method by which bepress had exposed the locations or existence of supplemental files. and yet, there they were, listed on the index page of each document of academic works! so close and yet so far. our final aims were: to download every thesis/dissertation/capstone in academic works, to match each one with its corresponding metadata, and to relocate those files and metadata to a set of directories with unique ids organized such that they could be imported into archivematica. at the moment i was stuck at step one, without a reliable method of getting the files we wanted (and only those files) out of the bepress repository. since the most likely solution to our troubles (a robust api) was not forthcoming, and asking nicely didn’t get us anywhere, the next method to turn to would be web scraping. scraping web scraping is often a last resort in situations where it is necessary to get data off of a website. when describing the technique, i often say that it is like creating a program to click through a website for you using an html map you’ve written for the program. this technique can be unreliable, in that when websites update their architecture, or even when they make small changes to their html, the web scraping script will break, since the map you’ve provided no longer applies. to build the “map” required for the web scraping script, i needed to know the url locations of each document we wanted to download. hence the problem: up to this point we had the url of the primary documents, but none of the supplementary files. figure 1. however, because i had resorted to scraping the pages, and each “index page”, as they are called, lists the supplemental files associated with the document (and, crucially, provides a link to download that file!), i was able to find everything according to a predictable schema and place each file in our predetermined file structure. the rest of this article will demonstrate the process to scrape a website according to a provided schema, or map. figure 2. the schema luckily, bepress provides a “content inventory” report (see image) that allowed us to just pull the metadata of the collections we were interested in, in this case all dissertations/theses, etc.. this would allow us to complete our second objective, pairing metadata to files. more importantly, for each row (which corresponded to a record), there was a corresponding url for the primary document’s index page (see image), where, again, each supplementary file is listed. at this point,i knew what was required: to write a script that would proceed line by line down the content inventory report, go to each index page, and download the primary document and any supplementary files found there. to start with, i needed my script to read the ‘content inventory’ csv, and turn each row into an object, like: import csv csv = open('bepress.csv') csv_reader = csv.dictreader(csv) then, looping through the csv: all_works = [] for item in csv_reader: author = item['author'] pdf = item['pdf_filename'] … together = {'filename': pdf, 'author': author, …} all_works.append(together) and so on. now i have a list of dictionaries that contain the metadata of each item in our collection. as i do this, i can transform each string of text into my desired format. for instance, an author’s name could go from “jane smith” to “smith, jane”. requests when i interact with websites, i turn to the requests library. it’s a library for python that simplifies getting the data from a website into a format that i can play around with. for instance: import requests r = requests.get('https://academicworks.cuny.edu') website_text = r.text the first line would get me the front page of academic works, and ‘r.text’ would contain the entire page as a string, ready for me to parse. this would work equally well for calling a web api–better, in fact, as that would be much simpler to parse. i wrote a function to get the files from academic works, simplified below: def fetch_url(entry): # define the location where i want the files to go path = "dissertations/"+ entry['last_name'] + "_" + str(entry['unique_id']) # the file url i got from the csv uri = entry['download_url'] os.mkdir(path) r = requests.get(uri, stream=true) # checking if the url works, and then starting the download if r.status_code == 200: with open(path + '/' + entry['filename'], 'wb') as f: for chunk in r: f.write(chunk) # getting the supplemental files, discussed below for index, part in enumerate(entry['supplementals']): get_supplementals(index, part, entry['article_number'][0], path) return path here we accomplish our third objective: to relocate our files and metadata to a set of directories with unique ids. the trick to getting the supplemental files is the only real web scraping necessary in this approach. as it turns out, the format bepress uses to create the urls for supplementary files is quite simple, provided you know how to read it. here is one below: https://academicworks.cuny.edu/cgi/viewcontent.cgi?filename=0&article=1784&context=gc_etds&type=additional the first part of that url should be familiar to any user of the web: https://academicworks.cuny.edu. we are interested in the part that comes after the question mark: those are known as query parameters. filename=0&article=1784&context=gc_etds&type=additional the query parameters follow the ? in the request, and are separated from one another by the & symbol. to determine what each parameter does, we can only experiment, because documentation is not available to us. the first query parameter, filename=0, gets the first supplementary file (therefore ‘filename=1’ would get the second!). the second parameter, article=1784, refers to the unique idof the primary document, which we know from our csv. ‘context’ is the particular collection we are looking in, and we don’t have any use for ‘type’! so, a hypothetical 5th supplementary file would have the url of: https://academicworks.cuny.edu/cgi/viewcontent.cgi?filename=4&article=1784&context=gc_etds&type=additional this sort of url is sometimes called an “internal api”, inasmuch as it is an undocumented api; we can still use it to get what we want. the function is as follows: def get_supplementals(which_one, filename, article_number, path): download_url = 'https://academicworks.cuny.edu/cgi/viewcontent.cgi' \ + '?filename=' + str(which_one) \ + '&article=' + str(article_number) \ + '&context=gc_etds' \ + '&type=additional' # download the file s_file = requests.get(download_url) if "error: no such file is available for this article" in s_file.text: return with open(path + '/' + filename, 'wb') as f: f.write(s_file.content) the variable ‘download_url’ is constructed for each entry out of the filename retrieved from the content inventory csv. the ‘article_number’ parameter is which supplementary file (1, 2, 3, etc.) as discussed above. in all of our metadata, we still don’t know how many supplementary files a document has, or whether a document even has any supplementary files! so, we loop through and try to download the first supplementary file, the second, and so on, until we get an error message, which in this case comes as a warning from bepress: “error: no such file is available for this article”. when we see this text, we stop. result now we have achieved our goals: to download every thesis/dissertation/capstone in academic works, we used the requests library and bepress’s internal api; to match each one with its corresponding metadata, we used the “content inventory” provided by bepress; and to relocate those files and metadata to a set of directories with unique ids organized in order to be imported into archivematica, we wrote a line in python that created and named a directory according to each document’s unique id. all this has allowed us to continue moving forward on our project to make a real archival backup of our dissertations, theses, and capstones. of course, this method has limitations, and a few potential improvements come to mind, for example, a way of ensuring that the downloaded files are complete, not corrupted or mistaken, for instance hash checking. (the way i accomplished this, which is not a complete solution, was to download the whole set multiple times and comparing the sets.) subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – generating standardized audio technical metadata: aes57 mission editorial committee process and structure code4lib issue 30, 2015-10-15 generating standardized audio technical metadata: aes57 long-term access to digitized audio may be heavily dependent on the quality of technical metadata captured during digitization. the aes57-2011 standard offers a standardized method of documenting fairly comprehensive technical information, but its complexity may be confusing. in an effort to lower the barrier to use, we have developed software that generates valid aes57 files for digitized audio, using output from fits (file information tool set) and a few fields of information from a tab-delimited spreadsheet. this article will describe the logic used, the fields required, the basic process, applications, and options for further development. by jody l. deridder long-term access to digitized audio may be heavily dependent on the quality of technical metadata captured during digitization. in 2011, the audio engineering society (aes) released a new standard for technical audio metadata: the aes57-2011 (audio engineering society [updated 2015]), which covers audio object structures for preservation and restoration (both for analog and digital). prior to 2011, the best standard available for audio technical metadata was audiomd (library of congress [updated 2011]), which is far less extensive. also released in 2011 was the aes60 (audio engineering society [updated 2015]) for descriptive metadata, which extends the unqualified dublin core fields to add an additional 4 fields (version, publication history, metadata provider, and entity type) (audio engineering society 2011). an excellent review of the aes57 standard while in development was provided by jane otto [otto 2010]. broad categories of information covered by this standard include physical properties and dimensions of the analog object, signal characteristics (such as playback speed, sound field, and noise reduction), digital file characteristics (such as sample rate, bit depth, and byte order); and comments about the condition of the object. each audio object is described by a separate aes57 document; thus a 2-reel performance (digitized into two separate audio files) would be described by two separate aes57 documents. this article will describe a method and accompanying perl script used to automatically generate the portions of aes57 technical metadata necessary for schema validation, as part of a digitization process. the intent is to provide at least the minimal standard technical metadata necessary to ensure the ability to effectively manage digitized audio content long-term. by automating the generation of the aes57 documents, we increase the speed of metadata work while reducing the level of complexity for the digitizing staff. within the aes57, the audio object is described at multiple hierarchical levels: the audio object itself; the face levels within that object (such as the side of a vinyl record); the regions within the face; and the streams within each region. within the spreadsheet, a line is created for each audio file (audio object) and for each region (or segment) within that file which we are extracting as a separate track for online delivery. if there are no segments, there is a single line in the spreadsheet for the file. since an intellectual item may span multiple files (such as a two-reel performance), our file naming system allows for sub-items (regions, or segments for web delivery) at two levels, and the level is indicated by the length of the file name, where the first segment of the file name (21 characters long) specifies the intellectual item, and the remainder indicates the sequence. in a recording (of multiple musical segments) which spans two reels, for example, the intellectual item is identified by the first 21 characters, and the reel would be identified by the next 5 characters. since the reel would have multiple musical segments, it would have sub-items, each of which would be identified by more than 26 characters. had there been only one reel, the sub-items would be identified by the 5 characters following the first 21. if (length($id) > 26){ $itemid = substr($id, 0, 26); # collect all sub entries for this subitem push (@{$mykids{$itemid}}, $id); # an array of sub-sub-identifiers, keyed on the subitem id, in the hash %mykids } elsif (length($id) > 21){ $itemid = substr($id, 0, 21); # collect all sub entries for this item push (@{$mykids{$itemid}}, $id); # an array of sub-identifiers, keyed on the item id, in the hash %mykids } else{ $itemid = $id; } although the aes57 does not require information in the section of the schema designated for physical properties (physicalproperties), it does require a “type” attribute value (in formatlist/formatregion) that may be used to indicate information categorizing the original material with one of these values: analogdiscformatregiontype (such as a long-playing record) opticaldiscformatregiontype (compact disc or dvd) analogtapeformatregiontype (cassette tape or reel-to-reel) digitaltapeformatregiontype (digital tapes) cylinderformatregiontype (cylinders) wireformatregiontype (wire) baseformatregiontype formatregiontype as this software (fits2aes) was written for content undergoing digitization, it tries to identify the physical form by parsing text from the spreadsheet “format” field, entered in appropriate rda cataloging form, such as: “1 sound tape reel: analog (10:56 min); 7 inches.” values that must exist in the format field for correct assignment include “tape”, “cassette” or “reel” (with or without “digital”); “disc” (with either “digital” or “analog”); “roll,” or “wire.” if the type of content cannot be determined, the “formatregiontype” value is substituted for this attribute. undef $phys; if ($phystype =~ /tape/i || $phystype =~ /cassette/i || $phystype =~ /reel/i){ if ($phystype =~ /digital/i){ $phys = "digitaltapeformatregiontype";} else{ $phys = "analogtapeformatregiontype"; } } elsif ($phystype =~ /disc/i){ if ($phystype =~ /analog/i){ $phys = "analogdiscformatregiontype";} elsif ($phystype =~ /digital/i){ $phys = "opticaldiscformatregiontype";} } elsif ($phystype =~ /roll/i){ $phys = "cylinderformatregiontype";} elsif ($phystype =~ /wire/i){ $phys = "wireformatregiontype"; } else{ $phys = "formatregiontype";} all regions must have their identifiers referenced in the formatlist/formatregion@ownerref attribute. the aes57 offers many more fields for further description of the original object in the physicalproperties section, but as they are not required for validation, they have not yet been included. additional fields would need to be added to the spreadsheet, but this additional information would be useful from an archivist’s perspective, to provide provenance information about the original object. most of the information necessary to capture about the audio object itself can be extracted from the fits (file information tool set (harvard 2015)) extract, though some requires reformatting. the aes57 fields which we mapped to the fits (version 0.8.4) fields are as follows: aes57 field fits field reformatting required audioobject @id filename format identity@format audioobjecttype/byteorder byteorder little_endian = 0; big_endian = 1 audioobjecttype/audiodatablocksize blockalign audioobjecttype/audiodataencoding audiodataencoding (jhove tool version) audioobjecttype/filechecksum/checksumvalue md5checksum audioobjecttype/filechecksum/checksumcreatedate lastmodified transform to datetime format audioobjecttype/firstsampleoffset offset audioobjecttype/objectcreationdate lastmodified transform into datetime format* formatlist@label identity@mimetype (optional) formatlist/formatregion/wordsize wordsize formatlist/formatregion/soundfield channel 4=surround; 2=stereo; 1=mono formatlist/formatregion/bitdepth bitdepth starttime@editrate and duration@editrate samplerate face/region/numchannels channels duration duration (exiftool) transform into editunit frames* * to be discussed in more detail shortly. templated values: because the checksum type extracted from the fits is the md5 checksum, the script we developed will template “md5” for the value of audioobjecttype/filechecksum/checksumkind. other options include sha-1 and crc. the audioobject/primaryidentifyer@identifiertype is templated as “file_name” in our software. other options would include shelf_number, umid, or other. since this script was developed for digital files, we template the audioobject@analogdigitalflag as file_digital. other options include analog and phys_digital (for digital content with a permanent physical structure). after installing fits (which requires java), the form of the system call used to generate the fits file on a linux/unix machine is: /path/to/fits.sh -i $audiofile -o $fitsfile >/dev/null 2>&1 for windows: c:\\path\\to\\fits.bat –i $audiofile –o $fitsfile in this line, “-i” precedes the full path and file name of the audio file, and “-o” precedes the full path and file name of the fits file being generated. the last portion of the line (after $fitsfile) simply prevents output of information to the command line during the process. after generating the fits file, fits2aes opens the file to check whether the wav is well-formed and valid. if it is not, the script will output errors identifying the fits file name and the problem found. for example, using perl: open (in, $fits) or die “can’t read in $fits\n”; while ($line = ){ chomp $line; # remove newline $line =~ s,\r,,; #remove windows carriage returns if ($line =~ /true<\//){ $wf = 1;} elsif ($line =~ /true<\//){ $valid = 1;} } close(in); if (!$valid && ! $wf){ print "$fits: not valid and not well-formed\n";} elsif (!$valid){ print "$fits: not valid\n";} elsif (!$wf){ print "$fits: not well-formed\n";} } if the file generated is not valid and well-formed, correction should take place before proceeding further. the transformation of the fits lastmodified value (in the form yyyy:mm:dd hh:mm:ss-hh:mm) to datetime format (w3c 1997) (yyyy-mm-ddthh:mm:ss-hh:mm) is fairly straightforward: elsif ($line =~ /(.*?)<\//){ $adate = $1; undef $ending; ($ending = $adate) =~ s,.*?(\-.*),\1,; # pull off offset value for reuse after transform $adate =~ s,\-.*,,; # remove offset such as -0500 from end $adate =~ s,([\d]{4})\:([\d]{2})\:([\d]{2}) ,$1\-$2\-$3 ,; # hyphens not colons in yyyymmdd $adate =~ s, ,t,; # add t between date and time in place of space if (! $ending){ $ending = "z";} # add z on end if no offset $datec = $adate.$ending; # add the offset back on } the remainder of the information needed for a validating aes57 can be provided via a tab-delimited export from a spreadsheet. as noted above, this script supports entries for singletons (one region to be used from an audio file) to file entries followed by multiple regions. fields from the spreadsheet include file name, title, format, clip begin, clip end, region condition notes, condition notes for each of four possible streams, notes on prior or ending streams, direction, label, and speed correction. the aes57 can also accommodate security notes at the level of the region. we accommodate these by semicolon separation of the security note from any other region notes, and preceding the security note with “security: “. also, as condition notes can optionally be targeted to specific clips of time within a region or stream, such clips of time can be included (in square brackets) prior to the appropriate note within the spreadsheet. for example, to provide 3 comments (one about security) about portions of a specified region, the following formatting would be expected in the spreadsheet field for region condition notes, in the row dedicated to that region: [12:23-12:30] crackly; security: phone number; [13:01-14:12] background whine each condition note in the aes57 includes a creation date, which for our purposes is the same date as digitization. modification of this would require either additional spreadsheet fields or additional formatting within the notes field above. sub printnotes{ my ($notefield,$samplerate, $speedcorrection, $datec, $cond) = @_; undef @mynotes; if ($cond =~ /;/){ # multiple condition notes are separated by semicolons @mynotes = split(";", $cond); } else{ push (@mynotes, $cond);} foreach $note (@mynotes){ undef $myclipbegin; undef $myclipend; undef $mystart; undef $myduration; undef $sec; # security note if ($note =~ /\[ *([\d\:]+) *\-([\d\:]+) *\] *(.*)/){ $myclipbegin = $1; $myclipend = $2; $note = $3; $mystart = &calcframes($myclipbegin, $samplerate, $speedcorrection); $myframes = &calcframes($myclipend, $samplerate, $speedcorrection); $myduration = $myframes $mystart; } if ($note =~ /(.*?)security *\:(.*)/i){ $note = $1.$2; $notefield = "securitynote"; print out " <$notefield>".$note."\n"; } else { $notefield = "conditionnote"; if ($note){ print out " <$notefield>\n ".$note."\n"; if ($myclipbegin){ &printrange("timerange", $myclipbegin, $myclipend,$samplerate,$speedcorrection); } print out " $datec\n"; print out " \n"; } } } } time codes for clip begin and clip end fields are entered into the spreadsheet in hours:minutes:seconds format (used by other scripts to generate audio decision list files and to generate derivatives for web delivery). these time codes are used with the sample rate and speed correction to determine the duration (and starttime, if not zero). fields in aes57 that can have a time range element include face, region, and condition notes. the time range type (timerangetype) elements include required duration and starttime elements which are of type “editunit” (frames). each of these requires an editrate and the optional fields factornumerator and factordenominator, all three of which are positive integers. clarification of these fields comes from the ebucore metadata set (ebu 2013) from which they are drawn (as subelements of the aspectratiotype). the intent is to describe the aspect ratio of the resource compared to the original recording, in order to indicate the amount of speedup or slowdown used to generate the digitized product. here the factornumerator is the numerator of the ratio and the factordenominator is the denominator, and the sample rate is used for the “editrate” value. the sample rate is one of the fields extracted from the fits file, and represents the rate at which the audio was sampled, expressed in hertz, such as 22000, 44100, 48000, 96000, etc. the formula (ebu 2013) used to determine the editunit is: editunit = 1 / (editrate * (factornumerator / factordenominator)) in the spreadsheet, the speed correction value is entered as whole numbers and decimals: for example, 1.5 indicates the speed is one and a half times faster, whereas .5 means the speed is half as fast as the original. this is translated into a numerator and denominator in the following way: if ($speedcorrection){ $numerator = $speedcorrection * 100; $denominator = 100; } so if the speed was doubled, the value “2” is entered in the spreadsheet, and the numerator becomes 200 while the denominator becomes 100, as 200/100 = 2. thus if the sample rate (editrate) is 44100 and no change has been made in the speed, then an editunit value is: 1/(44100*(1/1) = 0.00002267573696. however, if the speed correction value is 2, the editunit becomes 1/(44100 * (200/100)) = 0.0000113786848 indicating the speed was doubled during digitization, (2 * 100 / 100). the duration (a value measured in frames) multiplied against the editunit should provide the amount of time in seconds; yet it is the seconds that we have captured in the spreadsheet. duration * editunit = seconds duration = seconds / editunit since the editunit is defined as: 1 / samplerate * speedcorrection, the previous statement is equivalent to: duration = seconds * samplerate * speedcorrection therefore we obtain the duration value in frames by multiplying the number of seconds elapsed against the sample rate and speed correction, to provide the duration value in frames. should the start time of the clip be other than zero, then the value for starttime is generated the same way. sub calcframes{ my ($duration, $samplerate, $speedcorrection) = @_; if ($duration =~ /.*?\:.*?\:.*?:/){ # if hours:min:sec:subsec ($hours, $min, $sec, $subsec) = split(":",$duration); $totalsec = ($hours * 3600) + ($min * 60) + $sec + ($subsec / 60); } elsif ($duration =~ /.*?\:.*?\:/){ # if hours:min:sec ($hours, $min, $sec) = split(":",$duration); $totalsec = ($hours * 3600) + ($min * 60) + $sec; } elsif ( $duration =~ /\:/){ # if min:sec ($min, $sec) = split(":",$duration); $totalsec = ($min * 60) + $sec; } else{ $totalsec = $duration;} # seconds only if (! $speedcorrection){ $speedcorrection = 1;} $frames = $totalsec * ( $samplerate * $speedcorrection); return $frames; } as described before, each audio object consists of one or more “faces,” each of which contains one to many “regions,” each of which contains one to many “streams” (this software supports up to four streams per region). examples of a “face” include a single side of a long-playing record; both tape tracks of a half-inch open reel tape containing stereo content; and a single portion of a quarter track four-track tape for which each flip of the tape contains a single stream that is monophonic. additionally, each variation in the direction or speed correction value would require a separate “face” section. for example, a reel-to-reel tape on which multiple entries were captured at different speeds (or in different directions) would require a separate “face” entry in the aes57 for each segment. directions supported include front, back, forward, reverse, a_pass, b_pass, c_pass, d_pass and none. front and back may be used for two-sided objects, such as a long-playing record. forward and reverse are most appropriate when audio is encoded along the length of the object, such as cassette or reel-to-reel tapes. the a_pass, b_pass, etc. are more descriptive of the quarter track tape described above. none would be used for born digital audio, which has no physical structure. changes in direction, speed correction, or file name pattern (or end of file) indicate an end to a “face,” in this software. thus, when the tab-delimited spreadsheet is first read in, preliminary parsing of the lines documents these changes, and notes “endface” and “startface” times for each audio file. open (log, $log) or die "can't read $log\n"; while ($line = ){ chomp $line; # remove linux newlines $line =~ s,\r,,g; # no new lines of any kind (this gets windows carriage returns) if ($line =~ /^ *([a-z]{1}[\d]{4}\_[\d]{7}[^\t]*) *\t(.*)/){ # identifier should be in first field $id = $1; $mystuff{$id} = $2; # collect everything on the line for this identifier $id =~ s, ,,g; #remove spaces # split up what we just collected, so we can parse it a bit ($title, $phystype, $clipbegin, $clipend, $regioncond, $stream1, $stream2, $stream3, $stream4, $priorstream, $endingstream, $direction, $label, $speedcorrection, @trash) = split("\t",$mystuff{$id}); # trash is any other fields in the log for other purposes # here we set the values for the first entry in the spreadsheet, to be compared against following entries if (! $lastspeed){ $lastspeed = $speedcorrection;} if (! $lastdirection){ $lastdirection = $direction;} if ($first){ # if this is the first line of the spreadsheet, $lastspeed = $speedcorrection; $lastdirection = $direction; undef $first; $face = 1; $startface{$itemid}{$face} = 0; # we always start the audio file at zero } elsif ($id !~ /$lastid/){ # new item, new face, new audio file $lastspeed = $speedcorrection; $lastdirection = $direction; $face = 1; $startface{$itemid}{$face} = 0; } elsif (($lastspeed != $speedcorrection) || ($lastdirection ne $direction)){ # new face! change clipend of last face to stop before clipbegin of this one $endface{$itemid}{$face} = $clipbegin; $face ++; # we're on the next face now, because the speed changed $startface{$itemid}{$face} = $clipbegin; # this new face starts here } $myface{$id} = $face; # use this to find out what face something is on $lastface{$itemid} = $face; # this keeps getting overwritten, so the last face encountered is listed here for this item $lastspeed = $speedcorrection; $lastdirection = $direction; $lastid = $itemid; $lastkid = $id; } the “startface” time is expected to be zero for all audio files, however, the clips may not begin at zero. for these, preliminary regions are generated with the label “start of file” and any notes in the “priorstream” field of the spreadsheet for the first entry for a new file. if ($clipbegin && $clipbegin =~ /[1-9]+/){ # the first clip does not start at zero $priorid = $id.".begin"; $ownerref .= $priorid." "; $startframe = &doface(1, 0, $samplerate, $speedcorrection, $direction, $id, $title, $endframe, $endframe); $priorframes = &calcframes($clipbegin, $samplerate, $speedcorrection); $myend = $priorframes -1; $lastend = &printregion($priorid,$faceid,"start of file",$samplerate,0,$myend,$channels,$priorstream,$speedcorrection,$datec,$iam,); $starttime = $lastend + 1; } below is an example of the beginning of a “face” in an aes57 file, in which a “begin” region has been added prior to the first clip (or region) identified in the spreadsheet for web delivery: 0 43312500 0 269999 2 crowd noise 2015-07-31t15:54:21-05:00 also, the “duration” value for the entire file is drawn from the fits duration (exiftool entry) and this may differ from the duration as specified within the spreadsheet. if the difference is more than a second (determined by the sample rate), an error is generated. otherwise, if the difference is less than a second, the frame rate for the fits duration is substituted for the one provided in the spreadsheet. $endframe = &calcframes($clipend, $samplerate, $speedcorrection); if ($duration){ $endfile = &calcframes($duration, $samplerate, $speedcorrection); if ($endfile > $endframe){ $endframe = $endfile;} # if significantly different, may have left off content elsif ($endfile < $endframe){ $diff = $endframe $endfile; if ($diff > ($samplerate )){ print errors "duration of file ($clipend) differs ($diff) by more than a second from fits duration of $duration\n"; } else{ $endframe = $endfile;} # correct for single second differences } } else{ $endfile = $endframe;} regions within a face correspond to tracks or clips expected to be played separately. examples would be each song on one side of a long-playing record, each musical piece in a performance, or the recording of each speaker during a segmented presentation. compilation of all regions should account for the entire “face” of the object. thus, segments (regions) that are not selected for extraction for web delivery (as indicated in the spreadsheet) need to be accounted for as well. potential reasons for skipping segments might be: extended silences; applause; security reasons or intellectual property rights; inadvertent capture of content unrelated to the intended material; or damaged or lost portions of the original. to account for such segments, our spreadsheet includes fields for notes on prior or ending streams, and the time codes and duration for those are calculated from the time codes for the audio file and the entries for the regions (or segments) to be included in web delivery. notes indicating the reason for suppression of the segment skipped should only be added in the priorstream field (unless it is the ending stream) on the row of the succeeding region. elsif ($kidnum > 1 && ($priorframes != ($lastend -1)) && ($priorframes != $lastend) ){ # need an intermediary clip here $priorid = $lastkid.".5"; $ownerref .= $priorid." "; # build the list for formatregion $starttime = $lastend + 1; $myend = $priorframes -1; # i should end one frame before the next clip; $lastend = &printregion($priorid,$faceid,"skipped portion",$samplerate,$starttime,$myend,$channels,$priorstream,$speedcorrection,$datec,$iam,); } each region contains one to many streams, each of which represents an individual channel of audio information, and each of which is assigned a different non-negative integer. this script supports up to 4 streams, labeling them alphabetically in order; each must reference the region and face of which it is a part. face identifiers are generated by numbering faces sequentially, and preceding them with “section_”; singleton regions are identified by file identifier with “_0000” attached, and multiple region files simply depend upon the region identifiers as entered in the spreadsheet. regions added at the beginning of a file use the next identifier concatenated with “.begin”; those at the end use the last identifier concatenated with “.end”; and skipped regions added are identified in fits2aes by the last identifier concatenated with “.5”. this allows the person digitizing to avoid entering lines in the spreadsheet for each segment of the audio file which is not being described and extracted for web delivery, and still generate valid aes57 files which define all segments, as required. additionally it is possible to assign “pan positions” within the audio sound stage that the stream should occupy during playback, on a left-right axis and/or a front-rear axis, however this optional encoding is not included yet in the software being described; instead all are assumed to come from center stage (leftrightposition=“0.0”). sub printstreams{ my ($regionid,$id, $faceid,$datec, $channels, $samplerate, $speedcorrection, @streams) = @_; %numtoalpha = ( 1 => "a", 2 => "b", 3 => "c", 4 => "d" ); undef $numerator; undef $denominator; if ($speedcorrection){ $numerator = $speedcorrection * 100; $denominator = 100; } for ($i = 1; $i <= $channels; $i ++){ # must print streams for as many channels as we have. they may or may not have comments undef @mynotes; $alpha = $numtoalpha{$i}; $id = $regionid.$alpha; $label = $i; $mychannel = $i; $cond = $streams[$i-1]; print out " \n"; print out " \n"; $cond = $streams[$i-1]; &printnotes("notetimerange",$samplerate, $speedcorrection, $datec, $cond); print out " \n"; } print out " \n"; } an example of a region generated at the end of a file for a segment (with the endingstream comment of “applause”) following the last region (clip) entered in the spreadsheet would thus look like this: 43290001 22499 2 applause 2015-07-31t15:54:21-05:00 in summary, the aes57 is a fairly complicated schema for audio technical metadata, but it is attainable by generating fits files for each audio file, and creating a tab-delimited export of a spreadsheet containing appropriate information gathered during digitization. by documenting information about each portion of the archival audio file, even those segments not provided as web-accessible derivatives, future web delivery can be consistent with current content selections and can be informed by previous decisions regarding access. additionally, the generation of this type of technical metadata adds a layer of quality control to the digitization process, to ensure that the time entered for clips do not actually extend beyond the duration of the entire file, and also to verify that the original audio file is valid and well-formed prior to preservation. example aes57 files, their corresponding fits files, and an example tab-delimited spreadsheet are available with the script from http://www.lib.ua.edu/wiki/digcoll/index.php/fits2aes. bibliography audio engineering society [internet]. [updated 2015]. aes57-2011: aes standard for audio metadata – audio object structures for preservation and restoration. [cited 2015 july 25]. available from: http://www.aes.org/publications/standards/search.cfm?docid=84 audio engineering society [internet]. [updated 2015]. aes60-2011: aes standard for audio metadata – core audio metadata. [cited 2015 july 25]. available from: http://www.aes.org/publications/standards/search.cfm?docid=85 audio engineering society [internet]. 2011. aes standard for audio metadata – core audio metadata (preview). [cited 2015 july 25]. available from: http://www.aes.org/tmpfiles/aessc/20150725/aes60-2011-i.pdf dublin core metadata initiative [internet]. [updated 2015]. dublin core metadata element set, version 1.1. [cited 2015 july 25]. available from: http://dublincore.org/documents/dces/ ebu operating eurovision and euroradio. 2013. tech 3293: ebu core metadata set (ebucore), version 1.4, pp. 29 &77. [cited 2015 july 29]. available from: http://citeseerx.ist.psu.edu/viewdoc/download;?doi=10.1.1.278.4772&rep=rep1&type=pdf harvard institute for quantitative social science [internet]. 2015. file information tool set (fits). [cited 2015 july 25]. available from: http://projects.iq.harvard.edu/fits/home library of congress [internet]. [updated 2011 oct 5]. audiomd and videomd – technical metadata for audio and video. [cited 2015 july 25]. available from: http://www.loc.gov/standards/amdvmd/audiovideomdschemas.html otto, jane johnson. 2010. a sound strategy for preservation: adapting audio engineering society technical metadata for use in multimedia repositories. cataloging & classification quarterly (48:5) and rutgers university community repository [internet]. [cited 2015 july 25]. available from: http://dx.doi.org/doi:10.7282/t3ww7g11 w3c [internet]. [updated 1997 sep 15]. date and time formats. [cited 2015 july 29]. available from: http://www.w3.org/tr/note-datetime about the author jody l. deridder (jody@jodyderidder.com) is head of metadata & digital services at the university of alabama libraries, where she manages digitization, metadata, software development, infrastructure, and preservation policies and procedures. previously, she developed digital libraries at the university of tennessee. she has an m.s. in both information science and computer science. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction: a peer network mission editorial committee process and structure code4lib issue 19, 2013-01-15 editorial introduction: a peer network code4lib, and the code4lib journal, considered as a peer network. by andrew darby the code4lib journal might, at first glance, seem to be in a client-server relationship with its readership: we publish an issue, you consume the contents. (this assumes you view the world in computer networking terms; you do, don’t you?) however, even the most inky of journals doesn’t work that way; the published article is a finished product, but also a starting point. people do something with the information. but i like to think that people can do more with our information, which suggests a different networking model. a peer network is decentralized, “a distributed application architecture that partitions tasks or workloads among peers [where peers] are equally privileged participants in the application” [1]. it’s not a perfect analogy, but rather an interesting way to consider the journal and indeed the whole code4lib universe or ecosystem or . . . network. the journal, conference, listserv, irc: each of these nodes are separate but feed into one another. a post on the listserv might transform into a presentation at the conference; a presentation might be elaborated as an article; an article might be referenced to solve a problem posed to the list. the data is disseminated, cited, copied whole, re-purposed or forked. this sort of decentralized and distributed architecture encourages doing. a newcomer to the overall code4lib network quickly discovers that there is no waiting for permission; if something doesn’t exist on a peer network, well, put it there. if it’s good, if it makes sense, if it’s necessary, it will prosper; a distributed network is pretty egalitarian. or is it? is code4lib really an egalitarian network? are we all peers? a recent opt-in survey of gender diversity found that 18% of respondents (and 22% of female respondents) didn’t consider themselves members of the code4lib community [2]. you don’t have to self-identify as a peer to participate in a peer network, but there is room for everyone. even if you are not a super-coder, there are ways to contribute, for example, writing documentation, beta testing, user testing, even editing articles! another feature of peer networks is some form of peer review: from a list moderator who can switch off misbehaving nodes to an up/down vote on an article or project to that holiest of holies, double-blind peer review. with each degree of review we move further away from a pure anarchic internet vision, but we gain something–the ability to zero in on those nodes with a greater probability of value or usefulness. so peer review, in whatever form, can chip away at the “just do it” ethic. casual peer review (thumbs up! thumbs down!) will slowly suppress some ideas and elevate others. double-blind peer review anonymously makes decisions that might seem arbitrary. at the journal, the entire editorial committee reviews proposals, and while there is an assigned and second editor for each article, it is not unusual for other members to read and weigh in on drafts. in this system, a peer reviewer can be a collaborator as easily as a gatekeeper; we work with the author to produce the best article possible, and like a good friend (or peer) we sometimes suggest that an idea is not quite ready, or would make more sense attached to a different network [3]. we have a dozen articles this issue–it’s a big one. topics and technologies range from apis to xslt. find the articles that interest you, leave a comment, ping the author, blog about the ideas, fork the project, heck, why not submit a proposal for an upcoming issue [4]? thanks finally, a big thank you to tom keays and mark pernotto for all their work in getting our wordpress install and plugins up-to-date and running smoothly again. notes [1] yes, this definition is from wikipedia: http://en.wikipedia.org/wiki/peer-to-peer [accessed jan 12, 2013]. [2] the survey was created and analyzed by rosalyn metz, and promoted on the code4lib listserv (so presumably respondents were list subscribers). the summary of results is available here: http://goo.gl/8wtkc. [3] so, is the journal peer reviewed? ed corrado discussed this in a previous editorial: http://journal.code4lib.org/articles/3277 and carol bean outlined our process in a recent article in in the library with a lead pipe: http://www.inthelibrarywiththeleadpipe.org/2012/open-ethos-publishing/. [4] http://journal.code4lib.org/call-for-submissions. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – parsing and matching dates in viaf mission editorial committee process and structure code4lib issue 26, 2014-10-21 parsing and matching dates in viaf the virtual international authority file (oclc online computer library center 2013) http://viaf.org is built from dozens of authority files with tens of millions of names in more than 150 million authority and bibliographic records expressed in multiple languages, scripts and formats. one of the main tasks in viaf is to bring together personal names which may have various dates associated with them, such as birth, death or when they were active. these dates can be quite complicated with ranges, approximations, bce dates, different scripts, and even different calendars. analysis of the nearly 400,000 unique date strings in viaf led us to a parsing technique that relies on only a few basic patterns for them. our goal is to correctly interpret at least 99% of all the dates we find in each of viaf’s authority files and to use the dates to facilitate matches between authority records. python source code for the process described here is available at https://github.com/oclc-developer-network/viaf-dates. by jenny a. toves and thomas b. hickey background viaf merges records from various national and trans-national authority files into clusters. these authority files provide identifiers and often standard names for people, titles, series and concepts. there are many challenges in matching across the files, including recognizing and parsing dates associated with personal names. bibliographic dates can be surprisingly complicated and over the years libraries have come up with a mélange of approaches to capture the aspects of dates that they are interested in.[1][2] dates can appear in a number of different fields, including name headings, separate date fields, and within notes. viaf needs to recognize the presence of dates and convert them into a uniform format that it can use when matching and comparing records about people. how viaf uses dates while contemporary files such as orcid (http://orcid.org/) place great reliance on email addresses and institutional associations, library files have in general eschewed such information and rely heavily on birth and death dates to disambiguate people. viaf tries to match records that describe the same entity in multiple files, and dates associated with people are important pieces of information when matching names across authority files. each viaf cluster that represents a person, maintains a date range (described below in storing and comparing dates). if viaf can identify that two names are associated in some way (for example the forms are similar or have matching cross references) and share both a birth and death year, this is considered a very strong indication of a match, second only to more specific information such as matching unique identifiers. while on the face of it matching dates would seem straight forward, there are some subtleties even after the dates have been parsed. most of the dates viaf needs are specifically coded as dates, so recognizing them is not difficult. however viaf also needs to pick dates out of unstructured text, such as notes in authority records. for this we use some fairly simple regular expressions [3] to recognize and parse these dates. appendix 3 has examples of these. here are examples of some of the main variants we see: ranges. typically used to indicate birth and death 1903-1993 flourished dates (when the person was active). may or may not be a range. 1130 fl. circa dates. approximate dates ca. 1507-1584 other approximations 4./5. stol. 10..-11.. showing month and day 1942 june 24 1943 ún. 23.-? different scripts ٢٧٣م different calendars 604-672 of course, these can be combined in various ways 14..-1472 14?? –ca 1475 here are some ways we see december 3, 1949 entered: 1949 (december 3) 1949 dec. 3 1949 3 déc. 1949 december 3 1949 (dec. 3) 03.12.1949we find 27,000 ways to say what we interpret as flourished around 1850. here are the patterns that occur more than a thousand times (the first character is the marc subfield code): f18..-19.. d18?? f18..-18.. d18..-19.. f18..-18..? f17..-18.. f18..-18.. d18..-18.., d18..-18.. d18..-…. fca 18– d17..-18.. f18..-19..? f17..-18..? f18..-19.. f17..-18.. although our previous system was able to share much of the code used to parse dates from various authority files we became aware that we had no way to make global changes in how dates were handled. one change we wanted to make was how we handled flourished dates, and realized we needed to touch the dozens of routines that handle the details of each authority file. an interesting observation is that while the domain, that is how dates are represented, is almost limitless, the range of actual individual dates representing a single day is fairly small. given two thousand years of publication there are only about 730,000 different dates of interest (2000 years x 365 days/year). since most dates we encounter are after 1700, only about 115,000 dates will be encountered with any regularity and the range of the transformation from text to our stored dates is fairly small. another observation is that while we characterized the domain as ‘almost limitless’, in fact, within date subfields we find 384,000 actual unique date strings in all the viaf authority source authority records, even before normalization (this is larger than the 115,000 individual dates both because of variant representations and because many of these strings represent a date range). a table [4] of all these date strings, ranked by frequency is available on github. identifying dates in authority and bibliographic records viaf is interested in various dates in both authority and bibliographic data. we tend to divide dates into four categories publication dates personal dates in a field with a specified date format personal dates in free text notes personal dates in a heading the first two of these are fairly straight forward. while publication dates can be complicated, this data is considered of low relevance for matching names so we simplify it by just using the equivalent of the marc-21 date1 field.[5] this is a coded field, but even that can have a number of variant forms, such as 198u to indicate an approximate date in the 1980’s. for the purpose of viaf we avoid publication dates that are indicated as questionable or are date ranges and discard dates that are not all digits, along with a few patterns, such as ‘0000’ and ‘9999’, which do not look like valid dates. when a date is in a field such as the marc-21 046 [6], parsing is also quite straight-forward since the iso 8601 standard on the representation of dates and times [7] is specified. picking up dates from free text notes fields can be a challenge, but prior to 2009, marc-21 authority records did not have a structured way of entering date, month and days associated with people, so this information is often embedded in notes and needs to be found. with the many languages and scripts in viaf source authority records, identifying the type of date in free text can be difficult, e.g. whether it is a publication date, birth, death, or some other type of date. again, we simplify the process by looking for key words and then picking up only the most straightforward of the dates. while all these dates are important we have put the most effort into parsing the dates contained in the personal headings themselves. a typical heading looks like: ‡a shakespeare, william, ‡d 1564-1616 ‡a shakespeare ‡b william ‡f 1564-1616 where the ‡d (marc-21) and ‡f (unimarc) indicate the date subfield with a hyphen indicating a birth-death range. our approach to parsing these dates is the topic of the rest of the paper. storing and comparing dates each cluster in viaf representing a person has a date range stored as two dates (min and max) consisting of a year, month and day. associated with the date range is an indication of what they represent. we have three values: lived – the dates are birth and/or death dates flourished – the dates show when the person was active circa – the dates are approximate each section of the date (year-month-day) is set to 0 if not available. the dates themselves are kept in the standard western calendar (often called the gregorian calendar), using negative dates for dates before 1 a.d./c.e., turning 1 b.c.e. into -1. while best practice when using negative numbers for b.c.e. dates is to use astronomical conventions [8] [1] which use a year 0, we do not, maintaining consistency with the common practice of not having a year 0 in b.c.e dates and reserving 0 to indicate no data. our most basic test for comparing date ranges is to see whether two sets of date ranges are compatible: if the both the min and max years are zero in both dates, there is no conflict for min dates after 1400: one max date cannot be less than the other min date max dates must be within 120 years of the other min date when present if either pair is marked flourished min and max dates must be within 100 years of each other if present if either pair is marked circa min and max dates must be within 10 years of each other if present otherwise both are lived min and max dates must be within 3 years of each other if present if both have months specified, they must match if both have days specified, they must match the 3 year range was arrived at pragmatically (we started at 0 and worked up). for various reasons dates associated with people tend to vary by a few years, and allowing some variation in them helps bring records together. when comparing dates within a single authority file (e.g. to look for duplicates) the fully parsed dates essentially have to match exactly to not conflict. when only one of the min/max dates is present for comparing we get a single-date match which is much weaker than the earlier mentioned double-date match. mapping dates to patterns our original approach grew from parsing dates in two authority files to eventually three dozen. although our processing for each file shared code, each had its own dedicated parsing routines. driven by necessity, we evolved the current system that relies on mapping dates to patterns shared across all authority files that are then used to guide the parsing. here are some of the pieces we need to identify: type of date: flourished, circa, birth/death dates era, e.g. a.d., b.c., b.c.e for gregorian dates reign and era dates for japanese names century indications months alternate calendars, e.g. hijri (arabic) dates wild card characters separating the min and max dates (e.g. birth and death) parsing a date subfield is accomplished by morphing the input data into a pattern that guides parsing. there are four basic steps: normalization of the input data splitting the input data into min (e.g. birth) date and max (e.g. death) date sections if we have a han (japanese) date then resolve it via lookup table and we are done[9] if all parts of the date are century dates then resolve them and we are done resolving any wildcards parsing the min/max date sections as an example, “1949 december 3-“ is converted to the pattern “nnnn month nn”. this pattern is then matched to this regular expression: nnnnmonthnn = re.compile('(\d\d\d\d) +([^0-9. ]*)\.? ?(\d\d?)') which is then used to pick out the year, month and day from the input string. appendix 4 shows the 32 recognized patterns, and the code for all the patterns can be found in datefld.py.[10] normalization starts with lowercasing the data and decomposing unicode characters. next some non-latin digits and punctuation characters such as hyphens and question marks are converted to latin equivalents. digits are then converted to “n” and anything that looks like a month name is converted to “month”. we have built these tables up from examples encountered in dates, augmented by month names and abbreviations for various locales known to the python date libraries. here are some examples of the transformation: 1999 -> nnnn 1999 january 10 -> nnnn month nn 1947-1999 -> nnnn-nnnn 30 b.c. -> nn bc circa 1920 -> circa nnnn splitting tries to find a single hyphen and use that point to split the date into min/max sections. if no hyphen exists, then look for evidence of this being a death date (ie. “d 1946”). if there is no hyphen and no evidence of a death date then it is assumed to be a birth date. if there are multiple hyphens, then we have a manually constructed table of 75 common patterns[11] that can be split. an example is “nnnn-nnnn-“. the second hyphen is probably a typo so we ignore it and split the date at the first hyphen. here are some examples of the transformation: nnnn-nnnn -> (nnnn, nnnn) died nnnn -> (none, nnnn) nnth century-nnnn -> (nnth century, nnnn) nnnn -> (nnnn, none) resolving wildcards means finding things like “nnn?”, “nnn.” and “nnnn or n” and modifying the pattern, the input string and the datetype. an example is the string “nnn?” in a birth pattern with a date string of “197?”. the input data is changed to ‘1979’. the pattern is changed to ‘nnnn’ and the date type is set to ‘circa’. ‘circa’ allows ±10 years to match so this birth date will match any other birth date from 1969 to 1989. there are some patterns that we are unable to parse, usually because they are ambiguous. appendix 1 has a list of these patterns. post parsing we apply hijri and bc adjustments. arabic dates are converted to their latin form. if both min and max dates are bc (or if only the max date is bc), multiply min and max by -1. the sample viaf code in github has a stub for the hijri code. since it has its own licenses it needs to be obtained separately.[12] during this transformation, the dates undergo some ‘sanity’ checks, the most important being: max >= min date max date – min date <= 110 years month values must be between 01 and 12 inclusive day values must be appropriate for the given month testing when we receive a new authority file to add to viaf, one of our first tasks is to identify the dates in the records and decide how to parse them. to do this we extract all the dates found in the structured fields, count them and rank them. these strings can then be put through our existing date parsing routines and then visually compared to the input. since we have the dates ranked by frequency we concentrate on the dates we will encounter the most, making sure that the interpretation looks correct and that we are handling at least 99% of the dates correctly. the date patterns tend to be clustered, so this task is not as difficult as it might seem. since we now have a uniform approach to dates across the various authority files that contribute to viaf, we can also consolidate all the dates we have found across all of them. this lets us spot patterns that, while fairly rare in individual files, across all of viaf are large enough to merit attention. appendix 2 shows the most common patterns along with an indication of how often they are encountered. limitations we recently added japanese ‘reign’ dates to viaf. while there are many other forms of dates in the world [13] [14], we are comfortable that we are covering the vast majority of dates we encounter. works cited [1] library of congress. “extended date/time format (edtf) 1.0 submission.” library of congress. january 13, 2012. http://www.loc.gov/standards/datetime/pre-submission.html (accessed 11 11, 2013). [2] dublin core metadata initiative. dcmi date working group. http://dublincore.org/groups/date (accessed 04 24, 2013). [3] wikimedia. “regular expression.” english wikipedia. http://en.wikipedia.org/wiki/regular_expression (accessed 11 11, 2013). [4] https://github.com/oclc-developer-network/viaf-dates/blob/master/datefields.txt [5] network development and marc standards. “marc 21 format for bibliographic data: 008:.” network development and marc standards. 09 01, 2011. http://www.loc.gov/marc/bibliographic/bd008a.html (accessed 04 26, 2013). [6] network development and marc standards office, library of congress. “marc 21 format for authority data: 046: special coded dates.” 04 05, 2011. http://www.loc.gov/marc/changes-rda-046.html (accessed 05 25, 2013). [7] wikimedia. “iso_8601.” wikipedia. 04 24, 2013. http://en.wikipedia.org/wiki/iso_8601 (accessed 04 26, 2013). [8] espenak, fred. “year dating conventions.” nasa eclipse web site. february 25, 2008. http://eclipse.gsfc.nasa.gov/sehelp/dates.html (accessed 02 10, 2014). [9] https://github.com/oclc-developer-network/viaf-dates/blob/master/handates.py [10] https://github.com/oclc-developer-network/viaf-dates/blob/master/datefld.py [11] https://github.com/oclc-developer-network/viaf-dates/blob/master/overrides.py [12] alsadi, muayyad saleh. “hijra/hijra.py.” http://github.com/ojuba-org. 2006-2008. https://github.com/ojuba-org/hijra/blob/master/hijra.py (accessed 08 26, 2014). [13] reingold, edward m., and nachum dershowitz. calendrical calculations. cambridge, uk: cambridge university press, 2001. [14] blinn, peter. “today’s date in over 400 more-or-less obscure foreign languages.” curious notions. 2014. http://www.curiousnotions.com/todays-date.asp (accessed 03 18, 2014). about the authors jenny toves is a software architect in oclc research. in addition to viaf, her current interests include the large scale frbr clustering of bibliographic records, which is central to oclc’s linked data program. one aspect of this work is the algorithmic creation of authority records by mining the bibliographic data. she has an undergraduate degree in computer information systems and a m.s. in computer science. thomas hickey is chief scientist at oclc where he helped found oclc research. current interests include metadata creation and editing systems, authority control, parallel systems for bibliographic processing, and information retrieval and display. in addition to implementing viaf, his group looks into exploring web access to metadata, identification of frbr works and expressions in worldcat, the algorithmic creation of authorities, and the characterization of collections. he has an undergraduate degree in physics and a ph.d. in library and information science. appendix 1 unhandled date patterns with multiple hyphens. these patterns are not handled because the meaning is ambiguous or because we haven’t seen it before. this data is monitored to look for new overrides to add to overrides.py (https://github.com/oclc-developer-network/viaf-dates/blob/master/overrides.py). unhandled ca nn– 2330 unhandled ca nn – – 110 unhandled nn– 69 unhandled nn-..38 unhandled n– 36 unhandled ca nn — 34 unhandled nn-ה/nn-המאה ה 30 unhandled -cannn of -cannn 17 unhandled nn – – 17 unhandled -nnn or -nnn 16 appendix 2 this table summarizes the 40 most common date patterns covering 99.4% of the dates viaf encounters. see https://github.com/oclc-developer-network/viaf-dates/blob/master/appendices.py for code to generate this table. pattern occurences example parsed date date type % of total nnnn 19563491 d1947[1947, ”, ”] lived 94.19% . 372270 d(1947). [0, ”, ”] lived 95.99% nnn 179488 d900-talet [900, ”, ”] lived 96.85% nnnn? 145978 f1950?-…. [1950, ”, ”] circa 97.55% nnth century 126436 d20th century [1900, ”, ”] flourished 98.16% ca. nn. jh. 47126 dca. 20. jh. [1900, ”, ”] flourished 98.39% … 43903 d1977-… [0, ”, ”] lived 98.60% ca. nn./nn. jh. 17848 dca. 20./21. jh. [1900, ”, ”] flourished 98.69% nnth cent. 16349 d17th cent. [1600, ”, ”] flourished 98.76% nn 15008 d19-…. [19, ”, ”] lived 98.84% ca. 11056 dca. gegenwart [0, ”, ”] circa 98.89% nnnn month nn 10839 d1921 october 30[1921, ’10’, ’30’] lived 98.94% ? 10562 d?-…. [0, ”, ”] lived 98.99% nn. jh. 8852 d20. jh. [1900, ”, ”] flourished 99.04% nn./nn. jh. 6758 d20./21. jh. [1900, ”, ”] flourished 99.07% nnth cent 6620 d19th cent [1800, ”, ”] flourished 99.10% n… 6342 d18..-1… [1850, ”, ”] flourished 99.13% ca. n. jh. 5796 dca. 6. jh. [500, ”, ”] flourished 99.16% nne e. 5136 d18e e. [1700, ”, ”] flourished 99.18% nnnn month n 4886 d1956 november 7[1956, ’11’, ’07’] lived 99.21% ca. n. h. nn. jh. 4783 dca. 2. h. 20. jh. [1900, ”, ”] flourished 99.23% ca. nn.jh. 4016 dca. 20.jh. [1900, ”, ”] flourished 99.25% nn. stol. 3964 d19. stol. [1800, ”, ”] flourished 99.27% nth century 3787 dactive 9th century [800, ”, ”] flourished 99.29% nn.nn.nnnn 3437 d09.06.1703[1703, ’06’, ’09’] lived 99.30% n. jh. v. chr. 3213 d3. jh. v. chr. [-300, ”, ”] flourished 99.32% ?. 2836 d(1892-?). [0, ”, ”] lived 99.33% nne eeuw 2722 d18e eeuw [1700, ”, ”] flourished 99.34% n. jh. n. chr. 2563 d5. jh. n. chr. [400, ”, ”] flourished 99.36% ca nn– 2330 fca 18– [0, ”, ”] circa 99.37% ca. n. hälfte nn. jh. 2250 dca. 2. hälfte 17. jh. [1600, ”, ”] flourished 99.38% sec. xvi 2225 dsec. xvi [1500, ”, ”] flourished 99.39% ca. ende nn. jh./anfang nn. jh. 2105 dca. ende 20. jh./anfang 21. jh. [1900, ”, ”] flourished 99.40% nnth/nnth cent. 2041 d17th/18th cent. [1600, ”, ”] flourished 99.41% sec. xvii 1904 dsec. xvii [1600, ”, ”] flourished 99.42% n 1813 d1-1 [0, ”, ”] lived 99.43% nnth/nnth cent 1797 d17th/18th cent [1600, ”, ”] flourished 99.44% ca. n./n. jh. 1729 dca. 5./6. jh. [400, ”, ”] flourished 99.44% nnnn ? 1636 f1577 ?-1650 [1577, ”, ”] circa 99.45% ca. nn./nn.jh. 1415 dca. 20./21.jh. [1900, ”, ”] flourished 99.46% appendix 3 regular expressions used to find birth and death dates in free text note fields in marc authority records. ## dd. mm. yyyy. # ('', 'narozen', '31', '01', '1962') birth3 = re.compile('(^| )(narozen) (\d{1,2}) ?(\d{1,2}) ?(\d{4})') death3 = re.compile('(^| )(zemrel) (\d{1,2}) ?(\d{1,2}) ?(\d{4})') # dec 21, 1903 # ('', 'b.', 'jan 31 1962', 'jan ', '31 ', '1962') birth1a = re.compile('(^| |\()(b\.|n\.) ((\d{3,9} )?([0-9]{1,2},? )?([0-9]{4}))') death1a = re.compile('(^| |\()(m\.|d\.) ((\d{3,9} )?([0-9]{1,2},? )?([0-9]{4}))') birth1b = re.compile('(^| )(born|ne|nee|birth|dob) (([a-z]{3,9} )?([0-9]{1,2},? )?([0-9]{4}))') death1b = re.compile('(^| )(died) (([a-z]{3,9} )?([0-9]{1,2},? )?([0-9]{4}))') # 21 dec, 1903 # ('', 'b.', '31 jan 1962', '31', 'jan', '1962') birth2a = re.compile('(^| |\()(b\.|n\.) (([0-9]{1,2}) (\d{3,9}) ([0-9]{4}))') death2a = re.compile('(^| |\()(m\.|d\.) (([0-9]{1,2}) (\d{3,9}) ([0-9]{4}))') birth2b = re.compile('(^| )(born|ne|nee|birth|dob) (([0-9]{1,2}) ([a-z]{3,9}) ([0-9]{4}))') death2b = re.compile('(^| )(died) (([0-9]{1,2}) ([a-z]{3,9}) ([0-9]{4}))') appendix 4 there are 32 known patterns which map to 10 regular expressions that are used to parse the dates. the month table (monthlookup in https://github.com/oclc-developer-network/viaf-dates/blob/master/datefld.py) is used to convert month names to 01-12. knowndatepatterns = { 'nnnn month nn' : parsennnnmonthnn, 'nnnn monthnn' : parsennnnmonthnn, 'nnnn month n' : parsennnnmonthnn, 'nnnn monthn' : parsennnnmonthnn, 'nnnn month' : parsennnnmonth, 'nn month nnnn' : parsennmonthnnnn, 'nn month nnn' : parsennmonthnnnn, 'n month nnnn' : parsennmonthnnnn, 'month nn nnnn' : parsemonthnn_nnnn, 'month n nnnn' : parsemonthnn_nnnn, 'nnnn nn month' : parsennnn_nnmonth, 'nnnn n month' : parsennnn_nnmonth, 'nnnn nn nn' : parsennnn_nn_nn, 'nnnn-nn-nn' : parsennnn_nn_nn, 'nnnn/nn/nn' : parsennnn_nn_nn, 'nnnn-nn-nn-' : parsennnn_nn_nn, 'nnnn n n' : parsennnn_nn_nn, 'nnnn/n/n' : parsennnn_nn_nn, 'nnnn/n/nn' : parsennnn_nn_nn, 'nnnn/nn/n' : parsennnn_nn_nn, 'nn.nn.nnnn' : parsenn_nn_nnnn, 'nn-nn-nnnn' : parsenn_nn_nnnn, 'nn/nn/nnnn' : parsenn_nn_nnnn, 'nn.n.nnnn' : parsenn_nn_nnnn, 'n.nn.nnnn' : parsenn_nn_nnnn, 'n.n.nnnn' : parsenn_nn_nnnn, 'nn.nnnn' : parsenn_nnnn, 'nnnnnnnn' : parsennnnnnnn, 'nnnn' : parsennnn, 'nnn' : parsennnn, 'nn' : parsennnn, 'n' : parsennnn, } subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction – issue 6 mission editorial committee process and structure code4lib issue 6, 2009-03-30 editorial introduction – issue 6 the intelligent use of technology in libraries continues to be one of our most crucial challenges. for those of us who became librarians because we loved to explore the book stacks, we are now finding new ways to explore both old and new content in digital form. with issue 6 of the code4lib journal we hope you will find new ways to explore, experiment, and bring to your library users what they want and need. by christine schwartz coordinating editor, issue 6 the intelligent use of technology in libraries continues to be one of our most crucial challenges. we watch the world of book and newspaper publishing struggle to find the right mix of print and digital. and we–as keepers of recorded knowledge–know that where publishing goes, we must follow. the intersection of technology, libraries, and the future is where the code4lib community “lives.” however, many of us coming out of a traditional library background (myself included) are now having to learn things and develop skill sets we never thought we’d need. it’s not an easy transition. so, the code4lib journal is a great continuing education tool for librarians who are trying to expand their skills. the informal, collaborative approach of both the journal and community makes it easy to get your toes wet on issues related to information technology and programming. for those of us who became librarians because we loved to explore the book stacks, we are now finding new ways to explore both old and new content in digital form. some of this requires experimentation. in our darker moments–when experiments fail–the library’s future looks bleak. but it is out of these digital experiments that libraries are staking their claim for a future that we hope will be exciting, interesting, and full of potential. growing our technological skills helps librarians to not just passively resign themselves to digital inevitability. these skills enable us to create the future library that our users want and need. we need to become conversant in a technological mindset in order to meet the challenge with confidence and an open, intellectual curiosity. to quote joanna dipasquale, one of the author in this issue: [code4lib] “wants libraries to win and is doing something about it.” [1] so, let’s do it. you know, someday little kids will think that being a librarian is a cool job, because it is. new code4lib journal design we have adelie design to thank for the new look of the journal for issue 6. this company designed the distinctive, new code4lib logo pro bono as a gift to code4lib community. also, we’d like to thank editorial committee member, jonathan brinley for the new css that matches the logo. notes [1] dipasquale, joanna et al. “conference report: code4lib 2009.” the code4lib journal iss. 6 (march 2009). http://journal.code4lib.org/articles/998 subscribe to comments: for this article | for all articles one response to "editorial introduction – issue 6" please leave a response below: jai haravu, 2009-03-31 thanks for the great issue of code4lib journal. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – building up a collaborative article database out of open source components mission editorial committee process and structure code4lib issue 12, 2010-12-21 building up a collaborative article database out of open source components members of a swiss, austrian and german network of health care libraries planned to build a collaborative article reference database. since different libraries were cataloging articles on their own, and many national health care journals can not be found in other repositories (free or commercial) the goal was to merge existing collections and to enable participants to catalog articles on their own. as of november, 2010, the database http://bibnet.org contains 45,000 article references from 17 libraries. in this paper we will discuss how the software concept evolved and the problems we encountered during this process. by markus fischer and stefan kandera introduction in a meeting of health care libraries in early 2009 in zurich, switzerland, we formed a working group with the goal of building up a collaborative article database. within the network of health care libraries we already had two major database collections which could be used as a starting point for the project: the rudolfinerhaus in vienna had a collection of about 23,000 records and the pro senectute library in zurich had a collection of about 20,000 records. further growth of the database should be achieved by offering a cataloging tool to the participating libraries. there is only a small amount of funding for the project: the participating libraries are free to pay a small fee to support the project. from the beginning we strove towards open source components, which gave us the possibility to adapt the solutions to our needs. practical prerequisites the working group defined some minimal requirements the end product had to fulfill: the reference database should be freely available the software should be manageable for and by libraries open source solutions are preferred the data format should be marc21 and the rules for cataloging should be aacr2 the system and the bibliographic data should allow openurl requests tracing articles to institutional level is desirable, but not a necessity the project was built using 5 open source components: vufind as the discovery layer, doctor-doc as knowledge database and linkresolver, koha for cataloging, a set of script utilities to manipulate bibliographic data (“bibnet.org – dedupe utilities”) and drupal as cms.[1] indication of availability in vufind was achieved using the daia driver to query doctor-doc. some of these components turned out to be more suitable for our needs than others. technical prerequisites one of the most astonishing facts we were faced with was that marc21 does not specify separate fields for article level descriptions like volume, issue and page information. while these fields are essential for the most basic functions like building an openurl request or to create bibliographic citations, this data is normally stored in marc21 in the field 773g as freetext: 773 $g 50(2010), no. 3, p. 352-362 the situation we encountered got even even worse, as the marc data delivered to us came from different institutions with different “rules” on how to enter the information in 773g. it became obvious that it is almost impossible to parse 773g reliably on the fly in a production system out of such diverse data. we decided to first optimize the marc structure with additional subfields to hold the citation information, and, in a second step, to parse the data of each institution with an adapted algorithm before importing the marc21 data into vufind. this was achieved by analyzing the different contents we found in 773g and by adapting, step by step, the parsing conditions to these situations. the resulting code is included in a gui based form in the “bibnet.org – dedupe utilities” (see below). we found no official recommendation to solve this structural marc problem, except for a discussion paper from the library of congress[2] dating from 2003. we chose to use the proposed solution 4.1 from this paper, adding three subfields for volume (773v), issue (773l) and first page (773q). while this served the basic needs of our project with regard to openurl and indicating availability, we now see that even this solution has some shortcomings: for exporting a citation to generate bibliographies you also need the last page. we may address this in future improvements. since many open source library systems like koha or vufind are built around marc21, it is inevitable that these systems, as article databases, have the same shortcomings as the marc21 format itself: missing fields like volume, issue and first page in every record based function of the system. cataloging articles does not seem to be the traditional core business of libraries. machine readable accessibility to the data stored in the freetext field 773g turned out to be essential for our project. the solution we currently run is the following: $g 50(2010), no. 3, p. 352-362 $l 3 $q 352 $v 50 for the reasons explained above, rethinking these structural marc21 problems would be an important step in making bibliographical article data usable independently of its holding data, and finally, suitable for search systems supporting openurl. vufind http://sourceforge.net/projects/vufind/files vufind is a discovery layer for libraries. the system is based on a solr index and sets new benchmarks in discovering library collections. vufind is developed by villanova university. we found that vufind is easily expandable due to a very clean and modular design. we made a simple default installation of vufind and added the following changes: additional fields for volume, issue, startpage in every aspect of the view (overview, details, favorites, export record, cite this…) a location chooser to manually override the indication, by ip address, of availability of articles for an institution changed the daia-driver to send availability requests over openurl and add the ip of the requesting client speed up the availability requests by parallelizing the sequential requests, using curl location chooser the institutions listed in the location chooser, added in our installation below the existing language chooser, can be defined in vufind’s config.ini as a key value pair together with their ips. the default indication of availability is by ip address of the requesting client, but this can be overridden by selecting another institution in the location chooser. the code for the location chooser is placed in web/index.php and simply sets a cookie with the ip of the desired institution to view its availability: // setup locator $iprequest = $_server['remote_addr']; if (isset($_post['myloc'])) { $location = $_post['myloc']; setcookie('location', $location, null, '/'); } else { $location = (isset($_cookie['location'])) ? $_cookie['location'] :$iprequest; } // make sure the location code is valid. reset to default if not: $validlocations = array_keys($configarray['locations']); if (!in_array($location, $validlocations)) { $location = $configarray['site']['location']; } $interface->setlocation($location); there were other small changes involved. to make the location chooser appear as a select button in the gui you need to add it in layout.tpl: {if $showbreadcrumbs} {/if} in web/sys/interface.php you need to add function getlocation() { return $this->location; } function setlocation($location) { global $configarray; $this->location = $location; $this->assign('userloc', $location); $this->assign('alllocs', $configarray['locations']); } curl curl is a library to facilitate getting and sending files or requests using a url syntax. since php does not include threading in its core, curl might be used to create a multi-threaded type situation for sending and receiving requests over the web. to speed up the rather slow availability drivers in vufind we installed curl on the server and tweaked the daia driver to send “multi-threaded” openurl requests. the base code for this is from: http://www.phpied.com/simultaneuos-http-requests-in-php-with-curl/ by stoyan stefanov, which we adapted to our needs: private function multirequest($data, $options = array()) { // array of curl handles $curly = array(); // data to be returned $result = array(); // multi handle $mh = curl_multi_init(); // loop through $data and create curl handles // then add them to the multi-handle foreach ($data as $id => $d) { $curly[$id] = curl_init(); $url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d; curl_setopt($curly[$id], curlopt_url, $url); curl_setopt($curly[$id], curlopt_header, 0); curl_setopt($curly[$id], curlopt_returntransfer, 1); // post if (is_array($d)) { if (!empty($d['post'])) { curl_setopt($curly[$id], curlopt_post, 1); curl_setopt($curly[$id], curlopt_postfields, $d['post']); } } // are there any extra options? if (!empty($options)) { curl_setopt_array($curly[$id], $options); } curl_multi_add_handle($mh, $curly[$id]); } // execute the handles $running = null; do { curl_multi_exec($mh, $running); usleep(25000); // important to reduce unnecessary cpu-load! } while($running > 0); // get content and remove handles foreach($curly as $id => $c) { $xmlstring = $result[$id] = curl_multi_getcontent($c); curl_multi_remove_handle($mh, $c); } // all done curl_multi_close($mh); return $result; } daia http://www.gbv.de/wikis/cls/document_availability_information_api_%28daia%29 daia (document availability information api) is a data model to encode availability information for documents. the protocol was developed by the german consortias gbv[3], hebis[4] and the beluga[5] project. daia is a simple and lightweight model to query and return the availability of a given document. it works on the assumption that you have a unique identifier / id associated with the document. articles can hardly be traced for availability using a unique record id if they are not mapped to holdings. identifiers like dois or pmids are often not available for the kind of literature we are indexing. additionally we do not come from a consortial situation with one shared ils. typically there is no one-to-one relationship where a record has a unique id for one article related to holdings information. therefore, we were forced to obtain availability information by using the bibliographic information of issn, year, volume and issue, from the marc 773g field. to make daia work for our purposes, we send openurl requests to our daia server[6] and get back daia-xml answers. an institution’s ip address can also be appended as a parameter to the openurl request, to get the availability of a record for that institution. this allows vufind to distinguish between availability by ip address on the records overview and the general availability for all institutions in the details of a record. doctor-doc http://sourceforge.net/projects/doctor-doc/ doctor-doc is primarily a tracking tool for ill. it may also be used as a link resolver for online journals in connection with the german ezb[7]. journal print holdings can be uploaded[8] to doctor-doc, to be indicated in the link resolver and searched down to issue level through a daia interface[9] sending openurl requests to doctor-doc. there is one big challenge when trying to establish availability information by using the issn as your main identifier: a journal often does not have one, but several issns (like e-issn, p-issn and an issn for the cd-rom edition). often a journal gets a new issn when there is a small change in the title or when a journal changes its publisher. issn.org recently introduced the l-issn (linking issn) that should be used in openurl situations to create a more reliable linking situation. but many libraries and data providers still do not use l-issn. doctor-doc takes care of this messy issn situation by internally mapping an issn to any related issn. so a client may do an openurl request with the most recent version of an issn and still finds all holdings that use the old version of the issn. it turns out that for our situation, after having resolved the issue with multiple issn numbers, the issn is a reliable identifier. the issn data to achieve this comes primarily from the issn-to-issn-l[10] table of all issns assigned by issn.org, which has been freely available for download on their website. koha http://koha-community.org/download-koha/ since vufind is not a cataloging system but a highly optimized search interface using an index instead of a database, we had to find another system able to produce marc21 data for those libraries without a capable ils. we didn’t find any small system that would be appropriate for our needs, so we chose koha. koha is an integrated library system (ils or ilms). we are using the cataloging editor and the marc21 framework for about 400 journal templates and the authority file system. we run version 3.0.6. while koha is a very large and powerful system, it is not easy to customize due to a rather complex design. we had to fix various java-script bugs in connection with ie on the cataloging interface, which failed to open new windows, due to blank space in the wrong place in the js function. we needed to ensure that the cataloging tool works with browsers down to ie 7. there is an ongoing bug in the marc 008 field in koha which scrambles the content of the field. we have been forced to correct the data before importing into vufind, by using a custom de-dupe and data control tool (see “bibnet.org – dedupe utilities” below). a nice feature in koha is that cataloging templates can be generated for different materials. we created marc21 templates for about 400 journals with journal title and issn, mandatory, and repeatable or default values already defined. based on our experience with both commercial and open source ils, we realized that there is a high potential for optimizing cataloging articles. catalog records for articles contains a lot of repeated information which should be system generated rather than user entered: koha specific the leader and the system-id is system generated but needs an additional click each. the system-id should not be editable once generated, to avoid loss of record identity. copying an existing record should produce a different system-id. the code for the cataloging agency should be prefilled and not be editable. general the code for the publication place in 008 should be automatically prefilled for the selected journal. the language code in 008 should be automatically generated for the selected journal. publication date in 008 could be generated with the actual year. the volume in 773v could be calculated upon the frequency of the publication. 773g could be composed automatically from 260c, 773v, 773l and 773q. during the cataloging session the last entered issue information should be maintained for the next article to avoid reentering of 260c, 773v, 773l and 008. we are currently evaluating if we should develop a custom application for the specialized task of cataloging articles. bibnet.org – dedupe utilities http://sourceforge.net/projects/bibnet/ a major task in the project is de-duping marc21 data coming from different sources. vufind and koha both provide de-duplication mechanisms. however, we found neither solution was flexible enough for our situation. both approaches de-dupe either too much or too little. de-duplication is dependent on the structure of a given record. a record with full citation information may be de-duped more selectively than a record containing only a title and a year. so we developed a custom de-duping application, available as open source in a raw but functional beta stage. basically this de-dupe utility performs data control and de-dupes using different approaches depending on the machine readable information (issn, volume, issue, startpage) present in a record. possible duplicate matches are further checked and identified by using a fuzzy search on the title. the fuzzy search is accomplished by calculating the damerau-levenshtein distance after string normalization. it is essential to normalize the strings being compared before using this algorithm. librarians tend to index records with different punctuation, using different case sensitive characters and so on. we normalize the article title to lower case and remove all spaces and non-alpha-numeric characters before calculating the damerau-levenshtein distance. we found that using this approach results in highly reliable de-duplication for our needs. drupal http://drupal.org/ for a cms we chose drupal. drupal serves for coordination purposes, although any other cms could do as well: we create the internet presence of the project. we provide direct links from drupal to the cataloging templates of koha. we list the journals assigned to libraries to be cataloged. an internal ticketing system (drupal modul “support 6.x-1.3”) helps us coordinate different tasks (basically, to correct bugs and report new feature requests). conclusions http://bibnet.org/ the features required by bibnet.org could not be developed in a single architecture without significant investments. as a consequence we chose a modular architecture with different software components. not all components of bibnet.org work together without manual intervention at the moment. in particular, the export, de-dupe and import processes need further automation. we are confident we can achieve this after creating a more customized and automated cataloging situation with less possibility of cataloging errors. the concept of bibnet.org can be adapted to any pool of libraries willing to work together. we try to convey self-administration to any participating library with as little technical interference as possible. each library is invited to work with (or improve) each component. the code (and therefore the system itself) is entirely open: cms (drupal) for allocating libraries and journals, to coordinate cataloging and for documentation purposes ils (koha) for cataloging “bibnet.org – dedupe utilities” to prepare records for import opac (vufind) as discovery layer doctor doc as an issn and holdings knowledge base to denote availability (daia) and as a link resolver using the services of the german ezb the most challenging part is to change marc21 to be able to expose machine readable article reference data. as a consequence all incoming data from external resources have to be parsed by our “bibnet.org – dedupe and script utilities”. we think the gain, especially in regard to openurl functionality, is worth the effort. koha turned out to be less suitable for our needs because of incompatible behavior with ie (due to inappropriate use of java-script in the cataloging interface) and because of some other bugs like the 008 bug we mentioned earlier. in general we found that for cataloging articles the interface of koha is not optimized to the point it could be. as powerful as koha may be as a general full featured integrated library system, it seems less apt for our specialized needs. bibnet.org shows the potential libraries can have if they start to cooperate: the software components available for the library world today do allow building aggregate systems with a reasonable effort. we can recommend in particular vufind, which is easy to install and to adapt. vufind provides a powerful discovery system for a multi-institutional environment. future plans cataloging develop a streamlined and optimized cataloging solution for bibnet.org to produce adapted marc21 data. this could be achieved by expanding the existing code base of “bibnet.org – dedupe utilities”. while building a general ils is a huge and complex task, we believe that creating a specialized cataloging system for article references is rather easy to achieve. oai vufind recently integrated a functional oai harvester and import function. we plan to integrate oai connectors to import external data and to further expand the existing database. cairn.info[11], heclinet[12] and ccmed[13] could be valuable targets and partners for bibnet.org. alerts many publishers of the journals we index do not provide alert functionality for their publications. we are aiming to create such functionality within vufind or within the eventually expanded code base of “bibnet.org – dedupe utilities”. vufind recently added the possibility of tracking the index date of a record by providing a persistent entry in a database. this is a precondition to develop alert functionality for vufind, since the solr index is not an optimal place to permanently store this kind of information. notes [1]we leave out the evaluation process here, but other open source software solutions were tested as well. [2]http://www.loc.gov/marc/marbi/2003/2003-dp01.html [3]gbv -gemeinsamer bibliotheksverbund: http://www.gbv.de [4]hebis -hessisches bibliotheksund informationssystem: http://www.hebis.de/ [5]beluga: http://beluga.sub.uni-hamburg.de/ [6]daia server address of doctor-doc: http://www.doctor-doc.com/version1.0/daia.do [7]the “elektronische zeitschriftenbibliothek regensburg” is a freely accessible knowledge database and a-z list for scientific online journals. http://rzblx1.uni-regensburg.de/ezeit/ [8]information about the upload process for print holdings to doctor-doc: http://sourceforge.net/apps/mediawiki/doctor-doc/index.php?title=help:contents#upload_print_holdings [9]description of the daia interface in doctor-doc: http://sourceforge.net/apps/mediawiki/doctor-doc/index.php?title=help:contents#print_holdings_availability [10]http://www.issn.org/2-24117-download-the-issn-issn-l-table.php [11]cairn.info databse of human and social science journals: http://www.cairn.info/ [12]heclinet health care literature information network (stopped in 2000): http://www.dimdi.de/static/de/db/dbinfo/hn69.htm [13]ccmed: http://www.zbmed.de/ccmed.html about the authors markus fischer is head of hospital libraries “solothurner spitäler ag”. he is the main developer of doctor-doc.com and swissconsortium.ch (http://www.so-h.ch). he can be reached at: markus.fischer@spital.so.ch stefan kandera is librarian at “bibliothek und dokumentation pro senectute schweiz” (http://bibnet.org/?q=taxonomy/term/2). he graduated in philosophy at the university of constance (germany) and has a master of advanced studies in information science (htw chur, switzerland). he can be reached at: stefan.kandera@pro-senectute.ch subscribe to comments: for this article | for all articles one response to "building up a collaborative article database out of open source components" please leave a response below: bibnet.org: ???? ?????? ????? ??? ???? « the blog of meorero (oren maurer) – f.o.s.s and libraries, 2011-03-28 […] database out of open source components", code4lib journal (c4lj), issue 12, 2010-12-21, http://journal.code4lib.org/articles/4438. ????? ????? ???: ??? ?????? ?????? ?????? ???????, […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial mission editorial committee process and structure code4lib issue 60, 2025-04-14 editorial welcome to the 60th issue of code4lib journal. we hope that you enjoy the assortment of articles we have assembled for this issue. by mark swenson in the week before finalizing the content for this sixtieth issue of the code4lib journal, i attended the computers in libraries conference in arlington, virginia. in the midst of dozens of presentations on genai and what it means for today’s libraries one session touched on issues core to this publication and its aims: empowering libraries through code: future ready digital leadership. the first presentation, from austin stroud at indiana university indianapolis, articulated the gulf between the widely held perception that learning a high-level programming language would improve career possibilities for mlis students and the reality that, historically, ala accredited programs do not provide this training. following a second presentation, from scott hargrove of fraser valley regional library in british columbia on how their new strategic plan focuses on engaging the staff and public on programming and ai literacy, the question was posed to those attending the session: why do librarians and library staff need programming skills anyway? if you have found your way to the code4lib journal (dedicated to sharing coding solutions in libraries) you may already have many ideas of the ways that programming knowledge can be useful for library staff. if not, this issue has eight articles that demonstrate how competencies in software development are intimately intertwined with modern library operations: from corinne chatnik and james gaskell at union college we have a description on how writing a program in python was able to improve the ability of undergraduate students to accurately enter data. karen coyle describes the openwemi specification of dublin core explaining how it can expand the use of frbr ideas into new non-library environments. jennifer d’souza (tib leibniz centre for science and technology) demonstrates using knowledge graphs to analyze large language models. halie kerns (binghamton university) and leah fitzgerald (amsterdam free library) describe how they made a video game to teach information literacy skills. aerith y. netzer (northwestern university) demonstrates a way to use large language models to turn plain-text citations to bibtex. wilhelmina randtke (georgia southern university) relates the journey they took to simplify and restructure a library system database that had become overcomplicated following a system merger and a migration. andrew weymouth (university of idaho) details the use of python and google apps script to text mine and tag the university’s oral history collections. olivia wikle (iowa state university) and evan peter williamson (university of idaho library) show their success in using the collectionbuilder framework and the static site generator jekyll for creating easy to maintain digital collections websites. these articles showcase a variety of perspectives and challenges tackled through modern computational methods. the authors are united by their commitment to enhancing workflows and services, their eagerness to engage in innovative collaborations, and their dedication to acquiring new skills. additionally, these contributions reflect a willingness to share knowledge and support the ongoing understanding of the skills needed in a contemporary library setting. certainly not every library staffer needs to know how to write computer programs, but having someone on your library staff who understands computer programming and can apply that knowledge to library problems is something any library can use. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – case study: using perl and cgi scripts to automate a quality control workflow for scanned congressional documents mission editorial committee process and structure code4lib issue 17, 2012-06-01 case study: using perl and cgi scripts to automate a quality control workflow for scanned congressional documents the law library digitization project of the rutgers university school of law in camden, new jersey, developed a series of scripts in perl and cgi that take advantage of the open-source module perlmagick to automatically review the image quality of scanned government documents. by implementing these procedures, rutgers was able to save staff working hours for document quality control by an estimated 25% percent from the previous manual-only workflow. these scripts can be adapted by novice perl and cgi programmers to review and manipulate large numbers of text and image files using commands available in perlmagick and imagemagick. by doreva belfiore project background in 1997, the law library of the rutgers university camden campus began a project to provide open web access to digital legal materials from the state of new jersey, other u.s. states, and the united states federal government. the camden library has been registered as a depository library of the government printing office for many years. through an agreement with the gpo, the library is allowed to scan congressional documents to a set of mutually agreed upon standards (600 dpi minimum resolution, uncompressed tiff image format, and full-text searching, among others) for digitization and presentation via the rutgers camden digital law library website. the project began digitizing congressional hearings and committee prints dated from the mid 1970’s through 1999, and expects to continue and expand its scope in the coming years. the author was involved in the initial phase of the project described in this article, and her code is enclosed in the appendix. after accepting a position at another university, she is no longer working directly with the project. the project code is now being managed by john joergensen, reference librarian at rutgers camden law library. legacy processing for digitization, each printed congressional document is cut at the binding with a guillotine-blade cutter. every document is labeled with its corresponding library of congress control number. if no lccn exists, an internal 8-digit document identifier is assigned. multi-volume sets are divided into sections and labeled with either one or two trailing letters of the alphabet, i.e. 12345678a through 12345678az. panasonic kv-s2026c sheet-feed duplex scanners are used to scan individual pages and generate tiff image files of each document page. work-study students of the law school are charged with scanning bundles of cut documents and saving the resulting tiff images to a central network server. prior to this project, one full-time staff member from technical services and one part-time graduate student intern from a local library school were tasked with manually performing quality control on every scanned congressional document. the former quality control workflow included: reviewing thumbnail images for every scanned page in the document, confirming that the total number of scanned images corresponded with the number of pages in the print original, manually re-scanning any mis-scanned pages, missing pages, or oversize fold-out maps and supplementary materials, manually copying and pasting a marc record from the library catalog into a text file for use during post-processing. after checking, completed congressional documents are prepared for web presentation by a series of automated perl scripts which were run by the digital librarian: each scanned page is run through an ocr program to create a text file of its contents, tiff images are converted to pdf files for an alternative web format. a single multi-page pdf is also created containing the entire document, an rdf file is created for oai-harvesting and use on the semantic web, each document is uploaded to a web server directory and indexed using swish-e open-source indexing software. once processed, documents can be easily searched via the congressional documents page of the library website. problems the panasonic rtiv software can be adjusted to account for deskewing of the image, auto-cropping document edges, adjusting brightness levels and other minor corrections. however, despite regular cleaning and maintenance of the scanning equipment, problems occur which sometimes cause documents to fail a quality control check. dirt and wear on the scanner rollers can cause both horizontal and vertical lines on the scanned images. wear on the roller can impede the paper flow through the scanner and cause blank pages with a large horizontal drag mark on the top of the pages. failure to remove security tape (“tattle-tape”) or to cut off all of the binding glue during the guillotining process can cause pages to stick together, altering the final document page count. the presence of fold-out maps and supplementary materials can also skew the final page count. figure 1. example of a scanning error – dirt and wear on scanner rollers causing black horizontal lines we estimate that approximately 12 hours of combined staff time per week was dedicated to manual quality control procedures. even at that level of staffing, the number of scanned documents quickly exceeded the ability of the staff members to perform quality control, and the bottleneck created a large backlog of documents. this work grew out of a need to address such bottlenecks by developing procedures to automate as much of the quality control workflow as possible. solution imagemagick is a free image manipulation program that can be run on various platforms (windows, macintosh, unix). perlmagick is the corresponding perl module for scripting. the digital library project has previously used perlmagick for automated image brightening (cf. belfiore, 2011) and we hypothesized that its power for image evaluation could be harnessed for purposes of quality control. we developed a series of perl and cgi scripts as follows: first, we needed to ensure that documents were in a standard file and folder format for automatic processing. occasionally, students misnamed files or folders during the scanning process. in addition, there have been cases where the rtiv scanning software settings have been inadvertently changed and documents were scanned as pdf’s instead of tiff images. we created a perl script that uses regular expressions to detect any misnamed documents or folders or to detect the presence of pdf files and move the document folders to a “problem” directory for manual remediation. see appendix script 1 for the entire perl script. once the pre-check runs, we run a second perl script, which performs various functions. it first scans the folder for multivolume documents and identifies them by their naming convention of lc control number plus one or two alphabetic characters. it moves any multivolume folder to a secondary staging folder for review. because the marc record for a multi-volume set records the number of volumes and the total number of pages, it cannot be used to estimate the total number of pages per volume. a staff member (or student) must still gather the pagination information, but we developed an easier way to accomplish the task. the person tasked with this activity can launch a cgi script from any web browser that will prompt him or her to enter the first and last pages of the document, along with the number of prefatory pages. this data is stored in a very small temporary text file that is read in lieu of the marc record and used for pagination calculations. see appendix – script 2 for the entire cgi script. next, gdocpagecheckonly.pl checks if the congressional document folder already contains a valid marc record in a catalog file in the form of “.cat”. if not, it automatically gathers a marc record for the document from the library catalog via a wget command. for each document, if the document is not a numbered volume of a multivolume set, it then opens the catalog file and parses the 300 physical description field of the marc record to identify the correct number of pages in the document, and evaluates the document against the official page count. for multivolume sets, it reads the small test file created by the multivp.cgi script. in evaluating the number of pages, the script takes into account whether the number of pages of prefatory material is even or odd, in order to account for the extra blank sheet after an odd number of pages. likewise, it takes into account whether the number of total pages is even or odd, in order to account for a final blank sheet after an odd number of body pages. if, after the calculations, the total number of pages counted in the document folder matches the number on the marc record exactly, the document folder is moved to a second staging folder for the final quality control check. if the document folder contains more image files than the marc record states, the document is moved to a secondary staging folder for further checking. if the document folder contains fewer image files than is expected by the marc record, the document folder is moved into the “problem” directory for manual checking by a staff member. see appendix script 3 for the entire godcpagecheckonly.pl script. because the documents sometimes contain extra blank pages at the end of the content, the quality control process must account for folders that contain trailing blanks in the pagination calculations. by segregating documents with more pages than specified in the marc record, we can apply a third perl script to determine whether extra pages are acceptable blanks or mis-scanned pages. the script, gdocblankcheck.pl, first uses imagemagick to evaluate a small cross-section “slice” of a known reference image of a blank sheet and measures the total brightness of the file. the script then checks each document folder and repeats the same pagination check as in the previous script. for documents with an odd number of prefatory pages, the next even number page is checked via imagemagick for the presence of a blank. when a similar “slice” of the file is checked, if the brightness level of the suspect scanned page matches that of the known reference blank within a small tolerance range, that page is marked as a good “known blank” and entered into a hash for future reference. the same procedure is conducted for a suspect even page after an odd number of total document pages. but how does one account for extra blank pages that may occur at the end of a scanned document? the script has already calculated the total of number of pages that should be present and accounts for one possible blank file at the end of an odd-numbered final page. for an amount of n extra pages, the script loops through an imagemagick check routine n times to see if those pages are indeed blank. every time, it sets a variable flag to mark each pass. when the flag reaches the final number, if all of the pages are evaluated as “good blanks”, the script will allow that document to pass to the next phase of quality control checking. if, however, one of the n extra pages has a brightness value that is too dark and exceeds that of the reference sample (as in a page containing text), it is considered to be non-blank and fails the test. the folder will then be moved to the “problem” directory for manual evaluation by a staff member. in the final step of our automated quality control procedure, any documents that have passed the previous checks are located in the final staging folder, which for purposes of clarity is named imagecheck. the final script, gdocimagecheck.pl, begins by using imagemagick to take the reference values of multiple types of sample documents: blank white pages, very clean scanned text pages, and pages containing dark black lines. figure 2. example of a clean scanned text file used as a reference sample for the gdocimagecheck.pl script it continues with the exact same page counting steps as its predecessor. this step is redundant, but serves an important purpose. in this script, any blank pages found after an odd-numbered prefatory page, an odd-numbered final body page, or at the very end of the document are labeled as “good” or “known” blanks. these are stored in a perl hash table %goodblanks for later lookup. the script then cycles through each document and evaluates each scanned page against the “clean” reference sample of a page of text. if the brightness level falls within a small range of tolerance, it is deemed to be good. if the level is too light, the page may be blank. the page number is then checked against the values stored in the %goodblanks hash table. if the blank page is known to be appropriate, the script allows it to pass and continues. if, however, the blank page is not known, it is marked as a “suspect” file. if the brightness level is too dark, the file is also labeled as “suspect”. after the script cycles through all the files in a given document, if there are no “suspect” files, the document is cleared to be moved to a final “good” staging directory for further processing for indexing and web presentation. if there are “suspect” files, the document is moved to the “suspect” staging directory where the document can be manually evaluated by a staff member. see appendix – script 5 for the entire gdocimagecheck.pl perl script. complications and lessons learned in the course of this project, we experienced some issues which necessitated our breaking this script into modular parts. we initially began with one master script which performed all the actions from downloading the marc records to checking individual images. because the library fileserver, maintained by library it services, was running an older version of the red hat linux operating system, we were not able to upgrade to the latest stable version of the imagemagick software. we experienced high server cpu utilization rates when running any scripts containing imagemagick, even during non-business hours, and had to stop the use of imagemagick on that server. our solution was to break the script into parts and utilize workstations to run some of the scripts. we run the first two scripts, gdocprecheck.pl and gdocpagecheck.pl, on our production unix server. we then use an ubuntu linux box that connects to our server using an sshfs link to run the two scripts requiring imagemagick, gdocblankcheck.pl and gdocimagecheck.pl. in phase ii of the project, described later in this article, planned server hardware and operating systems upgrades allowed us to eliminate redundancies, re-integrate the scripts and run them from one production server. one benefit we found to modular script development is flexibility. our project estimates that it takes the script about 3 seconds to check each file. for large documents, it may take over 15-20 minutes per document depending upon server memory and processor speed. because these scripts are written in a modular fashion, each one can be run independently and multiple scripts could be run at once. this is helpful for a busy digitization project where multiple documents are being scanned and checked on a given day, so that the people scanning the documents are not impacted by the script processing. our scripts rely on correct aacr2 cataloging and isbd punctuation in order to parse downloaded marc records and to gather pagination information. it is important to note that for approximately 10% of documents, the gdocpagecheckonly.pl script cannot accurately parse the marc record which will cause the document to fail the page check and be moved to the “problem” directory. there are various potential causes for this: the marc record is inappropriately coded or is missing information. the marc record is missing isbd punctuation. the marc record has been coded for a multivolume set but for some reason the document was not labeled as a multivolume part. the marc record is in rda format. the marc record was not properly downloaded so the catalog file is missing or incomplete in some way. the marc download portion of the script could not match the document to an lccn number and therefore no catalog file was created. in any of these cases, the documents are reviewed by a staff member and the marc records can be re-downloaded or the catalog files re-created manually. in the future, script enhancements can be made that would make multiple passes at reading the 300 line of the marc file to correct for missing information. the shift from aacr2 to rda format for cataloging will necessitate future revisions. our goal, as always, is to limit the time required for staff members to manually review the congressional documents. occasionally, a page with very little text will read as a blank file and will be labeled as “suspect”. further testing and manipulation of the base gdocimagecheck.pl script could refine the size of the sampled “slice” of the page and adjust the brightness tolerance so that partial pages are labeled as “good”. project phase ii in the intervening months since the author took a position at another university, the library obtained a server hardware and os upgrade, such that the scripts are now able to run directly on a linux server. the digital library has created a regular schedule for automated script processing and document triage. john joergensen, of the rutgers camden law library, has streamlined the scripting of the original processing scripts to eliminate redundancies and improve functionality. for phase ii of this project, the digital library is testing a cgi quality evaluation script similar to amazon’s “mechanical turk” that can be used by non-experts (i.e. work-study students) to mark suspect scanned documents as “good” or “bad” and triage them appropriately. the script converts the original tiff image of a suspect file to a jpg on the fly using imagemagick and presents it to the user in context on a web page. the user can then view a suspect page along with the preceding and following page, and can navigate between pages like an online reader in order to make a decision about the suspect file. according to the viewer’s decision, the scanned document is automatically moved to a “good” directory for further processing for web presentation, or to a “rescan” directory to be sent for rescanning by work-study students. this cgi script can be run from any internet browser and will greatly reduce the time required from library staff members to review suspect documents. future projects and other ideas currently, this set of scripts provides a level of basic processing that identifies potentially problematic documents. the gdocimagecheck.pl script can be extended to handle specific problem cases and triage them appropriately. for example, a finer grain of brightness tolerance or the use of imagemagick’s image-matching function could be employed to specifically distinguish mis-scanned pages with roller drag-marks (see figure 3) and automatically send the document to be rescanned. this is useful for scanned congressional documents because there are frequent instances of blank divider pages found in the middle of a document which are not erroneous, and it would be very useful for a quality control script to be able to distinguish acceptable blanks from scanner errors. figure 3. example of a scanned page with a “roller drag” mark instead of textual content the author plans to continue to work with imagemagick and perlmagick to develop other scripts for digital image quality control that take advantage of the software’s power of image evaluation. conclusions by taking a fully manual process and automating image checking for large series of uniformly formatted text documents, rutgers camden digital law library was able to save approximately 25% of staff time in quality control procedures. the implementation of cgi scripts to review specific “suspect” pages within scanned documents allows the library to easily offload quality control work to non-expert users such as work-study students, which will further save person-hours from technical services and digital library staff. the library’s future goal is to save 50% of total staff time dedicated to quality control for the congressional documents digitization project. while these scripts may not be appropriate for the quality control and revision of archival materials, manuscripts, or photographs, we feel that they are useful for textual documents that are generally uniformly formatted, such as government documents. as the imagemagick and perlmagick software are powerful and highly customizable, their use could be extended to manipulate and evaluate more complex documents and image types than are demonstrated by this project. this software and these scripts are very suitable for libraries and archives that have smaller budgets and limited information technology resources, as they can be edited and developed by staff with a beginning knowledge of perl and cgi programming. references amazon.com [internet]. (updated 2012). amazon mechanical turk: artificial artificial intelligence; [cited 2012 april 8]. available from https://www.mturk.com/mturk/welcome. belfiore, d. 2011. using imagemagick to automatically increase legibility of scanned text documents. code4lib journal [internet]. [cited 2012 april 08]; 14:2011-07-25. available from: http://journal.code4lib.org/articles/5385. ciornii, alexandr. [internet]. (updated 2007?). roman-1.23. cpan; [cited 2012 april 08]. available from: http://search.cpan.org/~chorny/roman/lib/roman.pm. imagemagick studio llc. [internet]. (updated 2011). imagemagick: convert, edit and compose images; [cited 2012 april 08]. available from: http://www.imagemagick.org/script/index.php. imagemagick studio llc. [internet]. (updated 2011). perlmagick api; [cited 2012 april 08]. available from: http://www.imagemagick.org/script/perl-magick.php. joergensen, j.p. 2002. the new jersey courts publishing project of the rutgers–camden law library. law library journal. [cited 2012 april 08];94(4):673-689. pozkanzer, jef. [internet]. (updated 2009 july 12). tiftopnm; [cited 2012 april 08]. available from: http://netpbm.sourceforge.net/doc/tifftopnm.html. thyssen, a. [internet]. (updated 2011 march 15). examples of imagemagick usage; [cited 2012 april 08]. available from: http://www.imagemagick.org/usage/. acknowledgements the author wishes to thank john joergensen of the rutgers university camden school of law for his mentoring and guidance. appendix n.b. – the scripts below are original to the project except where enhancements are noted and credited. in certain cases, information regarding specific server addresses and paths has been omitted for security and should be adjusted for the specific needs of the institution. for more information about current scripts, please contact the rutgers camden digital law library. [back] script 1: normalization pre-check script – gdocprecheck.pl [back] script 2: cgi scripts producing web forms to automate the pagination of multivolume government document sets part a: first cgi script page to confirm document – multivp1.pl part b: second cgi script page to evaluate pagination – multivp2.pl n.b. – this file was edited by john joergensen to reduce redundancy and improve functionality. the author wishes to thank mr. joergensen for his assistance with this script. [back] script 3: initial marc record download and page check – gdocpagecheckonly.pl script 4: imagemagick script # 1 for checking for extra blank pages – gdocblankcheck.pl n.b. – this script is written to run under ubuntu linux with an established sshfs share called ./camlaw . adjust this share name to your institution’s needs. [back] script 5: image checking program for quality control – gdocimagecheck.pl about the author doreva belfiore received her masters of library and information science in 2011 from the drexel university ischool in philadelphia, pa. from 2009 to 2011 she worked as a digital library and circulation intern at the law library of the rutgers university school of law in camden, new jersey. she currently works as a bibliographic assistant in the digital initiatives department of temple university libraries in philadelphia, pennsylvania. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – data for decision making: tracking your library’s needs with trackref mission editorial committee process and structure code4lib issue 33, 2016-07-19 data for decision making: tracking your library’s needs with trackref library services must adapt to changing patron needs. these adaptations should be data-driven. this paper reports on the use of trackref, an open source and free web program for managing reference statistics. by michael carlozzi introduction as a technology librarian i am responsible for implementing technical solutions to enhance library experiences for both patrons and staff. since i also work the reference desk, i noticed an intersection of two seemingly unrelated problems: 1) we do not track informative reference statistics, and 2) i have trouble meeting patrons’ technology needs because of time spent covering reference. as i was soon to find, these problems were connected. in preparation for submitting annual reports, my library would traditionally sample reference statistics by using “clickers” or “hash marks” over a three-week period. the purpose of these samples was to meet state requirements and thus reported only the volume of transactions. as they provided no details on these transactions, several questions lingered: with what specific reference services did patrons need help? how could staff respond best to those needs? how should we staff reference in response to these data? canton public library employs three full-time librarians in the reference department—a reference services librarian, a teen librarian, and me. staffing the reference desk consumes a considerable amount of our time and prevents us from providing specialized services. we average 19 hours/week at the desk per staff member, although the amount varies due to overlap and availability; some weeks, for example, i spend over 30 hours covering reference. in those weeks especially, i am handicapped in my ability to provide technology services and to maintain and improve our library’s technical infrastructure. anecdotally, i felt that most of the reference transactions we processed were directional, capable of being handled by paraprofessionals and volunteers. the directional nature of reference work has been addressed extensively in the literature (aluri and st. claire 1978; summerhill 1994), with many librarians “transforming” reference in response to this and other pressures (meldrem et al. 2005; sonntag and palsson 2007). but without data, all i had were conjectures—and no clarity on how to proceed. in order to improve our traditional sampling methods, i decided to implement reference tracking software (rts). many librarians use rts such as libanswers (flatley and jensen 2012), desk tracker, and gimlet (chan and johns-masten 2014). librarians often record transactions to make organizational decisions: bravender, lyon, and molaro (2011), for instance, used libstats to conclude that “it is not cost effective to use reference librarians to answer [virtual] chat questions” (see also garrison 2010). thus rts seemed to solve my two problems by enabling us to collect informative reference statistics and to make informed organizational decisions based on those statistics. many rts programs come with downsides. some of the issues are: they may be proprietary, limiting customization and/or requiring a monetary commitment. they may be inconvenient; for instance, users may need to select multiple options from a drop down menu. as goodsett (2013) observed, “if your staff has to answer a slough[sic] of questions every time someone comes to the reference desk or sends an email, they may be discouraged enough to just skip recording reference data altogether.” they may be unsupported or unavailable; libstats, a popular solution in the mid-late 2000s (see jordan 2008), cannot be found easily, and its code — hosted on github – has not been updated in over four years. google forms has been used as rts to some acclaim, but it has been criticized for lacking customizability and “reporting functionality” (goodsett 2013). for example, google forms requires the manual entry of dates and times because its timestamp feature does not translate well to analytical platforms like excel, spss, sas, or r. finding the available rts unsatisfactory for the above reasons, i coded in php—and released under gnu’s general license 3.0—a web-based program (trackref) built on an sql database. this program seeks to address the above limitations by providing librarians with rts that is free, open source, convenient, and powerful. in february of 2016 i launched the program in my library and we have been logging transactions ever since. figure 1. trackref’s homepage. i wanted the program to facilitate efficient, comprehensive, and customizable data analysis. with that goal in mind, i programmed trackref to allow administrators the ability to manipulate whole categories of data (renaming, deleting, and creating) and to enable qualitative input. the ability to input qualitative data has proven useful. after data collection began, we formulated additional “research questions”—for example, 7% of all transactions within the first month related to “technology help.” but the kinds of technology help that people were seeking remained unclear. i could have added additional transactions for specific concerns (e.g. transactions of “email” and “browser printing”), but adding new transactions for each concern seemed unsustainable in the longer run. thus i coded trackref to enable a feature (called “detailed mode”) that allowed the entry of textual input for select transactions. now staff can identify the kind of “technology help” they provided when logging. what we learned from tracking reference statistics as suspected, we discovered a low frequency of bona-fide reference transactions; 94% of our transactions were directional. consistent with national trends, few patron requests required a librarian’s research skills. further, most research questions required basic fact retrieval: i.e. identifying local news, retrieving phone numbers, accessing websites. these results suggested that staffing the reference desk with three full-time librarians was not the best use of the librarians’ expertise. in response to these data, we began to implement a more “on-call” model of reference, where paraprofessional and volunteer staff answered most questions and directed difficult queries to librarians who otherwise spent their time on more specialized work. this approach is relatively uncommon; miles (2013) found that only 10% of surveyed academic librarians used an on-call model of reference. public library figures are likely lower than academic libraries’, as public libraries often lack the staffing surplus to allow their librarians “office time.” but our data show that most reference transactions can be handled by paraprofessionals; other researchers, even in academic contexts, have reached similar conclusions (see dinkins and ryan 2010; faix et al. 2010). we are fortunate that a library of our (small) size has three pages and numerous volunteers, who have helped to cover the reference desk and “free up” the librarians to work within their specializations. as an example, consider that patrons were not approaching the reference desk for assistance with ereaders (1.2% of trackref’s transactions), in stark contrast to national claims that patrons needed help in this area (library journal 2014). as the technology librarian, i was particularly interested in this problem since my job is to train patrons in using modern technology yet our patrons were not visiting the reference desk for ereader guidance. but with fewer hours at the reference desk, i have had time to market and offer guided technology instruction: each week i reserve two hours (30 minute appointments) for one-on-one assistance at times when i would ordinarily have worked reference. every timeslot has been filled through july, with many patrons clamoring for additional hours—which i have been adding slowly. of these appointments, 30% concern ereaders. i also aim to offer classes based upon the needs of those attending my one-on-one sessions. excel, for example, which accounts for 24% of booked appointments, suits the classroom environment in a way that ereaders—with so many idiosyncratic devices—do not. i have had more time to build a technology-centered curriculum, something for which many librarians unfortunately lack time, and i am also planning information literacy classes. although a core tenet of librarianship, instruction is too often neglected because of other concerns. even in academia, the predominant model of library instruction is the problematic “one-shot” (fister 2013; helms and whitesell 2013). with more time to build engaging lessons built on data-supported curriculum, i can implement stronger pedagogy. the library has also used trackref’s data to improve staff efficiency. we noticed that several patrons needed help “accessing” our computer network: “access” was the largest category of transactions in trackref at 16%. we discovered more efficient ways to grant computer access. rather than manually creating user sessions through our time management software, taking upwards of two minutes per action, we now batch print 90-minute guest passes. and although this does not solve the “problem” of needing to provide guest access (as we will always offer guest access for privacy-conscious patrons), it optimizes our process. we also created better signage and instructional materials to help patrons with common problems encountered when trying to access library resources, as identified by additional information entered alongside “access” transactions in detailed mode. trackref revealed that almost 25% of patrons needed help with office services (faxing, printing, copying, and scanning). we decided to upgrade our reference department’s infrastructure by purchasing an easy to use, touchscreen library document station that consolidates these functions. when we learned that so many transactions concerned the copy machine, i flagged the “photocopy” transaction to prompt detailed information (see the section “installation and configuration” for details). by flagging these transactions we found that most of the problems were related to: 1) a misleading “key card” error message on the display for which there was no fix; 2) very particular positioning requirements that resulted in photocopies coming out “wrong”; or 3) a coin-bill acceptor that only took coins. the library document station solves these problems by eliminating the confusing error messages, allowing successful copies regardless of where patrons place documents on the scanner, and installing a new currency acceptor that takes bills, as well as coins. figure 2. details on the “photocopy” transaction. finally, 13% of our reference transactions involved staff members booking the study room. in response, i am currently improving our in-house study reservation software by coding an online version which will allow patrons to book their own appointments. this way, reference staff will be responsible only for granting patrons access to the physical room (which we decided not to track separately in trackref). installation and configuration this section presents step-by-step instructions on installing, configuring, and launching trackref. instructions on using the program are covered in the following section, “using trackref: process and analysis.” 1) trackref requires a webserver. unless you already use an online webserver, or you plan to implement the program across multiple networks/subnets, i recommend installing wampserver. wampserver brings the program’s essentials: apache’s webserver (for hosting), mysql (for database storage), and php (for querying the database). mac users can use mamp, though it has not been tested with this software. the latest release of wampserver when i wrote this, 3.0, requires numerous windows visual c++ packages, and users on 64-bit machines must install both the 32-bit and 64-bit packages. this is quite a headache, and some users report that installing packages “out of order” causes the entire installation to fail. because wampserver 2.5 does not require nearly so many packages and has functionally similar to 3.0, i recommend using wampserver 2.5. additional configuration outside of selecting a default browser shouldn’t be required, although password protecting the sql database is recommended (that is beyond this guide’s scope). for further help on configuring wampserver, consult this thread on http://forum.wampserver.com/read.php?2,123606. 2) download trackref’s code from http://www.activelibrarians.com/trackref and extract it to your webserver’s directory: in wamp, the default is c:/wamp/www. as wampserver’s authors note, this directory should not be changed from default. figure 3. correctly extracted trackref folder in windows 7. 3) on the machine hosting trackref, test the installation by navigating in a web browser to http://localhost/trackref. trackref’s homepage should be displayed, not a “not found” error (sql connection errors are to be expected, as trackref has not yet been configured). because trackref is hosted on a webserver, all computers within the same network can access its webpage. for example, my library uses the 192.168.1.0/24 address space for its internal network. the particular computer hosting trackref has an ip address of 192.168.1.76. thus for other staff members to access trackref, they visit http://192.168.1.76/trackref. consult your library’s information technology representative if you are unsure of your library’s network layout. 4) once trackref is online, its settings must be configured. trackref’s folder contains numerous php files, each responsible for handling different program functions. the settings.php file is used to configure trackref. the settings.php satisfies many work environments—those that want quick, easy, and strictly quantitative data. if you want to use trackref in detailed mode, which allows staff to enter descriptive information for select transactions, you will need to replace the settings.php file with settingsdetailed.php. when choosing this option, first delete or rename the original settings.php and rename settingsdetailed.php to settings.php. this now becomes the file that you will edit in order to configure trackref’s settings. which file is for you? all else being equal, i recommend using detailed mode (settingsdetailed.php) unless you absolutely do not want textual input. with detailed mode, the administrator can “flag” transactions when needed (see below), allowing for additional data. librarians interested in purely aggregate data may prefer the simpler version. 5) trackref stores all of its data in an sql database and must have credentials to access that database. edit settings.php with your database’s login information, as you had determined when installing wampserver. by default, the username is “root,” the host is “localhost,” and the password is “” (blank). those using external web hosting can retrieve this information from their web host. web hosting may require changing the database’s name from “reference” to something else, in which case the administrator must change the database’s name in settings.php, settingsdetailed.php, and create.php (all the files that create a database connection). figure 4. database settings in settings.php. in settings.php you create the transactions you wish to track; the program supports up to 14 distinct transactions. with detailed mode enabled, staff members are prompted for additional qualitative (textual) input after recording transactions. this option is configured through an array called $total in the settings.php page (renamed from the settingsdetailed.php file): for each transaction type, “0” will not prompt additional information whereas “1” will. figure 5. the $total array in detailed mode: ‘1’ indicates detailed information requests. here, for example, “catalog” will not request additional information but “faculty” will. other options managed in settings.php are time zone, the ability for staff to delete their transactions, the times to appear on hourly transaction charts, and the requirement to confirm transactions. 6) once settings.php is saved, visiting http://localhost/trackref/create.php will run a script that creates a database called “reference” with all the required tables as well as an “admin” user. in case the script fails, trackref includes the database.txt file outlining the database’s structure. administrators can add these tables by using an interface like phpmyadmin, which comes with wampserver and can be accessed through http://localhost. see phpmyadmin’s documentation for further guidance. 7) with the database created, the program administrator can now log into trackref with the admin user, whose default credentials are “admin/admin”; the admin account has special privileges, enabling the creation of additional users and library locations as well as the ability to structure the database. after logging in with these credentials, the administrator should change the account’s password under the settings tab located in the top navbar; this is not the settings.php page. the admin should also create a library location and associate it with him/herself by clicking “update user settings.” trackref can be configured to add support for “roving reference” by including the following meta element on index.php:(you may need to tweak the content parameters for your devices). this addition makes the homepage mobile-friendly when recording transactions; it is not enabled by default because, absent customization, it may distort the homepage. using trackref: process and analysis on trackref’s homepage, staff members click a word to record its represented transaction; a javascript alert box (which can be disabled) asks for confirmation; when using detailed mode, they are instead asked to provide detailed information for flagged transactions. once confirmed, the transaction and its corollary information enter the database: year, day, time, month, transaction type, staff member, and location. staff can also enter data retroactively under the navbar tab manual entry. with manual entry, trackref stores the last-entered date so users do not need to keep changing it when, say, “catching up” at week’s end. in detailed mode, i recommend standardizing input because the program does not offer controlled vocabulary; for instance, instruction librarians may want to determine the distribution of classes visiting reference. if enough of one class visits the desk, then librarians may contact the relevant faculty and offer course-based instruction (if using this instructional model). data become easier to visualize and code elsewhere (such as in excel) when standardized. a listing of transactions with information like “enl 240,” “bus 101,” and “nur 331” is more manageable than one with “english 240 intermediate lit. class,” “introduction to business studies” and “upper-div nurse class.” the program offers two analytical methods: 1) graphs for quick and aesthetic analysis and 2) csv exportation for deeper analysis. the analytics tab uses fusioncharts to render data. by default, these charts display usage by month, weekday, and hour. basic customization is also allowed, where users can specify certain days of the week (or years), which are then analyzed on the custom.php page; for instance, my library opens from 10:00 am to 9:00 pm three days a week, so when analyzing hourly statistics i might compare only those three days. figure 6. fusioncharts’ rendition of our hours for days open from 10:00 am to 9:00 pm. for further analysis, use the program’s second analytical method: database extraction. under the settings tab, click the “export” button under “export database to csv.” this exports the transactions table to a csv file, where data can be freely manipulated. the graphs provided within trackref are meant to visualize common analytical functions and are not meant to be exhaustive. experienced users should not be discouraged from tweaking the source code to add additional graphs, but consider exporting the database to a csv file instead; analysis is often easier within a statistical package than through programming. issues encountered and further work the largest hurdle has been getting all staff members to log transactions. as other librarians have noted (flatley and jensen 2012), rts only works when used. once logging begins, forgetfulness becomes “declining usage”—everyone enthusiastically logs in september and then neglects to do so in november. thus i have had to qualify all analysis; for example, when comparing transactions by time of day i have to consider that some workers log more diligently than others. security may also be a concern. web-based (http) limitations apply to this program as they would to any other; absent encryption, data (e.g. login credentials) can be sniffed by network users. i do not regard this as a serious enough risk in terms of likelihood or severity to implement encryption in my own library network, but be aware of the risks. concerned librarians should be able to encrypt their networks with the help of computer support services. finally, as a work in progress trackref is far from perfect. its advantage over commercial products is that it is free, easy to use, open source, and customizable. unfortunately, as it is the work of one developer, it has some limitations, listed below: web hosting may break some of the code’s query structuring. although i have not tested this particular program online, other programs i have written failed when hosted online. in one case, i had to change every sql query from single to double quotation marks. in another, my web hosting’s peculiar php configuration required extensive troubleshooting. the programming for “hourly rate” statistics in the analytics tab is lackluster. trackref divides total transactions (for each hour block) by the number of distinct dates in the database, yielding a “per hour” rate statistic. this works poorly with unequal operating hours. for example, because my library closes at 5:30 on friday and saturday, the hourly rates for 5:00 pm to 9:00 pm will be slightly lower than reality. i am working on an improved hourly rate version, but for now that statistic represents an approximation, not the actual value. staff members can only be associated with one location at a time. this can be annoying when working physical and virtual reference simultaneously, so administrators might consider creating a shared account called “virtual” (because the program does not prevent multiple logins for one account, administrators can create the virtual account as they would any other). multiple staff members can then log into the virtual account to record transactions. on the custom.php page between lines 380 to 507 is some code which presents a pie chart of transactions received in a given hourly period. the code is “commented out” to prevent sql errors. librarians may wish to know at what times certain kinds of transactions come to the reference desk and thus staff accordingly. unfortunately this feature is not yet release-ready and requires manipulation of the source code. to change the current time from the default of 8:00 pm, edit the following code on line 390, changing the times (e.g. 7:00 pm would be $newtime > = 700 && $newtime < 759). this code currently does not work for mornings. figure 7. conclusion trackref has allowed our library to make data-driven organizational decisions. i have been able to offer more technological instruction and plan to provide even more in the future. we have also improved overall staff efficiency and have made a large (for our institution’s size) evidence-based purchase for the reference department: the library document station. going forward, i plan to use these data for organizational rather than strictly departmental analysis (e.g. our operating hours). not all of our goals have been met, unfortunately. full-time degreed librarians still primarily cover reference. our reference librarian in particular still spends considerable time staffing the reference desk and has not been able to develop a genealogy center as originally planned. she has, however, been relieved from the desk enough to reorganize our archival/local history room in preparation for launching such a center. nevertheless, data from trackref have helped us to move in a positive and progressive direction that will hopefully continue. references aluri, r., & st. clair, j. 1978. academic reference librarians: an endangered species? journal of academic librarianship: 82-4. bravender, p., lyon, c., & molaro, a. 2011. should chat reference be staffed by librarians? an assessment of chat reference at an academic library using libstats. internet reference services quarterly 16(3): 111-127. chan, t., & johns-masten, k. 2014. a study of gimlet use in reference transactions. internet reference services quarterly 19(2): 73-87. dinkins, d., & ryan, s. 2010. measuring referrals: the use of paraprofessionals at the reference desk. the journal of academic librarianship 36(4): 279-286. faix, a., bates, m., hartman, l., hughes, j., schacher, c., elliot, b., & woods, a. 2010. peer reference redefined: new uses for undergraduate students. reference services review 38(1): 90-107. fister, b. 2013. the library’s role in learning: information literacy revisited. library issues: briefings for faculty and administrators: 33(4). flatley, r., & jensen, r. 2012. implementation and use of the reference analytics module of libanswers. journal of electronic resources librarianship 24(4): 310-315. garrison, j. 2010. making reference service count: collecting and using reference service statistics to make a difference. the reference librarian 51(3): 202-211. goodsett, m. 2013. no more tallies: comparing web-based reference statistics tools. persona website [internet]. [cited 2016 may 4]. available from: https://mandigoodsett.com/2013/10/21/no-more-tallies-comparing-web-based-reference-statistics-tools/ helms, m., & whitesell, m. 2013. transitioning to the embedded librarian model and improving the senior capstone business strategy course. the journal of academic librarianship 39(5): 401-413. jordan, e. 2008. libstats: an open source online tool for collecting and reporting on statistics in an academic library. performance measurements and metrics 9(1): 18-25. library journal. 2014. ebook usage in u.s. public libraries. [cited 2016 may 8]. available from: https://s3.amazonaws.com/webvault/ebooks/ljslj_ebookusage_publiclibraries_2014.pdf meldrem, j., mardis, l., johnson, c. 2005. redesign your reference desk: get rid of it! currents and convergence: navigating the rivers of change proceedings of the twelfth national conference of the association of college and research libraries. ed. hugh a. thompson. chicago: association of college and research libraries, 2005. miles, d. 2013. shall we get rid of the reference desk? reference & user services quarterly 52(4): 320-324. sonntag, g., & palsson, f. 2007. no longer the sacred cow—no longer a desk: transforming reference service to meet 21st century user needs. library philosophy and practice 45(2): 104-107. summerhill, k. 1994. the high cost of reference: the need to reassess services and service delivery. the reference librarian 20(43): 71-85. about the author michael carlozzi (michael@activelibrarians.com) is the library director at wareham free library in massachusetts. previously, he was technology and information services librarian at the canton public library in massachusetts. he dreams of a world that takes cribbage more seriously. subscribe to comments: for this article | for all articles one response to "data for decision making: tracking your library’s needs with trackref" please leave a response below: beeresh, 2023-09-28 hi, mr michael carlozzi, greetings!! trackref is really helpful and appreciate your kind efforts in this regard. i am also trying to use tracref for my library. this is to say thanks to you., request you to mail me your further thoughts. i may need your help making use of this tool. thank you, beeresh, bangalore, india leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – introducing sage: an open-source solution for customizable discovery across collections mission editorial committee process and structure code4lib issue 52, 2021-09-22 introducing sage: an open-source solution for customizable discovery across collections digital libraries at research universities make use of a wide range of unique tools to enable the sharing of eclectic sets of texts, images, audio, video, and other digital objects. presenting these assorted local treasures to the world can be a challenge, since text is often siloed with text, images with images, and so on, such that per type, there may be separate user experiences in a variety of unique discovery interfaces. one common tool that has been developed in recent years to potentially unite them all is the apache solr index. texas a&m university (tamu) libraries has harnessed solr for internal indexing for repositories like dspace, fedora, and avalon. impressed by frameworks like blacklight at peer institutions, tamu libraries wrote an analogous set of tools in java, and thus was born sage, the solr aggregation engine, with two primary functions: 1) aggregating solr indices or “cores,” from various local sources, and 2) presenting search facility to the user in a discovery interface. by david b. lowe, james creel, elizabeth german, douglas hahn, and jeremy huff 1) introduction a persistent challenge facing the digital library community is how to provide access to a large set of heterogeneous digital collections with disparate metadata. a widely adopted solution for searching these digital collections is apache solr. it is efficient and flexible open-source software with a broad user base in other industries. the solr discovery application programming interface (api) is a common solution in the library space with applications such as dspace, fedora, and blacklight. sage (solr aggregation engine) is a new solution from texas a&m university libraries for aggregating solr documents from across repositories. sage is an open-source application that provides two feature sets: 1) aggregation and 2) discovery views. sage combines any number of solr indices (or “cores”), crosswalks the fields, and generates one (or more) aggregated index. the discovery view feature set can be utilized via the user interface (ui) or api and enables dynamic customization of search interfaces and search results for any given solr core. sage source code and documentation can be found on github. table 1. sage links. link type url open source code https://github.com/tamulib/sage collections in production from fedora https://u.tamu.edu/berger-cloonan https://u.tamu.edu/comrosters from dspace https://u.tamu.edu/cubamaps this article will provide an overview of the technical background of sage, suggest future development of sage, and discuss its current and future use within libraries. 2) brief historical look at library search tools while code4lib journal readers are undoubtedly familiar with the historical background of online searching in libraries, still there is a trend worth emphasizing that relates to the aggregational nature and function of sage. first, libraries can boast a strong heritage as early implementers of information and communications technology (ict), with record #1 (the rand mcnally book of favorite pastimes) in the authoritative and collaborative oclc bibliographic database denoted as entered in may of 1969. in the ensuing transition from card catalogs to online public access catalogs (opacs), the library community embraced technologies such as telnet, gopher, and wide area information server (wais) to share the records of their holdings and make them searchable. while the card catalog had been an access system mostly limited to what books were on a given library’s shelves, forsaking it left some of the more nostalgic information seekers with nasty dispositions (see baker 1994); in the end, the scale of library holdings and their growth rates were destined to overwhelm our analog systems. harnessing the unstoppable force of the web throughout the 1990s translated into momentum that could match that grand scale such that today, rare is the library that cannot be searched via browser online. the succeeding generation of library catalogs, dubbed “discovery services,” sought to bring not just the physical book collections but also the library’s subscription content accessible online within typeable reach of a single search box. like sage, discovery services feature a unified index of all the items they cover. they followed a foray into what were deemed “federated search” tools which covered essentially these same disparate resources, but the federated search tools attempted to access a heterogeneous array of indices in real time, so their comprehensive result sets were routinely thwarted by network traffic and other responsiveness factors. for both federated and discovery services, the technical innovation shared with sage is a heterogeneous index or set of indices from multiple sources that can be brought together into a single, searchable unit for the end-user. in parallel with the rise of discovery tools, and in the wake of concurrent mass digitization efforts, ict implementers in libraries began to devote increasing attention to making their freshly digitized collections available. another development intertwined here from the early 2000s was the rush to build institutional repositories, primarily in support of open access (oa) efforts by providing a platform for hosting faculty, graduate, and (usually honors-related) undergraduate student research. the lines between strictly defined institutional repositories (containing theses, dissertations, and oa journal articles) and venues for digital collections scoped more broadly (historical photographs, letters, maps, etc.) often became blurred. also in the 2000s, the information landscape witnessed two more important, related trends: mass book digitization and web search tool frameworks that made searching these new large text corpora possible in a timely manner. in order to index not just the metadata now, but also the bulky full text from within these millions of books required new tools, and the root driving force for searching that mixture of metadata and content has been apache lucene, first written by doug cutting and released in 1999 (ingersoll 2007). lucene still, as of july 2021, holds close to one third (31.66%) market share in the enterprise search category, according to datanyze (datanyze 2021). and it has fruitfully borne successful offspring. hot on lucene’s heels (currently at 31.46% market share) is solr, based on the lucene search library and created in 2004 by yonik seeley at cnet (thuma 2018), who donated it to the apache software foundation in 2006. notably, hathitrust, as the largest (currently over 17 million items) full-text book searching operation yet achieved by libraries, relies on solr to power its searches. to close this thread, two other library-related, lucene-based search tools worth noting include elasticsearch (with 7.43% market share), first released by shay banon in 2010 (banon 2010), and secondly xtf (extensible text framework), first released in 2005 (haye 2005) by the california digital library (cdl) and used in some digital collections and archival finding aid applications in research libraries. it is solr, though, that is most relevant to the topic of this paper. to focus on digital library applications, it is fair to note that sage is not unique in the way it leverages these critical solr resources. in fact, it owes much to the established blacklight software efforts. blacklight traces its origins to 2009 at the university of virginia. in the texas a&m university libraries context, as we implemented blacklight’s cousin spotlight as our digital exhibits presentation tool, we also considered blacklight, which is written in ruby on rails. since our application development shop leans more toward java, it was only natural that our developers sought to manipulate this critical piece of our architecture more nimbly and fluently. 3) ecosystem context concerning the people involved, the digital initiatives unit at texas a&m university libraries includes an applications development team that produces custom software for the libraries and contributes to open-source products that advance the libraries’ mission. past successes include a major role, in collaboration with the texas digital library, in the development of vireo (https://github.com/texasdigitallibrary/vireo), a workflow tracking application for approving and depositing electronic theses and dissertations into institutional repositories. vireo has been implemented widely among research universities thanks to its thoughtful design and ease of use. among current projects, tamu developers are heavily involved in the new open-source integrated library system known as folio (for “the future of libraries is open;” see: https://github.com/folio-org). a commitment to open-source is a part of the team ethos, and they pursue opportunities for collaborative development. concerning the technology involved, solr provides an http api to a backend implemented with the apache lucene search library. as we have seen, the solr search platform is common in digital library applications such as catalogs, digital repositories, and other discovery tools. the texas a&m university libraries run a number of solr-based applications, including instances of dspace, fedora, spotlight, vufind, and sage. in a single-search-box world, our library patrons have come to expect the ability to search across various sources regardless of the application. however, providing such a search service presents technical challenges. the fact that all these applications employ solr offers us a way forward. the primary technical challenge in providing search across heterogeneous collections is the homogenization of their metadata. for the most part, a metadata schema’s implementation tends to be specific to an institution. most library search applications allow for customization of the metadata schema and how documents appear in a search index. efforts to standardize schemas across institutions can impose some order, but it seems unlikely to us that the worldwide library community will settle on one schema to rule them all. some efforts that have had considerable success in specific contexts include marc, dublin core, bibframe, and schema.org. however, even broad multi-institution adoption of a schema standard (as has been the case with dublin core in dspace) does not render metadata interoperable – local practices for what goes in a particular field will vary and local customizations to the schema are all but inevitable. furthermore, even within an institution with consistent practices, the metadata will vary across collections of different types of items. for example, photographs may not bear titles, but may require descriptive fields not applicable to print documents. the requirements and restrictions of a solr core afford sage a useful structural starting point. solr cores consist of documents that themselves consist of keys and values in a flat structure. as adventitious and heterogeneous as a set of records may be, if it has been indexed into solr, it must adhere to this structure. at the heart of sage is a mechanism for defining how a key in one solr core could be mapped to a key in an aggregated index. through this mechanism, a curator may choose how to reconcile and normalize the indices throughout the ecosystem. furthermore, as we will see, a curator may choose different mappings for different ways of viewing the collections. 4) aggregation at the center of sage is aggregation, which may be considered analogous to an etl (extract, transform, load) process. each aggregation process is referred to as a “job” and employs any number of each of these four elements: internal metadata: a user-defined set of metadata keys or labels reader: gathers data from a source solr core and employs a user-defined mapping from that source into sage’s internal metadata operator: takes a value from a specified key from the source and performs pre-defined operations such as transformations (e.g., regex) or normalizations (e.g., date formatting) writer: outputs the internal metadata, utilizing a user-defined mapping into an external solr core through this process it is possible to normalize or modify the data using sage’s user-specified operators so that a uniform search experience can be provided to end-users even though the original solr cores may be disparate and contain incomplete data. sage provides administrators with a simple interface to configure the aggregation process. in summary, the sage aggregation process, or job, reads a solr core, performs any data manipulation through the operators, transforms data into a standard metadata schema, and writes the data into the sage solr core. figure 1. aggregation: overview. internal metadata the internal metadata schema is the local storage that sage uses to preserve the data elements that are read from various solr cores. it can be created and modified at any time. there is no limit to the number of metadata elements that can be created, and the schema should contain all the elements that might be needed for aggregation. in defining the internal metadata schema for texas a&m universities libraries we relied heavily on our libraries’ metadata guidelines for our digital collections: https://hdl.handle.net/1969.1/175368. figure 2: aggregation: internal metadata. readers readers handle the process of running a query against a solr core and of harvesting the data to be stored in the internal metadata. this process involves crosswalking the solr core data definitions to the internal metadata schema. for texas a&m universities libraries it is not uncommon for our digital collections to exist in dspace and fedora solr cores. below is an example of mapping the solr fields from dspace and fedora into the internal metadata schema based on our metadata guidelines. dspace solr core element fedora solr core element sage internal metadata dc.title title title dc.subject subject_ss subject dc.creator creator_ss creator table 2: aggregation: possible metadata crosswalk mappings used by sage readers operators a powerful characteristic of sage is the operator feature. operators are built in transformation functions that can be applied to the data as it flows through the aggregation process. these functions can be added to or expanded upon as needed. for example, one operator may perform data normalization so that all known variations on a given field could be standardized ensuring the results are consistent. figure 3. aggregation: operators. writers writers are similar in setup and use to the readers except they take the internal metadata and push it out to a solr core which is used by discovery. moreover, multiple writers can push internal metadata with specific transformations to a single sage solr core. figure 4. aggregation: writers. once the readers, operators, and writers are properly defined, it is possible to rapidly create multiple sage solr cores that can be used by the discovery view. through this process the institution is able to easily highlight and promote disparate content and collections that normally may be buried deep in incompatible systems. 5) discovery views discovery view general description aside from the creation of solr cores from disparate sources, sage offers the discovery view feature set. this suite of features focuses on exposing the aggregate cores through an attractive discovery interface. these interfaces can be created by administrative users and can be highly customized within the context of a management dashboard. once created, they can be utilized by the end-users to conduct both keyword text and faceted searches of the aggregated data set. discovery views for end-users figure 5. discovery view: results page. figure 6. discovery view: single item view. sage’s discovery view feature set is distributed between two modes of ui interaction: administrative and end-user. for the end-user, sage discovery views are landing pages featuring various form inputs which allow for several modes of search queries, and a results section that displays all records retrieved through the user’s query. after a search is executed, either through a keyword text query or a faceted search, the results are depicted in a list. each result displays a predetermined subset of the record’s metadata and serves as linked navigation to a single item view. the single item view for any record displays a more complete depiction of the selected record’s public metadata as well as rendering the digital object via a viewer appropriate to its resource type. discovery views for administrators figure 7. discovery view: single item view. figure 8. discovery view editing: results tab. for the administrative user, sage discovery views offer a management dashboard that allows for the creation and customization of these discovery views. each discovery view can expose any solr core known to the sage application. a discovery view can be configured to pre-filter the records it exposes by providing a canned solr filter query. this means that a discrete discovery view does not necessarily need to expose the entirety of the solr core upon which it is based. given this, multiple discovery views might be created for different subsets of the same solr core. for example, while there might be a single fedora solr core, there could be multiple discovery views based upon specific collections. the fields and the widgets used to interact with these facets can be selected through the administrative interface, as well as the keys which are available for the text-based searches. through this customization interface, the administrator can also select what metadata will be displayed within the result list as well as within the single item view. additionally, various elements of the landing page can be customized, including a thematic image, associated links, and introductory text. the rationale behind discovery views a driving force behind the creation of sage discovery views was a desire to achieve two primary conveniences. the first of these is the ability to deploy and maintain a single instance of an application which allows for the creation and customization of multiple search engine results page (serp), each with a unique configuration. if instead there is a one-to-one relationship between a serp’s configuration and the application providing that view, the multiplication of serps would result in an increase in resource expenditure both in maintenance time and computing resources. secondly, it was desirable that the configuration of discovery views be exposed through a user interface and not require developer intervention to create or modify. this would empower non-developers to both expose and maintain the records they wished to make available through discovery. potential future development for discovery views to date, sage is capable of rendering records with various viewers, including mirador 2, the browser’s native pdf viewer, and a basic text renderer. these viewers are easily expanded upon, and this represents one area where sage can grow. support for audio, video, map, and 3d object viewers are all plausible areas where sage is likely to see development. additionally, the facet widgets which sage exposes can be expanded. currently, sage allows for link-list facet and free text facet (with type-ahead support) to be configured for each facet field. additional facet widgets, such as range selection or geo-coordinate input, would greatly increase the value of sage discovery views. 6) broader implications implementing disparate data source types currently, sage has been developed to aggregate a single type of data source into a combined solr core, and that source type is itself the solr core. a possible future innovation in sage could be the addition of multiple source types. this would allow sage to create aggregate solr cores not only from other solr cores, but from disparate sources, such as: csvs; tsvs; various apis; or iiif collection metadata. the potential source types that might be implemented are practically limitless. fortunately, the groundwork for a feature of this nature was planned for in the earliest iterations of sage, reducing the development barrier for these disparate sources to be implemented. general library discovery the library discovery ecosystem comprises many systems and interfaces including legacy opacs, e-journal search platforms, database directories, and local repositories. sage has the potential to move the library discovery space to an extensible search model where new targets are able to be developed and incorporated into the libraries’ search strategy while maintaining a consistent search experience for users. the discovery views feature set enables non-developers to create and manage these experiences without the need to burden valuable developer time and resources. future work can include additional value-added services such as recommendation, spell check, and other automated search assistance at a view-by-view level and additional views for any solr-based system within the library’s discovery ecosystem and in the future, any source type. digital humanities and remixing at the most basic level, the discovery view capabilities unlock numerous potential applications for digital humanities (dh) and other text-focused research. awareness of a digital object is an important first step in research, followed by the usual concerns while searching sets of things, such as precision and recall. the flexibility of solr enables all of the above. within a result set, researchers can turn to manipulating sage’s facets for counts to use in statistical inquiries relevant to sets of metadata and full text. as mentioned immediately above, the future direction of empowering the searcher to extend the greater index set would be another boon to dh in particular and text-based searching in general. bibliography always already computational – collections as data. “collections as data facets.” accessed january 8, 2021. https://collectionsasdata.github.io/facet1/. anant/awesome-solr. 2017. anant corporation, 2021. https://github.com/anant/awesome-solr. “apache solr.” accessed january 8, 2021. https://lucene.apache.org/solr/. baker n. 1994 “discards.” the new yorker, april 4, 1994. https://www.newyorker.com/magazine/1994/04/04/discards. banon, shay. “you know, for search.” elastic blog, february 8, 2010. https://www.elastic.co/blog/you-know-for-search. becker d, williamson e, and wikle o. “collectionbuilder-contentdm: developing a static web ‘skin’ for contentdm-based digital collections.” the code4lib journal, no. 49 (august 10, 2020). https://journal.code4lib.org/articles/15326. bolton, m, creel j, day k, hahn d, huff j, laddusaw r, savell j, and welling w. “user-configurable discovery across collections,” may 22, 2019. https://tdl-ir.tdl.org/handle/2249.1/156404. cartolano rt. “history of blacklight,” 2015. https://doi.org/10.7916/d8j38s9m. cole tw., and shreeves sl. “search and discovery across collections: the imls digital collections and content project.” library hi tech 22, no. 3 (january 1, 2004): 307–22. https://doi.org/10.1108/07378830410560107. datanyze. “apache solr market share and competitor report | compare to apache solr, apache lucene, swiftype.” datanyze. accessed july 5, 2021. https://www.datanyze.com/market-share/enterprise-search–287/apache-solr-market-share. ———. “enterprise search market share report | competitor analysis | apache lucene, apache solr, swiftype.” datanyze. accessed july 5, 2021. https://www.datanyze.com/market-share/enterprise-search–287. db-engines. “db-engines ranking.” accessed january 8, 2021. https://db-engines.com/en/ranking/search+engine. digital library services task force. “final report [of the] digital library services task force 2.” university of california libraries, 2011. https://libraries.universityofcalifornia.edu/groups/files/dlstf/docs/dlstf2_final_10may2011.pdf. duplain r, balser ds, and radziwill nm. “build great web search applications quickly with solr and blacklight.” in software and cyberinfrastructure for astronomy, 7740:774011. international society for optics and photonics, 2010. https://doi.org/10.1117/12.857899. european commission. joint research centre. institute for prospective technological studies. “enterprise search in the european union: a techno economic analysis.” lu: publications office, 2013. https://data.europa.eu/doi/10.2791/17809. gaona-garcía pa, martin-moncunill d, and montenegro-marin ce. “trends and challenges of visual search interfaces in digital libraries and repositories.” the electronic library 35, no. 1 (january 1, 2017): 69–98. https://doi.org/10.1108/el-03-2015-0046. gilbert h, and mobley t. “breaking up with contentdm: why and how one institution took the leap to open source.” the code4lib journal, no. 20 (april 17, 2013). https://journal.code4lib.org/articles/8327. goodmann e, matienzo ma, vancour s, and vanden dries w. “building the national radio recordings database: a big data approach to documenting audio heritage.” in 2019 ieee international conference on big data (big data), 3080–86. los angeles, ca, usa: ieee, 2019. https://doi.org/10.1109/bigdata47090.2019.9006520. haye, martin. “internationalizing xtf: bringing multilingual features to the extensible text framework,” may 2005. https://xtf.cdlib.org/wp-content/uploads/2010/08/internationalizing_xtf_v2.pdf. ho j, and stokes ck. “metadata guidelines for digital resources at texas a&m university libraries.” working paper, may 16, 2019. https://oaktrust.library.tamu.edu/handle/1969.1/175368. ingersoll, grant. “better search with apache lucene and solr.” 2007. https://web.archive.org/web/20120131154001/http://trijug.org/downloads/trijug-11-07.pdf. kelbert p, droege g, barker k, braak k, cawsey em, coddington j, robertson t, whitacre j, and güntsch a. “b-hit – a tool for harvesting and indexing biodiversity data.” plos one 10, no. 11 (november 6, 2015). https://doi.org/10.1371/journal.pone.0142240. “large-scale full-text indexing with solr | www.hathitrust.org | hathitrust digital library.” accessed july 5, 2021. https://www.hathitrust.org/blogs/large-scale-search/large-scale-full-text-indexing-solr. pham k, reyes f, and rynhart j. “building a library search infrastructure with elasticsearch.” code4lib journal, no. 48 (may 1, 2020). https://proxy.library.tamu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=edsdoj&an=edsdoj.594fb9ee35e64189900db986330db681&site=eds-live. pretoro ed, de roock e, fremout w, buelinckx e, buyle s, and van der stede v. “optimizing elasticsearch search experience using a thesaurus.” the code4lib journal, no. 51 (june 14, 2021). https://journal.code4lib.org/articles/15749. seeley y. “a history of lucene and solr.” solr ’n stuff (blog), june 16, 2014. https://yonik.com/lucene-solr-history/. thuma, john. “what is apache solr.” medium, august 9, 2018. https://johnthuma.medium.com/what-is-apache-solr-a18a60004e70. weig ec, and slone m. “spokedb: open-source information management system for oral history.” digital library perspectives 34, no. 2 (january 1, 2018): 101–16. https://doi.org/10.1108/dlp-03-2017-0012. williams s. “better search through query expansion using controlled vocabularies and apache solr.” the code4lib journal, no. 20 (april 17, 2013). https://journal.code4lib.org/articles/7787. zhang h, durbin m, dunn j, cowan w, and wheeler b. “faceted search for heterogeneous digital collections.” in proceedings of the 12th acm/ieee-cs joint conference on digital libraries, 425–26. jcdl ’12. new york, ny, usa: association for computing machinery, 2012. https://doi.org/10.1145/2232817.2232924. acknowledgements the authors would like to acknowledge additional sage project developers michael bolton, kevin day, ryan laddusaw, rincy mathew, jason savell, and william welling. about the authors david b. lowe (https://orcid.org/0000-0003-2856-8629 ) is an assistant professor and the digital collections management librarian at texas a&m university libraries, where he was recently selected as one of eight 2021 texas a&m institute of data science career initiation fellows. his work responsibilities include policies and procedures for digital collections and the institutional repository in the libraries. his research interests include applications of artificial intelligence and machine learning techniques to text mining in a scholarly communication and digital humanities context, with digital collections as data. james creel is a software applications developer iv at texas a&m university libraries. james holds a master of science degree in computer science from texas a&m and has 15 years of experience developing and managing digital library applications. elizabeth german is an associate professor and the service design librarian at texas a&m university libraries where she focuses on bringing together user experience, project management, and accessibility in order to provide quality user experiences for researchers and learners. her research interests include library search behavior, accessibility inclusion, and assessment. douglas hahn is the director of library applications and integration at texas a&m university libraries with over 20 years’ experience in the computer information technology field. he received a master of science degree from university of north texas. his current interests are in the evolution of the web, and technology’s impact on society. jeremy huff is a software applications developer iii at texas a&m university libraries and has been involved in software development at the libraries for the past 8 years. in addition to development on sage, he has represented the libraries at texas a&m on several other open-source projects, including both folio and vireo. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – help! a simple method for getting back-up help to the reference desk mission editorial committee process and structure code4lib issue 2, 2008-03-24 help! a simple method for getting back-up help to the reference desk using the “net send” command, native to windows xp, librarians at the university of california, riverside created a “help button” for the reference desk. the simple script file sends a message to librarians’ workstations in their offices and logs the date and time of use. this paper describes that program. by ken furuta and michele potter introduction reference librarians are occasionally caught in the middle of time-consuming reference questions as the line of patrons needing service lengthens. leaving the current patron to call around for other librarians, or running around trying to grab one at random may take too much time and frustrate everyone. at the university of california, riverside (ucr) libraries, we faced this problem more and more as workload increases made double staffing less and less feasible. we needed a way to contact all of the librarians at the same time in a method similar to a pa message at a department store (i.e. “back-up librarians to the reference desk.”) looking for solutions in the library literature, we didn’t find anything to help us solve this problem. while there are papers on staffing policies and others on using statistics [1], none addressed the nuts and bolts of calling for back-up help efficiently. a search through the archive of a reference librarian listserv (libref-l) showed some questions on policies for backups, but no methods for calling it. what we decided to try, after some brainstorming, turned out to be completely free and has been working beautifully for a few years now. we are using the “net send” command, native to windows xp, to broadcast the message “need help at the reference desk” to the workstations in all of the librarians’ offices. the message and the recipients are contained in a short batch file on the desktop. the procedure requires only the click of a button to create a stampede of librarians heading for the reference desk. in our experience backups arrive within 45 seconds. if a stampede does arrive, it becomes a pleasant opportunity to stretch and briefly visit. if a stampede is not desirable, there may be some other possible refinements. for example, if librarian workspaces are close together, one librarian may offer to go. how it works the command “net send” was created for communication from system administrators to users by transmitting one-way messages to another machine on the network. the command’s syntax is: net send [name] message in the syntax, “[name]” is the name of the user or computer to which you would like to send the message, and “message” is the message you would like to send. to test the command or to send a single message to a single user, one can: click on the start menu click on run type cmd type net send [name] message this would work very well for inviting a colleague to lunch, but is hardly feasible for the reference desk, especially when the line is lengthening. however, as with all dos commands, one can create a batch file in order to execute a collection of commands all at once. in our case it looks something like this: net send librarian1 need help at the reference desk! net send librarian2 need help at the reference desk! etcâ€¦ the batch file can be created in a text editor, such as notepad, and saved as a “.bat” file, e.g., “help.bat”. the message to the user looks like this and includes an audible alert: a shortcut to the batch file was placed on the desktop of all of the reference machines and works as a virtual help button. it is even possible to add coding to the file to keep track of when the button is used (see appendix). saving the batch file to a common, or shared, drive, instead of on each hard drive of every reference librarian’s pc, greatly simplifies maintenance. the file can be updated once and changes go into effect immediately. trouble shooting there are a few technical issues that may need to be resolved before the program will work smoothly. some are relatively minor while others may require you to work with your systems department to resolve. net send is a dos command, and as such it resides on your hard drive. if you see that the default drive for your command line interface is different (e.g., z:\>), you may need to add “c:” to the beginning of the batch file. if you are not sure of the user/computer names of your reference librarians, you can type “net view” in the command line interface to see a list of names (see below). the batch file progresses in order from user (or computer) name to user name. if one of the users is not logged in or if the computer is off, it takes a while for the timeout to get back to the originating computer. other users may not get the message for a few seconds to a minute. in order to partly resolve this issue, you can rearrange the list of recipients in order of the likelihood that they have logged in. another issue is the format of the user names. when our network was upgraded to sp2, the format of the user name changed from just the name to including the prefix “/domain:” (i.e. “net send /domain:librarian1 help is needed at the reference desk!”). the windows messenger service can be turned off on individual office workstations through the control panel. in this case the message will not appear on that machine. if you have the ability to see and alter your profile, this can be done in the administrative tools/services window. it is listed as “messenger”. in windows xp sp2, messenger is set to “disabled” as the default. it will need to be reset to “started.” if you cannot or do not want to mess with your settings, check with your systems department. it is also possible that the ability to receive net send messages can be turned off on a network because outsiders may also try to send net send messages (spam) to your computer. however, a network firewall can likely block these messages coming from the outside. this is an issue you should discuss with your systems department alternative method there has been some discussion at the ucr libraries about using msn messenger (formerly window messenger) or some other instant messaging program to call for help. in our case, msn messenger is already installed on our computers. msn messenger allows users to communicate and share files with others who are logged on at the same time. this may be a good alternative if you or your systems department is reluctant to enable microsoft messenger. however, we could not create a “canned” help message or list of recipients beforehand in msn messenger. thus the librarian at our desk would have to type a help message and designate the recipients when a backup is needed resulting in extra time calling for backup. other operating systems the windows vista operating system does not support the net send command. instead, vista users can use the “msg” command. the msg command is also available in xp, but is slightly more complicated to use. the syntax is “msg /server:[computer name] message”. it is worth noting that msg does not require the messenger service to be running on the target machines. it may even be possible to send messages between pcs and macs in mixed shops. however, we are unable to test this since ucr is a pc only shop. conclusion using a help button based on the native “net send” capabilities of most windows operating systems works well to send help requests to librarians’ offices when lines begin to form. the method may be most useful in departments where reference librarians’ offices are spread out, because the program sends a message to the various recipients no matter where they are located. note for example see deborah rinderknecht, “new norms for reference desk staffing adequacy: a comparative study,” college & research libraries 53 (1992): 429-36. appendix: help.bat file rem script created by michele potter. kenneth furuta added the log file. net send librarian1 need help at the reference desk! net send librarian2 need help at the reference desk! rem etc... rem code for the log file rem use the text parsing mode of the 'for' command and the /t (display only) rem switch on the date and time commands. rem "for in do" loops require 2 percent signs before variables, "%%variable_name" for /f 'tokens=1-4 delims=/ ' %%i in ('date /t') do ( set dayofweek=%%i set month=%%j set day=%%k set year=%%l ) rem only save the hour and minute. for /f 'tokens=1-2 delims=:' %%i in ('time /t') do ( set hour=%%i set minute=%%j ) rem create a comma delimited file. (note the comma between the variables). rem enclose variable names with percent signs. rem ">>" redirects the output from the screen to writing to a file. rem double angled brackets, ">>" appending it to the end of the file. echo %dayofweek%,%month%/%day%/%year%,%hour%:%minute% 1>> help_button_log.csv rem optional clean up variables (setting them to {nothing} deletes them) set dayofweek= set year= set month= set day= set hour= set minute= about the authors kenneth furuta and michele potter are both at the university of california, riverside university libraries. kenneth is the reference/information technology librarian in the rivera library (email: kfuruta@ucr.edu). michele is the engineering librarian in the science library (email: michelep@ucr.edu). the authors acknowledge vicki bloom’s (head, rivera reference department) contributions on this project. subscribe to comments: for this article | for all articles 4 responses to "help! a simple method for getting back-up help to the reference desk" please leave a response below: steve johnson, 2008-03-25 my network administrator noted that xp service pack 2 shuts down the net send command as a security risk. the command can be reenabled at the workstation level, as it will be here since the idea of a “help” icon seems quite popular. of course, we will also do a second icon, one with the message “call campus security to the reference desk.” the only drawback to the simple panic button message is that no information about the nature of the emergency is conveyed. bill helman, 2008-12-18 i would think that you could achieve a similar outcome, using msn messenger with a macro (from macro express, for instance). great idea, though! yoyosamo, 2010-09-10 hi, just wondering, will this work on windows 7 and what are the commands? ????? ???? ???????, 2011-07-07 this “help button” sure sounds like a great idea, especially as you said it would assist where reference librarians’ offices are spread out. nice thinking. :) leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – leveraging aviary for past and future audiovisual collections mission editorial committee process and structure code4lib issue 58, 2023-12-04 leveraging aviary for past and future audiovisual collections now that audio and video recording hardware is easy to use, highly portable, affordable, and capable of producing high quality content, many universities are seeing a rise in demand for oral history projects and programs on their campuses. the burden of preserving and providing access to this complex format typically falls on the library, oftentimes with no prior involvement or consultation with library staff. this can be challenging when many library staff have no formal training in oral history and only a passing familiarity with the format. to address this issue, librarians at the college of charleston have implemented avpreserve’s audiovisual content platform, aviary, to build out a successful oral history program. the authors will share their experience building new oral history programs that coexist alongside migrated audiovisual materials from legacy systems. they will detail how they approached migrating legacy oral histories in batch form, and how they leveraged aviary’s api and embed functionalities to present aviary audiovisual materials seamlessly alongside other cultural heritage materials in a single, searchable catalog. this article will also discuss techniques for managing an influx of oral histories from campus stakeholders and details on how to make efficient use of time-coded transcripts and indices for the best user experience possible. by tyler mobley and heather gilbert background and existing infrastructure the lowcountry digital library (lcdl) is a multi-institutional cooperative digital library that currently hosts several hundred collections consisting of almost 150,000 digitized images, objects, pages, and recordings of primary source materials and their associated metadata. lcdl is built around the open-source resourcespace digital asset management system that serves as our central repository for metadata records and presentation-quality digital files ranging from still images to audio files and pdfs. to bring these collections to the public, we created a custom plugin that indexes records in the repository to apache solr, a java-based search platform. indexed records are then available for search and discovery via blacklight, lcdl’s primary catalog. blacklight is a ruby-based search and discovery platform that provides a graphical interface for interacting with data in the solr index. it provides full-text and faceted searching of all our materials, and the college of charleston libraries has relied on it for nearly a decade across both lcdl and the south carolina digital library (scdl), a provider hub for the digital public library of america. figure 1. current lcdl technology infrastructure diagram. lcdl currently hosts over 600 oral histories in audio and video formats. our oral history collections have grown at a steady trickle over time, though we have noticed a substantial increase in interest in recent years from partners both within the college and local community thanks to the sudden ubiquity of smartphones and accessible audio visual options. as lcdl is focused on the collecting of historic cultural heritage materials related to the lowcountry region of south carolina, our hosted oral histories have naturally hewed to this subject area. however, we have increasingly been approached with oral histories covering topics and time periods that don’t necessarily fit the stated mission of lcdl. historically, oral histories in lcdl have been loaded to resourcespace in bulk as metadata and digital files. however, rather than serving these large files directly from resourcespace, we host copies of oral history files in amazon s3 storage and serve them for streaming with amazon cloudfront, a content delivery network (cdn) better suited for flexible streaming to browsers. the metadata records loaded to resourcespace contain a field referencing the file’s address on cloudfront. when indexed, that metadata field is served to blacklight to fill html embed code to embed the file on item pages for viewing. you can see an example of how that final item looks below. while this system allowed us to get oral histories online in a way that was adequately usable, it lacked in features and functionality. figure 2. existing lcdl oral history user interface. selecting the “click here to read the embedded transcript.” assessing our options we began evaluating our options for a better oral history public interface in 2021. we knew that we wanted to provide a more sophisticated experience for our patrons accessing oral histories. as mentioned above, lcdl employed an embedded pdf viewer for patrons to view transcripts while listening to an oral history, however we wanted the ability to offer value added features such as auto-scrolling of time-stamped transcripts and subject indexing. as we considered our options, deciding factors also included sustainability, personnel support requirements, and pricing. the college of charleston libraries has a history of working on a shoestring budget, but for sustainability purposes has been actively seeking out hosted solutions to specific digital projects with an effort to increase scalability and project longevity. we also knew from previous migrations that for sustainability purposes, any new platform had to have the ability to import and export our content accurately and efficiently. additionally, we knew that adding advanced features to our oral history platform couldn’t require the need to hire additional staff, we had to work within our existing personnel infrastructure. finally, we knew the solution had to be cost effective. requesting recurring funding wasn’t out of the question, but the cost to benefit ratio had to be well supported. we assessed several different options including the university of kentucky’s oral history metadata synchronizer package known as ohms (including incorporating ohms into our existing digital library infrastructure), alma digital (college of charleston libraries migrated to alma in 2020), dspace (which at the time was the foundation for our existing albeit underused institutional repository), and aviary, which is a collaboration between avp and yale university’s fortunoff video archive for holocaust testimonies. ohms offered the most options for developing a sophisticated oral history viewing interface. however, it would require self hosting, increasing our already hefty server footprint, advanced retraining of existing staff and, in all likelihood, the addition of personnel to support the extra work of file preparation, assuming we really wanted to take advantage of all ohms had to offer. during our assessment, the barrier to sustainability was high enough that we didn’t need to assess this platform further using our other criteria. it was decided that this option was perhaps too sophisticated for both our needs and our capacity for support. alma digital was an appealing option as we had just fully migrated to alma the year prior (2020), however we found the ex libris-provided documentation on alma digital lacking in specificity as to what the repository could actually do and after significant testing on our part we encountered problems having alma digital support our advanced mods metadata schema. while alma digital could support both audio and video file formats, we found the user interface to be too simplistic for our needs. alma digital support documents indicated some import and export options, but with the platform’s simplicity, it wasn’t worth pursuing during our assessment. alma digital just didn’t suit our needs. similarly, we determined that dspace was an unsuitable option. like alma digital, dspace has the ability to support audio and video file formats in their repository, but the interface was also lacking. dspace has robust import and export tools, but, ultimately that wasn’t enough to make it a good match for our needs. we decided that both alma digital and dspace didn’t provide any advantages over the existing lcdl display of oral histories. aviary was our last option to evaluate, and we were pleasantly surprised by what we found. aviary offered the advanced user interface options we were looking for, including auto scrolling of time-stamped transcripts and the ability to add custom indexing, while also providing a user-friendly back-end experience. this made it easier for us to use existing staff who were more experienced with using a gui as opposed to a terminal or code-based command structure. aviary was a hosted solution, so we wouldn’t be adding yet another digital project to our server support portfolio, and considering the modest size of our collection, it was relatively affordable. finally, aviary offered us a lot of flexibility. we could batch load collections or single load items, we could use some advanced features or use none of them without losing the level of interface we already provided in lcdl, we found we could embed aviary objects within our existing digital library for seamless discovery, and aviary offered a variety of permission and access levels that could support a wide range of use-case scenarios for digital collection access. furthermore, aviary provides a flexible api for interacting with resources and media files in the system. while the user interface provides multiple means of importing and exporting loaded oral history materials, the availability of an api to programmatically handle import and export in batches eased concerns about the possibility of future platform migrations and accidentally trapping our data in a closed system. aviary’s flexibility has helped us prepare to serve a growing need for campus supported oral history projects alongside our legacy cultural heritage material. with aviary, new collections of oral histories not within the scope of lcdl can be loaded and presented through a styled search and discovery interface on aviary’s own platform. each institution’s collections are discoverable through a custom site to which a custom domain can be applied (in our case, https://lcdl.aviaryplatform.com), and the entire library of items hosted across all subscribers is also searchable as a federated catalog. our mission with aviary is to serve a new and growing audience in oral histories at our institution while also enriching the production and presentation of our existing collection of cultural heritage oral histories in lcdl. figure 3. the aviary platform interface provides discovery for all of our a/v collections whether they are outside the scope of lcdl (ewi society oral history collection) or within the scope of lcdl (college of charleston oral histories). configuration and batch loading aviary, as stated, is its own hosted platform, providing search and discovery for users alongside collection browsing functionality. in adopting aviary, we realized we would have to reconsider our workflows for oral histories at both a procedural and technical level. this was something of a quandary. should we now only load items straight to aviary instead of our backup-redundant repository? should we remove oral histories from lcdl’s primary catalog and direct users to aviary instead? for the purposes of item loading and record creation, we decided on a two-fold approach. metadata for lcdl collections would continue to be loaded to our resourcespace repository as before. these collections would not be indexed to the catalog from resourcespace, but they would remain together in a central location with other lcdl materials to ensure continuity in case of future platform migrations. these collections would then be additionally loaded to aviary via the system’s import job functionality. the batch-loading system employed by aviary allows an administrator to load csv files for metadata, media files, transcripts, and even indexes as a single job, and then aviary ties those components together to create complete resources. in this process, an oral history collection might be loaded with any number of rows representing individual resources in a ‘resources.csv’ file including a column for a unique identifier. the companion ‘media.csv’ and ‘transcripts.csv,’ representing av-related files and transcript files respectively, would hook on that primary key and create the relationships on the aviary back-end. in order to import just over 600 legacy oral histories into aviary, we ran metadata exports of records per collection from our apache solr index to csv format. this could have also been done via resourcespace’s metadata export process, but our solr index more cleanly outputs extra data used for presentation like presentation filenames, transcript details, etc. this process provided us with a csv file for each oral history collection containing descriptive metadata for all records in that collection, one record per row. next, we pulled the cloudfront url from this csv into a separate ‘media’ csv file, again formatted one item per row. this file also included columns for both the record’s primary identifier and a new unique identifier for the media file. finally, in a third ‘transcripts’ csv file, we broke out the file path to each transcript per record as one record per row. this file additionally included a column referencing the media file’s unique identifier to ensure that they were associated upon load. these three csv files were then loaded in-browser to the aviary platform alongside a zip file of the referenced transcript files. once all necessary files were loaded, and the import job was assigned to a collection, the job ran in the background and records were added to aviary. while the details above might sound a little daunting, the process for our workflow ultimately involved taking one csv file, splitting it into three, and adding some additional administrative fields before uploading them all at once. since we usually only receive small batches of new oral histories at a time, our typical process going forward will most likely center around single item loading directly into aviary via traditional form-based data entry. however, it’s worth keeping in mind that for platform migrations or unusually large loads of oral histories, aviary’s import job functionality works effectively. in the event of future batch loads, we have since taken advantage of resourcespace’s plugin functionality and written a plugin that produces an aviary-compliant ‘resource’ csv file for use with the import job process. this should reduce some data manipulation steps and allow our staff, at least at a metadata level, to only worry about the initial metadata creation and load process. with aviary’s import functionality, we were able to load our bulk of legacy oral histories to aviary in about 12 total ‘jobs.’ most of these records, though successfully loaded, were not immediately made public as we then had to work through metadata remediation on the legacy metadata and transcripts associated with these oral histories. further, each collection in aviary had to be configured for appropriate display of our preferred metadata fields. content migration because of aviary’s enhanced feature set, we wanted to not only batch load new collections, but migrate existing lcdl collections into the new platform. this would allow our legacy oral history collections to benefit from these features if project partners wanted to review these items and add augmented metadata such as time stamping and/or subject indices. however, as with most migrations, this required remediation efforts. while we were able to use batch loading to easily migrate all of our existing oral history collections into aviary, the content was not adequately formatted for public display. we determined that, to get the most out of our aviary instance, we would need to manually configure metadata mapping for each aviary collection, reformat transcripts to meet aviary’s transcript requirements, load pdf transcripts in aviary’s “supplemental file” field for researchers to have downloadable access, and finally create subject indices (as time and budget allowed) for recordings that would best benefit from this feature. aviary comes standard with a robust set of dublin core metadata elements but it also allows for broad metadata field creation and customization. lcdl collections are described using the mods metadata schema, so batch importing the collections also imported our metadata elements as custom fields. this worked well for us as it meant that we didn’t have to do much metadata rectification pre or post migration. however, as the end goal for our lcdl aviary destined collections was reintegration within the lcdl platform, it was important for us to mirror the display order of the metadata elements between lcdl and aviary. aviary allows for this level of customization at the collection level on the resource description menu but there is no way to customize metadata field order at the instance level (as far as we could determine at the time of this writing). this is unfortunate for a project of this nature, as our display order is uniform across collections. while the resource description page is easy to use with a drag and drop interface, it must be configured for each collection to adequately reflect the metadata display order we desire. while this isn’t overly time consuming, for as granular a product as aviary is, it feels like an oversight to not have the ability to set display preferences at the instance level. however, aviary is being very actively developed, with each update providing major improvements, so it’s quite possible that this may be available in the future. figure 4. aviary allows for granular customization of metadata fields and display at the collection level. unfortunately, transcript migration is proving to be the most challenging part of our rectification project. our legacy oral history collections only required the submission of pdf transcripts. aviary utilizes .doc, .docx, or .txt transcripts for their transcript display. this is requiring us to convert pdfs to word documents and clean up formatting issues to conform with aviary import preferences. as with all products, there are quirks. if a transcript contains any text that is in hh:mm:ss or hh:mm format, even if it is within a sentence, aviary interprets this data as a timestamp and will link the sentence, inappropriately, to the time indicated. this is easy enough to correct within transcripts, however, we’ve also noticed that aviary will treat any colon used within a sentence, even without numerical context, as a 00:00 timestamp and will wrongly link the sentence and/or paragraph. this is something that we are addressing with future projects in a transcription style guide, however it still is adding time to our rectification process of legacy materials. some of our older oral history collections had their transcript text formatted into tables, therefore stripping out this formatting in the word document is necessary but time consuming. ultimately, we think the time involved will be worth it as we’ve developed robust find and replace practices that help with some of these formatting challenges. additionally, we still get to make good use out of our original pdf transcripts. aviary provides a “supplemental file” field for every record. we’ve found that we can load the pdf transcript in that field and allow download access. this way our patrons get the best of both worlds – the recordings can be listened to/watched with auto-scrolling enabled (assuming time stamps are in place) and they can download a human-readable transcript for later research use. figure 5. the aviary public user interface provides a tabbed viewing experience. the description tab displays item level metadata, the transcript tab allows for auto-scrolling of time stamped transcripts as well as multiple transcript selection support, and the supplemental files tab allows for downloadable pdf transcripts (in addition to other file formats). subject indexing is one of the areas that excited us most about aviary, but in the short term it is the feature we have used least. we plan on leveraging this more post-migration as an optional, value-added feature that can be used to augment existing metadata or possibly be leveraged as an experiential learning opportunity within faculty/student partnership projects. aviary allows for batch importing of subject indices, but also provides a very easy to use back-end interface for one-off index creation. from the aviary back-end, any recording can be reviewed, and a subject index added using a set of simple forms. the index can be as complex as including the relevant portion of a transcript or as simple as adding a title and very brief description. indices can also include keywords and subject headings, which are searchable within both the resource and the overall platform, gps data, and hyperlinks. subject indexing is an area which we have just started to experiment with. we’ve created some recommendations for indexing for our partners both to ensure best practices are being followed and to moderate expectations and requirements of staff time. in the future, we hope that this feature may lend itself to faculty/student collaborations, where a faculty member might want to assign a recording to a student and that student can then review and suggest appropriate subject indexing. in early conversations, faculty have shown interest in this potential. figure 6. in addition to batch loading of subject indices, aviary also provides a friendly user interface for recording review and index creation. we anticipate being able to leverage this to create more interactive experiential learning opportunities for students on future projects. integration with existing platform in terms of presentation and discovery, we quickly decided that removing oral histories from the lcdl catalog would hinder search and discovery and undermine our promise to our partners that the items they had submitted to us would be in lcdl. we would need to find a solution that maintained a search and discovery experience that our partners and audience had come to expect while doing our best to integrate the benefits gained from leveraging aviary. to that end, we began work on integrating aviary items into our primary catalog by indexing aviary data into apache solr with the aviary api. aviary’s api was an attractive feature for us when we were reviewing the platform. it provides well-documented restful interaction methods for loaded resources and collections. this allowed us to leverage the api via a script in php. the script is fairly simple. it accepts a passed-in csv file of resource identifiers that correspond to items in aviary. the script then authenticates against the aviary api, iterates over each resource, and builds and commits a solr document for each resource to our apache solr index. in iterating over each resource in aviary, we do some modifications to records. first, since aviary’s api outputs metadata field names in their proper display format, the script crosswalks those field names to their solr counterparts. for instance, ‘date digital’ translates to ‘datedigital’ or ‘contributing institution’ to ‘contribinst.’ further, we inject new fields into the record for easier flagging and control in blacklight. for example, we inject a custom placeholder image url for a thumbnail for catalog display, and we also add an ‘is-aviary’ field as a simple flag to check when loading image viewers on oral history item pages. in our current transitory period of launch and migration, the flag allows the catalog pages to quickly slot in either legacy viewer code for cloudfront urls or a new embed snippet from aviary depending on the value. through this script, we can serve the primary lcdl catalog with results mixed from both our central repository and aviary in a single search. while integrating two data sources into a single index was straightforward, we had philosophical discussions about how to handle the item display of these mixed results. initially, we considered making aviary items in the catalog open a new tab to show the aviary resource page in full. of course, that presented the issue where we would be jettisoning users from the site and their own search patterns without a clean way to get them back into the catalog. ultimately, we opted to leverage aviary’s embedding feature directly into our item display. one of the big strengths of aviary that drew us to the platform was the richness of its viewing experience. as discussed earlier, aviary provides auto-scrolling transcript functionality, indexed timestamps, and other features that we do not have the time or capability to incorporate directly into lcdl ourselves. instead, legacy oral histories simply provided an audio/video player and an embedded pdf of a transcript alongside each other, previously shown in an example image above. to get the full benefit of our investment in aviary, we decided to use aviary’s embed functions on item show pages. the embed functionality provides a duplication of the full viewer as seen on aviary, and allows users to follow along with transcripts live, navigate an oral history via timestamps, and more. this functionality does rely on an iframe currently, but thanks to the fact that all our solr-indexed metadata is still included farther down the page as usual, we do not lose any helpful seo or readability on the record. an example of an aviary embed on the lcdl catalog can be seen below. figure 7. aviary hosted oral history viewed within the lcdl platform. note that transcript auto-scrolling is an option as are subject indices and supplemental files. workflows for new oral history projects oral history was identified as a growing need on the college of charleston campus several years ago, however many potential oral history projects did not fit the criteria of the lowcountry digital library. as discussed above, using aviary has provided us with significantly more options for supporting campus oral history projects as we can now provide a platform and federated search for all oral histories while still only integrating oral histories into lcdl that meet that platform’s curation criteria. we’ve had early success working with academic faculty outside the library, using simple strategies to manage workflows while still collecting the data needed to upload and make their recordings accessible. these strategies include early intervention regarding project development, copyright, and informed consent, campus sponsored digital file storage, and data collection using simple online forms. setting expectations early with faculty regarding project development and the level of informed consent needed in oral history projects was vital to our early success. prior to this strategy, the library was being approached with completed “oral history” projects with the incorrect expectations that: we would accept any project into our special collections holdings and we would rapidly provide anything from online access to a custom website, digital catalog, and submission portal. unfortunately, none of the above is true. our resources are limited and oftentimes these completed projects would not even include consent forms that adequately covered copyright and sharing permissions. therefore, early intervention in the project development stage has been extraordinarily helpful. this way we reach faculty prior to project start to answer any questions they may have about designing a project appropriate for library collaboration and can provide sample consent forms that cover a wide variety of permission and copyright options. this early intervention also allows us to plan ahead for upcoming projects and minimize the number of “surprise” digital collections that can turn up on one’s desk. for our digital library partners outside of campus, our expectation is that collections are delivered to us either via the cloud or an external drive. however, external partnership comes with metadata training and templates as well as some expectations and dedication on behalf of the partner organization’s staff. we’ve tried this approach with our faculty in the past with mixed results. further, oral histories often have more moving parts than a straightforward digitized manuscript collection. therefore, we’ve started setting up structured cloud drives for use in faculty oral history collections. our campus uses microsoft teams and sharepoint for shared storage. once we’ve discussed the upcoming project with the faculty partners, we set up a teams drive for the project which includes all of the documentation and consent forms discussed in the initial meeting. every recording gets its own folder which should be named after the interviewee. each interviewee folder will eventually contain the recording file, the transcript, the signed consent forms, and any other information deemed necessary. this way there is no excuse for poor file management. we also upload simple read me instructions into the drive so faculty partners know what to do within the drive and can stay on top of the process. this read me file includes links to a “submission form” that is used to capture as much data about the interview from the faculty partner as possible. metadata collection can be challenging from our faculty partners who have no metadata experience and no inclination to learn that skill. by using a simple online form (we use google forms) we can collect the most salient points. then, our digital projects librarian can quickly export the entries into a spreadsheet and massage the results into usable descriptive and administrative metadata. while these are very simple, common-sense techniques and strategies, we’ve found them to be most effective at quickly and efficiently aiding in the development of successful oral history projects while both managing campus expectations and being mindful of staffing restrictions. lessons learned this process has provided several learning opportunities for us that are specific to both digital asset migration and oral history management. while integrating aviary into the traditional lcdl user experience proved straightforward from a technical standpoint, it became clear that this integration would cause a significant change to our staff’s workflows for oral histories. by continuing to add items to resourcespace for backup purposes alongside their presence in aviary, we were effectively doubling our commitment to the ingestion and management of oral histories for lcdl. though oral history collections rarely arrive in large batches, we realized that this process could become cumbersome for staff over time if we didn’t focus on developing methods to ease this new workflow. this lesson led us to the creation of the previously mentioned custom plugin in resourcespace that generates aviary-compliant csv files for batch imports. through this process, the initial work performed in loading items to resourcespace can also complete a portion of the work needed to load to aviary. it has also led us to plan for future work to leverage the resourcespace and aviary apis to automate the process of loading items more fully from one system to the other. additionally, the close integration of aviary items into a unified catalog introduced another layer of technical complexity to our infrastructure. the apache solr index feeding our catalog’s search results now houses the combined feeds of aviary and resourcespace. further, item display pages in the catalog now use hooks in those solr documents to correctly display aviary’s embed functionality. while it wasn’t difficult to make these changes to our infrastructure, they introduced new challenges for lcdl staff in how to approach troubleshooting. after all, even with years of training, we all occasionally make mistakes during the metadata creation and item loading process. furthermore, the public is often quick to find and point out errors that make it into production. the integration of aviary into this mix means we had to train and prepare staff for what to do when errors arise, and how and where to resolve them. in addition to the above technical workflow changes, we also learned that migration of legacy oral history transcripts can be very time consuming, especially if the format of the original transcript is dated or heavily formatted. having a small reserve of funds set aside to pay for online transcription services is incredibly helpful in this situation. while migrating a very early oral history, we discovered that the 50+ page transcript was available only as a pdf and was formatted as a multi-column table with inaccurate time-stamping. manually removing this formatting and correcting the transcription errors would take at least a full day of work. however, for less than what we would pay our oral historian, we were able to send these problem files out for re-transcription using an online service, and were returned transcripts that only required a brief review. this was a cost effective solution that freed up the time of our oral historian so they could continue working on other projects. not to mention the frustration that this saved our staff. finally, as with any migration project, it is important to document everything. the notes that were recorded while testing out migration techniques in aviary certainly weren’t fit to be shared broadly, but they were vital for the documentation we produced later. while determining how to both add new content as well as migrate existing content into a platform of this nature, it is not uncommon to write ourselves little “cheat sheets” or notes. don’t throw these away! very often these things can quickly and easily be formatted and fleshed out into a manual or migration checklist that can be passed on to others to assist in completing the project. conclusion we are in the early days of our migration of legacy oral history materials to aviary. while we have largely thought through how we want to integrate new aviary materials into both the staff workflows and the discovery and presentation of lcdl, we still have extensive remediation of legacy materials to perform before we can bring them live on aviary and embed them into lcdl. however, with aviary we now have the flexibility and potential to not only support our existing oral histories in the manner that we prefer, but also the capacity to take on new oral history projects. we’ve developed manageable workflows to support faculty driven oral history projects and now have the capability to support a dedicated oral history platform. as such, we are working on branding and launching the lowcountry oral history initiative (lohi) which is essentially a third pillar of the overall ‘lowcountry’ project alongside the lowcountry digital library and the lowcountry digital history initiative. under the banner of lohi we hope to cultivate relationships between the lowcountry digital library and various archival and community organizations to record, preserve, and make available audio and video recordings that document the memories of historically marginalized communities. through aviary and the lowcountry digital library, we can continue to service our partners and community. we look forward to steadily adding new and varied oral history collections to aviary while we ensure that new and legacy cultural heritage oral histories in lcdl are presented with a robust new look and feel. about the authors tyler mobley is the digital services coordinator for the college of charleston libraries. he serves as co-director of the lowcountry digital library and technical director of the south carolina digital library. he holds an mlis from the university of south carolina. his research interests include digital asset management and preservation, digital repository systems development, library information technology management, and digital scholarship. heather gilbert is the associate dean of collection and content services for the college of charleston libraries. she serves as associate director for the coastal region of the south carolina digital library. she holds an mfa from the pennsylvania academy of fine arts and an mlis from the university of south carolina. her research interests include academic content acquisition and management strategies, digital asset management and preservation, digital scholarship, metadata aggregation, and archival/cultural heritage repository management. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – works, expressions, manifestations, items: an ontology mission editorial committee process and structure code4lib issue 53, 2022-05-09 works, expressions, manifestations, items: an ontology the concepts first introduced in the frbr document and known as “wemi” have been employed in situations quite different from the library bibliographic catalog. this is evidence that a definition of similar classes that are more general than those developed for library usage would benefit metadata developers broadly. this article proposes a minimally constrained set of classes and relationships that could form the basis for a useful model of created works. by karen coyle introduction a model for the bibliographic entries in library catalogs was developed in 1998 that included a view of creative endeavors having four levels of abstraction, from the most abstract, “work,” to the actual physical instance, “item.” this model, developed solely within and for the library community, has found adherents among metadata practitioners in other communities. these non-library communities present significantly different use cases both among themselves and against the library use case the model was designed to address. these provide evidence that a set of classes that expresses this model but that is not constrained by the library catalog use case would be useful. each of the application-specific uses, including the one developed by libraries, would be an appropriate sub-class to a more general model describing the nature of created works. this article first introduces the concepts in the library model. it then describes the uses outside of the library area. this is followed by a proposal for a definition of entities with minimum semantic commitment that can be used as superclasses for metadata definitions in any environment describing created works. the frbr-lrm model in 1998, the cataloging division of the international federation of library associations (ifla) issued a document proposing a new conceptual model for the entries in library catalogs. (ifla 1998, ifla 2009) this metadata model, known by its acronym frbr (for functional requirements for bibliographic records), extrapolated entities and relationships from the metadata creation standards used for library catalogs. the model has three groups of entities, named solely as groups 1, 2, and 3. group 2 consisted of persons and groups of persons who have some agency over the cataloged works (such as creators and publishers). group 3 had the entities that represented topical aspects of the resource. group 1 was the most innovative, as it separated the description of the created resource into four levels of abstraction: work, expression, manifestation, and item (wemi). it is this latter set of entities that has found adherence beyond the library use case. figure 1. frbr group 1 (wemi) although the study group that created the frbr model was working exclusively on describing library data, their work revealed a missing concept in the description of created works far beyond the library context. the levels of abstraction from work to item has been seen to be useful for such diverse resources as dataset versions, legal documentation tracking, and fashion industry design flow. while these non-library parties are interested in making use of the wemi entities, those entities reflect the library catalog environment and its particular business rules. this article proposes the development of an unconstrained created work ontology that benefits from frbr’s innovation yet facilitates reuse and extension in situations differing from the original intended library use. the library use case library metadata serves a variety of use cases within the institution. the most mundane is that of an inventory of the owned items. for library users the catalog is the means to discover and locate resources at the level of specificity that serves their need. the multi-level description in frbr’s wemi was derived from important bibliographic relationships that could be of interest to library users. for example, one creative work can exist in multiple editions, translations, or arrangements (expression); one expression may be published in many books or other media (manifestation); and the end point of a search can be obtaining an individual copy owned by the library (item). the frbr model also followed general database criteria with a goal to eliminate any duplication of data elements across the entities. because the library community shares metadata records for items they own in common, the model promises an efficiency for the community at large. (coyle, 2016 p. 76) frbr has been superseded by a revised model called the library reference model (lrm). (riva, 2017) this model provides some significant changes in design and naming but the basic concepts of wemi remain. note that most of the extra-library discussion continues to refer to the model with the acronym “frbr” and some concepts introduced in the lrm are not yet widely discussed. the entities of wemi are defined today in the library reference model as: work: the intellectual or artistic content of a distinct creation. (riva, 2017 p.21) expression: a distinct combination of signs conveying intellectual or artistic content.(riva, 2017 p.23) manifestation: a set of all carriers that are assumed to share the same characteristics as to intellectual or artistic content and aspects of physical form. that set is defined by both the overall content and the production plan for its carrier or carriers. (riva, 2017 p.25) item: an object or objects carrying signs intended to convey intellectual or artistic content. (riva, 2017 p.27) each entity has specific relationships with the near entities, and the inverse relationship is assumed (riva, 2017 p.65-66): work is realized by expression expression is embodied in manifestation manifestation is exemplified by item in the library model, the relationships between the wemi entities are disjoint from each other and are fixed in order. being disjoint, the entities cannot share data elements if rdf reasoning will be applied. this is mainly a problem with sharing between metadata schemas that have distributed their data elements differently, such as the bibframe three-level model and the resource description and access (rda) adherence to the four-level models of frbr or lrm.(baker, 2014) in the frbr model all intervening entities are required: the only route from the manifestation to the work is through the link between manifestation and expression and between expression and work. (shotton, 2019) the relationships are therefore immutable and all “higher” entities must be present for a “lower” entity to be described. part 2: uses of wemi outside of libraries there is ample evidence that metadata designers outside of the library data silo are drawn to the concepts first introduced in the frbr report of 1998, and in particular they are drawn to the wemi entities. these entities provide a defined model of created “things” that acknowledges the abstract planes that we interact with separately from the physical embodiments. the examples that follow are a selection of uses of frbr in non-library applications. frbrcore at vocab.org when frbr was issued in 1998, there was no expression of the model in any information technology schema or code. this was in keeping with the declaration that frbr was a conceptual model that was not being introduced as a bibliographic record schema. this left open, however, the need for systems development using the frbr conceptual model. in 2004, library systems developers ian davis and richard newman created frbr core (davis, 2005) (revised in 2006), an rdf vocabulary encompassing all of the frbr entities and the frbr-defined relationships between them (e.g. “revisionof“, “expressionof“, “ownerof“). frbr core did not, however, follow precisely the description of frbr as given in the 1998 document. it added a top-level class, endeavor, to which the four entities of wemi are sub-classed which was not included in the frbr model. it also added a top-level class, responsibleentity, for the two entities of the second group, which are corporate body and person. (dunsire, 2008) the frbr document defined specific bibliographic relationships between entities, such as “successor” and “reproductionof,” which were also included in frbr core. frbr core follows the constraints on frbr entity relationships in that it declares each of the entities other than subject to be disjoint of all of the other entities. it also maintained the linear relationships of the entities, defining them as rdf classes in which each entity has only the relationships that maintain the wemi entities in a single order. frbr core is “core” because it consists only of the frbr entities and the relationships between them. it does not include properties for the attributes that frbr defined for the entities, such as title of the work, or publisher of the manifestation. it therefore leaves room for other vocabulary developers to describe the entities through the attributes best suited to their use case. frbr core is important to the story here because in the absence of a metadata schema definition of frbr or the lrm, frbr core has been used in metadata vocabularies implementing their own frbr-based metadata for non-library objects. frbr core was not, however, adopted by the ifla working group as an approved expression of frbr. this has created a fork in implementation between the community adhering to the ifla model and those outside of the library standards environment. fabio: semantic publishing and referencing one of the uses of frbr core is in the semantic publishing and referencing (spar) ontologies. (peroni, 2018) these ontologies define metadata for the description of documents and the publication workflow in the scholarly publishing domain, and provide a rich vocabulary for citations, references, and agents. within that suite of ontologies is the frbr-aligned bibliographic ontology, fabio (peroni, 2012) (shotton, 2019). fabio defines sub-classes of the frbr core entities and gives them definitions with scholarly publishing semantics. for example, fabio work is defined as: fabio:work – a subclass of frbr work, restricted to works that are published or potentially publishable, and that contain or are referred to by bibliographic references, or entities used to define bibliographic references. fabio works, and their expressions and manifestations, are primarily textual publications such as books, magazines, newspapers and journals, and items of their content. however, they also include datasets, computer algorithms, experimental protocols, formal specifications and vocabularies, legal records, governmental papers, technical and commercial reports and similar publications, and also bibliographies, reference lists, library catalogues and similar collections. for this reason, fabio:work is not an equivalent class to frbr:scholarlywork. an example of a fabio:work is your latest research paper. the definition of the superclass to fabio work in frbr core is: “an abstract notion of an artistic or intellectual creation.” the fabio classes are a reuse of the frbr concepts with narrower semantics. (peroni, 2017) the classes of wemi are further subclassed in fabio to describe the nature of the entity, such as: work subclasses: biography, critical essay, dataset expression subclasses: article, conference poster, instruction manual manifestation subclasses: analog manifestation, digital manifestation item subclasses: analog item, digital item by declaring a particular entity to be of the class fabio:biography using the facilities of rdf class relationships, one has also declared that entity to be a fabio:work and a frbrcore:work through the class relationships. fabio also expands the model provided in frbr core by allowing additional relationships between the wemi entities such as from work directly to item without passing through the intermediate entities of expression and manifestation. figure 2. fabio class relationships (http://purl.org/spar/fabio/) functional requirements for information resources (frir) members of the open government data community have proposed a frbr-based data model called frir (for functional requirements for information resources) for datasets. this also is based on the frbr core vocabulary. (mccusker, 2012) in this model, the use of abstraction, expressed information, and physical sources is needed to allow members of the community to combine data from different sources and to determine at what level the sources represent the same information. the authors apply the four wemi levels (which they refer to as the “four frbr levels of abstraction”) to digital information resources. they consider exact copies of files (the same bitstream) to be items of the same manifestation. different file formats with the same content, such as a csv file and an excel file, are different manifestations of the same expression. different expressions contain the same informational content but the files may differ in having more or less content, yet they are expressions of the same work because they express the same basic data. datasets that are derived from expressions but that do not contain the same information content are considered different works. wemi class designation for the datasets in the frir model can be determined computationally using algorithms that derive from the cryptographic digests of the datasets. as applied, these algorithms bring together files at the appropriate level of abstraction, logically building the wemi relationships from previously unlinked files. figure 3. frir (mccusker, 2012 p.4) the frir uses the frbr core vocabulary but not in any bibliographic sense. it does make use of some of the relationship properties such as successor, revision, and reproduction. although frbr was developed for bibliographic works, and is specifically appropriate to published materials, frir shows that it is possible to interpret wemi as representing relationships for very different kinds of information. legal documentation the use of wemi for legal documentation arises from the need in that community to keep a strict account of versions of documents as they develop in the juridical and legislative processes, and to make these available in a consistent way across jurisdictions. (boer 2002) the first to propose the use of wemi for legal documentation was joao alberto de oliveira lima, of the brazilian senate. (lima, 2006) he gives the example below which illustrates how the wemi structure can be used with legal documentation. the work in this example has two expressions: the original text and the amended text. the original text is published as a manifestation in the federal record and on the related senate website on subsequent days. the amended text is published in the record of the legislature one day later. w1 – “act n± 9 691, 07/22/1998 [creator = national congress]” e1:1 – “original text” m1:1:1 – “official publication in the federal journal, 07/23/1998” i1:1:1:1 – exemplar on paper in the national library i1:1:1:2 – exemplar on paper in the library of the federal senate m1:1:2 – publication in the federal senate web site, 07/24/1998” i1:1:2:1 – e1:2 – “rectifying text” m1:2:1 – “official publication in the federal journal, 07/24/1998” i1:2:1:1 – exemplar on paper in the national library figure 4. wemi relationships in legal documents this example shows how wemi can be used to provide a map of the history of document versions over time. (lima, 2008) actual legislative workflows can take place over months or years and can iterate through committees and legislatures multiple times while the interested parties negotiate amendments. a wemi analysis can be used to organize the version history of a single piece of legislation from first proposal to law. legal documents using wemi in this way are formatted in xml. one such xml schema is metalex (boer, 2008), another is the akoma ntoso schema (palmirani, 2008) (vitali, 2008) which is a standard under the oasis legaldocml technical committee. in the akoma ntoso schema, wemi metadata is embedded in the xml document itself, in a section called “identification.” the identification section creates a standardized description of the document. the entities are defined with data element names beginning with “frbr,” such as frbrtitle, frbrauthor. the akoma ntoso schema defines elements that are not found in the original frbr nor the lrm vocabularies; these support the xml structure of the akoma ntoso-coded document. for example, the identification section is linked to the legal document and its versions with the element frbrthis (which identifies the document to which the element relates). there are some properties, such as frbrdate and frbruri, that are common to each wemi entity, while other properties, such as frbrcountry and frbrfomat, are defined only for the appropriate entities. also note that in the xml schema nesting of entities obviates the need for specific relationship properties between the entities. deposited born digital 3 p. figure 5. snippet of akoma ntoso code the legal community’s use of wemi points out a concept that was missing from the original frbr but that has been added to the model in the lrm: the ability to treat time-based families of created works. the legal community has a need to include time as a central component of its analysis, and as such has turned to the use of time in the museum community’s model, frbroo, (bekiari, 2017) to satisfy this need. (lima, 2008) fast fashion design industry two ontologists engaged by the “fast-fashion design industries” to develop a metadata model and who were familiar with frbr found that they could apply concepts based on wemi to the work-flow of the imperial design company. there was the initial creative design effort (work) followed by a sketch of the design idea (expression), then a concrete manufactured product (manifestation) with individual sales units (items). (peroni, 2017) the vocabulary makes use of an owl2 dl implementation of the frbr core (ciccarese, 2018) to create a four-level description of the creation as defined in the fast fashion industry. in the place of work, expression, manifestation and item, the wemi classes for this industry were: style, item, stock-keeping unit and piece. (peroni, 2017) part 3: a new model these implementations for the use of wemi concepts for datasets, manufacturing, and non-library bibliographic data illustrate the utility of these concepts outside of the specific library cataloging context in which they originated. why a new model is needed given that wemi is defined in a standard issued by a professional body, it may not be clear why one should not simply use the entities as they have been defined there. the use cases above show that the specifics of the model designed for library catalog metadata vary significantly from the needs of other communities even though those communities would like to describe their created works with levels of abstraction analogous to wemi. although there is some room for interpretation in the library model, the library catalog’s business rules are built into the frbr/lrm standards by design. for example, the library model includes the primary bibliographic attributes of each entity. some of these are general in nature (“physical extent”) while others are specific to the descriptions produced for library catalogs (“script conversion”). the library use case also requires significant uniformity between metadata instances created separately in thousands of institutions of various types because libraries make use of commercial software packages that depend on adherence to the data standard. frbr and the lrm thus can be seen as conceptual models for application profiles that facilitate the use of library community-specific software. it would be unlikely to find another community that has the same needs. what follows is a proposal for a less constrained model and vocabulary for created works that responds to the demonstrated interest in a wemi-like set of classes for descriptive metadata. the proposed model will not be bound by library cataloging practices nor constrained by those business rules. because the goal of this new model is to support a wide range of views and uses, this proposal implements a theory of ontology development known as minimal ontological commitment. (gruber, 1993) in this approach, base vocabulary terms should be defined with as little built-in semantics as possible, with semantics in this sense being the axiomatic semantics of rdf/owl as well as meaningful semantics in the sense of human understanding. an ontology whose terms have high semantic definition, such as the library frbr/lrm model, will provide fewer opportunities for re-use because usage must adhere to the defined semantics in the original ontology. less semantic commitment in the base ontology means that desired semantics can be defined in specific implementations through the sub-classing of entities and definition of additional entities as needed. the only constraints on this model will be those that maintain necessary coherence between the entity definitions. it is important to note here that minimum semantics also means that the various community-supported vocabularies that derive from these terms may not be interoperable between them when their community-specific constraints are applied. like owl:thing, the created work entities are intended primarily to provide a foundation for more specific terms that are derived from them. proposal: a minimally defined wemi ontology to begin, we need a new name for this model since the commonly used acronym, wemi, may be too directly related to frbr and the lrm. the new ontology is neither the expression of a functional requirement (fr) nor is it directly related to bibliographic records (br). we also need to avoid confusion as people already refer to different models as “frbr” and often refer to the wemi entities as frbr when they are only a part of that model. unlike the library use case of modeling recorded knowledge, the proposed model is a model for all created works, that is, any agent-created “thing.” for the remainder of this document i will refer to the created work ontology, or cwo, but other names have been suggested, such as openwemi. modeling the ontology i propose that the cwo be initiated with the four entities first developed in frbr group 1, as these have been shown to have utility, although a community may decide that there is a need for other entities that describe aspects of the creation. the entities could have the following definitions: work: the essence of the created thing expression: a sign or series of signs that signify the created thing manifestation: the physical characteristics of a realization of the created thing item: an instance of the created thing in time and space another way is to view the entities as occupying general facets or planes: work: the idea or concept plane (“has concept/is concept”) expression: the plane of signs or signifiers (“has signs/is signs”) manifestation: the physical plane (“has physicality/is physical”) item: the location plane (“has location/is located”) i don’t recommend defining attributes for the entities at this level. it is inevitable that different communities will make different decisions in where they place specific metadata elements across the wemi planes. further experience with the model may surface universals for the entities, but it would be limiting to define attributes prematurely. as communities define their metadata with classes that are sub-classed to cwo, the properties that they create will serve to further explain the meaning of the created work in their environment. the relationships in cwo wemi is conceptually a technology stack with relationships between the members of the stack. whether the stack is organized from item to work, or work to item is a matter of a point of view. however, it is similar to a technology stack in that the order of the elements of the stack is fixed, and that order determines the possible relationships between the entities. this statement about the order is true even though not all elements of the stack are required: entities must retain their position in the stack to be semantically valid. a metadata description of a created resource does not require the existence of the full wemi stack, but it does require adherence to the logical order of the entities. the formal rules for cwo therefore need to maintain the wemi stack while allowing for flexibility. the relationships can be between any more abstract levels in the stack to those more concrete. this reads as: work is expressed in expression / expression expresses work is manifested in manifestation / manifestation manifests work is instantiated in item / item instantiates work expression is manifested in manifestation / manifestation manifests expression is instantiated in item / item instantiates expression manifestation is instantiated in item / item instantiates manifestation there are various reasons for not requiring the presence of every entity in the full stack for valid adherence to the model including the lack of omniscience on the part of metadata creators and the potential lack of completeness of a set of metadata at a given position in the data creation workflow. a person working on the description of an individual item may not know at the time whether it is one of a set of identical products that could be considered manifestations, nor may be in possession of the information about who created the work or even what the work represents. thus, at a particular moment in time, one may be creating metadata for an item that is orphaned from the rest of the wemi stack. knowledge is a process of constant building on prior knowledge, and even with a full wemi description one should be aware that new entries into the graph result in different knowledge about the creation. this is analogous to the open world assumption of rdf, which allows for the absence of information without negating the truthfulness of the information that is present. the cwo can be used for more than metadata creation. as an open model with few restraints it can be used to create the variety of views that meet the needs of the users that a community wishes to serve. a view may not be the same as stored metadata but may be a selection from a larger dataset. a separate work and expression in one application could be a single view in another. at this point it is also important to say that the definition of the entities does not entail that metadata making use of these concepts must describe each separately as structural elements of the metadata schema.these entities are not defined as disjoint, as they are in frbr. disjointness is a restriction that would limit the ways that the entities can be defined and whether they can share attributes. it also makes it difficult to combine metadata using different entity definitions, such as the difference between bibframe’s work-instance-item and rda’s work-expression-manifestation-item. to require all graphs to have a separation between entities would mean that the wemi model could not be used in concert with non-wemi metadata. (baker, 2014) such a requirement would create a data silo for any communities using wemi. having flexibility, within the necessary ontological commitment, is compatible with the need to share some metadata across communities as well as to facilitate ongoing evolution of information resources. there is also the possibility of creating other types of relationships between entities, such as work/work or expression/expression relationships. an interesting approach to this was taken by ross singer when he defined generic relationships for frbr core entities that could be used between any defined entities: commonendeavor, commonwork, commonexpression, etc. these properties are in a dormant area of the vocab.org namespace and perhaps should be reconsidered as part of the cwo effort. cwo in owl appended to this article is a first pass at an encoding of cwo in owl. it is also available at https://github.com/kcoyle/openwemi. it is expected that this is just a demonstration of the concept, and that, if deemed useful, will be modified and housed at a more stable namespace. conclusion perhaps inadvertently the library metadata community has developed a set of classes that resonate for metadata developers in a wide range of topic areas. this non-library use of the frbr concepts has created a fork of the group 1 classes of frbr, primarily through the use of frbrcore. these applications have, however, inherited some of the built-in constraints of frbr that were intended for library catalog metadata, and which could lead to unexpected difficulties if inferencing is applied to their metadata. a set of less-constrained classes that are not specific to the library data model should mitigate this problem and also allow for more flexibility in the use of the wemi concepts. if developers are able to sub-class their created work views to the cwo, we could see more communities embracing a multi-level view for their descriptive metadata. about the author karen coyle has been modeling metadata since 1980, and now interacts with dublin core and w3c in metadata standards development. bibliography baker t, coyle k, petiya s. 2014. multi-entity models of resource description in the semantic web: a comparison of frbr, rda and bibframe. library hi tech 32(4): 562-582. https://doi.org/10.1108/lht-08-2014-0081. bekiari c, doerr m, le bœuf p, riva p. 2017. frbr object-oriented definition and mapping from frbrer, frad and frsad. version 3.0 international working group on frbr and cidoc crm harmonisation. september 2017. https://www.cidoc-crm.org/frbroo/sites/default/files/frbroo_v3.0.pdf. boer a, hoekstra r, winkels r. 2002. metalex: legislation in xml. in: t. bench-capon, a. daskalopulu and r.winkels (eds.). legal knowledge and information systems. jurix 2002: the fifteenth annual conference. amsterdam: ios press, 2002, p. 1-10. boer a, winkels r, vitali f. 2008. metalex xml and the legal knowledge interchange format. in: casanovas p, sartor g, casellas n, rubino r. (eds) computable models of the law. lecture notes in computer science, vol 4884. springer, berlin, heidelberg. ciccarese p, peroni s. 2018. essential frbr in owl2 dl. version 1.0.1. http://purl.org/spar/frbr. accessed december 14, 2019. coyle k. 2016. frbr, before and after: a look at our bibliographic models. chicago : ala editions, an imprint of the american library association. davis i, newman, r. 2005 expression of core frbr concepts in rdf. http://vocab.org/frbr/core.html. accessed december 16, 2019. dunsire g. 2008. declaring frbr entities and relationships in rdf. https://www.ifla.org/files/assets/cataloguing/frbrrg/namespace-report.pdf. accessed november 18, 2019. gruber tr. 1993. toward principles for the design of ontologies used for knowledge sharing. in: international journal human-computer studies 43. p.907-928. international federation of library associations and institutions. 1998. functional requirements for bibliographic records: final report. münchen: saur. ifla study group on the functional requirements for bibliographic records. 2009. functional requirements for bibliographic records. den haag. http://archive.ifla.org/vii/s13/frbr/frbr_2008.pdf. lima jao. 2006. an adaptation of the frbr model to legal norms. in: biagioli c, francesconi e, sartor g. (eds.) proceedings of the v legislative xml workshop, 2006 european press academic publishing, italia 2007. lima jao, palmirani m, vitali f. 2008. moving in the time: an ontology for identifying legal resources. in: casanovas p., sartor g., casellas n., rubino r. (eds) computable models of the law. lecture notes in computer science, vol 4884. springer, berlin, heidelberg. mccusker, jp, lebo, t, chang c, mcguinness, dl, da silva, pp. 2012. parallel identities for managing open government data. ieee intelligent systems, 27(3), 55. http://tw.rpi.edu/media/2012/02/07/d641/ex_issi-2011-09-0138.r1_mccusker.pdf. palmirani, m., et al. 2018. akoma ntoso version 1.0 part 1: xml vocabulary. 29 august 2018. oasis standard. http://docs.oasis-open.org/legaldocml/akn-core/v1.0/akn-core-v1.0-part1-vocabulary.html. peroni s, shotton d. 2012. fabio and cito: ontologies for describing bibliographic resources and citations. in journal of web semantics, 17: 33-43. https://doi.org/10.1016/j.websem.2012.08.001. open access at: http://speroni.web.cs.unibo.it/publications/peroni-2012-fabio-cito-ontologies.pdf. peroni s, shotton d. 2018. the spar ontologies. in: vrande?i? d. et al. (eds) the semantic web – iswc 2018. iswc 2018. lecture notes in computer science, vol 11137. springer. peroni s, vitali f. 2017. interfacing fast-fashion design industries with semantic web technologies. the case of imperial fashion. journal of web semantics. 44:37-53. https://doi.org/10.1016/j.websem.2017.06.001. riva p, le boeuf p, žumer m. 2017. ifla library reference model: a conceptual model for bibliographic information. den haag, ifla. https://www.ifla.org/wp-content/uploads/2019/05/assets/cataloguing/frbr-lrm/ifla-lrm-august-2017_rev201712.pdf. shotton, d, peroni, s. 2019. fabio, the frbr-aligned bibliographic ontology. v 2.1 https://sparontologies.github.io/fabio/current/fabio.html. accessed december 26, 2019. vitali, f, palmirani, m, lima, jao. “moving in the time: an ontology for identifying legal resources ” in: casanovas p, sartor g, casellas n, rubino r. (eds) computable models of the law. lecture notes in computer science, vol 4884. springer, berlin, heidelberg, 2008. appendix: openwemi.ttl @prefix rdf: @prefix rdfs: . @prefix owl: . @prefix skos: . a owl:ontology ; rdfs:label "openwemi ontology" . a owl:class ; rdfs:label "endeavour"@en ; skos:definition "the conceptual or intellectual aspect of a creation."@en ; rdfs:isdefinedby . a owl:class ; rdfs:label "work"@en ; skos:definition "an abstract notion of an artistic or intellectual creation."@en ; rdfs:isdefinedby ; rdfs:subclassof . a owl:class ; rdfs:label "expression"@en ; skos:definition "an expression of a work in signs."@en ; rdfs:isdefinedby ; rdfs:subclassof . a owl:class ; rdfs:label "manifestation"@en ; skos:definition "the physical embodiment of one or more expressions."@en ; rdfs:isdefinedby ; rdfs:subclassof . a owl:class ; rdfs:label "item"@en ; skos:definition "an exemplar of a single manifestation."@en ; rdfs:isdefinedby ; rdfs:subclassof . a owl:class ; rdfs:label "responsible entity"@en ; skos:definition "one responsible for the creation, production, distribution or maintenance of a created entity."@en . a rdfs:property ; rdfs:label "related endeavor"@en ; skos:definition "another endeavor that is related in some way to an endeavor."@en ; rdfs:isdefinedby . a owl:objectproperty ; rdfs:label "expresses"@en ; skos:definition "an endeavor that expresses a work."@en ; rdfs:isdefinedby ; rdfs:subpropertyof ; rdfs:domain ; rdfs:range . a owl:objectproperty ; rdfs:label "expresses"@en ; skos:definition "an endeavor that manifests an expression or a work."@en ; rdfs:isdefinedby ; rdfs:subpropertyof ; rdfs:domain ; rdfs:range [ a owl:class ; owl:unionof ( ) ] . a owl:objectproperty ; rdfs:label "expresses"@en ; skos:definition "an endeavor that instantiates a manifestation, an expression or a work."@en ; rdfs:isdefinedby ; rdfs:subpropertyof ; rdfs:domain ; rdfs:range [ a owl:class ; owl:unionof ( ) ] . subscribe to comments: for this article | for all articles one response to "works, expressions, manifestations, items: an ontology" please leave a response below: basma chebani, 2025-02-21 dear karen coyle, i enjoyed reading your article, but you didn’t mention the rda registry that uses the wemi model (lrm) in relationships as uri in marc records and elsewhere. i am heavily using these rda wemi in making the relationships between works and manifestations and expressions in the library catalog in american university of beirut libraries as a step ahead to linked open data marc enrichment. do you think what i am making since 2013 till now is obsolete and i need to use open wemi instead? . please advise. chebani.basma@gmail.com leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction: on libraries, code, support, inspiration, and collaboration mission editorial committee process and structure code4lib issue 25, 2014-07-21 editorial introduction: on libraries, code, support, inspiration, and collaboration reflections on the occasion of the 25th issue of the code4lib journal: sustaining a community for support, inspiration, and collaboration at the intersection of libraries and information technology. by dan scott this issue marks the 25th issue of the code4lib journal since it launched in 2007. at that time, jonathan rochkind (co-ordinating editor of the first issue), wrote “we want the immediacy of a blog, the usefulness of a professional conference, the reliable quality of a good scholarly journal, and the participatory nature of our online communities, all in one easy to read and easy to produce package.” we’re still running that experiment seven years later, with different co-ordinating editors each issue, and i’m delighted that this has been my opportunity to shepherd our community’s efforts… but all the credit goes to the authors for volunteering to take the time to share their findings with the community, and to the editors for helping refine those words and ideas for publication. reflections on an initial code4lib encounter i hope you will indulge the following reflections on the code4lib journal, its relationship to the code4lib community, and my own experiences at the intersection of information technology and libraries as we celebrate this 25th issue. code4lib as support group my own introduction to the code4lib community occurred in late 2006, the year i transitioned from a corporate software development environment to my current position as a systems librarian at laurentian university. facing a proprietary library system, highly restricted access to our servers, marc, and the deadly combination of a limited budget and unlimited expectations, within a few months i was very seriously considering a quick exit from this career choice. thankfully, two things happened: my colleague lise doucette recommended the access conference as an appropriate destination for a new systems librarian. i took her up on the suggestion and engaged in my first hackfest, which was organized by dan chudnov and ross singer, and attended by the likes of patricia williams, art rhyno, bess sadler, terry reese, birkin diana, and other ur-members of the code4lib community. this hackfest was immortalized in a one big library podcast episode. around the same time, art pointed me at the #code4lib irc channel, which i found filled with smart people commiserating (in a good-natured way) around the constraints of library standards and proprietary technology. yet these people were infused with a belief in the importance of libraries, and determined to overcome those barriers to achieve a greater good. no, really! those two occurrences gave me hope for the future of libraries, and helped me stick with my career decision. eight years later, when i’m frustrated or in a state of despair due to some hurdle or setback, the #code4lib channel still serves as a kind of support group: there are inevitably people there who “get it”. in her issue 5 editorial, emily lynema succinctly stated that the existence of the code4lib community offers the reassurance that “librarians working with technology are not working alone.” code4lib journal articles also support that goal: they let information technologists in libraries know that no matter how geographically or institutionally isolated they might feel, there are others who understand and tackle similar challenges, and who are sharing their findings in the hopes that their peers may benefit and build upon their work. if you look closely, written between the lines of these articles you may find the words there is hope for libraries, if you persevere! code4lib as inspiration over the span of a day at that hackfest in 2006, my teammates heather matheson, holly eggleston, scott nickerson and i built a proof-of-concept collection analysis tool named cartman. it was an incredibly refreshing exercise in liberating yourself from institutional and technical constraints and just working with really smart people to get things done. and yes, of course we decided on the acronym first and then figured out how to make the name fit. that sassiness and creative flair was an important part of the code4lib community spirit as well: we don’t just get things done, we get things done with style and attitude. through the seven years that the code4lib journal has been in existence, i suspect that many articles have inspired others to adopt new technologies and approaches that they would not otherwise have felt capable of mastering, or to further their own work. to share a personal example, issue 16 containing jason ronallo’s “html5 microdata and schema.org” article was published at the same time that my own baby steps with schema.org were incorporated into evergreen. jason’s article gave me a much broader insight into the possibilities for schema.org and provided some validation for my efforts (“no, i’m not entirely crazy pursuing this!”), and i have spent the bulk of my research time over the subsequent two years working in the domain. code4lib as collaboration returning to my reminiscence about access hackfest 2006: many of the participants would not have self-identified as programmers; they had signed up the hackfest because they could bring their skills and insight into metadata, or collection development, or public services requirements to the discussion and collaborate on building a project that was much richer for their participation. somehow the organizers and promoters of that hackfest were able to communicate that the event would be a safe space in which true collaboration and communication could occur, and teams like mine benefited from the wonderfully diverse perspectives and knowledge. previous editorials such as kelley mcgrath’s editorial introduction – a cataloger’s perspective on the code4lib journal and ron peterson’s the code4lib journal isn’t just for coders have emphasized a similar point: the code4lib journal, and the community in general, is a forum for communication by all those involved and interested in the future of libraries. overview of issue 25 this issue offers another broad set of articles that reflect the diversity of the code4lib community, from the “getting things done efficiently” approach of kristina m. spurgin’s “getting what we paid for: a script to verify full access to e-resources“, to an introduction to cutting-edge containerization technology in john fink’s “docker: a software as a service, operating system-level virtualization framework“, to terry reese’s hands-on review of oclc’s metadata and holdings apis “opening the door: a first look at the oclc worldcat metadata api“. digitization efforts continue to impact our work. kyle rimkus and kirk hess share their experiences wrangling jpeg2000 images into compliance in “hathitrust ingest of locally managed content: a case study from the university of illinois at urbana-champaign“. pieter de praetere tells the story of how he overcame it policy constraints to improve the efficiency of the scanning workflow at the provincial library of west-vlaanderen in “within limits: mass-digitization from scratch“. and matt weaver reports on his the community cookbook project to digitize local cookbooks of historical value in “ebooks without vendors: using open source software to create and share meaningful ebook collections“. data and gis librarians also have something to look forward to in this issue, with some core geoblacklight design considerations highlighted in darren hardy and kim durante’s “a metadata schema for geospatial resource discovery use cases“, and frank donnelly’s case study “processing government data: zip codes, python, and openrefine” providing a hands-on tutorial in massaging datasets to produce added value for their users. for the forensics-oriented librarians, archivists, and curators in the audience, we are delighted to offer misra, lee, and woods’ “a web service for file-level access to disk images“. while our community is generally familiar with the mysql relational database and the solr full-text search engine, arie nugraha suggests it might be worth looking at some alternatives in “indexing bibliographic database content using mariadb and sphinx search server“. finally, josh rompf brings some practical recipes for wrangling media transcoding, concatenation, and scripting challenges in “solving advanced encoding problems with ffmpeg“. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – developing a multi-portal digital library system: a case study of the new university of florida digital collections mission editorial committee process and structure code4lib issue 58, 2023-12-04 developing a multi-portal digital library system: a case study of the new university of florida digital collections the university of florida (uf) launched the uf digital collections in 2006. since this time, the system has grown to over 18 million pages of content. the locally developed digital library system consisted of an integrated public frontend interface and a production backend. as with other monoliths, being able to adapt and make changes to the system became increasingly difficult as time went on and the size of the collections grew. as production processes changed, the system was modified to make improvements on the backend, but the public interface became dated and increasingly not mobile responsive. a decision was made to develop a new system, starting with decoupling the public interface from the production system. this article will examine our experience in rearchitecting our digital library system and deploying our new multi-portal, public-facing system. after an environmental scan of digital library technologies, it was decided to not use a current open-source digital library system. a relatively new programming team, who were new to the library ecosystem, allowed us to rethink many of our existing assumptions and provided new insights and development opportunities. using technologies that include python, apis, elasticsearch, reactjs, postgresql, and more, has allowed us to build a flexible and adaptable system that allows us to hire developers in the future who may not have experience building digital library systems. by todd digby, cliff richmond, dustin durden, and julio munoz introduction as with the integrated library systems, where some libraries are moving through their third or fourth generation of ils’s, we are now starting to see these generational changes to our digital library and institutional repository systems. generally, our migrations to new digital library systems are not as seamless and straightforward as our cousin ils system. the university of florida’s digital library system, first launched in 2006, presented additional complexity in our ability to easily migrate to a new modern digital library system. the system was locally developed and was not based on more common digital library systems, like dspace or islandora. with the system architected using an aging microsoft sql/iis framework, our ability to adapt and make changes to the system became challenging, especially since the lead developer had left uf and our ability to find developers who were fluent in these technologies became increasingly difficult. a turnover of programmers and developers happened in 2020, which jump-started our decision to develop a new system using more commonly used development and technology platforms. this article will examine our experience in re-architecting our digital library system and deploying our new multi-portal, public-facing system. historical background the university of florida digital collections (ufdc) currently hosts local and international collections, housing over 18 million files of content, including all material types, including books, archival documents, newspapers, photographs, audio, video, museum objects, data sets, and maps, that have been published in many languages. during the 1990s, the uf libraries began experimenting with digitization of print materials for the purposes of preservation and broader access to collections. in 1999, the uf libraries formalized ongoing support for this digitization effort by creating the digital library center. these digitization efforts were further expanded with the creation of the university of florida digital collections (ufdc) system in 2006. the system was powered by the open-source sobekcm system (http://sobekrepository.org/sobekcm), which was originally developed at the university of florida, and is run on microsoft web and database-based servers. as the system developed, it grew in complexity to accommodate the unique characteristics of these various types of digital content and digitization workflows. the sobekcm system also powered the digital library of the caribbean (dloc, https://dloc.com), which the uf libraries are one of the founding partners and the technical host partner of. started in 2004, dloc has grown to include 70 partners and associate partners in the united states, the caribbean, canada, and europe. like many systems developed during this time period the sobekcm system is structured to power both the back-end production work, as well as serve as the public-serving front-end. over the past decade and a half, there have been minor incremental system updates, but the interface essentially stagnated. the system served the library and the users well for the past decade, but the age of it is showing and change was needed. need for change besides the system being developed using a technological framework that was over a decade and a half old, there were additional factors that led to the decision to undertake a major effort to re-architect and modernize the ufdc system. as the system was internally developed and not based upon other open-source digital library systems, having programmers that were knowledgeable in how the system was developed was key to being able to respond in a timely manner to resolve issues or in the development of new features. since the original developer of the system departed uf a number of years ago, it has become increasingly difficult for new team members to be able to understand the nuances of the system. additionally, since the system was based within the microsoft realm, including ms-sql, iis, and mostly written in c#, it became difficult to find individuals who have the required skill sets. configured to run on older windows server versions, there became an increasing concern that a future server os version may render the sobekcm system inoperable. with our server infrastructure hosted by campus it, we were not in a position to potentially be open to security vulnerabilities by having the system remain on a version no longer supported by microsoft. in recent years the instability of the system became more and more evident with more frequent interruptions. with both the digital production or back-end tools and the public-facing system on the same instance, if one system was having an issue and needed to be restarted, it would impact the entire system, including the staff processing and making additions to the system. another impetus that impacted our decision was hearing from granting agencies who were reluctant to allocated grant funds to a system that looked dated and was not fully mobile friendly and accessible for users today. decisions – build versus buy/implement open source we knew we needed to move to or develop a new digital library system, but we also needed to decide if we were going to develop the system on our own or were going to implement and then migrate our existing digital collection to one of the many systems available across the digital library system landscape of both commercial and open-sources systems. since the system was developed internally, it continued to change and accommodate new formats and varieties of digital content as they became available, unlike many other academic institutions which implemented systems that were designed around the type of materials being housed. for instance, libraries have historically implemented different digital library systems for collections like newspapers and photos, or act as their institutional repositories, which could respond to the unique search, metadata, and file types needed for those types of materials. an examination of the currently available systems led to the conclusion that it may be necessary to break up our content into unique systems or we may lose some of the customized interfaces that we had developed, such as our digital maps and aerial imagery collections where a customized coordinate-based map search system was implemented. additionally, with two major front-end discovery views (ufdc and the digital library of the caribbean) we needed a system that could replicate our existing system. with no clear available system, and a priority of library leadership to keep the collections housed within a single system, the decision was made to once again develop the system ourselves. the decision was also made to develop the new system and technological infrastructure by using open-source components and open, standards-based protocols and formats. designing the system before deciding on an architecture for our updated system, we needed to map out how we were going to stage our development efforts and what part of the system we should concentrate on first. working within the constraints of needing to keep our existing system fully available and also not stopping the ingestion of materials. at this time, we were ingesting large grant-funded digitized collections that could not be postponed because of the deliverables that were agreed upon as part of the various grants we were involved in. knowing that we had to keep our back-end system available for loading materials into the system, it was decided that we would first build our front-end digital library components. this decision was also made easier, given the fact that we wanted to update the system architecture to separate the component systems, such as the front and back-end systems, that would address whole system downtime issues. additionally, in our desire to have a more stable system, we wanted to build in redundancy and fail-over as part of the initial deployment process. as part of our environmental scan process, we examined other digital library system implementations that have large homegrown digital library systems. we met with the university of north texas (https://digital.library.unt.edu/), who offered valuable insights and directions on the structure of their locally developed system. technology stack with the decision to focus on the public-facing system, we could continue to use the data from the existing system to feed the new front end. using open source components gave us the opportunity to move from windows-based servers (iis and ms-sql) to a linux-based technology stack. in 2021 we were in a unique staffing situation with our digital development team, where we had an entirely new development team consisting of two front-end and two back-end developers, who had no previous experience with our existing digital library system or any prior digital library experience. this team, however, had expertise and experience with modern database and indexing toolsets and development frameworks. the technology stack that was decided on was implementing a postgresql database, with an elasticsearch search engine, and a nginx web server. to modernize and create a more flexible front-end, it was decided that it would be driven by apis. with the goal to have redundancy and fail-over built into the system design, we implemented a redundant production web server environment with load balancing to meet possible system demands. for the database and search engine, we implemented redundant servers. since we were only focusing on our front-end, we designed it in a way to continue to point at our existing files or resource server to pull the access files of our digitized materials. how the system is structured can be seen in figure 1. figure 1. ufdc system architecture diagram development as with any system development we needed to focus on target objectives for an initial release. to prevent feature bloat and continued refinement which may lead to a delay in release, we set out to define a minimum viable product (mvp) for initial release, with alpha and beta release stages. we needed to have a system that would replicate the main functions of our existing system, while at the same time gave us an opportunity to assess if any of our old features were still important to include. we stripped away many of the lesser used features, such as users being able to have their own bookshelf of items and the ability for users to self-submit within the front-end to our institutional repository (ir). during this development period, we made additional efforts to work with collection curators to undertake a clean-up of the ever expanding list of collections that were contained in the system. many of these collections had limited items or may have been created but not really used or expanded as expected. through this effort we went from having over 900 collections to under 500. for our minimum viable product (mvp), all development had to be done while the current system continued to operate with no interruptions and back-end production could continue without interruption. as we developed our production system, we also changed how we implemented our code testing and refinements. this meant implementing a full front-end test system, where the developers could release their code only available to be accessed by those within the development team. once code was through testing, we moved it to the staging environment. this staging environment is available to on-campus library employees to test features and functionality before it is moved to the full production servers. these test environments use the same search apis as the production environment and also have access to the access files available, so accurate testing can happen in this staging environment. with clear objectives defined, we also implemented an agile development framework, which set up clear staged deliverables for each team member. with a team that had limited experience working in libraries, it was decided that the library technology department leadership would attend the daily standup meetings, so that we were able to address any issues or questions from the team immediately. this streamlined the development significantly and enabled the whole team to come together. since this system was a high priority for library leadership and the fact that we had the ability to focus on building the new system, we were able to go from initial development starting around may 2021 to being able to release the system in beta to the public at the end of november of 2021. one of the major goals of developing the new system was to maintain the multiple portal approach to our digital collections. the system now consists of a ufdc portal interface that includes the ability to search the entire aggregate of the collections we house, and additionally includes different user interfaces into the dloc and florida digital newspaper library (fdnl, https://newspapers.uflib.ufl.edu) subcollections (figure 2). figure 2. ufdc collection / portal diagram the multiple portal approach allowed us to develop slightly different user front-ends that are customized to the types of materials being searched/displayed and to the potential audiences that are accessing the collections. this new development not only resulted in being able to replicate our two existing collection portals, but given the flexibility built into the new system, we were able to create a new portal dedicated to our florida digital newspaper library (figure 3). figure 3. florida digital newspaper library portal an example of the customizations that we were able to do within a customized portal is our title search/browse feature within the fdnl collection. since the newspaper titles within this collection are all limited to florida, we decided to add a map-based county facet/limiter to our title search section. this way if a user was interested in exploring titles only within a certain county, they could find them more easily (figure 4). figure 4. newspaper title search by county ongoing work and next stages once in beta and being used by the public, we were able to start to refine the user interface and various other search and structural changes. in order to get user-centered input, we established a library user resource discovery committee, that included a number of librarians and other library staff that worked with the public. this group met regularly and became our main vehicle for moving changes forward in the new system. another avenue for input was to regularly examine questions that were submitted in our ask a librarian service to see if any of the questions related to our new digital library system. this direct input quickly helped us address many issues that were not necessarily evident to our development team. because the new patron interface is focused on user needs, it’s the perfect way for us to best see the metadata that has highest value and impact for users, and prioritize work to best support the user experience. as the front-end systems are in place and we are still making incremental improvements, we are now focusing on our back-end production systems. we envision this process to happen at a slower pace than our front-end system development, but we expect most of this work to be completed within the next two years. we have already developed a starting web-based production toolkit (figure 5) environment where we can release the various component services and jobs needed by our digitization staff. figure 5. ufdc toolkit conclusion although we continue to make progress and there continues to be a lengthy list of enhancement requests, we are very satisfied with what we were able to accomplish. by focusing on a user-centered design, we were able to understand how our users were using the system and make necessary changes. in addition to being able to produce a new front-end system for our digital library, this effort also enabled us to change many of our internal development and project management processes which continues to benefit both our team and the libraries as a whole. about the authors todd digby is the chair of library technology services at the university of florida. he leads a department that researches, develops, optimizes, and supports advanced library information systems and technology services for the university of florida libraries. he has over 25 years working in technology in academic libraries. he holds an ed.d. from hamline university and a master of library and information studies from the university of alberta. cliff richmond is the digital development team supervisor within the library technology services department. cliff received a bs and ms in geology from the university of pittsburgh, and a bs in environmental engineering from the university of florida. he has over 35 years of information technology (it) experience and is a certified information systems security professional (cissp). julio munoz is a digital collections front-end web programmer in the digital development team within the library technology services department. julio received his master of science in computer information systems from the university of phoenix and his bachelor of science in computer engineering from havana university. he also attended the miami international university of art and design for a bachelor of fine arts in web design and interactive media. before starting at uf, julio worked at sharpspring in gainesville, prior to that, he worked at fine art biblio as a web developer in miami, florida, and prior to that, he oversaw the us navy fleet forces command (ffc) website. he has over 18 years of experience in web design and web development, is a microsoft certified it professional, and a certified engineer intern by the florida board of professional engineers. dustin durden is a digital collections application developer analyst in the digital development team within the library technology services department. dustin has a ba in computer science from the university of florida. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – matchmarc: a google sheets add-on that uses the worldcat search api mission editorial committee process and structure code4lib issue 46, 2019-11-05 matchmarc: a google sheets add-on that uses the worldcat search api lehigh university libraries has developed a new tool for querying worldcat using the worldcat search api. the tool is a google sheet add-on and is available now via the google sheets add-ons menu under the name “matchmarc.” the add-on is easily customizable, with no knowledge of coding needed. the tool will return a single “best” oclc record number, and its bibliographic information for a given isbn or lccn, allowing the user to set up and define “best.” because all of the information, the input, the criteria, and the results exist in the google sheets environment, efficient workflows can be developed from this flexible starting point. this article will discuss the development of the add-on, how it works, and future plans for development. by michelle suranofsky and lisa mccoll background the lehigh university libraries and the lehigh university art galleries (luag) enjoy a collaborative relationship. the two areas of the university continued this relationship in march of 2019 when the staff of the libraries agreed to catalog a collection of approximately 2,000 books owned and housed by luag. these books were being used internally for reference purposes by the staff of luag, and were being stored on shelves in a storage facility on campus. however, a newly remodeled space for a reference reading room in the main art gallery facility was being planned. the staff at luag wanted these books to be available to anyone in the campus community who wished to visit that space. they also wanted the books to be discoverable through the library’s online catalog. while the library staff was eager to take on this project, we were also daunted by the task, knowing that the day to day demands of our job would continue without interruption. in consultation with the luag staff members we decided we could find a balance in getting the appropriate records into the catalog as quickly as possible, while keeping in mind that more granular details, if needed, could be applied to the records later. the opening for the new reading room was set for october 2019, and our goal became to have all of the books on the shelves, classified, and in the catalog in time for the opening reception. evaluating solutions as we began by considering where efficiencies in our typical cataloging workflow could be inserted for this project, automated oclc record lookups came to mind. we estimated that most of the books in the collection were not too old to have isbns or lccns, so we thought that data would be a good entry point for automated searches. we wanted a tool that would select a single “best” oclc record based on a cascading sequence of logic that mirrored as much as possible how we select a record manually. ideally, it would choose the record that we would have chosen in an individual lookup. although oclc’s connexion client does have a batch lookup tool, we knew it could not be customized to the extent we needed to achieve the results we were looking for. we began by looking at tools that utilized oclc’s worldcat search api [1] for record selection. two tools that were considered were cornell university’s ls tools [2] and yale university’s backlog_lookup open refine project. [3] our review showed both to have very rich feature sets. looking at yale’s backlog_lookup tool we got a sense of the large amount of logic and configurability built into it. with those rich features and configurability came a fairly steep learning curve that would have to be navigated in order to obtain the customizations we desired. fortunately, while this review was taking place, the possibility of using a google sheets add-on [4] for a technical solution opened up after hearing a podcast that focused in part on that feature. steps in cataloging workflows often take place in spreadsheets so it seemed like a tool that existed within this environment would be a good fit. further investigation revealed other attributes of google sheets add-ons that met our needs: it has the ability to make calls to apis it has the ability to parse and construct xml. this meant the add-on could seamlessly parse the marc xml returned from the api calls – to inspect values and pull out values to write back to the spreadsheet. the code is written using plain javascript and is fairly intuitive to work with. google provides thorough documentation and coding examples for this platform the actual writing of the code can take place directly in a browser (using their ‘script editor’) so there is not much effort needed in setting up a development environment. this platform gives you the ability to easily share your projects. during the development process, the add-on could be shared in a similar way to sharing a google sheet or google document. this meant the code could be written, easily shared and tested providing an efficient feedback loop. iterations based on those positive attributes, we decided to move forward with creating a google sheets add-on proof-of-concept. we were pleased with how quickly a rudimentary prototype came together. in the initial version, the record matching criteria was hard-coded. for our test case, we set the criteria to use the following sequence of logic for selection when given an isbn or lccn: return records first that are held by lehigh university (lyu) if no records are held by lyu, return if 040 contains dlc if no records are dlc, return if 042 contains pcc if no records are pcc, return the record that has the most holdings attached. the add-on was able to call the oclc worldcat api using isbns and lccns that were entered into a google sheet and then insert the values from the marc records contained in the api responses back into the sheet. during the next iteration of the project we added match criteria that would allow the user to retrieve records that represent a specific format. while the project for which this was being developed called for a return of records for print monographs, we began to think about future uses for this tool. we realized we would desire functionality from the add-on that would enable us to use it for a variety of formats. therefore we decided if format selection could be entered by the user and not hard-coded the tool would be more extensible. at first we considered using a form within the sidebar of the add-on to set the match criteria. then, as we began to realize that additional match criteria could be desired, another option was put into place. the setting of the criteria, instead of squeezed into the sidebar, was placed onto a new tab in the spreadsheet itself. with this setup a large amount of real estate became available that would allow a user to input fairly complex matching rules, directly in the spreadsheet, with no knowledge of coding needed. this spreadsheet criteria tab also became a place for the user to specify the bibliographic data they wanted returned in their results. testing with the selection logic and match criteria in place, it was time to test the add-on for accuracy. a good test subject came in the form of a donation of 83 math books that the library wanted to accession. each book was searched for in oclc’s connexion client, manually, and an oclc number was decided upon and recorded. then, in a google sheet, an isbn or lccn for each book was entered and the matchmarc add-on was run. we compared the column of manually selected oclc numbers with the column of machine selected oclc numbers. there were only two discrepancies between the human chosen and machine chosen records, and they were both the result of human error: the add-on selected the best record each time. encouraged by these results we began using the add-on immediately to select records for luag’s collection of books. the process of copy cataloging was sped up considerably not only by this new lookup process, but also by the native spreadsheet format, out of which the following workflow developed: after matchmarc results populated the spreadsheet: the column of call numbers returned was used to create spine labels. while a mail merge feature is available in the g suite, we decided to save the google sheet as a microsoft excel document and use ms word’s mail merge feature to complete this task. the same excel document used to create spine labels above was used to create brief marc records using marcedit’s delimited text translator function. local information placed in the spreadsheet, like local notes, barcodes, and call numbers, became part of this brief marc record. the brief marc record file created above was merged with the marc file from oclc, using marcedit’s merge records function. the oclc number was used in this process as the unique identifier on which to base the merge. as a final step before importing the marc file into our ils, a marcedit task removed unnecessary marc fields and marcedit’s rda helper was applied. the records were then imported into our ils. finally, the excel spreadsheet was used as a checklist to ensure barcodes and spine labels were placed on the correct books. using matchmarc matchmarc is free and publicly available. it can be obtained by going to the “add-ons > get add-ons” menu in any google sheet. a search for marc will show the add-on: figure 1. screen-shot showing the icon for matchmarc. once installed and launched, the user will see an “oclc lookup” sidebar. the add-on sidebar gives a field for an oclc api key to be entered. if you need a key, click on “request a key” on the worldcat search api page. the sidebar is also used to indicate which tab contains the isbn and lccn numbers and which tab contains the matching criteria. this allows the user to work in one spreadsheet using multiple tabs. clicking the “start search” button initiates the search. to help first-time users get started with the add-on the side bar provides a button labelled “click to initialize sample tabs.” this will generate two tabs: one with a few test searches and a second tab with sample matching criteria. it is important that the match criteria tab is laid out in this format because it is coded to look in specific rows and columns. a feature we added in the latest release of the add-on is the ability to have a marcxml file emailed to you that will contain all of the records it considered a ‘match’. to receive the email, the user can enter an email address in field labeled “create marc record file and mail to” before clicking the “start search” button. figure 2. screen-shot illustrating the google sheets add-on pointing out the sidebar where the api key is entered, tab selections made and email address entered. matchmarc operates using two tabs inside a google sheet. the first tab is a place to input the information for the records that need to be searched. it expects column 1 to contain an isbn. if it doesn’t find anything in column 1, it looks in column 2 for an lccn: figure 3. screen-shot illustrating the search criteria (isbn or lccn) in the first two columns of the google sheet. the second tab contains the matching criteria. on this tab the user can indicate whether or not a search for local holdings is needed. if configured this way, attempting to find local holdings will be the first api call it makes. if it finds a record, it writes the results to the spreadsheet. otherwise, it makes a second api call instructing the api to return results sorted by the number of library holdings. when it gets those results it starts at the top of the match criteria (in the second tab) to try to detect the preferred record. figure 4. screen-shot illustrating the google sheets tab that contains match criteria. when the add-on finds a record where all of the match criteria in one row is found, it considers that record a match and writes the record details into the first tab. figure 5. screen-shot illustrating the record details added to the spreadsheet. the desired bibliographic details from the marc record that are written to the spreadsheet are also configurable in the lower part of the second tab. figure 6. screen-shot illustrating the google sheets tab that contains match criteria calling out the section of the tab where you configure the fields you want written to tab #1. the add-on will compare the values in each row until it finds a match. if it does not find a match based on the set criteria it defaults to the record with the highest number of holdings. limitations one limitation to using the google sheets add-ons platform is the execution time limit of 10 minutes. this add-on was developed for a cataloging project that leant itself well to smaller batches of lookups. when we tried to apply the add-on to a large list of ebooks supplied by a vendor, we ran into the time limit problem. another limitation is its reliance on isbns or lccns. this is very effective if a list of these numbers is provided, or if the user has monographs with isbns that are embedded in scannable barcodes. when isbns are not located in a barcode that can be scanned, the user can enter the numbers manually. we have found that even this method is a time saver over a manual lookup. many older monographs may not contain an isbn or lccn, thus making the add-on unusable in these situations. at times the isbns are incorrect, or duplicated. these cases, however, are relatively rare and can be detected with proper checks in place in the workflow. plans for development one exciting development that is planned for matchmarc is that it will have the ability to take marc fields and subfields and their values that a user adds to the results spreadsheet and apply them to the appropriate marc record. when the marc file is delivered by email, it will contain the full oclc records with fields that were added to the google spreadsheet added to each marc record. the field values can vary from record to record. this development arose from the fact that in our current workflow for the luag project, we add local information, like barcodes, call numbers, and local notes, directly on the spreadsheet. after receiving our emailed file of marc records from oclc, we use marcedit to turn the spreadsheet into a marc file, then transfer that local information that was originally in the spreadsheet to the oclc records, using marcedit’s merge function. once all of the information is in a single marc file, the file can be uploaded into our ils, populating the holdings and item record with the appropriate information. this planned development will enable us to skip the steps of transforming the spreadsheet to marc, then merging the files. we are hoping this will be useful for our daily acquisitions processes, since a “best” record could be found at the point of order by the add-on and the acquisitions data for that order could move from the spreadsheet to a delivered marc file seamlessly. also planned is a new application that will use google sheets in the same way but will exist outside of the google add-ons platform. this will provide a way to work around the execution time limit while continuing to leverage the advantage of managing the match criteria and lookup values inside of a google sheet. we believe that expanding the functionality in this way will be useful for processing the large lists of ebooks we receive from vendors. the complete source code for this project can be found here: https://github.com/suranofsky/tech-services-g-sheets-addon about the authors michelle suranofsky (mis306@lehigh.edu) is a senior analyst on the library technology team for lehigh university. lisa mccoll (lim213@lehigh.edu) is a cataloging/metadata librarian for lehigh university. references [1] oclc developer network [internet]. dublin (oh): oclc headquarters; [cited 2019 sept 30]. available from: https://www.oclc.org/developer/develop/web-services/worldcat-search-api.en.html [2] solla, nancy [internet]. [updated 2019 may 14]. ithaca (ny): cornell university library technical services’s batch processing unit; [cited 2019 sept 4]. available from: https://confluence.cornell.edu/display/tsawg/library+services+tools [3] sugiyama, yukari [internet]. [updated 2019 mar 29]. github; [cited 2019 sept 9]. available from: https://github.com/ysugiyama3/backlog_lookup [4] extending google sheets with add-ons [internet]. g suite developer; [cited 2019 sept 30]. available from: https://developers.google.com/gsuite/add-ons/editors/sheets subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – testing three types of raspberry pi people counters mission editorial committee process and structure code4lib issue 38, 2017-10-18 testing three types of raspberry pi people counters the hudson county community college (hccc) library tested three different types of raspberry pi based people counters between 6/14/2017 and 7/9/2017. this article will describe how we created each type of counter, will compare the accuracy of each sensor, and will compare them to the college’s existing 3m 3501 gate counters. it will also describe why and how our team decided to make this project, discuss lessons learned, and provide instructions for how other libraries can create their own gate counters. by johnathan cintron, devlyn courtier, and john delooper introduction the initial idea for this project came from a talk delivered by jason griffey at the computers in libraries conference in 2013. griffey’s presentation, entitled “making libraries: getting into the hardware biz”, argued that libraries should use devices such as arduinos and raspberry pis to make better, cheaper, and potentially more accurate solutions for library technologies like people counters and heat maps. based on the ideas from this talk, our library discussed the possibility of creating our own electronic people/gate counter. we were frustrated that our existing counters required staff to manually check the machine and write down a visitor number three times a day. in our existing system, written numbers are transcribed once a month to a microsoft access database at one of our campuses, and to a microsoft word document at the other. in addition to errors caused by handwriting or data input mistakes, having staff remember to check the gate count was difficult to ensure, especially when the library was busy and staff were occupied responding to patrons and thus unable to leave the service desk. aiming to make a more accurate solution that did not require human intervention, we hoped to use griffey’s idea to test a fully automated system, using our north hudson branch campus as a test location. there were several challenges to setting up this project. our group had to decide which technology to use to best and most easily store and make the statistics available. we discussed whether we should use arduinos or raspberry pis, or a combination of both. as relatively new technologies, both were hard to acquire from the vendors that we usually use to purchase electronics, and thus there were several months of delays while we sought paperwork from new vendors who stocked these devices and made sure they were added to our purchasing systems. serendipitously, in 2014, we received eight intel galileos, which are arduino-compatible devices with additional processing power, and seed studio brand sensor kits as part of a grant from intel. this equipment was distributed as part of an intel effort to promote the use of their devices in stem curricula at educational institutions. as such, our initial efforts to create gate counters involved using the intel galileo. however, we did not have all the necessary sensors, and our group found that the galileo’s interface for storing and recording data was too difficult. therefore, we switched our efforts to raspberry pis, which, along with appropriate sensors, we were able to first acquire in 2016, thanks to a generous gift by then dean of libraries carol van houten. hoping to create a “set-it-and-forget-it” system, we also wanted to ensure each sensor’s accuracy. therefore, after over a year of research and testing, we settled on three different types of hardware to use: passive infrared sensor (pir), a laser and light dependent resistor (also called ldr or photo resistor,) and ultrasonic distance sensor. methodology: device placement one of the first steps in setting up the project was determining where to place the counters. our initial testing and calibration was performed in a staff office. however, for the final deployment, we had to consider several factors including proximity to power and data (ethernet) connections, patron safety, and potential to disrupt library services. with these factors in mind, we originally thought to place the counters near the inside of our main entrance door, as pictured below. figure 1. view of main entrance to library figure 2. close-up of library entrance door this door is the library’s primary point of entry, and would thus provide a good place to count patrons. unfortunately, placing our devices near the door would require running a series of long extension cords. we feared that deploying extension cords of this this length would be both unsightly and in our estimation, likely to cause trip hazards. we also thought they would have a high risk of being accidentally unplugged. with this in mind, we also wanted to be sure that our spot was against a wall, which would make each sensor easier to mount. this was especially important for the laser sensor, as it required aligning a resistor within a few feet of it. if the sensor was not properly aligned, the laser would lose its connection. space would also have to be sufficient so that the ultrasonic sensor was not recording a wall as a person either. in testing this issue, we found that the distance between walls in our entrance corridor was sufficient to avoid this type of input error. ultimately we chose to use a spot in our entrance hallway, next to our existing gate counter. this area had connections for both power and ethernet nearby, and we felt it would be minimally disruptive for our students as it was in an area with several other electronic devices including a photocopier, and the aforementioned 3m counter. this location was also an easier location for library students and staff, as it allowed the pis to be mounted relatively unobtrusively and minimized trip hazards. we also felt that this spot would provide the most accurate point for comparison with our existing gate counter, as it was adjacent to this device, and thus would measure who entered the library at nearly the exact same location. as the existing 3m counter doubles as an electromagnetic theft detection system, we recognized that this proximity may have made the raspberry pis or their sensors subject to electromagnetic interference, potentially compromising our project’s effectiveness. while we did not notice any issues that were directly attributable to the electromagnetic interference, it is an issue that we will have to keep in mind, and continue to study. the raspberry pis themselves were placed in cardboard boxes, along with the breadboards and attached sensors, and secured to walls with 3m command hooks. when we initially set up the devices, we used command hooks that have a capacity of two pounds. however, the two pound rated hooks proved unable to support the weight of our systems, so we replaced the original mounting strips that were included with the hooks with five pound capacity mounting strips. these higher capacity strips proved sufficient to keep the devices from falling during our testing period. photographs of the setup are below: figure 3. a view of the entrance corridor figure 4. this image shows a step in the setup of our sensors relative to the existing gate counter. the sensors were placed together on the left hand side of the picture. on the right hand side, a white cardboard box contains the laser and galileo. figure 5. once mounted, the pis were relatively unobtrusive when placed near the existing gate counter. methodology: hardware the first and easiest hardware for our staff to construct was the passive infrared or pir sensor, also known as a motion sensor. it required six male-female connector cables, a raspberry pi, a breadboard, and a pir sensor. in the illustration below, the red and black cables are power and ground while the blue cable connects to general purpose input output (gpio) 7. the black cable on the breadboard next to the blue cable is what transmits the information from the sensor to the pi. the other two cables in the sensor are the power and ground. figure 6. pir pi and breadboard setup with view of pin outs figure 7. view of the pir sensor figure 8. view of the pir sensor figure 9. view of the pir sensor for the ultrasonic sensor, we used eight male-female cables and three 1k resistors, although other libraries could choose to use a 1k and 2k resistor instead. the pi pinouts used are pin 5 (power), pin 6 (ground), pin 16 (gpio23), and pin 18 (gpio24). the first two connect the breadboard to the power and ground portion. pin 16 connects to the breadboard and is adjacent to another cable, both blue in figure 10, which connects directly to the sensor. this is how the sensor communicates with the pi. the charge on the pi needs to be reduced to receive accurate readings from the sensor, so the resistors are set up connecting the cable from the pi (red) to the cable from the sensor (yellow). the remaining two pins on the sensor are power and ground, respectively. figure 10. ultrasonic pi and breadboard setup with view of pin outs figure 11. close-up of ultrasonic sensor figure 12. raspberry pi 3 gpio (source: element14.com/raspberrypi) the last sensor, the laser sensor, utilizes a general purpose input output (gpio) ribbon cable, a 40-pin pi wedge, a light dependent resistor sensor, a laser diode, and an mcp 3008 chip, in addition to the other parts we have been using thus far — minus the resistors. another notable difference is that we needed to use a large quantity of male-male cables in this project instead of male-female. for this project, we needed to attach a wedge to the end of the breadboard so the pins are on the two sides of the breadboard. we used five of the pinouts on the wedge: the 3.3v, serial clock (sck), master input slave output (miso), master output slave input (mosi), and chip enable (ce0). the 3.3v connects to the power line of the breadboard. the remaining four cables are connected to ports on the mcp3008 chip. figure 13. mcp 3008 chip port display (note the small indent on the chip should match the display between 1 and 16 figure 14. laser pi and breadboard setup with view of pin outs the sck cable is connected to port 13, the miso is connected to port 12, the mosi is connected to port 11, and the ce0 is connected to port 10. port 16 connects to power on the breadboard, and port 14 is connected to ground on the breadboard. port 15 is connected directly behind the cable connecting to power on the chip. port 1 is connected from the chip to a blank space on the breadboard that has two empty spaces around it in all directions. directly next to that, one cable is connected to power and one to ground. (see below image) these cables are how the ldr sends data. to connect the ldr sensor to the board, three male-female cables are needed: one to connect the cable to the power source, one to connect the ground next to the cable for ground, and the last next to the remaining cable. once these are all set up, the ribbons should be connected to the pi on one end and to the wedge on the other. after this, the laser is connected to power, and you can use another raspberry pi to connect the ldr sensor to a ground and power port. this is because you will need a way for the laser to get power. in our setup, we originally used another raspberry pi, but swapped it for a galileo, as the galileo allowed us to pass through power directly without additional code or other steps required to make a raspberry pi operational, such as installing an operating system. figure 15. close-up of laser breadboard methodology: software our software code was modeled off of the instructions on electro18’s arduino-diy laser / ir person counter instructable (http://instructables.com/id/ir-laser-person-counter/) and adapted and rewritten in python. we chose to use python because it is a versatile scripting language that is easily installable on any linux distribution. as our main goal was to have to the raspberry pi write the statistics to a remote database, we used the mysqldb python library, and used the following sql query: # open database connection db = mysqldb.connect("hostname","username","password","database") # prepare a cursor object using cursor() method cursor = db.cursor() # prepare sql query to insert a record into the database. sql = "insert into ldrstats (date, time, gatecount) values ('%s', '%s', '%d')" % (curr_date, curr_time, count) this query inserted the date, time, and gate count statistics into each sensor variant’s own table in the database. each sensor type required its own customization. full details can be seen in the python scripts at https://github.com/squash-/hccc-library-gatecounter-scripts. troubleshooting the project required a great deal of troubleshooting to work properly. as we were worried about potential trip hazards caused by connecting the pis via ethernet, we initially connected the pis to our campus wireless network. however, this required using one of our staff’s credentials for the wpa2 enterprise setup, which made the project less maintainable. in addition, the pis would disconnect from our institution’s wireless network every night, so we had to use ethernet for the testing. there were also issues with determining how sensitive the sensors should be. this necessitated some tweaking of read intervals before we arrived at the final numbers found in our github code. finally, users who stood in the door often threw off the numbers for each type of sensor, which was a problem even with our original 3m counter. when this happened, we had to manually reboot and reset the system back to zero, and start the count again. finally, we had to abandon our original idea of entering a count of 1 to the database each time the pi recorded motion. doing this type of count, while potentially more accurate, was too difficult to program in the time we had available. findings we started recording data from the gate counters after they were set up on 6/14, and continued throughout the month of july. each week, we checked the data to make sure our mysql database was still receiving data from the raspberry pis. unfortunately, we failed to properly back up our system, and when we did the final data export, we lost all the data from after july 9. with that in mind, we still did record some interesting data, and have noted the results below. 3m 3501 gate counter and paper the campus on which we tested this project has used a 3m 3501 electromagnetic detection system both for security and for gate counts since 2011. the gate counter counts each time a person passes through the gates, and library staff manually record total gate tallies three times a day. by subtracting the number from our final day with the day we started to record, we find that we had 5440 visits over the course of one month. the 3m counter increments up as people both enter and leave, and is vulnerable to errors caused when students, faculty, or staff stand in the path of the gate, thus creating potential over counts. pir the pir recorded data until 6/16 and reset after 1721 visits. it recorded again until 6/19 and the machine rebooted and its numeric tally reset to zero after recording up to 428. after this, it continued recording until 7/3 and rebooted and reset at 1028. it recorded again until 7/9 and rebooted and reset at 1912. thus, the pir recorded a total of 5089 visitors, and required three resets. as with the 3m gate counter, it recorded motion in both directions, thus potentially over counting visitors. ldr the ldr sensor rebooted and reset on its first day of recording at 28 visitors. it jumped from 176 to 321 on 6/15, and had similar jumping issues on 6/19 at 13,639 and 7/3 at 4310. the worst jump was on 6/19, when the device actually fell from its wall mount and laid on the ground, at 7:40 a.m. on 6/19. at this time, the machine began to malfunction and went from 716 visitors to 2100, and climbed to 13,639 visitors by the end of the day. the final tally on 7/9 was 414. therefore, this machine recorded 18,391 total visits, or 5468 if the data from the day the device fell is removed, and we required three resets. as with the 3m and the pir, it recorded motion in both directions, thus potentially over counting visitors. ultrasonic the ultrasonic sensor rebooted and reset its numeric count on 6/14 at 733 visitors and again the same day at 230. it reset again on 6/19 at 248, and on 6/20 at 587 and 7/3 at 337. the final tally on 7/9 was 461. therefore, the ultrasonic sensor recorded a total of 2596 visitors, with 5 resets. this device also counted in both directions, and thus, as with our other sensors, probably over counted visitors. analysis as our project originally hoped to record the month of july, we were somewhat disappointed with the results. this was our own fault, as we did not make proper backups when we checked that the machine was functioning. for each device, other than the existing 3m device, we found resets were a problem, which we suspect was due to a memory leak in our programming. we were not able to verify this during the study. due to the reboot and resets, it was also difficult to obtain truly cumulative data. all of our devices were also uni-directional, and thus counted both entrances and exits, meaning all gate counts were probably twice as high as total visitors. based on the data, the laser/ldr appears more vulnerable to jumps from people standing in its path, and the ultrasonic appears to return the lowest total visitor count. until this issue is studied for a longer period of time, it is difficult to say which option would be the most accurate. future versions of this study could include some hour by hour sampling, or using hand counters to test the overall accuracy of all of the studied devices. related work libraries have long understood the importance of keeping statistics. some of the earliest literature ignores patron count (gerould, 1906) and suggests other metrics like finances and collection size to be most important for evaluating the delivery of library services. however, discussion of patron counting appears as early as chen (1978), and devices for electronically counting people have been around since at least 1965, when stephen hornung received a patent for developing a device to count users of elevators (hornung, 1965). several other technologies were developed afterward, including taylor et al. (1976), who used ultrasonic sensors to count bus passengers, and tsubota and satoru (1978), who used pressure sensitive mats to track movement. the earliest literature we found about library adoption of electronic people counters came in 1999, when 3m began to promote its model 3500 detection system, which upgraded their earlier security systems to include built-in people counters. the earliest instance we found discussing libraries using either raspberry pi or arduino based counters came with griffey (2013), but there are also several other examples of raspberry pi or arduino based counters or motion detectors published in the same time period, including nawrath (2012), one million monkeys (2012), and modmypi (2014). the library technology community has shown an interest in this topic as well on the lita-l listserv, which included discussions about gate counters and the raspberry pi in august 2016, and in february and march of 2017 (see http://lists.ala.org/sympa/arc/lita-l/2017-03/msg00089.html ). finally, there are several commercial providers, including sensource and 3m (now bibliotheca), who provide their own systems for tracking patron use through people/gate/door counters. these are usually sold either from the vendor directly (sensource, 2015), or from library specific vendors like demco (2017) or the library store (tls) (2017). conclusion and future research while the data gathered was useful, we found that the overall reliability of our sensors was not high enough to deploy as a “set it and forget it” type solution. while this project taught us a lot about the possibilities for using raspberry pis to record statistics, we feel that the technology has not yet fully matured. we will continue to refine these systems, and we hope that individuals from other libraries will read this and other literature on the issue, and work to improve on our findings to make more accurate, yet affordable solutions to track library usage statistics. acknowledgements: the authors of this paper would like to thank carol van houten, who purchased the original raspberry pis and sensors, and encouraged us to research this issue. we would also like to thank david hardgrove for supporting our trial leading up to and during the month of july 2017, and cynthia coulter for her encouragement and for helping us find space to test our project at the north hudson library. bibliography 3m. (2017). electromagnetic products. retrieved august 7, 2017, from http://solutions.3m.com/wps/portal/3m/en_ww/track_trace/home/products/one/two/?pc_z7_rjh9u52308du80i4ccrm093ii4000000_assettype=mmm_article&pc_z7_rjh9u52308du80i4ccrm093ii4000000_assetid=1180619853892&pc_z7_rjh9u52308du80i4ccrm093ii4000000_univid=1180619853892#z7_rjh9u52308du80i4ccrm093ii4 3m introduces digital identification system, model 3500 detection system; demonstrates pikiosk. (1999). information today, 16(8), 47.(coins) ching-chih, c. (ed.). (1978). quantitative measurement and dynamic library service. phoenix, az: greenwood press.(coins) electro18. (n.d.). arduino-diy laser / ir person counter. retrieved august 19, 2017, from http://www.instructables.com/id/ir-laser-person-counter/ friedman, m. (2011, july 18). jonesboro library installs thermal people counter. arkansas business, p. 12.(coins) gaven macdonald. (2013). ultrasonic sensor with the raspberry pi. retrieved from https://www.youtube.com/watch?v=xacy8l3lsxi gerould, j. t. (1906). plan for the compilation of comparative university and college library statistics. the library journal, 31(11), 761–763.(coins) griffey, jason. (2013, april). making libraries: getting into the hardware biz. presented at computers in libraries, washington, d.c. retrieved from http://jasongriffey.net/wp/2013/04/08/make-the-tools-that-measure-the-future/ hc-sr04 ultrasonic range sensor on the raspberry pi. (n.d.). retrieved august 21, 2017, from https://www.modmypi.com/blog/hc-sr04-ultrasonic-range-sensor-on-the-raspberry-pi inc, d. (2017). demco.com – patron counters. retrieved august 12, 2017, from http://www.demco.com/category/security/traffic-control/patron-counters/_/n-1a0 jedhodson. (n.d.). arduino and pi in harmony – as a sensor web server! retrieved august 21, 2017, from http://www.instructables.com/id/arduino-and-pi-in-harmony-as-a-sensor-web-server/ jones, j. l. (2011). using library swipe-card data to inform decision making. georgia library quarterly, 48(2), 11–13.(coins) making a laser tripwire with a raspberry pi. (n.d.). retrieved august 21, 2017, from https://www.raspberrypi.org/learning/laser-tripwire/ morton-owens, e., & hanson, k. l. (2012). trends at a glance: a management dashboard of library statistics. information technology & libraries, 31(3), 36–51.(coins) mrhobbyelectronics. (2015). raspberry pi – distance sensor. retrieved from https://www.youtube.com/watch?v=inxfadw0m9y nawrath, m. (2012, january). arduino frequency counter library. retrieved august 12, 2017, from http://interface.khm.de/index.php/lab/interfaces-advanced/arduino-frequency-counter-library/ normington, j. (2014). what are those people really doing in our library? incite, 35(10), 15–15.(coins) patron counters. (2017). retrieved august 12, 2017, from http://www.thelibrarystore.com/category/patron_counters_and_door_alarms patron counting analytics for the library industry. (2015). retrieved august 12, 2017, from http://www.sensourceinc.com/industries/library.htm perone, c. s. (2012, august 19). raspberry pi & arduino: a laser pointer communication and a ldr voltage sigmoid. retrieved august 21, 2017, from http://blog.christianperone.com/2012/08/raspberry-pi-arduino-a-laser-pointer-communication-and-a-ldr-voltage-sigmoid/ phillips, j. (2016). determining gate count reliability in a library setting. evidence based library and information practice, 11(3), 68–74.(coins) pi my life up. (2015). raspberry pi motion sensor using a pir sensor. retrieved from https://www.youtube.com/watch?v=mms7esi0sao raspberry pi gpio sensing: motion detection. (2014, february 19). retrieved august 21, 2017, from https://www.modmypi.com/blog/raspberry-pi-gpio-sensing-motion-detection raspberry pi laser tripwire. (2012, december 21). retrieved august 21, 2017, from https://1000000monkeys.wordpress.com/2012/12/20/raspberry-pi-laser-tripwire/ raspberrypiivbeginners. (2014). raspberry pi – gpio & python (9/9) – passive infrared sensor. retrieved from https://www.youtube.com/watch?v=cpr4vxngzew screechynutz. (2014). motion activated surveillance system using a raspberry pi. retrieved from https://www.youtube.com/watch?v=jhbgllftxi8 security tools. (1999). computers in libraries, 19(8), 16.(coins) stephen, a. h. (1965, september 21). 3207266. louisville, ky.: united states patent office. retrieved from http://www.google.com/patents/us3207266 szczys, m. (2011, april 19). laser trip wire – the bare essentials. retrieved august 21, 2017, from http://hackaday.com/2011/04/19/laser-trip-wire-the-bare-essentials/ taylor, w. r., linder, f. x., & clark, r. v. (1976, december 14). 3997866. silver spring, md. retrieved from http://www.google.com/patents/us3997866 thompson, s. (2015). using mobile technology to observe student study behaviors and track library space usage. journal of access services, 12, 1–13. https://doi.org/10.1080/15367967.2015.972754 tsubota, n., & satoru, s. (1978, october 24). 4122331. retrieved from http://www.google.com/patents/us4122331 we count people. (n.d.). retrieved august 12, 2017, from https://wecountpeople.com/ white, l. l. (1981, june 23). us4275385 a. retrieved from http://www.google.com/patents/us4275385 wright, a. (2006, february 15). ntlp – not typical library partners: gate counters. retrieved august 12, 2017, from http://ntrls.blogspot.com/2006/02/gate-counters.html about the authors johnathan cintron is an it technical support specialist at bergen community college. he holds an a.s. in computer science from hudson county community college, where he was a library technology associate from 2013 to 2017. he is an active participant in the android firmware development community, and his projects can be seen at http://pastebin.com/kfhycfd4. his github page is https://github.com/squash. his email address is johnathancintron@gmail.com. devlyn courtier is a library technology associate at hudson county community college, where he is also a student. he has written and presented about topics including using devices like raspberry pis to teach coding, hosting video game tournaments in academic libraries, and circulating electronics at academic libraries. his email address is dcourtier@hccc.edu. john delooper is director of library technology at hudson county community college, where he has worked since 2011. he holds an mlis from rutgers university, a b.a. in history from the george washington university, and recently completed an ms in information systems at baruch college. his email address is jdelooper@hccc.edu. appendix a: final tallies table 1. final tallies ultrasonic ldr pir 3m ldr – adjusted 733 28 1721 230 13639 428 716 248 4310 1028 4310 587 414 1912 414 337 461 2596 18391 5089 7966 5440 each time a sensor reset, its highest tally was recorded in the table above. more detailed tallies are available at www.hccclibrary.net/pi/pi_data.xlsx . appendix b: pictures picture 1. view of laser pi and breadboard with ldr sensor picture 2. close-up of ldr sensor picture 3. close-up of laser breadboard picture 4. view of ultrasonic pi, senor, and breadboard picture 5. alternative view of ultrasonic pi, senor, and breadboard picture 6. view of the pir sensor from various angles picture 7. view of all three mounted pis picture 8. laser in box subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – on two proposed metrics of electronic resource use mission editorial committee process and structure code4lib issue 52, 2021-09-22 on two proposed metrics of electronic resource use the author has requested that this article be retracted. subscribe to comments: for this article | for all articles one response to "on two proposed metrics of electronic resource use" please leave a response below: becky yoose, 2021-09-26 [the following is a slightly edited and revised version of the feedback given to the code4lib journal editorial team within a 24-hour timespan. the input was requested late monday, with publication scheduled for the end of the workweek. therefore, the feedback below is not comprehensive of the privacy, technical, data quality, assessment, and ethical concerns present in the article as it is published. this is primarily a public record of what the editorial committee had in hand when they decided to publish after their last-minute request for privacy feedback.] the feedback below is a combination of technical and data privacy factors that makes this proposed method of electronic resource use assessment of particular concern if published without additional context around risk and potential privacy harms to patrons (if published at all). here’s a list of those factors in no particular order or rank of importance: 1. behavioral tracking – the article describes a process where several scripts clean up a daily log that (theoretically) is scrubbed daily and creates a csv file that is kept for reporting purposes. the data in the csv is not in the aggregate – the barcode or institutional user id links a patron’s use of a platform. the article describes keeping these csv files to create reports on the annual use of electronic resources, meaning that the library has a record of the daily behavior of identifiable patrons that otherwise would not exist. 2. lack of clear, explicit user consent and privacy policy issues – i have not had time to review york university libraries’ privacy policy or the university privacy policy. however, if the policy allows for statistical analysis of aggregated user data, this particular use case would fail to meet that reserved right to use the data due to the identifiable data in the csv. it could also be that patrons are not aware of behavioral tracking that is happening in this article. the article does not state if consent or notification was given to patrons. 3. unprotected identifiable data – this one is pretty self-explanatory. storing csvs that have identifiable information – barcode or institutional id – tied to the use of library resources and materials is a privacy risk. one way of mitigating this risk is through pseudonymization of the patron pii to create a unique id that breaks the direct link between behavior and identity; however, in this use case, i would argue that pseudonymization won’t do much in protecting patron privacy, leading to the next point… 4. re-identification risk due to assessment focus and characteristics of certain patron groups – we are looking at assessing the unique use of electronic resource platforms. electronic resource platforms range from multidisciplinary (jstor) to niche, specialized areas of research. even if you pseudonymize the barcode or institutional id data, i can with some level of certainty identify individuals based on behavioral patterns based on the reports combined with a public school directory that contains major/minor concentrations or other public data sources available through the university or elsewhere. this assessment’s granularity and focus make it nearly impossible to use de-identification methods to mitigate privacy risks. 4.1 on a side note, transforming the barcode to the institutional id makes this dataset even more identifiable and possible for misuse through secondary use of the dataset or improper access or combination of this dataset with another sensitive dataset on campus (and most likely without the student’s knowledge). 4.2 side note #2 – the statement “it produces anonymized aggregate data with all personally identifiable information gone.” isn’t really true, even if you pseudonymized the barcode/institutional id because the platform analysis is still granular enough that aggregation can’t protect the privacy of those using niche or specialized resources based on the number of patrons doing work in those areas. 4.3 side note #3 – this is the obligatory reminder that proper anonymization is nearly impossible https://georgetownlawtechreview.org/re-identification-of-anonymized-data/gltr-04-2017/ is an excellent place to start. i can also get into the specifics of how l-diversity and k-anonymity can also point out the privacy risks with the dataset, but i’ll spare you the math! 5. report data lifecycle – according to the privacy section of the article, “these reports can be kept and used after the source data has been wiped. whether they are shared outside the library depends on institutional openness about collection usage statistics, but there are no user privacy reasons preventing it.” we still have the daily csv reports hanging around. where are they stored, who has access to them, and how long are they retained? 6. choice of assessment harmful to both desired assessment outcome and patron privacy – this is more of a concern in the area of “using data to address an issue that can be more effectively addressed with other methods that are less privacy-invasive.” suppose we are interested in how patrons use specific resources or how different groups use these platforms. why are we not talking to these patrons about their use of said resources? there are better ways to assess collection use than violating patron privacy through the process described in the article. in addition, other more qualitative research methods, such as interviews with students and faculty, can produce more meaningful data that can more accurately reflect the value of the databases in question since the value is not very easily derived through data alone without making dangerous and inaccurate assumptions about the use data itself. flawed data leads to flawed assessment. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – improving the discoverability and web impact of open repositories: techniques and evaluation mission editorial committee process and structure code4lib issue 43, 2019-02-14 improving the discoverability and web impact of open repositories: techniques and evaluation in this contribution we experiment with a suite of repository adjustments and improvements performed on strathprints, the university of strathclyde, glasgow, institutional repository powered by eprints 3.3.13. these adjustments were designed to support improved repository web visibility and user engagement, thereby improving usage. although the experiments were performed on eprints it is thought that most of the adopted improvements are equally applicable to any other repository platform. following preliminary results reported elsewhere, and using strathprints as a case study, this paper outlines the approaches implemented, reports on comparative search traffic data and usage metrics, and delivers conclusions on the efficacy of the techniques implemented. the evaluation provides persuasive evidence that specific enhancements to technical aspects of a repository can result in significant improvements to repository visibility, resulting in a greater web impact and consequent increases in content usage. counter usage grew by 33% and traffic to strathprints from google and google scholar was found to increase by 63% and 99% respectively. other insights from the evaluation are also explored. the results are likely to positively inform the work of repository practitioners and open scientists. by george macgregor introduction significant resource has been invested over the past decade to expose rich digital collections using a variety of repository technologies. this investment has resulted in unprecedented usage of institutional repositories, as evidenced in the uk by services such as irus-uk which, at time of writing, has recorded 146,398,650 counter compliant downloads from participating repositories since 2013 [1]. however, many institutions continue to demonstrate limited commitment to ensuring their scholarly content is exposed optimally. this also extends to a failure to ensure their repository is as usable as possible. in fact, many repositories have not undergone development beyond their original establishment and maintenance of its scholarly collection. the reasons for this inertia are complex and it is not the purpose of this paper to explore them. however, it is sufficient to state that such institutions may attempt to promote their repository content but if few attempts have been made to optimise for discovery, then these repositories may find themselves under exposed [2] and under used. significant future challenges are facing open access repositories, as well as the open science movement more generally [3]. competing scholarly platforms, many of which are proprietary, appear to be growing in popularity yet demonstrate poor support for open standards or prevalent open science technical protocols, as well as low levels of integration with open scholarly infrastructure. it is therefore imperative that user expectations of repositories are better met and improvements to the index penetration and exposure of their scholarly content demonstrated. only by doing this will scholarly open access repositories validate their continued relevance in open scholarly communication. in this contribution we experiment with a suite of repository adjustments and improvements performed on strathprints [4], the university of strathclyde institutional repository powered by eprints 3.3.13. these adjustments were designed to support improved repository web visibility and user engagement, thereby improving usage. although the experiments were performed on eprints it is thought that most of the adopted improvements are equally applicable to any other repository platform. following preliminary results reported elsewhere [5], and using strathprints as a case study, this paper will outline the approaches implemented, report on comparative search traffic data and usage metrics, and deliver conclusions on the efficacy of the techniques implemented. the results are likely to positively inform the work of repository practitioners and open scientists. background given the importance of institutional repositories in promoting open scholarly communication and the discovery of open research content, it is perhaps surprising to note that only a limited amount of prior work has been documented on repository discoverability approaches and their evaluation. many contributions note the importance of repository discoverability and report on some of the factors that should be addressed [6], but few then evaluate the impact of these factors. most recently, however, the code4lib journal published a contribution on the use of microdata within institutional repositories as a “low barrier” means of better exposing contents to google [7]. this work described the implementation of schema.org within dspace. it is a notable contribution owing to the fact that repository support for schema.org is a feature of the coar next generation repositories agenda [8]. pekala reported generally positive results but conceded that demonstrating its impact was difficult. kelly and nixon reported on the use of general seo techniques on three separate uk repositories [2]. this work relied on analytics services and tested early data indicating the importance of blogs in driving repository web traffic. the authors reported mixed results and therefore concluded that further work was required in order to refine their methodology and better understand search engine behaviour. in a poster presented at the 2017 repository fringe conference, the present author evaluated the preliminary results derived from a series of repository enhancements designed to improve web impact and discoverability. while some encouraging evidence was reported about the impact of specific repository enhancements, the small nature of the evaluation prohibited any wider conclusions to be drawn. others have focused on hypothesised impediments to repository discoverability. for example, tonkin et al. explored the significance of repository coversheets in disrupting the bot crawling potential of repositories in some cases [9], a practice also considered by anurag acharya of google scholar as undesirable [10]. better supporting google scholar indexing was addressed by arlitsch and o’brien, who noted variable indexing coverage of repositories on gs and evaluated the effects of adjusting in-page metadata on gs indexing penetration. arlitsch and o’brien highlight the dangers of paying insufficient attention to discoverability and propose corrective actions for repository managers to perform. promoting repository discovery whilst many of the prominent repository platforms (e.g. eprints, dspace, digital commons, ojs, etc.) now provide basic out-of-the-box support for discovery and interoperability with key academic tools, including meeting google scholar inclusion guidelines, there remains wide variation on the relative visibility and discoverability of repository content. the question of repository discoverability is therefore something which has attracted significant attention at the university of strathclyde as the institution seeks to ensure its internationally significant research [11], much of it available open access via strathprints, can be found easily. strathprints is powered by eprints (version 3.3.13). to improve repository web visibility and user engagement, thereby improving usage, a series of technical changes were made to strathprints in spring 2016 and their impact monitored during 2016/2017, and again in 2017/2018. process improvements were also implemented. the changes could be said to fall into one of two categories: improvements, and; adjustments. “improvements” were changes that resulted in substantive modifications to repository functionality, while “adjustments” included actions that sought to refine existing aspects of the repository. as noted below, much of the motivation for these improvements and adjustments came from the broader literature on web publication best practice and seo; although some were gleaned from the repository best practice literature [10]. figure 1. discovery paths for content stored in the strathprints repository. technical changes improvements the principal improvements made included: implementation of a refreshed strathprints user interface (ui). many repositories continue to demonstrate low levels of usability [12], [13]. low levels of usability can result in the users’ abandonment of a website or of system sessions [14], [15], [16]. an heuristic evaluation [17] of strathprints user interface (ui) was therefore undertaken in early 2016 to direct ui changes intended to improve usability and user engagement (figure 2 & 3); following heuristic evaluation, a “mobile first”, responsive re-engineering of strathprints was implemented, thereby triggering important signals in pagerank [18] and, later, heavier weighting in the google “penguin” updates [19] (figure 2 & 3); “white hat” improvements [20] to the way strathprints functions. this included improvements to internal linking (e.g. navigation, hyperlink labels, etc.) and content improvements including promotion of user interaction through support for the core recommender and altmetric. both of these improvements stimulate additional user interaction. for example, in the case of the core recommender this is achieved by referring users to alternative but related additional strathprints content, recommended to the user on the basis of the repository item they are currently browsing; improved integration with social tools, including growth in social interactions which are the result of tweets about recently deposited strathprints content; implementation of a “connector-lite” configuration actioned to cultivate strathprints as a full-text destination for users and machines alike [21]. within the currently scholarly communication landscape it is not uncommon for institutional repositories to now operate in parallel with the local current research information system (cris). this so-called “connected” configuration enables metadata and digital content exchange from the cris to the repository. it is a configuration that applies to strathprints, which is no longer a point of entry for staff wishing to deposit content in strathprints; instead users deposit via the cris which then automatically writes metadata and content to strathprints. “connector-lite”, however, enables greater control over what is written to strathprints by the institutional cris [21]. figure 2. strathprints ui (homepage). figure 3. strathprints ui (abstract pages). adjustments a series of adjustments were made to fine tune the search engine friendliness of strathprints and to enhance user experience. a number of these related to delivering page speed improvements for strathprints, in line with trends within search agents to factor speed in results rankings ([18], [19], [20], [22]). adjustments to the file-naming conventions used for deposited full-text files in order to render them more crawler friendly. descriptive file-names can lead to better and more effective crawling of files. moreover, words contained in file-names factor in retrieval algorithms and may be highlighted to users in results pages, so accurate naming is necessary to facilitate ‘known-item’ searching by users. a descriptive file naming convention with proactive use of hyphens to separate words in the filename [18] was therefore adopted. the broad approach to naming was as follows: {author surname(s)}{journal/conference acronym}{year of publication}{selected uninterrupted words from title of article using hyphens for spacing}.pdf so, for example, a file pertaining to the present article would be named: macgregor-c4l-2019-improving-the-discoverability-and-web-impact-of-open-repositories-techniques-and-evaluation.pdf gradual cleaning of broken links within strathprints thereby improving the “content health” of strathprints and, again, triggering important signals in pagerank [18]. like many repositories of its type, strathprints has been operating in one form or another for over 10 years and during that time has accumulated its fair share of “link rot”; “minification” of all relevant repository files (e.g. css, js, etc.) to deliver increased page loading speeds. minification refers to the process of removing superfluous or redundant data without affecting how the resource is processed by browsers, e.g. code comments, formatting, white space characters, unused code, using shorter variable names, etc. this superfluous data may aid the human readability of the code but is not needed for the code to execute efficiently. rationalisation of all css and javascript (js) files in order to remove unused rules and variables. this can be performed manually but there are automatic online tools (e.g. purifycss, uncss! online) which can analyse websites to determine which css rules are actually being applied to a given website, thus allowing redundant rules to be deleted. similarly, there are code quality tools for js (e.g. jshint). asynchronous loading of js resources: render-blocking js is probably the single most difficult obstacle to overcome when attempting to deliver repository speed improvements (see [23] for further details). a repository like strathprints, like most others, will require the loading of many js resources in order to deliver important functionality. for strathprints this includes native js resources but also third-party js such as the google jsapi, altmetric api, analytics from google analytics and addthis, as well as for any eprints plugins that have been installed from the eprints bazaar. however, some simple experimentation can deduce whether it is necessary for js to be loaded at the same time as the page itself since in many cases js can actually be deferred until after page rendering [23]. html5 introduced the async attribute to be used with gzip compression: gzip is a file format and software application used for file compression and decompression. all modern browsers support and automatically negotiate gzip compression for all http requests and, where used, gzip can compress the size of the transferred response by up to 90%. this significantly reduces the amount of time needed to download resources, reduces data usage for users, and improves the first render time repository pages. enabling gzip, however, is an infrastructural task as it necessitates adjusting the repository server configuration so that it returns “gzipped” content to compliant browsers. gzip implementation is described in more details at [23]. revisiting image optimisation: the question of optimising images for delivery over the web will vary from repository to repository and, in fact, many repositories have very little visual content at all. strathprints uses large banner images which, when not sufficiently compressed, were found to negatively influence page loading times [22]. all image resources were therefore compressed and optimised accordingly. migration to innodb as the mysql storage engine in order to improve repository performance: eprints generally runs on mysql, using myisam as the default storage engine, but table locking was found to be a db performance issue thereby inhibiting the execution of simultaneous queries. innodb demonstrates concurrency, locking only the row(s) which are relevant to the db query, leaving the rest of the table available for crud operations. deployment of google data highlighter: we noted earlier that exposing contents to google could be improved through the implementation of schema.org [7]. it was not possible in this instance to re-engineer eprints in order to expose schema.org interoperable data, although this may be something to be explored in future. instead google data highlighter – a pattern matching tool for structured data on websites – was deployed as a substitute [24]. data and results the impact of the repository changes were monitored and measured using a variety of metrics, including search traffic data from google search console [25], counter compliant usage data from irus-uk [1], google analytics [26] tracking data and routine statistical data from strathprints itself. the periods examined were the year up to end july 2016 (year 1 – y1), prior to the changes being implemented; and the years up to end july 2017 (year 2 – y2), after improvements were deployed, and end july 2018 (year 3 – y3), after the adjustments were implemented. note that counter usage data [1] refers to the international counter ‘code of practice’, which sets standards on how electronic content usage is calculated thereby allowing content publishers to provide consistent, credible usage data. this data can then be used to accurately understand real world usage and provide usage comparisons across multiple services or websites. traffic web traffic, as measured by google analytics (ga), grew by 150,408 in y2 to 428,407, equivalent to a 54% improvement when compared to y1. a 52% improvement in unique traffic was also observed during the same period (figure 4(a)). an increase in traffic in y3 was less than y2 (6%) but was still in excess of y2 (n = 454,318), meaning that the total percentage growth in traffic during the entire reporting period was 63% and 65% for traffic and unique traffic respectively. as might be expected, google was found to be the largest referral source, accounting for 55% of all traffic in y3; but thereafter google scholar was found to be the most significant referral source, accounting for 25% of all web traffic in y3 and growing by 99% during the entire reporting period. traffic in y2 grew by 48% (n = 83,045) and 34% (n = 111,563) in y3. 77% of all this traffic in both y2 and y3 was unique. this is at variance with previously reported results emerging from a preliminary evaluation [3], in which gs traffic was found to have declined slightly as a proportion of total web traffic (by 3%). in fact, this present evaluation, using a more comprehensive dataset, found the percentage of total traffic to strathprints from gs to have increased by 5%, with almost all of these gains achieved during y3. thus, the percentage traffic gains achieved from gs during the reporting period (99%) grew even quicker than the broader gains achieved from other web traffic sources (63%). this can be observed data charted in figure 4(b). repositories serving more content enjoy deeper indexing by google scholar (gs) [10] and, combined with the other improvements and adjustments, may be a possible explanation for the gs improvements. figure 4(a). volume of referral traffic (views and unique views) as calculated by google analytics (ga) in y1, y2 & y3. figure 4(b). volume of referral traffic from google scholar (gs) for views and unique view, as calculated by google analytics (ga) in y1, y2 & y3. figure 4(c). volume of google & google scholar referral traffic (views) in y1, y2 & y3. the principal referral sources remained largely unchanged during the reporting period, with both google and gs referring the majority of the traffic. however, the proportion of the overall traffic referred to strathprints by google and gs grew by 18% between y2 and y3 such that 80% of all repository traffic was referred by either google or gs. the remaining 20% comprised a long tail of services. the nature of this traffic growth can perhaps be better observed in figure 4(c) when the data for figure (a) and (b) are overlaid, with gs demonstrating steeper growth relative to other traffic. table 1 summarises the top ten referral sources (with local sources excluded). a 29% decrease in bing referrals between y2 and y3 is noted, as is a larger decrease for yahoo! (which shares the bing index). reasons for this are suggested later in this paper but essentially relate to search interference arising from the institutional cris. however, given the overall small contribution to traffic made by bing and yahoo! – and the far larger increases in referral traffic from other sources (including within the long tail) – this decrease is more than cancelled out. an interesting observation relates to the increase in referrals from social sources, such as twitter and facebook. again, this traffic remains small in relation to the volume of total traffic but extraordinary percentage increases can nevertheless be observed. for example, traffic from twitter increased by 3700% between y1 and y3 as improved social media interaction opportunities were implemented. referral source y1 y2 y3 google 152890 185491 251705 google scholar 57319 83045 111563 bing 10794 10411 7405 twitter 173 1414 6556 android google search 0 0 2274 baidu 3234 2657 2209 glgoo 878 1048 2077 yahoo 3628 1351 1436 facebook 533 634 1108 ebsco (eds) 482 433 485 table 1. summarised web traffic referral sources as measured by ga with local sources excluded. discovery a more appropriate measure of repository discoverability lies in search metrics. google search console was used to gather search data during the reporting period, thereby allowing the effect of the repository changes to be examined on google search queries. search console makes the distinction between data pertaining to “impressions” and “clicks”. impressions are defined as occurring when “a link url record […] appears in a search result for a user”, while a click is “any click that sends the user to a page outside of google search” [25]. improvements in impressions and clicks were observed in y2 at 52% (n = 5,795,781) and 23% (n = 169,720) respectively when compared to the y1 period. this upwards trend continued in y3 at 61% (n = 9,357,582) and 25% (n = 212,148), and a general upwards trend in impressions and clicks can be observed in the graph profile of figure 5, with impressions and clicks demonstrating particular growth from early 2017 onwards. the total percentage growth in impressions and clicks during the entire reporting period was 146% and 53% respectively. figure 6 provides a summary of the increase in clicks, impressions and counter usage, with steeper increases in impressions and clicks noted between y2 and y3. figure 5. strathprints counter usage during y1, 2 & 3 and google clicks & impressions during the same period. figure 6. charted data on observed clicks, impressions & counter usage during reporting period. during the full period examined (i.e. y1-y3), strathprints demonstrated a 33% growth in counter compliant usage. this growth in usage was observed despite only a 19% growth in full-text deposits during the same period. the pattern of this usage appears more nuanced when considered on an annual basis. for example, y2 and y3 observed a 6% and 25% increase respectively in counter usage, with the number of deposits in y3 actually declining by 19% while increasing by a similar proportion in y2. usage therefore generally increased greater than the number of deposits but in the first year this was not observed, possibly owing to the latency of search tool indexes during y1. it is also noteworthy recalling that google search referrals and gs traffic demonstrated growth well in excess of the 19% full-text deposit rate, as per figure 4. in other words, the percentage of users being referred increased at a greater rate than the percentage growth rate of full-text. to determine whether a correlation between clicks and counter usage was present, pearson’s correlation coefficient was calculated for each year in the reporting period. pearson’s correlation coefficient provides a measure of the linear correlation between two variables by using a value between -1 and 1 to denote the strength of correlation. it can be reported that a correlation was detected, ranging from a weak relationship in y1 (r = 0.26) to a moderate positive correlation in y2 (r = 0.65). for those readers familiar with statistics, this correlation was confirmed via the t statistic (t = 2.68, df = 11, p < 0.05). a strengthening of the positive correlation was further observed in y3 (r = 0.97), also confirmed by the t statistic and a higher level of statistical significance (t = 12.72, df = 11, p < 0.001). computing the coefficient of determination (r2) allows us to better understand the proportion of the variance in the dependent variable (i.e. counter usage) which is predictable from the independent variable (i.e. google clicks). computing the coefficient of determination revealed data to be more nuanced (figures 7, 8 & 9). r2 was stronger in y2 (r2 = 0.419) than y1 (r2 = 0.069); clearly a significantly higher value but indicating that only circa 42% of the unique variance in counter usage could be directly attributed to google clicks. however, this variance narrowed considerably for y3 (r2 = 0.934) with a strong linear relationship between variables noted such that 94% of the unique variance in counter usage could be directly attributed to google clicks. this narrowing in variation can also be observed from figure 5, with data points grouping more closely to the regression line. figure 7. coefficient of determination (r squared) for y1 (clicks and counter usage). figure 8. coefficient of determination (r squared) for y2 (clicks and counter usage). figure 9. coefficient of determination (r squared) for y3 (clicks and counter usage). by exposing their content to disparate search services, and the nature of repository content itself, repositories encourage – and are conducive to – “horizontal” information seeking strategies [19]. these types of information seeking strategy typically correspond with the relatively high “bounce rates” that repositories experience. bounce rates are calculated by ga as “a session that triggers only a single request to the analytics server, such as when a user opens a single page on your site and then exits without triggering any other requests […] during that session” [26]. the bounce rate in this study remained relatively unchanged, fluctuating across reporting periods at circa 75%. however, the average time users spent on strathprints upon arrival increased, up from 01:13 in y1 to 01:54 in y2 and then 01:59 in y3. although users were continuing to bounce, they were typically spending longer on strathprints, indicative perhaps that improvements to the ui and strathprints functionality was enough to persuade users to defer their bounce. in other words, it was possible to improve users’ “dwell time” on strathprints by 61% between y1 and y3. dwell time therefore suggests itself as a more accurate indicator of repository engagement than bounce rates, which experienced only marginal change during the reporting period. bounce rates are not necessarily a reliable metric within models of information seeking behaviour. for example, a user might spend 25 mins reading content on a repository, taking notes and chaining references, but then they might leave. in this example the user “bounced” because they failed to navigate to another page on the repository. but, in repository terms, the user spent 25 mins consuming repository content and found that content sufficiently useful that they “dwelled” for 25 mins. dwell time is therefore critical to understanding repository engagement. interestingly, it is for this reason that many search services, google and bing included, factor “dwell time” into their relevance rankings [27], [28]. like pagerank more generally, the way in which search tools calculate dwell time, or the weighting it is assigned in computing algorithms such as pagerank, is unknown; but it is clearly a variable in calculating relevance and is therefore a metric institutions and repository managers should monitor. similarly, the significance of dwell time in this evaluation is impossible to calculate. it is only possible to state that it would have positively influenced the visibility of strathprints in the search results of services such as google and bing. conclusion and future work in this contribution we experimented with a suite of repository adjustments and improvements performed on an eprints powered repository. these adjustments were designed to support improved repository web visibility and user engagement thereby improving usage and should be considered within the wider context of the coar next generation repositories agenda. the evaluation provides persuasive evidence that specific enhancements to technical aspects of a repository can result in significant improvements to repository visibility, resulting in a greater web impact and consequent increases in content usage. the results suggest that both web and search traffic and counter usage can be significantly improved on the most important search and discovery tools, with strong correlations between google search visibility and repository counter usage demonstrated and variation narrowing particularly in y3. 94% of the unique variance in counter usage was found to be directly attributed to google clicks. strathprints also demonstrated a 33% increase in counter compliant usage during the years examined. across the entire reporting period total traffic to strathprints grew by 63%, with google impressions and clicks increasing by 146% and 53% respectively. gs traffic was also found to have generated a traffic growth 99%, accounting for 25% of all web traffic to strathprints in y3. user dwell time was also found to have increased, suggesting longer interaction sessions by users. of course, as with any experiments attempting to effect change on third party systems, it is impossible to control for all variables hypothesised to influence web visibility. it is not claimed that every known variable has been addressed in this instance. the approach adopted here of delivering repository adjustments and improvements was a holistic one, and was intended to address as many as possible. the approach could therefore be described as pursuing the accumulation of marginal gains; identifying numerous minor optimisations that can be implemented which, when taken in aggregate, effect further significant improvements. there are also limitations to be noted on the use of search data from google search console which, for obvious reasons, provides data on google searches only. however, as the majority of referral traffic to strathprints comes via google this seemed an acceptable compromise to be made in this instance. future similar studies should nevertheless explore additional sources of search data to improve the accuracy of conclusions drawn, especially as google cannot be relied upon to be the preeminent web search engine indefinitely. we intend to continue monitoring our data into y4 with the hope of exploring how additional adjustments could improve visibility on other search discovery tools, thereby providing the basis for greater longitudinal analysis. although the experiments were performed on eprints it is thought that most of the adopted improvements are equally applicable to most repository platforms. there is, in fact, potential for others to improve the impact of the approach. for example, it was noted in the literature that coversheets are considered to be disruptive to the bot crawling potential of repositories and it has been suggested that repositories disable such repository functionality [29]. based on local experimentation and the need to ensure accurate attribution of repository outputs, coversheets remained enabled in strathprints and continue to remain enabled. this therefore highlights a possible limitation. however, there are also potential additional improvements to be gained by other repositories willing to develop their own alternative approaches (e.g. watermarking attribution details) or disabling coversheets altogether. furthermore, owing to the existence of strathprints within a connected cris configuration, the present author noted issues of the cris front-end interfering with the visibility of strathprints in some cases. again, this interference was almost impossible to quantify and appeared to particularly affect bing and yahoo! searches; but for those repositories operating outside of a cris environment or functioning as the de facto cris front-end, considerable additional opportunities are available vis-à-vis promoting the discoverability and web impact of repository content. references [1] irus-uk [internet]. 2018 [cited 2018 jun 30]. available from: http://www.irus.mimas.ac.uk/ [2] kelly b, nixon w. seo analysis of institutional repositories: what’s the back story? in: open repositories 2013 [internet]. university of bath; 2013 [cited 2017 jul 19]. available from: http://opus.bath.ac.uk/35871/ [3] macgregor g. the long read: why do institutional repositories remain one of the only viable options for green open access? [internet]. open access @ strathclyde. 2016 [cited 2017 jun 29]. available from: https://perma.cc/g52j-2fsg [4] macgregor g. reviewing repository discoverability?: approaches to improving repository visibility and web impact. in: repository fringe 2017 [internet]. john mcintyre conference centre, university of edinburgh; 2017 [cited 2018 aug 3]. available from: https://strathprints.strath.ac.uk/61333/ [5] macgregor g. reviewing repository discoverability with strathprints [internet]. open access @ strathclyde. 2017 [cited 2018 aug 29]. available from: https://perma.cc/a3r9-w2jv [6] tmava am, alemneh dg. enhancing content visibility in institutional repositories: overview of factors that affect digital resources discoverability [poster] [internet]. iconference, 2013, fort worth, texas, united states. 2013 [cited 2018 aug 13]. available from: https://digital.library.unt.edu/ark:/67531/metadc146593/ [7] pekala s. microdata in the ir: a low-barrier approach to enhancing discovery of institutional repository materials in google. code4lib journal [internet]. 2018 feb 5 [cited 2018 aug 13];(39). available from: https://journal.code4lib.org/articles/13191 [8] coar. next generation repositories: behaviours and technical recommendations of the coar next generation repositories working group [internet]. göttingen: coar; 2017 nov. available from: https://www.coar-repositories.org/files/ngr-final-formatted-report-cc.pdf [9] tonkin el, taylor s, tourte gjl. cover sheets considered harmful. information services & use [internet]. 2013 jan 1 [cited 2018 aug 29];33(2):129–37. available from: https://doi.org/10.3233/isu-130705 [10] acharya a. indexing repositories: pitfalls and best practices [internet]. proceedings of open repositories 2015. 2015. available from: http://purl.dlib.indiana.edu/iudl/media/6537033b6s [11] shirlaw d. university of strathclyde research rankings rocket [internet]. glasgow city of science and innovation – news. 2014 [cited 2018 aug 14]. available from: https://perma.cc/9cnk-8z53 [12] zhang t, maron d, charles c. usability evaluation of a research repository and collaboration website. journal of web librarianship [internet]. 2013 jan 1; available from: http://docs.lib.purdue.edu/lib_fsdocs/51 [13] mckay d, burriss s. improving the usability of novel web software: an industrial case study of an institutional repository. in: web information systems engineering – wise 2008 workshops [internet]. springer, berlin, heidelberg; 2008 [cited 2017 jul 18]. p. 102–11. (lecture notes in computer science). available from: https://link.springer.com/chapter/10.1007/978-3-540-85200-1_12 [14] wang j, senecal s. measuring perceived website usability. journal of internet commerce [internet]. 2007 aug 8;6(4):97–112. available from: https://doi.org/10.1080/15332860802086318 [15] pendell kd, bowman ms. usability study of a library’s mobile website: an example from portland state university. information technology and libraries [internet]. 2012 jun 12 [cited 2018 aug 3];31(2):45–62. available from: https://ejournals.bc.edu/ojs/index.php/ital/article/view/1913 [16] everard a, mccoy s. effect of presentation flaw attribution on website quality, trust, and abandonment. australasian journal of information systems [internet]. 2010 mar 1 [cited 2018 aug 3];16(2). available from: http://journal.acs.org.au/index.php/ajis/article/view/516 [17] nielsen j, molich r. heuristic evaluation of user interfaces. in: proceedings of the sigchi conference on human factors in computing systems [internet]. new york, ny, usa: acm; 1990. p. 249–256. (chi ’90). available from: http://doi.acm.org/10.1145/97243.97281 [18] google. search engine optimization (seo) starter guide [internet]. 2018 [cited 2018 aug 3]. available from: https://perma.cc/8ct3-uav5 [19] kloboves k. continuing to make the web more mobile friendly [internet]. official google webmaster central blog. 2016 [cited 2018 jul 19]. available from: https://webmasters.googleblog.com/2016/03/continuing-to-make-web-more-mobile.html [20] moreno l, martínez p. overlapping factors in search engine optimization and web accessibility. 2013 jun [cited 2018 aug 3]; available from: https://e-archivo.uc3m.es/handle/10016/20175 [21] macgregor g. feeding the beast: workloads in a hybrid ir / cris environment [internet]. open access @ strathclyde. 2017 [cited 2017 jul 19]. available from: https://perma.cc/dl7u-9vce [22] wang z, phan d. using page speed in mobile search ranking [internet]. official google webmaster central blog. 2018 [cited 2018 aug 3]. available from: https://perma.cc/8qkp-ne5s [23] macgregor g. demonstrating the need for speed: improving page loading and rendering in repositories [internet]. open access @ strathclyde. 2017 [cited 2018 aug 3]. available from: https://perma.cc/dcm7-ts7b [24] google. about data highlighter [internet]. 2018 [cited 2018 aug 24]. available from: https://perma.cc/92dy-mfzp [25] google. google search console [internet]. 2018 [cited 2018 aug 10]. available from: https://www.google.com/webmasters/tools/home [26] google. google analytics [internet]. 2018 [cited 2018 aug 13]. available from: https://marketingplatform.google.com/about/analytics/ [27] microsoft. how to build quality content [internet]. 2011 [cited 2018 aug 24]. available from: https://perma.cc/x2u7-qmpj [28] shewan d. dwell time: the most important metric you’re not measuring [internet]. 2017 [cited 2018 aug 24]. available from: https://perma.cc/5h5e-bte2 [29] tonkin e, taylor s, tourte g, web-support@bath.ac.uk. cover sheets considered harmful. in: 17th international conference on electronic publishing [internet]. blekinge: university of bath; 2013 [cited 2015 sep 18]. available from: http://www.bth.se/com/elpub2013.nsf/ data statement data underpinning this work are available under a cc-by license at: https://doi.org/10.5281/zenodo.1411207 about the author george macgregor (g3om4c@gmail.com) is the institutional repository manager at the university of strathclyde in glasgow, scotland (uk). george’s interests are in structured open data (esp. within semantic web and repository contexts), information retrieval, distributed digital repositories and human-computer interaction (hci). web: https://purl.org/g3om4c orcid: https://orcid.org/0000-0002-8482-3973 subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – large language models for machine-readable citation data: towards an automated metadata curation pipeline for scholarly journals mission editorial committee process and structure code4lib issue 60, 2025-04-14 large language models for machine-readable citation data: towards an automated metadata curation pipeline for scholarly journals northwestern university spent far too much time and effort curating citation data by hand. here, we show that large language models can be an efficient way to convert plain-text citations to bibtex for use in machine-actionable metadata. further, we prove that these models can be run locally, without cloud compute cost. with these tools, university-owned publishing operations can increase their operating efficiency which, when combined with human review, has no effect on quality. by aerith y. netzer background and motivation northwestern university libraries publishes two peer-reviewed journals, the bulletin of applied transgender studies, and studies in russian philosophy, literature, and religious thought. northwestern’s journal publishing operates under tight economic constraints—direct and opportunity—and therefore must solve the same problems of corporate academic publishers with a fraction of the resources available [1][2]. one of these problems is reference metadata, i.e., machine-actionable references that are then used to count citations of articles. the act of capturing, counting, and using citations accurately enables funding agencies, universities, and publishers to make data-driven decisions for funding allocation, reviewers to validate the research of a manuscript, and faster literature review. an example the workflow for our university—a medium-size, elite university in the midwest united states—consists of receiving manuscripts from authors in a microsoft word file format. we then use pandoc [3] to transform this word document to a markdown file format, from which we can build pdf and web versions from a single source. but due to manuscript author’s primarily writing their manuscript in microsoft word, this meant looking up each source, adding them to a zotero [4] library, and then exporting the bibtex file for use as metadata in the web version of the article. as northwestern libraries’ journal-publishing operation is a one-woman show and quickly growing in complexity and scope, we found it necessary to find a way to find a faster way. there have been many projects aimed to converting plain-text to bibtex using programmatic means, but are often limited to certain languages [5] or are dependent upon external data [6]. as northwestern’s journal submissions often use non-latin languages, such as sources in studies in russian philosophy, this is a limitation that precludes many of the sources necessary for us to translate into machine-readable text [7]. as large language models grew popular, we originally reached for the most popular option — the gpt-3 and 3.5 api. however, due to these popular options being paid, using this method would not be scalable to many journals with hundreds of citations to process per volume/issue. further, we as an organization prefer transparency and replicability in our tools. as gpt is managed by openai, access to the model can be closed at any time, forcing us to move to a new system. while with open-source systems, we can upgrade or downgrade as needs arise, and we need not pay. thus, we reached for another, more open, tool—ollama [8]. limitations & concerns this analysis is limited to works that appear in the crossref api, creating a bias in the dataset against older works and academic monographs. while this limits the usefulness of this analysis to publishers whose specialty lies within fields where citations are limited to recent works (such as the physical sciences), future work can and should include plain-text citations of historical, non-digital, and non-academic works. along with the rapid growth in users of large-language models, so have concerns over the ecological sustainability of llm technology [9][10]. most of these concerns, however, can be alleviated with the use of “small” models such as those provided by ollama. further, there are concerns about the validity of large-language models, especially concerning their propensity to hallucinate. however, in combination with validity checkers such as bibtexparser and human review, we are confident enough in this system to be used in production of our journals [11]. future work in this area should include building scalable, verifiable workflows that require less human oversight. methodology data [12] was collected via the crossref api. using the sample function of the crossref api, we retrieved a random doi. then, using the crossref content negotiation endpoint, we were able to retrieve a plain-text formatted citation from a randomly selected citation style from the following: chicago author-date elsevier-harvard ecoscience apa mla ieee council of science editors using the crossref api, we pulled the bibtex citation, the plaintext citation, and the doi to create a dataset for our analysis. table 1 presents the variables and their descriptions. table 1: citation metadata variable description doi the digital object identifier of the requisite work. bibtex citation metadata about the work in bibtex format. plain text citation the cited work in a given citation style. plain text citation style the style in which the plain text citation is given. using this random assignment of citation formats, we achieved a roughly even distribution of each citation style in the dataset (see figure 1). figure 1. pie chart demonstrating the proportion of each citation style present in the dataset. analysis all language models (see table 2) were tested using the ollama toolkit using the quest [13] supercomputing cluster at northwestern university, running in a singularity container [14][15]. testing of all models took 14 hours to complete on two nvidia a100 graphical processing units, one node with eight cores, and 128 gigabytes of memory[16][17]. all code was written in python using an anaconda environment to aid in reproducible deployments of this code [18]. we used the plain text citation given by the crossref content negotiation api as a ground truth to which the model would aspire. we prompted each model with the same text: you are a professional citation parser. given the following plain text citation: {plain_text_citation} please convert this citation into a structured bibtex entry. include all relevant fields such as author, title, journal, volume, pages, year, etc. output only the bibtex entry, and nothing else. do not include any explanations, preambles, or additional text. the following models were prompted in this analysis codegemma:2b codegemma:7b llama2:7b llama3.3:70b llama3:8b mistral:7b starcoder2:3b tinyllama these models were chosen to represent a range of model sizes and training methods. the following variables were saved to the output file of the model: table 3: study variables variable description model the model being tested. one of the eight models listed above. plaintextcitation maps to plain text citation field table 1. timetogeneration time taken to generate the entry. actualbibtex bibtex entry retrieved from crossref. totalfields the number of bibtex fields being compared in generated and “ground truth” entries. matching fields the number of fields that have a match in both the generated and “ground truth” entries. percentage match (overall accuracy) matching fields / total fields this generated 8 csv files of approximately 3,000 lines each. each row corresponds to a single doi. these files were used for analyzing the efficiency and effectiveness of each model. model effectiveness figure 2. per field accuracy and valid bibtex of the model. unsurprisingly, llama3.3:70b, the most advanced and largest model of the chosen models, performed the best. further, starcoder2:3b failed to create any valid bibtex entries, whereas every other model created valid bibtex for every citation. figure 3.box and whisker plot for timetogeneration by model model time to generation model median time to generation (seconds) standard deviation (seconds) codegemma:2b 1.18 3.24 codegemma:7b 1.08 0.29 llama2:7b 0.96 0.27 llama3.3:70b 5.45 1.90 llama3:8b 1.00 0.31 mistral 1.00 0.27 starcoder2:3b 0.00 0.00 tinyllama 0.57 0.64 as every model except for starcoder2:3b created valid bibtex perfectly, we are primarily concerned with the accuracy of the fields. in this discussion, the validity of the bibtex simply means that if the bibtex can be parsed without errors, then the bibtex is valid. however, a well-formed bibtex entry can be valid but incorrect. meaning that the entry can be parsed, but the data in the entry is wrong. llama3.3:70b generated the most accurate bibtex entries. however, we should not assume that the model was necessarily incorrect, but was just different from how crossref represented the field. mistral and codegemma, though, are very close behind, especially with their parameter sizes (and therefore cost of compute) much lower than llama3.3:70b, it may be economical for some publishing operations to use smaller models, decreasing their cost, while keeping parity with the accuracy of the model. trading a .2% reduction in overall accuracy for, on average, a 5x faster computation is an effective strategy for this use case. figure 4.per-field accuracy by model all models were very accurate in producing volume, year, and journal entries in bibtex, while author, publisher, and school were the least accurate fields. this is because there is greater freedom and flexibility in how these fields are entered, and thus a correct and valid generated bibtex need not be exactly the same as crossref’s representation of the same data. future work should include creating a validator to identify equivalent author, publisher, and school names. for example, consider the following bibtex entries: {national academy of sciences, the} {the national academy of sciences} while these entries refer to the same entity, they cannot be identified as the same programmatically, and are thus penalized as “inaccurate.” thus, the results in figure 2 should be interpreted as the models’ accuracy when using crossref as the metric of accuracy. this analysis is useful because it shows which models are better-suited for this task, rather than the concluding 50% of the fields to be incorrect. recreating results on consumer hardware while we ran these models on a supercomputer to aid in analysis, models with parameter sizes of less than 10 billion can be run on current consumer hardware. we recommend a computer with a dedicated, modern gpu (verified to work on the author’s personal nvidia 3080ti, amd ryzen 6-core cpu, and 32 gb of ram; and an apple m3 max with 36 gb of memory) for unix systems, use curl to install ollama with one command: curl -fssl https://ollama.com/install.sh | sh then pull the model (we recommend mistral): ollama pull mistral then, start an ollama server listening on port 11434: def generate_text_with_ollama(model_name, prompt): url = 'http://localhost:11434/api/generate' payload = { "model": model_name, "prompt": prompt, "temperature": 0, # make output more deterministic "stop": ["\n\n"] } headers = { "content-type": "application/json" } response = requests.post(url, headers=headers, data=json.dumps(payload), stream=true) # handle streaming response generated_text = '' for line in response.iter_lines(): if line: data = json.loads(line) if data.get('done', false): break else: generated_text += data.get('response', '') return generated_text.strip() you can then pass any input you like to this function and return a generated bibtex key. full code sample is in the author’s github repo. conclusion for university-owned publishers, small, locally-available llms are capable of producing well-formed bibtex. these models can be used to create machine-actionable citation metadata, automating a step in the publishing process. as of the publication of this paper, 7 billion parameter models, especially mistral, are capable of running on the latest laptops, and provide acceptable performance at the least cost. data and code availability the author strives to adhere to the fair guiding principles. code and data used for this analysis is available on github. references and notes [1] association of college & research libraries. “the state of u.s. academic libraries: findings from the acrl 2023 annual survey.” chicago: association of college & research libraries, 2024. retrieved from https://www.ala.org/sites/default/files/2024-10/2023%20state%20of%20academic%20libraries%20report.pdf [2] relx. 2023. “market segments.” relx. [3] “pandoc – index.” n.d. https://pandoc.org/. accessed march 8, 2024. [4] “zotero your personal research assistant.” n.d. https://www.zotero.org/. accessed december 31, 2024. [5] “makino takaki’s page – writings – technical tips – generate bibtex entry from plain text (.en).” n.d. https://www.snowelm.com/~t/doc/tips/makebib.en.html. accessed march 9, 2024. [6] “text2bib.” n.d. https://text2bib.economics.utoronto.ca/index.php/index. accessed march 9, 2024. [7] williams, rowan. 2024. “sergeii bulgakov, socialism, and the church.” northwestern university studies in russian philosophy, literature, and religious thought. [8] “ollama/ollama.” 2025. ollama. [9] ding, yi, and tianyao shi. 2024. “sustainable llm serving: environmental implications, challenges, and opportunities : invited paper.” in 2024 ieee 15th international green and sustainable computing conference (igsc), 37–38. https://doi.org/10.1109/igsc64514.2024.00016. [10] (chien, andrew a, liuzixuan lin, hai nguyen, varsha rao, tristan sharma, and rajini wijayawardana. 2023. “reducing the carbon impact of generative ai inference (today and in 2035).” in proceedings of the 2nd workshop on sustainable computer systems, 1–7. boston ma usa: acm. https://doi.org/10.1145/3604930.3605705. [11] “about the journal – bulletin of applied transgender studies.” n.d. https://bulletin.appliedtransstudies.org/about/. accessed march 8, 2024. [12] [18] netzer, aerith. “aerithnetzer/biblatex-transformer.” 2025. https://github.com/aerithnetzer/biblatex-transformer. [13] “quest high-performance computing cluster: information technology – northwestern university.” n.d. https://www.it.northwestern.edu/departments/it-services-support/research/computing/quest/. accessed january 1, 2025 [14] kurtzer, gregory m., vanessa sochat, and michael w. bauer. 2017. “singularity: scientific containers for mobility of compute.” plos one 12 (5): e0177459. https://doi.org/10.1371/journal.pone.0177459. [15] singularity containers allow for reproducible environments for analysis. the singularity definition file used in this analysis can be found in the github repository. as llms are typically stochastic in nature, one can expect to have reasonably similar, but not exactly the same, results as presented in this paper. the singularity definition file primarily serves as a resource for readers to deploy this system in their home institution. [16]the script used to create the slurm job can be found in the github repository https://github.com/aerithnetzer/biblatex-transformer. [17] thank you to kat nykiel at purdue university for her assistance in building and deploying the singularity container. about the author aerith y. netzer is the digital publishing and repository librarian at northwestern university in evanston, illinois. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial : the cost of knowing our users mission editorial committee process and structure code4lib issue 52, 2021-09-22 editorial : the cost of knowing our users some musings on the difficulty of wanting to know our users’ secrets and simultaneously wanting to not know them. by mark swenson for the second time in the past year we have an article that provokes concerns with patron privacy. despite some reasonable objections to printing this article because of the implied endorsement of a cavalier attitude toward patron data, the editorial committee has decided that the overall quality of the article merits its publication. we have added a note to the top of the article which acknowledges that there are considerable problems with elements of patron privacy. in this editorial i will mention some of the problems that come with using patron data in this way, as well as express some doubt that there are technical solutions to handling this data that could make it truly anonymized. as agencies with limited resources and a base of users that need specific information services, libraries are caught in a hard place when it comes to collecting data on user behavior to improve services. we have always collected data about our collections and users. many of the articles which run in this journal focus on how to dissect, clean, parse, and use that data. however, we have entered a pact with our users which states that what they individually do will never be divulged to anyone else. consequently library user data tends to be very broad and generic. for example, we know that a given book has been checked out 100 times, but we don’t have any idea if the book was checked out by 100 different people, by 50 people twice, or by one person 100 times. we know that our computer systems have the granular information that can answer this kind of question. so it’s possible to mine that and create new lists linking users and resources and get the kind of detailed use information that would be great for evaluating a resource. this is what this issue’s controversial article describes. in doing this, however, a library creates a record that theoretically makes it much easier for someone to misuse patron data and turn the question around: asking not what resources are being used by individuals but which individuals are using specific resources. when we create data like this it becomes difficult to control its future. we may have the best intentions in doing it, but it’s hard to be sure that everyone who can obtain it will respect the privacy of users equally. this is incredibly frustrating, so we keep delving into this data mine to figure out a way that maybe we can get that kind of information about users while somehow keeping their information anonymous. if we could do that, then maybe we could know if it would be worthwhile using limited resources to buy the next book in the series since 100 people will probably check it out too, or if it is more likely that we will just be buying it for the one person. as a thought experiment i want to briefly imagine exactly what we’d need to do to successfully have truly anonymized individualized user data, and why ultimately this probably isn’t possible. my goal here isn’t to sketch an actual viable plan, but rather to just show how hard this really is. because it’s a good example of a scenario where this is at least imaginable, i want to use the case of database access logs. from the logs we are taking only the following information: a user’s barcode, the name of the resource, and the date and time on which the resource was accessed. it is important to note here one of the many problems that libraries have with trying to maintain user privacy is that the logs in question are often not under our direct control. in such cases vendors may be collecting this information and not providing control to libraries as to what data is being collected, how long that data is kept, and how frequently that data is deleted. without control over that data, even with a hypothetically perfectly anonymized set of logs, the promise of anonymity granted by them is worthless. either the data could be obtained directly from the vendor or any anonymization rendered worthless by comparing the records to vendor data. in most cases with vendor-collected data i think that it should be a high priority for libraries to establish policies with vendors that value patron privacy first and foremost. for the purpose of this thought experiment we will imagine data which the library has complete control over. in reality, such a limitation would make doing the rest of the work to obtain the holy grail of anonymized data largely pointless, but it’s the only good starting point. let us also assume that we are working with a large population of over ten thousand users (i work in a suburban public library and a user base of this size is typical here) and a modest collection of resources. this scenario avoids problems one would be likely to encounter at a small academic institution, for example, where it might be possible to identify a user because the number of persons interested in a topic is small and there might be one resource that is almost exclusively used by them. first, i need to at least obfuscate the barcodes, the single piece of data that can be tied to an individual. if i am storing this in a database, and the database is hacked or requested as the result of a legal order, i don’t want to lose control of a list of barcodes that could then be used to figure out the identities of individual users. with modern technology, the best way to obfuscate some data is to use a cryptographic hash, turning my 14 digit barcodes into a much longer (maybe 32-72 character) random blob of letters and numbers. at this point things are looking good, but what if the person who gets the data wants to know if a specific person, whose barcode they already have, is listed in this database. it would be trivial for them to figure out which hashed value matches that barcode just by figuring out the hash algorithm, of which there are just a handful in common use. so to protect against that i’m going to need to add a salt (some extra random data) to the barcodes to make them harder to guess. it is here that the difficulty facing the person who wants to both protect and use this information becomes apparent. if i’m collecting these barcodes over a period of time and want later to be able to run a report that determines which resources are being used by multiple users and which ones are only being used by one, i need to always use the same salt. but if i use the same salt all of the time, that needs to be stored in a place where it could be seen by the same person who has a copy of database. i can generate a unique salt for every single record in the database and encrypt that salt with a public key. that should thoroughly obscure the information and it should be impossible for anyone to figure it out unless they got my private key. in theory, i can collect this data and only remove the encryption in a safe way to create a nice anonymized report. but who else has access to this private key? how can i keep it safe while not putting my library in a position where the ability to run reports on this set of data requires my secret knowledge. also, when i run a report and start to observe patterns in the data, even with a large population, certain users may be much easier to identify than i would expect. it might be obvious to a worker at a service desk that an anonymous user who used two dissimilar resources in a short timeframe would likely be a specific person. with that problem in mind, it becomes necessary to make the user id hashes unique for each resource. to do this, the barcode, the database name, and the salt all have to go into the hashing process, and then the salt needs to be encrypted and stored for later retrieval. other problems with the private key and data ownership remain, but we’ve maybe, finally gotten to a point where if the stars align correctly, and there are no flaws in the encryption, and the code is all written perfectly, this information’s pretty anonymized. that’s a really hard place to get to, and with the remaining problems with the private key and its ownership, a hard place to stay. on top of what’s already been mentioned in trying to get a balance between restricted access to the key and keeping it safe, it would still be vulnerable to potential warrant requests from law enforcement or vulnerable to unauthorized access on a hacked network. as much as we would really, really like this data, is it worth it to go here? i’m not so sure. subscribe to comments: for this article | for all articles 3 responses to "editorial : the cost of knowing our users" please leave a response below: kristin briney, 2021-09-22 this editorial fails to address the actual issue with the article in question: the violation of user privacy (namely, the privacy to access resources without surveillance). instead, the editorial lays out a technical solution to an ethical problem, except that anonymization isn’t actually a solution. it has been shown by researchers such as narayanan that anonymization simply doesn’t work. the larger issue here is that code4lib published an article that violates patron privacy and justified this violation with a theoretical technical exercise in how we can better protect data that should never have been collected and analyzed in the first place. patron data is not the new black or the new oil. as becky yoose says, data is glitter: it gets everywhere, it’s hard to clean up, and it’s best to never even let it into your house. under this analogy, code4lib just got glitterbombed and needs to clean up a mess. melissa belvadi, 2021-09-23 i strongly disagree with this statement: “it should be a high priority for libraries to establish policies with vendors that value patron privacy first and foremost”. rather i think that first and foremost we should establish such policies that value informed patron *choice*. let our patrons decide where each wishes to make the tradeoff between features and privacy. to adopt your value, we should be negotiating with every publisher to disable on their platforms the features that allow patrons to optionally create “my researcher” accounts, which usually involve giving their email address, which is a fairly unique identifier easily linked to their human existence. the generations that have widely adopted the use of facebook and the like have their own values about privacy and we have no business imposing our own more restrictive ones at their expense. becky yoose, 2021-09-27 “despite some reasonable objections to printing this article because of the implied endorsement of a cavalier attitude toward patron data, the editorial committee has decided that the overall quality of the article merits its publication.” any library worker making such a statement must reexamine their commitment to protecting a patron’s right to privacy at the library. the decision to sum up these critical privacy issues as an “implied endorsement of a cavalier attitude toward patron data” indicates that the editorial board places little to no value on privacy. instead, they chose to champion code over all else, patrons and professional ethics be damned. as the editorial states, this is the *second* article to have significant patron privacy issues. after the publication of the first article, the editorial committee made a statement that they would “pull in outside experts to comment on articles.” [1] when i was called to review the article with a very short turnaround (the email stated that the article was slated to be published later that week), i thought there was some progress in incorporating privacy checks in the review process. instead, the article was published as-is with one little note on the top of the article stating that there were concerns. in addition, it was implied later on that i would be more than welcome to clean up my feedback to turn it into a more formal, publishable article. i would say that the editorial process is broken, but perhaps it’s working by design. the more i learn about the editorial process in the journal, the more concerned i am about the ability of the editorial committee’s commitment in creating meaningful, substantial change to the process. it is standard for many library journals to ask authors to revise and resubmit articles. i recommended that the article be heavily edited to address the multitude of privacy issues if the article is published at all. [2] however, the current editorial review structure seems not to accommodate revise and resubmit. the implied expectation that i would contribute a rebuttal of my own in the form of article submission or other publication completely misses the point about why this article shouldn’t have been published as is in the first place. *it describes in detail how to violate professional ethics and patron privacy*. a debate in the scholarly literature doesn’t undo the harm of such an article in the scholarly record. even if the original article was framed in the context of privacy and ethics, any debate would not and must not be a substitute for an ethics review during the review process. disregarding a patron’s right to privacy seems to be becoming a pattern here, particularly when you brought someone to point out the privacy red flags only to ignore them. i have been aware that the editorial committee is revising guidelines for guest editors; however, i am not optimistic that this revision for “guests” would solve the root issues present in the overall editorial process. i ask that the editorial committee make the following changes: 1. create a code of ethics or ethical guidelines for submission authors – technology isn’t neutral. librarianship has a code of ethics. patrons have rights in the library. anyone working in library technology has to recognize all of this and reflect this in their scholarly output. 2. create a rubric or other mechanisms to evaluate submissions on their adherence to professional ethics, including the patron’s right to privacy – library technology journals are not neutral, either. what is and is not published here shapes the library technology discourse. people in the profession consider this journal as the prestige library technology journal. if this journal doesn’t reflect the profession’s ethics or protect the rights patrons have in the library, it codifies this disregard of ethics and patron rights in the scholarly record. 3. make more use of “revise and resubmit” during the review process, or make more decisions not to publish if the ethical issues in the original submitting ultimately cannot be resolved or adequately addressed in the submission. 4. revise the process for bringing in external reviewers – please, for the love of everything good and holy, do not bring in external reviewers four days before schedule publication, only to ignore their input. ideally, the ethics review would trigger a need for external review early in the editorial process. some people have called for the editorial committee to apologize to me for the treatment i received during this process. i do not want an apology. i want fundamental changes to the submission and editorial processes. if the editorial committee is committed to meaningful change, an excellent first step would be a full retraction of the article in question. only then any progress on ethical guidelines for submissions and reviews can start in earnest. [1] https://journal.code4lib.org/articles/15340#comment-2745195 [2] https://journal.code4lib.org/articles/16087#comment-2745444 leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using book data providers to improve services to patrons mission editorial committee process and structure code4lib issue 6, 2009-03-30 using book data providers to improve services to patrons at paul smith’s college, i recently implemented a “new books” display using open apis and an image scroller. in this article i’ll give a brief overview of google book search, openlibrary and worldcat, explain how i created this new books widget using book cover data, and provide readers with some practical and simple code to show how to collect this data. this article will be of interest to anyone who wants to read about a brief overview of current state of free book data service providers. additionally, beginner programmers will likely find the examples at the end of the article helpful when getting started with projects of their own. by mike beccaria introduction with the advent of large book data repositories such as google book search, open library and worldcat along with the application programming interfaces (apis) that accompany them, programmers are quickly developing new tools to enhance patron user experiences at library web sites. at paul smith’s college, i recently implemented a “new books” display on the front of our web page using some of these open apis in conjunction with a very affordable 3rd party application used to create an image scroller. in this article i’ll give a brief overview of the some of the larger book data providers, show how i used some of this data to create a new books widget at paul smith’s college, and provide readers with some practical and simple code to show how to collect this data. this article will be of interest to anyone who wants a brief overview of current state of free book data service providers and some projects that were created using them. additionally, beginner programmers will likely find the examples at the end of the article helpful when getting started with projects of their own. introduction to 3rd party data providers a trend in online service providers has been to create application programming interfaces (apis) for software developers. apis are simple ways for computers to give and take data from each other, allowing developers to use the data and repackage it in different and creative ways. worldcat, google book search, librarything and open library have extensive data repositories with metadata on millions of books. each of them has an api that allows developers to query, collect and use only the information they need. additionally, some of these providers (namely google and open library) have been offering full or limited access to the actual scanned text of some books. libraries that recently had access to these books only in print now have the ability to point their users to online copies of these books and allow them to search and see them without having to travel to the book shelves. i recently queried open library with a subset of books we house in our library archives and found that between 30-50% of my queries reached a book that had been scanned, placed online, and was available for viewing or download in full-text. by simply adding a link in our catalog records for these books, we can allow our patrons to have access to books that were recently only able to access with special permission. let’s take a brief look at some of the major data providers and discuss some of the offerings of each. google book search api google book search api documentation can be found at http://code.google.com/apis/books/. google book search (gbs) api allows you to programmatically embed book previews on your websites, gather social information such a book reviews, ratings, labels, and user library data, and perform searches on the gbs database to get back detailed results on the books. this allows users to, for example, query the gbs database by keyword, isbn or oclc number and receive back general record information such as title, author, and publisher, etc. as well as whether google has a thumbnail cover image or a book preview is available. some uses of google book search many institutions and organizations are using gbs in creative ways to enhance the products and services they deliver. the strength of gbs that sets it apart from the other services providers is its embedded preview capabilities. perhaps the most notable example of gbs uses related to library services is found in oclc’s worldcat. when viewing an item level record in worldcat, users can see the “google preview” button located in the “get it” section of the page when a preview is available. figure 1: a screenshot from an item record in the worldcat catalog showing the google preview icon when a preview is available. [view full-size image] when clicked, the user is brought to a new worldcat page with the google preview embedded in the site. figure 2: google book preview embedded into the worldcat catalog website. users can search and view pages. [view full-size image] gbs limits the pages that users are able to see, but still allows access to a significant portion of the book. users are also able to search the full content of the book. worldcat search api worldcat search api documentation can be found at http://www.oclc.org/worldcatapi/. the worldcat api provides access to the worldcat database using restful uri queries that return the results in a variety of formats, including rss, atom, marc xml and dublin core. the worldcat api is free to all cataloging members with a subscription to connexion and requires member libraries to apply for an access key that is used when performing queries. the worldcat database contains over 100 million records written in 470+ languages from 112 countries in every conceivable physical and electronic format, all of which are accessible via the worldcat search api [1]. the clear strength of the worldcat api is the vast amount of information that is available and the flexibility of the api itself. some uses of worldcat search api perhaps the most notable demonstration of the api that illustrates its power and flexibility is a prototype online public access catalog (opac) that was built completely using api commands to worldcat. david walker, from california state university, developed this prototype and gave a presentation at code4lib 2008 titled “working with the worldcat api” [2]. david was able to create a completely customized user interface and populate it with data from worldcat. at paul smith’s college we use the worldcat api to populate our item level displays in our opac with bibliographic citations. figure 3: an item level view of the paul smith’s college opac showing bibliographic citations for the book downloaded via the worldcat api. [view full-size image] open library api open library api documentation can be found at http://openlibrary.org/dev/docs/api. open library, a project of the internet archive, has the goal of creating one page for every book ever published. they currently have 20 million available records and over 1 million books available in full-text [3]. the entire project is open, including the source code that runs the web page and all of the data in the database. open library has scanning centers that scan out-of-copyright books which are then added to the database. much of the data in open library comes from donated marc records from large university libraries. while still in beta, the open library api is a great source of data for projects that can leverage its api. some uses of the open library api while not as large as the worldcat database or providing as much copyrighted information as google, the strength of open library is found in its openness with the content it houses. it is the only source of bibliographic data that is both open source and open data, meaning all of its data and the software that displays it is available to download, use and reuse [4]. perhaps the most popular example of one of its uses is found in the openbookdata plugin for wordpress [5]. wordpress is a an open source blog software package and wordpress plugins are created by community members that allow users to extend the capabilities or wordpress. the openbookdata plugin allows users of wordpress to put book covers and book information from open library on their sites. for more information on the openbookdata plugin, please see john miedema’s article in issue 4 of code4lib journal [6]. at paul smith’s college, our thumbnails and some book information in the new books widget come directly from the open library api. librarything api librarything api documentation can be found at http://www.librarything.com/services/ and http://www.librarything.com/api. librarything is a for-profit company that lets book lovers create online catalogs of their books. librarything connects people who have similar reading interests by finding similarity between their libraries. users are able to rate books, write reviews and even have conversations about books online. librarything’s api comes in a variety of different packages and flavors ranging from web based xml queries to simple book thumbnail requests. their web services api allows users to download interesting and often unique information about books, and herein lies its most important strength. because a lot of librarything data is created by its users, it is often possible to collect some really unique information on books that might not be available elsewhere, such as first and last lines, awards and honors the book has received, important places in the book, and user book reviews. also of notable mention is the fact that librarything now has over 1 million user-uploaded book covers that are open and available to use by developers as well as data feeds that enable users to download certain sets of raw data and thumbnail images [7]. some uses of librarything api librarything provides javascript generation tools on their website to help users add librarything data on their web pages and blogs as well as getting librarything on your cell phone. in the librarything developers’ forum there is some talk of getting librarything applications for the iphone or ipod touch [8]. making a new books display at paul smith’s college at paul smith’s college, we recently undertook a major website redesign and we left a large center portion of our site available to advertise some of the services we offer to our patrons. one of the services we decided to start with was a new books widget that mirrored our new books display in the library. to make the library web page more interesting and useful to our users, we incorporated data that is freely available from open library and google book search to supplement the information already available in our library catalog. here are screenshots of our website before and after the new books widget was implemented: figure 4: paul smith’s college library website before and after the new books widget was implemented. [view full-size image] figure 5: paul smith’s college library website after the new books widget was implemented. [view full-size image] the user interface was created using a very inexpensive 3rd party javascript scroller creator called sothink javascript web scroller [9]. after modifying the scoller templates in the sothink product, i was able to collect data from a database populated with book information from our ils, google book search and open library. implementation the process between adding new books in the catalog to displaying them on the web page has many steps and components. here is an overview of the process that the new books application performs to create an automated new books widget on our library website: each night, our ils outputs a pipe delimited text file to a web accessible folder on the server. the file contains the title, author, and isbn of books that are newly added to our catalog. the web server downloads the created text file from the ils server. a python script reads the text file and, for each item, uses the 3rd party apis to download and store the relevant book data in a mysql database for use by the new books widget. if a thumbnail image of the book cover is found in the search process, it is also downloaded, renamed to match the isbn number of the book, and saved. the new books widget, created with the sothink javascript web scroller and a php script, queries the database for relevant new book information and populates the widget with the data as well as the saved thumbnails. when a user opens the library web page, the new books widget loads the thumbnail files and queries the required information from the database and displays it. i wanted to keep the book data separate from the tool so that if, down the road, we decide to change our web presence, the data would still be accessible to be used in a different form. additionally, keeping the data separate from the widget allows us to create other tools, such as rss feeds for new books; novel ways to display book covers and information, such as making a wall of books [10], creating a coverpop [11]; or using book data to support a digital signage display [12]. getting the book data the code for paul smith’s college’s new books widget will continue to be customized as our needs evolve. so, instead of delving into the specific details of how i made the widget, let’s look at some basic principles on how to get data from a couple of the content providers, namely google book search and open library, so that you can start making your own services to meet the needs of your users. one common search feature that each of these services has is the ability to query by isbn number. let’s walk through a python code sample of reading data from a text file that contains isbns and querying each of the services to retrieve book data and cover thumbnails. what you will need: a copy of python [13] installed on your computer. the code in the article has been tested on python version 2.5. (http://python.org/) the demjson python library. (http://deron.meranda.us/python/demjson/) note: at the time of writing this article, demjson does not currently work with python 3.0. a new version is in production. the process: our sample program will perform the following steps: read a text file that contains isbn numbers for each isbn, query open library and google book search api and retrieve back the data in a usable format. download and save book covers from open library and google book search if available. step 1: reading the isbn text file the first thing we have to do is load the python modules that we will be calling. #import some of the modules we will need to use to complete our tasks import urllib, urllib2, demjson, os the modules imported on the 1st line of the code snippet perform several different functions. the urllib [14] and urllib2 [15] modules provide functions that allow users to gather data from the internet and are used to query gbs and open library. the demjson module will allow us to quickly and easily take the json [16] data from gbs and open library and convert them into python variables that we can reuse in our programs. the os [17] module allows us to read and write to files which will be useful when we want to read the text file that contains isbn numbers as well as save the image thumbnail images that we collect. the urllib, urllib2 and os modules come with python 2.5. the demjson module needs to be downloaded and installed before you use it. given a text file of isbn numbers separated by carriage returns, we can read these items into a python list (also called an array in other programming languages). #create a function that will read lines from a text file and #store the contents in a list def read_newbooks_file(path): data = open(path) isbnlist = [] for isbn in data.readlines(): isbnlist.append(isbn.replace("\n","")) return isbnlist #read our text file and name the list "isbns" isbns = read_newbooks_file("c:\\newbooks.txt") #print our results print isbns if our text file found at c:\newbooks.txt contained the following isbns: 0618379436 9780700615582 9781593761288 9781405163354 this program would output the “isbns” list that looks like this: ['0618379436', '9780700615582', '9781593761288', '9781405163354'] we can now use this list to query the service providers for book information. step 2a: searching open library open library api documentation can be found at http://openlibrary.org/dev/docs/api the open library api allows users to query its database in a number of ways. the result of an isbn query returns unique ids to the book that are specific to the open library database. in order to get actual book data from open library, we first need to query the database for items that match our isbn. open library will return the unique id values back which we could then use to find book specific data for each title. open library returns json formatted data which, after it has been gathered, needs to be parsed and converted to python using a json library. there are several json libraries available to use for python, but i chose to use demjson. the following code cycles through our isbn list and queries open library to see if it has the books in their database. #cycle through each isbn in the list for isbn in isbns: #create the url string to query openlibrary url="http://openlibrary.org/api/search?q={%22query%22:%22(isbn_10:(" + isbn + ")%20or%20%20isbn_13:(" + isbn + "))%22}" #send the url request and name the json response "response" response=urllib.urlopen(url) #translate the json response into a python variable of dictionaries and lists book=demjson.decode(response.read()) #if the result set isn't empty (i.e. it returned a hit) if book["result"]!=[]: results = book["result"] #value of python dictionary value returns a list #print out the list print results #take the 1st item in our list, the open library key for the book and create a url to query again url = "http://openlibrary.org/api/get?key=" + results[0] #send the url request and name the json response "olresult" olresult=urllib.urlopen(url) #translate the json into a python variable data=demjson.decode(olresult.read()) #print it print data when you run this code, it should output a dictionary object with the book data embedded in it. here is sample output: [u'/b/ol7604987m'] {u'status': u'ok', u'result': {u'publishers': [u'houghton mifflin'], u'languages': [{u'key': u'/l/eng'}], u'subtitle': u'and other stories of intriguing kitchen science', u'key': u'/b/ol7604987m', u'title': u'how to read a french fry', u'number_of_pages': 320, u'isbn_13': [u'9780618379439'], u'isbn_10': [u'0618379436'], u'publish_date': u'september 8, 2003', u'last_modified': {u'type': u'/type/datetime', u'value': u'2008-04-29 13:35:46. 87638'}, u'authors': [{u'key': u'/a/ol2688000a'}], u'type': {u'key': u'/type/edition'}, u'id': 10350312, u'first_sentence': {u'type': u'/type/text', u'value': u'everyone loves deep-fried foods, as a glance at any fastfood menu will prove.'}, u'revision': 1}} step 3a: getting book covers from open library now that we have the book data, we need to download the book cover image. open library stores its book covers in small, medium and large formats and uses the open library unique ids in the url. you can request different book cover image sizes by changing the url of the image. ending the url with a –s, -m, or –l provides you with small, medium and large images respectively. the following urls point to thumbnails with an open library unique id of ol7604987m: small: http://covers.openlibrary.org/b/olid/ol7604987m-s.jpg medium: http://covers.openlibrary.org/b/olid/ol7604987m-m.jpg large: http://covers.openlibrary.org/b/olid/ol7604987m-l.jpg because we already have the open library unique ids from our previous code section, we simply need to add a few lines of code to download the book covers. when open library doesn’t have a book cover, it returns a small file with a length of 808 bytes. in the code below, we check this by downloading the file and checking its size (in the case of the code below, we check if it’s smaller than 1000 bytes). if the file is smaller than 1000 bytes we know we have an invalid thumbnail and delete the bad file. #the url to the thumbnails end with -s for small, -m for medium, -l for large imgurl = 'http://covers.openlibrary.org/b/olid/' + results[0][3:] + '-m.jpg' #retrieve the file and save it to c:\ with the name of the isbn as the name of the file imgfile = urllib.urlretrieve(imgurl, "c:\\" + isbn + ".jpg") #retreive the file size and if the file size is less then 1000 bytes, delete it fsize = os.path.getsize(imgfile[0]) if fsize < long(1000): os.remove("c:\\" + isbn + ".jpg") [/sourcecode] notice on the 1st line of the code i needed to add the "[3:]" to remove the 1st three characters from the open library ids, because the api returns the ids with a "/b/" appended to the front (i.e. "/b/ol7604987m") and we only need to use the id (i.e. "ol7604987m"). after running the program, i found three book thumbnails in my c:\ folder with isbn numbers as their name, indicating that one of our books doesn't have a thumbnail in open library:

figure 6: three book thumbnails downloaded from open library.

step 2b: searching google book search

google book search api documentation can be found at http://code.google.com/apis/books/ google book search provides an api that allows users to get access to the metadata associated with books in its library. google has partners with a number of publishers and large academic libraries and has been scanning a large number of books and adding them to their collection [18, 19]. depending on copyright and publisher permissions, google allows snippets, previews or full-access to books scanned into its database. at paul smith's college, we are using gbs information in our open source opac [20] as well as in our new books widget. if a google book preview is available, an icon is shown under the book title in the results page. in our new books widget, when a user hovers over a book cover with their mouse, a popup is shown with more detailed book information. for both tools, if a preview is available from google, an icon is shown with a link to the book on the google books site.

figure 7: when a user hovers over a book cover with their mouse, a popout window displays further information and, in this case, a google preview link. [view full-size image] the gbs api allows you to perform searches and get back book information, reviews, ratings, labels, and user libraries, as well as embed google book previews on your website. the documentation on the google site is quite extensive and allows the potential for some great tools to be made. to get started, we are going to continue our book code sample by gathering data that indicates whether google book search has a preview and a cover image for a given book. gbs api allows http get requests to be made with oclc, isbn or lccn numbers and it returns data in json format. because we already have isbn numbers for our books, we will use those to search gbs in our example. here is a sample url request that will receive data back from google: http://books.google.com/books?bibkeys=isbn:0451526538&jscmd=viewapi&callback=mycallback google returns json-formatted data with these elements:

jsonsearchresult { string bib_key; string info_url; string preview_url; string thumbnail_url; string preview; };

starting where we are cycling through our isbns, let's take our example code and append a section that returns google book information. [sourcecode language='python'] #create parameters that will be used in the url sent to google. the results will be called gcallback. gparams = urllib.urlencode({'bibkeys': isbn, 'jscmd':'viewapi','callback':'gcallback'}) #open a new urllib2 handler and send the request opener = urllib2.build_opener(urllib2.httphandler()) request = urllib2.request('http://books.google.com/books?%s' % gparams) #we need to change the headers to trick google to think that firefox is sending the request opener.addheaders = [('user-agent', 'mozilla/5.0 (windows; u; windows nt 5.1; en-us; rv:1.9.0.3) gecko/2008092417 firefox/3.0.3')] #read the response from google g = opener.open(request).read() #print the results print g #if the results aren't empty, strip the "gcallback" parameter from the json results leaving only the relevant data if g != "gcallback({});": g = g[10:-2] #translate the json into a python variable gbookinfo=demjson.decode(g) #google only sends values that have data, so we need to check if that data #exists before we can print it or else python will throw an error. #print the url to the google page and the url to the thumbnail if gbookinfo[isbn].has_key("info_url"): print "gb info url: " + gbookinfo[isbn]["info_url"] if gbookinfo[isbn].has_key("thumbnail_url"): print "gb thumbnail url: " + gbookinfo[isbn]["thumbnail_url"] notice that i didn’t use the typical urllib python library in the same way that we did for open library. after some trial and error, i realized that google wasn’t returning results to me despite the fact that the program was structurally correct. google’s api would not accept the http headers that were being sent by the python module. i had to trick google into thinking the request was coming from the firefox web browser by changing the headers. also note that the gbs api only returns results when it actually has the data (i.e., it will not return any thumbnail url data if it doesn’t have any, versus returning an empty string). i had to check if the results included the key/value pairs that i was looking for (i.e., “info_url ” or “thumbnail_url”) or python will throw an error. the above code sample should output the url’s to the google books information page and the url to the book cover thumbnail for the book in question. e.g., the code outputs urls such as the following: gb info url: http://books.google.com/books?id=qd0qaaaacaaj&source=gbs_viewapi gb thumbnail url: http://bks8.books.google.com/books?id=qd0qaaaacaaj&printsec=frontcover&img=1&zoom=5&sig=acfu3u2mmlbje9kpba16k_rrafmukmn62q step 3b: getting book covers from google book search let’s see how to download the thumbnails and add them to our folder with the images from open library. because google doesn’t accept python headers when requesting objects, we need to send firefox headers again so google will allow us to download the image thumbnails. google book search returns the image url in the “thumbnail_url” value. here’s the code to complete our program: #we need to send a request to google and change the headers to imitate firefox opener = urllib2.build_opener(urllib2.httphandler()) request = urllib2.request(gbookinfo[isbn]["thumbnail_url"]) opener.addheaders = [('user-agent', 'mozilla/5.0 (windows; u; windows nt 5.1; en-us; rv:1.9.0.3) gecko/2008092417 firefox/3.0.3')] #and download it and save it with a -g appended to the end to indicate it came from google picfile = open("c:\\" + isbn + "-g.jpg", "w+b") picfile.write(opener.open(request).read()) the end product of the complete program should be book data stored as python objects that can be used easily in any programs you write, as well as downloaded thumbnail images, with those saved by gbs having a –g appended to the end of the file name. you are now only a few small steps away from creatively using this data. hopefully this sample got you started. figure 8: six thumbnail images, three from open library and three from google book search. [view full-size image] to make the sample easier to use, here is the complete version of the program with all of the comments removed. import urllib, urllib2, demjson, os def read_newbooks_file(path): data = open(path) isbnlist = [] for isbn in data.readlines(): isbnlist.append(isbn.replace(“\n”,””)) return isbnlist isbns = read_newbooks_file(“c:\\newbooks.txt”) print isbns for isbn in isbns: url=”http://openlibrary.org/api/search?q={%22query%22:%22(isbn_10:(” + isbn + “)%20or%20%20isbn_13:(” + isbn + “))%22}” response=urllib.urlopen(url) book=demjson.decode(response.read()) if book[“result”]!=[]: results = book[“result”] print results url = “http://openlibrary.org/api/get?key=” + results[0] olresult=urllib.urlopen(url) data=demjson.decode(olresult.read()) print data imgurl = ‘http://covers.openlibrary.org/b/olid/’ + results[0][3:] + ‘-m.jpg’ imgfile = urllib.urlretrieve(imgurl, “c:\\” + isbn + “.jpg”) fsize = os.path.getsize(imgfile[0]) if fsize < long(1000): os.remove("c:\\" + isbn + ".jpg") gparams = urllib.urlencode({'bibkeys': isbn, 'jscmd':'viewapi','callback':'gcallback'}) opener = urllib2.build_opener(urllib2.httphandler()) request = urllib2.request('http://books.google.com/books?%s' % gparams) opener.addheaders = [('user-agent', 'mozilla/5.0 (windows; u; windows nt 5.1; en-us; rv:1.9.0.3) gecko/2008092417 firefox/3.0.3')] g = opener.open(request).read() print g if g != "gcallback({});": g = g[10:-2] gbookinfo=demjson.decode(g) if gbookinfo[isbn].has_key("info_url"): print "gb info url: " + gbookinfo[isbn]["info_url"] if gbookinfo[isbn].has_key("thumbnail_url"): print "gb thumbnail url: " + gbookinfo[isbn]["thumbnail_url"] opener = urllib2.build_opener(urllib2.httphandler()) request = urllib2.request(gbookinfo[isbn]["thumbnail_url"]) opener.addheaders = [('user-agent', 'mozilla/5.0 (windows; u; windows nt 5.1; en-us; rv:1.9.0.3) gecko/2008092417 firefox/3.0.3')] picfile = open("c:\\" + isbn + "-g.jpg", "w+b") picfile.write(opener.open(request).read()) [/sourcecode] conclusion we have just begun to scratch the surface of gathering data from content providers and finding new ways to present it to our patrons. this article was meant to explain how we at paul smith’s college created a new books widget and added supplemental data to our book catalog using many of the same principles described in the provided code samples. we covered an overview of some of the larger content providers, gave an outline of how we used data from those content providers, and provided some building block code samples to get the beginning programmer started. book data downloader at google code if you are interested in delving further into downloading and storing book data from these repositories, you can get a head start by downloading the source code to the program i have made that runs the book data download portion of the new books widget at paul smith’s college. this program is designed to act much like the example in this article with additional components that store the downloaded data into a mysql database for future use. additionally, the program is organized in such a way that allows additional repositories to be added easily. the new books widget code can be downloaded from the google code repository at http://code.google.com/p/bookdatadownloader/. references [1] worldcat search api. [2] david walker, working with the worldcat api, code4lib conference 2008. [3] about the open library. [4] about open librarianship. [5] john miedema. openbook book data wordpress plugin. [6] john miedema. openbook wordpress plugin: open source access to bibliographic data. the code4lib journal, issue 4. [7] librarything. cover images repository. [8] librarything forum. librarything on ipod touch. [9] sothink javascript web scroller. [10] edward vielmetti. building a wall of books. higheredblogcon 2006. [11] coverpop. [12] greetsaver digital signage. ann arbor district library. [13] mark pilgrim. dive into python. . [14] urllib — open arbitrary resources by url python module. [15] urllib2 — extensible library for opening urls python module. [16] douglas crockford. introducing json. [17] os — miscellaneous operating system interfaces python module. [18] google books library project. [19] google book search partner program. [20] paul smith’s college library catalog. about the author mike beccaria is the systems librarian and head of digital initiatives at paul smith’s college, a small baccalaureate institution located in the adirondack park in northern new york. subscribe to comments: for this article | for all articles 5 responses to "using book data providers to improve services to patrons" please leave a response below: matt weaver, 2009-03-31 i am using a drupal module called “book post”: http://drupal.org/project/bookpost in an online community for readers (just began testing). via book post, an isbn entered into a content type returns cover art and publishing info. the field is available in the views module, allowing for the slide show on the front page. matt weaver web librarian westlake porter public library (westlake, ohio) carol bean, 2009-04-11 i am a bit confused by the initial steps of the process. in the implementation section, it says, “1. each night, our ils outputs a pipe delimited text file to a web accessible folder on the server. the file contains the title, author, and isbn of books that are newly added to our catalog.” but in the process section, (step 1: reading the isbn text file), it states, “given a text file of isbn numbers separated by carriage returns, we can read these items into a python list (also called an array in other programming languages).” where does the text file of isbn numbers separated by *carriage returns* come from? is there another step which converts the pipe delimited file to a carriage return delimited file, or is this conversion handled by one of the python modules? thanks, carol course, 2009-05-19 could you provide me with some address that provides books mike beccaria, 2009-10-16 carol, the script in the article is a shortened and simplified script relative to the one i am using in my production setting. i wanted to make the article as simple to read and (re)use as possible, so i tried not to make it too complicated. modifying the script to take an input file from a specific ils or file format shouldn’t be too hard. the principles are the same. josh gachnang, 2011-10-13 one of the links at the bottom is broken. diveintopython.org is permanently down. it has been mirrored at diveintopython.net. please update your link. thanks! leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – from digital commons to oclc: a tailored approach for harvesting and transforming etd metadata into high-quality records mission editorial committee process and structure code4lib issue 33, 2016-07-19 from digital commons to oclc: a tailored approach for harvesting and transforming etd metadata into high-quality records the library literature contains many examples of automated and semi-automated approaches to harvest electronic theses and dissertations (etd) metadata from institutional repositories (ir) to the online computer library center (oclc). however, most of these approaches could not be implemented with the institutional repository software digital commons because of various reasons including proprietary schema incompatibilities and high level programming expertise requirements our institution did not want to pursue. only one semi-automated approach was found in the library literature which met our requirements for implementation, and even though it catered to the particular needs of the dspace ir, it could be implemented to other ir software if further customizations were applied. the following paper presents an extension of this semi-automated approach originally created by deng and reese, but customized and adapted to address the particular needs of the digital commons community and updated to integrate the latest resource description & access (rda) content standards for etds. advantages and disadvantages of this workflow are discussed and presented as well. by marielle veve introduction as many libraries have previously done, in 2013 the university of north florida (unf) library contemplated harvesting the etd metadata generated in their institutional repository, digital commons, to oclc with the goal of avoiding duplication of cataloging efforts and improving efficiencies. when the library explored the literature, we discovered many semi-automated and some completely automated approaches. however, most of these approaches could not be implemented to the digital commons platform for various reasons. the completely automated approaches, such as the worldcat digital collection gateway did not seem to work properly with the qualified dublin core (qualified dc) metadata feeds that originated from ir software that use proprietary metadata, such as digital commons and dspace[1] [2]. many of the semi-automated approaches we considered were rejected because they relied completely on proquest services to generate the initial etd metadata. unf, like some other libraries (mcmillan, halbert, & stark, 2013), does not subscribe to proquest for these services, as the practice of relying on proquest to generate initial etd metadata has been scrutinized recently in favor of student choice and open access options (clement, 2013). other approaches relied on tools that would require a high level of programming expertise to be implemented (park & brenza, 2015), a background many catalogers do not have. only one semi-automated approach was found that could be implemented in our institution’s case. this approach, developed by sai deng and terry reese (2009), is customized to address the specific needs of the dspace ir, but could be extended and implemented to digital commons software if further customizations are performed. these customizations would address the differences that exist in software capabilities, schema used, and element display differences between dspace and digital commons. the dspace workflow deng and reese’s workflow starts with the generation of descriptive metadata within the dspace ir software and exposes the result in qualified dc schema using the open archives initiative protocol for metadata harvesting (oai-pmh); the resulting metadata is then harvested and transformed into marc records with the assistance of an xslt stylesheet (extensible stylesheet language transformations) and the marcedit oai harvester. after the transformation, the generated marc records are imported into oclc by cataloging staff. we have been successfully using this approach to harvest etd metadata from our ir. the marcedit oai harvester tool has proved over time it can harvest metadata in any proprietary schema and from any ir software without any complications. designed with the library community in mind, this tool is user-friendly and can be implemented by anyone with or without programming expertise; plus its customer service is excellent. emails are answered within the same day by the tool’s creator or by other members in the discussion list. in addition, the tool has a huge community of followers and users who support its development in the long-term. even though deng and reese’s workflow can work smoothly with any ir that uses the dspace software, some major customizations were necessary to address the issues that arise when a different ir software is used. obstacles to digital commons ir software implementation in the case of unf, we encountered the following obstacles when trying to implement deng and reese’s workflow with the digital commons software: the qualified dc metadata exposed via oai-pmh by dspace is different from the qualified dc metadata exposed by digital commons. this is because each institution makes their own internal decisions regarding which metadata elements to use in their ir metadata template and which ones to expose via oai-pmh. in addition, dspace and digital commons have different capabilities for displaying the metadata elements. for example, dspace has the ability to display the etd advisors’ names in inverse order (deng, matveyeva, & wang, 2008) while digital commons cannot (digital commons representative, personal communication with author, october 13, 2015). the dspace ir uses a different set of qualified dc elements from the ones used by digital commons ir to map their etd metadata to oai-pmh. dspace uses elements from the default qualified dc in the dspace metadata registry while digital commons uses the bepress proprietary schema for qualified dc. the xslt stylesheet used in the deng and reese case study uses elements from the dspace proprietary schema and follows the older cataloging content rule standards, the anglo-american cataloging rules, 2nd edition (aacr2). in order to be implemented to the digital commons software, this xslt stylesheet would need to be adjusted to the bepress proprietary schema. we also needed to adjust it to reflect the current resource description & access (rda) content standards. given that the initial metadata generated by dspace will not be the same as the metadata generated by digital commons, we anticipated that an additional set of final edits would need to be performed on the marc records. even with these major obstacles, deng and reese’s approach still presents a good set of practical tools and ideas that can be useful to the digital commons ir community if special customizations and adaptations are implemented to address these differences, such as the use of a customized etd stylesheet and marcedit’s oai harvester. for that reason we included these tools as part of the workflow presented below, but with additional customizations performed at each stage of the harvesting workflow to address these issues. the workflow is also updated to integrate the current description standards for etds mentioned in rda. tailored approach for the digital commons platform: unf case study we designed the following workflow to transform our ir etd metadata into high quality, rda-compliant marc records that can be ingested by oclc. the transformed records have authorized headings (lcnaf) for the etd authors, college departments, and library of congress subject headings (lcsh). in addition, keywords supplied by the etd authors are included. we have designed this workflow to work out-of-the-box with minimal edits [3] so that it can be used by other digital commons ir users who do not want to rely on proquest services to produce the original etd metadata nor have a strong background in programming. the workflow consists of seven main steps we implemented with the assistance of a customized digital commons crosswalk for etd metadata (appendix a). digital commons workflow steps step 1. created list of fields desired in final etd record (a wishlist) — (columns a & b) the first step was to decide which descriptive metadata elements we wanted to display in our final marc records. examples of these fields are the authorized form of author, etd title, degree acquired, college, and name of advisors. appendix a, column a provides a list of these fields with column b displaying how they map to marc fields. step 2. customized digital commons etd submission metadata template — (column c) when looking at the original submission metadata template in digital commons, we noticed that some of the desired fields from the “wishlist” were already included there, while others were not. in order to include the additional desired fields that were not originally included, we created a new template and customized it to include as many of these desired fields as possible. creating this new template helps ensure quality metadata from the beginning, a good practice that helps reduce editing efforts and avoid future problems. column c in appendix a specifies the fields we added to the digital commons submission template—if not already there—while column d illustrates how these fields should be mapped to the qualified dc elements in bepress’s proprietary schema. examples of the additional fields we added to the template are: birth date lcnaf authors’ names lcnaf corporate body advisors’ names in inverted order controlled subject terms it is important to clarify at this point that most of these additional fields are only used to generate the final marc records sent to oclc. fields such as ‘birth date’ are hidden from the public display in digital commons but still mapped to the appropriate oai-pmh field. we added these fields to the etd submission template and mapped them to the elements specified in column d of appendix a by sending a request to the institution’s designated digital commons’ service representative. our new, customized digital commons template can be seen in figure 1. figure 1. customized digital commons etd template step 3. mapped fields in etd submission template to oai-pmh — (columns c & d) digital commons exposes its metadata via oai-pmh in three schemas: “simple-dublin-core (dc), qualified-dublin-core (qdc), and oai_etdms” (bepress, 2015, p.2). from these three schemas, we selected qualified dc as the preferred schema to harvest the etd metadata for three main reasons. first, the metadata exposed in qualified dc provides the broadest set of elements with the highest level of flexibility for mapping customized fields. second, qualified dc is one of the most commonly used xml schemas in irs. according to park and tosaka (2010): “a trend of qualified dc being used (40.6 percent) more often than unqualified dc (25.4 percent) is noteworthy.” third, the qualified dc proprietary schema used by digital commons integrates the additional etd elements recommended by the networked digital library of theses and dissertations (ndltd) metadata standards (robertson, 2011). at this stage, we mapped the fields from the digital commons etd template (column c) to the appropriate bepress qualified dc elements in the bepress oai feed, requested from: http://digitalcommons.unf.edu/do/oai/?verb=listrecords&metadataprefix=qdc&set=publication:etd [4] appendix a presents our mapping of the elements in the digital commons template (column c) to their corresponding elements in the oai feed (column d.) for any of the fields that do not match this exact pattern, we requested a mapping re-configuration from our digital commons consulting services representative. some fields, like the marc fields 336, 337, 338, and 540, are constant for each record. for these fields (noted in column c with the value “will not be added at this point”), rather than mapping them in the oai, we added them later using an xslt stylesheet (column e). step 4. customized institution’s etd stylesheet (xslt) — (column e) the marcedit oai harvester tool provides a good set of generic xslt templates that can be used and customized to meet particular institutional repository data transformation needs. we used the xslt named “oaidctomarcxml” included in the marcedit 6 package. the stylesheet was adapted to etds and customized to include the additional fields desired in the final marc records and exclude the ones not wanted. examples of changes and additions performed to this original xslt and instructions on how to perform them are illustrated in column e of appendix a. a copy of this finalized customized xslt stylesheet is available in appendix b. step 5. transformed qualified dc into marc metadata using an oai harvester & xslt we then use an oai harvester tool to transform the qualified dc metadata displayed in the oai feed to marc metadata with the assistance of the customized etd xslt stylesheet (appendix b). an excellent open access oai harvester is the marcedit oai harvester, which can crosswalk the files to marc records to be ingested by oclc. we use this tool to formulate a feed request for etd metadata in qualified dc from our digital commons repository, using the customized xslt stylesheet in appendix b. an example of this request is presented in figure 2. it is important to know that the “qdc” option used in this request example is not included in the default metadata drop-down menu in the marcedit oai harvester, however, typing “qdc” manually in the field will allow the harvester to complete the transformation. we selected the customized stylesheet in the crosswalk field and ran the crosswalk. this request will generate marc records that look like figure 3. figure 2. request for metadata using the marcedit oai harvester figure 3. marc record generated for etds step 6. performed final edits using the marc editor tool– (column f) even though most of the metadata at this point has been customized with the assistance of the initial ir template and the xslt stylesheet, there are still some final edits that need to be done. these edits are minor and can be performed in batch or by creating automated assigned tasks using marcedit’s marceditor. column f in appendix a contains a list of suggested final edits that can be performed. examples include separating the different lcsh that display in one 650 field into separate individual 650 fields (figure 4), fixing punctuation mistakes, substituting the word ‘yyyy’ in fields 502 and 008 for the actual year digits (figure 5), and removing keywords that only serve a local purpose and will not make sense outside the ir context (figure 6). figure 4. separating lcsh that display in one field into multiple 650 fields click to enlarge figure 5. substituting the word ‘yyyy’ in fields 502 and 008 for digits figure 6. automated assigned task to remove locally purpose keywords step 7. imported records to oclc and perform final quality control — (column g) after final edits are performed with marcedit, we compile records back into marc-8 and import to oclc connexion. detailed instructions on importing and exporting records in batch to oclc are located in oclc’s documentation [5]. in oclc, we perform a final quality control of records by validating them. column g in appendix a contains a list of suggested things to look for during quality control. something that is of particular importance to check is that symbols in the abstract display well and that the abstract content was imported completely without errors. a common problem encountered with this particular xslt (appendix b) is that if the symbol “>” appears somewhere in the abstract, any data after the symbol will be dropped and will not be transferred to the new marc record during the harvesting process. this is a side effect of the xml coding “remove-html” that is used in xslt stylesheets to avoid transferring unnecessary html symbols to marc records. figure 7 final marc record of etd in oclc advantages and disadvantages of this workflow: future improvements when implemented with digital commons repositories, the workflow presented in this paper will avoid duplication of efforts by generating high quality etd metadata from the beginning and in just one place. the time that is saved can be better allocated to perform other important etd tasks such as authority control for the etd authors and the assignment of controlled subject headings (lcsh) in addition to the subject keywords provided by the etd authors. for those interested in implementing this workflow, the tools are free and no programming or major xslt editing will be required, as most of the needed customizations have already been integrated. another advantage of this workflow is that it has the ability to separate multiple lcsh that may appear in one 650 marc field into individual separate 650 marc fields as well as separating the authors’ keywords (653s) from the controlled headings (650s). lastly, by avoiding the use of proquest metadata services and the oclc digital gateway, this workflow will enable other digital commons ir users to gain more control over the type and quality of etd metadata produced for these important and unique materials. in addition to this, bypassing proquest will channel access to the full-text of etds into one place, the institutional repository, instead of having multiple access routes to full-text that may divert traffic from the ir. on the other hand, this workflow presents two disadvantages. first, asking for authors’ birthdates in the etd metadata template may raise privacy concerns among some scholars, even though the practice of asking for birthdates during the etd submission process is not new to the profession and has been performed by libraries for a while. this privacy issue, can be addressed by either not requiring students to fill the birthdate field in the etd template or by blocking this field from the public display in digital commons. alternatively, using unique researcher identifiers such as the open researcher and contributor id (orcid) to identify etd authors has been suggested by some scholars. the orcid registry assigns a unique alphanumeric code to authors instead of using birthdates or authorized forms of headings. originally launched in october 2012, the registry offered a few mini-grants in 2013 for institutions that would be willing to implement and test the registry (tamu libraries, 2015). texas a&m university libraries and the university of missouri were two of these grant awardees that pioneered the orcid implementation into the etd domain in 2014. using orcid identifiers for etd authors, however, is a new concept that is still in its infancy and more research should be done to assess its applicability to bibliographic records in the marc schema environment. another disadvantage of this workflow is that names of the etd advisors have to be entered twice in the digital commons template. one form of the name is in direct order (firstname lastname) while the other one is in the inverted form (lastname, firstname). the reason for this duplication is that students enter their advisors’ names in direct order in the template while the inverted form is needed to generate the 700 fields for the marc records. when digital commons was approached by this paper’s author to display the etd advisors fields in the inverted order in the oai, they said they were unable to do so. the only solution would be to add an additional field to the etd template with the inverted form of the name [6]. an alternative solution to this problem is to include an xml coding in the stylesheet that can perform transformations for personal names, but doing so would be too complicated given all the possibilities a name can be displayed. reese included an example of this complex xml coding in one of his xslt stylesheets located at the oregon state university ir [7]. unfortunately, the coding for personal names used in this stylesheet cannot be used today without significant alteration, as it was created before the rda days and still follows the former aacr2 rule conventions. conclusion finally the workflow presented in this paper, even though it is customized and tailored to address the particular needs of the digital commons ir software, can be extended to other ir software through further customization to address the differences in ir software capabilities and schemata used. notes [1] s. wynne, s. mcintyre, & m. finn, discussion with author through metadatalibrarians mailing list, september 23, 2015. [2] a. harrell, w. robertson, & m. gibney, discussion at digitalcommons@google.groups, april 29, 2016. [3] the only customization that will be required is changing the institution’s name and place in the stylesheet. (fields 264, 502, 040, and 008 will be affected). [4] to formulate this type of request, see bepress’s documentation for more details. “digital commons and oai-pmh: harvesting repository records.” available at http://digitalcommons.bepress.com/reference/80 [5]“cataloging: export or import bibliographic records” available at https://www.oclc.org/content/dam/support/connexion/documentation/client/cataloging/exportimport/exportimportbib.pdf [6] digital commons representative, email communication with author, october 13, 2015. [7] “oregon state university electronic theses dspace (oai-pmh) to marc21xml crosswalk” available at https://ir.library.oregonstate.edu/xmlui/handle/1957/6300 references bepress. (2015). bepress proprietary schema for qualified dublin core. retrieved from http://www.bepress.com/assets/xsd/oai_qualified_dc.xsd bepress. (2015). digital commons and oai-pmh: harvesting repository records, 2-6. retrieved from http://digitalcommons.bepress.com/reference/80 clement, g. p. (2013). american etd dissemination in the age of open access: proquest, noquest, or allowing student choice. college & research libraries news, 74, (11), 562-566. retrieved from http://crln.acrl.org/content/74/11/562.short deng, s., matveyeva, s. & wang, t.m. (october 2008). customized mapping and metadata transfer from dspace/soar to oclc and voyager. paper presented at the ex-libris southcentral users group (elsug) meeting, wichita, ks. retrieved from http://soar.wichita.edu/handle/10057/1573 deng, s., & reese, t. (2009). customized mapping and metadata transfer from dspace to oclc to improve etd workflow. new library world, 110 (5/6), 255. doi: 10.1108/03074800910954271 dspace. (2015). default dublin core metadata registry (dc). retrieved from https://wiki.duraspace.org/display/dsdoc4x/metadata+and+bitstream+format+registries mcmillan, g., halbert, m. & stark, s. (july 2013). comprehensive study of national etd practices. paper presented at the annual meeting for the united states electronic thesis and dissertation association, claremont, california. retrieved from https://conferences.tdl.org/usetda/index.php/usetda/usetda2013/paper/view/666/318 oclc connexion client guides. (2014). cataloging: export or import bibliographic records. retrieved from https://www.oclc.org/content/dam/support/connexion/documentation/client/cataloging/exportimport/exportimportbib.pdf orcid. (2015). retrieved from https://orcid.org/about/what-is-orcid park, j.r., & brenza, a. (2015). evaluation of semi-automatic metadata generation tools: a survey of the current state of the art. information technology and libraries, 34, (3), 24. doi: 10.6017/ital.v34i3.5889 park, j.r. & tosaka, t. (2010). metadata creation practices in digital repositories and collections: schemata, selection criteria, and interoperability. information technology and libraries 29, (3), 108 and 114. doi: http://dx.doi.org/10.6017/ital.v29i3.3136 reese, t. (2009). automated metadata harvesting: low-barrier marc record generation from oai-pmh repository stores using marcedit. library resources and technical services, 53, (2), 129. doi: http://dx.doi.org/10.5860/lrts.53n2.121 robertson, w. c. (april 2011). repository metadata: challenges of interoperability. paper presented at the alcts institutional repository webinar series. retrieved from http://ir.uiowa.edu/lib_pubs/76/ tamu libraries. (may 2015). orcid integration at texas a & m. retrieved from http://guides.library.tamu.edu/content.php?pid=553864&sid=4564757 about the author marielle veve has been metadata librarian at the university of north florida since 2013. previously she worked as cataloger & metadata librarian at the university of tennessee from 2006 to 2013 and cataloger for latin american materials at tulane university from 2003 to 2006. appendix a digital commons crosswalk for etd metadata pdf available a b c d e f g description of desired fields for etd “wish list” marc fields (follow rda) digital commons submission template fields digital commons oai-pmh display (in bepress proprietary schema for qualified dc) customizations to xslt (bepress’ schema + rda) final edits using marcedit quality control in oclc author of etd authorized form 100 naco name & birth date (add in separate fields) map to the 2nd creator element 100 will be mapped to dc:creator[2] and an additional constant will be added at the end of this field $e author. if $q shows, remove comma before $q. if there is a title, such as jr. or ii, add $c in front press “f11” to link the 100 field to the lcnaf record title 245 $a $b title statement of responsibility 245 $c *will not be added at this point *will not show in oai mapping *will not be added at this point can be added at this point using automated task 245$c pub. place 264 $a *will not be added at this point *will not show in oai mapping will be added at this point. 264$a will be mapped to [jacksonville, florida]: publisher 264 $b *will not be added at this point *will not show in oai mapping will be added at this point. 264$b will be mapped to [unf digital commons], pub. date 264 $c year of publication (already there) instructions will be given to select last 4 digits (year) physical description 300/336/337/338 *will not be added at this point *will not show in oai mapping will be added at this point using constants: 300$a 1 online resource 336$atext$btxt$2rdacontent 337$aunmediated$bn$2rdamedia 338$a volume $bnc$2rdacarrier degree 502 degree name an additional constant will be added at the end of this field thesis ( )–university of north florida, yyyy. substitute the word ‘yyyy’ for actual digits. will change field 502 and 008. season pub (spring, summer or fall) 500 season of publication (needs to be added if not there) abstract 520 abstract check the content of all the abstract passed. for symbols and special characters not recognized by oclc, change using a character from oclc table. rights 540 *will not be added at this point *will not show in oai mapping a constant field will be added at this point $a all rights…. and mapped to 540 lcsh (controlled subjects) 650 controlled terms (need to be added) will show in one element generate a separate 650 field for each lcsh. author provided subjects (uncontrolled) 653 keywords will show in separate elements remove local purpose keywords that you don’t wish transferred to final record thesis advisor(s) in inverted order 700 for each advisor: lcnaf or aacr2 form of (1st, 2nd, 3rd) advisor name (need to be added) an additional constant will be added at the end of this field $e thesis advisor. yes. check diacritics, if any, passed univ & dept. who granted degree 710 $a $b naco controlled corp body (need to be added) an additional constant will be added at the end of this field $e degree granting institution. press “f11” to link the 710 field to the lcnaf url 856 *will not be added at this point an additional constant will be added at the end of this field $z connect to this object online. verify url works. fixed fields ldr, 008,006,007,040 *will not be added at this point *will not show in oai mapping will be added at this point using constants: 008 006 007 040$afnp$beng$erda$cfnp substitute the word ‘yyyy’ for actual digits. will change field 502 and 008. appendix b appendix b is available as an xslt stylesheet. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – making user rights clear: adding e-resource license information in library systems mission editorial committee process and structure code4lib issue 29, 2015-07-15 making user rights clear: adding e-resource license information in library systems libraries sign a wide variety of licensing agreements that specify terms of both access and use of a publisher’s electronic collections. adding easily accessible licensing information to collections helps ensure that library users comply with these agreements. this article will describe the addition of licensing permissions to resource displays using mondo [1] by queen’s university and scholars portal (a service of the ontario council of university libraries) [2] . we will give a brief introduction to mondo and explain how we improved mondo to add the license permissions to different library systems. the systems we used are an ils (voyager), an openurl link resolver (360 link), and a discovery system (summon). however, libraries can use mondo to add the license permissions to other library systems which allow user configurations. by jenny jing, qinqin lin, ahmedullah sharifi and mark swartz introduction changes in canadian copyright law and in the way that universities approach copyright compliance, combined with the growing complexity of working with digital materials, have made copyright a priority for canadian higher education institutions. one factor that universities worldwide have to consider when establishing copyright compliance initiatives is the vast array of database licenses that libraries sign in order to access electronic resources (e-resources). these licenses specify terms that users must follow in order to legally use these materials. for example, some licenses may allow users to include materials on learning management system sites. others may allow libraries to loan e-resources to other libraries. this information is frequently only included in licensing agreements that are not accessible to users, or on website “terms of use” that are often ignored. possible outcomes of not complying with these agreements include loss of access to databases and legal action. as a result of the inaccessible nature of these licenses, and the serious consequences of having users ignore these terms, it is essential to make this licensing information accessible to all users in a format that is understandable and clear. this is why many institutions have created databases that make license permissions available to all. when i started working as the information systems librarian at queen’s university library in 2013, my first project was to migrate the library’s openurl link resolver from sfx to 360 link. a major task in this project involved linking the license permissions from the mondo database to the e-resources in the user interface of 360 link. we finished the project in two months and here is a sample record of an e-resource in 360 link and its license page. (figure 1). figure 1. the user interface of 360 link and license page the license links: before and after the migration queen’s university library uses mondo, an open source application used to manage the copyright and license information for e-resources. our mondo is hosted by scholars portal, a service provider of ontario council of university libraries (ocul). (figure 2). because the license pages for our e-resources are created in mondo, our task is to add the urls of those license links to our library systems, for example, 360 link. figure 2. the relationship between mondo and queen’s e-resource knowledgebase in sfx, this process is accomplished by inserting the license links into the “general note” field for an e-resource. because of the migration away from sfx, we needed to build a new model for adding license links to our e-resources in 360 link and other library systems. michael vandenburg, my supervisor, suggested using 360 link reset [3] and summon_click [4] to add the links to the user interface of these two systems. after discussing the priorities with my supervisor, my first step was to familiarize myself with mondo, the open source software we use to manage the license permissions for queen’s, as described in the next section. mondo license grinder mondo is an open source application developed by the university of british columbia library in 2009 (https://code.google.com/p/mondo-license-grinder/ ). it displays license permissions for online resources. (figure 3). a sample site is available at: http://queens.scholarsportal.info/licenses/ figure 3. customized mondo by scholars portal – a to z list and a license page the key functions and main workflow of mondo first, authorized library staff must create license records for each type of e-resources and parameters are set based on the copyright agreement between the library and the database provider. second, library staff copy the url of the license page of an e-resource in the mondo system and paste it in a target record in the link resolver system. for example, in 360 link the link can be added in the “public all titles note” field for the database “abi/inform global” (figure 4). then the “license terms of use” link for that e-resource will display in 360 link. figure 4. create license page in mondo and paste the url in a record in 360 link issues to overcome first, there were about 2,000 e-resources in our serials solutions’ knowledgebase and we needed to finish the migration from sfx to 360 link in several weeks. because 360 link is a hosted service provided by serials solutions, we couldn’t batch load the license links for those 2,000 records into the database tables. during the summer, there were no student or library staff who could help us enter the data manually. in addition, we needed to add the links to the license information to other library systems, such as voyager and summon. adding them in different library systems by myself would be very time-consuming. in order to finish this task by our deadline, we needed to find a new model for providing this service. second, because the urls of those license links are not in a predictable pattern (sample records in table 1), we could not automatically generate the license links and add them in the target library systems. table 1. the e-resource names in mondo and the license urls of the e-resources database names in mondo license link urls abi inform global http://queens.scholarsportal.info/licenses/abi_inform acm digital library http://queens.scholarsportal.info/licenses/acm_digital_library 19th century uk periodicals online http://queens.scholarsportal.info/licenses/seventeenth_eighteenth_century_burney finally, because we wanted to add the license links to our discovery system (summon) and there is no field that we could use to display that piece of data in summon, we needed to work out a solution which could display the license links in different library systems using a predictable pattern rather than manually entering the links. the solution: enhancements to mondo after reviewing all of the aspects of this project, we identified the following key issues/requirements to improve the workflow: we need to use a predictable, predefined pattern for the license urls created in mondo (for example, a base url, plus the database name in a standard format), so that we can generate the license links with the same pattern in the target library system’s user interface. we need to have access to the configuration files of the target library system’s user interface (for example, we can insert javascript codes into the summon search result page). we need to be able to read the e-resource names from the target library system’s user interface in order to generate the url (for example, in figure 1, the e-resource name is “abi/inform global”). we need to create a “general license” page for the e-resources which do not have a license page in our mondo database. based on the requirements we defined above, we made the following enhancements to mondo: first, we added a new function in mondo to automatically generate the urls for the license link pages in a predefined pattern “(rawurldecode($db_name))”. this allows us to generate the same pattern of the urls in different target library systems’ interfaces, so that we do not need to enter the license link urls manually in mondo. add a javascript file called “licenselink.php” in the mondo/admin folder to generate the license page url licenselink.php: //in index.php, get database name from url parameter $db_name = $_get['name']; //get license $license= $db->get_license(rawurldecode($db_name)); //or redirect to an information page if($license== null){ $license= "generic_license"; } //redirect to license view ('location: '.base_url.'index.php?license='.$license); modify the “add/edit a license” page in mondo second, in order to eliminate the need to manually enter the urls of the license link in mondo and save time, we modified mondo/admin/index.php and update.php to automatically generate the urls and save them in mondo. (figure 5). this also prevents typos resulting in bad links. (for example, the url for the license page of “bibliographie der deutschen sprach”, might be spelled wrong if the person who enters the url of the license page does not understand the language). the detailed changes we made in mondo are listed below: in mondo-license/admin/index.php, we added the following code: $vendor=select('vendor',$vendor,$db); //mondo code $consortium=select('consortium',$consortium,$db); // mondo code $databases=select('databases', $databases, $db); //we added this line …

databases $databases databases

figure 5. enter the database name and url of the license page will be generated automatically in admin/update.php, we added the following code to get uploaded/input databases: // "database name", which corresponds to this license and was uploaded by users, are saved in here //open the file in read mode. notice the 'import' is the name of input in admin/licenselink.php. $fp=fopen($_files['import']['tmp_name'],'r'); //the upload file needs to be csv format. //read in line while ( ($line = fgetcsv($fp)) !== false) { if($line[0] <> 'id'){ //skip title line //insert to 'license_db' table $db->insert_data_databases($line); } } two tables were added in the mondo database: databases: contains all databases.(columns could be db_id, db_name etc.). license_db: license – databases map (id, db_id, license_id). in db.inc.php, we added a function called insert_data_databases($line): public function insert_data_databases($line){ //assume in csv, comma ',' is the delimiter $field = explode(",", $line); $sql="insert into `db_id`, `db_name` set `db_id`=:db_id, `db_name`=:db_name"; $stmt=$this->preparedstatement('dblicref', $sql); //to simplify the code, we don't add error check here. $stmt->execute(array('db_id'=>$field[0], 'db_name'=>$field[1])); } add the license links to different library systems after changing mondo functions and automatically generating the license links in a predefined pattern, we were ready to add the license links to different target systems. the examples in this article will be using 360 link, voyager opac, and summon. openurl link resolver: 360 link there are three parts in the 360 link 1.0 search result page: citation display, content provider links, and custom links. in the content provider links section, under “resource” column, all the links to the providers’ sites (providers can be databases, publishers or journals) are listed. the licenses and copyrights are assigned based on the agreement our library has with the providers, so we need to assign our “license terms of use” links to the provider information. (figure 6). figure 6. sample license links in 360 link 1.0 user interface first, my supervisor suggested that we test 360 link reset [3] as the starting point. after reading the documentation and consulting with matthew reidsma, the author of the program, we learned that we could modify the 360 link 1.0’s user interface by inserting javascript in 360 link 1.0 admin > advanced options > page head > head html. second, we defined the pattern for the urls of the license links to:“http://baseurl+e-resourcename”, in order for us to add those urls to 360 link user interface. the baseurl is the http address our mondo service provider gave us. for example: “https://queens.scholarsportal.info/licenses/license/?name=”. the “e-resourcename” is the name of the e-resource. we used a javascript file called “360_get_names.js” to add the urls of the license links to 360 link user interface and it follows a two-step process outlined here: find the e-resource names in the 360 link page (the e-resource names are given as the elements of a css class called “ss_databasehyperlink”). add the license links to the e-resource using the same pattern we defined in mondo. (http://baseurl+e-resourcename). the javascript is listed at the following url: https://gist.github.com/happyrainb/81541e23049e30b281d2 360_get_names.js: $(document).ready(function() { $(".ss_databasehyperlink").each ( function() { $('[license of use]').insertafter($(this)); $(' ').insertafter($(this)); } ); function urlencode(str) { / //discuss at: http://phpjs.org/functions/urlencode/ (function urlencode: this function was copied from phpjs.org and the last line of this function needs to be changed to “.replace(/\+/g, ‘%20’);”. this function resolves the linguistic difference between php and javascript’s url encoding functions. javascript uses the “%20” encoding to represent a spaces, while php uses “+.” the difference in urls would lead users to the general message page instead of the specific database license. replacing the “+” with “%20” while generating the url results in consistent urls.) the code above simply goes through each link class (“.ss_databasehyperlink”) and appends a license link to it with the url pattern (http://baseurl+e-resourcename) as we defined in mondo. the name parameter is also read from the same link. the new version of 360 link in the spring of 2015, a new version of the 360 link interface was released. comparing with 360 link 1.0, the new 360 link makes it easier to reference an external javascript file, and offers more customization options with configurable custom links. (according to eddie neuwirth, the product manager lead – discovery services at proquest, the new 360 link interface is loosely based on some of the design concepts from matthew reidsma’s 360 link reset but available out-of-the-box). in order to test the license links, we made a sandbox on our server and added the links to the 360 link (figure 7) by making a new version of the get-license.js, which we called get-license-2.js. figure 7. sample license links in 360 link new version user interface in the 360 link new version user interface, the e-resources all have the same css class name: ‘.resource-name’. we just need to get each of them and generate the license link to the e-resource using the baseurl and the pattern we defined in mondo. we list the code for the get-license-2.js below: get-license-2.js: jquery(document).ready(function() { // get all the elements that have a class name '.resource-name' var links = jquery(".resource-name"); // go through the elements and add a link after each one based on the text each element contains for(var i =0; i[license of use]'); new_link.insertafter(el); var empty_span = jquery(' '); empty_span.insertafter(el); } }); voyager 8 and 9 after we added the license permissions to 360 link in the summer of 2013, we discussed this process with our partners at western university. we learned that they were able to add the license links to the opac of their ils and discovery system (taufique, 2015), then we tried to add the license links of the e-resource to our voyager opac. having read the voyager manual and consulted voyager-l, we searched the web and found a presentation about editing the voyager opac (guy, 2010) and a website with documents for customizing the voyager skin [5]. they helped us identify the file we needed to modify in voyager in order to add links to the skin (the display.xsl file in the contentlayout folder in webvoyage). after testing the changes in our voyager sandbox, we managed to add the license links to voyager opac using the same pattern that scholars portal defined in mondo. (figure 8) figure 8. sample license links in voyager opac – record page after we discussed this issue with our cataloger, we learned that we can insert the url of the license links to holdings record tag 856 subfield z. because we didn’t have library staff or student to help us add the license links manually in voyager catalogue module, we tried to add the license links in voyager in the opac configuration files. after testing in the webvoyage sandbox, we found the block which is responsible for the holdings 856 subfield z is in the template ”bmd3000?. the display.xsl file is listed in the following url: https://gist.github.com/happyrainb/71da337b043d02757691 display.xsl:

https://baseurl/licenselink.php?name= [license terms of use]
the xml above simply defines a baseurl variable and uses it in the newly added xml for the links. the provider name is extracted from the data in the voyager opac interface and the license links are inserted in the xml file under subfield “z”. summon 1.0: server-side data endpoint api after we discussed this process with our partners at the university of toronto, we learned that their team added the license information in summon 1.0 using summon api (taufique, 2015). after testing in december 2013, it seems that the only way we can add the license links to summon 1.0 is to use 360 link api, because we don’t subscribe to the summon api service. the license data that we want to show the user is generated by a php script which we refer to as the server-side data endpoint api. this is a server-side script that generates all the data we want to show the user when they “hover” over the license link for a specific search result. we can see an example of the popup in our testing site (figure 9): figure 9. sample license links in summon 1.0 – record page the data in the popup is loaded via ajax. the ajax call for each popup is made when the page loads all the search results. the ajax call makes a request to the server-side data endpoint api, using an http request, which then returns a response with the entire popup html populated with the relevant data. therefore, the interface must simply show that html to the user once the ajax call has completed successfully. add the popup to the summon interface the search result list from summon 1.0 needs to show the license information for each individual entry. because the library decided to show the license links only for articles, we need to find the journal article entries in the summon search result page first. the most easily deployable solution was to use client side functionality on the browser to dynamically add the links and respective popups for each search result. since the search results are generated in a pattern it is easy to find that pattern and use it to make the modifications we need. the following figure shows the html structure for the entries area we are interested in. (figure 10) figure 10. the html structure of summon 1.0 user interface – record display we can see that each search result is wrapped in a “div” element with the class “.document”. inside each of these “div” elements we have the necessary information to make our ajax request and obtain the relevant popup data to show the user when prompted. given the structure and patterns we have found, we could now write code to enable the computer to generate the links to help the user quickly access license information for any viable search result journal. as suggested by my supervisor, we modified the summon_clicks.js. [4] the complete javascript code (summon1-use360api.js) is listed at the following github url: https://gist.github.com/happyrainb/946e708a60373305375a for the purposes of this section, the important part of this script starts at line 94 of the file from the github link above. first, we must iterate over each search result (‘.document’) class element. for each search result we can read the metadata in order to make our ajax calls to our server-side endpoint api, explained in the previous section. essentially, there are two tasks performed for each search result which are listed below in order: extract issn and doi from the body after we read the 360 link api documents and two articles (durno, 2012 and talsky, 2008) , we decided to use the doi and issn as parameters to supply the ajax call and make a request to the server-side data endpoint api. it is important for us to get the issn and doi because we need to give these parameters to the server-side endpoint api in order to receive the correct information for an article, such as the e-resource names, the dates available, etc. the code block below is responsible for this:summon1-use360api.js: sections = $(this).find('div.previewdocumentcontent').find(".section"); var issn = ''; var doi = ''; $(sections).each ( function() { if ($(this).find('.title').text() == 'issn') { issn = $(this).contents().filter(function() { return this.nodetype == 3; }).text(); } if ($(this).find('.title').text() == 'doi') { doi = $(this).find('.title').next().text(); } } ); after we get the issn and doi data from the above code, we can send them to the php script to make the popup and display the article link, database link, holdings data, etc., in the mini-window shown in figure 9. make ajax call to server-side endpoint api with the above parameters and add resulting popup code to the html dom. summon1-use360api.js provider=''; $.post('summon1-use360api.php', { 'issn': issn, 'doi' : doi}, function(data) { $(publicationlocation).find(".summary").append(' '); $(publicationlocation).find(".summary").children('div').first().mouseover(function(){$(this).stop(true,true).show();}).mouseout(function(){$(this).stop(true,true).fadeout();}); $(publicationlocation).find(".summary").children('div').first().mouseover(function(){$(this).stop(true,true).show();}).mouseout(function(){$(this).stop(true,true).fadeout();}); $(publicationlocation).find(".summary").children('div').first().find('.theme-close').click(function(){$(this).parents('div.summary').children('div').first().hide();}); }); $(publicationlocation).find(".summary").children('a').first().mouseover(function(){$(this).next('div').stop(true,true).show();}).mouseout(function(){$(this).next('div').stop(true,true).fadeout();}); when this code successfully executes, the user will be able to hover the license link and view the the license information for each of the e-resources within the popup. use 360 link api to get data and display them in the popup in the above javascript file we get the issn and doi of an article from summon user interface and provide them to a php script (summon1-use360api.php) to make the elements in the popup. the php script is listed at the following github url: https://gist.github.com/happyrainb/2608d28cfdcfdc54350a summon1-use360api.php: … $queensurl= xxx; //you can find this in your 360 link document provided by proquest $baseurl = yyy; //you can find this in your 360 link document provided by proquest $key=zzz; //you can find this in your 360 link document provided by proquest $mandatoryfields = array('version' => '1.0'); //this is required in 360 link api document $optionalfields = array('issn' => $issn, 'volumn' => $volumn,'doi' =>$doi); //make the query string $fields = array_merge($mandatoryfields, $optionalfields); $querystring = http_build_query($fields); $url = $baseurl.'?'.$querystring; //make the url of the query string combining the api key, baseurl, etc. $xml = get_xml_result($url); //the data is displayed in xml format $xmldomdoc = new domdocument(); $xmldomdoc->loadxml($xml); …. …. foreach ($xmldomdoc->getelementsbytagname('linkgroup') as $datagroup) //assign the value to each { //elements which will be displayed in the popup window $url_elements = $datagroup->getelementsbytagname('url'); $data=$datagroup->getelementsbytagname('holdingdata')->item(0); $urldata=$datagroup->getelementsbytagname('url'); $results[$counter]['databasename'] = $data->getelementsbytagname('databasename')->item(0)->nodevalue; $startdate = $data->getelementsbytagname('startdate')->item(0)->nodevalue; $enddate = ($data->getelementsbytagname('enddate')->item(0))?($data->getelementsbytagname('enddate')->item(0)->nodevalue):''; $results[$counter]['dateavailable'] = ($startdate. '-' .$enddate); $results[$counter]['databaseid'] = $data->getelementsbytagname('databaseid')->item(0)->nodevalue; we supply the doi and issn parameters to the ajax call which returns an xml result from which we can extract information about the article. we use fields such as “dateavailable”, “databasename”, and so on. it is important to note that we use the “databasename” field as the key to retrieve relevant license information for a given article; this key will be used in the next step. for the purposes of this section, the important part of this script starts at line 212 of the file from the github link above. summon1-use360api.php: …

license information the above code is responsible for displaying the elements in a table in the popup window. we make a mini-version of the 360 link search result page using 360 link api and assign the license links to each of the e-resources. this way we provide the complete information for permitted use of the e-resource and it’s easy for the users to understand because it’s similar to the 360 link user interface. summon 2.0: available methods to modify summon 2.0 modifying summon 2.0 to achieve similar functionality for license links was easier because we found a script called “dnl_summon_custom_step3.js” [6] from the articles “hacking summon 2.0 the elegant way” in the code4lib journal (bailey, 2014) and “exposing library services with angularjs” (voß, 2014). all of them helped us customize the summon 2.0 user interface. although all the techniques described in the code4lib article “hacking summon 2.0 the elegant way” are good and have their own merits, we decided to use “technique 2: direct dom manipulation” because it was closely related to the same methodology as the techniques we used to modify summon 1.0. the complete codes are listed at the following github url: https://gist.github.com/happyrainb/e0f546390436a1c26ac2 for the purposes of this section, the important part of this script starts at line 217 of the file from the github file listed above. add license links to the summon 2.0 user interface our approach to modify summon 2.0 was to identify the html structure that was used to store the relevant source name data and then to create links that directly send the user to the license information. compared to summon 1.0, summon 2.0 was easy because it uses angularjs, a standard mvc client-side framework that is documented and can easily be understood by a programmer. we decided to inject our code into a “watchcollection” observer, which is a facility provided by angularjs to detect changes made to a collection of data such as our search results, because we wanted to generate the links every time the list of search results or “feeds” were changed. the structure of the search results seemed more complex, but after hacking away on the javascript console, we were able to find the correct jquery selectors to get the elements that we need from the main item which can be seen below (figure 11): figure 11. main html element pattern which represents each search result the dynamically-generated links all have the same endpoint: a simple url pattern, shown in the pattern below: http://baseurl/licenselink.php?name={{source name}} url pattern: {{baseurl}} + {{source name}} so for example if we wanted to get information regarding source: “acm digital library” we would enter the following url into the browser: http://baseurl/licenselink.php?name=acm digital library. summon2-license.js: //watch feed change, add license links function watchfeedchange( ) { myscope.$watchcollection('feed', function(){ //delay 1 sec. to wait for dom to actually finish loading settimeout(function (){ ... link generation code ... },1000); }); } //watch feed change() we can see the feed change event handler above; this handler basically waits for search or feed changes and runs our link generation code when such changes are detected. after we learned about the summon 2.0 user interface, we found that the e-resource is called “source” in the detailed record popup on the right side of the screen. the next step for us was to get the data in this field and assign the license links for those e-resources. (figure 12). figure 12. sample license links in summon 2.0 preview server – record page the source names (“[abi/inform global]” and “[academic onefile]”in figure 12) we need are visible in the popup on the right and it appears when we mouse over the record listed in the search result. if we observe the figure above, we can see that the popup has information regarding the first record in the search results, which is where the user has “mouse-over”. dynamic link generation in this application we do not need to embed the links into the html. when the user clicks the “empty” license link, we dynamically build the appropriate url for those links and redirect the user to it when the user clicks any of the generated links. the “source: [license information]” link upon the “watchcollection” event triggering due to the search results changing, the first thing the javascript code (summon2-license.js) does is to add an “empty” link called “source: [license information]” at the bottom of each journal article. this will prompt the user to click if they require license information. the following code performs the task of adding these “empty” links: //for each content type display, do these $(".contenttype.ng-scope span[bo-bind='doc.content_type | translate']").each(function(){ //if it's journal article and hasn't been processed by our custom script if($(this).text().indexof('journal article') != -1 && $(this).not('.processed').size() > 0 ) { //add a class called "processed" so we can keep track, then add a custom link $(this).addclass('processed').parent().append(' source: [license information]'); } }); first, we find all the

items shown in figure 11 by using jquery to select them; then we simply check if the text “journal article” is in the selected html, since we are only handling journal articles. if the item is an article, then we inject the “empty” “source: [license information]” links. extract the source name and define the pattern for the license link url after the user clicks the “source: [license information]” link, the license links of each e-resource for that journal article (for example, “[abi/inform global]” and “[academic onefile]” in figure 12) are displayed. (if the user does not click the “[license information]” link, the two license links won’t be displayed).in order to generate links to the relevant license information we need to extract the source data. an example of this data can be seen in figure 12 on the right side. because angularjs always keeps its data inside the models of the application, we need to access the model data from the javascript object named “myscope.preview.doc” in the summon 2.0 user interface.to extract the source data in the popup, we defined two javascript functions listed below: function getcontainervalue(){ if(myscope.preview.doc.databases.length > 0){ return myscope.preview.doc.databases[0].name; //extract the source name } else { return myscope.preview.doc.publisher; } } function buildurl(source_name){ //define the pattern for the url return "http://baseurl/licenselink.php?provider="+(urlencode(source_name)); } generate the url based on the extracted data and redirect the user to it. the source names are stored in the “myscope.preview.doc.databases” array. if any source names were found, we generate the relevant links for the user to click. if the source was not found for any reason then we redirect the user to a general license page (“http://library.queensu.ca/360 link/360sample.htm“), which informs the user to contact the copyright office for license information of that e-resource. $("a.mycustomlink").click( function(){ if( $(this).parent().children('div.license_popup').length ){ $(this).parent().children('div.license_popup').remove(); } else { if(myscope.preview.doc.databases && myscope.preview.doc.databases.length > 0){ var new_node = '

'; for (var i = 0; i < myscope.preview.doc.databases.length; i++) { new_node += ''; new_node += ' ['+ myscope.preview.doc.databases[i].name +']'; new_node += '
'; }; new_node += '

'; $(this).parent().append(new_node); } else { window.location.href = "http://library.queensu.ca/360 link/360sample.htm"; } } return false; } ); when the “empty” license link is clicked we check that source data is available by checking the “myscope.preview.doc.databases” array existence and length. if such array is available and has at least one source then we proceed to insert the html code to create the links for the user to click if needed. lessons learned hosted service provided by vendors 360 link is a hosted service provided by serials solutions. we could not load data into 360 link’s database, which means we could only enter data manually from the vendor’s client site. this made it difficult for us to modify the records efficiently. in addition, there is no testing environment in 360 link and summon. in voyager, we have a sandbox folder in which we can test our code, but in 360 link and summon we need to set up those testing environments ourselves. we also need to wait at least 24 hours to see the changes we made in 360 link and summon, which means if we made a mistake in the configuration, the interface will be displayed with this mistake for 24 hours. documentation and community support the biggest challenge for adding the license links in the voyager opac is that we didn’t find much documentation in the voyager manual about customizing the opac skin. through a google search, we found a number of useful presentation and documents made by other voyager users on how to do similar tasks, which helped us to understand and modify the configuration files in voyager. one key reason we were able to finish this project is the community support we got from the users groups. for example, the summon-clicks.js for summon 1.0 and dnl_summon_custom_step3.js for summon 2.0 and the articles from the code4lib journal had already solved several issues we faced. we built our solutions using those scripts and they saved us a lot of time to finish the project on time. consortia and partners because our service provider scholars portal hosted mondo for us, we could not edit the license page urls in the database table, or load all of the urls into mondo. we needed to work closely with scholars portal to modify the data in our mondo. fortunately, scholars portal helped us whenever we needed their support. we also learned a lot from other libraries in the ocul consortia, such as western university and university of toronto. they provided us with solutions for adding the license links in the systems that they use, and they shared their experiences with us. this is another key reason why we could successfully finish this project by the deadline. summary from the summer of 2013 to spring 2014, we worked with scholar portals and added the license links to the 360 link and voyager in production, as well as the summon 1.0 testing site. because summon 2.0 uses angularjs, which was new to us, and we did not have a plan for migrating our discovery system from summon 1.0 to summon 2.0 by the summer of 2015, we waited to add the license link to summon 2.0 after reading articles posted on code4lib and consulting with ahmedullah sharifi, an expert in angularjs. interpreting, managing and providing licence terms to library users is a complex task. integrating the information in the mondo database into tools like summon and 360 link allow users to access this information at their point of need, which helps libraries meet their legal requirements while making it easy for users to find this information. with access to this information, users may even discover that they can use items in ways that they didn’t realize were possible, for example, as a pdf upload on a learning management system course page or in a print course package of readings. throughout this project, we needed to work with different vendors, user communities and our partner scholars portal to find solutions for our needs. this is not a project accomplished by our group, but a project supported by the library community. we hope this article can help other libraries provide user right/license information to their e-resources and avoid the issues/problems we had during our implementation. all of the scripts/codes described in this article are avaliable at the following url: https://gist.github.com/happyrainb. libraries have started using more vendor hosted services than before. these vendor-hosted services make customizing library systems more difficult, especially when the vendors release new versions of their systems and change the user interface, system configuration, etc. we have to deal with different types of issues and be creative in solving them. in addition, we need to work with peers in consortial universities, members in the same user group and vendor support teams to find the solutions to our unique needs. what’s next? making license information accessible at points of need for users drives future integration possibilities for mondo. for example, information from mondo can be included in proprietary databases or in university libraries’ digital asset management tool or institutional repositories. by including the copyrights/access rights information like this in a digital asset management system (in the digital preservation process [7]) or in an institutional repository (in the submission process), libraries can showcase the flexibility of these resources for users as most of this material is open access and can be used for all of the purposes generally included in mondo. this can help encourage the open access movement, while also increasing the use of library provided tools. acknowledgements the following people helped us with this project. without their support we wouldn’t have been able to add the license links to our systems. we gratefully acknowledge their assistance. 360 link/summon: matthew reidsma, web services librarian, grand valley state university; wittawat meesangnil, systems librarian, dimenna-nyselius library, fairfield university mondo: teresa lee, e-resource & access librarian, university of british columbia our partners: amaz taufique, assistant director of systems and technical operations; marc lalonde, web coordinator, librarian, university of toronto; christina zoricic, acting head, metadata access, western university. our project team: michael vandenburg, associate university librarian; mark swartz, copyright specialist; anne brule & ellen symons, e-resource librarian; katie legere, web coordinator; andrew dacosta, web development technician; alex fletcher, application support technician. notes [1] 1. https://code.google.com/p/mondo-license-grinder/ [2] scholars portal is a service provider of ontario council of university libraries (ocul). it hosts the e-resource license database using mondo for queen’s university library, as well as the other ocul members’ license databases. [3] https://github.com/gvsulib/360 link-reset [4] https://github.com/gvsulib/summon-stats/blob/master/summon_clicks.js [5] http://systemsmgr.wikispaces.com/voyager_customizations [6] http://mlib.fairfield.edu/summonjs/dnl_summon_custom_step3.js [7] http://public.ccsds.org/publications/archive/650x0m2.pdf reference: bailey a and back g. 2014. hacking summon 2.0 the elegant way. the code4lib journal 26. [cited 2015 april 16]. available from: http://journal.code4lib.org/articles/10018 durno j. 2012. hacking 360 link: a hybrid approach. the code4lib journal 18. [cited 2015 april 16]. available from: http://journal.code4lib.org/articles/7308 guy l. 2010. voyager 7 tomcat webvoyage interface for dummies, or: what will it take to customize our new user interface and remain sane? eluna 2010, 11-13 may 2010, fort worth, texas. [cited 2015 april 16]. available from: http://documents.el-una.org/456/ voß j and horn m. 2014. exposing library services with angularjs. the code4lib journal 26. [cited 2015 april 16]. available from: http://journal.code4lib.org/articles/10023 talsky d. 2008. auto-populating an ill form with the serial solutions link resolver api. the code4lib journal 4. [cited 2015 april 16]. available from: http://journal.code4lib.org/articles/108 taufique a, zoricic c, jing j and lalonde m. 2015. transparent licenses: making user rights clear. ola super conference 2015, toronto, on. [cited 2015 april 16]. available from: http://www.slideshare.net/happyrainb/ola-2015-ourpresentation about the authors jenny jing has more than ten years of experience working as the systems librarian in different organizations including: queen’s university, harvard business school and memorial sloan kettering cancer center. her experience includes: migrating ils systems, building library ir and digital repositories. if you have questions/comments about 360 link, summon 1.0 and voyager solutions described in this article, please contact jenny at: jingjenny@hotmail.com qinqin lin is a graduate of the university of toronto’s computer science program and has worked for university of toronto since 2009 as a programmer. she works on a variety of scholars portal projects including enhancing ontario council of university libraries (ocul)’s new usage rights database, our and the scholars portal ebooks platform. if you have questions/comments about mondo functions described in this article, please contact qinqin at: qinqin.lin@utoronto.ca ahmedullah sharifi is currently the senior software engineer at vegrif group inc. in toronto, ontario. sharifi focuses on developing software technology to fully utilize computers and their endless potential in many areas of application. he has been published in the icme 2006 for his work with fluid interfaces at the university of toronto. if you have questions or comments about angularjs and summon 2.0 functions described in this article, please contact ahmedullah at: ahmed@vegrif.com mark swartz is the copyright specialist at queen’s university. in this position, he works with librarians, staff, faculty and instructors across all faculties and schools to develop web-based information and educational programs on copyright. mark has held two other positions at queen’s – as an education librarian and as the online course developer for the continuing teacher education department. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – creating a seamless cross-platform online experience for mobile users mission editorial committee process and structure code4lib issue 16, 2012-02-03 creating a seamless cross-platform online experience for mobile users in creating a mobile-optimized website for drexel university libraries, we have strived to preserve the seamless transition between platforms that our desktop users experience. we employ separate technology and coding solutions to make drupal, wordpress, and html sections mobile optimized, while continuously improving the mobile user experience in terms of design, usability, and site performance. this paper details how, through extensive research, design, and development, we found the best solution for creating a steady mobile experience for our users. by katherine lynch introduction according to google analytics, in september 2009, the drexel university libraries website (http://www.library.drexel.edu) received over 141,500 unique visitors. of those visitors, 23 were on mobile devices. in september 2011, nearly 800 unique mobile visitors used the drexel university libraries website on handheld and tablet devices, and that number continues to climb steadily each month. as mobile usage figures grow dramatically for many libraries, having a mobile-optimized web presence becomes nearly as important for reaching patrons and students as having a web presence at all. a mobile application or mobile-optimized version of a library’s website is beneficial to the institution in many ways, from reaching more users to creating a clean, semantic version of the site. the drexel university libraries website has benefitted in all of these ways with the development of a smooth, seamless user experience for mobile patrons across our multiple content management systems. in this paper, i will describe the process of design and development for the first mobile-optimized web presence for drexel university libraries. i will detail the challenges posed by working with the existing architecture of a website running on multiple content management systems, and the solutions we employed. i will also address ongoing improvements we made to our mobile presence, as mobile-optimization tools and techniques have emerged and matured. it is important to note that, when we decided to create a mobile version of the libraries website, we did so with the knowledge that this would require the development of more than one mobile optimization solution. the drexel university libraries website, as it is referred to in this paper, is comprised of the following: a website containing primarily resources, services, and location information, which includes personalized, dynamic, and interactive elements. this section is powered by the open-source content management system, drupal 6. blogs maintained by staff and liaison librarians for research and information, powered by wordpress. research guides maintained by liaison librarians, powered by wordpress. several small web-based projects and applications too small to require a separate content management system, written in html and css; for example, the drexel university libraries annual report web version, an online version of a widely-circulated print document roughly 20 pages long motivations in the summer of 2009, a few months before those first 23 mobile users graced our website with their presence, drexel university libraries undertook a usability study of the current website with in-depth user interviews of several drexel students from diverse groups. participating students were asked to give some details about their daily web use habits, including data on how often they went online each day, the websites they used the most, and the ways in which they used the drexel university libraries website. the sampling of students included undergraduates and graduates, with majors ranging from engineering to art history. two recurring themes emerged from these interviews. first, our students were beginning to need constant access to the internet for their studies, which would result in increased acquisition and usage of mobile devices by the student body. second, if we wanted to continue to offer all of the information and services available on the drexel university libraries website to our students as often as possible, it was no longer good enough for our full browser website to look passable on a mobile device. it was time to create a user experience specifically for mobile devices. our goal in taking this project on was not to create a stripped-down, application-like mobile version of the website, but rather to make all information and resources on the drexel university libraries website available, accessible, and readable on a wide range of mobile devices. with this in mind, the design process began. interface design the design of the new mobile presence stripped out all elements located in the peripheral areas of the sidebars of the full browser version of the website. sectional navigation deemed absolutely necessary was planned for inclusion at the base of each page. the horizontal navigation across the top of the full browser version was preserved in this design, as these large buttons were easy to use on touch screens as well as desktops, well recognized from the full browser design, and followed the lead of other successful button-centric mobile designs. figure 1: finalized design mockups for mobile-optimized website pages apart from stripping out most sidebar content, the information presented on pages with this design was to remain the same as the content presented in the full browser version. for the mobile site’s homepage, the elimination of content was a bit more radical. all content was stripped from the homepage, save the omnipresent, horizontal navigation, the resource and catalog search interface, three popular links, and the chat widget. in this way, we created a design that followed the stripped-down nature of mobile-only sites, but still allowed users, upon entering, to access any information available on the full browser version of the site. the mobile-optimized designs for the wordpress-powered pages were very much the same. the full browser version of the drexel university libraries website preserves the same aesthetic look and feel across drupaland wordpress-powered pages, and something similar was desired for the mobile-optimized version. figure 2: finalized design mockups for mobile-optimized blogs and research guides mobile optimization with drupal 6, wordpress, and media queries once the look and feel of the mobile site had been decided, it was obvious that one mobile solution would not extend far enough to cover every part of our site’s presence. we realized that three solutions were needed – one for drupal, one for wordpress, and one for any stray pages or side projects written in html. as in our desktop presence, the mobile site needed to seamlessly switch between platforms, without any indication to the user that the current solution had been withdrawn and a new one was in place. integration for a seamless experience the goal for integration was a coherent, consistent user experience, referring not just to the overall branding of the site but, more importantly, to the rule that a user would never be thrown out of the mobile-optimized version of the web presence, regardless of where they went within the site. each of the modular solutions provided this in a robust way, and the custom themes, while not all identical to the original designs, were solid representations of the mobile-optimized web presence for the whole site and excluded no necessary content. drupal in researching this issue, it was determined that there were two main ways to create mobile-optimized websites in drupal: the mobile tools module and the mobile plugin module. both offered a certain amount of plug-and-playability and seemed to favor similar methods for mobile conversion, specifically theme-switching. each module set the drupal theme for use in content display based on rules set within the module’s configurations for device groups including desktop, tablet, and handheld devices. the mobile tools module did not show much promise at first. there was very little helpful documentation to consult when something did not work as intended. the initial selection of “theme-switching” in the interface did nothing to change the theme once enabled and configured, and very little could be determined about what was wrong. multiple experiments were performed with site caching and php memory limits, but the module could not function as needed for theme switching. mobile plugin, on the other hand, seemed the better choice. once configured, theme switching worked like a charm on many mobile devices, including tablets and devices with non-standard operating systems. the configuration interface was not terribly user-friendly, to say the least, though with good reason. administrators could set device detection rules for each mobile device type and operating system that might hit the website. though it required more technical knowledge for entering rules, this would actually afford greater control, should the need arise to have even more different themes associated with other devices or screen resolutions. rules are entered as text by the administrator and dictate which devices the module treats as mobile. in the example below, devices in the group “mobile” are treated as mobile and devices in the group “no” are full browser. rules also use a “weight” variable to dictate which theme should be given precedence for devices that land in multiple groups. an example of the mobile plugin rules is shown below: figure 3: mobile plugin device detection rules configuration screen however, it was not without its quirks. by enabling mobile plugin on the drexel university libraries customized drupal installation, our site lost the ability to employ a separate theme in the administrative sections of the website. this option is built into every new drupal installation, and is generally beneficial in helping content editors better understand how to use and edit a site. the newest version of mobile plugin evidently overrode the drupal variable used to set the administrative theme internally, causing the theme to always default to the live one. while it was irksome to lose this feature, it was not a deal breaker when the reward was gaining a completely functional mobile solution for drupal. additionally, theme switching required drupal’s built-in site caching mechanism to be disabled. this would have been true of mobile plugin or mobile tools when using theme switching. theme switching for device detection in drupal requires an absence of drupal performance caching at all times. drupal performance caching consists of built-in content caching features that primarily utilize compression of site elements such as cascading style sheets and javascript files along with database query caching for minor to moderate site performance boosts. this type of caching allows drupal pages to be rebuilt from memory rather than from a live database lookup every time, and so provides a substantial speed increase on websites that use it. with drupal caching turned on, a theme could not switch fast enough—mobile users would still get the full browser site, likely exiting before ever seeing a change to the mobile version. therefore, drupal site caching had to be turned off to use theme switching through mobile plugin. while this was not an ideal situation, it was unavoidable for creating a mobile drupal presence and the performance was not bad enough to force us to scrap the entire mobile effort. even with these two small-to-moderate stumbling blocks, mobile plugin ensured a more robust front-facing user experience; thus it was the solution we chose to implement in the mobile pilot site. building the mobile theme the drupal mobile theme, named drx mobile, was built as a customized fork of nokia mobile, a contributed base theme (available at http://drupal.org/project/nokia_mobile). in the documentation, nokia mobile is the recommended starter theme to accompany sites using mobile plugin, and with good reason; it serves as a helpful and comprehensive framework of base html for mobile designs. to match our own mobile design, nearly all of the built-in look and feel were stripped out and replaced with custom php, html, and css. with the exception of some later additions of content and menu items, the drupal result was extremely similar to what had been planned. figure 4: unaltered nokia mobile theme on a sample site figure 5: drx mobile theme in action figure 6: drx mobile theme in action figure 7: drx mobile theme on a tablet device figure 8: drx mobile theme on a tablet device figure 9: drx mobile theme on a desktop device figure 10: drx mobile theme on a desktop device drupal performance caching the mobile optimization in drupal functioned well with theme switching from the mobile plugin module, but could not be considered a long-term solution as long as we were unable to utilize even basic, built-in caching for site performance in the same drupal installation and had no other caching solution in place for mobile, tablet, or desktop users. we researched some methods for controlling the cache outside of drupal through server configurations, but we needed help balancing the performance load of the high traffic on the drexel university libraries full browser drupal-based site pages. we first considered using a caching proxy such as squid with a reverse proxy setup on the web server. using reverse proxy, we could provide cached versions of popular pages on our site to unlimited users on one server. however, as we were running multiple content management systems, using squid for production would require extensive testing before rollout. this was not a deal breaker, however it seemed that there might have been a simpler solution to our caching problem related solely to the drupal installation. before pursuing squid, we searched out a simpler drupal-based solution, and found one. there was always another option for mobile optimization in drupal. rather than theme switching, we could employ domain switching, directing users to a separate url for the mobile site. our main reason for resisting this approach at the outset was our desire not to manage two separate sites—one mobile, and one desktop. creating two separate sites would mean several headaches. content duplication across the two websites would increase the risk of inconsistent or outdated information on one of them. measures to avoid this would mean twice the work for content editors and site maintainers. most importantly, two sites would create a potentially jarring transition for users between drupal and the rest of the website. the only way that this solution would work was if it allowed us to maintain a single base of information available in our web presence, and display it in an optimized way, responsive to the device of the user. additionally, this was only a drupal problem. as i will detail later in this paper, the wordpress and html pages had no caching problems. keeping the wordpress and html pages on the main website and pushing only the drupal solution off onto a separate mobile domain would create a problematic data silo that we wanted to avoid. therefore, we compromised. researching ways to manage multiple domains within one installation of drupal immediately led us to the domain access module, written for just this purpose. the module could be used to control multiple site instances with multiple backend databases, but most importantly, it could control multiple urls pointing to the same drupal installation, while displaying a different theme based on the url accessed. in other words, we would be able to have one site accessible from two urls, pulling all of the same content from one database, and the only difference would be which theme was displayed. we secured the mobile url of m.library.drexel.edu through the drexel university office of information resources & technology. then we configured the domain access module to pull all content from one database, use the full browser drupal theme for users visiting the site from www.library.drexel.edu, and use drx mobile as the theme when the site was viewed from m.library.drexel.edu. furthermore, the domain access module allowed separate caching configurations for each domain. the main website could therefore finally start using basic drupal performance caching again without a negative impact on mobile users. as an added bonus, within this solution we were also able to employ a separate configuration of drupal performance caching for the mobile theme, which gave it a nice performance boost as well. however, once we got the urls and themes distributed, we still needed a mobile detection mechanism. as we wanted to move away from mobile plugin’s favored method of theme switching for mobile detection, we reexamined the mobile tools module. an unexpected bonus to moving away from mobile plugin was that the administrative theme bug in the mobile tools module went away. we were again able to have a separate theme in the administrative side of drupal. figure 11: drupal theme shown to patrons and anonymous users on the live website figure 12: administrative theme in use on a page administration screen in drupal upon installation and configuration, there was still one relatively small problem. while tablet devices could be assigned a specific theme for theme switching in the mobile tools plugin, tablet devices were automatically classified as desktop devices for domain switching. upon contacting the developer, we were informed that this was because of a belief that tablet devices were closer to desktop devices, and therefore should use the full browser version of a site. for our own configuration, we needed a more flexible approach. we needed to direct tablet device users to the same mobile site as handheld device users. we built a small custom module we built designed to catch the mobile tools detection variable and set it to “mobile” for users coming into the site from certain tablet devices to solve this problem. after this module was installed, and a few other minor configurations were made, we had a mobile drupal site that seamlessly moved between a mobile url and full browser url, able to blend well with our additional solutions while also taking advantage of caching for site performance. to summarize our drupal research, the following drupal modules were at some point used and/or tested: domain access allows one drupal installation and one shared database to power multiple domains. http://drupal.org/project/domain domain analytics allows each domain on a single drupal installation to have separate google analytics account settings, which allows for separate analytics information to be collected for each domain. http://drupal.org/project/domain mobile tools provides theme switching and domain redirection capability for mobile devices. http://drupal.org/project/mobile_tools mobile plugin offers theme switching and domain redirection through administrator-defined device detection rules for mobile devices. http://drupal.org/project/mobileplugin wordpress when this project began, the wordpress-powered blogs had already been experimentally employing a plugin called wptouch to make them mobile-friendly. the plugin was a set-and-forget type, which offered no real configuration and no way to preview the look and feel within the browser. it only optimized wordpress blogs, and did so in a very rigid way. the mostly unchangeable interface of the mobile optimized wordpress instances using wptouch resembled that of many news blogs and feed readers. the interface is suited for the delivery of information that changes rapidly over time, for which the context of date and time are integral. unfortunately, wptouch was not right for drexel university libraries’ more stable research guides running on wordpress. the research guides utilized pages in favor of here-today-gone-tomorrow posts, and required front pages with static information. while it was possible to create these in wordpress, it was not possible to display them this way using the wptouch plugin. another obstacle to the continued use of the wptouch plugin related to the fact that, as mobile technologies continued to advance, the wptouch module did too, but as wptouch pro, a plugin that the developers forked from their open-source version and were gearing up to sell commercially. this was signaled by the already-sparse configuration options for the open-source wptouch plugin becoming sparser, and losing some support in favor of the pro version. ultimately, the reason we chose not to upgrade to wptouch pro was the fact that the plugin could only provide a semi-seamless mobile web experience for 50% of our wordpress presence. another approach was required. enter the wp mobile detector plugin. as with wptouch, there were open-source and paid versions of this plugin available, but the open-source version already offered a wealth of configuration options for the creation of a mobile site in wordpress. we were able to drop our own custom wordpress theme into the file structure of the plugin, turn mobile site tracking on or off within the configuration, dynamically resize rich media to fit smaller screens, and, most importantly, create a fluid experience entering and exiting wordpress, for both the blogs and the research guides. in creating the custom drx mobile wordpress theme, based on the anakin mobile wordpress theme, the design changed slightly for both sections. in order to preserve the familiar mobile experience created by using the wptouch plugin, the homepages of the blogs were modeled similarly to their previous appearance. single blog post pages kept drexel university libraries branding and website color combinations, but gained an even more stripped-down look. the horizontal navigation was removed in favor of the bottom-up navigation utilized at the bottom of each mobile drupal page. the research guides gained static homepages for guide listings and the same stripped-down branding, while also keeping the drexel university libraries-consistent color scheme. these themes proved not only sleeker on handheld devices but also pleasantly easy to read on tablet devices. figure 14: mobile-optimized wordpress blog homepage and single post figure 15 and 16: tablet-optimized wordpress blog homepage and single post figure 17 and 18: mobile-optimized wordpress research guide homepage and single guide figure 19 and 20: tablet-optimized wordpress research guide homepage and single guide unlike our drupal solution, the wordpress mobile solution included minimal drama where caching was concerned. we used the popular w3 total cache plugin for site performance on desktop and mobile devices already. this plugin cached database queries, minified css/javascript files/rss feeds, and utilized browser caching for repeat page views for a performance boost. it also specifically claimed to work well with mobile devices by offering caching configuration options for separate device groups, similar to the mobile plugin drupal module. we were happily surprised to find that the authors of wp mobile detector had taken the w3 total cache plugin into consideration when creating their plugin. a tip rolled right into the wp mobile detector’s wordpress.org faq page (http://wordpress.org/extend/plugins/wp-mobile-detector/faq/) stated that the only configuration needed to make wp mobile detector work while w3 total cache was active was adding the wp mobile detector cookie into the w3 total cache “rejected cookies” list, as show below: figure 21: page cache configuration screen in the w3 total cache plugin, configured to exclude pages defined as mobile by the wp mobile detector plugin from desktop-optimized page caching by rejecting this cookie, the wordpress pages that were labeled “mobile” by wp mobile detector would not be subject to the page content caching of desktop pages through w3 total cache. database caching, external script and feed minification, and other caching mechanisms remained in place for mobile devices. device groups, referred to as “user agent groups” in the w3 total cache plugin, used rules set by administrators for which themes would display dependent on the user agent. any active theme could be selected for forced display on any user agent in an administrator-defined group. however, there was also the “pass-through” option, which, when selected, passed the specified user agent group through the w3 total cache module’s theme selection for caching and allowed another theme switching solution, such as the wp mobile detector’s theme switching mechanism, to take precedence. figure 22: configuration screen for sorting user agents into groups for specific theme and url redirection in the w3 total cache plugin the plugins used that were crucial to creating our mobile-optimized wordpress configuration are as follows: google analytics for wordpress allows wordpress blogs to be associated with google analytics accounts for analytics and metadata tracking. http://wordpress.org/extend/plugins/google-analytics-for-wordpress/ w3 total cache caches wordpress sites through database query caching and minified css/javascript – also caches by device group to support mobile theme switching. http://wordpress.org/extend/plugins/w3-total-cache/ wordpress mobile admin delivers mobile-optimized display of wordpress administrative pages for mobile content management. http://wordpress.org/extend/plugins/wordpress-mobile-admin/ wp mobile detector detects mobile devices and displays an administrator-selected mobile-optimized theme based on the device. http://wordpress.org/extend/plugins/wp-mobile-detector/ css and media queries with content under the drupal and wordpress content management systems optimized for mobile devices, the only pages left with a decided support gap were some semi-static pages written in html and css. there were not many of these pages, but any of them could cause a break in the mobile-optimized user experience, so a solution was put in place. cascading style sheets provided some of the easiest-to-implement solutions for creating mobile content. we used the technique of conditionally calling style sheets to hide certain non-optimized content, reveal mobile-only content, and change the look and feel of content existing in both realms. to conditionally call style sheets, we used a media query. a media query is a logical expression check performed by the user agent against the current media type—in this case “screen”—for the conditions or parameters of a certain feature of that media type. features can be one of many variable display items, including aspect ratio, color display, screen orientation, and device width. our css solution consisted of a combination of three style sheets – one for mobile content on the smallest screens, one for standard mobile screens, and one for overriding any mobile-specific css classes or content. to determine when to call each style sheet, we employed media queries as follows: using this code, we were able to determine the maximum and minimum device width of any incoming users. this media query checked the width of the user’s device and called the style sheet only if width conditions were met. if a device with a screen width of 480 pixels or less called up the website, the “micro-handheld” style sheet was called. if a device with a minimum screen width of 481 pixels called up the site, the “micro-handheld” style sheet was not called and the regular “handheld” style sheet for handheld device classes was called instead. finally, if a device with a screen width of 1024 pixels or more called up the site, the handheld classes were overridden in the “override-handheld” style sheet. this css technique was used not only for side projects and the occasional html page, but has also come into use on content in both drupal and wordpress pages. data tables and wide-format content that display well in a full browser site but not on a mobile device have mobile-specific css classes assigned to their html elements. these css classes re-style the content for better readability on smaller screens. figure 24: full browser version of the drupal-powered hagerty library staff page figure 24: full browser version of the drupal-powered hagerty library staff page this css solution is by far the most reliably adaptive solution for optimizing web content for mobile users, changing mostly when the standards of actual device widths change. possibly our best example of its extensive use to date is in this past year’s drexel university libraries annual report 2009-2010, web edition. figure 26: full browser version of annual report homepage figure 27: tablet version of annual report homepage figure 28: handheld version of annual report homepage the handheld version of the annual report was designed after the full browser version was complete. in the full browser version, all of the report’s content was dynamically displayed using jquery with pagination effects. these effects suited a desktop user agent, but were inappropriate for the stripped-down look required for an easy-to-use mobile interface. as each section was loaded dynamically, the desktop version consisted of one index file, so we opted to create the mobile homepage’s content—a list and the report’s description—at the end of that index file. each item in the mobile homepage’s list linked to small, separate, mobile-optimized pages, one for each of the report’s sections. separate pages for each section made it easier for handheld and tablet users to bookmark sections of the annual report for later referral and sharing. media queries were used to stop the mobile version from appearing in the desktop version, and vice versa. from there, the tablet-optimized version’s css, which would override the mobile css, was called using an additional media query on the mobile version of the site’s html. we made a concerted effort to create specially optimized experiences for both handheld and tablet device users when building this application, because it was designed with the idea of allowing visitors from outside of the library to learn more about our institution and accomplishments in a friendly and intuitive way, a way encouraged by the social nature of information sharing on mobile devices. the result has afforded outside visitors a positive experience with this report, and has also made it far easier for our professional staff to refer to it in professional conversations and presentations. conclusion our current successful recipe for a seamless mobile web experience is as follows: drupal mobile url domain access module mobile tools module configured to use domain switching with the mobile url small custom module to augment tablet support wordpress wp mobile detector plugin custom theme based on the contributed anakin theme w3 total cache plugin, configured to exclude the detection cookie used by the mobile plugin html css media queries conditionally calling style sheets based on screen resolution each of these solutions requires minimum upkeep and maintenance from developers and creates no extra work for content editors. as we have been able to keep all of the content for each platform or technology coming from the same source as the full browser version, the daily workflow of editors and site maintainers remains the same. the most up-to-date web content is available at every point of access to drexel university libraries website users, whether they enter from a desktop computer, an ipad, an android, or any other device. these solutions have not only shown that we can create a reasonably responsive design for a site managed across multiple platforms, but have also allowed us to embrace new technology at the same pace as our student body and patron population. this gives them the ability to continue using the drexel university libraries website as a learning tool and educational environment wherever they go and gives us the ability to continue using our website as a tool for information and education for all. related links drexel university libraries: main web site – http://www.library.drexel.edu drupal domain access module (drupal) – http://drupal.org/project/domain mobile plugin module (drupal) – http://drupal.org/project/mobileplugin mobile tools module (drupal) – http://drupal.org/project/mobile_tools nokia mobile theme (drupal) – http://drupal.org/project/nokia_mobile wordpress google analytics plugin – http://wordpress.org/extend/plugins/google-analytics-for-wordpress/ mobile admin plugin – http://wordpress.org/extend/plugins/wordpress-mobile-admin/ mobile detector plugin – http://wordpress.org/extend/plugins/wp-mobile-detector/ mobile detector plugin faq – http://wordpress.org/extend/plugins/wp-mobile-detector/faq/ w3 total cache plugin – http://wordpress.org/extend/plugins/w3-total-cache/ about the author katherine lynch works as web developer for drexel university libraries in philadelphia, pa. she presented on drupal mobile development at drupal ig meeting annual 2011 at 2011 ala annual conference, and writes and presents extensively on drupal, mobile development, and web accessibility. more information can be found at her website, http://www.katherinelynch.org/. subscribe to comments: for this article | for all articles 2 responses to "creating a seamless cross-platform online experience for mobile users" please leave a response below: junior tidal, 2012-02-03 excellent article. right now we’re using a separate mobile site outside of our d6 installation and using the mobile tools module, but it doesn’t seem to be working too well. i think i’ll have to check out the mobile plug-in. joshua odmark, 2012-08-06 katherine, what a great writeup. i enjoyed the flow of the article as it followed your thought process. i wrote the wp mobile detector, so it was a pleasure reading how it has helped create a mobile website for drexel university libraries. cheers. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – why purchase when you can repurpose? using crosswalks to enhance user access mission editorial committee process and structure code4lib issue 11, 2010-09-21 why purchase when you can repurpose? using crosswalks to enhance user access the mansfield library subscribes to the readex database u.s. congressional serial set, 1817-1994 (full-text historic reports of congress and federal agencies). given the option of purchasing marc records for all 262,000 publications in the serial set or making use of free access to simple dublin core records provided by readex, the library opted to repurpose the free metadata. the process that the mansfield library used to obtain the dublin core records is described, including the procedures for crosswalking the metadata to marc and batch loading the bibliographic records complete with holdings information to the local catalog. this report shows that we successfully achieved our goals of dramatically increasing access to serial set material by exposing metadata in the local catalog and discusses the challenges we faced along the way. we hope that others tasked with the manipulation of metadata will be able to use what we learned from this project. by teressa m. keenan introduction the university of montana-missoula (um) is a doctoral institution serving approximately 12,000 undergraduates and 2,000 graduate students. faculty and staff comprise a population of over 1,500. um is part of the montana university system with three affiliated campuses in dillon, helena, and butte. the maureen and mike mansfield library (ml) holds the largest collection of books and media in montana. ml’s collections exceed one million volumes, 125,000 maps, 100,000 photographs, 50,000 media items, over 30,000 print and electronic journals, and access to over 300 databases. however, like many institutions its size, the library has a limited budget and large expenditures must be carefully considered. thus when the library was faced with the option of purchasing vs. repurposing metadata to augment the catalog, they chose repurposing. in january 2007 ml subscribed to the readex database u.s. congressional serial set, 1817-1994, a collection of full-text historic reports of congress and federal agencies. access to the database was provided through a database a-z list and a subject-based libguide. when readex made bibliographic records available to subscribers, the library decided to add these to the local catalog in order to increase access to the database. readex offered subscribers the option of purchasing a set of marc format records for $25,000 or receiving a set of dublin core (dc) format records at no additional charge. the addition of a metadata librarian to ml’s technical services unit in september 2008 provided the personnel resource needed to repurpose the free metadata. the project the serial set project was initiated in order to achieve a cost-effective way of increasing patron access to and use of the serial set database. initial concerns centered on the quality of the bibliographic records and our ability to successfully crosswalk a less granular metadata schema such as dublin core into the more granular and structured rules of aacr2 and marc. a test was run on a small set of records; the data was collected, crosswalked and then loaded into the library’s test database. the results were then compared to the marc records available for purchase (see appendix a). after consultations between project team members, it was determined that the metadata in the dc records provided by readex was compatible with the standards set for the library’s bibliographic catalog. while it was acknowledged that the marc records available for purchase more closely followed the conventions of aacr2 than the crosswalked records and that cataloging each title individually following aacr2 would produce more robust and complete bibliographic records, we felt that the information provided from the dc records would display appropriately in our catalog, would be sufficient for our goals and would enhance our users’ ability to find, identify, and access these resources. the project was broken down into three tasks: collect the available dublin core records, transform the xml into a compatible format, and load the metadata into the library’s integrated library system (ils). collecting the data the first available segment of simple dublin core records provided by readex included data for every publication contained in serial set volumes 1 to 5377, approximately 168,800 records. the records were harvested over oai-pmh with marcedit, a windows-based freeware marc editing tool already in use at the library (reese, n.d.). the interface was straightforward once the correct server address was obtained from readex. the download process took approximately 10 hours. when complete, 6827 xml files, each containing data for approximately 40 individual titles, had successfully downloaded. transforming the data the first step in transforming the metadata into a form that was ingestible by the library’s voyager catalog was to create a crosswalk from dublin core to marc. existing dc to marc crosswalks (lc, 2008; dutta, 2003) were compared with the indexing capabilities of the local ils and a working crosswalk was created specifically for this project. (see appendix b) we edited the existing xslt in marcedit to match our project crosswalk specifications, but stumbled in our initial attempts to edit the script due to lack of experience with xml and xslt. for example, we could not separate the year from the rest of the date field in the dublin core record until we learned about the substring-before function. another issue we had with the bibliographic records after running the xslt script was the inclusion of unwanted whitespace within the subfields in the 1xx and 7xx marc fields. while the whitespace did not show in the public display of the catalog and did not appear to affect searching it was visible in the staff side display in the cataloging module of voyager. we were unable to remove the whitespace with xslt, but found success with find and replace in marcedit, so we added that step to its automated tasks. the following example contains extra whitespace in the date subfield: 710 10‡au.s. congress. house. committee on the judiciary ‡d (1813). here is the same field, with the whitespace removed: 710 10‡au.s. congress. house. committee on the judiciary ‡d(1813). additional bibliographic data not included in the dc record were incorporated into the final marc record through the xslt stylesheet and the batch editing features in the marcedit program. additional data included: the government documents classification number stem (=086 0\ ‡a y 1.1/2:), an authentication code to indicate that the record was converted from a simple resource description record in another syntax using the dublin core metadata element set, and that the content of the record (descriptive elements and headings) may or may not follow any cataloging standard (=042 \\ ‡a dc) a reproduction note indicating that the material was provided by readex and has restricted access (=533 \\ ‡a electronic reproduction ‡b chester, vt. ‡c newsbank, inc. ‡d 2005. ‡n available via the world wide web ‡n access restricted to readex u.s. congressional serial set subscribers) a local note indicating that the same material is also available in print (=590 \\ ‡a ml: this title is also available in print. see the “united states congressional serial set”, call number y 1.1/2:) a public note in the electronic location and access field (=856 40 ‡z connect to this title online.) holdings/item record information (=949 \\ ‡c uonetlib ‡d y 1.1/2: ‡t useminet) see appendix c for the full xslt stylesheet. batch loading the data in an effort to reduce the overall time and manual interaction with the transformation, editing and batch load process, we wrote a script to combine the numerous records into six batch files. once the data were transformed, voyager batch loading protocols were used to add the marc bibliographic, holdings and item records to the catalog. a load profile was established on voyager that defined the load parameters, including expected character set, fields to match on, how to handle matches (e.g., replace or merge), and location of holdings information. the batch files were loaded using parameters on the server, and system logs were reviewed for problems or errors upon completion each of these batch files, roughly 64mb in size, required 18 hours to finish loading. therefore, the timing of the load became an important consideration. the first file load was interrupted by our routine daily system back-up. once this collision was discovered, further file loads were timed to not coincide with the back-up process. as the process of loading the files continued, we discovered that each subsequent file took longer to load. we decreased the processing time somewhat by loading the files without keyword indexing, then re-indexed after the project was completed. in order to minimize load time in the future, records will not be processed with the above script. instead, we’ll use the batch edit feature of marcedit to combine the records into a number of smaller batches. this revised procedure will involve less manual interaction while avoiding the problems we encountered with very large files. project results (cost effectiveness & user access) when completed, over 262,000 bibliographic, holding, and item records were added to the library’s catalog. the process took approximately 46 staff hours and just over 260 computer hours to complete. based on these statistics, repurposing of the metadata allowed the database to be enriched by the addition of about 856 titles per hour. while the library did not have any initial purchase costs associated with the project (the metadata was freely available to database subscribers, and the software used was either open source or already in use for other applications), there were costs involved (table 1). most of the staff time was spent in planning, research, and the development of the metadata transformation scripts. staff time estimates include salary and benefits. approximations for computer time and other overhead costs such as electricity, etc. were not included in this analysis. repurpose purchase data acquisition $0.00 $25,000.00 staff time (systems support) $614.80 $412.46 staff time (metadata librarian) $514.25 $257.25 total estimated cost $1,129.05 $25,669.71 table 1. cost comparison chart system supplied statistics for use of the serial set database show an overall increase in use since the library subscribed in january 2007 (table 2). overall usage increased by 2,577 views in the nine months following the addition of individual title records to the catalog. the addition of bibliographic records to the library catalog is not the only factor affecting use of the database. for example, reference librarians include that database in their research classes. also, in april 2009 the library celebrated its 100th year as a federal depository, an event which helped focus attention on this database. initiating a better method of tracking use (including how users got to the database, rather than just the number of times a database was viewed), would provide better analysis opportunities for determining success of future projects intended to increase access to collections. year january february march april may june july august september october november december total 2010 2014 1151 1686 n/a n/a n/a n/a n/a n/a n/a n/a n/a 4851 2009 911 1925 4499 4956 1433 2513 285 587 1092 2497 620 304 21622 2008 427 579 631 887 154 656 62 136 759 383 1308 812 6794 2007 488 175 578 1745 735 672 618 553 165 349 1333 1377 8788 table 2. use statistics conclusions and future plans the repurposing of the pre-existing metadata to increase access to digital material was an overall success. repurposing the metadata proved to be a cost effective alternative to purchasing marc records, saving the library over $24,500. use statistics show a marked increase in database usage after the records were added to the catalog. because this project was the first of its kind at our library, the project team gained valuable knowledge that can be utilized in the future. because of inexperience with xslt and scripting in general, creating the crosswalk took a bit of trial and error. additionally, the marcedit software was new to members of the project team. the reference section of this report includes materials that were helpful in learning the basics of xslt and marcedit. according to the readex webpage, additional dublin core records will be available each month (readex, n.d.). the library plans to selectively harvest the new records, crosswalk them to marc, and add the data to our catalog using the workflows established in this project. future research plans include enhancing our understanding of xslt and consideration of additional ways to repurpose existing metadata to enhance our users’ experience. references dutta, b. (2003). cataloguing web documents using dublin core, marc 21. presented at the workshop on digital libraries: theory and practice, drtc, bangalore. (coins) library of congress, network development and marc standards office. (2008). dublin core to marc crosswalk. retrieved march 30, 2010, from http://www.loc.gov/marc/dccross.html open archives initiative – protocol for metadata harvesting – v.2.0. (n.d.). . retrieved march 30, 2010, from http://www.openarchives.org/oai/openarchivesprotocol.html otegem, m. (2002). sams teach yourself xslt in 21 days. indianapolis in: sams. (coins) readex. (n.d.). marc records. retrieved march 31, 2010, from http://www.readex.com/readex/index.cfm?content=296 reese, t. (2009a). youtube – add new metadata function. retrieved march 31, 2010, from http://www.youtube.com/watch?v=3x5ke81aoeu reese, t. (2009b). youtube – translating oai metadata to marc using marcedit. retrieved march 31, 2010, from http://www.youtube.com/watch?v=gvbrmvh6j7u reese, t. (n.d.). marcedit homepage: your complete free marc software. marcedit. retrieved march 30, 2010, from http://people.oregonstate.edu/~reeset/marcedit/html/index.php appendix a: metadata comparison the following example of the simple dublin core and marc records available through readex (readex, n.d.), followed by the crosswalked and edited marc record created by ml, demonstrates the similarities and differences in the granularity of the marc records: simple dublin core, serial set

oai:docs.newsbank.com/sset.0fdbe2664851d648 1817-12-08 sset

motions and resolutions 1817-12-08 congressional session number 15th congress, 1st session; session volume number 1 senate document serial set number 2 s.doc. 3 http://docs.newsbank.com/select/serialset/0fdbe2664851d648.html english in senate of the united states, december 8, 1817. mr. sanford submitted the following motion for consideration: resolved, that the committee of finance inquire what alterations or amendments may be requisite in the present system of collecting the duties charged... sanford, nathan, 1777-1838, republican-jeffersonian (ny) u.s. congress. senate customs administration foreign trade imports tariffs and duties 1 p. readex marc record =ldr 01911cam 22003371i 4500 =001 nb00000000008 =003 readex =005 20090622155353.1 =006 m\\\\\\\\d =007 cr\cn\|||||||| =008 070221s1817\\\\dcu\\\\\\s\\\f000\0\eng\d =035 \\ ‡a (readex)0fc9a3c9017886a0 =040 \\ ‡a readex ‡c readex =110 1\ ‡a united states. ‡b congress. ‡b senate. =245 10 ‡a in senate of the united states, december 8, 1817. mr. sanford submitted the following motion for consideration: resolved, that the committee of finance inquire what alterations or amendments may be requisite in the present system of collecting the duties charged... ‡h [electronic resource] =260 \\ ‡a washington, dc, ‡c 1817 =300 \\ ‡a 1 p. =440 \0 ‡a united states congressional serial set; ‡v serial set no. 2 =490 1\ ‡a senate document / 15th congress, 1st session. senate ; ‡v no. 3 =500 \\ ‡a title taken from opening lines of text. =533 \\ ‡a electronic reproduction. ‡b chester, vt.: ‡c newsbank, inc., ‡d 2005. ‡n available via the world wide web. ‡naccess restricted to readex u.s. congressional serial set subscribers. =540 \\ ‡a copyright 2007 by newsbank, inc. all rights reserved. =610 17 ‡a united states. ‡b congress. ‡b senate. ‡b committee on finance. ‡g (1816) ‡2 readex congressional thesaurus =650 07 ‡a customs administration. ‡2 readex congressional thesaurus =650 07 ‡a foreign trade. ‡2 readex congressional thesaurus =650 07 ‡a imports. ‡2 readex congressional thesaurus =650 07 ‡a tariffs and duties. ‡2 readex congressional thesaurus =655 \7 ‡a motions and resolutions. ‡2 readex congressional thesaurus =700 1\ ‡a sanford, nathan, ‡d 1777-1838. ‡u republican-jeffersonian (ny) =830 \0 ‡a senate document (united states. congress. senate) ; ‡v 15th congress, 1st session, no. 3 =856 40 ‡u http://docs.newsbank.com/select/serialset/0fdbe2664851d648.html crosswalked & edited marc record =000 01927cam a 00361mi 00 =001 1552483 =005 20100704120842.0 =007 cr cn||||||||| =008 090331s1817 dcu|||||s||||f||| 0|eng|d =035 \\ ‡a (readex)0fdbe2664851d648 =040 \\ ‡a readex ‡c readex =042 \\ ‡a dc =086 0\ ‡a y 1.1/2:serial set number 2 =110 1\ ‡a united states. congress. senate =245 10 ‡a in senate of the united states, december 8, 1817. mr. sanford submitted the following motion for consideration: resolved, that the committee of finance inquire what alterations or amendments may be requisite in the present system of collecting the duties charged... ‡h [electronic resource] =260 \\ ‡a [washington, dc], ‡c 1817. =300 \\ ‡a 1 p. =490 \0 ‡a congressional session number 15th congress, 1st session; session volume number 1 =500 \\ ‡a title from opening lines of text. =520 \\ ‡a senate document =533 \\ ‡a electronic reproduction ‡b chester, vt. ‡c newsbank, inc. ‡d 2005. ‡n available via the world wide web ‡n access restricted to readex u.s. congressional serial set subscribers =540 \\ ‡a copyright 2005 by newsbank, inc. all rights reserved. =546 \\ ‡a english =590 \\ ‡a ml: this title is also available in print. see the "united states congressional serial set", call number y 1.1/2: =650 07 ‡a customs administration ‡2 readex congressional thesaurus =650 07 ‡a foreign trade ‡2 readex congressional thesaurus =650 07 ‡a imports ‡2 readex congressional thesaurus =650 07 ‡a tariffs and duties ‡2 readex congressional thesaurus =655 7\ ‡a motions and resolutions ‡2 readex congressional thesaurus =700 10 ‡a sanford, nathan, ‡d 1777-1838, ‡g republican-jeffersonian (ny) =786 0\ ‡n serial set number 2 s.doc. 3 =856 41 ‡u http://weblib.lib.umt.edu/redirect/proxyselect.php?url=http://docs.newsbank.com/select/serialset/0fdbe2664851d648.html ‡z connect to this title online. =949 \\ ‡c uonetlib ‡d y 1.1/2: ‡t useminet appendix b: dc to marc map the following table contains the mapping/crosswalk between the simple dublin core elements found in the readex serial set records and marc bibliographic data elements: dublin core marc title 245 10 ‡a (title statement/title proper) subject 650 07 ‡a (subject added entry – topical term) ‡2 (source of thesaurus) description 520 \\ ‡a (summary, etc. note) 300 \\ ‡a (physical description – extent) source 786 0\ ‡n (data source note) 086 0\ ‡a (government document classification number) language 546 \\ ‡a (language note) coverage 490 \0 ‡a (uncontrolled series statement) creator 110 1\ ‡a (main entry – corporate name) 100 1\ ‡a (main entry – personal name) 710 2\ ‡a (added entry – corporate name) 700 1\ ‡a (added entry – personal name) date 260 \\ ‡c (date of publication, distribution, etc.) identifier 856 40 ‡u (electronic location & access/url) type 655 \7 ‡a (index term – genre/form) appendix c: dc to marc xslt the following xslt stylesheet was used for dc to marc translation:

p m r k m m m i a a c m

(readex)

dc readex readex y 1.1/2:

[washington, dc],

electronic reproduction chester, vt. newsbank, inc. 2005. available via the world wide web access restricted to readex u.s. congressional serial set subscribers

readex congressional thesaurus

ml: this title is also available in print. see the "united states congressional serial set", call number y 1.1/2: uonetlib y 1.1/2:

useminet

http://weblib.lib.umt.edu/redirect/proxyselect.php?url=

connect to this title online.

()

about the author teressa m. keenan is the section head of metadata and continuing resources at the university of montana, mansfield library. tags: crosswalks, dublin core, marc, metadata, repurposing data subscribe to comments: for this article | for all articles one response to "why purchase when you can repurpose? using crosswalks to enhance user access" please leave a response below: dorothea salo, 2010-09-22 re killing whitespace in xslt: the normalize-space() function should do the trick nicely. you wind up with something like: leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – pythagoras: discovering and visualizing musical relationships using computer analysis mission editorial committee process and structure code4lib issue 51, 2021-06-14 pythagoras: discovering and visualizing musical relationships using computer analysis this paper presents an introduction to pythagoras, an in-progress digital humanities project using python to parse and analyze xml-encoded music scores. the goal of the project is to use recurring patterns of notes to explore existing relationships among musical works and composers. an intended outcome of this project is to give music performers, scholars, librarians, and anyone else interested in digital humanities new insights into musical relationships as well as new methods of data analysis in the arts. by brandon bellanti background i began this project as a capstone study for my graduate program in library and information science. i was interested in a digital humanities project that would bring together what i was learning related to data mining, database design, data analysis, and linked data. i chose music as a topic of interest because i studied it in my undergrad program and am especially interested in music theory and history. throughout music history, composers influenced and borrowed from one another, and these relationships are revealed through music analysis and theory. some relationships were between teachers and pupils, but others were between rival composers or even between composers and patrons. if composers use the same pattern in their works, there is some relationship represented. this overlap could be as elemental as the use of the same scale or as specific as a musical quotation or set of variations. in the project outlined here, i perform an analytical task – finding recurring patterns across works – that i’ve done many times before, but using a computer to find these patterns rather than finding them manually. i focused only on the most basic matches using intervals alone, but the same method could be expanded to include other facets of musical patterns as well. summary this paper describes an application of computer-aided musicology. patterns, or sequences of notes, recurring within data extracted from xml-encoded music scores are considered to represent some existing relationship between works and composers. the paper outlines my process of collecting, transforming, analyzing, and visualizing music score data with python and other tools. first, i converted musicxml files to csv files, which i loaded into a dataframe. i transformed the note pitches into intervals and converted the intervals to strings. i used a regular expression to find recurring sequences of intervals in the outputted strings, and i stored the returned sequences in a sqlite database along with their respective work titles and composers. finally, i queried the database and used matplotlib and musescore to visualize the patterns. this workflow is outlined in figure 1. figure 1. project workflow for pythagoras. method setting up the environment for this project, i primarily used python 3 in jupyter notebooks. i executed longer-running functions directly through the terminal in visual studio code. i imported several packages, some standard to python and others installed with anaconda. the main ones were lxml, pandas, numpy, matplotlib, and sqlite3. additionally, i used a substitute regex module (barnett, 2020) and the music21 toolkit (cuthbert et. al, 2019) for visualizations. the music21 toolkit works in conjunction with musescore, an open-source music notation software. most of my work was done on a 2015 macbook pro 13-inch computer running macos mojave; i also used a 2015 macbook pro 15-inch computer running macos catalina for the more resource intensive processes such as recursively searching for patterns with regular expressions and writing patterns to a database file. obtaining xml-encoded music score data my initial step was finding music score data in a format i could parse with python. i chose musicxml because of its wide availability. musicxml is an xml-based format primarily used for exchanging data between music notation software, meaning i was not limited to working in a specific application. musicxml files can be generated by manually inputting scores into a notation software or by scanning sheet music using optical music recognition (omr). since musicxml files are meant to exchange, publish, and archive music scores digitally, it is easy to find files online through collections like project gutenberg or the international music score library project (imslp). the scores i used for this project came from a large collection of files i received upon purchasing a license for capella scan, an omr software. the files i received were in a proprietary format (.cap), so i opened each score file in capella 8 and exported it as a musicxml file. i converted a total of 349 files (figure 1, steps 1–3), which included digitized piano, orchestral, and choral scores. parsing musicxml files using lxml after converting the scores to xml files, i used python’s lxml library to parse the files. the xml elements i used in this project are shown in figure 2. i also retrieved the values of the id attribute of the part element and the number attribute of the measure element. figure 2. tree structure of musicxml files used for this project. i wrote a function, xml_to_csv() in figure 3, which iterates over each note element in an xml file (figure 3a) and writes a line to a corresponding csv file (figure 3b) with values for part id, measure number, and note data. note data includes step, octave, duration, type, and accidentals; it also includes whether the note is a rest, grace note, dotted note, tied note, or cue note. three of my files had xml-validation errors from the capella 8 export, but i was able to convert 346 of the 349 xml-encoded scores to csv files (figure 1, steps 3–5). figure 3. converting an xml file to a csv file using xml_to_csv(). figure 3a. snippet from an xml file of mozart’s twelve variations on “ah vous dirai-je, maman”, k. 265/300e before conversion to csv figure 3b. snippet from the corresponding csv file of mozart’s twelve variations on “ah vous dirai-je, maman”, k. 265/300e after conversion from xml and interval calculation building a dataframe and writing to a string using pandas the next step was transforming my data to a format that could be searched for patterns (figure 1, steps 5–7). i planned to use regular expressions to find patterns, so i generated a string of pitches for each musical work using the function csv_to_strings() shown in figure 4. figure 4. reading a file as a dataframe and outputting strings using csv_to_strings(). i loaded each converted csv file into a dataframe using pandas’s read_csv() method, each row representing a single note. with all of the pitches in a series (column) in the dataframe, i used the series.to_string() method to write the pitches to a single, continuous string. i replaced the newline delimiters between the pitches with single dots using string.replace() to make the strings more readable for error checking. this process worked well, but if i had searched the string at this point, i would only find matching patterns if the pitches were exactly the same. regex searches could not recognize recurring patterns at lower or higher pitches or in a different key signature. i solved this issue by instead using intervals that showed the relative distance between the pitches. regardless of the pitch or key signature, the intervals between the steps of recurring patterns would always match. to calculate intervals, i assigned each note in the dataframe a piano keyboard value based on the pitch and octave. i created a dictionary with each pitch and octave combination as the dictionary key and the piano keyboard number as the value. for example, middle c is c4, which is key 40 on a piano keyboard. i assigned the notes these values using the series.apply() method. once each note had been given a numeric value, i used the series.diff() method to create a new series of intervals, calculating the difference between consecutive rows of keyboard numbers. the final step before converting the series to a string was to divide the dataframe by score part. i did this to prevent finding a pattern that occurred only once in each part, as my goal was to find patterns that recurred throughout a work. i iterated through the unique score parts, generated a string of intervals for each part using series.to_string(), and saved the strings to a dictionary. searching for patterns using regular expressions the core of this project is finding patterns in music score data, and i accomplished that through regular expression (regex) searches (figure 1, steps 7–9). instead of python’s standard regex library, i used an alternate regex package that could find overlapping matches. i wrote a regular expression that searches for a pattern of a minimum length that recurs a minimum number of times: `((?<=\.)[^.]+\.(?:[^.]*\.){%d,}[^.]+(?=\.))(?:.*\1){%d,}` the expression is built on the following rules: the values in the string are delimited with dots. the pattern must begin with a non-dot character immediately preceded by a dot; this is implemented through a lookbehind: `(?<=\.)`. the pattern must end with a non-dot character immediately followed by a dot; this is implemented through a lookahead: `(?=\.)`. the pattern must be at least a certain length; this is implemented by specifying a minimum repetition for a non-capturing group: `(?:[^.]*\.){%d,}`. the pattern must recur at least a certain number of times; this is implemented by a non-capturing group that includes a backreference: `(?:.*\1){%d,}`. the find_patterns() function in figure 5 takes as its first argument the string dictionary returned from csv_to_string_dict() function. the other arguments are the minimum length of the sequence and the minimum number of occurrences. the argument for minimum length is reduced by two to account for the first and last interval, which are already represented in the regular expression; the minimum occurrences argument is lessened by one to account for the initial occurrence of the pattern. figure 5. searching for patterns in strings with regular expressions using find_patterns(). i saved each returned list of patterns to a unique text file so i could retrieve the patterns without redoing the entire search process. some of the largest score files were skipped over because the regex searches took too long. i used the func-timeout package to set a five-minute time limit when i ran the find_patterns() function. i did this because a test pattern search through one of the largest files was still unfinished after three hours, though most searches finished well under the time limit. in the end, 313 of the 346 csv files were successfully converted to strings and searched (figure 1, steps 5–9). creating a database using sqlite3 after finding the patterns, i created a simple sqlite3 database to store the patterns and their corresponding works (figure 1, steps 9–11). the structure of the database is outlined in figure 6. i created three tables: one for musical works, one for patterns, and one bridge table. there are fields for work and pattern identifiers, the pattern strings, and the work title and composer. figure 6. pythagoras database logical model. i wrote the function add_to_db() in figure 7 to read in the text files that held the patterns returned by the find_patterns() function and add data to the three tables. when i had originally converted the capella files to musicxml files, i made sure that each file name followed the format: composer_name-work_title.file_extension. splitting the filename returned values for the composer and work title. based on the work’s title, i searched for its id, adding any missing works to the database. i iterated over the patterns in each text file, adding to the database any patterns that hadn’t already been added and retrieving their ids. i added a row in the bridge table for each work id and pattern id combination. figure 7. inserting values into the database using add_to_db() the database currently holds 346 works from 6 composers containing over 380,000 patterns. i wrote a short function query_db() in figure 8 which takes an sql query string and returns an iterable of rows from the query results. the sample query returns the pattern and associated work titles for any patterns that are at least 10 characters long and occur in 5 or more works. figure 8. querying the database using query_db() with a sample query. visualizing patterns using matplotlib and musescore the patterns stored in the database are sequences of intervals; they represent the relative distances between note steps, but not the steps themselves. this was necessary for finding patterns at different pitches or in different key signatures, but it meant that the interval pattern strings must be converted back to sequences of steps in order to visualize the original pattern. i included this conversion in the plot_pattern() function in figure 9. figure 9. plotting an interval string with matplotlib using plot_pattern(). first, i split the interval strings on the dot separators, returning a list of strings. i converted the strings to intervals using the map() method. i created a new list to hold steps and added an initial step of 0, since the first interval represents the relative distance of the second step from the first. for each interval in the interval list, i added its value to the previous step value and appended it to the step list. so, for example, the string ‘1.0.-1.1.2.-7’ was converted the interval list [1,0,-1,1,2,-7] which was used to build the step list [0,1,1,0,1,3,-4] according to the scheme current step + interval = next step: initial == 0 0 + 1 == 1 1 + 0 == 1 1 + (-1) == 0 1 + 1 == 1 1 + 2 == 3 3 + (-7) == -4 i plotted the list of steps using matplotlib, and labeled the steps with provisional pitches based on step 0 representing the pitch c (figure 10). figure 10. matplotlib visualization of an interval string, output of code in figure 9. i used the music21 toolkit developed by a team at the massachusetts institute of technology to plot the list of steps to a musical staff. figure 11 shows the output of passing the steps from the example pattern in figure 9 to music21’s converter.parse() method. although this visualization does not necessarily capture the pitches or durations of the original notes, it displays the pattern in a format that is easily recognizable to a musical audience. figure 11. musescore visualization in jupyter notebook using the same pattern as figure 10. challenges my first major challenge was collecting digitized scores, since optical music recognition software is not yet as functional as i had hoped. i began with the intent of scanning individual scores and exporting them as musicxml files but found the recognition to be prohibitively inaccurate. some programs were more accurate than others, but none were effective enough to be considered for this project. a second challenge was making sure the desired matches were being found by my regular expression searches. i tested simple strings like the pitches for the melody of twinkle, twinkle, little star before working with longer strings generated by my csv_to_strings() function. one of the issues i had to solve was making sure both leftand right-anchored matches were returned. the overlapping functionality of the alternate regex library only returned right-anchored matches. for example, a regex search to match four digits in the string ’123456’ would return ’123456’, ’23456’ and ’3456’, but not ’1234’, ’12345’ or ’2345’. to solve this issue, i iterated through the initial list of returned patterns and searched one-by-one through the strings to find any additional matches. those extra matches were then appended to the list of patterns, as shown in figure 5. another challenge i faced throughout the project was limited computing power. the major steps – converting xml files to csv files, searching for patterns, and adding patterns to the database – took many hours to run. results this project is still in progress, but my early findings from database queries are promising. i wrote a query for patterns at least 20 characters long and appearing in at least 5 works, and the results included the following: pattern: -1.-2.-2.-2.-1.-2.-2 composers: beethoven,schubert,handel,mozart appears in: 31 works this result is confirmation that the functions i’ve written are able to successfully parse and convert scores, find and store patterns of intervals, and show at least some relationship between works and composers based on those patterns. this particular sequence, when converted back from intervals to steps, is a descending major scale: 12-11-9-7-5-4-2-0, or c-b-a-g-f-e-d-c in pitches. a next step is to distinguish between commonly used patterns like scales and more unique patterns that would possibly represent a more interesting link between certain works or composers. further work the work presented in this paper constitutes an introductory approach to the larger goal of using technology to gain new insights into the arts, specifically music. there are several opportunities to expand on the work that i’ve described here. i will continue to add works and composers to my database. the composers and works included in the project so far represent only a narrow slice of music history. i am particularly interested to see how musical themes have been borrowed across distances in time and place. much of the future work has to do with accounting for the complexities of musical patterns. i have, so far, only considered patterns made up of specific intervals, but that is a most basic type of musical pattern. a pattern could be inverted, reversed, shortened, lengthened, or any combination of those variations. it could exchange major intervals for minor intervals, or use completely different intervals altogether and be based on rhythm instead. there is more information to include about the relationship between the pattern and its respective work. as of now, i can only say whether or not a pattern exists within the notes of the work, but not the number of times the pattern appears or the measure numbers where it began or ended. conclusion computer analysis can yield insights quickly on a large scale. it is no replacement for trained music theorists, music historians, and musicologists; but it is a useful tool for identifying musical characteristics and relationships that may otherwise go unnoticed. musical patterns are complex, and their nuance and variation is difficult to account for in the type of searches outlined in this paper. nevertheless, the results of this initial attempt to show existing relationships using patterns in music score data are promising. references barnett, m. (2020). regex (version 2020.4.4). pypi. https://pypi.org/project/regex/ cuthbert, m., ariza, c., hogue, b., & oberholtzer, j.w. (2019). music21: a toolkit for computer-aided musicology (version 5.7.2). massachusetts institute of technology. http://web.mit.edu/music21/ savannah, t. (2019). func-timeout (version 4.3.5). pypi. https://pypi.org/project/func-timeout/ about the authors brandon bellanti recently graduated from simmons university’s master of library and information science program with a concentration in information science & technology. he can be reached for questions or comments at brandonbellanti@gmail.com. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – recount: revisiting the 42nd canadian federal election to evaluate the efficacy of retroactive tweet collection mission editorial committee process and structure code4lib issue 37, 2017-07-18 recount: revisiting the 42nd canadian federal election to evaluate the efficacy of retroactive tweet collection in this paper, we report the development and testing of a methodology for collecting tweets from periods beyond the twitter api’s seven-to-nine day limitation. to accomplish this, we used twitter’s advanced search feature to search for tweets from past the seven to nine day limit, and then used javascript to automatically scan the resulting webpage for tweet ids. these ids were then rehydrated (tweet metadata retrieved) using twarc. to examine the efficacy of this method for retrospective collection, we revisited the case study of the 42nd canadian federal election. using comparisons between the two datasets, we found that our methodology does not produce as robust results as real-time streaming, but that it might be useful as a starting point for researchers or collectors. we close by discussing the implications of these findings. by anthony t. pinter and ben goldman introduction twitter is a social media platform launched in march 2006 that allows users to post messages in 140 characters or less, and share hyperlinks, images, and video. the potential historical value of content posted to twitter was recognized in 2010 when the library of congress announced an agreement with twitter to archive and make accessible every tweet. since that time, however, the library has been unable to meet its pledge of making all tweets available. in recent years, researchers and archivists have begun developing new strategies for curating topical datasets of tweets around culturally significant events as they happen. using twitter’s api, such researchers can harvest tweets going back seven to nine days. datasets of this variety are composed of thousands of ordinary users, offering researchers access to representative samples of public discourse by and between individuals during these significant events. however, the relative newness of tools like twarc and social feed manager means that researchers and librarians have only been able to develop such collections in the past five years, leaving a gap between when twitter was created (march 2006) and when collection tools became available. this gap presents collectors with a two-fold problem – we know that people were tweeting about events (and frequently, tweeting a lot), but conventional methods use twitter’s api for collecting, which limit collections to the past seven to nine days. if a collection is started while the event is not in this near past, or if the subject of interest lies outside of the near past, there is not a readily available, open-source method for accessing these tweets. as well, there can be data loss over time as twitter users delete tweets, whole accounts, or alter the privacy settings on their accounts. at penn state university libraries, these concerns were brought to the forefront when we considered the feasibility of retrospectively capturing tweets related to the jerry sandusky child abuse scandal which dominated the national u.s. news in late 2011. from the initial public revelation of the scandal in november 2011 to the death of penn state’s football coach, joe paterno, in january 2012, the twitter reaction was likely considerable, and we wondered how comprehensively we would be able to capture the dialogue found on this platform years after the tweets were first posted. in this paper, we offer the results of an investigation to determine how much twitter data can be retroactively captured using a potential workaround. by using twitter’s advanced search feature and in-browser javascript, we found it was possible to collect large numbers of tweets from beyond the seven-to-nine day limit imposed by twitter’s api. we outline our methodology and then test its efficacy by collecting tweets from the 42nd canadian federal election and comparing them to a dataset of tweets that was generated in real-time during that election. we close by discussing its limitations and the implications that these limitations have on our ability to collect from twitter’s far past. background in early 2016, nick ruest and ian milligan published “an open-source strategy for documenting events: the case study of the 42nd canadian federal election on twitter”, documenting a strategy for collecting tweets about a specific event via hashtag streams and searches. the researchers turned to open-source software, twarc, developed by ed summers, to conduct their streams and searches. over the course of 102 days, the researchers collected 3,918,932 tweets that used the hashtag, #elxn42. #elxn42 is a reference to the 42nd canadian federal election, which occurred in late 2015 and resulted in justin trudeau being elected prime minister of canada. using this generated dataset, ruest and milligan demonstrated various analytic tools that were included with twarc, supported by a separate tool called twarc-report, or that were available via command-line tools. this included outputs such as timelines, user-level analytics, hashtag-level analytics, and js visualizations of the data. ruest and milligan’s published dataset offers a concrete benchmark against which we can test the effectiveness of a method for collecting tweets that are weeks, months, or even years old. determining the efficacy of this methodology is important, because as ruest and milligan note, “once an event has happened, if a small window of time has passed – 7 to 9 days – the tweets become largely inaccessible on a large scale without considerable monetary resources.” offering a work-around to this time limitation will allow broad research efforts to be taken to examine twitter during significant events that fall outside of the near (7 to 9 day) past. methodology our methodology for collecting tweets beyond the seven to nine day range was built upon the methods employed by the trump twitter archive, which aimed to collect, organize, and make accessible all of the tweets sent by donald j. trump from 2009 to present. the major change we made was to alter the amount of information collected, as we only needed the tweet ids in order to hydrate using twarc. following, we describe the process and highlight the changes we made to the original collection strategy. the trump twitter archive was built using twitter’s advanced search feature to get the date codes in the url, and then manually changing the dates searched in the url. for example, (author emphasis on the relevant portions of the url): https://twitter.com/search?q=%23trump%20since%3a2016-12-19%20until%3a2016-12-21&src=typd the first highlighted portion is the search term. in our instance, we used the search term “#elxn42” to capture tweets containing that hashtag. the second and third highlighted portions set the beginning and end dates of the search, respectively. thus, our search url for the hashtag #elxn42 from september 2015 to november 2015 becomes: https://twitter.com/search?q=%23#elxn42%20since%3a2015-09-05%20until%3a2015-11-05&src=typd however, in our case, the size of the dataset prevents us from using just one url that spans the entire date range. to get around this issue, we broke the original date range into 102 individual days, pulling one search from each day of the original timeline. following the trump archive methodology, for each search we used the developer’s console feature to run an auto-scrolling javascript code, which went through all of the results of each hashtag search: setinterval(function(){ scrollto(0, document.body.scrollheight) }, 2500) the developer’s console feature can be found under the “develop” tab on most browsers. this console allows you to enter javascript code to run on the webpage that is currently open. in our case, we used the code described above to autoscroll to the bottom of each page, and used the subsequently described code to collect the tweet ids for all of the tweets on each page. in figure 1, we show an example of the developer’s console. figure 1. the developer’s console (in safari) as previously stated, our methodology differed from the trump archive’s process. the trump archive used javascript code that collected the date, id, text, retweet status, favorite status, and url of each tweet, and copied it to the computer’s clipboard. we quickly realized that much of this data was useless for our methodology, because we were going to rehydrate the tweets to collect the full json output associated with each tweet. thus, we changed the code to only collect the ids of the tweets, the piece of information needed for rehydration. the original javascript to accomplish this task was: var alltweets = []; var tweetelements = document.queryselectorall('li.stream-item'); for (var i = 0; i < tweetelements.length; i++) { var el = tweetelements[i]; var text = el.queryselector('.tweet-text').textcontent; alltweets.push({ id: el.getattribute('data-item-id'), timestamp: el.queryselector('a.tweet-timestamp').getattribute('data-original-title'), text: text, link: 'https://twitter.com' + el.queryselector('div.tweet').getattribute('data-permalink-path'), is_retweet: el.queryselector('.js-retweet-text') ? true : false, retweets: el.queryselector('.js-actionretweet .profiletweet-actioncountforpresentation').textcontent, favorites: el.queryselector('.js-actionfavorite .profiletweet-actioncountforpresentation').textcontent, replies: el.queryselector('.js-actionreply .profiletweet-actioncountforpresentation').textcontent, }); } copy(alltweets); while we used the following code: var alltweets = []; var tweetelements = document.queryselectorall('li.stream-item'); for (var i = 0; i < tweetelements.length; i++) { var el = tweetelements[i]; var text = el.queryselector('.tweet-text').textcontent; alltweets.push({ id: el.getattribute('data-item-id') }); } copy(alltweets); similar to the autoscroll script, this code was entered in the developer’s pane of an internet browser (described above). after running this code in the console, the tweet ids are copied to our machine’s clipboard. we pasted this collected data into a text-file and rehydrated using twarc. then, the files were concatenated, deduplicated, and unshortened, similar to ruest and milligan’s approach. we would like to note that this is an implementation that anyone can operationalize, regardless of programming ability. to do large scale searches such as ours, it can quickly become tedious. there is a script that automates this process but requires some programming knowledge to properly implement. we chose to present the low barrier version of the methodology to appeal to wider audiences and mention this more efficient method for those possessing the technical skills to explore. results & discussion in this section, we turn to the same analytical approach that ruest and milligan utilized to examine our dataset and compare it to the original dataset. this included comparing the results of user counts, hashtag counts, and url counts between our “new” dataset and ruest and milligan’s original dataset. by utilizing the same analytical approach, we were able to determine the efficacy and usefulness of our retroactive collection methodology. we excluded the other metrics originally used by ruest and milligan (namely retweets, geographic information, and images) because we felt they did not offer as great a means of comparison as username, hashtag, and url. a larger dataset would by default include a larger number of retweets, so this number would be disproportionately skewed. the geographic information and images were excluded because not every tweet contains them, making it hard to compare between two datasets. number of tweets collected using our methodology, we collected 250,781 tweets. this represented approximately 6.4% of the original dataset collected by ruest and milligan (3,918,932 tweets), who themselves acknowledged the inherent limitations of using twitter’s api to collect data instead of commercial platforms. at face, this low rate of overlap indicates that this methodology is not robust enough to stand on its own, a conclusion that was reinforced when we attempted to rehydrate the #elxn42 dataset from the publicly accessible list of tweet ids, which resulted in a collection of 64.5% of the original tweets from ruest and milligan’s dataset. the loss of nearly 1.5 million tweets from the original data collected is likely due to tweets or accounts being deleted, or users changing their privacy settings. but the loss of an additional two million-plus tweets suggested that our methodology was missing tweets that remained accessible online. users a deeper examination of the data somewhat clarified the gaps in our methodology. the original researchers identified 318,176 unique users in their dataset, for an average of approximately twelve tweets per account. in comparison, we collected 28,157 unique users with an average of approximately nine tweets per user. this represented 8.8% of the users found in the original set. surprisingly, the top tweeters from the two datasets widely differed, with only one account shared between the two lists. this account, davidmorrison17, topped the list in the original set and was second on our list, indicating some overlap between the two datasets. tables 1 and 2. original top ten users (left) versus new top ten users (right) tweets username 1. 21,423 davidmorrison17 2. 15,527 p_wog 3. 10,812 chuddles11 4. 10,051 444_nal4b 5. 8,871 joannecangal 6. 8,346 littleshasta 7. 8,316 madeincanada56 8. 8,114 lucmatte9 9. 7,360 frazzling 10. 7,019 stopharpertoday tweets username 1. 2,448 bergg69 2. 2,001 davidmorrison17 3. 1,704 muskokamoneybag 4. 1,490 billhillier 5. 1,344 rabbelca 6. 1,252 deepgreendesign 7. 1,194 worldwideherald 8. 1,036 alanabowker 9. 1,028 tomassorico 10. 1,019 cmusician it is promising to see a similar (12 vs. 9) tweet/user breakdown and a shared account finishing high on both lists (davidmorrison17), but the lack of further overlap merits further investigation. it was notable that p_wog does not appear in our top 10, as that account remains public and active, reinforcing the idea that deletions and privacy settings alone would not account for the considerable differences in tweets collected. hashtags there was a large amount of overlap between the hashtags present in the two datasets, both in the number of unique hashtags found and the top ten hashtags in each data set. the original dataset contained 70,112 unique hashtags, while our dataset had 21,411 unique hashtags. this meant our set had 30.5% of the hashtags found in the original set, a percentage much higher than the percentage of overlap we found in our unique users. in terms of comparing the top ten hashtags from each set, aside from the top hashtag (#elxn42, as expected), there were six other shared hashtags, including #2 – #6. the higher rate of comparative overlap between the number of unique hashtags and the top ten hashtags suggests this methodology could be useful in examining the hashtags that were being used, and using this information in tandem with other methodologies to paint a fuller picture of the social media activity at the time. tables 3 and 4. original top ten hashtags (left) versus new top ten hashtags (right) tweets hashtag 1. 3,685,885 #elxn42 2. 1,390,783 #cdnpoli 3. 164,339 #ndp 4. 139,070 #cpc 5. 129,082 #lpc 6. 89,303 #elxn2015 7. 68,387 #polcan 8. 64,718 #realchange 9. 62,282 #polqc 10. 61,700 #globedebate tweets hashtag 1. 250,781 #elxn42 2. 108,275 #cdnpoli 3. 11,649 #cpc 4. 11,044 #ndp 5. 9,833 #lpc 6. 7,255 #elxn2015 7. 4,277 #harper 8. 3,547 #onpoli 9. 3,535 #canada 10. 3,446 #realchange urls in the original dataset, there were 1,988,693 urls tweeted, representing 50.75% of the total collected tweets. in our dataset, we found 140,839 tweeted urls, or 55.34% of the total tweets. comparing unique urls, the original dataset contained 334,841 unique urls, while our set contained 100,789 unique urls or 30.4% of the original set. in tables 5 & 6, we compare the top ten most-tweeted urls from the original set versus our “new” set. tables 5 and 6. original top ten urls (top) versus new top ten urls (bottom) tweets url 1. 11956 http://www.cbc.ca/includes/federalelection/dashboard/index.html 2. 9712 http://www.conservative.ca/ 3. 4562 http://www.votetogether.ca/ 4. 3983 http://www.cbc.ca/news/politics/macleans-debate-leaders-2015-1.3182000 5. 3926 http://www.elections.ca/scripts/vis/finded?l=e&qid=-1&pageid=20 6. 3104 http://www.elections.ca/home.aspx 7. 2812 http://www.theglobeandmail.com/try-it-now/?articleid=26875323 8. 2808 https://www.facebook.com/abu.nawaf.581/posts/10206977713713332?pnref=stor 9. 2757 http://dont-be-a-fucking-idiot.ca/ 10. 2707 https://www.mypayingads.com/index.php?ref=51826 tweets url 1. 365 http://looniepolitics.com 2. 311 http://www.votetogether.ca 3. 213 http://votecompass.cbc.ca 4. 185 http://www.macleans.ca/news/canada/vanishing-canada-why-were-all-losers-in-ottawas-war-on-data/ 5. 175 http://www.cpac.ca 6. 160 http://dont-be-a-fucking-idiot.ca/ 7. 152 http://looniepolitics.com/election_2015/ 8. 149 http://election.davidsuzuki.org 9. 119 http://bit.ly/healthycandidates 10. 113 http://www.votetogether.ca/ while there was a high percentage of shared urls between the two datasets, there was little overlap between the two top ten lists. this discrepancy likely indicates that the url metrics from datasets collected by our methodology should not be trusted on their own, but might be useful in conjunction with other sources. the opaqueness of twitter’s documentation on search and advanced search impacted our ability to fully understand the limitations of this methodology for retroactive collection, but it seems to suggest that twitter selectively filters search results based on a number of factors, including the existence of “duplicate or near duplicate content,” hashtag spamming, automated replies and bot-generated tweets, and even the use of 3rd party applications. it is also worth noting over the course of our experimentation, twitter’s advanced search was updated to remove options for filtering by factors such as positive and negative sentiment, and the inclusion of retweets. finally, twitter does acknowledge that search result algorithms display tweets in the advanced search feature in a specific manner, such as due to the popularity of the account or the number of views that specific tweet received, which would then skew datasets collected in this manner. conclusion in conclusion, we offer two principal takeaways from testing the efficacy of this work. first, this methodology might be useful to researchers or curators looking to identify research questions or build collections from twitter’s far past. while it is not nearly as powerful or comprehensive as collecting tweets in real-time or from the near past, it can still reveal trends in hashtag usage and link popularity, and it is a better alternative than not having access to data at all. additionally, datasets built from this methodology might be augmented with additional collection tools that capture other period information from online sources (news crawls, the wayback machine, etc.). the process might also be augmented in future work to be applied to other social media formats, such as facebook or tumblr, to collect posts and comments that occurred in the far past. this, in turn, offers several avenues for future work exploring how we might retroactively collect from these other social media platforms to paint a better picture of the discourse surrounding significant events from the far past. we believe this retroactive collection methodology, although imperfect, might offer a good start point for collectors and researchers to delve into past events and how the internet discussed them. second, the failings of this methodology highlight the importance of collecting in real-time around events. while we were able to build a sizeable collection of tweets from over 18 months ago, it is clear that we collected a drop in the bucket compared to what was collected in real-time during the event. thus, the results of this work highlight the importance of active collecting around societal events, and suggest that a best practice for archivists and curators is to collect in real-time as much as possible. while this will generate large amounts of data, this is preferable to losing tweets, and the meaning they convey, to the far past limitations of twitter’s api. in their original article, ruest and milligan concluded by stating: “in an era where web archiving and twitter collection can be seen as expensive luxuries, this article shows how, for a relatively small investment of computing power, bandwidth, and storage, people can create and analyze their own twitter archives.” while web archiving and twitter collection might be seen as an expensive luxury, our explorations into retroactive collection show that the expense is well worth the result – preserving a better picture of our societal reactions to events. references binkley, peter. twarc-report readme.md. https://github.com/pbinkley/twarc-report/blob/master/readme.md, last accessed 11 may 2017. brown, brendan. trump twitter archive. http://www.trumptwitterarchive.com/, last accessed 11 may 2017. brown, brendan. twitter_scraping. https://github.com/bpb27/twitter_scraping/blob/master/readme.md, last accessed 11 may 2017. ruest, nick, 2015, “#elxn42 tweets (42nd canadian federal election)”, hdl:10864/11270, scholars portal dataverse, v4 ruest, n., & milligan, i. (2016). an open-source strategy for documenting events: the case study of the 42nd canadian federal election on twitter. the code4lib journal, (32). http://journal.code4lib.org/articles/11358, last accessed 11 may 2017 the search api – twitter developers. https://dev.twitter.com/rest/public/search, last accessed 11 may 2017 summers, ed. twarc readme.md. https://github.com/docnow/twarc/blob/master/readme.md, last accessed 11 may 2017. twitter search rules and restrictions. https://support.twitter.com/articles/42646, last accessed 11 may 2017 zimmer, michael. “the twitter archive at the library of congress: challenges for information practice and information policy.” first monday 20.7 (2015). about the authors anthony t. pinter is a phd candidate in information science at the university of colorado boulder, where he studies identity in online spaces. he previously completed his b.s. and m.s. in information sciences and technology at the pennsylvania state university. ben goldman is the kalin librarian for technological innovations at penn state university libraries, where he oversees web archiving efforts. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – chrononlp: exploration and analysis of chronological textual corpora mission editorial committee process and structure code4lib issue 57, 2023-08-29 chrononlp: exploration and analysis of chronological textual corpora this article introduces chrononlp, a free and open-source web application designed to enable the application of natural language processing (nlp) techniques to textual datasets with a time-based component. this interactive python platform allows users to filter, search, explore, and visualize this data, allowing the temporal aspect to play a central role in data analysis. chrononlp makes use of several powerful nlp libraries to facilitate various text analysis techniques including topic modeling, term/tf-idf frequency evaluation, automated keyword extraction, named entity recognition and other tasks through a graphical interface without the need for coding or technical knowledge. by highlighting the temporal aspect of specific types of corpora, chrononlp provides access to methods of parsing and visualizing the data in a user-friendly format to help uncover patterns and trends in text-based materials. by erin wolfe introduction computational text analysis encompasses a range of methods and strategies employed by researchers to examine and derive meaningful insights from large amounts of text in a fast and systematic way. it has become an important tool across various fields, enabling researchers to uncover hidden information and gain deeper understandings from text-based materials. applying these methods to textual corpora with a time-based component can enable users to uncover valuable insights into ways that the content changes over time. this article introduces chrononlp [1], a free, open source web-based application designed to specifically facilitate application of natural language processing (nlp) techniques to texts of this type, by providing means for the temporal aspect to play a central role in the data analysis. background chrononlp is an interactive web-based python application designed by the author to provide the user with an interactive means of filtering, searching, exploring, and visualizing textual materials with a time-based component. this platform grew out of the author’s interest in “distant reading” of textual materials, a method of text analysis that treats text as data and employs computational approaches and analytical tools to uncover patterns, trends, and other features that may not be as easily revealed through traditional literary analysis. while there are a number of available very good open tools that do nlp actions [2], the intention of chrononlp is to provide an alternate tool for researchers to explore texts in which a time-based element can provide valuable insight into the data and data changes. specifically, chrononlp is designed to allow the user a graphical means to explore and highlight various aspects of the texts using a variety of popular powerful text analysis libraries without a need for coding or command line knowledge. the website is deployed using streamlit, an open-source python library that simplifies the creation of interactive web applications for data science and machine learning [3]. streamlit allows easy integration with any python-based library, including the various powerful nlp libraries used by chrononlp (which will be discussed in later sections), and its real-time data processing and rendering capabilities allow users to interactively explore and visualize the underlying data in an intuitive and user-friendly graphic interface. in addition, the streamlit community cloud service can deploy web apps directly from github at no cost, with code updates automatically reflected in the user interface. chrononlp offers a variety of nlp models and data visualization techniques to interact with textual datasets, including filtering data based on user-defined criteria, robust keyword searching and visualization tools, a framework for side-by-side comparison of different slices of the data, and other actions. in addition, chrononlp provides an optional method for identifying a secondary data source component (such as a publisher or author category), which can be further used to filter and explore the data. all graphs and tables are interactive and responsive and support zooming, panning, and other features to enhance data exploration and analysis. the data sample datasets although chrononlp is designed for users to upload their own corpora, there are sample datasets included that can be used to review aspects of the tool, including a (1) collection of news articles related to covid-19 from six online newspapers published in douglas county, kansas, between january 2020 and january 2022 (‘dataset 1’), [4] and (2) a collection of 757 posts from the blog of the history of black writing (hbw) project at the university of kansas, published between 2011 and 2021 (‘dataset 2’). [5] uploading user data in addition to the text itself, chrononlp requires minimal metadata to be included [6]. the only required element is a date, ideally formatted according to the iso 8601 standard (yyyy-mm-dd) or similar. additionally, three optional fields may also be included: (1) a unique identifier for each item (e.g., filename or uri), (2) a label for each item (e.g., title) and (3) a grouping column (“source”) that will vary based on the type of data (e.g., author, genre, publisher). including a source field will allow filtering and/or visualizing the data on this facet, such as comparing keyword usage between authors, relative frequency of publication by source, or other methods of grouping the dataset. data can be added to chrononlp either as (1) structured data in a csv file or (2) a zip file of individual text files along with a csv of metadata. on upload, the user has a chance to review the data and to map the csv columns to the correct values – any extraneous columns will be ignored. preprocessing once the data is ingested and verified, the full text of each article will go through a number of preprocessing steps to facilitate the subsequent analysis of the data. first, the entirety of the text is retained in full as retrieved during the initial harvest (“full text”). this is useful for searching for phrases in the unaltered text which may otherwise be missed (e.g., “the cdc”). chrononlp implements spacy 3 [7] and textblob [8] to process each text and create a cleaned version of the text, removing punctuation and stopwords [9] and converting all tokens (i.e., individual words) to lowercase. the text is analyzed for parts of speech and named entity identification, and some additional descriptive metrics are extracted. the preprocessing tasks may take several minutes, depending on the data size. the resultant processed dataset can be downloaded, allowing users to skip this step in subsequent visits. filtering the data at any stage of the data review process, users can apply filters based on various facets to examine specific subsets of the data. these filtering options include specifying date ranges, selecting data sources, and excluding or including texts that contain specific terms. the resulting filtered dataset can be downloaded as a csv file, and all subsequent nlp tasks can be applied exclusively to this subset of the data. this ability to create on-the-fly subsets of data can help users review the texts more closely and from different angles. by comparing data from different date ranges, users can study temporal patterns and trends, seeing how language, topics, or sentiments evolve over time. filtering by source can enable users to investigate the contributions and perspectives of specific publishers, authors, or platforms. it may help identify variations in content, biases, or writing styles, providing a more comprehensive understanding of the dataset and allowing a deeper analysis of the diverse perspectives present in the texts. filtering based on specific terms or phrases allows users to focus their analysis on articles that are most relevant to their research questions. this functionality helps eliminate noise and irrelevant information, enabling researchers to zoom in on the key aspects they want to explore. overall, the filtering functionality in chrononlp empowers users to tailor their analysis, saving time and effort by narrowing down the dataset to specific subsets that are most pertinent to their research goals. it enhances the efficiency and effectiveness of the analysis process, facilitating more targeted and insightful interpretations of the textual materials at hand. functionality data overview the data overview page provides a high-level look at the dataset, beginning with a stacked bar chart depicting the different sources’ number of items or word count over time (see figure 1). figure 1. stacked bar chart showing the number of articles in dataset 1 by source. a tabular summary of the data is shown, including item count, average word count, shortest and longest entries, and the date range for each source (see figure 2). the textdescriptives python library [10] is implemented in spacy to evaluate the readability of the corpus, using a variety of recognized tests. in this data, chrononlp reports the mean flesch reading ease readability score, a metric that uses the average sentence length (in words) and the average word length (in syllables) to calculate a measure of readability. the result is a number from 0-100 in which a higher number reflects a text that is easier to read, with a lower score indicating one that is more difficult to read. figure 2. high level summary of dataset 1 sources presented in a tabular format. parts of speech and named entity recognition part-of-speech (pos) tagging is a natural language processing (nlp) task that involves labeling words within a text with their corresponding grammatical part of speech, such as noun, verb, adverb, etc. named entity recognition (ner) tagging is a common nlp task that uses statistical predictions to identify and label a variety of named and numeric entities within a given text. chrononlp employs spacy 3’s default model and pipeline for pos and ner tagging [11], which evaluates each word in its context and position within a sentence to automatically assign these and other linguistic labels (see figures 3 and 4). figure 3. sample part-of-speech evaluation of a subset of dataset 2, including only entries that contain the word “novel.” figure 4. sample named entity recognition from a subset of dataset 2, including only entries that contain the word “novel.” term frequency and keyword searching one of the primary goals of chrononlp is to enable the visualization of word occurrences and frequency changes over time. exploring term specific data is facilitated in two ways: (1) a user-instigated keyword search and (2) various automated keyword extraction methods. analyzing word occurrences and frequency changes over time provides valuable insights into how language usage, trends, and topics evolve. it allows researchers and users to track the rise or decline of specific terms, identify emerging or fading concepts, and understand linguistic shifts within a given context. keyword search the chrononlp interface allows the user to enter one or more search terms and view the usage frequency fluctuation in a variety of ways. the user can search by exact match or use wildcards to find variations (see table 1). wildcard results can be returned as either a single combined result of all matches or as individual results for each match. specific terms can be also excluded from the results. search type example returns single term test all matches for exact match – “test” phrase test case all matches for “test case” wildcard test* a single count for all matches beginning with “test” (e.g., “test”, “testing”) wildcard, separated ^*test individual counts for each match ending with “test” (e.g., “test”, “detest”) combination test*, ^*virus* (1) a single count for all matches beginning with “test”, and (2) individual counts for all matches containing the string “virus” table 1. examples of keyword searching a count of all search terms is returned, which can be especially helpful when using wildcard searches (see figure 5). figure 5. summary of search results in dataset 1 for the search terms “test” (with no wildcard) and “mask*” (with wildcard). the results are then visualized in a few different ways. first, a line graph showing a simple raw usage count of the term(s) over time, or as a normalized count which is weighted against the percentage of the number of items in that time frame in the dataset, to show the relative frequency (see figure 6). data is graphed both by individual sources and as an aggregate, providing a comprehensive view of the textual patterns and trends. the results are also presented in a tabular format that can be downloaded as a csv file. figure 6. graphs depicting the results for the search “test” and “mask*” in dataset 1. clockwise from top left: raw count of all uses, normalized count of all uses, normalized usage by source, raw count of usage by source search terms are also presented as a keyword-in-context table, showing all results along with n terms on the left and right, allowing the user to gain additional context of the words as used in the text (see figure 7). separate tables calculate the 20 most frequent terms directly to the left and right of the keyword throughout the dataset, along with word clouds of the 200 most frequent terms (see figure 8). figure 7. sample keyword-in-context results from the search “test” and “mask*” in dataset 1. figure 8. most frequently occurring terms to the left and right of search results for the search “test” and “mask*” in dataset 1. if more than one keyword is searched, the user can select to view the combined results or individual results for each term (see figure 9). figure 9. chrononlp interface depicting the available options for reviewing the results for the search “test” and “^mask*” in dataset 1. note the carat and asterisk usage in “mask”. automated keywords in addition, chrononlp goes beyond user-instigated searching, by providing an interface to explore computationally derived keywords. the chrononlp interface prompts the user to select a date range, one or more sources, number of n-grams [12] (ranging from 1-3), and the preferred method of analysis. in order to reduce noise from common or less pertinent terms, the user can also input a custom list of terms to omit from the results. three methods of automated keyword analysis are available for the user to highlight different aspects of the text. all methods are applied to the cleaned text to remove very common words (such as “the”, “and”, “for”) from the results. for each of these inquiries, the user can see a word cloud composed of the top 200 terms, a list of the top ten terms with frequency and/or weight, and a graph showing the top five terms’ frequency over time. two inquiries can be run and displayed side-by-side to allow the user to easily compare two sets of results, such as the most frequent terms from two different time periods or sources. term frequency is the most basic and uses nltk’s freqdist module to identify the most commonly used n-grams in a given set of articles. tf-idf (term frequency-inverse document frequency) analysis is an algorithmic approach to text analysis that highlights potentially significant terms that occur less frequently than others in relation to the entirety of the texts. chrononlp employs the scikit-learn library to perform tf-idf analysis on the selected article set. [13] topicrank analysis is a dimensionally-based analysis performed in spacy using the pytextrank library that groups similar terms together and provides a count of each of these topics. table 2. sample results from automated keyword extraction for dataset 2, including most frequent terms (n=1 and 2), tf-idf terms (n=1 and 2), and topicrank topics and a sample of terms related to each topic. topic modeling topic modeling is an algorithmic process that evaluates a set of documents to computationally extract patterns and structures from textual data to identify topics contained within. a statistical model analyzes and clusters the words from the document set based on their relationship. using these clusters, the model generates a set of keywords for each topic, along with a probabilistic rate of occurrence for each keyword within that topic. individual documents are assigned a relative proportion of each topic (to total 100%) based on the content, which enables the representation of the documents in terms of their underlying topics. latent dirichlet allocation (lda) is a widely used method of topic modeling introduced in 2003 that is commonly used for nlp tasks [14]. lda is an unsupervised learning method (i.e., it does not require a manually created set of training data) that works by analyzing the words in the documents, applying a probabilistic framework for capturing the relationships between words and topics, and grouping the words into a specific number of topics. each document is then assigned to one or more of these topics based on its content, which can allow users to uncover the latent thematic structure in their textual data. chrononlp employs gensim [15], an open-source library that uses modern statistical machine learning for nlp tasks, to perform lda on a selected subset of the articles and to select the number of topics to generate. to determine the optimal number of topics for a given set of documents, the user can use chrononlp to review two language model evaluation metrics: perplexity and coherence. perplexity is a measure of how well a language model predicts unseen data, where a lower value indicates better performance. while this is a good metric for machine learning tasks, it may not be the most suitable measure for creating a human interpretable model. coherence is a measure of how well the topics within a model make semantic sense. a higher coherence score suggests that the topics are more interpretable and clear, and it is a more appropriate metric to use when creating a model that can be visually reviewed and understood by humans. these metrics help evaluate how well the topics align with the documents and how understandable they are. by comparing these results, the user can select a number of topics for the model (between 5 and 15) to achieve the best results for their specific dataset (see figure 10). figure 10. results of coherence and perplexity evaluation for a subset of dataset 1, including only entries that contain the word “community.” based on the coherence scores (the darker line), in this example, the optimal number of topics would likely be 5, 6, or 7. results are plotted over time, with larger bubbles representing a higher presence of the topic. results can be plotted to show the absolute article count or relative frequency (i.e in relation to other topics present at that time) of a given topic (see figure 11). figure 11. results of topic model graph using 7 topics for a subset of dataset 1, including only entries that contain the word “community,” showing both the absolute values (left) and the relative values (right). the user can then review more specific details about each of the topics. as noted above, lda is a topic modeling technique that clusters words into topics. lda assigns documents to relevant topics based on the words they contain, assigning each topic a weight between 0.0 and 1.0. the sum of weights for all topics in a document is equal to 1.0, representing the distribution of topics within that document. in the chrononlp interface, the user can see (a) a list of the top 10 keywords and (b) a word cloud based on the top 200 keywords for a given topic, as well as the number of texts included in that topic group (see figure 12). figure 12. sample of topics from a model using 7 topics for a subset of dataset 1, including only entries that contain the word “community.” limitations and future work this project could be expanded to allow exploration of a variety of different collections that include texts with a time-based component, including literature, twitter datasets, oral histories, or other content. some testing has been done using some of these materials, including the plays of shakespeare and historical speeches by u.s. presidents, but more work remains to facilitate wider application. the chrononlp interface does rely on a large number of external libraries and platforms. it will require periodically updating and migrating to newer versions of libraries or exploring alternative libraries to ensure the project remains up-to-date and compatible with the latest technologies. at this time, the tool has been tested with relatively small datasets, and the efficiency and reliability of working with larger datasets is not known at this time. future work will involve scalability, performance, and efficiency improvements, through code optimization, parallelization, hardware accelerators, and other means. in addition, the platform is currently limited to datasets in english. future work could expand this through the inclusion of additional libraries and models. conclusion this article has presented chrononlp, a web-based application that facilitates the application of natural language processing (nlp) techniques to texts with a temporal component. it offers a range of functionalities that enable users to filter, search, explore, and visualize textual materials in an interactive and user-friendly manner. integration with streamlit enhances the accessibility and real-time data processing capabilities, empowering researchers to tailor their analysis and focus on specific subsets of the dataset that are most relevant to their research goals. it is the author’s hope that this tool may help researchers to uncover hidden insights, track language usage over time, and gain a deeper understanding of the textual materials at hand. by eliminating the need for coding or command line knowledge, chrononlp makes text analysis accessible to a broader range of users and enhances the efficiency and effectiveness of the analysis process. data availability all code and underlying data described in this article is hosted in the author’s public github account: https://github.com/ewolfe1/chrononlp notes [1] wolfe, e. (2023). chrononlp. https://chrononlp.streamlit.app/ [2] some examples of other open nlp tools include: sinclair, s., & rockwell, g. (2016). voyant tools. http://libvoyant.unm.edu/docs/#!/guide/about; morgan, e., & abeysinghe, e., et al. (2019). the distant reader – tool for reading. pearc ’19, july 28-august 1, 2019, chicago, il, usa. https://doi.org/10.1145/3332186.3333260; anthony, l. (2013). developing antconc for a new generation of corpus linguists. proceedings of the corpus linguistics conference (cl 2013), july 22-26, 2013. lancaster university, uk, pp. 14-16. https://www.laurenceanthony.net/research/20130722_26_cl_2013/cl_2013_paper_final.pdf [3] streamlit inc. (2023). streamlit documentation. [accessed june 15, 2023] https://docs.streamlit.io/ [4] a collection of 3,753 news articles from six online publications in douglas county, kansas published between january 28, 2020 and january 31, 2022. articles were collected from the public web sites of three city publications (the lawrence journal-world, the lawrence times, and the eudora times) and three student university publications (the university daily kansan, the baker orange, and the indian leader). articles were identified as part of a university of kansas libraries web archiving project identifying online content relating to covid-19. [5] a collection of 757 posts from the blog of the history of black writing (hbw) project at the university of kansas, published between 2011 and 2021. learn more about the hbw at https://hbw.ku.edu/. [6] documentation for the user on data preparation is included on the site: https://chrononlp.streamlit.app/upload_dataset [7] honnibal, m., montani, i., van landeghem, s., & boyd, a. (2020). spacy: industrial-strength natural language processing in python. http://doi.org/10.5281/zenodo.1212303 [8] loriah, s. (2023). textblob: simplified text processing. https://textblob.readthedocs.io/en/dev/ [9] stopwords are very common words that generally do not contribute to the understanding of the text, such as “a”, “the”, “would”, etc. it is a common practice in nlp tasks to remove stopwords and punctuation as part of the pre-processing of the texts to help reduce noise in the results. [10] hansen, l., & enevoldsen, k. (2023). textdescriptives: a python package for calculating a large variety of statistics from text. arxiv preprint arxiv:2301.02057. [11] spacy provides four english language pre-trained pipelines (simply, a sequence of processing steps or algorithms that can be applied to text data) for performing basic nlp tasks. each of these varies in size, processing speed, required computational resources, and suitability for a given nlp action. chrononlp uses the smallest of these four, “en_core_web_sm”, as it is the fastest model and well-suited for the available streamlit resources, and it performs as well as the other larger models for the basic tasks of chrononlp, such as pos and ner identification. more information about the models and their functionality is available in spacy’s documentation: https://spacy.io/models and https://spacy.io/models/en/. [12] n-grams are simply a sequence of n words. the use of n-grams is very common in many nlp tasks. c.f. https://machinelearningknowledge.ai/generating-unigram-bigram-trigram-and-ngrams-in-nltk/ [13] pedregosa, f., varoquaux, g., gramfort, a., et al. (2011). scikit-learn: machine learning in python. journal of machine learning research, 12, 2825-2830.. [14] c.f. kapadia, s. (2019). evaluate topic models: latent dirichlet allocation (lda). towards data science. https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0 [15] rehurek, r., & sojka, p. (2010, may 22). software framework for topic modelling with large corpora. in proceedings of the lrec 2010 workshop on new challenges for nlp frameworks (pp. 45-50). valletta, malta: elra. http://doi.org/10.13140/2.1.2393.1847. see also https://radimrehurek.com/gensim/ about the author erin wolfe is the digital initiatives librarian at the university of kansas, where he works with the libraries on a variety of tasks involving the creation, description, and organization of digital materials for access and preservation, along with instruction and consultations in tools and practices for metadata management, dataset creation, digital preservation, and related topics. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – ebooks without vendors: using open source software to create and share meaningful ebook collections mission editorial committee process and structure code4lib issue 25, 2014-07-21 ebooks without vendors: using open source software to create and share meaningful ebook collections the community cookbook project began with wondering how to take local cookbooks in the library’s collection and create a recipe database. the final website is both a recipe website and collection of ebook versions of local cookbooks. this article will discuss the use of open source software at every stage in the project, which proves that an open source publishing model is possible for any library. by matt weaver introduction as libraries consider the impact/future of ebooks in libraries, so much of that energy is spent on the actions of vendors and publishers. digitization of local content holds a lot of potential for all libraries. vendor-driven solutions are expensive and can be overkill for many types of information that libraries might want to digitize and distribute. i came across some local cookbooks on my library’s shelves that represented a wide historical range, from a 2011 book produced by the city for westlake’s centennial to one produced by a local church’s sunday school in 1906, with several others in between. i wondered how we could capture that content, strip the recipes from their container, and create a database of local recipes. when looking for open-source digitization tools and processing tools, i realized that i could not only extract the recipes from the cookbooks, but also produce ebook copies. the result is the community cookbook, a recipe website and collection of ebook versions of local cookbooks. the project is entirely open-source based: homer project, suite of open source tools for the production of searchable pdfs when building projects from tiffs. sigil, an open-source epub editor calibre, an open-source, multipurpose ebook program used for converting epub ebooks into the kindle-compatible mobi format drupal, an open-source content management system. a project like this shows that libraries of any size or budget can use open-source software for meaningful content creation. because westlake porter public library hosts its own websites and owned all of the hardware used in digitization, the community cookbook project has been accomplished without any direct costs. what follows is an explanation of how wppl developed this resource, and of the decision-making processes with regards to hardware, software, copyright and site infrastructure. let’s begin with the hardware. hardware used in this project low-end pc bleeding edge technology is not required. the pcs that i have used for processing digitized images into ebooks were machines with average processor speeds at the time of their manufacture, running windows xp. hard drives the digitization process involves creating a large number of images. more images are created as part of the production process; the net result is large project files requiring adequate hard drive space. my current project file space for local ebooks exceeds 100 gigabytes for about a dozen ebooks. ebook data should also be appropriately backed up on an external hard drive. when a pipe burst in my office this winter, releasing the contents of the library’s sprinkler system, i was fortunate that my hard drives survived. scanner flatbed scanners can be painfully slow, and are inappropriate for traditionally-bound books. many cookbooks are published in spiral bindings so they can lay open when used in the kitchen. this binding makes them ideal for flatbed scanning. around the time i started this project, the library acquired multifunction copier/printer/scanners that scanned books rapidly. on such a machine i could scan even a large cookbook in about twenty minutes. the scanners supported the tiff format. dslr camera while there are different paths to ebook freedom, the method i use requires images to be in tiff format (more about formats later). a camera that shoots in tiff format is ideal, but raw format, which is more common, can easily be converted to tiffs using the open-source image-editing program gimp, or other common commercial products. digitization rig a digitization rig is needed for scanning with a camera. rigs can be elaborate and expensive, or downright elegant in their simplicity. figure 1 is a rig that i cobbled together using a couple of bookstands that are held together with rubber bands. it got the job done. figure 1. test rig used at wppl. deciding on a copyright strategy in creating local ebook collections, copyright is a primary consideration. when dealing with works covered by copyright, the society of american archivists posits the following principles [1]: multiple legal rationales may apply to a specific project or use; holdings in archival collections should be used, not left unused because of obscure ownership status; common sense should apply. copyright issues under section 108 of copyright law, libraries cannot distribute electronically digital copies made for preservation. section 104 of copyright law prohibits the creation of an ebook version of a physical book that is still under copyright without the rightsholder’s permission. i could use neither section 108 protections nor a fair use strategy for this project. to proceed with digitization, either the work would have to be in the public domain, or the library would have to get permission from the rights holder. few of these types of cookbooks were registered with the u.s. copyright office, and hardly any feature a copyright statement, which factors into an item’s copyright status. complicating matters is the fact that often i could not locate the rightsholder. some organizations that had produced cookbooks no longer existed, and it was difficult to determine who would have held the rights. copyright and recipes according to the u.s. copyright office [2], recipes can be subject to copyright: ingredient lists are not copyrightable, but instructions may be, if they amount to expression [3]. anecdotes about a recipe, accompanying images, and any other original content in cookbooks may be copyrightable. therefore, for the purposes of the community cookbook, in order to provide ebook copies of actual books, the only legitimate approach to copyright was to digitize titles that are public domain, and to seek permissions from the organizations that produced the cookbooks. when considering this collection, i was unclear about how copyright would apply to these recipes. even cookbook publishers inc., a popular producer of fundraising cookbooks, states in its documentation that “individual recipes cannot be copyrighted” [4]. however, some of the titles that we have had access to were published by cookbook publishers and always carry a copyright notice. i emailed the company to ask if the copyright notice applied to the recipe content. a member of the company’s customer service department responded that copyright only applied to the artwork and other content provided by the company [5]. therefore, if i received permission from the producing organization to digitize a cookbook published by cookbook publishers inc., i needed to make sure that no artwork, or other publisher-provided content would end up in the ebook versions of the cookbooks. i created a template for book cover art that i use for cookbooks from cookbook publishers, inc. researching copyright status in order to research the copyright status of books that are potential candidates for digitization, i turned to several online resources: the copyright office’s online catalog, which contains copyright records from 1978 to the present. stanford university’s copyright renewal database, which contains renewal records for books published between 1923 and 1963. as renewals became automatic by statute starting in 1964, this database is a valuable resource for assessing the copyright status of works. those renewals were received between 1950 and 1992. the copyright genie, from the ala office for information technology policy and michael brewer, which guides a user in determining whether an item is in the public domain, through a series of questions about the work, and makes an evaluation of its copyright status. the user can generate a pdf with information about the work to document efforts to ascertain its copyright status. the copyright slider, also from michael brewer and the ala office for information technology policy, is an easy-to-use interface for quickly determining a work’s copyright status based on its publication date, whether it contains a copyright notice, and some other factors. documenting research it is important to document all copyright research, and that research be as detailed as possible. if someone claims to be a rightsholder of a work that you have digitized, providing documentation about the process by which you decided to use the work can help prove the legality of your actions or at least show that you employed thoughtful, professional methods in your evaluation of the content contained in your project. according to the society of american archivists, “a documented history of the search for a copyright holder should help establish that a good-faith effort was made” [6]. consent agreements when i determined that a cookbook was still covered by copyright, i would seek out the current head of the organization and ask for permission to digitize it. since none of the cookbooks that i have digitized that are still subject to copyright had a single author, i determined that permission from the current head of the organization that produced the cookbook would be an appropriate indicator of permission. the consent agreement contains language that the signer of the agreement has the authority to approve our use of the cookbook. for each work published after 1923 that i determined to be in the public domain, i thoroughly documented the research process in an excel file in case i might have to justify my use of the content, and to document a good faith effort in attempting to identify rights holders. because, through this project, the library acts as a publisher, there had to be special consideration of how to handle access to the ebook as a digital object. my process for determining how users could access the content respected the concerns of the organizations that created the cookbooks without applying technologies that would both make the project cost-prohibitive and inject technological obstacles to patron access. access control there are two components to digital rights management (drm) that need to be considered in a project like this. technological protection measures (tpm) represent the end-user technological obstacles commonly referred to as drm. there is a second component to drm, however: copyright management information (cmi) [7]. there are no tpms used in the community cookbook ebooks. the content in this system is in most cases out of print, so it wouldn’t make any sense. this is clearly spelled out in the content agreement that is signed by representatives of organizations that produced cookbooks that are included in the collection. cmi, however, is an important part of the digitization process. data that comprises this category must be included in the digitization process, or it represents a violation of section 1202 of us copyright law [7] [8]. cmi for cookbooks is preserved both in the ebooks and in fields for both recipe and ebook content types in the community cookbook website. at this stage in the project’s development, only cardholders can access the full site. one collection, a text book of domestic science, a common home economics textbook from the early 1900s, which is in the public domain, is provided to all users largely as an example of how the site works. in the future, we may consider moving more titles beyond user authentication, but this will largely be determined by relationships with contributors. ebook formats epub has been established as a standard ebook format by the international digital publishing forum. there are two versions of epub. most ebooks are currently epub2 format; but the epub3 format, the current version of the standard [9], is gaining in popularity. the most significant difference between versions is that epub3 does not use the navigation center extended (ncx) file. this file, which is used for navigation within the ebook [10], can be left in an epub ebook file system so older ereading devices can still access the book. epub3 also offers improved support for mathml [11]. mobi is a kindle-compatible format. while most ebooks for amazon’s kindle device are in amazon’s proprietary azw3 format (which is mobi-based), mobi offers the broadest kindle device and app support. [12]. the ubiquitous pdf (portable document format) is a terrible ebook format for ereaders, smart phones and small tablets. designed to capture the exact layout of printed documents, the lack of adjustable text forces users of small tablets, smart phones and ereaders to pan and zoom in order to read a document. for users reading on a computer or large tablet, the pdf can provide an acceptable reading experience. the community cookbook workflow the content from the digitized cookbooks is stored on the community cookbook website in two ways: a recipe database; and a collection of ebooks in epub, mobi and pdf formats. at the core of the recipe database is the drupal recipe module, which provides functionality comparable to commercial recipe websites. users can browse recipes by ingredient or recipe name; and download recipes in three formats that can be uploaded into recipe management tools and websites: mastercook4, recipeml, and plain text. figure 3. digitization workflow. there are three main stages in the workflow for the community cookbook project (figure 3). stage one is digitization, using either a camera or flatbed scanner. optical character recognition is performed in stage two, beginning with tiff-format images from the digitization process and ending with production of a searchable pdf. stage three uses the text layer from the pdf, often after substantial editing, as the source of recipes which are uploaded into the recipe database, and to produce an epub ebook which is later converted to a mobi ebook. figure 4. sample recipe. key drupal modules apart from ‘article’ and ‘page” content types that are part of drupal out of the box, three others are at the core of the site’s functionality: recipes — produced by the recipe module, ebooks, and organizations. the ebook content type has three file fields, one for each ebook format. the organization content type has fields for an organization’s logo, website address, and description. i felt it was important to give the organizations that produced the cookbooks a visual presence on the site. the organization information and logo appear alongside every recipe from their cookbooks (figure 5). figure 5. ebook connected to organization content type. in terms of recipe metadata, the recipe module [13] does a lot of the heavy lifting with its built-in management of ingredient and measurement data (figure 4). ingredients can be entered in a simple interface with dropdown selectors for measurements and autofill textfields for ingredients (figure 6). figure 6. recipe ingredient interface. one criticism of the module is that ingredients are not stored as a taxonomy, but rather as arrays in tables for the module, preventing their availability to the views module. recipes can be imported individually in either plain text, or mastercook4 format from a simple text box, allowing copying and pasting from other sources (figure 7). figure 7. recipe import interface. an essential module for maintaining the site is views bulk operations (vbo) [14]. through views, vbo can be used to make changes to large numbers of content nodes in minutes. because a cookbook can contain hundreds of recipes, i have set the default published status of all recipes to “unpublished”. when an entire cookbook has been entered into the site, i use vbo to publish all recipes for that cookbook en masse. content access for the cookbooks and recipes that are restricted to library cardholders, the taxonomy access control lite (tacl) [15] module allows permissions to be set on content based on individual taxonomy terms. i created a “recipe collection” taxonomy, with terms — the titles of each cookbook — applied both to the ebook and recipe content types. out of the box, drupal’s files are set as “public”. to control access to the ebook files, i changed the setting at “admin/configuration/media/file system” to “private files”. unlike public files, which are accessed directly via the web server, private files are accessed via drupal path requests [16], providing access control to those files, which the tacl module handles. because some of the organizations whose cookbooks i had arranged to digitize did not want their content shared with non-cardholders, i had to allow users to create accounts on the site, which drupal handles out of the box. however, since the library already has a site that requires user accounts, i wanted to simplify account creation and site usage. the ils authentication module [17] allows users to create an account simply by using a library card number and pin. the module overrides the core user registration/login block (figure 8). there are costs associated with this, at least for libraries using sirsidynix symphony®. one has to subscribe to web services in order to be able to generate the client id required to use the module. as my library had already subscribed to sirsi web services, this module was easily added. out of the box, drupal provides user accounts. ils authentication provides convenience for the end user. figure 8. ils authentication module overrides drupal user registration/login. usage to capture usage data on recipes, i created a view that filters based on paths in the access log. all prints or downloads of recipes use a path that begins with “recipe/export”. i use the download count module [18] to keep track of ebook download statistics. this module tracks downloads of private files, and gives the option of filtering out downloads by the site administrator. since late october 2013, the site has had more than 1,500 ebook downloads and more than 20,000 individual recipes have been downloaded or printed, content that cost the library literally nothing to acquire. future improvements mapping – i have received cookbooks that are outside the westlake, ohio, community, and are of regional interest. i have started looking at how to map cookbooks and recipes collections (figure 9). the website already uses location as a facet for recipe data, so adding a map is a natural interface figure 9. testing mapping of collections. integration with local history content – usually every recipe features the name of its source. i would like to combine recipes with images in our local history, and perhaps other sources to collocate as many historical items as possible. adding recipes to images of people and families expands our understanding of their experience because we can see what people ate, beyond traditional historical information. developing a local publishing model – based on the skills that i have developed and via the website that i have established, i would like to grow this project into a publishing service. as small organizations have a need to produce books for fundraising, the library could work with organizations to help produce ebook copies. in such a model, starting with content in electronic format eliminates the digitization and ocr parts of the workflow, reducing editing time. to help patrons produce books that can be printed, the open-source desktop publishing program scribus [19] could be used. this program can produce pdfs, but also layouts that a printer would need to produce a physical book. the library would be able to help with content creation in every stage except printing. as organizations sell out of their print run, the library could attempt to negotiate the rights to those works. the value of the experiment developing the skills to digitize, reshape and distribute content has the power to change our thinking with regard to electronic content, leading toward greater independence among libraries. by building projects/communities, we develop expertise in our communities that vendors cannot possess. libraries of any type possess intimate knowledge of their communities. the system developed in this project has low technological and cost barriers, and represents the first step in the development of an open source publishing model for libraries. a logical next step would be to provide publishing assistance for organizations, families, churches, etc., that want to publish cookbooks for fundraising purposes and facilitate design and layout. appendix open source tools used the homer project [20](lupocos, 2011) was begun as a way of developing low-cost, open digitization tools for museums, libraries, or anyone who wants to digitize books. it combines a number of free and open-source tools to make book digitization easy and affordable. any of these components can be downloaded individually. if any of the following are already installed, they must be uninstalled before installing the homer software. imagemagick (for manipulation images) jpegtran (loseless jpeg transformation) jbig2 encoder (compression tool for bi-level images) tesseract-ocr – optical character recognition engine rubyinstaller (installs the ruby programming language) hpricot (html parser) rmagick (interface between the ruby programming language and imagemagick) pdfbeads (to create searchable pdf) cmdow.exe (command-line utility used in homer) scantailor (book page-processing tool) homer.sh (bash script: command-line interface for producing a searchable pdf) the homer bash script can be used to rename and rotate the scanned images. renaming can be an important step as, to a pc, page111.jpg will appear right after page11.jpg. many of these tools operate without any direct user interaction. the homer bash script uses tesseract-ocr and pdfbeads to create a searchable pdf. the searchable text layer created by tesseract-ocr/pdfbeads is what allows the recipes to be stripped from the cookbook and uploaded individually into the community cookbook site. that text layer also can be copied into an ebook editor to create other ebook formats. some caveats installation: when launching the windows installation executable, a number of open source tools are installed automatically. as part of the installation process, the usual installation displays will emerge and will proceed with the installation script taking complete control over the process. each installation process for the components that homer installs is controlled by the script, and progresses rapidly. even if there is a pause, with a window that appears to prompt you for approval to proceed, no user action is needed. figure 2. homer warning. network profiles – because the “cmd does not support unc paths as current directories” (figure 2) the homer project cannot be installed on a pc on a network’s domain controller, using a network profile. homer cannot write the final pdf to the desktop. cmdow – antivirus software might quarantine cmdow as a hacking tool “because it can hide windows.” the cmdow project site recommends checking the checksum if you have any concerns about the file’s authenticity. troubleshooting the installation – i have installed homer on several pcs. usually, all of the components are installed without a problem. if there are problems with the installation, the project can be uninstalled by running homer-uninstall.exe. even then, some components might not have been removed. the above list of tools installed by homer can be used with the windows software uninstaller in the control panel to see if any remain. most of the components installed as part of homer require no human interaction. for instance, jbig2 allows the creation of bi-level images so the text layer created by tesseract-ocr can be combined with the scanned image by pdfbeads to create the final searchable pdf. below i discuss in greater detail the components that the user interacts with, and some other tools in addition to homer that are part of the workflow. tesseract-ocr [21] is a powerful optical character recognition engine that is installed as part of the homer package. packages have been developed for a large number of languages, from afrikaans to vietnamese. scantailor [22], which is installed as part of the homer package, processes scanned images of book pages. if images of book pages were scanned in double, they can be split. the software also deskews and despeckles text. scantailor supports right-to-left writing systems. the software produces an output directory (named “out”) that contains images, and an html file that contains the text. this output directory is dragged and dropped directly onto the homer command-line interface. ocr is then applied via tesseract-ocr, and then pdfbeads produces the final searchable pdf. sigil [23] is an epub editor that allows users with little or no xhtml experience to produce epub2-compatible ebooks. sigil is not installed as part of homer and must be downloaded separately. calibre [24] is the swiss army knife for ebooks: one can use it to manage ebook collections, as a reader, and as a powerful tool to convert ebook books into a range of formats. for the community cookbook project, i have only used calibre for dealing with books in epub, pdf and mobi formats. kindlegen [25] is a program provided by amazon for converting documents in various formats to mobi. this program is not open source, but it is free. there are restrictions on its use [26]. i haven’t used this tool as part of the community cookbook project; but having experienced some problems with mobi-format ebooks that had been converted with calibre, i am looking into it. epub validator [27] is a website powered by the open-source epubcheck system [28] to validate epub ebooks. the site is for non-commercial use only. references [1] society of american archivists (january 12, 2009). orphan works: statement of best practices. retrieved on may 5, 2014 from http://www.archivists.org/standards/owbp-v4.pdf p. 2 (back) [2] u.s. copyright office. (feb. 6, 2012). fl-122: recipes. retrieved on may 5, 2014 from http://www.copyright.gov/fls/fl122.html. (back) [3] butler, joy. (jan. 29, 2008). are recipes copyrightable? – rights clearance observations about julie & julia, deceptively delicious, and the sneaky chef. guide through the legal jungle. accessed on may 11, 2014 from http://www.guidethroughthelegaljungleblog.com/2008/01/are-recipes-cop.html. (back) [4] cookbook publishers, inc. recipe management. retrieved on 4/22/2014 from http://www.cookbookpublishers.com/wp-content/uploads/2012/05/recipemanagementtips3.pdf. (back) [5] crosby, debbie. (october 2, 2013). email. (back) [6] society of american archivists (january 12, 2009). orphan works: statement of best practices. retrieved on may 5, 2014 from http://www.archivists.org/standards/owbp-v4.pdf p. 11. (back) [7] lipinski, tomas. (may 8, 2014). using copyright and licenses to your advantage in the public library setting [workshop]. ohio library council, columbus ohio. (back) [8] u.s. copyright office. (n.d.) copyright law of the united states of america, chapter 12 section 1202. retrieved on may 11, 2014 from http://www.copyright.gov/title17/92chap12.html#1202. (back) [9] international digital publishing forum (2012). epub validator. accessed on may 5th, 2014 from http://validator.idpf.org/. (back) [10] buse, jarret w. (2014). epub from the ground up: a hands-on guide to epub 2 and epub 3. mcgraw-hill education, new york, ny. p. 130 (back) [11] buse, jarret w. (2014). epub from the ground up: a hands-on guide to epub 2 and epub 3. mcgraw-hill education, new york, ny. p. 170 (back) [12] buse, jarret w. (2014). epub from the ground up: a hands-on guide to epub 2 and epub 3. mcgraw-hill education, new york, ny. p. 149, table 6-3. (back) [13] tzoscott. (sept. 28, 2003). recipe module. retrieved on may 5, 2014 from http://drupal.org/project/recipe. (back)22 [14] https://www.drupal.org/project/views_bulk_operations. (back)23 [15] dave cohen [sic]. (march 13, 2006). taxonomy access control lite module. retrieved on may 5, 2014 from https://drupal.org/project/tac_lite. (back)24 [16] arianek (2009, dec. 1). working with files in drupal 7. drupal.org. acquired on may 11, 2014 from https://drupal.org/documentation/modules/file. (back)25 [17] jsherman. (january 26, 2012). ils authentication module. retrieved on may 5th, 2014 from https://drupal.org/project/ilsauthen. (back)26 [18] wordfallz. (feb. 12, 2007). download count module. retrieved on may 5, 2014 from https://drupal.org/project/download_count. (back)27 [19] http://www.scribus.net/canvas/scribus. (back)28 [20] lupocos. (september 15, 2011). home page: homer book scanner. retrieved on may 5, 2014 from http://bookscanner.pbworks.com/w/page/40965440/frontpage. (back)13 [21] tesseract-ocr. (n.d.). project page. retrieved on may 4, 2014 from https://code.google.com/p/tesseract-ocr/. (back)14 [22] scantailor. (n.d.). project page. retrieved on may 5, 2014 from http://scantailor.org/. (back)15 [23] https://code.google.com/p/sigil/. (back)16 [24] calibre. (n.d.) project page. retrieved on may 5, 2014 from http://calibre-ebook.com/. (back)17 [25] amazon.com. (n.d.) kindlegen. retrieved on may 11, 2014 from http://www.amazon.com/gp/feature.html?docid=1000765211. (back)18 [26] http://www.amazon.com/gp/feature.html?docid=1000599251. (back)19 [27] http://validator.idpf.org/. (back)20 [28] https://github.com/idpf/epubcheck. (back)21 about the author matt weaver is the it manager at westlake porter public library (westlake, oh). his email address is mattrweaver1@gmail.com. you can connect with him at facebook.com/mattrweaver and twitter.com/mattrweaver, and see slides about this project at slideshare.net/mattrweaver. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – reporting from the archives: better archival migration outcomes with python and the google sheets api mission editorial committee process and structure code4lib issue 46, 2019-11-05 reporting from the archives: better archival migration outcomes with python and the google sheets api columbia university libraries recently embarked on a multi-phase project to migrate nearly 4,000 records describing over 70,000 linear feet of archival material from disparate sources and formats into archivesspace. this paper discusses tools and methods brought to bear in phase 2 of this project, which required us to look closely at how to integrate a large number of legacy finding aids into the new system and merge descriptive data that had diverged in myriad ways. using python, xslt, and a widely available if underappreciated resource—the google sheets api—archival and technical library staff devised ways to efficiently report data from different sources, and present it in an accessible, user-friendly way,. responses were then fed back into automated data remediation processes to keep the migration project on track and minimize manual intervention. the scripts and processes developed proved very effective, and moreover, show promise well beyond the archivesspace migration. this paper describes the python/xslt/sheets api processes developed and how they opened a path to move beyond csv-based reporting with flexible, ad-hoc data interfaces easily adaptable to meet a variety of purposes. by david w. hodges and kevin schlottmann introduction migrations bring out the best and worst of library activities. dedicated staff have invested years of practice and skill development around specific technologies, platforms, and software; and adapted to meet novel conditions and requirements. but after a while that cms or software platform is looking long in the tooth. or maybe its open-source maintainers have moved on to other things and there hasn’t been a new release in years. or maybe each library unit has developed its own tools and workflows on separate paths. then the time comes to bite the bullet and modernize. you may find that, while there may be resources out there for migrating from platform a to b, they don’t quite fit your particular flavor of a and/or b. or they don’t take into account your special edge cases and cleverly devised workarounds. improvisation is in your future, best-laid plans notwithstanding. that’s all just to say, when it comes to tools and methods for migration, we often have to wing it. this was on our minds as columbia university libraries (cul) embarked on a multi-phase project to migrate nearly 4,000 records describing over 70,000 linear feet of archival material from disparate sources and formats into archivesspace. in the process, we committed to regularizing and remediating data while we had it in view—marc records, accessions, and encoded archival description (ead) xml files created and updated over many decades by countless staff. some problems were known, such as that collection metadata managed for years in parallel legacy systems diverged, sometimes significantly. but other issues with the data were unknown or could only be guessed. how were we to review so much data and make effective use of archivists’ time and effort? which irregularities were isolated edge cases and which were project show-stoppers? to address these concerns we looked at how to streamline the extraction and reporting of important data and devised new tools and workflows using python, xslt, and the google sheets api to: analyze the corpus, pipe custom reports of targeted data directly to shared google sheets for distributed review, and in turn use information in the sheets to guide data remediation at scale. many institutions make use of workflows involving tabular data structures and tools such as the pandas python library or open refine to automate data cleanup (see e.g., bartczak and glendon 2017). however, the manual touch points of importing and exporting data to and from spreadsheets can prove burdensome and error-prone. we found the efficiencies gained by dispensing with transitory csv (or excel, openoffice) documents and routing data directly to and from the collaborative space of google sheets made the additional effort worthwhile. in the process, we developed a reusable tool set with wider applicability beyond the migration. in effect we found that google sheets could serve as an ad-hoc database—a versioned, cloud-hosted, reliable one at that—for python-powered projects, with a flexible web gui as well. this paper describes the driving forces motivating this work at cul, the tools and workflows created, and the work done across archival processing and technology units that helped make the migration successful. background columbia’s main archival repositories reside in the rare books and manuscripts library (rbml), avery art and architectural library, starr east asian library, and the burke theological seminary library. the largest holdings by far are in rbml, which was also the first to explore archival management software (archivists toolkit, primarily to track accessions). in phase 1 of our migration, we moved around 4,000 accessions from archivists toolkit along with a similar number of collection-level marc records from our voyager ils into archivesspace. this initial migration left us in a hybrid state, with collection-level catalog records managed in archivesspace while finding aids remained for the time being in an exist xml repository, authored manually by archivists in oxygen xml editor, and published by a durable but aging custom php-based web application developed in 2008. the scripts and workflows around finding aids had served the organization well for many years, but the time for consolidation in archivesspace had come [1]. this led us to phase 2, the migration and integration of ead finding aids into archivesspace, and the retirement of the legacy web application and associated workflows. this entailed not merely importing but rather merging of data at different levels of description in the platform. it was in this phase where the scope and scale of review and quality control (qc) called out for automated solutions and efficient reporting mechanisms. defining the universe of migration a surprisingly difficult question to answer was what exactly we were migrating in this phase. we decided early on to focus on finding aids that were already in ead 2002 xml, thus setting aside for later an assortment of ur-legacy content—scanned pdfs, static html finding aids, and such [2]. the legacy xml system, built on an exist-db repository backend and an impressive array of associated scripts and tooling, was feature-rich but lacked data integrity safeguards. it had accumulated a large assortment of duplicate or orphaned content as well as “stub” records auto-generated from marc with no component description. the exist database held in the neighborhood of 7,300 records from six library repositories—clearly far more than we intended to migrate. de-duplicating and winnowing down this corpus to an authoritative data set was not a trivial task. we compiled an inventory of all xml data and, through some judicious sorting and filtering, were able to select a subset of records that (a) had component description (i.e., a non-empty element), (b) were currently (or soon to be) published as a finding aid, and (c) had a corresponding resource record in our newly operational archivesspace. in this way, out of the thousands ead files we defined our universe of migration to roughly 1,400 records. migration strategy having selected the archival description to migrate, we examined a few different approaches to actually getting the data into archivesspace. since we had already imported collection-level records and linked them with accessions in phase i, we couldn’t simply import the ead records and overwrite collection-level information. appending component description (the dsc section in ead files) to existing resources would get us closer to the desired result, but we lacked a means to do this at scale; the harvard excel importer plugin, while excellent for single file processing, lacks (as of this writing) a batch import feature.we also had to handle divergent description that predated the migration (described in more detail below). together with our hosting vendor lyrasis, we settled on a hybrid approach: export the archivesspace resource record as ead; create a complete ead from the best elements of the exist ead and the archivesspace ead; delete the archivesspace record; import the completed hybrid ead; relink accession, assessment, external document, and locally-defined fields directly in the database. this approach took advantage of the in-house expertise at cul across collection and digital libraries departments centered on archival description, ead, python, xslt, and xquery. divergent description in the old infrastructure, when a new collection-level record was first created in voyager, a stub ead was automatically spawned. from that point forward, the ead and marc records were no longer programmatically connected. thus, as a consequence of separate cataloging workflows for marc and ead, inconsistent manual updating, and automated changes such as authority control to the marc records in the ils; individual fields of collection-level records diverged significantly in unpredictable ways over the last decade. for example, a biographical note might have been updated with a death date in the voyager record but not in the ead record or vice versa. that same biographical note might also have later been expanded considerably in the ead by a project archivist, but was not brought into the ils record due to field size limitations. similarly, an expired restriction that was found and lifted programatically in the ead record using exist might not be reflected in the ils record. the immediate issue for phase ii migration was thus to identify whether each field in every one of 1,400 collection-level records was better in the ead or in the marc record. our goal was to not lose descriptive data in the process and reconcile the differences. it was very clear that cul archivists were best positioned to do the review, but we were unsure how to do so efficiently. workflow overview for the finding aid migration and remediation, we generally worked in an extract-transform-load (etl) model, exporting all data to local xml files and performing bulk and individual cleanup operations via xslt [3]. a recurring task was to iterate over a batch of ead files and report out selected data based on xpath expressions to tabular format. a typical way to do this is to use xslt to generate a pipe-delimited text string for each file; a script (bash, python, etc.) would read directory contents and feed files one-by-one to the stylesheet, and compose the output into csv format for import into excel, openoffice, or google sheets. but what if we could skip a step and write the data directly to the spreadsheet in a user-friendly format? enter the google sheets api, a great resource sitting right there in our organization’s google apps suite, little noticed. use is relatively straightforward to anyone familiar with apis and google provides decent documentation along with sample snippets in several code languages to help users get started. after some preliminary setup and oauth authorization, any google spreadsheet with requisite permissions (i.e., with edit permissions granted to the api user via the document “share” settings) can be addressed by script functions. a sheet then can be put to use as both a light-weight, versioned data store behind a python (or other language) process as well as a versatile collaborative interface for data review and remediation. to facilitate no-frills interaction with the api we abstracted some core functions into the lightweight python module sheetfeeder that has proven very useful. its functions can be imported in python in the usual way: from sheetfeeder import datasheet # define a datasheet instance my_sheet = datasheet('1yzm1ginagfotuiaoa21pspt8tkwtt9klgq','sheet1!a:z') its datasheet class defines a target sheet (based on the stable id in the url) and provides a number of straightforward methods for interacting with a sheet, for example, reading a whole table into a list: # get some data the_data = my_sheet.getdata() the_heads = the_data[0] selectively extracting data via pattern matching : # return value of col 1 for any row where col 0 matches id the_matches_by_id = my_sheet.lookup('423451', 0, 1) # return rows where id and title match patterns my_matching_rows = my_sheet.matchingrows([['id', '^123*'], ['title', '.*papers.*']]) or posting new data to sheet: # clear the sheet my_sheet.clear() # append new data my_data = [['a','b','c'],['d','e','f']] my_sheet.appenddata(my_data) the sheetfeeder module provides pythonic shorthand code that was very useful for our project , but readers are also encouraged to explore the google sheets api for themselves and try coding against it using whatever framework is most convenient. with our tooling in place, our new workflow thus took shape: ead data extracted from two sources (legacy and phase-1 migrated archivesspace) was parsed and pushed to a shared google sheet via python and the google sheets api using sheetfeeder; a user-friendly presentation allowed archivists to perform qc review; the qc results were read by another python process and fed into an xslt pipeline for merging and cleanup; and finally, the results were imported back into archivesspace and qc’d again there. figure 1.workflow diagram. reading ead data into a sheet although its verbose syntax and ponderous matching mechanism can be daunting, xslt is a great way to quickly and precisely grab a selected set of information from well-formed ead xml files. we found that, since advanced xslt 2.0 functions were not necessary for the extraction part of the process, we could use the native python lxml library to apply xpath 1.0 expressions and forego calls to an external saxon processor. namespace support can be idiosyncratic with lxml, so the xpath expressions were more verbose than we would have liked, but still easily comprehensible, as is shown in a list of query strings in ead_process.py: # default namespace for cul eads ns = {"ead": "urn:isbn:1-931666-22-9"} [...] # dict of elements and their xpath. asqs = { "bibid": "ead:archdesc/ead:did/ead:unitid[1]/text()", "repo": "ead:eadheader/ead:eadid[1]/@mainagencycode", "title": "ead:archdesc/ead:did/ead:unittitle[1]/text()", "status" : "ead:eadheader/@findaidstatus", "revisiondesc": "ead:eadheader/ead:revisiondesc", "altformavail": "ead:archdesc/ead:altformavail", "accruals": "ead:archdesc/ead:accruals", "accessrestrict": "ead:archdesc/ead:accessrestrict", "userestrict": "ead:archdesc/ead:userestrict", "acqinfo": "ead:archdesc/ead:acqinfo", "arrangement": "ead:archdesc/ead:arrangement", "bibliography": "ead:archdesc/ead:bibliography", "bioghist": "ead:archdesc/ead:bioghist", [...] } the script defines two file sources for each record: an ead file of the legacy finding aid (exported from exist) and a corresponding ead file exported from archivesspace. an obvious prerequisite was that there be two corresponding ead files for each record (identified by the ils’s unique bibliographic number, the bib id for short) [4]. since the data structure differed slightly in archivesspace-flavored ead than what we had used for years in our legacy system, xpath expressions were slightly different and needed to be stored in two parallel dictionaries. looping through each batch of files (first legacy, then archivesspace export), the script parses and extracts data via the xpath expressions, composes them into a list, and in turn compiles each list as a row in a nested list. once done iterating through the batch, the resulting data object is posted to the sheets api via sheetfeeder to populate the target tabs (one for legacy, one for archivesspace) in the spreadsheet. the result is two enormous tables of very unreadable data. however, this is where formulas in the spreadsheet take over! figure 2. raw import of ead elements into columns. data interface the desired workflow was for archivists to compare two versions of a single element (e.g., scopecontent, userestrict, etc.) for a record and determine which is authoritative (or more accurate, or just better). this was a binary choice—further textual refinement could happen later. to make a reasonable interface for this task, each element was broken out into a separate tab. for a given element (e.g., relatedmaterial), query() functions pulled in single columns from each of the large data tables and place them side by side for review. a column allowed archivists to indicate which to migrate for each record. work was distributed to a number of staff to work concurrently on different tabs and record sets according to their domain expertise. queries could be added as comments for review and resolution. figure 3. each element is filtered to a separate tab, where archival staff can review two versions side-by-side and determine which to migrate. “migrate” added in the notes column indicates that the legacy version (on left) should be migrated in the merge. data collation: the migration grid the archivists completed the review efficiently by working concurrently in the same document. each tab (24 in all), representing a collection-level element, now had a column indicating which version to migrate for each record. another tab called “migrate-grid” assembled the information from the 24 tabs into a grid, with each element represented by a column. the cells were populated by formula with boolean values based on the information added by archivists in the respective tabs. the end result was a layout of one row per record, with a series of boolean values indicating whether or not to overwrite the archivesspace values with those of the legacy data. figure 4. a migration grid collates selections from the various element views into a single row for ingest into the merge script. round trip back to python the next step was to merge the two versions of each record, grafting the component (dsc) section from the legacy data onto the collection-level data coming out of archivesspace (and by extension, the ils), and selectively replacing elements per archivists’ wishes as well. for this, another python script, ead_merge.py, needed to read the data from the google sheet and use it to inform the merge process. for this to work, the script needed to send the boolean values from the spreadsheet as parameters when executing the xslt transformation. again using the sheets api and sheetfeeder, ead_merge.py read the migrate-grid tab into an array of rows, including the heads for use in identifying element names. the values from each row were parsed out into a series of parameters with boolean values to be passed to the xslt pipeline executed in saxon [5], like so: m_userestrict=y m_arrangement=y m_scopecontent=y [...] the first stylesheet in the pipeline (ead_merge.xsl) prepared the merge by saving the requested elements (along with the dsc section) in a tree fragment variable and interpolating them into the appropriate locations in the result tree.

[...] on the xslt side, this task of merging data in various configurations in the two corresponding source trees proved complicated. the stylesheet had to test for a variety of scenarios where an identified element (e.g., a bioghist note) may replace one or more counterpart elements or may have no counterpart to replace and is inserted as a new element. the output was passed to two additional clean-up stylesheets performing a large number of global remediation actions. the end result was a valid, house-standard-compliant ead that included (a) collection-level data as known to the ils, (b) selected collection-level notes and other elements identified by archivists and brought over from the legacy ead, and (c) all component description from the legacy finding aid. figure 5. elements from two source trees are merged in a result ead document ready to import back into archivesspace. loading back into archivesspace the last step of the migration was to re-import the now-complete ead files into archivesspace. we realized when evaluating migration options that using an etl approach that relied on deleting and re-importing ead, rather than editing in place, would require extra work from our host vendor lyrasis to restore lost links to other archivesspace data structures [6]. however, careful analysis of existing links and database-level scripting allowed lyrasis to quickly relink accessions, assessments, and other data structures. over two months, we performed three rounds of data load and qc. 99% of our ead files imported successfully in each load, thanks to the extensive data cleanup. errors found in the qc process were mostly minor and due to either quirks in the archivesspace-ead importer, such as silently dropping uniform titles in controlled vocabularies, or unusual cul ead constructs. conclusion and further applications the archivesspace ead migration at cul was a complex project that forced us to countenance data variability and think through the implications of various strategies at some length. the tool chain discussed in this paper is best suited for scenarios where there are intermediate programming skills and a need to consolidate and accelerate reporting or qc processes on the cheap. institutions that already make heavy use of google docs and xml workflows, such as many academic libraries (colin et al. 2019), will find this a useful way to synergize these competencies. libraries that encourage staff to learn by doing (e.g., writing python scripts for one-off tasks like reporting on an ead collection) and to explore scripted solutions on a case-by-case basis may find this approach hits a sweet spot as well. building to the audience: staff ux matters too the visual design of internal reports and data interfaces is often an afterthought at best (and nonexistent for csv exports). but the principles of user experience are just as applicable, and anything that can be done to lower the cognitive load required to navigate and use an interface means users can more readily focus attention on the task at hand [7]. advantages of using google sheets for distributed data review and remediation are myriad, starting with the ubiquity and (likely) familiarity of the interface, along with ease of customization. being able to work concurrently in one document allowed our project to move forward quickly and complete dozens of person-hours of work in a few days. use of filtered views, formulas, and conditional formatting let us tailor the interface to our users and purpose. programmatic use of archivist-generated judgements on which field was correct to build the combined ead file meant we avoided the error-prone copy-and-paste method of data transfer. the automated population of the google sheet from the data sources meant that it was very easy to refresh data on demand and eliminated risk of version confusion that often emerges whenever csvs or other files are passed back and forth. most importantly, this approach allowed archivists to focus on the descriptive data in a familiar web-based interface, centering their skills and experience on the content question rather than wrestling with xslt or struggling with version control of multiple spreadsheets. while we made only modest attempts to innovate reporting interfaces for this particular project, we think this approach shows potential to iteratively improve report ux design at minimal cost, with potential to boost productivity, reliability, and general satisfaction. going “beyond the csv” in a collaborative reporting environment with high customizability thus presents advantages to institutions wishing to innovate around qc methods and practices. extension and limitations the python–google-sheets integration outlined here is easily adapted to other projects and reporting needs and has proved useful at cul for data import qc and reporting accessions and other data from the archivesspace api [8]. with the addition of some modest query and operator functions including the ability to select rows based on regular-expression matching, the toolset takes on broader applicability, and in effect enables a google sheet to serve as a light-weight, versioned, cloud-hosted database for any python project. not a fan of python? the sheets api offers good documentation for ruby, php, java, and other frameworks. in this way, we see this type of integration as holding potential for library projects at institutions where google apps is already the go-to collaborative platform and where spinning up reporting interfaces quickly and repeatedly is desired. it would however not be ideal for large-scale data manipulation and storage, and production applications would be better served with a high-availability mysql or other database if performance is of high importance. if complicated analysis and manipulation of tabular data sets is a requirement, users may do well to look at the pandas python library and the many tools built around it. acknowledgements the authors would like to thank their colleagues in the rare book and manuscript library, avery architectural library and archives, digital library and scholarly technologies, digital project management, and library information technologies, and at our host lyrasis, for their work on the migration described here. about the authors david w. hodges is special collections analyst in the digital collections and preservation services unit of the libraries, and served as project manager for the first phase of the archivesspace migration project. kevin schlottmann has been head of archives processing at columbia university’s rare book and manuscript library since 2018. among his first tasks was spearheading migration into archivesspace. previously, kevin was digital archives manager at the new york philharmonic and archival processing manager at the center for jewish history. notes [1] in the earlier workflow, new finding aids were manually seeded in exist by a digital librarian with basic data from voyager ils; archivists then edited and added description using oxygen xml editor. a php application pulled data from exist in real-time to build public-facing html content on the fly (catapano et al. 2008). the setup accrued many features over the years, including a staging/preview mode for unpublished content; custom xpath-based reporting; and cul-specific validation against a custom xml schema. it was a fine example of the custom-built solutions that archivists have built since the advent of ead 2002, in the absence of archives-specific software solutions. however, as time went by the system showed other hallmarks of complex bespoke solutions familiar to library professionals. the intricate xml and php codebase—layered with years of bug fixes and feature additions—required delicate handling for even minor changes. the broad permissiveness of ead 2002 emboldened archivists to push the limits of what finding aids could do, producing sometimes very divergent practices and and an assortment of display-oriented markup and ad-hoc workarounds to achieve desired outcomes. if that weren’t enough, the php application had been flagged by central it as a security risk. perhaps most importantly, the diy ethos it embodied was increasingly at odds with the trend toward common frameworks and open-source platforms in library technology. with advances in the archivesspace project over recent years, the coalescence of support in the archival community around the platform as the best-of-breed tool of archival description, and the emergence of community-developed add-ons like the excellent harvard excel import plugin (aspace-import-excel, [updated 2019]), the decision to migrate was an easy one. we have reached the point that mike rush called for in 2011, to “. . . move beyond our angle bracket fetish to develop and implement tools that allow us to focus on archival tasks. ” (hensen et al., 2011) [2] we did however take the opportunity to survey all non-ead archival description and lay the groundwork for future remediation efforts. [3] for specific ead cleanup and preparation for archivesspace import, we relied heavily on ground covered previously by our colleagues at other institutions. dave mayo and kate bower’s extensive description of ead data cleanup in “the devil’s shoehorn (mayo and bower 2017) proved extremely useful, as did mark custer’s yale as-prep schematron(custer 2015). [4] a separate script extracted eads from as based on bib id: https://github.com/cul/rbml-archivesspace/tree/master/ead_harvest [5] because of the complexity of the cleanup and merging activities performed on each record, we found we needed to break the xslt processing into three passes (ead_merge.xsl, ead_cleanup_1.xsl, and ead_cleanup_2.xsl). multiple stylesheets can be strung together in a single call to saxon, with the result tree of one becoming the input of the next. [6] this approach does not maintain as ids for resources. cul didn’t rely on these, but persisting these would likely require additional database-level work. [7] even minimal tweaks to shared sheets such as freezing header rows and write-protecting imported data columns make for much improved report ux. [8] our migration workflow did not make use of the as api since we were already working with ead xml and it was easier to merge xml with xml. however, a similar process could act on json data from api calls rather than xml via xslt; the python-google sheets approach described here is agnostic as to data interchange methods. references aspace-import-excel [internet]. [updated 2019 june 20]. harvard library of harvard university. available from: https://github.com/harvard-library/aspace-import-excel bartczak, j., & glendon, i. (2017). python, google sheets, and the thesaurus for graphic materials for efficient metadata project workflows. the code4lib journal, (35). available from: https://journal.code4lib.org/articles/12182 catapano, t., dipasquale, j., & marquis, s. (2008). building an archival collections portal. code4lib journal, (3). available from: https://journal.code4lib.org/articles/77 colin p., chassanoff a., lee c.a., rabkin a., zhang a., skinner k., meister s. (2019). digital curation at work: modeling workflows for digital archival materials. proceedings of the 19th acm/ieee joint conference on digital libraries, 39-48. jcdl ’19. urbana-champaign, il: ieee computer society press. doi:10.1109/jcdl.2019.00016. available from: https://colincpost.info/files/post-etal-digital-curation-at-work_final-preprint.pdf custer, m. (2015). validation scenarios – archivesspace @ yale. available from: https://campuspress.yale.edu/yalearchivesspace/2015/07/22/validation-scenarios/ huffman, n. (2016). getting things done in archivesspace, or, fun with apis. the digital collections blog. available from: https://blogs.library.duke.edu/bitstreams/2016/09/21/archivesspace-api-fun/ mayo, d., & bowers, k. (2017). the devil’s shoehorn: a case study of ead to archivesspace migration at a large university. the code4lib journal, (35). available from: http://journal.code4lib.org/articles/12239 ou, c., rankin, k. l., & shein, c. (2017). repurposing archivesspace metadata for original marc cataloging. journal of library metadata, 17(1), 19–36. available from: https://doi.org/10.1080/19386389.2017.1285143 hensen, s., landis, w., roe, k., rush, m., stockting, w., & walch, v. (2011). thirty years on: saa and descriptive standards (session 706). the american archivist, 74(supplement 1), 1–36. available from: https://doi.org/10.17723/aarc.74.suppl-1.15011hj3lg56t0t3 subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – respect my authority mission editorial committee process and structure code4lib issue 2, 2008-03-24 respect my authority some simple modifications to vufind, an open source library resource portal, improve the retrieval of both lists of works and information about authors from wikipedia. these modifications begin to address ways that current “next-generation” catalogs fail to fully harness all of the bibliographic tools available for indexing and presenting author information. simple methods such as those described in this article, which make use of full headings for authors, can offer marked improvements to these systems. by jonathan gorman introduction as current “next-generation” catalogs attempt to overcome the inadequacies of the previous generation, they have abandoned useful techniques that have evolved in the practice of cataloging. a good example is the display of search results for works by an individual author. many of the last generation of catalogs offered large browse lists of unique authors with little guidance for choosing between authors. the current generation lets you find all the books written by people with the same name but still offers little to those who want the works of just one author. relatively simple changes allow authority practice to be used more effectively in the next generation. this article will show a quick enhancement to the vufind system that improves use of its heading information to group books by particular authors. a similar technique could be applied to many of the next-generation catalogs. generations of catalogs and authority control many of us can still remember working with the card catalog. given a large enough collection, you would find yourself flipping through “johnson, james, 1705-“, “johnson, james, 1777-“,”johnson, james, 1835-“, and so on. each card would have the author’s name along with the information about a particular work. fanning the names gave a sense of what the person wrote or created. if the titles did not seem to match the search, you could just skip ahead until the pattern changed. one could learn to quickly scan the catalog in this manner, flipping the cards in a blur. then came the first generation of web interfaces. this generation typically used the concept of an author browse list. instead of seeing sequences like: johnson, james, 1705articles of visitation and enquiry johnson, james, 1705sermon preached before the right honourable ... johnson, james, 1777economy of health johnson, james, 1777tour in ireland: with meditations and reflections you saw: johnson, james, 1705johnson, james, 1777examples of such lists can be seen in many currently used systems: library of congress – voyager catalog urbana free library – horizon information portal. in theory, instead of having to look through a hundred cards to find an author, the searcher need only view a screen or two. in reality, the old card based system’s speed was limited by manual dexterity and mental agility. in newer systems, though, users must wade through the tedium of multiple distinct actions, and their speed is limited by the time spent waiting for pages to load. compounding the problem was a maze of see, see also, and scope notes, presented in a poorly explained and unfriendly interface. clicking an author link in an individual record may take you to a list of works by that author in some systems like voyager, while others like horizon took you to the appropriate place in the author browse. many next-generation catalogs take the opposite approach to the author browse list. works by authors who share a name get lumped together, usually derived from the subfield a of the main entry or added entries in the marc record. records for “carter, john, 1921-” will be interspersed with records for “carter, john, 1912-“. this can be seen in implementations of vufind, evergreen, worldcat.org, and koha. for smaller collections or less common names, this is not as big of a hurdle. there is one famous ray bradbury and getting all the books by him is relatively easy using just his name. however, as collections grow, more common names start colliding. in the case of large collections with works spanning centuries it becomes much more difficult to find what works a library may have by a single individual. your search can return hundreds of volumes with no obvious way to narrow the results down to the books of just one author. vufind’s author page can we do better in the next generation? i believe so. modifying vufind why use vufind? the interface pages use php, the indexing and searches use solr, and transforming the records for both indexing and display are done using xslt. i am already familiar with all of these technologies. vufind is under the gpl (version 2), so i can share my modifications. it’s easy to use as a stand-alone system. one just needs bibliographic records and minor modifications to the codebase to make it run without an ils, allowing for easy experimentation. there seems to be a surge of local interest in central illinois (where i work) about vufind. getting vufind currently, vufind is hosted at sourceforge. vufind 0.7, running on kubuntu 7.10, was used while writing this article, although release 0.8 is due soon with some potential changes (vufind is a young and active project, so details here may differ from the most current version). you can either download the files or check it out of svn (svn co https://vufind.svn.sourceforge.net/svnroot/vufind/releases/vufind-0.7 vufind). setting up vufind. you will want to follow the readme file that comes with the vufind distribution. see the wiki and check out the mailing lists if you have any issues. getting some records for testing this system a small sample of around 2,000 records was used (catalog.xml). they were combined from two different searches using the yaz z39.50 client [1] and querying the z39.50 server at carli [2], an academic consortium in illinois (see appendix i for details of this process). feel free to use the created catalog.xml file by placing it into the import directory. the file isn’t meant to be an exhaustive scientific test, just an exercise in seeing how we can treat authorities differently. now, if you’re following along, you should see how vufind normally uses the headings. take the catalog.xml file, put it in the import directory, and follow the vufind import steps (see the readme) and experiment with it. this will make sure you have the setup correct. however, the changes we are about to make will require re-indexing. the easiest way to do this again is to simply remove the existing indexes and records in solr. to do so, run the following curl commands while vufind.sh is running: curl http://127.0.0.1:8080/solr/update \ --data-binary '[* to *]' \ -h 'content-type: text/xml; charset=utf-8' curl http://127.0.0.1:8080/solr/update \ --data-binary '' \ -h 'content-type: text/xml; charset=utf-8' later, after modifying the index you will need to run “./vufind.sh restart” and re-run the import steps described in the readme. modifying the index to collocate all the records by a particular author, we’re going to take advantage of the system specified by aacr2 to allow people to distinguish headings. a heading in a catalog record, when created according to the rules of aacr2 chapter 22 [3], should provide a unique string for a given author to be used in all the records for works by that author or about that author. many libraries keep track of these headings by using authority files, one of the most frequently referred to being the library of congress authorities. the heading is derived mostly from the most well-known or commonly published form of an author’s name. information is also added to make sure the heading of a new author does not duplicate an existing heading. in a marc record these headings will be found in the main entry (100), used for the main author of a work; added entries (700), used to indicate people who contributed to a work but are not the main author; and subject added entries (600), used to indicate people the book is about. the marc record splits the heading into several subfields, such as the personal name (subfield a), titles (c), and date (d) [4]. vufind does index author information, but only a normalized version of the subfield a. it does not include the information used to distinguish between authors. when searching for authors we can search for authors that are jackson, michael or by just part of the author name, but we cannot get a particular author. this is fine for user searches where people are likely to be searching using names. however, for internal purposes something that uses a more unique identifier is needed so we can do things like describe just a particular author or show works by just one person. why not use the heading in the record which is already functioning as an identifier? to change vufind to use the full heading, it is useful to understand solr, the index engine underlying vufind. queries are done by passing an http get request to the index engine. that returns information in a variety of ways about the search and the search results. similarly, the index is constructed by uploading xml files consisting of name-value pairs via http post requests. an example might look like: 1 johnson, james the various fields solr will index in these uploaded documents are specified in a file called schema.xml (located in vufind at solr/conf/schema.xml). in order to allow the indexing engine to keep track of the full heading, we’ll add a line to the schema.xml file. find the fields element and add the line: now if the uploaded record contains the field name “authornaf” it will be added to the index. the type “string” is used since the “string” type requires a search to be an exact match in order to return a method. so search for an authornaf value of ajackson,_michael would not return a record that had ajackson,_michael,d1942-. currently, the “author” field uses “text”, which allows for partial matches in search. so searching for jackson would return the record “jackson, michael,”. next we will modify the import process that reads in the catalog.xml file and creates the upload files for solr. this process is driven by the import-solr.php script in import/. this script breaks each record element in the catalog.xml file into a string. each of these strings has an xslt transformation, marcxml2solr.xsl, applied to create solr upload files. these files are then posted to solr. to do this, i’ve added a section to marcxml2solr.xsl: original code

modified code (added at about line 78)

the added code joins all the subfields of the 100 field, after first removing any spaces before and after each subfield and converting the text to upper-case. for example, if $ajackson, michael, $d1942appeared in the 100 field of a a record, there would be a field in solr that would be searchable with the value ajackson,_michael,d1942if you’re following along, you’ll need to restart vufind.sh (./vufind.sh restart) and re-import (php import-solr.php) at this point. the author home page in vufind now, in order to demonstrate how the the index can be used to improve an interface, i’ll focus on the author page of vufind, which provides information from wikipedia about the author and a list of works by the author . currently, a record or search results page will link to the author page with a url like http://example.com/author/home?author=newman,%20paul. the php page resides in web/services/author/home.php, and takes in just one get parameter, author={processed subfield a}. for this information vufind uses subfield a of the 100 field, with some slight normalizations. this author information is used in two ways within the page. to create a url used to screen scrape a wikipedia page providing information about the author. $author = $_get['author']; if (substr($author, strlen($author) 1, 1) == ",") { $author = substr($author, 0, strlen($author) 1); } $author = explode(',', $author); // some unrelated code ... $url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=php&titles=' . urlencode("$author[1] $author[0]"); you will notice it only uses the name, assuming it will find an entry in wikipedia for this particular author at “first name last name”. so, for any heading that has the personal name “jackson, michael,” you will get the information for the pop star since his wikipedia identifier is “michael jackson”. vufind’s display of author information from wikipedia to make a solr call to search for all books written by the author. // get records by this author $this->db = new solr($configarray['solr']['url']); $result = $this->db->query('author:"' . $_get['author'] . '"', null, 0, 20); //does some things with the resulting arrays to display the results this will collocate all books in the system written by authors with that name, even if they are different authors. results of an author search in vufind to improve both the wikipedia search and the list of author works, we need author/home.php to use our indexed full heading, authornaf, so we can refer to a unique bibliographic identity. later we will modify the author page to use the authornaf field, but first we have to pass it in. in places where the old url existed, it should now be formed to look like http://example.com/author/home?author=newman,%20paul&authornaf=anewman,_paul,d1921-. for an example of how existing urls in vufind should be changed, we will look at the display of an individual record. vufind uses the raw marcxml file and transforms it for display, similar to how it transforms it for the solr upload file. so in web/services/record/xsl/record-html.xsl, i modified a section that creates the url for the author in the record to also include the authornaf. original code (line 79) modified

:	/author/home?author=

modifying wikipedia search we can now use the authornaf field to enhance the wikipedia search . we’ll do this using the following algorithm: use the authornaf to retreive the title field stored in solr for all records that contain that authornaf. find the two most common words in these titles after a simple stop list is used to eliminate common words like “the”,”of”, etc. create a wikipedia search url and use it to query wikipedia. extract the search results from the html page returned by wikipedia iterate through results in relevancy ordering until a url is found that has all the parts of the name use this url to retrieve the page and extract the first section. we’re assuming this page is about the author. this algorithm seems to work for many cases, but some further experimentation could involve: looking for a disambiguation page not counting parts of the first name as heavily using subject headings as search words using a local index of wikipedia or a local dbpedia for more complex searches doing this means we no longer get information about the king of pop everytime we search wikipedia; rather, we get information more suited to the individual or no information at all. improved results from wikipedia note: the relevant code for this can be found in appendix ii. modify the list of works by an author now, let’s focus on the second part of the author page, the list of all the author’s works. in this case, instead of doing a solr query that will get us records that have the same 100 subfield a, we’ll do a query that finds all records with the same normalized 100. so instead of looking for all works that have jackson, michael we will find all the bibliographic records that have a normalized 100 that looks like ajackson,_michael,d1942-. the result will be a list of just the works by the well-known beer and whiskey critic michael jackson. results of an authornaf search to accomplish this, we modify line 152 in web/services/author/home.php from $result = $this->db->query('author:"' . $_get['author'] . '"', null, 0, 20); to $result = $this->db->query('authornaf:"' . $_get['authornaf'] . '"', null, 0, 20); note: all of the changes from this article can be implemented with a patch file [5]. issues authority complexity a flaw in the use of the heading to collocate author works is that it makes the assumption that a person’s entire works will be represented by a particular heading. this assumption is fundamentally flawed. in certain cases there will be multiple headings for one person, as when an author writes under a pseudonym or a changed name, or writes as a government officer. since we do not examine the authority records, each of these headings will be treated in our modified vufind as if it is for a different author. take, for example, bill clinton, the 42nd president of the united states. he has one authorized heading as “clinton, bill, 1946-“, but he also has writings as: arkansas. governor (1979-1981 : clinton) arkansas. governor (1983-1992 : clinton) united states. president (1993-2001 : clinton) ideally, there would be some way to find all the writings of an individual but still be able to distinguish between writings done by separate “bibliographic identities”. this would require a much higher level of processing, including use of the authority records. the lack of descriptions regarding the nature of a connection between two authority headings is also problematic. us practice does not give enough information to identify the relationship between two authority records. if this information existed in a machine-readable form it would be possible to display, for example: material by stephen king writing under pseudonym richard bachman or works created as president by george w. bush lack of authority control. one of the largest stumbling blocks to implementing this system is the simple fact that not all libraries practice authority control. it appears a large majority of academic libraries do use some name authority control. according to a survey of libraries with a carnegie classification of either doctoral/research extensive or intensive level, 95% of those who responded (75%) do some form of authority control. for in-house work, 88% of new and maintence cataloging had the personal name headings verified. for vendor work 95% of new cataloging involved verifying personal name headings and 90% of maintence work verified personal name headings [6]. however, that leaves many libraries that do not practice any authority control. in fact, some may even practice a mixture by keeping the headings on records they import from external systems such as oclc but not adding any to their internal records. this would prevent the association of some books with their authors. in these cases, grouping all the records by authors with the same name may be the only solution. possible future directions: don’t use all the subfields the modification given in this paper is a simple algorithm that uses all the subfields of the 100 field of the bibliographic marc record. in the future, it would make sense to restrict the fields indexed to subfields a, b, c, d, q, and u. use added entry as well as main entry in the future, treat added entries and headings used as subjects in a similar way. this might allow for some more sophisticated interfaces. user testing far more user testing is needed, looking at when people look for items by particular authors and what can help them get appropriate results. with this technique we can take some different approaches than those currently offered by many next-generation catalogs. system testing what impact does having more unique headings have on the performance of large production systems? perhaps previous next-generation catalog developers have purposely avoided using the full heading for this reason. facets there is a need to try to improve the faceting of results. right now the interface will “group” all the authors with the same normalized subfield a together. however, this makes this particular facet useless when people are searching for books by an author’s name. if one clicks on a link in a record, they’ll just see the exact same search they performed. the identifier information could be used to populate the facet display with unique authors, even if they have the same subfield a. distinguished information could be added such as titles of works created by the individual or common subject headings assigned to the person. authority files briefly mentioned earlier, incorporating the authority files into a separate index could allow for interesting possibilities, such as multiple levels of searching by author (i.e., all works or just works written under a particular pseudonym). better linking to external projects if we can identify a person and the works they created, we can use that information to try to connect with other sources of information. we already link to wikipedia; we could also link to similar projects. static dumps of external projects wikipedia can be downloaded as a static file and re-indexed to focus more on information boxes, persons, and works. dbpedia and similar projects also offer rdf tags and a chance to search wikipedia information. using static dumps when possible allows for more intensive automated searching. for example, there are several works by michael jackson, an anthropologist in my current collection. the algorithm of using the most popular two words fails in this case. few of the words in his titles overlap and the highest ranking words do not appear in the wikipedia article. having a local source would reduce concerns of over-use and abuse of wikipedia bandwidth and other resources, allowing experimentation with different algorithms to find the most appropriate article. conclusion we have room for improvement in the next generation of catalogs. it may be that there are some cataloging pratices that could be changed to make automation easier, but there are also existing techniques that developers are not using to their full potential. the new generation of open systems brings new opportunities for experimentation. this does not have to be complex or intimidating; by changing just a few lines of code, we can create lists of the works of individual authors and improve the retrieval of author information from wikipedia. go experiment, and make our catalog interfaces better than they have ever been. notes this was run with the yaz-client version 2.1.18, from the ubuntu package. yaz-client is a client that is included with the yaz library consortium of academic and research libraries in illinois. for information on connecting to the z39.50 server hosted by carli, read “i-share via z39.50“. i-share is a catalog that contains all the records of most of the carli members. anglo-american cataloguing rules. second edition. revision 2002. american library association, chicago; 2002. (coins) for more information about the details of these marc fields, see the library of congress bibliographic data pages, in particular main entry — personal name, added entry — personal name, and subject added entry — personal name. or, see the oclc bibliographic formats and standards. apply the patch file to recreate the modifications to vufind described in this article. wolverton, robert e., jr. authority control in academic libraries in the united states: a survey. cataloging & classification quarterly. 41 (1), 2005. p 111-131. (coins) appendix i – getting records using yaz-client first, i issue the command to log into the i-share z39.50 server and start creating an outputfile. ($ indicates command line prompt). any records that i “show” during the session will also be recorded into the output file. $ yaz-client -u i-share auth.carli.illinois.edu:210/voyager -m somefile.marc if i wanted to narrow it just to a particular institution, i could use a different user name. for example, to search just urbana-champaign. $ yaz-client -u uiu auth.carli.illinois.edu:210/voyager -m somefile.marc i actually searched i-share for records with michael and jackson in the author fields, but only searched urbana-champaign for the graphic novels. it should be noted that results from running yaz-client are always appended to the filename given with the -m flag (somefile.marc in the example). it is not overwritten. this means you can build up a file from several different institutions and sessions. you will get a response that contains some information about the server and then another command prompt (z>). now you can do some searches and retrieve records. the search should return the number of hits, which you can then “show 1+number of hits” to retrieve. one word of warning, for larger return sets you may wish to do them in batches by doing “show 1+500; show 501+500” and so on. otherwise you risk timing out with busy servers. getting records by people named michael jackson z> find @and @attr 1=1003 jackson @attr 1=1003 michael sent searchrequest. received searchresponse. search was a success. number of hits: 483 records returned: 0 elapsed: 3.119026 z> show 1+483 getting records that might be comic books/graphic novels/comic strip collections z> find @and @attr 1=21 "comic books" @attr 1=21 strips sent searchrequest. received searchresponse. search was a success. number of hits: 5605 records returned: 0 elapsed: 4.772098 z> show 1+500 ... lots of files go zipping by z> show 501+500 ... lots of files go zipping by z> show 1001+500 ... lots of files go zipping by z> close z> exit finally, you will need the marc records to be in marcxml. another yaz tool can help here. yaz-marcdump -x somefile.marc > catalog.xml appendix ii – modified wikipedia code for retrieving author information the instructions below apply to web/services/author/home.php (see the original code): add the following function at line 29 (before the launch function): /* return: array consisting of word => number of times words appears input: $string the string to analize $words an array consisting of word => number of times words appear if there are existing values, they will be added to. this means you can pass in a series of strings and get the overall totals */ function countwords($string,$words) { foreach(explode(' ',$string) as $word) { //print("$word
"); $word = strtolower($word); $word = str_replace(array(':', ';', ',', "\'", '"', '(', ')', '|', '/', '?', '!', '@', '#', '$', '%', '^', '&', '*', "\\", '.', '+', '=', '_', '~', '`', '"'), '', $word); /* not removed/kept out because hypens might be important probably could just focus on beginning and end of strings */ /* simple stop list */ if($word != '' && $word != 'the' && $word != 'a' && $word != 's' && $word != 'of' && $word != 'on' && $word != 'in' && $word != 'an' && $word != 'if' && $word != 'to' && $word != 'and') { $words[$word]++; } } return($words); } then replace lines 113-209 (after adding the above function): original code // clean up author string $author = $_get['author']; if (substr($author, strlen($author) 1, 1) == ",") { $author = substr($author, 0, strlen($author) 1); } $author = explode(',', $author); $interface->assign('author', $author); // connect to wikipedia if (!isset($_get['page']) || ($_get['page'] == 1)) { $url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=php&titles=' . urlencode("$author[1] $author[0]"); $client = new http_request(); $client->setmethod(http_request_method_get); $client->seturl($url); $result = $client->sendrequest(); if (!pear::iserror($result)) { $body = unserialize($client->getresponsebody()); //check if data exists or not if(!$body['query']['pages']['-1']) { $body = array_shift($body['query']['pages']); $info['name'] = $body['title']; $body = array_shift($body['revisions']); $body = explode("\n", $body['*']); $done = 0; while(!$done) { if($body[0] == '') { array_shift($body); continue; } switch(substr($body[0], 0, 2)){ case "[[" : case "{{" : case "}}" : case "]]" : case "| " : //echo " sub : '" . substr($body[0], 0, 2) . "' "; $stpos = stripos($body[0], "image:"); if(!$stpos) $stpos = stripos($body[0], "image"); if($stpos) { $len = 4; $endpos = stripos($body[0], ".jpg"); if(!$endpos) { $len = 4; $endpos = stripos($body[0], ".gif"); } if($endpos) { $image = substr($body[0], $stpos, $endpos + $len $stpos); } } array_shift($body); break; default : $done = 1; break; } } $desc = ""; $done = 0; while(!$done) { if(substr($body[0], 0, 2) == "==") $done = 1; else { $desc .= $body[0]; array_shift($body); } } //create links to wikipedia $pattern = array(); $replacement = array(); $pattern[] = '/(\x5b\x5b)([^\x5d|]*)(\x5d\x5d)/'; $replacement[] = '$2'; $pattern[] = '/(\x5b\x5b)([^\x5d]*)\x7c([^\x5d]*)(\x5d\x5d)/'; $replacement[] = '$3'; // removes citation $pattern[] = '/({{)[^}]*(}})/'; $replacement[] = ""; $desc = preg_replace($pattern, $replacement, $desc); $info['image'] = $image; $info['description'] = $desc; $interface->assign('info', $info); } } } } modified code // clean up author string $author = $_get['author']; if (substr($author, strlen($author) 1, 1) == ",") { $author = substr($author, 0, strlen($author) 1); } $author = explode(',', $author); $interface->assign('author', $author); $authornaf = $_get['authornaf']; //we'll now search to see if we can find //a wikipedia article that seems associated with the //author by using common title words // connect to wikipedia if (!isset($_get['page']) || ($_get['page'] == 1)) { // get records by this author $this->db = new solr($configarray['solr']['url']); $result = $this->db->query('authornaf:"' . $_get['authornaf'] . '"', null, 0, 20); /* the result will have some information about the solr query and also information about each record. issue is this is an array of arrays, unless there's only one result, then it's just an array with values */ if (is_array($result['record'][0])) { $records = $result['record']; } else if (is_array($result['record'])){ $records = array($result['record']); } $titles = array(); $words = array(); for($i = 0;$i < count($records);$i++) { $words = $this->countwords($records[$i]['title'],$words); } asort($words); /* now the words should be sorted from most frequent to least */ $words = array_keys($words); /* now we search for the author words (from earlier processing) and the two most common words. why? some rouging testing seem to indicate this was a good number. */ $url = "http://en.wikipedia.org/w/index.php?title=special:search&search=" . urlencode("$author[1] $author[0] " .array_pop($words) . " ". array_pop($words) ); //now we examine the results. $client = new http_request(); $client->setmethod(http_request_method_get); $client->seturl($url); $result = $client->sendrequest(); if (!pear::iserror($result)) { $xmlstring = $client->getresponsebody(); } else { print("errorerror"); } //need to suppress warnings //errors about id $xmldoc = new domdocument(); //see http://www.mutinydesign.co.uk/scripts/problems-encountered-with-php-dom-functions---3/ on suppressing warnings -> bad html @$xmldoc->loadhtml($xmlstring); $docxpath = new domxpath($xmldoc); //for some reason i haven't quite yet figured out, //registering the namespace isn't working, //the dom class seems to ignore it in the source //document $query = '/html/body/div[@id="globalwrapper"]/div[@id="column-content"]/div[@id="content"]/div[@id="bodycontent"]/ul[1]/li/a'; $links = $docxpath->query($query); $goodlink = ''; //now, i'll iterate through the results //i'm looking for the first result that //has all the parts of the author name in it // //this could definitely be improved foreach($links as $link) { $firstname = $author[1]; $firstname = str_replace(array('.',','),'',$firstname); $firstname = trim($firstname); $lastname = $author[0]; $lastname = str_replace(array('.',','),'',$lastname); $lastname = trim($lastname); if (stripos($link->nodevalue,$firstname) > -1 && stripos($link->nodevalue,$lastname) > -1) { //print("good link
"); $goodlink = $link->attributes->getnameditem('href')->nodevalue; break; } } $title = substr($goodlink,6); $interface->assign('info', $info); $url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=php&titles=' . $title; //if we found something, display the wikipedia info //(in final version we'd want to have something displayed // if there wasn't a match or a more strict if ($goodlink != '') { $client = new http_request(); $client->setmethod(http_request_method_get); $client->seturl($url); $result = $client->sendrequest(); if (!pear::iserror($result)) { $body = unserialize($client->getresponsebody()); //check if data exists or not if(!$body['query']['pages']['-1']) { $body = array_shift($body['query']['pages']); $info['name'] = $body['title']; $body = array_shift($body['revisions']); $body = explode("\n", $body['*']); $done = 0; while(!$done) { if($body[0] == '') { array_shift($body); continue; } switch(substr($body[0], 0, 2)){ case "[[" : case "{{" : case "}}" : case "]]" : case "| " : //echo " sub : '" . substr($body[0], 0, 2) . "' "; $stpos = stripos($body[0], "image:"); if(!$stpos) $stpos = stripos($body[0], "image"); if($stpos) { $len = 4; $endpos = stripos($body[0], ".jpg"); if(!$endpos) { $len = 4; $endpos = stripos($body[0], ".gif"); } if($endpos) { $image = substr($body[0], $stpos, $endpos + $len $stpos); } } array_shift($body); break; default : $done = 1; break; } } $desc = ""; $done = 0; while(!$done) { if(substr($body[0], 0, 2) == "==") $done = 1; else { $desc .= $body[0]; array_shift($body); } } //create links to wikipedia $pattern = array(); $replacement = array(); $pattern[] = '/(\x5b\x5b)([^\x5d|]*)(\x5d\x5d)/'; $replacement[] = '$2'; $pattern[] = '/(\x5b\x5b)([^\x5d]*)\x7c([^\x5d]*)(\x5d\x5d)/'; $replacement[] = '$3'; // removes citation $pattern[] = '/({{)[^}]*(}})/'; $replacement[] = ""; $desc = preg_replace($pattern, $replacement, $desc); $info['image'] = $image; $info['description'] = $desc; $interface->assign('info', $info); } } } } } (final modified code for web/services/author/home.php) about the author jonathan gorman spends his days shifting bits and bytes through the voyager system, among several other duties, as a research information specialist for the university of illinois. his nights are spent playing around with library technologies in hopes of constructing tools he would actually enjoy using. jonathan is grateful for the love and support of his wife, colleen; this article would not have been possible without her. he can be contacted at jonathan.gorman at gmail dot com. subscribe to comments: for this article | for all articles 2 responses to "respect my authority" comments on this article are currently closed. daniel lovins, 2008-04-21 this article addresses a concern we’ve been discussing at yale for several weeks. thanks so much for sharing the results of your investigations. jonathan gorman, 2008-04-21 hi daniel, glad my article was of some help. in case you check back, do you mind giving a summary of the concern you have at yale? you made me curious. issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – quality control automation for student driven digitization workflows mission editorial committee process and structure code4lib issue 60, 2025-04-14 quality control automation for student driven digitization workflows at union college schaffer library, the digitization lab is mostly staffed by undergraduates who only work a handful of hours a week. while they do a great job, the infrequency of their work hours and lack of experience results in errors in digitization and metadata. many of these errors are difficult to catch during quality control checks because they are so minute, such as a missed counted page number here, or a transposed character in a filename there. so, a computer science student and a librarian collaborated to create a quality control automation application for the digitization workflow. the application is written in python and relies heavily on using openpyxl libraries to check the metadata spreadsheet and compare metadata with the digitized files. this article discusses the purpose and theory behind the quality control application, how hands-on experience with the digitization workflow informs automation, the methodology, and the user interface decisions. the goal of this application is to make it usable by other students and staff and to build it into the workflow in the future. this collaboration resulted in an experiential learning opportunity that has benefited the student’s ability to apply what they have learned in class to a real-world problem. by corinne chatnik and james gaskell introduction digitization of cultural heritage materials in academic libraries is important for increasing the access and visibility of its holdings, providing content and data for digital scholarship, and allowing for increased accessibility of library materials. the output of this work is valuable to researchers and the union college community, but it is also resource intensive and time-consuming. union college schaffer library experiences these challenges particularly in the context of a predominantly student-staffed digitization operation. the digitization program at schaffer library relies heavily on undergraduate student workers for the capture and quality control of the digital archival records. union college is a solely undergraduate institution and the student work hours are limited to a few hours per week with varying schedules. most undergraduates do not have prior experience with digitization and this is a new skill set for them that requires significant training. successful training is made difficult due to the differences in the students schedules resulting in ad hoc training sessions and sometimes days between each work shift. there is also currently no staff member or librarian dedicated to the digitization lab full time; instead, supervision of students is divided between two librarians and one staff member. with these limitations, it was recognized that more qc can’t be done but better qc can be achieved with automation. to do this, a collaborative project was initiated between the library’s digital collections and preservation librarian, corinne chatnik and a senior union college computer science student, james gaskell, who was initially hired and trained to do digitization. together they planned and developed a quality control automation application. by leveraging python programming and the openpyxl library, the aim was to create a tool that could systematically verify metadata consistency and file management accuracy, thereby reducing the burden on manual review processes. this approach was informed by direct experience with the digitization workflow and an understanding of common error patterns observed over time. the resulting application needed to be both sophisticated enough to catch subtle errors and user-friendly enough to be integrated into the existing workflow, particularly considering the varying technical expertise of student workers and staff. this paper details the journey in developing this quality control automation solution, from initial concept to implementation. it will explore how the application addresses specific challenges in the digitization workflow and the technical decisions that shaped its development. moreover, how this project served as an experiential learning opportunity, allowing james to apply classroom knowledge to solve real-world problems while contributing to the library’s digital initiatives. background in 1795, union college was established in upstate new york as the first non-denominational institution (not affiliated with a religious organization) of higher education in the united states. today, it is a small undergraduate, liberal arts college committed to the integration of arts and humanities with science and engineering. union college has a long history evidenced by the collections held by the schaffer library’s special collections and archives. these collections are not only valuable to the institution but to a broader audience. with the significance of these collections, schaffer library hopes to increase digitization. automating the quality control aspects of the workflow will help this initiative. the digitization program at union college has a relatively short history. schaffer library began digitizing small amounts of material from special collections in 2008. then a loss of funding resulted in a pause in digitization efforts. in 2014, library leadership recognized the importance of digitization for access and accessibility and plans were made to restart the program. a staff member with digital projects experience was hired to take over the digitization lab and complete all aspects of the digitization workflow. these projects were small boutique collections chosen based on faculty and librarian interests. the resources allocated for these projects were not conducive to scaling up to larger initiatives. taking advantage of various staff turnover, new positions were created with assembling a digital projects team in mind. an increase in technical skills and positions with digital projects responsibilities allowed schaffer library to start scaling up digitization efforts. this also came with the recognition that undergraduate student workers were a valuable resource in document imaging, so the library enacted plans to hire and train students for that purpose. with increased production, it identified some areas of the workflow that acted as bottlenecks to the process, quality control checks being one major area. this analysis served to signal to schaffer the need to improve quality control for increased efficiency and realized for some aspects, automation was possible. methodology quality control workflow prior to automation to consider automating the process, the digitization and quality control workflow was scrutinized. for the existing workflow, the digital projects librarian creates a metadata spreadsheet prior to pulling material for digitization. they then share the spreadsheet and pull boxes or folders of the corresponding records for the students to scan. while scanning, students read the metadata and make sure it matches the physical document. they should be verifying metadata fields like title, date, and creator match what they can see on the record. after they scan, they enter the extent, or number of pages of the document into the metadata spreadsheet. once they have completed the scanning portion they move on to quality control. for quality control, students should check work scanned by other students. this is so that they are not desensitized to the records they’ve already gone through. additionally, if they required more training, their errors would be caught by someone else. when students begin the quality check, they navigate to the folder of completed scans. each item has its own directory with the scans inside. so, going item by item in the metadata, they search for the item identifier and match it to a directory. they then open the folder to visually confirm that filenames in the parent directory match and all those filenames match the identifier in the spreadsheet. if there is anything that doesn’t match, they enter “fail” in the qc pass/fail column which is the last column of the spreadsheet. next, if the item is a multipage record, they make sure page count matches the extent column in the spreadsheet. if the values are not the same, they fail the check. if the document is less than 20 pages, they should look at each page individually. for expediency, if the document is longer than 20 pages, the students are to skim through the images, taking a random sample, approximately 10% of the total page count and verifying the quality of the images. when checking the images they evaluate the color balance, making sure the color looks approximately correct, not too blue or red, for example. next they observe the orientation, checking if images are mistakenly disorientated. overall, students are encouraged to use their best judgment. the role of quality control in the digitization workflow quality control is vital, not just for accuracy for descriptive metadata it also ensures that administrative and technical metadata are accurate and complete, making materials findable and usable within digital collections. additionally, poor quality digitization (like blurry text, missing pages, or incorrect color reproduction) can lead to misinterpretation or make materials unusable for research [1]. not only should digitization result in a high quality product, the data is vital to the success of the digital repository workflow. if some aspects of the metadata are not exact, it will cause the repository part of the workflow to fail. each batch of digitized records is over one hundred images and filenames. it’s easy to skim hundreds of filenames and miss a transposed identifier or misread a date. especially when the workflow for student workers ends with quality control and they don’t work with the files in the digital collections repository. in many cases, the data errors were not found until the upload process. several data points can act as a point of failure during upload, for instance, if the system is unable to find a file through the filepath, no digital object will display with the metadata. if the digital object file is too large, that will also cause display issues. another is if the date is not in iso format, the date facet will not work. the upload will also fail if required fields are not filled. these quality control failures will disrupt the workflow and if not caught before upload, undoing the work to repair it is time consuming and resource intensive. union college’s digital repository is archipelago, a flexible, customizable, open source repository created by metropolitan new york library council’s digital services team. the software is built on drupal with custom modules and indexed with solr [2]. for the upload process, the digital files are transferred via ftp from a onedrive directory to an amazon s3 server and the metadata spreadsheet is uploaded through the archipelago interface. the metadata contains the file’s full url path in addition to the other fields. the metadata is processed through a php template engine called twig and calls the files from the s3 path for display [3]. if the file path is not an exact match, the upload fails. additionally, there are other variables that, if not correct, will cause the upload or interface functionality to fail. part of the responsibilities of the digital collections librarian is uploading the metadata and scans to the digital repository. in doing so, corinne was encountering these errors during the upload process. after getting several completed batches of digitized records but still encountering some of the same issues, she wrote small scripts to check on those elements. this ad hoc process was sufficient until the digital projects team began ramping up digitization. it was then clear that these errors needed to be caught earlier and data checking could be integrated into the workflow. james was part of a cohort of students hired for digitization. with his background in computer science, he pretty quickly realized that some aspects of the workflow could be automated. especially in regards to saving time by eliminating some manual processes. with his hands-on experience implementing the workflow, he understood the goal of the application and how it could fit into the workflow going forward. together, they analyzed the workflow. corinne’s perspective was from the digital collections repository and what was going wrong during upload and indexing. james’s perspective was rooted in the actual digitization process. through this process, the following variables were identified that could be systematically evaluated both within the spreadsheet and by comparing the scans to the metadata. the biggest error encountered during the upload was mismatched or missing filenames. as a result, one of the first checks is determining if the filenames listed in the metadata, have a corresponding file in the scans directory. building on that, for archipelago, the full file path is required for upload and rendering. therefore, in addition to the filename, the metadata also has to have the full amazon web services filepath. both the filename to scan verification and the aws path check are pattern matching strings which is something scripting is really useful for automating. similarly, the number of pages in the extent field could be verified automatically. the input guidelines for that field are integer plus format; this means “x pages” with the rich text clarifying the integer represents a page count metric . so by identifying the extent field, if the “pages” string or other format string is stripped out, the integer can be isolated. then by locating the images, it can determine the actual number of pages that belong to each object. at that point it’s just a matter of comparing two integers. figure 1. an excerpt of a metadata spreadsheet in excel showing the identifier field, the aws path to the image, and the extent field with the page numbers. schaffer library’s metadata schema has required fields. so for those fields like label, type, ismemberof, and rights_statements it was just a matter of confirming a multi-character string was there. figure 2. an excerpt of a metadata spreadsheet in excel showing the required metadata fields for upload. the ismemberof field holds the identifier of the parent object in archipelago. the type field triggers archipelago processing to use a mapped iiif template. the label field is the same as the title but archipelago uses it to name the digital object. finally, rights_statements tells archipelago which https://rightsstatements.org/ icon to display. for the date_created field the qc check was a bit more complex. there is apache solr faceting functionality in archipelago for date faceting so the date needs to be formatted in yyyy-mm-dd (iso 8601) format. an even more complex match is when the filename is derived from the analog version of the record’s physical location within the collection. for example, an item located in box 2, folder 1, will have an identifier represented as zwu_sca0319.b02.f01 and those values need to be accurate. figure 3. an excerpt of a metadata spreadsheet in excel showing the more complex fields for the quality control check. physical location informs the identifier and date_created needs to be in yyyy-mm-dd (iso 8601) format for apache solr indexing and faceting. finally, with hundreds of scans and different people working on different digitization workstations and software, it is possible image file settings will get changed. archipelago doesn’t handle the display of massively large files well. this needs to be addressed prior to upload so it doesn’t overload the archipelago system. it was determined that if the size of the image file is over 500 mb it needed to be flagged and its size reduced. considering the metadata and image variables in this way laid the groundwork for automation. application design technical approach python was chosen as the programming language for this application because of shared knowledge, longevity, and python’s ability to interact with excel files, which is the format the metadata is stored in. it is also the programming language that james and corinne have in common. this is important because when james graduates he will no longer be able to support the software. but with corinne’s knowledge of python, the program can be maintained and updated. additionally, python is the most widely taught programming language at union college so there’s a greater chance future students can work with the application. python is also an attractive choice due to the vast array of packages available for data handling and i/o operations. openpyxl is the best choice for reading and writing to excel files. it allows for more complex formatting than most csv handlers and it is possible to highlight and change text formatting at a cellular level [4]. this is important since some discrepancies will be more easily fixed with a manual review and these features will indicate problematic records and fields while maintaining the metadata format. the package also reads into pandas dataframes with column headings rather than excel column references making it easier to extract and reference data. since a user interface is provided to make the program more friendly for work study students, tkinter is valuable for providing helpful error messages, and pyqt5 to provide a rich, user-friendly experience for the main functions of the program [5]. finally, python supports object oriented programming. since records in the metadata spreadsheet refer to physical holdings and digital copies of these holdings with attributes such as pagecount, permanent location and filename, it makes sense to store these records as objects. with a macro lens, sheets of a spreadsheet can be viewed as objects, and perhaps even individual spreadsheets should the scope of the project increase. technical architecture as discussed, it is imperative that the automated processes fit into the existing workflow to minimize disruption to the current digitization processes. further, the limitations of automation are understood. recording metadata and reviewing scanned images currently remain manual processes since the technology required to extract data or make human inferences from images is complex and inaccessible at this time. as such, james decided to split the program into three stages. all three of these functions are reliant on the same excel spreadsheet and file system and will adopt the same object structure and add more fields at each step in the process as they become necessary. the file structure is as follows: a spreadsheet contains multiple sheets. these sheets correspond to a physical box which has individualized items (the rows of the spreadsheet). these items represent both the physical holdings and the files in which the digital copies are stored. the items have attributes based on their spreadsheet data. attributes include date created, physical location, filename, page extent etc. data validation the first function of the program is data validation which should be initiated after the metadata spreadsheet is created. at this stage the file structure is created, the spreadsheet is represented as an object with a list of files, which are also objects with attributes like page count etc. to automate work with microsoft excel spreadsheets, james utilized openpyxl. openpyxl is a python library that can read and write to microsoft excel files. the program reads the spreadsheet into a dataframe using the pandas python package. pandas organizes each row of the excel spreadsheet into a table and has search functions that make the imported data easier to navigate [6]. the program goes through each record in the pandas dataframe and creates a file object then adds it to the spreadsheet wrapper. this allows a comparison of the physical location data entered for each object to the filename generated from that data which follows the name convention of “box.x.folder.y.item.z”. this function also checks if the date is in iso format, the international standard for representing dates, and checks the identifier and filepath columns for duplicate filenames. some errors and their solutions can be anticipated. most date errors are resolved by understanding alternate date formats like month dd, yyyy or mm/dd/yyyy. other variables, such as naming conventions and duplicate filenames, are more complicated and will require manual intervention. still, this will reduce the time needed to resolve the errors should they be discovered after the physical document has been scanned. file location and instant fails at the second step, assuming the initial discrepancies have been resolved and the scanning has been completed, the program will search for the scanned documents in onedrive. this process is one of the most time intensive and, until now, a manual process of copying the filename from the spreadsheet into the search bar of onedrive. at this stage, the application checks if the file exists, determines if the size of the file is below the 300mb threshold, and if the number of pages in the file matches the extent recorded in the metadata spreadsheet. this all occurs with a single click of a button. using the glob python package it recursively scans a user-selected folder with a much lower margin for error and much more quickly than the current manual method. glob takes a root directory, provided by the user through the ui, and the list of file objects from the sheet, then checks every subfolder for the file path [7]. if there are missing files, the item automatically fails the check, and a useful message is output to the spreadsheet to indicate to the user that the file could not be located. image quality checks after searching the file structure, the scanned document’s existence is confirmed or denied. if the file does exist, and thus can continue with the qc process, it also stores a file path that can be used to open the document from within the python application. this is much quicker than manually searching the file structure, opening the file and following the steps of the qc guidelines. further, since the python program has the ability to edit the metadata spreadsheet the application can mark the record as pass or fail, limiting human interactions with the metadata record which could introduce sources of error. user interface considerations and design principles command line scripting can seem overwhelming to those unfamiliar, and since this application needs to be integrated into a workflow primarily done by students, an interface was necessary. james designed a graphical user interface (gui) to make the program more usable to everyone who is part of the digitization workflow. the windows operating system is the primary os for schaffer library student computers and digitization workstations. with windows as an application parameter, tkinter will suffice for less detailed error messages. tkinter is a standard python interface package for tk gui. it is used to create simple interactive applications for python scripts [8]. the application also utilizes pyqt5, because it is more feature rich than tk and the drag and drop methods for designing the application in qtdesigner are more accessible to less experienced ui designers [9]. pyqt5 also benefits from css support allowing for aesthetic improvements and hover-over button information which should make the program even more user friendly. some effort has been made in this regard but it is still in its infancy, should there be a need for further aesthetic improvements to the gui, this could be done without much complexity by adding to, or altering, the current css. the gui is important for making sure the application is useful to as many people as possible and that it stays a sustainable part of the workflow even if the developers move on. where possible the program can utilize existing python gui architectures such as easygui for basic tasks like file selection. with a gui, the program has both a front-end and back-end that are, for the most part, independent of one another. as such, the singleton design pattern can be adopted to link the two, allowing for future independent changes to be made to both whilst ensuring a clean project structure for future development. the singleton, which is considered the main program, is a class guaranteed to only have one instance [10] and bridges the front-end and back-end operations. more importantly, there is a plan to implement new features and that can be done easily in this framework. when those features are written on the backend, modifying the interface is adding another button to the gui to trigger the new behavior. for example; initially it may appear as though a spreadsheet can also be a singleton with just one instance of the class, however, this design pattern guarantees expandability by allowing the tracking of multiple spreadsheets should the need arise due to changes in the workflow or increased system requirements. a change like this would require a change to the front end to allow the user to select a spreadsheet from a given list. integration with the current workflow implementing this application with the existing workflow will not create too large of a shift. though the data is read into the program and transformed for analysis, the program will convert the objects back into dataframes and output them back to the original spreadsheets. to the user, the spreadsheet seemingly remains unchanged after each check except for color coded flags to indicate errors. automatic failures due to extent and file size issues alongside files that cannot be located in onedrive will automatically be indicated in the pass/fail column and a useful error message produced. it is important to maintain the original data format and retain excel as the medium for data review. though a lot of tasks are automated, some variables require human intervention and this method changes that portion of the workflow very little. for the work-study students this application requires very little additional training to adapt to the new process. moreover, the user interface should be intuitive to use and the overall design is meant to reduce each process to a few mouse clicks. implementation the first step in creating the program is implementing the openpyxl read/write functionality. at both the read and write stages the file data is ported into rows of a pandas dataframe, so the object file structure described acts as an intermediary medium for processing. openpyxl handily uses the column headings for dataframe fields. furthermore, iteration over the rows of the dataframe discards unnecessary data and initializes abstract file objects for each row. at this stage the file objects have attributes; permanent location, filename, extent and date created, and the list of files is stored in the parent spreadsheet. this is also an object with a “sheetname” class variable to identify it in the full spreadsheet. importantly, each file also has an “errors” and a “failures” dictionary which can be used at each step to identify issues with the record. the dictionaries contain date, filename, duplicate filename, extent, filesize and existence flags which are all instantiated to false. updating these boolean flags indicates if an error has been found for the respective file for output to the spreadsheet. in order to translate these objects back into the data frame for writing, the filename can be considered a de-facto unique identifier. duplicates duplicate checks are the least complex but require the most human intervention to remediate. the duplicate check is implemented with a linear search. since the number of records on each sheet is never more than 200 this method suffices but could be scaled up if the metadata sets grow or for other libraries with higher digitization output. if the function finds a duplicate filename within the list of files, the ‘dupfilename’ flag is set to true – this is done for each instance of the duplicate. """checks for duplicate filenames in the list of files args: sheet: excel sheet containing a list of files """ def check_duplicate_filenames(sheet): for file in sheet.filelist: for comp_file in sheet.filelist: if file.filename == comp_file.filename and file != comp_file: file.errors['dupfilename'] = true sheet.errors += 1 figure 4. checking the spreadsheet for duplicate values in the filenames field. physical location each filename may not match its physical location. this is the first major check conducted, primarily because at this stage, it is the cause of most failures. the file location format in the spreadsheet is plaintext which is traditionally difficult to work with. location descriptors in the filename, such as “box” and “folder” are reduced to “b” and “f” respectively. increasing the complexity further, the location names also have item descriptors “bulletin”, “sheet” and “item” – when translating to filenames, bulletins are reduced to “bull”, sheets remain “sheet” and the item identifier is dropped entirely. since the aim is to maintain the existing workflow, the file naming conventions cannot change and these identifiers must be accounted for when matching locations with filenames. to effectively compare the location with the filename there are two options: one is to attempt to recover the filename from the location or vice versa. the former was chosen. the first part of the filename is computed as the most common prefix by taking the first string appearing before the period in the list of filenames in the metadata spreadsheet. importantly, this ensures the program can be used with different collections. the william stanley jr. collection which was used to inform the development of this program has the prefix “zwu_sca0319”. files that do not conform to this are marked as errors for manual review accounting for a rare edge case in which an item from a different collection may be grouped incorrectly. the algorithm below is an averager, using a dictionary to track the number of examples of each prefix and returning the mode prefix to the filename predictor subroutine. """determines the correct prefix for the files by finding the most common from the file means the script can be used for different collections args: list[scanfile]: list of file objects from excel sheet returns: string: most likely prefix given all the entries """ def find_file_prefix(filelist): filenamedict = {} #dictionary containing the prefixes in the document and their count for file in filelist: try: #ignore any funky filenames prefix = file.filename.split('.')[0] if filenamedict.get(prefix) == none: filenamedict[prefix] = 1 else: filenamedict[prefix] += 1 return(max(filenamedict)) #returns the most common prefix assumes this is correct except: pass figure 5. checking the filenames for the collection prefix. splitting the physical location field by commas, the program is able to discard filler words “box” and “folder” from physical locations thus converting the data into more usable, numeric form. the “bulletin” and “sheet” identifiers are retained by checking the full string for these substrings meaning item type identifiers are propagated through to the final filename prediction in format “bxx.fxx.bullxx” """checks that the filename matches the permanent location for each item in a given sheet reconstructs an expected filename from the location then compares it to what is recorded follows a precise naming convention. box xx, folder xx, item type xx records the error in the file's errors dictionary if there is a discrepancy args: sheet: excel sheet containing a list of files """ def check_location_filename(sheet): prefix = find_file_prefix(sheet.filelist) for file in sheet.filelist: pred_filename = prefix if not file.location == none: location = list(filter(none, file.location.translate(str.maketrans('', '', string.punctuation)).split(" "))) #ignore any funky filenames try: file.filename = file.filename.replace(" ", "") #removes any spaces that shouldn't be in the filename except: pass if "box" in location: pred_filename += ".b" + location[1].zfill(2) if "folder" in location: pred_filename += ".f" + location[3].zfill(2) elif len(location) == 4: pred_filename += "." + location[3].zfill(2) #accounts for items not in folders if "bulletin" in location: #assumes .bull. for bulletins pred_filename += ".bull." + location[5].zfill(2) elif "sheet" in location: #assumes .sheet. for sheets pred_filename += ".sheet." + location[5].zfill(2) elif len(location) >= 6: #assumes no identifier for items pred_filename += "." + location[5].zfill(2) if pred_filename != file.filename: file.errors['filename'] = true sheet.errors += 1 figure 6. checks that the filename assigned matches its permanent location value for each item. after comparing the prediction to the filename recorded in the file object, the program can deduce if there is a mistake and, if so, flips the boolean flag for “filename” in the error dictionary to true. date format the date formatter is designed to systematically attempt to convert from multiple expected date formats into iso. the most common issues with formatting are years without months and days, and the use of commas rather than periods or backslashes as separators. of course, some dates are already in the correct format so this is also accounted for. """checks the date format and attempts to format the date if in unexpected form args: sheet: excel sheet containing a list of files returns: boolean: true to verify the process was executed successfully """ def check_date_format(sheet): spell = speller(lang="en") for file in sheet.filelist: success = false if file.date != none: if not type(file.date) is datetime.datetime: try: date = (parse(file.date.rstrip())) success = true except: if type(file.date) is str and not success: success, date = attempt_format((file.date)) elif type(file.date) is int and not success: success, date = year_to_date((file.date)) if not success: #last ditch effort, successful if incorrect spelling in date try: date = (parse(spell(file.date.rstrip()))) date = date.strftime("%y-%m-%d") file.date = date success = true except: file.errors['date'] = true sheet.errors += 1 else: file.date = date.strftime("%y-%m-%d") else: file.date = file.date.strftime("%y-%m-%d") return true figure 7. checks the date format and attempts to format the date if in unexpected form. the main formatter method uses helper functions such as the one shown below with a success flag indicating whether the returned date could be successfully converted. since the methods use type casting and type specific operations, it is important to encapsulate them within exception handlers (in python try, except), without this the program would crash due to the variability of formats. should all the methods fail “date” is added to the errors list so the record can be manually reviewed. interestingly, a common date issue was the misspelling of written dates, again highlighting the abundance of human error. this was tackled by using the autocorrect speller package after which we are able to convert into iso using regular type casting. """converts dates from year to year and day. e.g. 1980 to 1980-01-01 args: date: date in year form returns: [boolean, date]: success flag and date in iso format """ def year_to_date(date): try: date_new = datetime.datetime(date, 1, 1) return [true, date_new] except: return [false, date] figure 8. converts dates written as a date range to iso format. file existance, size and extent the largest source of failure came from missing files within the onedrive file structure. since individual scans are saved as jpeg images and multi-page scans are saved as pdfs the program searches for both cases using rglob, a recursive search package that can find files within a larger file structure. the parent directory, selected by the user, is stored in the program singleton meaning the file structure can be searched by glob and the full file paths can be stored into the file object. the method uses easygui which in turn uses windows file explorer to again ensure a familiar user interface for the quality control student. """opens an easygui window to allow the user to select the file they want to parse returns: string: filepath of the selected file """ def get_file(): path = easygui.diropenbox() return path def find_file(folder): for path in path(folder).rglob('*.pyc'): print(path.name) return true figure 9. opens an easygui window to allow the user to select the file they want to parse. since the file structure is complicated and some files are at greater depths within subfolders, there is a moderate time requirement for this step. helpful print messages display, showing the process is in fact still running for the approximate 1 minute it takes to look for every file. if the file is found, the “exists” flag is set to true. while the program verifies the existence of each file, it also checks the file size. the size of the file in megabytes, is read into the file_size variable within the file object. if the extent falls above the preset size threshold of 300mb, the too_large variable is set to true, otherwise it is set to false. """conducts the preliminary qc checks checks if the file exists in the file structure, if the extent is correct and if the file size is less than 300mb adds the respective failures to the count and to the file's failure dictionary args: sheet: excel sheet containing a list of files parent_directory: the onedrive parent folder to search through """ def check_files(sheet, parent_directory): failures = 0 for file in sheet.filelist: try: for path in path(parent_directory).rglob(file.filename + '.[pdf jpg]*'): file.filepath = (parent_directory + "\\" + file.filename + ".pdf") # can i do this using path above? # we can't assume this will be a pdf as single pages stored as jpg file.exists = true # we perhaps can't assume this is correct since some don't have parent folders if file.extent == len(os.listdir(path.parent.absolute())) 1 or file.extent == none: file.failures['extent'] = false else: file.failures['extent'] = true failures += 1 if not (os.path.getsize(file.filepath) >> 20) < 300: file.failures['filesize'] = true failures += 1 except: pass if not file.exists: file.failures['existence'] = true failures += 1 sheet.failures = failures # add number of failures to the sheet object figure 10. checks if the file listed exists and the file size. checking page count proved to be more complex than initially anticipated. since the files are hosted using onedrive, to use the built-in page enumeration tools provided in python’s pdf handling packages, the pdfs would have to be downloaded. this is not possible given the size and number of files. fortunately the scanning software used at union college maintains a copy of each page in jpeg form inside each subfolder. utilizing this structure, the program is able to count the number of jpegs and compare this with the recorded value, but this process is highly tailored to the digitization process at the college and may need adapting elsewhere. for every type of failure a helpful error message is added to the data frame along with a “fail” in the necessary column. the program tracks the failure rates for each of the three categories which can be used to evaluate the accuracy of different parts of the digitization process. indicating errors as described, there are many errors that are too complex to be remedied by the program and must be highlighted for human review. with openpyxl we can do this literally by selecting a color, highlighting the problematic row, and overwriting the metadata spreadsheet. the program currently uses orange (hex #efbe7d) to indicate file naming issues, blue (hex #8bd3e6) to indicate duplicates, and yellow (hex #e9ec6b) to indicate date errors. these color codes are assigned but a new method of selecting colors is also being considered. the program will likely be run more than once at each stage, allowing for errors to be manually rectified and the application run again. as a result, the first step is resetting the color of rows within the spreadsheet. the automation program is only one source of highlighting since the digital projects and metadata librarian also use various colors to convey information about records. so those messages aren’t affected; the spreadsheet only resets the colors with the precise hex values used by the program. the function that does this is shown below. """removes specific highlight colors from the spreadsheet to allow the program to be run continually after each error is rectified args: excelfile: excel file to remove colors on contains the error colors dictionary wb: work book opened with openpyxl colors_to_remove: allows specification of individual colors so running spreadsheetchecks alone (for example) doesn't remove failure highlighting """ def reset_colors(excelfile, wb, colors_to_remove): fill_reset = openpyxl.styles.patternfill(fill_type=none) for sheet in excelfile.sheetlist: ws = wb[sheet.sheetname] for row in ws.iter_rows(): for cell in row: if cell.fill.start_color.index in colors_to_remove.values(): cell.fill = fill_reset """highlights rows with errors and/or failures with their corresponding hex color since the function pulls from both the error dictionary and the failure dictionary it can be used in both parts of the program args: excelfile: contains the failure/error information and the highlighting colors returns: boolean: true if the save is successful, false if not """ def highlight_errors(excelfile): xl_file = pd.excelfile(excelfile.filepath) wb = openpyxl.load_workbook(xl_file) reset_colors(excelfile, wb, excelfile.errorcolors) for sheet in excelfile.sheetlist: dt = pd.read_excel(xl_file, sheet.sheetname) ws = wb[sheet.sheetname] errors = sheet.getsheeterrordict() errors.update(sheet.getsheetfailuredict()) colors = excelfile.errorcolors colors.update (excelfile.failcolors) for file in errors: error_color = colors[errors[file]] if error_color != none: fill = openpyxl.styles.patternfill(start_color=error_color, end_color=error_color, fill_type="solid") for index, row in dt.iterrows(): try: if file == dt['filename'][index]: for y in range(1, ws.max_column+1): ws.cell(row=index+2, column=y).fill = fill except: pass try: wb.save(excelfile.filepath) wb.close() xl_file.close() return true except: wb.close() xl_file.close() return false figure 11. highlights rows with errors and/or failures with their corresponding hex color. since the failure and error dictionaries share the same keys as the color dictionaries, these can be referenced alongside one another to determine which color to highlight which record. the above function loops over each error and failure, selects the color, locates it in the workbook and fills the corresponding row. again, exception handling is of paramount importance – if the excel file is currently open openpyxl does not have the permissions required to edit it. in cases throughout the program, including this, tkinter windows are used to inform the user of errors since python’s error outputs are difficult to read for the untrained eye. figure 12. an unfortunate shift causing mismatched filenames has been caught and highlighted by the qc program. in some cases no user input is required – this is mainly the case for automatically fixed dates, which are not indicated with color on the spreadsheet, and automatic fails. to ensure consistency with the current workflow the program uses the “qc results”, “qc initials” and “qc comments” columns to indicate records that have failed with descriptive output messages and an auto flag to indicate this was an automated fail. similar output messages are used when the file size is too large or the pagecount is incorrect. the whole record for auto fails is highlighted red for immediate visibility after opening the spreadsheet. figure 13. two files with auto fails. one because the file cannot be located, the other for incorrect page count. results and evaluation in testing the quality control automation program’s first step, the application was run on a spreadsheet where the items had already been scanned. this immediately highlighted an issue where filenames were duplicated for different records. while this led to additional labor requirements to remediate the scanned batch, it may not have been caught in the upload because technically the filename was correct and had a scanned image but it would have been inaccurate for users of the repository. further, an unfortunate shift in cells, not obvious to the human eye, meant item n was saved with the filename for item n+1 which would cause issues when uploading items to archipelago. if the data type was inaccurate, like a date format in a field expecting a string, the upload would have failed. if the data type was acceptable, the upload would have carried on and the metadata would display in the wrong field or not at all, leading to inaccuracies and poor presentation. dates can be tricky in excel, even if entered correctly, if the format settings are off, excel will automatically change the date pattern. also if many people have different methods of recording the date and that can be an issue if metadata goes through many hands. the application was able to resolve 100% of dates that were in the incorrect format. while developing the program, james discovered that dates were usually in a predictable and readable format, hence the success rate, but not one that is useful for archipelago’s solr index. this application can reformat the dates to the iso 8601 standard automatically, so this feature reduces manual labor requirement for post processing. the second stage of the automation workflow highlighted a major issue. in the spreadsheet, 32% of items were marked as scanned but could not be located in the onedrive folder. evidently something went wrong in the digitization workflow and could mean scans are stored incorrectly in a different directory and as a result scanned multiple times unnecessarily. even though the solution to this issue requires manual intervention, this report still represents a massive time saving effort. prior to this, students were taking approximately 2 minutes to attempt to find a file, enter a fail, and a note in the spreadsheet for a supervising librarian to attempt the same check for 32% of the files. in the test file this equates to 131 failures, saving time and providing invaluable insight into how the qc process can be improved outside of this automation task. the program is able to identify files that are too large to be uploaded during this step. again, an error that requires manual intervention once discovered but eliminates the frustration of finding this out during the upload process. it also detects discrepancies in the spreadsheet’s extent column versus the scan’s actual number of pages. in one recorded case, this zeroed in on an item where a page had not uploaded correctly so the item needed rescanning. it may seem superfluous, but when the record is a pdf and remembering to check the number of pages while a document is open amongst the other checks, that item can fall by the wayside. table 1. preliminary errorrates table: representing 23 record failures across the spreadsheet. box failure rate 3 & 4 3.8% 5 8.3% 6 6.3% 7 2.8% the failure rate for the second step on the test data is 36.84%. this accounts for approximately 144 automatic fails and suggests there is a problem somewhere else in the workflow. the majority of these failures are due to missing files, but a handful are due to oversized files and incorrect page counts. conclusion the collaboration between the digital collections and preservation librarian and computer science major resulted in an application and outcomes that exceeded expectations. not only was a solid product developed to improve and expedite parts of the digitization workflow, but the process was a valuable experiential learning opportunity for the computer science student. the librarian brought the high level workflow needs and analysis and experience with the post digitization workflow issues, while the student provided more sophisticated programming skills and hands on digitization experience. through analysis the amount of human error in manual data entry is evident. of course, many aspects are better handled by people and it is a risk removing the human aspect of libraries by automating everything, but using it as a tool to check over manual work is highly valuable. it was noted that it is much easier to write software to do a job after having been trained on the manual workflow. often, computer programmers do not have extensive knowledge of the system requirements and therefore require a consultancy period. this method of development is an interesting model and can be considered in university libraries with cs students. this integration of the programmer into the original workflow highlighted the importance of fitting any new tool into the existing workflow and not trying to overhaul the entire process. this application is able to find things that are invisible to the human eye, and provide helpful statistics such as the 2-8% failure rate for filenames and duplicates and the 36% upload failure. it helps identify where errors are created with statistics to support it. the process ensures consistency and creates a common standard for quality control for every collection that passes through it. without, one wrong character could grind the process to a halt. the implementation of this application leaves room for further technological developments. other areas of quality control are being scrutinized for automation. this team is especially interested in some of the visual aspects of the checks like color balance, skewed images, and even cropping of images. the application itself is relatively new to the workflow but over time, schaffer library hopes to determine the real time cost savings in producing a system like this. it’s important to determine if the time taken to build the program was worth the quality control hours saved before more resources are put into expanding the application. references [1] federal agencies digital initiatives. 2023 may. technical guidelines for digitizing cultural heritage materials: third edition. u.s. national, archives and records administration. https://www.digitizationguidelines.gov [10] gamma e, helm r. 1994. design patterns: elements of reusable object-oriented software. 40th printed ed. boston: addison-wesley professional. [4][6] gazoni, e, clark, c. 2024. openpyxl – a python library to read/write excel 2010 xlsx/xlsm files; [accessed 2025 jan 9]. available from: https://openpyxl.readthedocs.io/en/stable/. [7] python software foundation. 2025. glob – unix style pathname pattern expansion; [accessed 2025 jan 9] available from: https://docs.python.org/3/library/glob.html. [5][8] python software foundation. 2025. graphical user interfaces with tk; [accessed 2025 jan 9] available from: https://docs.python.org/3/library/tk.html. [2] pino, d. 2024. archipelago commons intro. [accessed 2025 jan 9] available from: https://docs.archipelago.nyc/1.4.0/. [9] the qt company. 2023. pyqt5 reference guide; [accessed 2025 jan 9] available from: https://www.riverbankcomputing.com/static/docs/pyqt5/introduction.html. [3] twig team. 2025. twig | the flexible, fast, and secure template engine for php; [accessed 2025 jan 9] available from: https://twig.symfony.com/. notes the working code referenced in this article is available at: https://github.com/schaffer-library/qualitycontrolautomation about the authors corinne chatnik is the digital collections and preservation librarian at union college in schenectady, ny. she earned her mlis from the university of alabama. she was previously a professional archivist specializing in digital archiving at the new york state archives. author email: chatnikc@union.edu author url: https://orcid.org/0009-0004-7229-5431 james gaskell is a senior at union college, majoring in computer science and minoring in electrical engineering. his main areas of study are evolutionary algorithms and software verification. he is also currently a work-study student at union college’s schaffer library. author email: gaskellj@union.edu author url: https://orcid.org/0009-0002-9361-6172 subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using static site generators for scholarly publications and open educational resources mission editorial committee process and structure code4lib issue 42, 2018-11-08 using static site generators for scholarly publications and open educational resources libraries that publish scholarly journals, conference proceedings, or open educational resources can use static site generators in their digital publishing workflows. northwestern university libraries is using jekyll and bookdown, two open source static site generators, for its digital publishing service. this article discusses motivations for experimenting with static site generators and walks through the process for using these technologies for two publications. by chris diaz introduction static site generators build websites from plain-text files. most are free to use and are available under an open source license [1]. they are often described in comparison to content management system (cms) software, like wordpress or drupal. cms websites use database processes on a web server to dynamically create html on demand. static site generators, however, perform all of the plain-text-to-html processing before the files are deployed online. this preprocessing workflow removes the need for high-touch system administration, database installations, server-side processing, and security patching, reducing the need for full-time developers and system administrators for digital publishing services. these advantages make static site hosting, maintenance, and preservation more affordable and sustainable for small teams. northwestern university libraries began using static site generators for our digital publishing service in 2018. we initially licensed the digital commons platform from bepress to support our open access publishing services, but the elsevier acquisition made us question our reliance on proprietary software and motivated us to consider open source alternatives (schonfeld 2018). at the same time, interest in open source software for library publishing was growing (library publishing coalition 2018). this article reflects on our use of two open source static site generators for library publishing, including an overview and evaluation of the technologies while focusing on two popular use cases: scholarly publications and open educational resources. why build static sites? static sites are gaining popularity in the digital library and cultural heritage community [2]. the getty museum is currently developing quire, a new monograph publishing tool that uses the hugo static site generator (the j. paul getty trust 2018). newson’s excellent “tools and workflows for collaborating on static website projects” covers how static website generators work, the benefits they offer, and a case study for a digital collections project (newson 2017). for digital publishing, the benefits of static sites include affordability, sustainability, and preservation. static sites are cheap to host and require very few computing resources. in addition to domain name services, dynamic sites require an operating system to manage the web server, the database, and all server-side scripting, often with significant software dependencies that require regular monitoring and maintenance. modern web browsers are reducing the need for server-side processing for rendering dynamic web content. static sites simply require storage and a content delivery network (cdn) from hosting providers. the requirements for specific operating systems, databases, software dependencies, or server-side scripting for are minimal or nonexistent for hosting static sites. in some cases, static sites can be hosted for free or very cheaply (e.g. $6.00 us – $30.00 us per year) by providers like github pages, netlify, or amazon web services (aws). static sites are generally smaller than dynamic sites. for one of our conference publications with over 30 presentations, the entire working directory is 230 mb (includes the full git version history, markdown, configuration, and sass files) and the output directory is 61 mb (which includes only the html/ css files and pdf full-text content). for comparison, some hosting providers suggest 1 gb of space needed to run one wordpress site, more than 10 times the needed disk space to host a static website (jackson 2018). static websites are easy to maintain because they are composed of asset, html, css, and javascript files. whereas dynamic websites are processed and built on-demand or cached on a server, static site generators process and build the site before the files are deployed online. the website doesn’t change until there’s a new deployment of the site. websites powered by a cms facilitate the exchange of information between the client and the server, such as user login credentials, which is stored in a web accessible database. both the database and the cms require system administrators to install and configure the software on the server, which require monitoring to ensure constant uptime, maintain software dependencies, handle regular security patches, and schedule server migrations. this is not the case for static sites. scholarly publications with jekyll jekyll [3] is an open source, command-line tool that transforms plain-text files into static sites. jekyll is very popular among the open source community. it is often the tool of choice for blogs, project marketing websites, and technical documentation due to its integration with github, which offers github users free static site hosting with its github pages service [4]. gitlab [5] and bitbucket [6] also provide free static site hosting. jekyll’s built-in toolset is well-suited to support scholarly publications, like journals and conference proceedings. both sx archipelagos [7] and the programming historian [8] use jekyll and github pages for hosting and publishing their peer-reviewed publications. we use jekyll to build conference proceedings websites for content we’re storing in our institutional repository. the static site serves as an lightweight presentation layer for a collection we manage in our institutional repository. this provides us a low-cost, highly customizable, simple website for our institutional stakeholders (diaz 2018). figure 1: how static websites enhance institutional repository collections content 3the full-text and metadata are formatted as markdown [9] and yaml [10] respectively. both markdown and yaml are stored in markdown files (.md), which can be opened by any plain-text editor. because static sites do not use content management systems, users interact with the plain-text files and folders directly. if you were to open a basic jekyll project for a scholarly journal, for example, the folder and file structure might look something like this: when you run the build command, the process takes all of the files in the scholarly journal file directory and outputs website files in subdirectory called _site. the _site directory contains everything you would need to deploy the site to the web. jekyll has built-in support for collections, which enables you to define content types, such as pages, articles, and posters (jekyll 2018). each collection will likely have its own designated html layout and default metadata values. for a journal project, you could create a subdirectory called _articles and keep all of the journal articles as markdown files. the markdown files include metadata and full-text content. the metadata appears at the top of the file as a key-value store represented as yaml, a field-driven markup language suitable for metadata. the full-text of the article is formatted as markdown. both yaml and markdown are human-readable, non-proprietary plain-text formats. from a preservation perspective, one could compress the markdown files into a .zip file at the time of publication and store them in a preservation or repository system. the _config.yml file contains all site-level configurations, such as the publication’s title, editor contact information, google analytics tracking code, and the website’s theme. at northwestern, we used jekyll and our institutional repository to create a conference proceedings publication for computational research day [11], an annual conference on data science and high-performance computing topics. we have made the source code [12] available for reference, however the style sheets are restricted to northwestern university websites and are not openly-licensed. layout layouts for jekyll websites are often set by a theme. there are hundreds of jekyll themes available on github [13] or rubygems.org [14] for free. themes allow users to ignore difficult design decisions requiring writing and editing html, css, and javascript. most themes are designed for users to plug into their jekyll projects without needing to edit any html, css, or javascript files, allowing content creators to focus on the full-text and metadata of their sites. however, because jekyll was built with blogs in mind, there may be instances when library publishers would want to edit the layouts of their publications to enhance the display metadata and match the user expectations of scholarly audiences. jekyll includes [15] is a method of modularizing html templates into component parts, sometimes referred to as partials, and can be useful additions to existing themes. these are small html files that are often limited to a specific html element in a page layout, like a search bar or navigation menu. these modules help keep the code better organized. for scholarly publications, jekyll includes are helpful when needing to call html elements on a conditional basis. for example, a scholarly publication might offer two or three creative commons license options for authors. with jekyll includes, you can create a template for displaying each of these licenses with their respective logo and govern their use with conditional logic based on the article’s yaml metadata. for scholarly publications specific to northwestern university, we use the sass [16] files and html pattern library created by the university’s marketing department to defer the layout and styling decisions. for non-northwestern specific scholarly publications, we experiment with open source css frameworks or themes that contain responsive layout elements and usability features that are important to stakeholders. metadata jekyll’s use of the liquid templating language makes reusing yaml data easy. liquid is a system of tags, objects, and filters that jekyll uses during the build process to load yaml and markdown content into specified html elements (shopify 2018). for articles, you can reuse the yaml front-matter for each journal article to create machine-readable metadata in the html header of each article’s web page to assist web search indexing and citation management systems. you could do this by creating an html file called metadata.html and assigning metadata values to designated html tags according to the scheme’s guidelines. here is an example of a few lines from a metadata.html file following google scholar recommendations [17] that pulls from yaml front-matter for each journal article and places it into the element of the rendered html: the metadata.html could include as many additional metadata schemas as you would like, including dublincore and open graph. each schema has its own tag attribute but uses the same yaml front-matter from the articles. publishing workflow we partnered with northwestern information technology research computing department to publish the proceedings from computational research day, a conference they organize. research computing announced a call for posters, presentations, and visualizations and marketed them to various departments on campus. the also managed the review and editorial timeline for the publication. we used smartsheets [18] to create a web form and serve as a submission management system for receiving documents, capturing submission metadata, assigning reviewers, and tracking the editorial timeline. the web form including an option for submitters to participate in the proceedings publishing; publication was not required. to build out the website, i made html layouts based on northwestern’s html pattern library and sass files developed by the marketing department. i also set up a collection in our institutional repository to store the final versions of the poster and visualization submissions and mint dois. i used the metadata from the submission form to deposit the files into the repository and to manually create markdown files for each submission. we reused the dois created by the repository for each submission to display on the website in order to facilitate. to deploy the site, i used the s3_website gem [19] to deliver the static site files to a provisioned aws using s3 bucket [20]. while the files are stored in s3, they are delivered using cloudfront [21]. i requested a northwestern subdomain and launched the conference publication [11]. figure 2: screenshot of conference website, available at: http://crd.northwestern.edu/ we expect to maintain the website for the conference for up to three years after the conference has been discontinued. we chose to reuse the dois from the repository under the assumption that the active maintenance of the repository will outlast the availability of the static site. open educational resources bookdown bookdown [22] is an open source r [23] package that transforms plain-text into html, pdf, and epub. bookdown works with text formatted with either markdown or r markdown [24], a version of markdown that allows users to embed executable r code into markdown files. similar to jekyll, bookdown is a static site generator that renders content as online and downloadable e-books. it is popular within the r community of scholars and practitioners working in fields that use statistical computing. we worked with a faculty member in the statistics department to publish a set of modified teaching labs for an introductory statistics course that uses r. we chose bookdown for this project because it was a specifically designed for the content he was creating and the computing environment he was introducing to his students. content the book’s full-text content is stored in either markdown or r markdown files (.rmd). a bookdown project contains a working directory of plain-text files that looks something like this: the index.rmd file is the homepage of the static site and title page for the book. this is where the book’s cover image, preface, biographical information about the author and contributors, rights, and licensing information can be found. the chapters of the book are all r markdown files in the root directory of the project. if they’re not using r code snippets, the chapter files can be standard markdown. the bookdown rendering process reads the chapter files using a specific file naming convention for setting them in order. the first chapter begins with a 01, the second chapter begins with a 02, and so on. the r markdown files use standard markdown formatting with some syntax options that allows the files to reference data, execute r code, and highlight r syntax upon rendering html, epub, and pdf. code snippets in the bookdown package also includes support for python, c++, shell, sql, and a few others [25]. here’s an example of an r snippet in an r markdown file: layout the layout for the book can be specified either in the yaml front-matter of the index.rmd file, or as a separate yaml file called _output.yml stored in the book project’s root directory. this is where users can specify the output directory, such as the folder name set to contain the rendered html files, options for the toolbar or navigation, colors, or typography settings. like jekyll, there are options to take advantage of themes to handle the layout and styles of the rendered html. bookdown has built-in support for three styles: gitbook, bootstrap, and tufte. each of these styles have arguments to specify style options. users can override these styles by replacing css: style.css with css: custom.css in the _output.yml file to reference the custom style sheet. in addition to html, users can create settings for the epub and pdf outputs. these output files, like the site itself, are static files generated during the build process and will be downloadable by end-users through the user interface. the pdf generation is handled with latex commands using the pandoc software [26] and the epub document is created by a css file. metadata bookdown has built-in support for open graph and twitter card metadata for social sharing optimization. additionally, the front-matter in the index.rmd file contains some basic metadata, such as the book’s title, date, citation style, and basic description. it will also include several configuration options you can set with yaml front matter. here is the yaml contained at the top of our index.rmd file for the statistics oer we are building: like jekyll, bookdown has an includes method for adding html components to projects. this can be used add custom metadata to the html element. within the _output.yml file, you could add an instruction to include a metadata.html file. the metadata.html file includes all of the tags and attributes according to the schema’s guidelines. you could use this same includes method to also add a google analytics script to monitor usage of the book by modifying the includes argument: in_header: [metadata.html, google_analytics.html] workflow we used a git-based workflow to collaboratively set up, edit, and publish the oer with github. i set up a clean working directory with rstudio, installed the bookdown package, which included all of the files needed to complete the plain-text processing, and initialized a new git repository to track changes and commit history. although rstudio supports git, we pushed commits to github with github desktop. we were unable to figure out how to store ssh keys to circumvent the need for logging into to my github account each time we needed to push a commit in rstudio, otherwise we would have used its built-in git support. the faculty member i worked with had familiarity with markdown and r markdown, so he emailed me all of the r markdown files, which i then edited to fit bookdown styles we set out to follow. this required some basic content organization, such as removing redundant yaml metadata and licensing information. i also provided some basic copy edits for the text. once i was able to render the book as html, i published the git repository to northwestern university libraries’ organizational account on github and added the faculty member as a contributor with write-access to the repository. we were both able to push commits from edits to the book and preview the book in html using the github pages hosting service. in order to take advantage of github pages for hosting the book, this required some edits to the _bookdown.yml file. this file contains some basic settings, such as the ability to automatically label figures, tables, and chapter headings sequentially, and more importantly set the rendered book’s output directory. when you host your source code on github, you have a few options to publish static sites. this is intended to provide provide developers a way to create a static sites for documentation or marketing of the open source package. common methods for doing this includes pushing your static site contents (i.e. html and assets) to a gh-pages branch or a folder called docs. we chose to designate a docs folder in the _bookdown.yml file, instead of the default _book folder so that github knows to create a website with those files. a final step to make this work was to add an empty file called .nojekyll to tell github servers that the files are ready to be served without the need to run any jekyll build processes. figure 3: screenshot of published oer, available at: https://nulib.github.io/kuyper-stat202/ the book is currently a static site hosted by github [27] and allows users to download the epub and pdf versions of the book. future bookdown projects will likely involve more customizations to the book’s style, better integrations with data repositories (i.e. storing datasets in our institutional repository and writing r scripts that interact with those data programmatically), and custom domains. we have also made the book’s source code [28] available. moving from a concept to a service our use of static site generators for digital publishing projects was a success, but scaling this concept into a service that supports distributed editorial teams would present a number of challenges whose solutions might be the topic of a future article. the websites we build serve primarily as public user interfaces for the content we publish; they are not all-in-one editorial and publishing systems. the build processes involved in building the static websites can automate the conversion of markdown to publishing output formats, such as html, pdf, and epub; however, static site generators do not support the initial document preparation, submission management, editorial workflows, or any other backend functionality available in other digital publishing systems. those functions could be handled by software-as-a-service extensions or outside of the website infrastructure entirely. in both jekyll and bookdown, the submission document needs to be converted to markdown in order to be added to the site. this would require some human intervention to ensure that the conversion is successful, and likely the use of a computer program to convert from microsoft word[29] or google docs[30]. in the above case studies, the markdown files for both projects were created by copying and pasting information from spreadsheets or supplied by the author. the planning phase for a scalable workflow would involve decisions around which technologies and services will perform specific functions in the workflow, such as: publishing function description technologies and services document preparation preparing ms word or google doc submissions to markdown, html, pdf, or jats outputs; copyediting –        pandoc [31] –        context [32] –        freelance copy editors or typesetters –        student workers submission management accepting full-text and metadata submissions from authors and contributors –        smartsheets [18] –        scholastica [33] –        submittable [34] –        box.com [35] –        dropbox [36] –        google drive [37] analytics logging and analyzing page visits, user demographics, and downloads –        google analytics [38] –        matomo [39] archiving depositing content for long-term storage and preservation –        institutional repository –        digital asset management system –        local or cloud-based file storage system registration creating unique identifiers for published content (e.g. doi, issn, isbn) –        cross ref [40] –        datacite [41] –        library of congress [42] –        bowker [43] project management tracking activities, task assignments, and timelines for the editorial team –        smartsheets [18] –        trello [44] –        github project boards [45] –        google drive [46] –        ms excel [47] examples of editorial and publishing functions needed in addition to static site generators a selection of these editorial and publishing tasks would need to be stitched together with documentation and executed using a several pieces of software, some of which may be difficult for one or more persons to design and a rotating team of editors to manage. this is part of what makes all-in-one publishing systems attractive. and then there’s the issue of deployment. because static sites are generated before they are deployed, all of the content management of the site happens on local copies of the plain-text files on someone’s computer. if an editor needed to update a webpage or publish a new issue, they would either need to understand the static site generator workflow or coordinate with the publisher via email. it would be highly desirable in this case to install a headless cms to push updates to the static site. there are numerous headless cms options [48] that can be added to provide editors a friendly user interface for editing the site. the configuration would involve connecting the cms to the static site’s github repository and using a continuous integration and delivery service to auto-merge and deploy the updates. otherwise, all of content editing and updating would be handled by the people who understand the workflow of the static site generator, have all of the necessary software available to them on their computers, and have the access rights to deploy the changes online. at northwestern, this problem has not yet been a deterrent for us to continue to use static site technologies because we have centralized digital publishing within a single position who is comfortable using git, static site generators, github, and aws resources. conclusion static site generators are great for creating simple, powerful, low-cost websites for scholarly or educational publications, but they cannot alone provide the all-in-one system functionality for scholarly publishing. rather, these open source tools can used as components of the “open scholarly commons” as described in lewis’s “the 2.5% commitment,” which called on the colleges and universities to contribute to a shared infrastructure of open source technologies to facilitate scholarly communications (lewis 2017). the elsevier acquisition of bepress exposed the extent to which the over-reliance on monolithic publishing and repository systems by library-based open access publishers is a problem. using open source static site generators for digital publishing is one option for avoiding futute lock-in scenarios and supporting values of openness to research, education, and technology. at northwestern, using static site generators enabled us to provide nicer online publications to our stakeholders and support new use cases for library-based publishing than we were able to with the bepress platform. for us, the downside of utilizing microservices for digital publishing is managing a complex set of technical components with varying levels of required skills and documentation within a single position; however, we believe that the upside is much greater. we can easily move away from jekyll or bookdown if we need to. this platform-agnostic approach allows us the flexibility to experiment with new publishing technologies and the ability to provide services that prioritize the needs of our stakeholders over the limitations of any particular platform. static sites are a key component to this approach. about the author chris diaz is the digital publishing librarian at northwestern university. he has held positions in academic libraries focusing on scholarly communication, collections management, and outreach since 2013. his previous publications include two edited volumes for the alcts monographs series on college textbook affordability and library collections. he is currently focused on using open source technologies for web publishing and digital libraries. you can find him on twitter @chrisdaaz or reach him by email at chris-diaz@northwestern.edu. notes [1] https://www.staticgen.com/ [2] https://github.com/hardyoyo/awesome-static-digital-libraries [3] https://jekyllrb.com/ [4] https://pages.github.com/ [5] https://about.gitlab.com/features/pages/ [6] https://confluence.atlassian.com/bitbucket/publishing-a-website-on-bitbucket-cloud-221449776.html [7] http://smallaxe.net/sxarchipelagos/index.html [8] https://programminghistorian.org/ [9] https://www.markdownguide.org/ [10] https://learnxinyminutes.com/docs/yaml/ [11] http://crd.northwestern.edu/ [12] https://github.com/nulib/computational-research-day [13] https://github.com/search?q=jekyll+theme&type=repositories [14] https://rubygems.org/search?utf8=%e2%9c%93&query=jekyll-theme [15] https://jekyllrb.com/docs/includes/ [16] https://sass-lang.com/ [17] https://scholar.google.com/intl/en/scholar/inclusion.html#indexing [18] https://www.smartsheet.com/ [19] https://github.com/laurilehmijoki/s3_website [20] https://aws.amazon.com/s3/ [21] https://aws.amazon.com/cloudfront/ [22] https://bookdown.org/ [23] https://www.r-project.org/ [24] https://rmarkdown.rstudio.com/ [25] https://bookdown.org/yihui/rmarkdown/language-engines.html [26] https://bookdown.org/yihui/bookdown/latexpdf.html [27] https://nulib.github.io/kuyper-stat202/ [28] https://github.com/nulib/kuyper-stat202 [29] https://gist.github.com/vzvenyach/7278543 [30] https://word-to-markdown.herokuapp.com/ [31] https://pandoc.org/ [32] http://wiki.contextgarden.net/what_is_context [33] https://scholasticahq.com/ [34] https://www.submittable.com/ [35] http://box.com/ [36] https://www.dropbox.com/ [37] https://www.google.com/drive/ [38] https://analytics.google.com/analytics/web/ [39] https://matomo.org/ [40] https://www.crossref.org/ [41] https://www.datacite.org/ [42] https://www.loc.gov/issn/ [43] https://www.isbn.org/ [44] https://trello.com/ [45] https://help.github.com/articles/about-project-boards/ [46] https://www.google.com/drive/ [47] https://products.office.com/excel [48] https://headlesscms.org/ references staticgen: a list of static site generators for jamstack sites. available from: https://www.staticgen.com/ shopify. (2018). liquid basics. available from: https://help.shopify.com/en/themes/liquid/basics diaz, c. 2018. jekyll and institutional repositories. open repositories 2018. available from: https://doi.org/10.21985/n28x22 jackson, b. [updated 2018 august 16]. how much disk space does your hosting plan really need? [internet] kinsta.com. available from: https://kinsta.com/blog/disk-space-wordpress-hosting/ jekyll. 2018. collections. jekyllrb.com [internet]. available from: https://jekyllrb.com/docs/collections/ lewis, d. 2017. the 2.5% commitment [internet]. available from: http://doi.org/10.7912/c2jd29 library publishing coalition. 2018. owned by the academy: a preconference on open source publishing software [internet]. available from: https://librarypublishing.org/owned-by-the-academy-preconference/ newson, k. 2017. tools and workflows for collaborating on static website projects. code4lib journal [internet]. [cited 2018 september 3]. available from: https://journal.code4lib.org/articles/12779 schonfeld, r.c. 2017. elsevier acquires bepress. the scholarly kitchen [internet]. [cited 2018 september 3]. available from: https://scholarlykitchen.sspnet.org/2017/08/02/elsevier-acquires-bepress/ the j. paul getty trust. 2018. quire: a new publishing tool [internet]. available from: http://www.getty.edu/publications/digital/platforms-tools.html xie, y. 2018. bookdown: authoring books and technical documents with r markdown [internet]. crc press. [cited 2018 september 3]. available from: https://bookdown.org/yihui/bookdown/ subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – closing the gap between fair data repositories and hierarchical data formats mission editorial committee process and structure code4lib issue 52, 2021-09-22 closing the gap between fair data repositories and hierarchical data formats many in the scientific community, particularly in publicly funded research, are pushing to adhere to more accessible data standards to maximize the findability, accessibility, interoperability, and reusability (fair) of scientific data, especially with the growing prevalence of machine learning augmented research. online fair data repositories, such as the open science framework (osf), help facilitate the adoption of these standards by providing frameworks for storage, access, search, apis, and other features that create organized hubs of scientific data. however, the wider acceptance of such repositories is hindered by the lack of support of hierarchical data formats, such as technical data management streaming (tdms) and hierarchical data format 5 (hdf5), that many researchers rely on to organize their datasets. various tools and strategies should be used to allow hierarchical data formats, fair data repositories, and scientific organizations to work more seamlessly together. a pilot project at los alamos national laboratory (lanl) addresses the disconnect between them by integrating the osf fair data repository with hierarchical data renderers, extending support for additional file types in their framework. the multifaceted interactive renderer displays a tree of metadata alongside a table and plot of the data channels in the file. this allows users to quickly and efficiently load large and complex data files directly in the osf webapp. users who are browsing files can quickly and intuitively see the files in the way they or their colleagues structured the hierarchical form and immediately grasp their contents. this solution helps bridge the gap between hierarchical data storage techniques and fair data repositories, making both of them more viable options for scientific institutions like lanl which have been put off by the lack of integration between them. by connor b. bailey, fedor f. balakirev, and lyudmila l. balakireva introduction with the increased prevalence of machine learning technology, the viability of big data projects, and the rapid proliferation of vast quantities of information, data are becoming more usable and useful than ever. however, these datasets must be properly stored and managed to ensure that the greatest possible scientific output can be gained from them. to achieve these goals, there is a push to use more accessible data standards to maximize the findability, accessibility, interoperability, and reusability (fair) of scientific data (wilkinson 2016). open data repositories help with the adoption and usage of the fair principles by providing storage for scientific data repositories and act as hubs for sharing such information. tdms [1] and hdf5 [2] are hierarchical data formats that are commonly used in scientific communities, are well supported by and compatible with popular scientific software, and offer a wide variety of benefits over alternative data storage methods. the hierarchical design allows for many different ways of structuring and organizing data, including large and complex datasets, while also allowing metadata to be included at various levels in the hierarchy, which can help explain the structure, contents, and format of the data. having metadata embedded within the hierarchy of the file can improve its reusability by providing inseparable, detailed information about the various parts and contents of the file. the adoption of fair data practices can be greatly enhanced by hierarchical data storage formats as there are several overlaps between the benefits and use cases of each tool. we found, however, a disconnect between hierarchical data formats and open data repositories and control systems that could be addressed. the advantages of hierarchical data storage formats there are many reasons to use a hierarchical data storage method and many advantages if that storage type is used. these storage types allow for vast quantities of data to be organized efficiently to store arrays of data and the associated metadata that contextualize them. figure 1 tdms sample file structure: tdms files consist of a three level hierarchy: file, group, and channel and can contain two types of data: property metadata and data values. every tdms object can have an unlimited number of properties consisting of a combination of a name (always a string), a type identifier, and a value. channel objects can contain raw data multi-dimensional arrays. hierarchical data storage formats like tdms and hdf5 allow the user to organize data arrays into groups and trees, separating out different trials or conditions, and create any grouping that helps the data make more sense. additionally, important information such as the parameters that correspond to that dataset can be included at the same level as the data embedded within the hierarchy of the storage method. important information such as how the data should be read can also be included at this level; for instance, waveform or time interval data that is integral to the understanding of the data channel. because hierarchical formats work well for time series data, it is a useful format for signal processing, network monitoring, finance, weather forecasting, and largely in any domain of applied science and engineering. the advancement of fair data principles the concept of fair was introduced in “the fair guiding principles for scientific data management and stewardship” by wilkinson et al. in 2016 and is seen as a consolidation of good data stewardship practices (jacobsen et al. 2020). since then these guidelines have seen success in impacting institutions by pushing them towards using better management methods for their data. built into the acronym are the four core goals of fair that institutions should strive for with their data: findability, accessibility, interoperability, and reuse. each has significant importance; however, they all aim to reduce the chance that some valuable data can be lost to future scientific endeavors due to poor data management. each goal is also further detailed in the guidelines and applies to any digital object being generated or stored within a repository. the fair principles do not dictate any kind of specific software or service, however many data repositories exist which can help institutions to adopt the principles (wilkinson et al. 2016). open science framework (osf) is one such data repository that seeks to facilitate open collaboration in scientific research run by the center for open science (cos), a non-profit technology organization. researchers can use osf to research and find data related to their work, and they can use it to host the data they produce themselves. these types of repositories seek to be a hub for data management, helping to facilitate the exchange of valuable scientific data between researchers. an important part of fair is the metadata associated with data sets. metadata gives important information about data; identifying it, contextualizing it, and ensuring that the full meaning of the data can be understood. the globally unique, persistent identifier that is required by the fair guidelines, for example, is an extremely important bit of metadata, and references to other identifiers are useful too. many of the ideas in the fair principles relate to “machine-actionability”, however there is an important human interactivity element to fair which we focus on with our tool here. we found that fair data repositories, even the most well-known, like harvard dataverse [3] or osf [4], lack the support of tdms and hdf file viewing. this deters scientists from engaging with fair practices like publishing datasets for reuse in the open. the repositories store hierarchical format as a binary blob, so users can not see the experiment content of the file or associated metadata. this creates the problem that even if one researcher posts the tdms file to the repository those files are not browsable or findable by others in the scientific community. we created a renderer tool to address the deficiency. many files on osf can be viewed directly in the browser while looking through the repository. however, many types of files may display as plaintext, spreadsheet, or pdf style renders which may not be able to show the full complexity of data. many files such as tdms and hdf5 do not display at all. a rich view of data and metadata would allow a researcher to get a more complete understanding of the data they are viewing directly in the data repository. the hierarchical renderer for fair repositories osf [5] is an open source project and the rendering is done by a tool called the modular file renderer (mfr) [6]. the mfr is made up of “handlers” which handle web requests made to the mfr, “providers” which know how to take a url for a file and retrieve the necessary content for the file, and “extensions” which generate the necessary html for the rendering of the file. osf allows us to integrate new “extensions” into the existing mfr framework which are called to handle the rendering of hierarchical data files. we have consulted with the staff at cos [7] so that our renderer will be added to the main osf library and be usable by all users of osf. using the nptdms [8] and h5py [9] libraries in python, the renderer extracts the data from the hierarchical data files into the python environment and generates the necessary html, javascript, and css content to display the rendering in the browser. in order to create this tool, we created a local development environment and ran a local host of osf in a docker instance. [10] the same steps should be taken to test the tool before it has been implemented in the main osf distribution. using these tools we were able to fork the mfr, develop hierarchical data extensions, and test its functionality locally. the main difficulties arising from a development project like this come from getting the wide variety of disparate software and libraries to properly work together. the renderer tool [11] that we have created allows users to fully render a view of a hierarchical data file directly in a fair data repository. specifically, we can render tdms and hdf5 files in the osf file viewer. this allows researchers and others using osf to store their tdms and hdf5 files in that repository knowing that it will have full support when browsing their files in the future. our renderer builds off of tools such as the ni tdms viewer, a labview program that loads and displays tdms files within a labview program. our renderer gives users the same information directly in the browser while interacting with the files in the osf repository. it is able to display the metadata alongside a table of the data and a plot of the data, visualizing the logical link between elements of data and metadata in a composite view. this allows us to quickly render the necessary information for someone who is browsing through these types of files. figure 2 tdms renderer rendering a sample file: the renderer presents the file content in a split view with the metadata and properties values being shown in tree and table formats on the left and the channel data being displayed in a plot and table format on the right. the metadata tree is expandable, collapsible, and scrollable so that the necessary information is as easy to access as possible. the renderer view is split vertically between the tree style metadata display on the left and the data plot and table on the right (fig. 2). on the left, in an intuitive hierarchical tree structure, the file level metadata is displayed at the top followed by the group and channel metadata and properties table beneath it. each level is expandable, collapsible, and scrollable to allow easy access to whatever level the user is interested in looking at more closely. on the right, an image of the channel data arrays plotted and exported using the matplotlib python library [12] is displayed. this is a quick and efficient way of rendering a plot of the data so that a user can see at a glance what the structure of the data looks like. because the renderer isn’t designed for in depth data analysis, we found this method to be the best way to achieve fast rendering speed while also giving a visual indication of the files contents. a chunk of the data is also included in a data table beneath the plot. the table loads only the first 100 entries in a channel, again to allow for speedy load times, while still giving the user the ability to see what the channel fields of the file are and what the data they contain looks like. figure 3 tdms renderer within the osf file viewer: the view a user gets when accessing a tdms file within the web browser accessing the osf repository. the osf toolbar is displayed at the top of the screen, the file is displayed in a toolbar on the left side of the screen and the main rendering view is shown in the right center of the screen. selecting a tdms file from the file menu on the left calls on our rendering tool to render it in the webpage. the person viewing these files is likely to want to get a general idea of the file they are looking at directly in the osf data repository. they can then identify files they may want to work with further and download those specific files to do more in depth analysis or calculations with. conclusion our rendering tool seeks to target a direct need in the scientific community not only to give researchers better tools for accessing and viewing their data, but to ease the adoption of better methods of data management. our solution aids in bridging the gap between hierarchical data storage techniques and fair data repositories, making both of them more viable options. by bringing this user-friendly tool to a popular fair data repository, we hope that we have helped bridge that gap. scientific institutions like lanl which have been put off by the lack of integration between them now know they can use these tools together and use them more efficiently and effectively. acknowledgments the national high magnetic field laboratory is supported by the national science foundation through nsf/dmr-1644779, the state of florida, and the u.s. department of energy. references wilkinson, m. dumontier, i. aalbersberg, g. appleton, m. axton, a. baak, et al. 2016. the fair guiding principles for scientific data management and stewardship. scientific data 3(mar). jacobsen, r. demiranda, azevedo, n.juty, d.batista, s.coles,r.cornet, et al. 2020. fair principles: interpretations and implementation considerations. data intelligence 2(1-2):10–29. endnotes [1] tdms file format description and documentation from national instruments: https://www.ni.com/en-us/support/documentation/supplemental/06/the-ni-tdms-file-format.html [2] hdf5 file format description and documentation from hdf group: https://portal.hdfgroup.org/display/hdf5/hdf5 [3] harvard dataverse repository: https://dataverse.harvard.edu [4] open science framework repository: https://osf.io [5] open source development of osf.io, the osf repository on github: https://github.com/centerforopenscience/osf.io [6] open source development of the modular file renderer for osf on github: https://github.com/centerforopenscience/modular-file-renderer [7] acknowledgements to fitz elliot at cos for consultation on osf and mfr [8] documentation for the nptdms library that provides interfacing between tdms and python: https://nptdms.readthedocs.io/en/stable/index.html [9] documentation for the h5py library that provides interfacing between hdf5 and python: https://www.h5py.org [10] guide to setting up a local development environment for osf allowing you to run the extended version including our renderer until it is published on the main osf repository: https://github.com/cbb-lanl/modular-file-renderer/blob/develop/configuringdevenv.md [11] fork of the modular file renderer containing our additional extensions in the mfr/extensions folder: https://github.com/cbb-lanl/modular-file-renderer [12] documentation for the matplotlib library which provides plotting functions in python: https://matplotlib.org about the authors connor bailey (cbbcbail@gmail.com) is a graduate student at the pulsed field facility, national high magnetic field laboratory, los alamos national laboratory, usa (maglab). connor works on simulation and visualization strategies for radio transmissions and develops automated data capture, processing, and software programming tools at the maglab user facility. fedor balakirev (fedor@lanl.gov) is a staff scientist at the pulsed field facility, national high magnetic field laboratory, los alamos national laboratory, usa. the focus of his research is studies of materials under extreme conditions of high magnetic fields, including the development of the measurement infrastructure, instrumentation, and computing suites for the maglab. fedor holds a ph.d. in physics from rutgers university, usa. lyudmila balakireva (ludab@lanl.gov) is a research engineer of the prototyping team at the los alamos national laboratory research library. in this role, she focuses on research and development efforts in the realm of web archiving, scholarly communication, digital system interoperability, repositories, and data management. lyudmila holds a ph.d. in mathematics and physics from moscow institute of physics and technology, russia. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – conference report: code4lib 2009 mission editorial committee process and structure code4lib issue 6, 2009-03-30 conference report: code4lib 2009 conference reports from the 4th code4lib conference, held in providence, ri from february 23 to 26, 2009. the code4lib conference is a collective volunteer effort of the informal code4lib community of library technologists. included are four brief reports on the conference from the recipients of conference scholarships. by jie chen, joanna dipasquale, lauren ko, and andreas orphanides. the 4th code4lib conference was held in providence, ri, from february 23 to 26, 2009. as in years past, the code4lib community was able to offer scholarships focused on gender diversity and minority representation, this year sponsored by oregon state university and brown university. following are conference reports from the scholarship recipients. for more notes, many informal, by other code4lib community members, find content on the internet tagged “c4l09” or “#c4l09”, a label attendees were encouraged to use to aid collocation. for example: http://delicious.com/tag/c4l09 from jie chen as an ils system administrator, i did not come to this conference with a goal to improve any programming skills. instead, i wanted to have a better understanding of the thriving and energetic code4lib community and find out what’s the latest with open souce library technologies. i was very impressed by what i saw. there’s so much more than mind blowing coding discussions and demos, as code4lib offers valuable dialogues of where the libraries stand today and where we need to go. while i enjoyed all sessions, i found the 3 keynote speeches particularly interesting. a recurrent theme in all three seems to be an emphasis on data openness, which goes beyond simply making use of open source software and tools for libraries. stefano mazzocchi talked about the close to zero marginal cost of communication, and how it’s changing the role of libraries. his demo of freebase showed how collective common knowledge contributed by users can be translated into tremendous value. sebastian hammer pointed out that even though standards suck, they are central in allowing data to flow freely and are essential for collaboration; hence a call for systems and organizations to surrender our data. ian davis talked about the importance of open data. and i thought he put it very well when he said the goal is not to build a web of data, but to enrich people’s lives through access to information. a spirited game of ‘werewolf’ during conference down time. i loved the conference format — 20 minute sessions, single track and the favorite of many people including me: 5 minute lightening talks. jodi schneider and william denton gave a great discussion on the frbr model and demanded strong frbrization in library applications, which made the long-lost cataloger in me want to stand up and applaud. for the same reason, i liked the demo of the open source web based metadata editor ‡biblios.net by liblime. godmar back introduced the libx 2.0 community platform that allows sharing and building upon existing libx services. i was really excited to see that librarians will get to play a role in this platform — non-programmers like us could be adopters who combine modules into libapps, and reuse and share them as packages. attending his pre-conference workshop allowed me to see the editing of libapps in action. in addition to beer and wine, the jokes in the irc backchannel, and the addictive werewolf game, this conference offered me an insight into programmers’ view of the library world, and it has been really fascinating. since coders create applications to handle and process data, i think it gives them a very keen sense of how data could be more efficiently shared, linked and retrieved. just like what ian said at his keynote, because there is more structured data than unstructured data, therefore people who understand structures matter. i walked away from code4lib 2009 with a deeper understanding and appreciation of developers and coders’ contribution to the library community. from joanna dipasquale the themes of interoperability, portability, scalability, cooperation, and, of course, change were on display at the fourth annual code4lib conference in providence, ri. it was an excellent opportunity for library technologists to come together and showcase their innovative software and brainstorm new ideas. but there was much more, for the conference’s messages were both simple and powerful: this community wants to design to interoperate, it wants to share, it wants to create things that make life easier for library users. perhaps sebastian hammer, in his keynote address on wednesday, february 25 said it best: the assumption that the marketplace changes libraries inexorably – where our choices are only to get out of the way or adapt – is coming into serious question as market forces shift.* code4lib provides some of the innovation needed to not just be part of the larger conversation of the “next phase” of libraries, but to lead the conversation. it wants the library to win, and it is doing something about it. the conference provided a wide array of insights into the current and next great library applications. the range of projects was amazing: from geospatial data, to information visualization and dashboard styles, to metadata standards, to more effective searching and indexing, code4lib showed the innovation of library and archive technology (see program, breakout sessions, and lighting talks for more information). through the presentations and breakout sessions, the themes above provided us with two main goals, “what we can do right now” and “what we need to do for the future.” the desire to provide ways to interoperate with already-on-hand systems – from blacklight’s plugin-based customization files for local instances, to the improvements in vendor tools and apis from oclc, ex libris, and serials solutions – gave viable solutions for the present. the push for better standards – from ead to semantic web to sword techniques – provided insight to current and future solutions. from the provocative stefano mazzocchi challenging us to embrace electronic books, to the experimentations with ead, djatoka, and dashboard views, challenges become opportunities. yet, as ian davis stated in his keynote address, code may change, but data remains. our standards and interoperability are key; as we learn more ways to expose our library data to the greater community, we also know that we’ll find better solutions in the future to work with our repositories and catalogs. the conference met these challenges head-on by providing the techniques and the thinking behind them to enable all of us to do more. — * if you attended code4lib 2008, you probably thought i was going to write, “perhaps hammer said it best when he noted: ‘i’m not much of a keynote guy. i always thought that, if you have something to say, you should release it as code.'” but i didn’t (until now). it was an excellent thing to say, but the argument he proceeded to develop was much more powerful than that quip. from lauren ko: bacon, kittens, and brewpubs pre-conference my experience at code4lib 2009 began with the linked data pre-conference. speakers focused on the use of rdf for describing and sharing data to enable the linking of resources via the crawling of uris. the pre-conference was great, not only because my foaf file won me a book, but because its focus on enabling the sharing and connecting of data was a fitting indication of what would come at the main sessions. opening remarks the conference began surreally with roy tennant, whose words i used to read monthly in library journal, showing off his face on a pair of thong underwear. despite this image, his words were significant as he encouraged all attendees to take part in the conference beyond the role of viewer. mark matienzo followed him on stage with a presentation on how to have fun at code4lib 2009. his recommendations included the irc channel, some sort of bacon-donut (plus food in general), and of course, beer. keynotes the idea and implications of opening up data to the web pervaded the three keynote addresses of the conference. first keynote speaker stefano mazzocchi began with a history of information technology from cave paintings to electronic publishing. he spoke about the cost ineffectiveness of libraries keeping physical copies of every book that could be improved by institutions moving to the virtually infinite storage space of servers. he then showed us freebase as an example of marked up data that can be utilized by any number of to-be-developed applications to enrich the lives of users. sebastian hammer, wednesday’s keynote speaker, issued a call to arms. he implored all library coders to become advocates within their organizations and prevent a death of libraries brought about by dependence on the idea of the book. he also spoke of the problem of apis forcing loyalty above interoperability and the need to surrender data freely to help remedy such problems. ian davis, the final keynote speaker, continued the push for libraries to set their data free and contribute to building a more useful semantic web. because data outlives code, it is more important for institutions to push for open data over open source. the rest of it other sessions addressed various areas within the realm of libraries. we heard about indexing collections (solr, lusql), tools for facilitating access to library resources (libx, jangle), and content management systems (drupal, vufind, blacklight). being new to the code4lib community and having limited knowledge of some of its topics, i was most satisfied with speakers who began with a general overview of their topic, followed with further description, shared a demo, and concluded with what it means to the community. particularly fun was richard wallis’s presentation (complete with sound effects) on using the javascript based framework, juice, to extend opacs through embedding related external content (http://www.slideshare.net/rjw/squeezing-more-from-the-opac). the conference facilitated an amazing amount of participation/contribution with its breakout sessions and lightning talks. the breakout sessions allowed attendees to informally discuss current projects, issues, and recommendations. the lightning talks, covering a variety of topics (applications, specifications, protocols, etc.), introduced me to new ideas, resources, and projects for further investigation. my experience at the code4lib conference was overwhelmingly positive. while i don’t pretend to understand the fascination/running joke of bacon and kitten images or the love of brewpubs that permeates, i gained a great respect for the community of programmers and librarians that is working together on ways of enriching the lives of its users. from andreas orphanides: code4lib 2009 with a critical eye one of the things that struck me about my first code4lib conference is the level of bright-eyed idealism that the presenters, and the participants, brought to the conference’s discussions. and certainly, it is a right and proper thing for us code4libbers to keep focused on the horizon as we think about technology, data, and the future of libraries. it’s important to remember, though, that while optimism is all well and good, we must also be mindful of the practical problems we may encounter in implementation. with that in mind, i present a few points of critical analysis of the presentations at code4lib 2009–places where we should pause to think about how the ideal will fit into a real, and very sloppy, world. the linked data preconference was both an exploration of the possibilities of decentralized, interconnected data repositories and a rallying cry for developers to make use of those repositories. linked data as a concept is also one of the areas where reality is likely to end up in conflict with an ideal system. fundamentally, linked data relies on high-quality metadata creation and robust linking technology. the presenters acknowledged the problems inherent in linking, such as expiring uris, and suggested as a solution locally caching linked data. this kind of workaround, however, runs right up against the problem of quality metadata creation. what happens as cached metadata ages, or false or incorrect metadata propagates through the system? a decentralized system will either have to accommodate conflicting metadata, which could have nasty side effects, or it will need a heuristic to resolve the conflict, a notoriously difficult prospect. it seems to me that the only viable alternative would be to cede some of the ad-hoc, democratic nature of this system to establish an authority control system of some kind, sacrificing some of the decentralization that make the linked data concept so appealing. consider also the freebase project, and its genderizer tool. stefano mazzocchi, in his keynote, demonstrated freebase, an open-access database that hopes to catalog, well, everything about everything. he also demonstrated the genderizer tool, a web application where users vote on the gender to objects in the freebase database. stefano indicated that a possible goal of the genderizer would be able to determine the gender of every relevant (i.e., “genderizable”) object for which information exists on the internet, something on the order of four billion objects. this presents an immense problem of scale: the genderizer assigns gender to approximately 150,000 objects per month; at the current rate, it would take over 2,200 years to reach that goal. i won’t even address the problems inherent in assigning genders through voting or the potential for dissonance between someone’s self-identified gender and its majority-assigned one. there’s the potential for similar problems in assigning properties in this way to any large collection of objects. although i’ve pointed out flaws in these two projects, each one has incredible promise, and each will doubtless serve as the foundation for a host of next-generation information tools. and while i don’t have recommendations for how to address the flaws, i think it’s important to acknowledge them, and to think about potential solutions as we explore the possibilities these tools offer. this same earnest, realistic perspective–balancing the ideal of a perfectly crafted design against a reality where implementation is far from perfect–must be maintained if we are to build technologies that are relevant, practical, and practicable, and which serve as the platform for real-world applications that are useful to the information consumers that we serve. about the authors yu-jie chen is the ils librarian at loudoun county public library in leesburg, virginia. joanna dipasquale is a web developer for columbia university libraries’ digital program division. lauren ko has a bachelor’s degree in computer science and a master’s in information sciences. she is currently web archiving programmer for the university of north texas libraries digital projects unit. andreas orphanides has a ba in mathematics from oberlin college, and after a stint as a high school mathematics teacher, he earned his msls from the unc-chapel hill school of information and library science. he is currently a libraries fellow at the north carolina state university libraries, where he works in the libraries’ information technology and research & information services departments. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction: seeking a diversity of voices mission editorial committee process and structure code4lib issue 24, 2014-04-16 editorial introduction: seeking a diversity of voices making the journal the best that it can be. by ron peterson the editorial committee of the code4lib journal is always looking for ways to improve the content of the journal and ensure that it is meeting the needs of our readers. new additions to the committee have brought new ideas about how we can make the journal the best that it can be. when several members of the committee met in march 2014 during the code4lib conference in raleigh, we had the opportunity to discuss these issues in person. inspired by the presentations on data visualization, the editorial committee talked about how we can use the data that we have to help chart the future directions of the journal. we have google analytics data and our own administrative data that we use to track proposals from submission to publication that we could evaluate, but we can go further. we could mine the mailing list or use other kinds of alt-metrics in order to learn more about the impact of the journal, where we are succeeding and where we could do better. some of the questions we would like to explore include the following: what are some of our most popular articles? topics? what is our rejection rate? how does our rate compare with other journals in the field? is there a “season” for proposals? are there regular cycles with peak times for the submission of proposals? how many repeat authors have we had? repeat submitters? one question that i wanted to explore was “how inclusive of different populations have we been?” diversity of voices a theme that emerged at the code4lib conference was the rate of participation for women in technology. valerie aurora, the closing keynote speaker, advocated for a conscious improvement of the diversity of voices that we are listening to.[1] with that in mind, i did a preliminary analysis of gender participation in the code4lib journal. we don’t formally track demographic data for the authors who submit proposals, so those numbers involve a certain amount of guessing about the gender of the author. also, without that demographic data it is difficult to dig deeper to look at other characteristics. diversity of editorial committee members the best data we have is on the editorial committee itself. of the twenty-nine people who have ever served on the committee, only eight of them are women. twenty-one of the editors are men. here is a graph showing the makeup of the committee over time: while we have been mindful of recruiting women to join the editorial committee, this graph demonstrates we haven’t been trying hard enough. diversity of authors out of the 201 articles that the journal has published, 95 articles had female authors and 160 articles with male authors – in some cases the article may have had both male and female authors. in all, the articles published by the journal have been written by 247 male and 142 female authors. women make up less than 40% of the authors published in the journal. we should be able to find more female authors in a profession that is 80% women.[2] for further analysis, i broke the data down by issue. below is a bar graph showing the break down of author gender by issue. the most striking thing about the graph to me is that many of the issues have almost no participation from female authors. lastly, i looked at the 182 articles that were not accepted for publication. participation by female authors remains low but is consistent with the number of female authors of published articles. i think this data raises some interesting questions. does the journal create barriers to publication for women? how do these numbers compare to overall population of people involved in code4lib?[3] are there things we could do to increase participation in the journal? why are the numbers of women on the editorial committee so low? more importantly, how can we correct that? if you have thoughts about improving the diversity of voices that are represented in the code4lib journal, i invite you to post them to the journal’s open discussion list (c4lj-discuss@googlegroups.com). or if you would be interested in helping us make better use of the data we have, drop us a line at journal@code4lib.org. i hope you enjoy issue 24! references [1] aurora, valerie and tennant, roy. (2014). closing keynote – interview with valerie aurora. 2014 code4lib conference. raleigh, nc. https://www.youtube.com/watch?v=b5lfjpersy0&feature=share [2] office of research and statistics, american library association. (2012) diversity counts. http://www.ala.org/offices/diversity/diversitycounts/divcounts [3] metz, rosalyn. (2012). code4lib gender survey summary. https://docs.google.com/document/d/1hbofh63-5f9mwek8y8c83heoknodttaswf5juqglq1e/edit subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction — issue 1 mission editorial committee process and structure code4lib issue 1, 2007-12-17 editorial introduction — issue 1 this mission of the code4lib journal is to cover "the intersection of libraries, technology, and the future." we hope that this journal can be one more contribution to the developing culture of collaboration around library technology, and we welcome you to join in our experiment. by jonathan rochkind, coordinating editor, issue 1 this is a decisive time for libraries. in the changing social and technological environment, libraries must adapt to fulfill their missions and satisfy their users. library technology is acutely involved in this adaptation. digital services, content and tools have become a part of nearly every aspect of library operations. the “digital library” is here–if you work in a library, you probably work in a digital library. this mission of this journal is to cover “the intersection of libraries, technology, and the future.” we plan to provide practical information to help the library community envision and achieve our technological future, to bring libraries’ tradition of collaboration to bear on new challenges. we want the digital libraries of today to be transformed into the digital libraries of tomorrow, providing quality information while meeting new and changing needs. rapid transformation has risks, but maintaining the status quo brings its own, greater, risks. libraries must take a leading role beside their vendors in the technological innovation that must accompany this needed transformation. the code4lib community one locus of pragmatic innovation has been the code4lib community. inspired in part by the social dynamics of distributed open source projects, code4lib is an informal online social and professional network of library technologists, embodying values of transparency, cooperation, and pragmatic problem solving. the code4lib infrastructure includes a listserv, an irc channel, a blog aggregator, and an annual conference [1]. code4lib is a dynamic community which fosters collaboration and encourages the sharing of skills and ideas [2,3,4]. but paradoxically, this amorphous informality can make it hard for someone new to the field—or wanting to take a new look at the field—to find a comfortable entry point to the community and the resources it has to offer. we hope the code4lib journal can manifest the values that have been successful for the code4lib community, while providing increased access to the collective knowledge and experience held throughout our diverse professional networks and local organizations, increasing cross-pollination and collaboration among library technology innovators–and helping more people and organizations become innovators. this journal is an experiment the code4lib journal project aspires to balance a variety of sometimes competing goals. we want to provide quality articles providing useful information and discussion on bringing library technology into the future. we want every article to be a useful intervention into our communities of practice. we value readability over formality, and hope to meet high standards for quality and utility. we’d like articles to have the technical detail for reproducibility, while still being accessible to readers at varying levels of technical expertise. at the same time we want to ensure an easy process for authors, letting authors share their important work and ideas with as few barriers as we can get away with. the journal is intentionally edited rather than refereed, and we try to contribute editing advice to help authors improve their articles without aggravation. we are committed to the journal’s free online availability, to increase its visibility and impact in addition to its accessibility. we want the immediacy of a blog, the usefulness of a professional conference, the reliable quality of a good scholarly journal, and the participatory nature of our online communities, all in one easy to read and easy to produce package. and we are trying to accomplish all of that on a shoestring, with an all-volunteer editorial committee sharing management and editorial responsibilities in an informal, open, and pragmatic way as per the code4lib ethic. our coordinating editor will rotate with every issue; i’ll soon be passing the baton to eric lease morgan. the code4lib journal project is in that sense much like some of the technology projects many of us work on in our daily lives, balancing competing values and priorities with limited resources. and we’ve tackled this project the way we do those projects, with a ‘can do’ spirit and an agile development approach—in other words, we’re making it up as we go along. so how is the experiment working out? we think we’ve got a great first issue. this is due to the great work of our authors, and of the editorial committee. i am not alone among the editorial committee in discovering that inventing a journal—even one solely online which is intended to be relatively informal and agile—is more work than i personally expected. all of our authors and editorial staff deserve to be proud of what we’ve produced together through hard work [5]. but ultimately only the judgments and actions of you, our readers, can measure our success. if you think this first issue is evidence of a worthwhile endeavor, you can contribute to its future success. how can you help? you can read our articles, suggest them to others, and continue the discussions in your blogs, listservs, and in comments attached to the articles themselves. we want every article here to be part of an ongoing conversation towards cooperative innovation among libraries. you can submit articles to us, and when you run into a colleague with an interesting project or idea, you can suggest that they submit articles to us. we’re happy to accept articles and proposals at any time; proposals for our third issue are due by friday march 14th. we welcome anyone interested in participating in the operation of the journal to join our public discussion list for journal business [6]. at some point in the future, we will solicit more official members of the editorial committee, too. we hope that this journal can be one more contribution to the developing culture of collaboration around library technology, and we welcome you to join in our experiment. code4lib journal founding editorial committee carol bean, north county regional library, (palm beach county library system) jonathan brinley, ball state university libraries edward corrado, the college of new jersey, corrado@tcnj.edu tom keays, syracuse university library emily lynema, north carolina state university libraries eric lease morgan, university libraries of notre dame ron peterson, university of delaware library (ronp at udel dot edu) jonathan rochkind, johns hopkins libraries jodi schneider, amherst college library & graduate school of library and information science at uiuc ken varnum, university of michigan notes [1] http://www.code4lib.org [2] barrera, antonio, parmit chilana, kevin clarke and michael giarlo (2007). 2007 code4lib conference report. library hi tech news 24(6). pp. 4-7. http://eprints.rclis.org/archive/00011670/ [3] frumkin, jeremy and dan chudnov (2006). code4lib 2006. ariadne issue 47, april 2006. http://www.ariadne.ac.uk/issue47/code4lib-2006-rpt/ [4] chudnov, daniel (2007). code4libcon shows what a participatory conference looks like. computers in libraries 27(5), may 2007. pp. 37-40. (coins) [5] special thanks to jonathan brinley for providing the nuts-and-bolts web management that many of us wanted to leave at our day jobs. he deserves the credit for the clean look and useful functionality of the site. [6] http://groups.google.com/group/c4lj-discuss/ subscribe to comments: for this article | for all articles one response to "editorial introduction — issue 1" please leave a response below: editors’ choice: on libraries, code, support, inspiration, and collaboration | digital humanities now, 2014-07-22 […] it launched in 2007. at that time, jonathan rochkind (co-ordinating editor of the first issue), wrote “we want the immediacy of a blog, the usefulness of a professional conference, the reliable […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – automatic aggregation of faculty publications from personal web pages mission editorial committee process and structure code4lib issue 11, 2010-09-21 automatic aggregation of faculty publications from personal web pages many researchers make their publications available on personal web pages. in this paper, we propose a simple method for the automatic aggregation of these documents. we search faculty web pages for archived publications and present their full text links together with the author’s name and short content excerpts on a comprehensive web page. the excerpts are generated simply by querying a standard web search engine. by najko jahn, mathias lösch, and wolfram horstmann introduction in most academic fields, publication lists on personal web pages are a common practice to promote lasting research visibility. these lists not only function as individual bibliographies, but also as digital archives. often researchers complement these bibliographic references with the corresponding preor postprints in order to allow peers easy access. from a librarian perspective, such personal publication lists provide a rich source of bibliographic data and full texts which would be worthwhile to monitor and grab. in particular, these publication lists might be a good starting point for populating the local institutional repository. if the repository manager could track the deposited documents on the web pages, he could simply ask the researcher to upload these documents to the repository as well. given the large amount of highly distributed personal web pages usually hosted by an academic institution, the question arises of how to keep track of the research output available if no efficient central service is in use for publishing publication lists as part of the personal web page. in this paper, we propose a simple method for the automatic aggregation of the publications on personal web pages. we search faculty web pages for freely available publications and present their urls together with the author’s name on a comprehensive list. moreover, each list entry is being enhanced by a short document excerpt automatically generated using the bing search engine. based on a subsequent text classification, we furthermore distinguish between scholarly documents and other material such as cvs or courseware. background before proposing the method, let us reconsider our motivation. in preparation of the development and implementation of a publication management system at bielefeld university, a middle-sized and research-intensive university in the state of north rhine-westphalia (germany), we wanted to provide evidence that scholars maintain personal publication lists on the web. moreover, we were interested in whether the documents are actually downloadable from these web pages. our challenge is not a new one, but has been approached theoretically and practically by others before. since the advent of the web, scholars from various fields have been discussing scholarly communication patterns on the web and developed methods of how to track and measure these. philpapers.org, [1] an online directory for philosophical publications, shows how fruitful automatic monitoring of personal web pages can actually be. based on crawlers and metadata extraction utilities as part of the open source platform disciplinary virtual research environment (divre) [2], philpapers.org collects and indexes web content coming from the field of philosophy. one of the services philpapers.org provides, once a researcher is registered with the system, is automatically harvesting the researcher’s personal web site for new documents on a regular basis. after indexing the titles, authors, and abstracts of the new documents, the metadata are presented as part of the online bibliography philpapers.org provides. getting scholarly publications in three steps with this example in mind, we similarly look for a simple and easy method to automatically aggregate online publications of bielefeld university faculty. collecting document links in the context of the above-mentioned project concerned with publication management, we needed to gain a closer insight into the habit of maintaining personal publication lists at bielefeld university. however, the content of personal websites is not in central control at large universities [3]. therefore we asked two student assistants to manually collect information on the lists. on the basis of the official directory of staff and departments, they manually searched for the publication lists from the researchers. the search was limited to people holding at least a doctoral degree, and conducted using web search engines such as google. during a period of two months, they identified publication lists from 750 out of 1214 researchers (see figure 1), and compiled the information in a table associating the names and departments of the researchers with the urls of their lists. figure 1. detected publication lists at bielefeld university apart from the rather theoretical interest in the data, we considered to re-use part of it for automatically collecting information on the publications available through those web pages. to this end, we process the table using a python script that downloads the html pages behind the urls, and extracts all hyperlinks pointing to pdf documents, which we assume to be scholarly publications. this is achieved simply by using regular expressions. in total, we extract 3365 pdf links that give us a first idea about the coverage of alleged scholarly publications on personal lists. however, two technical problems occur: one is that some researchers link to sub-pages containing additional information and the full text document. the other is that the extraction of document links coming from script-generated web pages often fails. searching bing next, we query a web search engine to generate a short document excerpt. for this task we could use virtually any search engine that provides an api. however, we pretested a small subset of ten randomly selected entries of our table to determine the best performing engine with regards to the number of found documents and the quality of the results (see below for encountered quality issues). we take into account google, yahoo!, and bing. since bing performed best in this pretest, we also use it in the final program. we query the bing search engine directly from our program via its public web api [4]. as query strings we simply use the document links. the bing api returns a data structure containing the same information that would be presented on a regular result list. a result entry, apart from the hyperlink to the original document, consists of a title (usually the document title) and a description (a short excerpt of the indexed document). for each url we save the title and the description of the top-ranked result entry, if bing has indexed the pdf document behind the link. finally, we combine our initial table with the bing data, and display it all together on an html page that makes up our comprehensive publication list. figures 2 through 5 show excerpts of this list. apart from the titles, the excerpts ideally provide additional information, such as the bibliographic reference, the document type and the beginning of the underlying paper (see figure 2). figure 2. an almost ideal example of how the approach is supposed to work. the title has been correctly identified and the bing excerpt provides useful additional information. however, some problems occur. for example, the search engine apparently has not indexed all the documents, as for some entries there is no title or excerpt (figure 3). closer examination of these cases reveal that most of them come from script-generated web pages. figure 3. a list entry lacking title and excerpt. this may be because the document has not been indexed by bing. another problem that we observe are meaningless excerpts consisting only of junk characters (figure 4). most probably, bing could not read the underlying file and this prevented a proper indexing. figure 4. character junk probably coming from a damaged pdf file. and finally, we see documents whose titles could not be recognized correctly. in those cases, the title mostly is part of or even identical to the document’s url (see figure 5). figure 5. another bad example. apparently the alleged title is part of the document’s url. text classification since the web pages containing the publication lists often include references to other material in pdf format, such as cvs, courseware, guidelines for students, etc., our comprehensive list features some irrelevant entries. to tackle this problem, we explore automatic text classification using a naïve bayes’ classifier. this is a simple machine learning technique widely in use, for example in spam filters [5]. the idea behind this step is to train a classifier that can distinguish a document representing a journal article or thesis from any non-publication material. this approach is also being used by philpapers.org, and is described by w. schwartz in his article on “metaparser” [6]. we systematically download the pdf files and extract their raw texts using the pdftotext program [7]. in total we obtain 1701 files. note that we can only download less than half of the files initially identified. this is because many links point to documents hosted by commercial publishers. we then use langident [8] to automatically sort the plain text files by language; we find 1287 english and 414 german documents. the files are then labeled by our student assistants as publication and non-publication documents. on this corpus of labeled documents, we train a naïve bayes’ classifier using the words of every document as features. note that we eliminate punctuation and function words, and limit the feature space to the remaining 1,000 most frequent words in the corpus to minimize computational complexity. for corpus management and classification we use the natural language toolkit (nltk) python library [9]. in sum, we identify 1493 relevant scholarly publications. in order to evaluate the classifier, we use cross-validation against all documents in our corpus. this means that for each document, a classifier is trained on the whole corpus minus that particular document. the classifier is then evaluated by classifying the left out document. therefore this method is also being referred to as leave-one-out evaluation. all evaluation measures are summed up and divided by the number of documents in the corpus. by this means, we achieve a classification accuracy of 89% for the english documents. that means the classifier is able to predict the correct category (publication or not) for 89% of the documents. one could also say that for a randomly selected document, the probability for the classifier assigning it to the right category is 89%. for the german corpus the accuracy is 82%. we also measure the effectiveness of the classifier in terms of recall and precision [10] for the task of selecting the documents that are truly publications. recall measures the ability of a classifier to select the correct documents from the data set, that is the completeness of the result set. precision measures the ability of a classifier to reject incorrect documents. to compute the recall and precision values for our classifier, we must compare for each document the category assigned by the classifier with its true category, and assign it to one of the four sets (a, b, c, d) outlined in table 1. then the recall is computed by r = a/(a + c) and the precision by p = a/(a + b). reference: publication reference: no publication classifier: publication a b a + b classifier: no publication c d c + d a + c b + d a + b + c + d table 1. contingency table for computing recall and precision. our classifier achieves a recall value of 89% and a precision value of 98% for the english documents. the values for the german documents are a little lower with a recall value of 80% and a precision value of 95%. since we have not yet conducted a qualitative in-depth analysis of these results, we can only suppose that the lower values for the german documents result from the smaller data set (414 documents) as opposed to the english data set (1287 documents). limits and future steps to be taken our approach shows that personal publication lists provide a significant amount of deposited scholarly documents and our search engine-driven excerpt extraction yields promising results in many cases. the method, however, has its limits. due to various reasons, some list entries are malformed. our method relies on an already existing directory of personal web pages of our university. furthermore, since our directory is manually compiled, it is hard to keep track of lists that appear on the web after the compilation has been finished. yet not all libraries may have access to such a directory, if it exists at all. an alternative might be to start with the domain of the academic institution under examination and simply crawl all the documents within the subdomains and file directories. since the number of received documents will then certainly be much higher, text classification will become even more important. in combination with the efficient crawling of websites, text classification and metadata extraction, such an approach would allow the creation of a much more detailed and robust list—certainly at the cost of development effort, since a focused web crawler is a way more complicated piece of software than our python script. considering our text classification experiments, the resulting classifier performs well enough to encourage further investigation in this direction. the high precision values show that we can reliably filter irrelevant material from the publication list. however, the achieved recall values suggest that the classifier would also reject a significant portion of publication documents. therefore we would not recommend using it in a production environment, yet. future work should aim at practical applications including the integration of library services already in use. we think of a web application that would allow repository managers to continuously monitor the faculty publication lists. therefore, a mechanism to refresh the comprehensive list on a regular basis is needed. for convenience, new entries in the list should be highlighted, for example by putting them on top of the list or by making use of color. in the long run, we can even imagine deposit scenarios for the researchers. for example, a service could be offered to faculty to have their documents, including the extracted metadata, transferred to the institutional repository. this would dramatically lower the barrier for open access self-archiving. in the future, it would also be worth studying how to extract bibliographic metadata directly from the documents in order to overcome the limits of our current search engine-driven approach. g. hatop [11] has recently described a promising method for doing this. the final consideration is of an organizational nature. before presenting such a compiled lists on the web as a library service, methods for involving the researchers have to be developed. this includes workflows for researchers to verify their publications or to refuse to be part of the list. for instance, philppapers.org only harvest publication lists deliberately registered by the authors themselves. conclusion we presented a method for automatically monitoring faculty publication lists and detecting deposited publications in them. our main intention was to keep the implementation as simple as possible. we enhanced the publication list using a standard web search engine. furthermore, we explored text classification to distinguish between scholarly publications and other material. despite the described drawbacks, we found 1493 relevant and even accessible scholarly publications. we therefore think that the automatic aggregation of faculty publications from personal webpages is a complementary approach to digital libraries in general and institutional repositories in particular. if this approach is combined with policies and workflows to form new deposit strategies, publication lists on personal websites could become a valuable source for populating digital libraries in the future. acknowledgements we are indebted to saskia scheibler and agatha walla for their valuable help with collecting data on the publication lists and labelling the corpus data. we also wish to thank torsten wilholt. references 1. philpapers: online research in philosophy. http://philpapers.org/ 2. divre: the disciplinary virtual research environment. http://www.divre.org/ 3. aguillo, isidro f. measuring the institution’s footprint in the web. library hi tech 2009; 27(4): 540-556. available from: emerald; doi 10.1108/073788309. available from: http://biecoll.ub.uni-bielefeld.de/volltexte/2010/5002 4. bing development center. http://www.bing.com/developers 5. sahami, m., dumais, s., heckerman, d., horvitz, e.: a bayesian approach to filtering junk e-mail. in: learning for text categorization: papers from the 1998 workshop, vol. 62, pp. 98–05. aaai technical report ws-98-05, madison, wisconsin (1998) (coins) 6. schwartz, w.: metaparser. http://philpapers.org/metaparser.html 7. xpdf. http://www.foolabs.com/xpdf/ 8. simoes, a., castro, j.: lingua::identify. http://search.cpan.org/~ambs/lingua-identify-0.26/lib/lingua/identify.pm 9. bird, s., loper, e., klein, e.: natural language processing with python. o’reilly (2009) (coins) 10. lewis, d.: evaluating text categorization. in proceedings of speech and natural language workshop, pp. 312–318 (1992); doi: 10.3115/112405.112471 11. hatop, g.: bibliographic metadata extraction from theses. code4lib (7) (2009) http://journal.code4lib.org/articles/1686 code python code about the authors najko jahn (najko.jahn@uni-bielefeld.de) works at bielefeld university library. he is part of the project “publister – personal publication lists as a university-wide service” funded by the “german research foundation” (dfg). mathias lösch (mathias.loesch@uni-bielefeld.de) works at bielefeld university library. he is part of the project “automatic enhancement of oai metadata” funded by the “german research foundation” (dfg). wolfram horstmann (wolfram.horstmann@uni-bielefeld.de) is cio of scholarly information at bielefeld university. subscribe to comments: for this article | for all articles 3 responses to "automatic aggregation of faculty publications from personal web pages" please leave a response below: dori stein, 2010-09-27 i think you should also consider using website scraping tools described in this series of posts: http://www.fornova.net/blog/?p=4 brice stacey, 2010-09-28 line 189 throws an error for me. i’m running ubuntu 9.10, python 2.6.4, and had trouble installing pybing… new department http://www.cs.umb.edu/~cheungr/ http://www.cs.umb.edu/~ding/publications.html http://www.cs.umb.edu/~duc/publications.html traceback (most recent call last): file “facpublister.py”, line 240, in main() file “facpublister.py”, line 189, in main if ‘publication list’ in row[0]: indexerror: list index out of range i’m not sure why the for loop doesn’t stop. i’m not a python programmer, so maybe i didn’t do something right, but before line 189, i added the following to make it work: if len(row) == 0: break anyway, nice job. the result is pretty neat. have you thought of perhaps parsing citations from publication lists and running them through a search engine? that could be pretty powerful. mathias lösch, 2010-10-13 brice, maybe your input csv contains an empty row at the end? i tested that and it throws exactly the error you saw. thanks for catching it! we indeed thought of parsing citations, but gave up on that because we actually found it to be pretty complex. that would be another project. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – distant listening: using python and apps scripts to text mine and tag oral history collections mission editorial committee process and structure code4lib issue 60, 2025-04-14 distant listening: using python and apps scripts to text mine and tag oral history collections this article presents a case study for creating subject tags utilizing transcription data across entire oral history collections, adapting franco moretti’s distant reading approach to narrative audio material. designed for oral history project managers, the workflow empowers student workers to generate, modify, and expand subject tags during transcription editing, thereby enhancing the overall accuracy and discoverability of the collection. the paper details the workflow, surveys challenges the process addresses, shares experiences of transcribers, and examines the limitations of data-driven, human-edited tagging. by andrew weymouth background the university of idaho’s digital scholarship and open strategies (dsos) department was established in 2008 to digitize the international jazz collection and has since expanded to over 130 digital collections.[1] these are constructed with collectionbuilder, an “open source framework for creating digital collections and exhibit websites that are driven by metadata and modern static web technology”[2]. a companion framework named oral history as data (ohd) was developed in 2016 to visualize encoded transcriptions and allow researchers to explore oral history recordings by keywords and tags. in this paper, “tagging” refers to a custom set of subject designations that can be tailored by the transcriber depending on the recording’s content and themes. figure 1. collectionbuilder browse site and collectionbuilder template interface our physical workspace at the library is the center for digital inquiry and learning (cdil), where our digital labs manager, digital project manager and i support the labor of a small group of student workers and fellowship recipients, generally around 2-5 a semester. both the collectionbuilder and ohd frameworks have been designed to be simple and accessible, only requiring someone with access to google sheets, visual studio code and minimal software installation to create, maintain, and export digital collections. for the process outlined in this paper, student transcribers only need access to google sheets to generate and edit subject tags using google’s apps script extension, while the project manager needs access to adobe premiere and a text editor to run the python script to generate transcriptions, text mine the material, and to create subject tags. the incentive for this project arose from realizing a number of oral history recordings were either untranscribed, partially transcribed or lacking in accuracy following a data migration of our digital collections away from contentdm in the winter of 2023. because of the volume of text that needed updating, it was worthwhile to rethink workflows for overall efficiency and accuracy. distant listening one element i wanted to focus on was the creation of subject tagging to enrich the transcripts. in addition to keyword searching, the ohd interface allows custom subject tags to be highlighted and visualized across the duration of recordings. new york university’s weatherly a. stephan details the importance of subject tagging oral history collections, noting how “transcription alone cannot address the perennial gap of supporting serendipitous discovery through subject-based inquiry rather than simply known-item searching.”[3] previous transcription practices had student workers and cdil fellowship recipients adding subject tags to automated transcripts as they were copy editing dialogue, using the recordings as a reference. transcribers would tag individual recordings of collections that ranged from 20 to over 100 recordings, making their best judgments to what subjects seemed like they might be indicative of the overall collection. this approach, which i am calling linear listening, may mislead transcribers by establishing repeating themes that do not occur across the collection or missing themes that only begin to appear in later recordings. figure 2. challenges of linear listening visualization our physical workspace at the library is the center for digital inquiry and learning (cdil), where our digital labs manager, digital project manager and i support the labor of a small group of student workers and fellowship recipients, generally around 2-5 a semester. both the collectionbuilder and ohd frameworks have been designed to be simple and accessible, only requiring someone with access to google sheets, visual studio code and minimal software installation to create, maintain, and export digital collections. distant listening is an alternate approach that mines combined transcripts and generates tags before the transcriber begins the copy editing process. this moniker is an adaptation of franco moretti’s concept of distant listening, “where distance[…] allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes—or genres and systems” [4]. by searching across collections for terms and phrases indicating subject matter, oral history project managers can produce richer, more accurate data that increases the discoverability of recordings and makes the transcription process more dynamic and pedagogically rewarding for transcribers. this case study details my experience over the winter and spring of 2024 developing the tools independently and testing the process with two student transcribers working through 30 transcriptions, incorporating their feedback and streamlining processes. i also had a chance to iterate over this process in january 2025, this time working with the cdil digital projects manager to transcribe, tag and copy edit 100 more recordings for digital preservation. challenges the time-intensive nature of transcription, subject tagging and web hosting has made oral history recordings an undervalued format in digital initiatives. as doug boyd, director of the louie b. nunn center for oral history at the university of kentucky recounts in his chapter oral history archives, orality and usability (2015): “from the archival perspective, oral history proved an exciting and enticing resource to acquire. however, the difﬁculties posed by time-intensive and ﬁnancially draining realities of processing oral history collections resulted in an analog crisis in the late 1990s. hundreds of oral history archives around the united states claimed large collections, but the overwhelming majority of these collections containing thousands of interviews remained unprocessed, analog, inaccessible, and unused” [5]. in quantifying the need: a survey of existing sound recordings in collections in the united states (2015), created in collaboration with the northeast document conservation center, the authors highlight the scale of archival audio preservation challenges. they conclude that only 17% of audio holdings in u.s. collections have been digitized. the survey estimates that over 250 million preservation-worthy items remain undigitized. of these, more than 80 million (32%) will require a specialized audio preservation workflow. the report goes on to detail how the national recordings preservation plans states that many of these analog recordings must be digitized between 2027 and 2033 before material degradation [6]. in contrast to the digitization of archival photographs or documents, meeting accessibility standards for oral history recordings involves not only transcribing recordings but also presenting them in an intuitive, keyword navigable digital interface. ohd developer devin becker’s solution displays the audio at the top of the page, followed by a visualization of the recording with color valued tags, a key to the tags, a search bar for keyword queries and the transcription below. this allows researchers to follow along with the time stamped transcript as the audio plays. figure 3. demonstration of oral history as data transcription/recording interface, keyword search functionality and tag visualization showing ability to read through audio material both chronologically and vertically despite this advancement in the audio player interface, the initial transcription and tagging remained a significant hurdle in developing these collections. while machine learning speech to text technology has improved considerably since the development of the ohd platform in 2016, early, no cost transcription services were often so poor that they required extensive manual correction. fully human-driven transcription and tagging has its own challenges: it is tedious, slow moving work that, without close supervision, can result in uncontrolled vocabulary, knowledge gaps, and bias from linear listening. process figure 4. visualization of workflow from audio files to csv to python text mining and google sheets overview to summarize the process: transcription: audio is transcribed into comma separated values (csv) files using adobe premiere’s speech to text tool csvs are made into individual google sheets, exported using an google’s apps script extension and added to a folder in the python transcription mining tool on running the python script, these items are combined and searched for all associated terms and phrases built into the different subject tag categories the tool generates a tally of these terms and phrases, which is used to create the primary tag sheet in another google sheet using an apps script function, all individual transcripts are linked to the primary tag sheet so each transcript’s tag column is automatically generated new subject categories or associated terms can be added or removed from the primary tag sheet and these changes can be implemented across all individual transcripts by simply re-running the code transcription moving away from services the department had been working with, i tested adobe premiere’s speech to text tool and found it uniquely well-suited for the ohd framework, with advantages including: powered by adobe sensei, machine learning dramatically increased accuracy in differentiating speakers and transcribing dialogue, even with obscure, regional proper nouns. significantly faster transcription speed, from one 1.5-hour recording every two to three business days up to twenty 1.5-hour recordings in one day. costs covered by our university-wide adobe subscription. direct export to csv utf-8 (avoiding conversion errors necessary for ohd) available non-english language packs, enabling the creation of the department’s first spanish and french language oral history collections. privacy standards with premiere’s general data protection regulation compliance, ensuring all transcription material is stored locally and not uploaded to the cloud.[7]. figure 5. excerpt of transcript with the header names speaker name, start time, end time and text below a portion of sample dialogue that said, the tool is not perfect. while modern recordings in good audio conditions have extremely high transcription accuracy, poor quality recordings and interviews between two similar sounding people can require significant correction. python text mining after using the web-based text mining tool voyant to develop subject tags for previous oral history collections, i wanted to create a text mining tool from scratch using python that would allow the targeting of specific words and phrases and create custom tagging categories. while the natural language toolkit (nltk) has many more complicated modules for text processing, such as tokenization, parts of speech tagging and fuzzy string matching, i found these approaches generated too many false positives when it was attributing areas of the transcripts to subject tags. instead, this text mining approach favors less automation and more transparent and customizable controls. custom subject sections are created which contain around 50 terms or two-word phrases. running the python script concatenates the csv files, minimally processes them and then searches for these terms, then tallies and prints the terms in the terminal according to word frequency. once the csvs of the transcripts are generated in premiere, they are added to a google drive folder that is shared with the student workers who will be copy editing the transcriptions. using the apps script downloadsheetsascsv code (see appendix 2) is run to generate a csv with only the dialogue column. the csvs are then added to the “csv” folder in the python workspace. the code begins by importing the pandas library for data manipulation, string for punctuation removal, natural language toolkit (nltk) stopwords (words removed from text before processing and analysis) for each collection and collection counter to tally identified terms within the dialogue. next, the ‘preprocess_text’ function removes punctuation, converts text to lowercase and handles missing values by replacing them with an empty string. csv file paths are constructed, and the text data is concatenated into a single string corpus. word frequency is tallied and the 20-50 most frequent words and phrases for each subject tag section are generated when the code is run. below this header material in the python file are the three subject tag categories: general: agriculture, animals, clothing, etc. geographic: (based loosely on migration statistics from the 1910 idaho census): basque, britain, canada, china, etc. custom: (example from our rural women’s history project): marriage and divorce, motherhood, reproductive rights, etc. these fifty sections have a list of fifty associated terms and phrases that the script is searching for within the combined transcription corpus. these terms were generated using chatgpt-4 turbo with the following qualifications: the word or phrase is only associated with one section. for example, regarding the sections agriculture and animals, the word “pasture” would be excluded since it could refer to both the land used for grazing animals and also the act of animals grazing. exclude homographs (words that are spelled the same but have different meanings). for example, “sow” refers both to an adult female pig and the agricultural act of planting seeds in the ground. placenames and how certain nationalities would refer to themselves for the geographic sections. for example, “philippines”, “filipino”, “tagalog…”, “norwegian”, “norway”, “oslo…” or “japanese”, “japan”, “tokyo”, etc. terms and phrases favor informal, conversational speech. these text mining categories and sections produce a total of 2,250 associated terms or phrases that are being identified across the combined transcript corpus before the script tallies these words to generate the output shown below: figure 6. sample of identified terms from combined transcripts, tallied in descending order. see appendix 1 for the header script and one subject section or visit the github repository to view code in full. apps script connection and customization once this text mining data is produced, it can be copied and pasted into the primary tag sheet in google sheets, located in the same folder as the transcripts for student workers to access and edit. using the text to columns function, subject tag sections can be split into column a and their associated words into column b using the programmed “##” as the separator. figure 7. example of the formatted primary tags sheet with headers reading tags in column a and associated words in column b. after minor formatting to the individual transcript, student workers access the apps script extension located in the drop-down menu. transcribers then enter the code (see appendix 3), and make two adjustments: change the sheet name of the transcript they are editing on line 6 change the url of their primary tag sheet on line 13. then save and run the code now the individual transcript is connected to the primary tag sheet, which will automatically search the text column for these terms and phrases and fill in the tag column of the transcript with its associated subject tag. it is important to state that this process is not intended to replace human transcribers but shifts the focus from manual tagging to copy editing. if transcribers notice that a tag is either not applicable or missing from the primary tag sheet, they are encouraged to make these additions or subtractions and rerun the apps script on their individual sheets, which will automatically enact these revisions across the entire document. if transcribers notice errors that are more specific to individual transcriptions, they can paste these edits into an additions or subtractions column to the right of the tag column, so the changes aren’t written over by future runs of the apps script code. findings figure 8. example of a pre and post process tagging visualization of a recording, with the post process being dramatically more dense. while initially testing this process, my main concerns were: would transcribers find the apps script coding element confusing and/or anxiety-inducing? due to the complexity of language, would the automated tagging generate so many false positives that correcting these items would become a drag on productivity? working with a student worker and a fellowship recipient, copy editing 30 transcripts over the course of two months in the summer of 2024, these factors were not an issue. possibly helpful in this effort was weekly meetings where we checked in and tested the code, sometimes purposefully breaking it to show how those mistakes can be easily fixed and demonstrate how they can update the primary tags sheet and rerun on their individual transcript sheets. rather than simply asking student workers to transcribe recordings—work that offers little to highlight on a cv and can lead to burnout and high turnover—this process allows transcribers to engage in coding, create and modify tags, and see those changes reflected instantly through the apps script process. i had the opportunity to revisit and iterate on these tools and processes for another oral history digital collection undertaken in january 2025, this time working with the digital projects manager, as opposed to student transcribers. we were able to complete the process lifecycle of transcription, tagging and copy editing for 100 recordings in just over a week, increasing productivity from the summer 2024 initiative by 1233%. regarding the limitations of data-driven, human-edited automated tagging, program managers must communicate that automated tags are only a starting point. tags may be incorrectly applied, missing or need to be applied more broadly to transcripts. even when these measures are taken, the amount of detail this process accrues is easily distinguishable in the before and after ohd tagging visualization shown above (fig. 8). one could argue that the density of the data might now make it difficult for the researcher to navigate, especially on mobile devices. this continues to be a dialogue as we refine this workflow. conclusion while discussing grant funding for digital initiatives, a colleague pointed out that the time-intensive nature of oral history projects often leads to their neglect. as they put it: “would you rather present ten oral history recordings or 500 photographs?” this quantity-focused selection criteria ultimately poses an existential threat, leaving these materials physically vulnerable as they languish in the archives. bicentennial and community oral history initiatives, rich in non-academic perspective, offer a uniquely biographical account of places and provide valuable contrast and context to the accepted historical record. by utilizing machine learning, python, and apps script approaches, this process seeks to make digitizing these resources more efficient and accessible, promoting their preservation and availability to the public. references and notes [1] digital collections, university of idaho. university of idaho library digital initiatives. 2024 [cited 2024 jul 8]. available from: https://www.lib.uidaho.edu/digital/collections.html [2] home. collectionbuilder. [accessed 2025 feb 13]. https://collectionbuilder.github.io/ [3] stephan w. the platinum rule meets the golden minimum: inclusive and efficient archival description of oral histories. journal of contemporary archival studies. 2021;8(1). https://elischolar.library.yale.edu/jcas/vol8/iss1/11 [4] moretti f. conjectures on world literature. new left review. 2000;(1):54–68. [5] boyd da. ‘i just want to click on it to listen’: oral history archives, orality and usability. in: the oral history reader. 3rd ed. routledge; 2015. [6] avp. quantifying the need: a survey of existing sound recordings in collections in the united states. avp. 2014 [accessed 2025 feb 14]. https://www.weareavp.com/quantifying-the-need-a-survey-of-existing-sound-recordings-in-collections-in-the-united-states/ [7] speech to text in premiere pro faq. adobe. [cited 2024 jul 8]. available from: https://helpx.adobe.com/content/help/en/premiere-pro/using/speech-to-text-faq.html about the author andrew weymouth is the digital initiatives librarian at university of idaho, specializing in static web design to curate the institution’s special collections and partner with faculty and students on fellowship projects. his work spans digital scholarship projects at the universities of oregon and washington and the tacoma northwest room archives, including long form audio public history projects, architectural databases, oral history and network visualizations. he writes about labor, architecture, underrepresented communities and using digital methods to survey equity in archival collections. professional site: aweymo.github.io/base appendices appendix 1. excerpt of python text mining tool import pandas as pd import string from nltk.corpus import stopwords from collections import counter import re # download nltk stopwords data import nltk nltk.download('stopwords') # define preprocess_text function def preprocess_text(text): if isinstance(text, str): # check if text is a string text = text.translate(str.maketrans('', '', string.punctuation)) text = text.lower() # convert text to lowercase else: text = '' # replace nans with an empty string return text # load stopwords for both spanish and english stop_words_spanish = set(stopwords.words('spanish')) stop_words_english = set(stopwords.words('english')) # combine both sets of stopwords stop_words = stop_words_spanish.union(stop_words_english) import os # directory containing csv files directory = "/users/andrewweymouth/documents/github/transcript_mining_base/csv" # list of csv file names file_names = [ 'example_01.csv', 'example_02.csv', 'example_03.csv' ] # construct file paths using os.path.join() file_paths = [os.path.join(directory, file_name) for file_name in file_names] # initialize an empty list to hold the dataframes dfs = [] # try reading each csv file and print which file is being processed for file_path in file_paths: try: print(f"processing: {file_path}") # add quotechar and escapechar for handling csvs with quotes dfs.append(pd.read_csv(file_path, encoding='utf-8', quotechar='"', escapechar='\\')) except exception as e: print(f"error with file {file_path}: {e}") # concatenate text data from all dataframes into a single corpus corpus = '' for df in dfs: text_series = df['text'].fillna('').astype(str).str.lower().str.strip() # extract and preprocess text column corpus += ' '.join(text_series) + ' ' # concatenate preprocessed text with space delimiter # preprocess the entire corpus cleaned_corpus = preprocess_text(corpus) # remove stopwords from the corpus filtered_words = [word for word in cleaned_corpus.split() if word not in stop_words and len(word) >= 5] # count the frequency of each word word_freq = counter(filtered_words) # get top 100 most frequent distinctive words with occurrences top_distinctive_words = word_freq.most_common(100) # === general section === def find_agriculture_terms(corpus): # define a list of agriculture-related terms agriculture_terms = [term.lower() for term in ["harvest", "tractor", "acreage", "crop", "livestock", "farm field", "barn building", "ranch", "garden", "orchard", "dairy", "cattle", "poultry", "farming equipment", "fertilizer", "seed", "irrigation", "plow", "farmhand", "hoe", "shovel", "milking", "hay", "silage", "compost", "weeding", "crop rotation", "organic", "gmo", "sustainable", "farming", "rural", "homestead", "grain crop", "wheat", "corn maize", "soybean", "potato", "apple fruit", "berry", "honey", "apiary", "pasture", "combine harvester", "trailer", "baler", "thresher" ]] # initialize a counter to tally occurrences of agriculture-related terms agriculture_word_freq = counter() # tokenize the corpus to handle multi-word expressions tokens = re.findall(r'\b\w+\b', corpus.lower()) # iterate over each token in the corpus for word in tokens: if word in agriculture_terms: agriculture_word_freq[word] += 1 # return the top 20 most common agriculture-related terms return agriculture_word_freq.most_common(20) # call the function to find agriculture-related terms in the corpus top_agriculture_terms = find_agriculture_terms(corpus) # print the top 50 agriculture-related terms print("## agriculture") for word, count in top_agriculture_terms: print(f"{word}: {count}") appendix 2. apps script code for exporting sheets to csv for text mining function downloadsheetsascsv() { // specify the folder id of the folder containing the google sheets var folderid = 'folder-id'; // replace with your folder id var folder = driveapp.getfolderbyid(folderid); var files = folder.getfiles(); // loop through each file in the folder while (files.hasnext()) { var file = files.next(); // check if the file is a google sheet if (file.getmimetype() === mimetype.google_sheets) { var spreadsheet = spreadsheetapp.openbyid(file.getid()); var sheets = spreadsheet.getsheets(); // loop through all sheets and download each as csv for (var i = 0; i < sheets.length; i++) { var sheet = sheets[i]; var csv = convertsheettocsv(sheet); // create a new csv file in the same folder var csvfile = folder.createfile(sheet.getname() + '.csv', csv, mimetype.csv); logger.log('downloaded: ' + csvfile.getname()); } } } } function convertsheettocsv(sheet) { var data = sheet.getdatarange().getvalues(); // find the index of the "words" column and replace it with "text" var headerrow = data[0]; var wordsindex = headerrow.indexof('words'); // locate the "words" column index if (wordsindex !== -1) { headerrow[wordsindex] = 'text'; // change "words" to "text" } // start building the csv with the header row var csv = 'text\n'; // loop through rows and extract the "words" column, removing line breaks for (var i = 1; i < data.length; i++) { // start from 1 to skip the header row var row = data[i]; // extract the "words" column (index of "words" column) var cell = row[wordsindex]; // remove all line breaks (carriage returns, newlines, etc.) within the "words" data if (typeof cell === 'string') { cell = cell.replace(/(\r\n|\n|\r)/gm, ' '); // replace all line breaks with space cell = cell.replace(/[^\w\s,.'"-]/g, ''); // remove punctuation except for some valid ones } // enclose the text in quotes to avoid column splitting due to commas cell = '"' + cell + '"'; // add the cleaned "text" to the csv output csv += cell + '\n'; } return csv; } appendix 3. apps script code for linking transcript to primary tag sheet function filltags() { // get the active spreadsheet var spreadsheet = spreadsheetapp.getactivespreadsheet(); // get the transcript sheet by name var transcriptsheet = spreadsheet.getsheetbyname("your-sheeet-name"); if (!transcriptsheet) { logger.log("transcript sheet not found"); return; } // set the header in cell e1 to "tags" transcriptsheet.getrange("e1").setvalue("tags"); // get the tags spreadsheet by url var tagsspreadsheet = spreadsheetapp.openbyurl("your-spreadsheet-url"); if (!tagsspreadsheet) { logger.log("tags spreadsheet not found"); return; } // get the tags sheet within the tags spreadsheet var tagssheet = tagsspreadsheet.getsheetbyname("tags"); if (!tagssheet) { logger.log("tags sheet not found"); return; } // get the range of the transcript column var transcriptrange = transcriptsheet.getrange("d2:d" + transcriptsheet.getlastrow()); var transcriptvalues = transcriptrange.getvalues(); // get the range of example words and tags in the tags sheet var examplewordsrange = tagssheet.getrange("b2:b" + tagssheet.getlastrow()); var tagsrange = tagssheet.getrange("a2:a" + tagssheet.getlastrow()); var examplewordsvalues = examplewordsrange.getvalues(); var tagsvalues = tagsrange.getvalues(); // create a map of example words to tags var tagsmap = {}; for (var i = 0; i < examplewordsvalues.length; i++) { var word = examplewordsvalues[i][0].tolowercase(); var tag = tagsvalues[i][0]; tagsmap[word] = tag; } // initialize an array to store the tags for each transcript entry var transcripttags = []; // loop through each transcript entry for (var i = 0; i < transcriptvalues.length; i++) { var text = transcriptvalues[i][0]; var uniquetags = []; if (typeof text === 'string') { // use regular expression to extract words and handle punctuation var words = text.match(/\b\w+['-]?\w*|\w+['-]?\w*\b/g); // check each word in the transcript entry against the tags map if (words) { for (var j = 0; j < words.length; j++) { var word = words[j].tolowercase().replace(/[.,!?;:()]/g, ''); var singularword = word.endswith('s') ? word.slice(0, -1) : word; if (tagsmap.hasownproperty(word) && !uniquetags.includes(tagsmap[word])) { uniquetags.push(tagsmap[word]); } else if (tagsmap.hasownproperty(singularword) && !uniquetags.includes(tagsmap[singularword])) { uniquetags.push(tagsmap[singularword]); } } } } // add the determined tags to the array transcripttags.push([uniquetags.join(";")]); } // get the range of the tags column in the transcript sheet, starting from e2 var tagscolumn = transcriptsheet.getrange("e2:e" + (transcripttags.length + 1)); // set the values in the tags column to the determined tags tagscolumn.setvalues(transcripttags); } subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – a comparison of article search apis via blinded experiment and developer review mission editorial committee process and structure code4lib issue 19, 2013-01-15 a comparison of article search apis via blinded experiment and developer review this study looks at perceived user preference between products that can provide a scholarly article search service via an application programming interface (api). the study set up a blinded review and asked users at johns hopkins to select the service that provided the most useful results. few statistically significant preferences were detected, and some interpretation is provided of what the results might tell us. the specific products evaluated for this study are: serials solutions summon, ex libris primo, ebsco eds, ebscohost ‘traditional’ api, and elsevier scopus. re-usable open source tools for implementing article search were created to support the study and future development, and a developer review of the apis is included based on the developer’s experience in this implementation. *note from the editor: the author has invited representatives from the reviewed products to provide comments on the article. these comments have been attached as the first comments of the publication. background and goals at johns hopkins libraries, we decided to explore the creation of an improved, more integrated article search experience for our users. the motivation for this work was driven by negative user feedback from our current offerings, staff observation, and reports by other academic libraries investigating patron article search behavior and preferences. we naturally considered the relatively new generation “discovery” products in the library market, such as serialsolutions summon, ex libris primo, ebsco discovery service (eds), and (to some extent) oclc worldcat local. all of these products include their own extensive index of articles and scholarly citations, as well as being marketed as a potential public catalog interface (with integrated article search). but rather than assume a complete replacement of our catalog discovery with such a product we considered the approach taken by some of our peer institutions using custom designed software that provided catalog search and article search in separate sections of the result page, a layout that’s been called “bento style” (for its visual resemblance to japanese bento lunchboxes) by library technologist tito sierra. examples include: columbia university north carolina state university university of virginia we decided that a bento-style search was worth serious consideration: leaving catalog search powered by our existing infrastructure, but adding an article search section powered by a yet-to-be-determined third party service. additional details about the reasoning and evidence that led to this direction of pursuit, including local evidence and citations to findings from other peer institutions, can be found in a position paper on the author’s blog [1]. the question arose as to how to evaluate available third-party services that might be used to power an integrated ‘article search’ feature. focusing on article search presented in a bento-style interface, we also realized that some traditional abstracting & indexing services might also be worth considering, in addition to more recently introduced ‘discovery’ products. we wanted to find a way to apply a user-centered empirical investigation to this evaluation. i realized that the same capabilities that would allow one to embed an article search in a local applications bento-style interface (essentially, a sufficiently powerful api), would also allow us to create a ‘blinded’ survey instrument, where we displayed results from two different searches and asked the participating users to tell us which they liked better [2]. we decided to pursue this idea utilizing a blinded survey instrument to evaluate the users’ perceptions of results from different services. an additional benefit of this approach is that the implementation of the blinded survey instrument required a developer to use the same techniques that would be used in implementing the potential final product. therefore, this approach allowed for review of the quality and capabilities of the various product apis, and also served as a ‘proof of concept’ of the ‘bento style’ approach with each product, developing some software building blocks that could be re-used for implementation of a final product. these advantages helped motivate our course of action. others before us have done article search or discovery product comparisons where users were given pre-selected research questions, and/or results were evaluated for quality by librarians [3]. we wanted to try instead letting the users bring their own research questions and queries, and evaluate for themselves which tool they preferred. we thought this might allow us to find out the actual preferences of users for their actual needs (something we were interested in), without having to figure out for ourselves if we had the right ‘typical’ research questions, or if we evaluated results the same way our patrons did. this was our goal and motivation; we had somewhat mixed success. products included in study in order to be included in our blinded study, a product needed to have a sufficient api for the blinded instrument, which are essentially the same api features required for a ‘bento style’ implementation. while we decided for purposes of the study instrument not to include links from articles to full text, a production implementation would need a way to link to full text and/or to a local link resolver. we decided to only include services in our evaluation that could support such linking. products included in evaluation: ebsco discovery service (evaluation access) ebscohost ‘traditional’ api (http://support.epnet.com/eit/ws.php) with existing licensed ebsco databases (usable with our existing licenses) (set up to search approx 55 of our most used ebsco databases) ex libris primo (evaluation access) serials solutions summon (evaluation access) scopus (elsevier) (usable with our existing licenses) (http://api.elsevier.com/content/search/#d0n17066) securing access to these products and their documentation — both ones we needed trial access to as well as products we already licensed — was a time-consuming process, and comprised a significant portion of the total calendar time to complete our evaluation. products initially considered but not included in the study due to lack of fitness: google scholar: the service has no api, and has terms of service that disallow screen-scraping; it can only be used in google’s own native interface. microsoft academic search: while it does have an api, the terms of service are prohibitive (including a requirement to include all links delivered by the api, in the order they are delivered, even if they lead to paywalls you do not license). additionally, while the ms academic search product itself includes sufficient metadata for an openurl (as revealed, for instance, in the ris export), the api response does not actually include this data with sufficient semantic granularity. thomson reuters web of knowledge: while the product did seem to have a sufficient api (for instance as used by our existing metalib installation), and we believed our existing license ought to give us access to this api — we were unable to secure access/permission/documentation in time to include in the study, despite several months of attempts. our existing metalib-based, xerxes-enhanced product, using its default/general search. we initially hoped to include this as a baseline/control, but writing the evaluation tool that could accommodate metalib’s extreme slowness would have significantly increased the development time, and we ultimately decided that we did not need it. the survey instrument users would enter a query and be presented with side-by-side ‘blinded’ results from two products chosen at random from the five services being studied. the participating user could choose which results they preferred, or select a neutral choice of “can’t decide/about the same”. technological implementation the survey instrument was implemented as two ruby on rails projects, both licensed open source. the first, bento_search [4] is a general purpose abstract layer for querying third party search engines. it is designed to: let you quickly implement search functionality in your own custom application, for any of its supported adapters; have a potentially shared implementation of working around the gotchas and undocumented features of certain search services; and let you switch out one service for another in an existing application with as little trouble as possible (to avoid lock-in). bento_search currently supports adapters for the five evaluated services, plus adapters for google books, worldcat and google site search. one of the motivations for undertaking this particular research was that the code required to write the survey instrument could largely be repurposed for an actual production application; the bento_search gem was designed from the start to support both use cases. it supports fielded search entry, pagination, and sorting in an abstract standardized way, but does not yet support multi-field/advanced search or facets/limits. bento_search also supports a standardized rendering/display of search results, across search providers — a necessity for the survey instrument. the survey instrument itself is a rails application called bento_battle, also available as open source. bento_battle, built with bento_search functionality, is only a few hundred lines of generously commented/whitespaced ruby. we believe it’s important that the complete implementation of our experimental instrument is available, both so people can see exactly how the experiment was conducted, and so people can use our implementation to re-run the experiment, either identically or with changes, to investigate reproducibility of our findings. the exact version of the bento_battle application used for the experiment is tagged at jhu-study-10sep2012 [5]. mid-way through the study, a new version was deployed with some bugfixes and enhanced reporting features, at jhu-study-17sep2012 [6]. the nature of a ruby application using bundler means that the exact version of all gem dependencies, including the bento_search gem, is recorded and fixed in these tagged snapshots. per-product configuration and implementation choices there are some inevitable choices to be made in both configuring how to search a given product, and implementing how the product will be searched and how results will be displayed. these choices could affect the experimental results, but need to be made anyway, in some cases to implement anything at all, and in other cases to try and show the products in their best light. display our standard template shows an abstract or other summary where available. if the product supported showing a query-in-context summary with highlighted search terms, this was used instead of a straight abstract. summon supports a google-style query-in-context snippet from full text, where available — this was used in display in preference over straight abstract. primo and eds both supply only abstracts, but do query term highlighting within the abstract — this capability was used in the survey instrument in these cases. all three products do query-term highlighting in the item title, and this too was used in the survey instrument’s output. ebscohost offered only a straight abstract, so this was used for ebscohost output. in all cases, abstracts/snippets were truncated to just the first few lines. for purposes of the survey instrument and study, we decided not to provide any links from presented search results to full text or additional information, instead only presenting side by side results with no links to click on. search input in order to provide a standard query input syntax across all search products, the bento_search library discussed above supports a normalized standard input format across all products: basically a simple ‘google style’ input, that supports a simple list of terms and phrases (using double-quotes to indicate phrases). bento_search translates this to the most appropriate query the developer could determine for each product, to find results matching that list of terms and phrases with appropriate relevancy ranking. for instance, in the case of ebscohost, bento_search translates the user’s input to an explicit boolean ‘and’ expression; in the case of summon, bento_search passes the user’s input through to summon more or less unmodified. the bento_search gem does not presently support standardized boolean operators or expressions. holdings limiting — not used the three ‘discovery’ tools (primo, summon, eds) all have features were you can limit search results to ‘items held by your institution’ — sometimes this means electronic full text access, sometimes it includes print holdings. in order to do this, you of course need to communicate your holdings to the service in some way. configuring our holdings with each such product would have taken additional setup time, and introduced an additional point for mistakes to be made and inconsistencies introduced. we also weren’t certain whether we (or our users) would want the ultimate production tool we were considering to default to such a holdings limit, or to instead return unrestricted results. considering overall cost vs. benefit to run this study, we chose not to make use of any ‘limit to holdings’ in our tests — all searches were unlimited as to institutional holdings. authorized user some of these products produce different results depending on whether the end-user is currently authenticated as a member of the institution or not. other products only allow results to be shown to authenticated users. as our test instrument authenticated users before allowing any access, all product api access was configured to assume an authenticated user. content type filtering we noted that many peer institutions implementing a discovery service supply a default pre-set limit to exclude newspaper articles, or to focus exclusively on journal articles — they had found that non-scholarly content in the results was negatively impacting their quality and service. this could have an even more devious effect on a study like this, if left uncontrolled. if users did prefer not to have newspaper articles in their result set, and rated one product poorly for including such results, when that product could have been configured to exclude them, this would be to some extent a misleading result. our choice not to limit to local holdings also resulted in many types of content being included in results that seemed of questionable utility (such as television transcripts and web page text), but which would have ended up excluded by a local holdings limit. i suspect several products have been optimized by their vendors for ‘limit to local holdings’ use. adding some compensatory configuration to exclude these odd content types, where possible, seemed necessary to show each product at its best. we also kept in mind that our original goal was to analyze these products for a basic, simple, article and scholarly citation search — not necessarily for use cases involving other types of content. all these considerations led us to configure the searching for each product to try to exclude non-scholarly citations and focus on articles. where possible, we chose not to limit only to ‘journal articles’, but instead to exclude certain product-specific categories that seemed to create ‘noise’ in the result set, as this generally seemed to produce optimal results. while these configuration choices, specific to each product, were subjective and might influence study results, we still decided they were necessary to present each product at its best to study participants. exact configuration choices made can generally be deduced from the configuration file in the git repository for the test instrument. ebscohost traditional the ebscohost ‘traditional’ api allows you to search from 1 to all of your existing licensed ebsco databases simultaneously, returning a single merged result set. we license nearly 100 ebsco databases (although some of them are strict subsets of others). trying to search all of our databases at once through this api resulted in unacceptable performance. we looked at our usage logs for ebsco, and took the top-most used databases. we then adjusted the list a bit by hand to broaden disciplinary coverage as far as we could within our ebsco licenses, and to remove databases focused on formats we weren’t interested in (image databases, or video databases). the ‘ebscohost traditional’ option was configured to search the resulting list of approximately 40 databases. the exact list of databases used can be seen in the git repository for the bento_battle instrument implementation code [7]. promotion and response our evaluation tool was open to users from september 10th 2012 until october 2nd. login to our enterprise sign-in system was required to access the evaluation tool, in order to comply with all relevant contracts and agreements for included products. promotion was via: links on library home pages inviting participation a link on our metalib-based jhsearch results page, inviting participation — which would execute the user’s already entered query in the search instrument. a similar query-preserving link in the johns hopkins welch medical library multi-search tool (also metalib-based). a library blog post. bulk email to librarians, who then re-publicized the study to their liaison departments. (library staff was also welcome to participate in the study). at the eisenhower library, placards were placed next to reference/information desk stations with the url, for staff to participate in during any free time. other jh libraries may have publicized the tool to their patrons in various ways. it was somewhat more challenging to attract participation than we expected. during the period the study was up, we received 414 total preference responses. this is not as large a sample as it might initially sound, as with 5 total products and 414 total responses, any given pairing of products (say primo vs. summon) were only matched on average 41 times each. we don’t know how many unique people participated; for instance, 40 people doing 10 choices each, or 400 people doing one choice each. we also don’t know which participation events came from which promotion venue. in addition to the 414 registered selections, there were another 214 occasions a user loaded the evaluation tool results page, but abandoned it without making any selection at all. so about 1/3rd of the time an evaluation page was displayed, it was abandoned without a selection. demographics we had an optional area for participants to identify their affiliated hopkins school/institution, and their role/status. as this was optional, many participants chose not to fill it in. in retrospect we still think making demographics optional was the right design choice, to maximize participation. the plurality of responses did not self-report their demographic characteristics, and we decided that we lacked sufficient sample size in any of the demographic groups to justify reporting or comparing results between or within the individual groups. limitations any experimental design has limitations: both things not being studied and potential limitations in the validity of what is being studied. it is important to understand the limitations of our experimental design before we move on to looking at our findings. we only evaluated the article search function. some of these products come with other functions or features that may be of use, but are not being evaluated here. evaluation was limited to a basic, simple style of article search; a usage pattern where a user enters some search terms and looks at the first page of results. our study did not include or evaluate advanced/fielded searching, faceting, looking past the first page of results, etc. some products may do some of these things better than others. our experimental design did not attempt to explore this. our experimental design only reveals relative preferences. if two providers are pitted against each other 100 times, and one tool is selected as better 90 times — it’s quite possible that users considered both providers great, or considered both providers awful. we have no way to distinguish. our experimental design only tells us they liked one better than the other, only a relative preference between the two. our experimental design only measures consistency of relative preference, not strength. if two providers are pitted against each other 100 times, and one tool is selected as better 90 times, this means our participant population very consistently preferred that tool — it does not tell us if they thought it was a lot better, or just a tiny bit better. our design reveals consistency of preference, not strength of preference. we do not know why participants preferred one tool to another, or if different participants had different reasons at different times. although it would be nice to know more details about what users want in a search tool, answering this question was not a goal of this research. a good experimental design tries to control as much as possible, and isolates a very specific phenomenon to be studied. we tried to do this, focusing on the usage pattern/style we were prioritizing: user preferences between products, for a basic, simple, article search functionality. and we wanted to discover which product, of products as they actually exist available to us now, our users would prefer. we tried to design an experiment to efficiently answer this question. however, there are also some limitations to the validity of even what we tried to measure: participants, in the artificial environment of our experiment, may not express preferences that match what their real world preferences would be. when i experimented with the evaluation tool, i found that if i just entered a hypothetical query, i really had no way to evaluate the results. i needed to enter a query that was an actual research question i had, where i actually wanted answers. then i was able to know which set of results was better. however, when observing others using the evaluation tool, i observed many entering just the sort of hypothetical sample queries that i think are hard to actually evaluate realistically. we tried to include links from existing tools to find articles, to capture participants in the middle of the research process with actual queries — but we did not track what portion of participation came via these avenues. this issue does not apply to known-item searches, where either the item you are looking for is there at the top of the list, or it isn’t. looking through the queries entered by participants, there seem to be very few ‘known item’ searches (known title/author), even though we know from user feedback that users want to do such searches. so the study may not adequately cover this use case. findings in summary, the study, largely, did not show significant preferences between products. while this could be due to lack of sufficient sample size, there are reasons to think that is a valid result, at least under this experimental design. the rest of this section and the subsequent interpretation section will explain how we arrived at this conclusion and what we think it means. our complete data set from is publically available at: http://jhir.library.jhu.edu/handle/1774.2/36246 . others may want to analyze that data themselves, or check our findings against the actual dataset. somewhat unruly data set the experimental design resulted in a somewhat unruly data set. 414 total 1-vs-1 selections, with each selection being between two of five possible products. on average, each product was presented in a selection 166 times ( (414*2) / 5 ), but since options to present were chosen randomly, some products were presented slightly more and others less (from 155 times to 171 times). on average, any two products were presented as paired ‘competitors’ 41 times each. (if n=5 is the total number of options, and m=414 is the total number of presentations, this is ( m / (n-1)! )). however, random variation means that actually the number of presentations for each pairing ranged from 35 to 48. you can see that even with 414 total preferences selected, we get relatively few preferences expressed for any given individual pairing: for instance, summon was only matched against primo 48 times, and of those 18 times “about the same/can’t tell” was chosen, so a preference was only expressed 30 times. this ends up being a fairly modest sample size. the one-sample z test to explain the statistical analysis method used, let’s quickly explore an analogy. let’s say you were taste-testing colas. coke or pepsi, we’re not telling you which is which, which do you like better? but let’s say you misled the participants, and it was really coke in both glasses. if you gave 100 people this taste test, you would expect glass on the left (coke) to win about 50 times, and glass on the right (coke) to win about 50 times. because they are the same. of course, it might not be exactly 50/50, just as when you flip a fair coin 100 times you might not get exactly 50 heads, but it will usually be close, maybe 48/52 etc. but let’s go back to an actual taste test, where one glass really is coke and another pepsi, and we don’t tell the participants which is which. let’s say we have 100 participants, and 57 choose coke and 43 choose pepsi. how do we know if this is an actual expressed preference, or is this just random variation approaching 50/50? one way is to use a one sample z-test for proportions, with a null hypothesis of 0.5. if we use that test with an alpha value of .05 (less than 5% chance of happening through random variation), then in the particular example above we find that with a sample size of 100, 57/43 is not in fact a statistically significant difference from the 50/50 null hypothesis. we can apply exactly this test to each individual pairing in our results, say all the times summon was presented against primo and a selection was made, to see if the preferences differ significantly from a 50/50 no preference. but what do we do with the “can’t decide/about the same” choices? i decided that they give us no information about preferences between two products, they are essentially an ‘abstention’ in helping us determine user preferences, and should not be counted. for overall rankings, we can look at all the times a given product, say summon, was included in the choices and a selection was made, across all the different products it could be paired with, and look at overall what summon’s ‘victory rate’ is (proportion of times it was chosen). we can use the one-sample z-test again to determine if its victory rate differs significantly from the 50/50 null hypothesis of no preference. we did not come up with any way to actually test overall comparisons between products — only an overall score for each product differing significantly (or not) from 50%, and each individual pairing. i am fairly inexperienced at statistical quantitative analysis (as are library technology departments in general, i think). although i am happy with our approach here, there may be a better way to analyze these results; if a reader knows of one, please do share. okay, the actual findings overall please note again that few of these numbers were statistically significant. engine num participating wins losses ties victory rate summon 168 76 54 38 58% ebscohost 165 64 51 50 56% eds 155 58 52 45 53% primo 169 62 66 41 48% scopus 171 51 88 32 37% **(significant)** each row was analyzed with a one-sample z-test, alpha of 0.05, null hypothesis of 0.50 — ties were excluded as giving no information on preferences, the ‘victory’ rate is just wins / (wins + losses). only scopus differs significantly from the 50/50 no preference (with scopus being less preferred compared to alternatives). all the other rows are essentially statistically non-distinguishable from a straight 50/50. (**summon does get close** to being significantly different than 50/50 in the positive direction; if one more summon choice had been received, and no more anti-summon choices, it would have crossed into statistical significance). in order to look more deeply, we have to look at individual pairings: okay, scopus is significantly unliked in general, but in particular is it less preferred than some alternatives, while no less preferred (or even more preferred) than others? individual pairings we can look at a cross-tabulation of each individual possible pairing. summon vs. primo, summon vs. scopus, etc. and we can analyze each pairing for statistical significance using the one-sample z-test again. vs. eds scopus ebscohost eds * total:44tie:10scopus win: 11eds win: 23scopus:0.32 significant total:38tie:20ebscohost win: 13eds win: 5ebscohost:0.72 scopus total:44tie:10eds win: 23scopus win: 11eds:0.68 significant * total:43tie:12ebscohost win: 17scopus win: 14ebscohost:0.55 ebscohost total:38tie:20eds win: 5ebscohost win: 13eds:0.28 total:43tie:12scopus win: 14ebscohost win: 17scopus:0.45 * primo total:38tie:8eds win: 16primo win: 14eds:0.53 total:40tie:6scopus win: 13primo win: 21scopus:0.38 total:43tie:9ebscohost win: 20primo win: 14ebscohost:0.59 summon total:35tie:7eds win: 14summon win: 14eds:0.50 total:44tie:4scopus win: 13summon win: 27scopus:0.33 significant total:41tie:9ebscohost win: 14summon win: 18ebscohost:0.44 vs. primo summon eds total:38tie:8primo win: 14eds win: 16primo:0.47 total:35tie:7summon win: 14eds win: 14summon:0.50 scopus total:40tie:6primo win: 21scopus win: 13primo:0.62 total:44tie:4summon win: 27scopus win: 13summon:0.68 significant ebscohost total:43tie:9primo win: 14ebscohost win: 20primo:0.41 total:41tie:9summon win: 18ebscohost win: 14summon:0.56 primo * total:48tie:18summon win: 17primo win: 13summon:0.57 summon total:48tie:18primo win: 13summon win: 17primo:0.43 * while 414 total responses may seem like a healthy number, it results in a very small sample size for each pairing. for example, summon was matched up against primo only 48 times in which the user made a selection. of those 48 times, summon was chosen 17 times, primo 13 times, and 18 ties (“can’t decide/about the same”). we want to know if choices in an individual pairing represent a significant indication of preference, or just random variation; we again use the one-sample z test with an alpha of 0.05, calculated for each pairing, with .5 as the null-hypothesis (no preference), and considering only active selections. using this analysis, the only pairings that achieve statistical significance are: a preference for eds over scopus a preference for summon over scopus that is, the only statistically significant preferences we detected both overall and in individual pairings were a negative preference against scopus. interpretation one obvious question is whether we have a finding of no user preference of our users, collectively, thinking all the products are about equal, or if our findings are simply inconclusive. i suspect the answer is a bit complex. while small sample size made it hard to show statistical significance in some cases, i suspect a larger sample size would still have the products clustering fairly tightly. as our experiment progressed with more responses submitted, the rankings became closer to each other. the significant number of “ties” in every pair-up (generally from 25%-33% of selections) gives us another reason to suspect that indeed different products perform ‘about the same’. i think there’s a larger issue though with the nature of the experiment. if used over time in production, some products may very well satisfy users better than others — but when asked to express a preference for a small handful of searches in the artificial context of the experiment, users may not have the capability to adequately judge which products may be more helpful in actual use. i think this is quite possible, especially if users were not using their own current real research questions to test. some users, especially beginner/undergraduate users, may simply be unconcerned with relevance of results, being satisfied by nearly any list of results. in a survey-based study receiving 523 responses at a ‘mid-sized midwestern university’, james p. purdy [8] reports: “only six students explicitly identified relevance as a reason a research resource was their favorite. in other words, only six students favored a resource for its ability to return sources relevant for their topic and/or assignment… that students valued quality over relevance indicates students may define research as meeting particular task criteria, rather than generating knowledge. for example, they may see good research as referencing five scholarly sources rather than conversing with topically relevant sources.” if the findings in the purdy study are generalizable, what this means for library practice is still unclear — does that mean that the relevance/topical quality of results does not matter for our users and we need to pay no attention to it either? or does it mean that to maximize quality for our patrons, we need to care about factors that our users themselves don’t, and can not rely on user’s own evaluations to guide us? librarians might possibly be more expert/capable at choosing the right queries to search and judging based on a few sample searches — unfortunately, with only 64 preference selections from self-identified ‘library staff’, we don’t think we have a large enough sample to look at the preferences of that demographic on their own. despite these limitations, we can imagine there could surely be a product so bad that it would have had a significant negative preference. for an extreme example, we can imagine creating a ‘product’ which simply returned random citations from a list, unrelated to the query. surely that would have performed poorly even in this experiment. in retrospect, it would in fact have been useful to include such a ‘service’ in the study as a control. that few significant preferences were demonstrated in this study, can, i think be taken as at least some evidence that all products tested (with the possible exception of scopus) perform ‘well enough’ for supporting basic search usage scenarios across our population served. individual pairings while few of the individual pairing results rose to the level of statistical significance, possibly in part due to small sample size in each individual pairing, it is worth looking at them for trends suggestive of further research and possibly valid tendencies not captured with statistical significance. scopus ‘lost’ to every other product it was paired with — and produces the only statistically significant pairings, losing to both eds and summon with statistical significance. ebsco ‘traditional’ api, the other ‘traditional’ (rather than ‘next generation discovery’) product, does surprisingly well, holding its own — winning (although by a slight and not statistically significant amount) when matched with every other product included! the three ‘discovery’ products are fairly tightly clustered, although there are some trends. primo lost to every product it was paired with, although by fairly small not statistically significant margins. summon won, but by a small and not statistically significant margin, against every product it was matched against — except eds, where it was an exact tie (14 vs. 14 with 7 ties). eds has very mixed success — it beats scopus with statistical significance, but primo only by a very slim non-significant margin (16 to 14 with 8 ties); ties with summon; and is beat by ebsco ‘traditional’. what criteria do participants use to judge? the study gives us no direct evidence of what criteria any individual participant made in selecting a preference. we can categorize hypothetical criteria for preferences into two groups: differences in items in result sets, due to different corpuses or different ‘relevance’ ranking algorithms across products. differences in presentation that could not be normalized out in implementation — primarily presence or absence of abstracts and query-in-context highlighting, and in the case of eds in formatting of citation details. valentina artemieva at the montgomery county campus of sheridan libraries (johns hopkins libraries) did in-person observed run-throughs of the experiment with 5 separate students (4 grad students, one undergrad). she specifically asked them to tell her why they chose the preference they did — all five independently said that they decided based on preferring results with ‘more information’, such as including abstracts. we don’t know for sure if this is representative — but it could indicate that, at least for students, just about any set of reasonable results are ‘good enough’, and they focus on presentational issues. to the extent completeness of presentational context was valued, it could explain why scopus did poorly. scopus is the only product which had no abstracts at all available. (scopus as a service certainly has abstracts, but their api does not provide them, and their terms of service for federated search in fact prohibit presenting them). it’s possible this harmed scopus in participant preferences, although i think it’s probably also true that actual items in result sets were a factor as well: ebsco ‘traditional’ api, while it does provide abstracts, is the only other product in addition to scopus that lacked ‘query in context’ bolded highlighting of search terms. interestingly, this does not seem to have harmed ebsco ‘traditional’ in participant preferences. comparing ebsco ‘traditional’ and eds is particularly interesting, as they are from the same vendor. ebsco traditional is essentially a subset of the eds corpus — eds searches everything our ebsco traditional api setup does, and more. they probably use similar ‘relevance’ rankings. one would expect eds to do substantially better than ebsco traditional, being a newer product, with many enhancements and more coverage, from the same vendor. yet, this did not happen. ebsco ‘traditional’ in fact was preferred substantially more than eds — 13 to 5, not a statistically significant level in part due to small sample size, but striking nonetheless. this pairing of two products from the same vendor, perhaps unsurprisingly, also had a higher number of ‘ties’ (20, 52% of all selected preferences!) than any other pairing. eds does offer a version of query-in-context highlighting, while ebscohost ‘traditional’ offers only abstracts; eds was not preferred over ebscohost despite this. eds is also theonly product that does give us sufficient granularity in the api response allow us to format citation presentation in a standard way; it’s possible users reacted negatively there. it’s also possible the smaller coverage of ebscohost ‘traditional’ was actually to its advantage in this test – it’s a smaller corpus, but also a more focused one, perhaps giving better results for topic searches. if our participants had been searching for known-item titles (they largely were not), the smaller corpus may have been more detrimental. speed of response between products the deployment of the study gave us a rare opportunity to compare the speed of response of the various products, side by side. the bento_battle blinded survey instrument was written to record the duration it took to retrieve and prepare the responses for each query. this measurement is of total end-to-end time, including software parsing and normalization of the response — however, spot checks suggest that the time waiting on http responses from third party service is the great majority of total time elapsed. it should be kept in mind that these results could vary at different network locations, with different software implementations, or as the various products under testing continue to evolve. however, it still seems useful to share our findings with those caveats, as any head-to-head speed of response comparison would be otherwise very difficult to come by for interested potential customers. over the month the study ran, every time a query was made to a third party service, the elapsed time to retrieve and prepare results was recorded. there are more timing results recorded for a given product than actual user preference selections involving that product, because some portion of participants chose to abandon the task after doing a search but before selecting a preference. product count mean median 90th percentile 95th percentile max slowest ebscohost 265 2658 2388 3983 5273 33118 eds 252 3738 2826 5899 8418 27719 primo 277 2356 1611 4545 5878 16280 scopus 261 961 939 1340 1474 3123 summon 261 961 670 1853 2384 7465 response time descriptive statistics, by product, in milliseconds of the ‘discovery’ solutions, summon was definitely the fastest with a median response time under one second — and consistently fast. even its 95th percentile slowest response was still a fairly reasonable 2.3 seconds. summon, like all the products, did exhibit some extremely slow responses (possibly due to network hiccups, or cold caches on the vendor’s side), but not very many and even its slowest response was faster than other products’ slowest responses. eds was by far the slowest of the ‘discovery’ solutions, and while its median response time was a barely acceptable 2.8s, it had some really pathologically terrible response times in 90th percentile and slower (10% of all responses!). this is likely due to the cumbersome and multi-step authentication process required by the eds api. in even the typical query, two http requests were required to eds (one to get a session token, one to actually execute the query), and in the worst case four http requests are required. eds’s cumbersome and slow authentication process is discussed below in the api review section. primo performed somewhere in between summon and eds. of the non-discovery solutions, scopus’s response time was comparable to summons, while ebscohost traditional api’s performance will vary based on the number of ebsco databases you choose to query. however with the set of around 40 we chose to include, ebscohost traditional was surprisingly slightly faster than eds, perhaps because it did not require the multi-layered authentication process. further research directions? all code used to do this experiment is open source and designed to be easily re-usable. others could do repeat experiment to validate or reproduce; or repeat with changed configuration to see if that changes the results. although i’m interested in how configuration and limiting choices may have changed our results, i suspect a trial done much like this one — with 5 or more products compared — would probably still result in no significant preferences detected. it might be more useful to limit the options considered to only two products, to increase the sample size of that single pairing with the same overall participation. alternately, it might be useful to do a comparison only with librarians participating — it may be that patrons are just not able to predict their actual long-term preferences or benefits from an artificial experiment like this. as an example of a comparison using only subject-specialist librarians, see bietila and olson, “designing an evaluation process for resource discovery tools” in planning and implementing resource discovery tools in academic libraries, igi global 2012. or the slides from a niso presentation by bietila, available open access at: http://www.niso.org/news/events/2012/nisowebinars/discovery_and_delivery/ if there were a way to creatively design the study to be more confident you were eliciting more realistic queries for actual real world research questions and research needs (including known-item searches), it would produce results one could be more confident in. it’s not entirely clear how to do this, although making existing search tool results pages the exclusive entry point to a study instrument might be beneficial. whether focusing on subject experts or not, whether in an experiment similar to this one or not, libraries will continue to have a need to evaluate and compare discovery and article search products. and often, especially if a product’s api’s will be used to create a local ui, it will be beneficial to focus on search results rather than presentation — to compare in a normalized presentation (whether blinded or not), rather than using vendors’ native interfaces with very different ui’s. the bento_search rails gem is intended to be useful in allowing rapid development of testing or prototyping environments for such examinations. another option possible with the bento_search tool would be deploying multiple products in an actual production application, using an “a/b testing” approach to give provide different users with different search products, and tracking analytics to try and determine which product seemed more useful (say, which product resulted in more clicks on results). one could also imagine this preference comparison approach (or an a/b approach) being useful to test multiple configuration options for a single tool one already has licensed access to. one of the biggest barriers to do any kind of comparison between products is the difficulty of getting access to evaluate a product. libraries are increasingly, and rightfully, attempting to gather more evidence before making major decisions (such as purchasing an expensive ‘discovery’ product). ideally, vendors would make it quicker and easier for libraries to get evaluation access in order to do this. conclusion many of us working with library technology would like to incorporate more user-centered assessment into our planning and decision-making process, including quantatitive investigations where appropriate. one barrier that can keep us from undertaking such investigations is how overwhelming it can be to “get it right”, and our worry that it’s not worth doing unless it’s done perfectly with absolutely minimized limitations. i do think it’s important to spend some time understanding the limitations of any research we do (and any research has limitations). but i think it’s better to do more assessment even with limitations, than to do minimal assessment out of desire to only do perfect assessment. if you are aware of the limitations of your conclusions, than almost any assessment can give you some additional information to make better decisions than you would have without it. any time you incorporate assessment into your planning, you gain experience to do it better (and more efficiently) next time. on those grounds, i consider our research here to be a success. we made a foray into incorporating user-centered assessment into our planning, becoming more comfortable with that process and increasing our experience and ability to do it. we believe this research has been valuable despite its limitations, and that additional time spent perfecting the research design would not have had a proportional benefit to our findings. the way this research project allowed us to become familiar with the various projects being investigated–both on a technical and a functional level–was also invaluable in helping us to be more comfortable in our decision-making involving some potentially expensive purchases and projects. the technical implementation aspects of the study were intentionally designed to serve as a proof-of-concept of one possible production implementation path, and an opportunity for technical review of the various products. i consider this aspect of this project a complete success, and will definitely try to find opportunities in the future to efficiently undertake that kind of evaluation as part of planning processes. i think it’s absolutely vital to get full evaluation/trial access to expensive products being considered for licensing, and to find ways to take advantage of that trial access for in-depth technical evaluation, not just a quick surface-level perusal. the actual conclusions we were able to make from the quantitative empirical portion of the study are limited, with few statistically significant differences between products demonstrated. in the absence of such demonstrated preferences, we think there is some reason to believe that end-user’s own preferences with regard to core search functionality truly will not significantly differ between these products, and we can focus mainly on other factors in our decision-making. we are considering moving ahead with an initial deployment based on a product we can use with our already existing licenses, such as the ebscohost ‘traditional’ api. our ability to quickly pivot to another article search results provider increases our confidence in this approach. the ‘bento style’ interface choice is part of what makes this kind of switch-out feasible, especially powered by the ‘bento_search’ ruby gem which aims to support this kind of switch-out. we will try to set expectations at the start of implementation to allow for such a quick switch-out, based on librarian and user response to the initial implementation in real-world practice. our initial implementation plan will include the commitment to do more assessment of how well the service is working within several months of initial roll out. appendixes developer’s api review notes [1] http://bibwild.wordpress.com/2012/10/02/article-search-improvement-strategy/ [2] only after our study was well underway did we notice that microsoft is promoting a very similarly designed side-by-side comparison for bing marketing purposes: http://www.bingiton.com . [3] for example: asher, duke, and wilson. “paths of discovery: comparing the search effectiveness of ebsco discovery service, summon, google scholar, and conventional library resources”. forthcoming in college & research libraries, anticipated publication date: july 2013, online pre-print available now: http://crl.acrl.org/content/early/2012/05/07/crl-374.short. [4] http://github.com/jrochkind/bento_search [5] https://github.com/jrochkind/bento_battle/tree/jhu-study-10sep2012 [6] https://github.com/jrochkind/bento_battle/tree/jhu-study-17sep2012 [7] https://github.com/jrochkind/bento_battle/blob/jhu-study-10sep2012/config/ebsco_dbs.rb [8] james p. purdy. “why first-year college students select online research resources as their favorite”. first monday 17(9). http://www.firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/4088/3289) disclosure participating vendors were given the opportunity to review a pre-print draft of this article and provide feedback to the author. additionally, vendors were invited to submit public comments on the article. about the author jonathan rochkind is a software engineer at johns hopkins libraries, focusing on the libraries’ user-facing web presence. he blogs professionally at http://bibwild.wordpress.com. library-focused open source projects in which he is a committer and has made significant contributions include umlaut, blacklight, and xerxes. he can be reached at rochkind at jhu edu. acknowledgements thanks to the various vendors who gave us trial access where needed, and technical support in all cases, to include their products in our evaluation. thanks to david kennedy, head of systems, and deborah slingluff, associate director for library services and collections, both at johns hopkins milton s. eisenhower library, for recognizing the value of this study to our organization, and supporting its undertaking. thanks to our local article search study committee for advising on study design and helping to market and promote the study. all mistakes are mine, however. developer review of products developers in the library sector know that library market vendor api quality can vary. often an api may look acceptable for desired use cases from the marketing, documentation, or even an initial look at a live demo. but sometimes, when developing for actual use cases, you can discover that an api is buggy, difficult to work with, or functionally incapable of achieving your aims. the implementation of the blinded survey instrument required development of code against the products’ apis, and provided the opportunity to review those apis for developer ease of use and functionality. the ‘proof of concept’ against the actual api’s was perhaps equally as valuable as the study itself. as few developers have the opportunity to compare these apis side by side, sharing what we’ve learned seems a valuable service. however, it should be noted that this review is one developer’s subjective opinion based on his perspectives and concerns and the information available to him. one common thread among all products was the lack of documentation of the quality and completeness a developer would want. most products offered some level of documentation, but most were incomplete, leaving out important details related to error responses, input formats, or content limitations. also, in general the ‘discovery’ products all offer some form of “limit to my libraries holdings” and “limit to items available in online full text”, but this aspect was not explored. the traditional a&i products (scopus and ebscohost ‘traditional’) do not offer such features. the discovery products also, unlike the traditional a&i products, typically offer a did-you-mean/spellcheck feature, but i did not investigate the quality or ease of development use for this feature in any products. ebscohost traditional this is not a discovery layer and lacks ‘discovery’ features, but the api is actually quite reasonable and easy to work with for what it is, as well as reasonably well documented. in using ebsco ‘traditional’ api, you are cross-searching a variety of ebsco databases licensed to you. different ebsco databases can use different controlled vocabularies for such values as format, and metadata completeness may vary between databases. with apis that come bundled with licenses mostly used for native html app access, there can sometimes be a worry that the api is an afterthought not fully supported — especially for ‘last generation’ products like ebscohost (rather than eds). however, the ebsco ‘traditional’ api seems fairly mature, and basic questions on the api through ebsco’s usual support mechanisms were generally answered satisfactorily. notable features for records where it’s available, you can ask for complete fulltext as html in the response. this is a fairly unique feature, although i do not have concrete uses for it in mind. peer-review limit is available, although not all databases may record this info. while extremely limiting ‘facet’-like functionality is available from the “getcluster” api call, when cross-searching multiple databases, the facets are limited, and in all cases they don’t actually have facet counts. the api faceting functionality is not equivalent to that of the native interface. there is sufficiently granular citation metadata to create an openurl or export a citation, and it’s rare that a record is missing proper data for these purposes. api response does clearly indicate if full text is available for a particular record on the ebsco platform for your institution’s licensed entitlements (and whether it is in pdf or just full text), and provides a link to pass the user on to the ebsco platform record-level page. (links directly to pdf take an additional per-record api call to retrieve, which is inconvenient.) abstracts are included. limitations or issues licensed databases are not automatically available for searching through the api. rather, one needs to explicitly configure each database for inclusion in your ebscohost api profile. if a database formerly available becomes unavailable (due to licensing or platform changes), and your api requests still ask to search it — it’s a fatal error and no response is returned, even from remaining available databases. the fact that you are cross-searching multiple databases can lead to some data normalization issues since content types are not particularly consistent across databases. the query syntax is poorly documented. < is not the same as <>. some changes to query parsing may be available by asking ebsco support to enable a different mode on your profile, but not by a self-editable field in the admin interface. some kinds of query syntax errors (such as “a and b”) can result in a zero hit response, without an error message. while there is a language tag in output, it seems to quite frequently be empty or improperly set to “english” even for non-english articles. relevancy ranking is decent, but can sometimes be slightly odd. a single phrase-quoted search on a title can result in the article matching that title exactly being number 2 or 3 on the result list instead of 1. unlike the ‘discovery’ products, there is no “query in context” search term highlighting whatsoever. api authentication account/password passed in ordinary url query parameter. end-user authentication as an ordinarily a&i type database, we assume we are not allowed to show results except to end-users that have been authenticated as affiliated with our institution. documentation there actually is a fair amount of documentation on the public web, although it’s scattered in several places and can be hard to locate. ebsco also maintains an extensive support knowledge base that includes entries on api issues as well. as is typical for all these products, response formats are mostly undocumented and need to be reverse-engineered, but a dtd is provided for the xml response (http://support.ebsco.com/eit/docs/dtd_eit_ws_searchresponse.zip), which is better than nothing, although insufficient at documenting semantics. some notably useful pages: http://support.ebsco.com/eit/ws.php http://support.ebsco.com/eit/ws_faq.php http://support.ebsco.com/eit/ws_howto_queries.php http://support.epnet.com/knowledge_base/search.php?keyword=&interface_id=1082&document_type=&page_function=search http://eit.ebscohost.com/pages/methoddescription.aspx?service=/services/searchservice.asmx&method=search http://support.epnet.com/knowledge_base/detail.php?id=5397 eds the eds api presented a number of challenges, chief of which was dealing with authentication. while eds is provided by the same vendor as the older more basic ebscohost ‘traditional’ api, eds was in many ways more challenging to work with than ebscohost. the eds api does offer some of the ‘discovery’ functions missing from ebscohost traditional api, but it lacks the granular citation metadata found in the ebscohost traditional api product. notable features there is a sophisticated query syntax that should support fielded search and complex boolean expressions. while it’s not clear if it’s suitable for exposing directly to the end-user, it’s reasonably okay to work with in software. there are several different ‘modes’ of query parsing which you can select via api. it’s somewhat confusing what the differences are and which you might want. there is a convenient multi-parameter query style that would be useful for sending multi-field queries entered by users in separate text boxes. a ‘peer reviewed’ limit is available. there is a query-in-context highlighting function, but it’s very cumbersome to use, and only applies to certain metadata fields (google-style fulltext snippets not available). faceting is available, including on content type and langauge, but i did not investigate this extensively. limitations or issues see api authentication section below, there are some significant challenges. the citation metadata (journal name, volume, issue etc) are offered only in a single text field meant to be displayed directly to the user. there are no granular metadata elements allowing openurl generation or citation export, though you can configure the api to provide a pre-configured openurl in the response. oddly, this single text field provided in api response isn’t an end-user presentable literal string, but includes undocumented xml-like tags which must be stripped before presentation to end users. this single text field of citation information is missing from some items, even when the underlying platform is able to provide an accurate openurl for the item. other data elements that are available for facets or limits are not available in the api response for an individual element. a notable example would be the language of the article. the per-record api response is very limited. data tends to be html, with presentational html tags like and as well as html character entitles. but this is somewhat inconsistent, and undocumented. while eds does offer better ‘content type’ limiting and labeling than the ebscohost traditional api, the content type vocabulary used applies to container of the record rather than the record itself — “academic journal” rather than “journal article” — which is a somewhat odd ui. additionally, the terms used for labeling records in results don’t exactly match the terms you need to use for limiting (singular vs plural), which can make coding a flexible ui more challenging than it should be. date sorting reveals metadata inconsistencies limiting the usefulness of this feature. the facetting api does not allow exclusionary facets (eg exclude “newspaper articles”). api authentication the eds api’s authentication process is quite complex, and a significant challenge to writing client software. making an http request to receive an authentication token. this token will need to be repurposed for all subsequent requests to the api. since this token will expire after a period of inactivity, all responses need to be evaluated for token expiration errors. if a token expires, the client must re-authenticate and request a new one. using the authentication token, the api requires a second http request to generate a per-session token. the api requires the session token to be utilized for all future api interactions for a particular session. since it can be difficult to track end-user sessions effectively within a web application, and even if you did so a given session in an app of this type may only make one or two queries, it might make more sense to initiate a new session for every query, despite the significant performance cost of doing so. ebsco’s explanation for this session token is that some of the underlying databases have licensed connection limits, so they need to keep track of how many sessions are currently accessing. however, there’s no self-service way to tell which underlying data sources may have connection limits, or what those connection limits are. it’s unclear what will actually happen if the connection limits are exceeded, and the nature of the error message returned, if any. this cumbersome authentication/setup process significantly adds to the difficulty of writing a client for the eds api. it also degrades the effective response time to deliver results to the end-user, as in some cases several separate http requests have to be made — and each http request can have a one second or more response time. end-user authentication eds does allow use with both authenticated end-users, as well as public/guest end users. if you tell the api you are using public/guest access, it has a rather novel way of accommodating this. other ‘discovery’ products typically will transparently hide some results from the result set for ‘guest’ access. eds also suppresses some results from the result set, but not transparently — instead, in a page of 10 results, several (or more) of them may be replaced by placeholder “can not show this result to guest user” blocks. i am not certain if this results in a better or worse ui than what the other products do, there are some arguments both ways. documentation documentation is quite limited, and behind a customers-only login, which makes it inconveniently unavailable via a google search.http://edswiki.ebscohost.com/eds_api_documentation “console” app at https://eds-api.ebscohost.com/console is quite useful for debugging and exploring the api in a browser, without having to deal with the insane auth/session workflow manually. primo a decent api, which allows you to do what’s needed. getting per-record metadata from the response format can be very confusing and under-documented, although on the plus side extensive metadata is usually present. notable features query syntax does allow fielded searches. it is definitely not suitable for end-user direct entry, and can require some contortions to search multiple terms/phrases. it is also unclear if complex boolean expressions are supported. the query parameter in request can be repeated, which is convenient for multi-field searching. supports facets, including creator, language, format, subject, pub year, db source, journal title. supplies abstracts has a ‘peer reviewed only’ limit. custom relevancy rankings based on discipline and individual status (faculty, student, etc) are available. i did not explore this feature. a ‘highlight’ feature is available, but is poorly documented and difficult to use. this feature seems to only highlight certain metadata fields, and does not provide google-style highlighted full text snippets. limitations or issues individual record response elements have many sections with similar data presented in different ways, without much documentation of the different response elements. this makes it difficult to determine which response element will provide the best, most consistent data for a given use. these varying presentations of certain elements may in fact be powerful support for complex use cases, but lack of documentation is a barrier to doing so. error messages are sometimes xml and other times http status codes with html response bodies. expected error response specifications are not documented. during the survey, we ran into at least one strange edge case bug: searching for a long phrase with a question mark in it resulted in strange error message. this is possibly due to undocumented escape requirements of query input? multiple format/genre vocabularies are used in responses, but there is inadequate documentation defining expected values or semantics. queries with very many terms can have exceptionally long response times. phrase queries with very many terms seem to sometimes return false negatives. while granular citation metadata is provided, it can sometimes be incomplete for certain records and not sufficient to generate a working openurl. there is a management interface to specify which primocentral “collections” you want to include in your search. however, these collections are constantly changing, and i believe one must manually stay on top of changes to update one’s configuration to be searching all that is available. api authentication api is ip-address restricted for authentication, which is certainly easy to develop for, although it can be annoying if your server ip addresses can change. if your api request is not from a registered ip, you improperly get an http 200 response, with an empty body, rather than a proper http error response and/or machine-readable error message. end-user authentication access is allowed by both authenticated end-users and guest users. it is not clearly documented what differences in access levels are. documentation documentation is restricted to customer-access only, making it unavailable for non-customer evaluation. it is on one giant html page with somewhat hard to read formatting. documentation not present for the numerous controlled vocabularies, facets or fields available for limiting, some are mentioned anecdotally like peer-reviewed controls, but there is no complete list. http://www.exlibrisgroup.org/display/primooi/brief+search scopus a fairly reasonably designed, full-featured api for the basic search functions you’d expect from an a&i database. there are some limitations, including restrictive terms of service. notable features uses fielded and boolean search language equivalent to the scopus native interface, making sophisticated searches possible. there is an incredibly full set of search indexes, letting you craft very precise queries. has sufficient granular citation metadata (volume, issue, start page, issn, etc) to build an openurl, or export a citation (except only first author). limitations or issues the documentation says some very limited faceting is available, but i have not tried it. there is no limit to peer review/scholarly content only, although most content in scopus probably is peer reviewed/scholarly. only the first author of a multi-author citation seems to be available from the api, even though documentation suggests otherwise. sorting either doesn’t work at all, or doesn’t work as documented, especially for sort orders than ‘relevance’. no abstracts are provided (see tos issues below). queries that produce zero hits give an error message rather than simply reporting 0 hits. this was undocumented, and discovered through ‘reverse engineering’. unlike the ‘discovery’ products, there is no query-in-context search term highlighting. api authentication scopus requires your api client ip address to be registered, and an api key additionally sent in an http header. the api key sent in the http headers makes debugging or exploration of api in a browser window difficult. other products provide a ‘console’ app of some sort to work around this, but scopus does not. end-user authentication authenticated end-users presumably required to display any results from this licensed a&i database. documentation decent documentation is on the public web, but is occasionally out of date or lacking significant details. http://www.developers.elsevier.com/devcms/content-api-search-request http://api.elsevier.com/content/search/ (scopus section) terms of service limitation the scopus api terms of service (under ‘federated search’ case at http://www.developers.elsevier.com/devcms/content-policies) are unusually limiting. link back to scopus is required with presentation of results. displaying abstracts or citation counts in an interface is prohibited. lack of abstracts may be particularly problematic for many use cases. summon serialsolutions tells us that their default out-of-the-box summon interface uses the exact same api available to customers to power it’s functionality. this kind of “dog fooding” is probably unique among the products reviewed, and it pays off. while not perfect, the summon api is by far the easiest to work with, most consistent, and most flexibly powerful of the products reviewed. notable features sophisticated query syntax, which is actually suitable for exposing directly to end-users if so desired, as it supports google-style list-of-terms-and-phrases with reasonable semantics. it also supports fielded queries, and at least some level of boolean expressions. however, the search syntax isn’t documented. it appears to be based on solr-lucene query syntax, but with enhancements. lack of documentation can make things tricky when you are machine-generating queries; it can be challening to be sure you are escaping what needs to be escaped, and in general that your machine-generated query has the intended semantics. full featured faceting api, including applying ‘and’ (conjunction), ‘or’ (union) and ‘not’ (exclusive) facet limits. large list of search indexes you can target and combine. this is useful for supporting finely targetted queries. however, in some cases, combined/aggregated indexes you might want are missing, such as an “any kind of subject” index as opposed to only “subject geographical”, “subject topical”, etc. also, there is insufficient documentation as to semantics of search indexes, beyond a one phrase label. both a “just scholarly” and a “just peer-reviewed” limit (not sure what the distinction is). a good vocabulary for ‘content type’ that matches my impression of what would be useful to users. however, mis-classification of item format is noticable, for example something classified as a “journal article” that is from a mass market publication nobody would consider a ‘journal’. straightforward response format, with sufficiently granular data elements to generate openurls or export citations, and in general has what you’d expect it to have, in reasonable places. the best query-in-context search term highlighting of any product reviewed, with google-style highlighted snippets from fulltext, that works quite well. an interesting ‘direct link’ feature which aims to provide a url to send the user directly to institutionally-licensed fulltext for an item, avoiding the openurl resolver where possible. i believe this feature supports multiple content providers, not just those owned by serials solutions parent company. i did not investigate this, but if it works well it could be a very useful feature. in limited experimentation, it did sometimes successfully link even to public-access pre-print copies on the web, which is a pretty neat feature. (linking sources can be ranked and turned off in config). a ‘recommend a specific database’ feature, to send users to specific native database interfaces for databases matched to their query. i did not investigate the quality of results, but it is an interesting feature. limits to local holdings and fulltext-available-only are based on the serialsolutions 360 knowledge base, which would be convenient if you are already using this vendor for other products requiring knowledge base configuration. limitations or issues while api returns a ‘language’ code labeling individual responses, it is frequently empty or labeled as english even for non-english articles. there were some problems applying multiple exclusive (“not”) facet limits. it’s not clear if i was encountering a bug in the api or if i was doing something wrong while there are a variety of facet groupings available, there is a lack of clear documentation of what facet groups are available. the developer has to reverse engineer from observed responses. api authentication you need to authenticate your api requests using a fairly complex (but technically secure and competent) algorithm involving cryptographic signatures. the requirements are actually quite similar to aws api (although it is not identical logicimplementing a client complying with summon’s authentication method would be a a bit challenging to implement from scratch, but libraries are provided in several languages to wrap api access, including authentication. i did not want to use the api for all requests; i prefer to stay closer to the http metal and not use an extra layer of abstraction (with fairly spotty documentation) limiting my flexibility. fortunately, the provided ruby implementation of a summon api client was written such that i could easily instantiate and call the right classes to use the request-signing parts of the ruby gem alone in my own code, without using the rest of the gem. end-user authentication as is typical for the ‘discovery’ products, you can provide access both to authenticated local affiliate end-users, and ‘guests’. ‘guest’ access has some restricted results suppressed from the result set. documentation the api is reasonably documented, although it could be better in many places. as with the other products investigated, we could use more documentation of response format semantics and various controlled vocabularies. particularly missed is sufficient documentation of the facet groupings and search field indexes. docs are on the web, with public access, which is great. http://api.summon.serialssolutions.com/help/api/search http://api.summon.serialssolutions.com/help/api/search/fields useful ‘console’ app for debugging/exploring api in a browser window without having to deal with complex auth system manually: http://api.summon.serialssolutions.com/help/api/search/example?s.q=query subscribe to comments: for this article | for all articles 4 responses to "a comparison of article search apis via blinded experiment and developer review" please leave a response below: ebsco, 2013-01-14 jonathan, ebsco agrees that “creating a better article search experience for our users” should be a primary goal for libraries choosing a discovery service. your study’s focus on apis and least common denominator metadata presentation is one data point. other complementary studies have been conducted this year to answer the question you acknowledged wasn’t answered by your study – “we do not know why participants preferred one tool to another”. we wanted to point out one study in particular conducted by librarians at illinois wesleyan (eds) and bucknell (summon) – “paths of discovery: comparing the search effectiveness of ebsco discovery service, summon, google scholar, and conventional library resources” [http://crl.acrl.org/content/early/2012/05/07/crl-374.full.pdf]. this study focused on student search behaviors, speed to success, and quality of research in comparing a variety of services. in addition to ranking eds first, this study came to several interesting conclusions on searching. nearly all of the negative issues you encountered with the technical features of the eds api were either a result of not using best practices (which we would have suggested if your institution were a customer) or are features we have added to the api since your study began in may 2012. details and more complete commentary: http://www.ebscohost.com/uploads/general/johns_hopkins_api_comparison_ebsco_feedback.pdf serials solutions, 2013-01-14 this study does a good job of illustrating the value, performance and ease of use of the summon api, but doesn’t address the fact that library discovery services must do more than simply provide a single search box for article content. with its unified index architecture, powerful relevance-ranking and ability to present precision results in a clean, intuitive and user-friendly interface, the summon service moves users beyond searching siloes of content. the summon service increases usage of more than just popular aggregator packages by mapping content from all providers to a common schema to ensure equal treatment of content and therefore boosting visibility and access to underutilized collections. and, as this study suggests, user experience and results presentation are paramount to users’ perceptions of a discovery tool’s effectiveness. user experience is a critical factor when evaluating discovery solutions. [see external report] the summon service provides layers of contextual research assistance and opportunities for librarians to impact the research process to optimize user experience—all without using the api. this is why the vast majority of customers who do use the summon api for building custom applications also present users with the full-featured, “out-of-the-box” interface to meet a variety of user needs and expectations. emily lynema, 2013-01-16 jonathan, i’m so excited that you did this study. i remember talking about it at code4lib last year. i’ve only skimmed this report and look forward to reading it in more depth. awesome job highlighting the statistical significance (or insignificance) of the results. but i’m not highly surprised about the results. student rankings are often fairly indifferent in terms of relevance. it would indeed be interesting to include some librarian rankings in this set for comparison. it also emphasizes that some of the differences in these tools have little to do with relevance of result sets — things like the usability of the interface and the scope of content included also play a big role. we’ve been doing a periodic re-evaluation of the playing field here, and i’m forwarding this on to our staff for digestion. thanks again! dmitry green, 2014-11-26 hello, thank you for this article. as a physics phd (yale 2001), like many people, i have been frustrated with many of the search interfaces. here’s a tool that i have been building with my team: arximedes.org is a new interface that helps professional researchers separate the wheat from the chaff. at the moment we are focused on physics & technology (it is not a social network and we don’t spam), having integrated nasa ads and arxiv. in addition to getting a laundry list of papers, as in standard search engines, we’ve built another processing layer on top. you can see which authors are driving research within the topic you are searching, visualize publication rates, and track who is citing whom. a newsfeed automatically organizes daily updates to your searches. we’ve also built a simple way for you to rate papers. unless you are an expert in a very narrow field, it is hard to make sense of the literature. at the same time, most professional researchers “know it when they see it”, quickly. you can take register and use it for free. here is a brief video: http://youtu.be/kj_q7iqnq8o and pdf overview: http://arximedes.org/static/project/images/arx_overview.pdf. thank you leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – barriers to initiation of open source software projects in libraries mission editorial committee process and structure code4lib issue 29, 2015-07-15 barriers to initiation of open source software projects in libraries libraries share a number of core values with the open source software (oss) movement, suggesting there should be a natural tendency toward library participation in oss projects. however dale askey’s 2008 code4lib column entitled “we love open source software. no, you can’t have our code,” claims that while libraries are strong proponents of oss, they are unlikely to actually contribute to oss projects. he identifies, but does not empirically substantiate, six barriers that he believes contribute to this apparent inconsistency. in this study we empirically investigate not only askey’s central claim but also the six barriers he proposes. in contrast to askey’s assertion, we find that initiation of and contribution to oss projects are, in fact, common practices in libraries. however, we also find that these practices are far from ubiquitous; as askey suggests, many libraries do have opportunities to initiate oss projects, but choose not to do so. further, we find support for only four of askey’s six oss barriers. thus, our results confirm many, but not all, of askey’s assertions. by curtis thacker and charles knutson motivation the mission statement of the american library association includes the charge to “ensure access to information for all.”[1] this charge comes without restriction, cost or qualification. stated another way, libraries make information freely available to all, regardless of how that information is to be used. similarly, open source software (oss) “licenses must permit non-exclusive commercial exploitation of the licensed work, must make available the work’s source code, and must permit the creation of derivative works from the work itself” (laurent 2004). the core values of libraries and the oss movement are similar, suggesting that libraries should tend to favor the oss model. in particular, they might feel a responsibility to share the code they have developed with other libraries in a spirit of openness and access for all. that libraries are predisposed to oss adoption and contribution is not a new idea. pat eyler, an open source developer for the koha ils project, said “that more librarians aren’t actively using and evangelizing free software is an indictment against us for not letting them in on our secret” (eyler, 2003). nicole engard characterized the issue this way: “it has been suggested that libraries are almost ethically required to use, develop and support open source software” (engard, 2010). richard stallman, the pioneering free software evangelist, stated that “… universities shouldn’t be developing proprietary software. it is better if they develop none at all, because [by doing so] they are betraying their mission to contribute to human knowledge” (anderson, 2008). despite the suggestion that libraries are ethically obligated to use and create oss, it has been observed that libraries seem reluctant to share their code. in 2008 dale askey authored a column in this journal entitled “we love open source software. no, you can’t have our code.” he states that “librarians are among the strongest proponents of open source software. paradoxically, libraries are also among the least likely to actively contribute their code to open source projects” (askey 2008). askey identified a list of six issues he believes contribute to this dichotomy. in his own words: after pondering this issue for some time, i identified the following issues as the driving forces that undermine the sharing of open source software in libraries: perfectionism – unless the code is perfect, we don’t want anyone to see it dependency – if we share this with you, you will never leave us alone quirkiness – we’d gladly share, but we can’t since we’re so weird redundancy – we think your project is neat, but we can do better competitiveness – we want to be the acknowledged leader misunderstanding – a fundamental inability to understand how an open source community works many of these issues operate in combination, but any one of them is sufficient to thwart the development and adoption of open source software in libraries. in this paper, we report on our empirical investigation into askey’s central claim. we examine the six barriers he proposes in light of our empirical results. methods the association of research libraries (arl) “is a nonprofit membership organization of 125 research libraries in north america. the association operates as a forum for the exchange of ideas and as an agent for collective action.” each year arl distributes and publishes a small number of surveys, called spec kits, that are proposed and designed by librarians and other interested parties. in february 2014, arl distributed a 32-question survey authored by curtis thacker, charles knutson, and mark dehmlow, to 125 member libraries. seventy-seven libraries (61%) responded to the survey, the results of which were subsequently published as spec kit 340: open source software (thacker et al. 2014) (hereafter referred to as “the spec survey”). the purpose of the spec survey was to study arl member libraries’ adoption and/or development of oss for the primary functions carried out in libraries. we aimed to understand organizational factors that affect decisions to adopt oss. with regard to development of oss, we studied: 1) research libraries’ policies and practices on open sourcing their code; 2) the frequency of research library contributions to open source projects; 3) the reluctance of research libraries to make their code openly available; and 4) the most common benefits and challenges encountered when research libraries open source their code. questions were reviewed, evaluated and refined by empirical software engineering researchers from the sequoia [2] lab in the brigham young university computer science department. this exercise enabled us to deepen our understanding of issues related to open source software development by applying the growing body of work in the area of empirical software engineering. the creation of the survey instrument followed best practices for empirical software engineering surveys (kitchenham and pfleeger 2008). questions were crafted to empirically test several of the issues laid out in askey’s column. in particular, the following question provided respondents with an opportunity to identify reasons for not openly releasing software they had developed: example spec kit question the table below illustrates the relationship between the options presented in the question and the issues presented by askey. the first column identifies each issue as presented in the survey, while the second column presents the issues as stated by askey. two of the issues offered by askey were not tested because they fell outside the scope of the spec survey. two other issues were added in an attempt to validate additional reasons for which an institution might choose not to open source their code. of these two issues, the second one (“seeking to license or sell the system”) was inspired by a response[3] made to askey’s column. survey results were reviewed and statistically analyzed. free response questions were encoded and qualitatively analyzed for themes and best practices. the executive summary of the spec survey includes an overview of statistical results that spans the entire survey (thacker et al. 2014). a specific set of results relevant to this paper are presented and discussed in the sections below.arl reviewed and administered the survey. participants were given four weeks to respond and arl sent two email reminders as the deadline approached. a spreadsheet of the complete response data was returned to the authors for analysis and preparation for publication. oss adoption askey’s initial premise is that libraries love oss. he cites dan chudnov (chudnov 2007) who asserts that infrastructure software and programming languages are widely adopted by libraries. operating systems such a linux, web servers such as apache, and programming languages such a ruby and java are examples of oss systems commonly adopted by libraries. these applications compete with commercial applications for market share and often hold the largest slice of the pie. askey also pointed out that oss adoption is ubiquitous for other common types of software applications such as web browsers (such as mozilla) and mail clients (such as thunderbird). market share statistics for linux[4], apache[5] and mozilla[6] substantiate these claims. the spec survey found that 74 respondents (97%) had deployed open source software in their libraries, suggesting that, at least for arl libraries, adoption of oss is essentially ubiquitous. this data strongly supports askey’s claim that libraries love oss. we also wanted to understand the specific types of oss that are loved by libraries. askey asserts that libraries have “strongly embraced…object repositories such as dspace and fedora and content management systems such as drupal.” spec survey respondents were invited to provide information about the type of software being used for various purposes. respondents most frequently reported choosing oss solutions for institutional repositories (52 total), blogging (51 total) and digital preservation (50 total). see the table below for more details on how respondents have adopted oss within their institutions. the spec survey revealed compelling evidence for the widespread adoption of library specific software, even beyond askey’s claims.the spec survey confirmed askey’s sense that dspace and fedora were “strongly embraced” by libraries. sixty-six respondents reported the oss projects they had adopted. we found that the most commonly adopted open source systems were dspace (31 respondents, 47%[7]), fedora (21 respondents, 32%), open journal system (19 respondents, 29%), blacklight (14 respondents, 21%), hydra (12 respondents, 18%), vufind (8 respondents, 12%), archivesspace (7 respondents, 11%) and archivists’ toolkit (6 respondents, 9%). respondents were further asked to describe three benefits and three challenges associated with adopting oss. the most commonly reported benefit was the ability to customize the software (50 responses). other common themes included low cost or time to implement (27 responses) and association with an active community (27 responses). the most common challenge was the need for highly skilled staff that could provide support for the oss system (40 responses). other commonly cited challenges included poor documentation (19 responses), a need for additional training or expertise (16 responses), and substandard development practices (12 responses). oss contribution askey shares his perception that libraries are reluctant to initiate and/or contribute to oss projects, despite their nearly universal enthusiasm for adoption. askey’s main claim is: “where we tend to fall flat is in the area of creating, maintaining, and sharing library-specific applications. there are certainly myriad exceptions to this statement, but i would suggest that however large and noteworthy, they remain the exceptions, and not the rule” (askey 2008). while askey’s statement mainly addresses initiation of oss projects, maintaining library-specific applications could be interpreted as contribution to oss projects. askey’s column focused primarily on contributions to oss projects in the form of source code. beyond software, oss projects benefit from many types of contributions including, money, hosting, testing, etc. the table below shows the types of contributions that libraries have made to oss projects.[8] the spec survey found that 56 respondents (78%) had contributed to one or more open source projects; of these, 50 respondents indicated which projects they had contributed to. the most common projects included dspace (12 respondents, 24%[9]), fedora (11 respondents, 22%), hydra (9 respondents, 18%), kuali (6 respondents, 12%), blacklight (5 respondents, 10%) and archivesspace (4 respondents, 8%). the spec survey found that respondents had contributed to an average of 2.6 oss projects and a median of 1 oss project. these findings support askey’s claim that contribution to oss by libraries is common, yet far from universal. oss initiation askey addressed initiation of oss when he claimed that “where we tend to fall flat is in the area of creating, maintaining, and sharing library-specific applications. there are certainly myriad exceptions to this statement, but i would suggest that however large and noteworthy, they remain the exceptions, and not the rule” (askey 2008). thirty-two (42%) respondents identified themselves as the original developer of an open source project. respondents initiated an average of 1.4 oss projects and a median of zero oss projects. thus we see that while a number of institutions have some experience initiating oss projects, initiation is far from the norm. our finding supports askey’s claim. respondents were asked if any of their in-house software could have been, but had not yet been, released under an open source license. fifty-three respondents (69%) answered in the affirmative. additionally, the spec survey revealed libraries that always choose to share their sharable projects, and, conversely, there are libraries that could share but have thus far not chosen to share their code. the table below breaks down these responses in greater detail. perfectionism respondents cited all of askey’s barriers as reasons for not open sourcing a sharable system. we address each of these issues in the sections below. thirty-nine (74%) of those who chose not to open source their code cited “concerns that the code quality is not ready for public adoption.” the perception that the code quality is not acceptable, and therefore cannot be shared, is very common. this particular question in the spec survey was only able to test perceptions of libraries. as pointed out by askey, intrinsic to the open source philosophy is the idea that the community will improve upon an initial system. linus’ law, as described by raymond (1999), describes oss communities this way: “given enough eyeballs, all bugs are shallow”, or more formally: “given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix will be obvious to someone.” it follows from linus’ law that not sharing code due to quality issues is more a matter of pride than practicality. dependency “nothing is more certain in the world than this: if you share software with someone, you will be asked to support it, even if you make it perfectly clear that you have no ability and no intention to do so” (askey 2008). forty-one respondents (77%) cited “staff time commitment required to support the community” as a reason for not open sourcing a product that could have otherwise been shared. the spec survey offers strong evidence that the perception of dependency is a common barrier among arl members. quirkiness quirkiness is defined by askey as “the sense that one organization’s needs are so locally-tailored that [it] would make no sense to release the software to the broader library community.” later in the same section he cites an example of quirkiness as dependence on “idiosyncratic local metadata scheme.” the spec survey addresses quirkiness in three ways. first, 30 respondents (57%) cited “dependence on internal systems” as a reason for not open sourcing a system that could have otherwise been open sourced. second, 7 respondents (13%) stated “it didn’t occur to us” as a reason for not open sourcing their software. third, the issue of quirkiness was directly addressed by respondents who entered free form responses describing reasons they chose not to open source a system. responses included: “highly customized to address local requirements”; “narrow niche applications where a community is unlikely to develop”; and “often these systems reflect local practices. we’ve not viewed them as useful beyond our local environment.” these data are evidence of quirkiness among arl members and support askey’s claims. redundancy redundancy, as described by askey, “is when there is perfectly acceptable software available and yet is rejected because it’s not quite what one would have done had they created the software.” we found that this issue relates more to adoption than initiation of oss. as a consequence, we did not study this issue in detail. competitiveness askey explains that libraries tend to implement their own systems (e.g., institutional repository, digital libraries, and web services) because they “want to be the acknowledged leader.” while one respondent of the spec survey indicated “a competitive desire to have the best system” as a reason for not open sourcing their software, no other respondent cited such motivation. as a result, while we find some support for askey’s claim, competitiveness does not appear to be widespread. misunderstanding askey describes misunderstanding as “a fundamental inability to understand how an open source community works.” we determined that “misunderstanding” primarily suggested that respondents did not understand the benefits of involvement with an oss community. this issue represents a catch-all of sorts that encompasses the other issues we’ve discussed. the breadth of “misunderstanding” prevented us from testing this issue in the same manner as the other issues presented above. other questions in the survey do, however, offer insights into the benefits libraries currently enjoy as a result of adoption of and contribution to library-specific oss projects. we highlight some of these insights below. respondents were asked to describe three benefits and three challenges associated with contribution to oss. the benefit most commonly cited was engagement in the open source community (38 responses). other common themes included control of product features and direction (25 responses), and recognition/reputation (14 responses). the most common challenge was allocating sufficient staff time to make meaningful contributions (24 responses). other commonly cited challenges included writing generalized software for use by a larger community (7 responses) and securing the financial resources needed to support the open source project and community (7 responses). control of software emerged as a theme common to both adoption and contribution. those adopting oss products felt that access to source code gave them greater control, allowing them to change the software as needed, rather than being subject to the whims of a proprietary solution. those libraries contributing to oss projects felt that they gained greater opportunity to influence product direction, especially with respect to software features. in both cases, library information technology organizations perceived a sufficient benefit to their overall productivity to justify the expense of their involvement (as adopters, contributors, or both) in oss systems. when asked about reasons for open sourcing their project, spec survey respondents listed the following as being “important” or “very important”: a belief that open sourcing would lead to better software (30 respondents), a desire to contribute to an open source community (29 respondents), and shared effort in development and quality assurance of the project (27 respondents). the experiences shared by respondents who initiated an open source system support the idea that one way to inject quality into a system is to open source it. in contrast to askey’s claim, there were many respondents who demonstrated an understanding of this benefit of open sourcing their code. additionally, of the 54 respondents who have a system they chose not to release as open source, 24 (44%) have initiated at least one open source project. further research is required to understand the motivation of these arls decision to share one system but not another. many respondents expressed a desire on the part of their developers to share with and participate in one or more oss communities. larger lit organizations committed more resources to oss projects than smaller lit organizations, but we found no significant correlations suggesting a disproportionate level of commitment to oss projects as a function of lit staff size. the nearly universal adoption of oss systems and the high level of contribution to oss projects may suggest that adoption of and contribution to oss projects has entered the mainstream for lit organizations. simply stated, lit organizations that develop software have also generally contributed to one or more oss projects. additional insights in the final section of his column askey makes several suggestions on what should be done to overcome the issues he discusses. we address a few of these suggestions in this section. in 2008 askey claimed that there was no standard way of distributing library specific code, suggesting that a single place should be agreed upon as the established method for sharing code. github has emerged as the preferred method for many open source projects (including libraries) to share their code. github accommodates large oss projects such as fedora, dspace, hydra and others as well as supporting what askey calls oss lite[10]. forty-one spec survey respondents indicated that they use a public forge to manage and share their open source projects. thirty-eight of these use github for this purpose. while making use of an open source forge, such as github, to share code is effective, it is unclear whether this tool has impacted the propensity of libraries to initiate an oss project. askey states that “libraries that wish to use open source software need to understand the staffing commitment they are making by going that route. open source software requires programmers, interface designers, and system administrators.” in our review of organizations that contribute to open source projects, software development staff ranged from one or two to as many as fourteen. while organizations that contribute to large-scale, formal open source projects were clearly investing heavily in programming staff, it was also clear that a few organizations that didn’t have resources for large technology staffs could still contribute to projects with as few as one or two programmers. the median number of staff reported as working on oss projects was two, with an average of nearly four. the results of the spec survey suggest that we view organizational behaviors surrounding the adoption of open source software separate from contribution to oss projects. for example, while oss adoption is viewed by respondents as a means of saving time and resources, oss contribution is not similarly viewed. rather, contribution to oss projects is viewed as being advantageous for different reasons, namely engagement in an oss community. for developers, the sense of social involvement in a community represented by an oss project can be a positive source of professional satisfaction, ultimately leading to greater productivity and a return on investment for the lit organization. threats to validity care must be taken when generalizing survey findings to a larger population. the spec survey was distributed to all 125 arl member libraries. arl libraries are often considered a model for best practices, but are not a representative set of research libraries or libraries in general. further, the 77 respondents of the survey self-selected, introducing bias toward libraries that are interested or invested in oss. also, survey fatigue is a large concern. the spec survey was relatively long (32 questions), with some questions involving multiple parts and some requiring respondents to look up specific information in order to answer. several instances were found where respondents didn’t answer questions completely, which can be seen in the tables above. future work the spec survey revealed that there are libraries that always choose to share their sharable projects, and, conversely, there are libraries that could share their code but have never chosen to. future work could include looking for correlations between a library’s software engineering, talent management and innovation policies and practices, and its propensity to initiate oss projects. in the years since the publication of askey’s column two significant types of organizations have arisen within the library landscape, exerting considerable influence on open source software projects. governing foundations, such as duraspace, kuali, the islandora foundation, the software conservancy foundation and archivesspace, manage requirements and coordinate resources of member libraries. supporting vendors, such as bywaters and @mire, offer support and hosting services to oss adopters. while outside the scope of the research we performed, the impact of such organizations is highly relevant to the issues posed by askey and warrants further investigation. conclusion we found support for many of the issues presented in askey’s column. the majority of spec survey respondents have adopted and/or contributed to at least one oss project. nearly half of respondents chose to initiate one or more oss project. while most institutions have some experience with oss, most have only made an initial foray into the space. as askey suggests, many libraries do have opportunities to initiate oss projects, but choose not to do so. we found strong evidence supporting the existence of “perfectionism,” “quirkiness,” “dependency” and “misunderstanding,” however, “competitiveness” was extremely rare. thus, we find support for many, but not all of askey’s assertions. the emergence of github as a preferred means of sharing code was highlighted as a development since askey’s 2008 column. we would suggest that library information technology organizations participating in oss projects typically understand that they must dedicate technical personnel and other resources in order to do so. finally, we found that oss comes with a number of financial trade-offs that need to be carefully examined when considering adoption, contribution and initiation of oss projects. acknowledgements one of the authors (curtis thacker) spoke with mr. askey about his column and the work we were doing on this paper at the cni spring 2015 membership meeting in seattle. we appreciated his encouragement and insights in addition to his thought-provoking column which contributed inspiration for both the spec survey and this paper. authors curtis thacker is director of discovery systems in brigham young university’s (byu) harold b. lee library. he is also an m.s. candidate in the department of computer science at brigham young university and member of the byu “software engineering quality: observation, insight, analysis” (sequoia) lab. in addition to bachelor’s degrees in computer science and applied mathematics from brigham young university – idaho. curtis has 10 years industry experience and seven years library specific experience. dr. charles knutson is an emeritus professor of computer science at brigham young university, former director of the byu sequoia (“software engineering quality: observation, insight, analysis”) lab and former director of the byu mobile computing lab. dr. knutson has 27 years of industry experience, including engineering and management positions at hewlett-packard and novell, inc. he was also vice president of r&d at counterpoint systems foundry, inc. (now sybase ianywhere), the world’s leading provider of irda and bluetooth protocol stacks for embedded systems. dr. knutson has more than 120 technical publications in areas including mobile computing, medical informatics, and software engineering. he holds a ph.d. in computer science from oregon state university, and b.s. and m.s. degrees in computer science from byu. notes [1] see http://www.ala.org/aboutala/missionpriorities [2] sequoia = “software engineering quality: observation, insight, analysis” [3] see http://journal.code4lib.org/articles/527#comment-1299 [4] see operating system statistics at http://www.netmarketshare.com/ [5] see http://news.netcraft.com/archives/2015/01/15/january-2015-web-server-survey.html [6] see http://gs.statcounter.com/#all-browser-ww-monthly-201502-201502-bar [7] percentages are based on the 66 respondents who reported the oss projects they had adopted. [8] all percentages are based on the 56 respondents who have contributed to one or more oss project. all 56 of these respondents reported on the types of oss contributions they made. [9] percentages are based on the 50 respondents who indicated which projects they had contributed to. [10] askey defines oss lite as “tiny programs written in various scripting languages that drive all the doodads and widgets on our websites, or extend (or, in some cases, repair) the functionality of our commercial systems.” bibliography anderson p. richard stallman on the road less travelled [internet]. intelligent content. available from: http://oss-watch.ac.uk/resources/stallman askey d. 2008. we love open source software. no, you can’t have our code. code4lib journal(5). ayala cp, cruzes ds, hauge o, conradi r. 2011. five facts on the adoption of open source software. software, ieee 28(2):95-99. chudnov d. 2007. the future of floss in libraries. information tomorrow: reflections on technology and the future of public and academic libraries. medford, nj: information today, inc. p. 19-30. engard nc. 2010. practical open source software for libraries. elsevier. p. 29. eyler p. 2003. koha: a gift to libraries from new zealand. linux journal 2003(106):1. kitchenham ba, pfleeger sl. 2008. personal opinion surveys. guide to advanced empirical software engineering. springer. p. 63-92. laurent ams. 2004. understanding open source and free software licensing. ” o’reilly media, inc.”. raymond e. 1999. the cathedral and the bazaar. knowledge, technology & policy 12(3):23-49. thacker jc, knutson cd, dehmlow m. 2014. open source software. association of research libraries subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – apples to oranges: using python and the pymarc library to match bookstore isbns to locally held ebook isbns mission editorial committee process and structure code4lib issue 56, 2023-04-21 apples to oranges: using python and the pymarc library to match bookstore isbns to locally held ebook isbns to alleviate financial burdens faced by students and to provide additional avenues for the benefits shown to be present when no-cost materials are available to students (equity and access and an increase in student success metrics), more and more libraries are leveraging their collections and acquisition processes to provide no-cost ebook alternatives to students. it is common practice now for academic libraries to have a partnership with their campus bookstore and to receive a list of print and ebook materials required for an upcoming semester. libraries take these lists and use various processes and workflows, some extremely labor intensive and others semi-labor intensive, to identify which of these titles they already own as unlimited access ebooks, and which titles could be purchased as unlimited access ebooks. the most common way to match bookstore titles to already licensed ebooks is by searching the bookstore provided isbn or title in either the library management system (lms), the analytics and reporting layer of the lms, the library discovery layer, or via another homegrown process. while some searching could potentially be automated, depending on the available functionality of the lms or the analytics component of the lms, the difficulty lies in matching the bookstore isbn, often the print isbn, to the library ebook isbn. this article will discuss the use of python, the pymarc library in python, and library ebook marc records to create an automated identification process to accurately match bookstore lists to library ebook holdings. by mitchell scott introduction it is becoming increasingly common for academic libraries to seek out partnerships with campus bookstores [1]. academic libraries are interested in working with the bookstore to receive the title level data that the bookstore collects on what books are being required for an upcoming semester. libraries are seeking these partnerships to leverage the use of their collections, specifically their already licensed unlimited access ebooks, and their ability to purchase unlimited access ebooks, to offer as no-cost alternatives to students. offering library supplied no-cost ebook alternatives to students allows the library to play a role in affordability initiatives and potentially create the no-cost conditions that have been shown with oer (open educational resources) use to increase student success metrics [2]. while some libraries find receptive bookstore partners [3], others do not, and those develop alternate processes to get at the title level data on what books are being required for an upcoming semester [4, 5, 6]. no matter how libraries get their hands on this data, once acquired, they have developed various methods to pursue the daunting task of matching isbns or title level data on what is required to isbns or title level data on what is already licensed, owned, or can be purchased as unlimited access ebooks [7, 8, 9, 10]. ideally, this process could be automated and the hundreds, and potentially thousands, of bookstore isbns could be batch searched against the hundreds of thousands, potentially millions, of ebook isbns that make up the library’s ebook holdings. however, it is not that simple. for one, the isbns represented on the campus bookstore list vary with some book records having only a print isbn, some having only ebook isbns, and some having both a print and ebook isbn. figure 1. examples of a bookstore supplied spreadsheet with one isbn per book. figure 2. examples of a bookstore supplied spreadsheet with multiple isbns per book. another aspect that complicates this process is which isbn is represented in the bookstore list and whether the one selected by the bookstore will correspond to any isbns indexed in the library’s ebook collection, the library management system (lms), or library discovery. unique isbns are granted based on varying formats and editions. this means certain books can have unique isbns for paperback and hardback versions [11]. ebook isbns are also granted by format type and exist for standard ebook versions, epub versions, and pdf versions [12]. unique isbns could also be a part of a record for revised editions of the same book. with varying isbns represented in the bookstore data the library could be attempting to match a print isbn to an ebook isbn (not going to happen) or match a different ebook format to the library held ebook format (might happen). figure 3. example of ebook marc record and the multiple isbns contained within it. another complication to batch searching by isbn is that libraries are dependent on whether their lms has the functionality that allows for the batch searching of isbns and if they are one of the libraries that does have an lms that allows for isbn batch searching, how well it searches indexed isbns. from 2015-2017, i was a librarian at an institution that used ex libris alma, when alma was relatively new, and alma had an analytics component that allowed for the batch input of hundreds, to potentially thousands, of items and searched against alma data made available within its analytics component [13]. although possible, i never used alma analytics for isbn searching and wonder about the ability and accuracy of the system to do this level of searching and matching. it is my understanding that, in most cases, library management systems commonly index one isbn per book, which is usually the primary or preferred isbn for a particular title. this means that not every isbn granted to an ebook is searchable, and a bookstore ebook isbn might not be found in lms or discovery searches [14,15,16]. if print isbns are searched, then they commonly will not result in identifying its ebook alternative within a lms as these systems do not commonly cross search in that way. within the library literature, only one article references a process that potentially does batch searching with isbns and that is at the university of minnesota, where they have worked on similar processes since 2015. eighmy-brown notes that data management staff “match the libraries electronic resource titles and isbns with those in the bookstore“ [3]. i have been working with campus bookstores on this process since 2018 and unfortunately, have never been a user of a lms that allows for any level of batch searching. after having the experience of manually searching bookstore isbns and titles in a library discovery platform to identify assigned materials that have available library ebooks, i was interested in automating the process. having some python programming experience, i turned to python and discovered the pymarc library [17], which was built to work with marc records, and developed a script and process that takes the isbns present in a bookstore list, and matches them against existing library ebook holdings. as a result of this matching, i can then take the isbns that did not match to determine what could be purchased as an unlimited access ebook. this script and process works in accurately identifying matches in this content because the script iterates over all the ebook marc records that i input into it, indexes every isbn that is present in an ebook’s marc record (all print, ebook, and format or version isbns), and then matches the bookstore isbns, whether print or ebook, to the isbns from the ebook marc records. the process this process starts with identifying and harvesting the ebook marc records that are to be a part of the match process. since we only use unlimited access ebooks for this initiative, i only include unlimited access licensed ebook collections and individual purchases, when optional. the outlier here is ebsco ebooks, which does not allow for the separation of unlimited access ebooks from other more limiting user models at the marc level–i still include these but check these matches manually to determine the user model of the ebook. i use the oclc worldshare collection manager [18] to harvest the marc records for the ebook collections and titles that i want to include. figure 4. downloading files from the oclc collection manager. at st. norbert college, where i worked previously, the oclc worldshare management system (wms) was our lms and the collection manager was a required component for e-resource activation, linking, and discovery. despite being on a different lms at indiana university southeast, i still use the oclc collection manager for this process. this is due to complications for extracting institution specific ebook marc records from the shared indiana university lms (largely limited users with these admin privileges and the large amount of e-resource marc content loaded) and because of my familiarity with the worldshare collection manager. once i have all the marc records harvested that i want to include in this process, i run the first python script that kicks off the matching work [19]. figure 5. folder with marc files and python scripts that pull data from these marc files. the first script to run iterates over all the ebook marc records that i have compiled. there are several libraries that i import to use alongside python’s basic library for this process. from pymarc import marcreader import glob import re from collections import defaultdict import pickle this script uses the glob library [20] to iterate over one file type (.mrc) and progress over all marc files until all the marc files have been opened and processed. with the pymarc library and its marcreader function, the script iterates over every marc record within these files and extracts specific data to build a dictionary for later recall. this dictionary is built using the collections library [21] and is compiled with isbns, from the 020 marc field and 020 marc subfields a and z, as the dictionary key. def main(): ebook_dict = {} ###main e-book dictionary that maps vendor, title, and access url to e-book isbn def ebook_get_isbns(): for file in glob.glob('*.mrc'): # glob alows you to open all file types in a folder. will process all .mrc files try: print('getting isbn list...') with open(file, 'rb') as marc: reader = marcreader(marc) for r in reader: for field in r.get_fields('020'): for subfield in field.get_subfields('a', 'z'): re_sub_isbn = re.search(r'/((978[\--])?[0-9][0-9\--]{10}[\---][0-9xx])|((978)?[0-9]{9}[0-9xx])', str(subfield)) # regex to trim isbns this step extracts all isbns contained within the ebook marc record. i use regular expressions in python [22] to trim the isbns before adding them to the dictionary as the key. the dictionary values, also extracted from the marc, are then attached to the isbn keys. these values are the title of the book (245 and 245 marc subfield a) and the url of the ebook (856 marc field). for field in r.get_fields('245'): # gets ebook title from marc ebook_dict[final_sub_isbns].append(field.value()) if field is none: for subfield in field.get_subfields('a'): # gets title in 245 a if subfield is none: pass else: re_title = re.search(r'^(.+?)\/', subfield) # regex for title title = title.strip('/') title = title.strip() ebook_dict[final_sub_isbns].append(title) for field in r.get_fields('856'): # regex for ebook url url = field.value() re_url = re.search(r'http(.+?)$', url) if re_url is none: pass else: url = re_url.group(0) ebook_dict[final_sub_isbns].append(url) one outlier to this is the vendor’s name value that i create from the marc file name (i use a file naming convention that is the vendor_date of extract_number of file (if multiple marc files exist). figure 6. file naming convention for extracting vendor name for dictionary. for this value, i use regular expressions to match to the vendor’s name that prepends the file and use it as a dictionary value. else: final_sub_isbns = re_sub_isbn.group(0) vendor_name = re.match(r'^(.+?)_', file) # regex to get the vendor from the file name vendor_key = vendor_name.group(1) ebook_dict.setdefault(final_sub_isbns, []) #sets the isbn as the dictionary key ebook_dict[final_sub_isbns].append(vendor_key) # adds the vendor name as a value to the isbn key i also use regular expressions in combination with the pymarc library to match to the relevant data, clean the data, and pass the data, when necessary, to basic string functions to clean the data before adding it to the dictionary. a good example of this is the title data from the 245 marc field. in some marc records the title data appears in the 245 field, as seen with here: for field in r.get_fields('245'): # gets ebook title from marc ebook_dict[final_sub_isbns].append(field.value()) # gets title in 245 however, the title data could also be in the 245-subfield a. if no data is present in the 245 marc field, the script looks in subfield a and then uses regular expression to get the title data from the field. the script then passes the title through string functions to remove the ‘/’ from the title and any leftover white spaces. if field is none: for subfield in field.get_subfields('a'): # gets title in 245 a if subfield is none: pass else: re_title = re.search(r'^(.+?)\/', subfield) # regex for title title = title.strip('/') title = title.strip() ebook_dict[final_sub_isbns].append(title) each piece of marc data (vendor, title, url) is added as a value to the ebook dictionary. the isbns are set as the key: ebook_dict.setdefault(final_sub_isbns, []) #sets the isbn as the dictionary key ebook_dict[final_sub_isbns].append(vendor_key) # adds the vendor name as a value to the isbn key the other values are then matched and appended to their corresponding isbn: else: re_title = re.search(r'^(.+?)\/', subfield) # regex for title title = title.strip('/') title = title.strip() ebook_dict[final_sub_isbns].append(title) else: url = re_url.group(0) ebook_dict[final_sub_isbns].append(url) having compiled the ebook dictionary, the script then uses the pickle library [23] to save the dictionary for use in the next script. def pickle_dict(): pickle.dump(ebook_dict, open("ebook_dict.p", 'wb')) # pickle saves the ebooks dictionary so that it can be called in a seperate script. i run this to compile the dictionary and wait till i have bookstore isbns to run second part. pickle_dict() print('all done') i have broken this process into two scripts [26, 27]. the first script that builds the ebook dictionary can take ten to fifteen minutes to finish and in debugging the matching process, i found it easier to break these apart and deal with them separately. pickling or saving the dictionary also allows me to extract marc records and build my ebook dictionary in advance of receiving the isbns from the bookstore. once i have the bookstore isbns, i can input that data into the second script and simply look for matches. the second python script does the matching and outputting of matched titles. the first step is to load the ebook dictionary that was built in the previous script, this is also done with the pickle library. import pickle # imports pickle to load the dictionary def main(): ebook_dict = pickle.load(open('ebook_dict.p', 'rb')) # loads isbn dictionary that was built and saved with 1st script. once i receive the required material list from the campus bookstore, i extract all the isbns and use a text editor, such as notepad ++, to remove duplicates, sort, and delete any isbns that do not begin with the 978 prefix and save as a text file. that file is then opened within the script, and a list is created with its isbns. bookstore_list = [] def get_bookstore(): bookstore_file = 'spring_2023_nov.txt' lines = [line.strip() for line in open(bookstore_file, encoding = 'utf-8')] for line in lines: bookstore_list.append(line) # add all bookstore isbns to list in the find_match function, the script uses two for loops to iterate over the key and values of the ebook dictionary and the isbns within the bookstore list. def find_match(): print('getting ius ebook matches...') for key, value in ebook_dict.items(): for i in bookstore_list: the script then matches any isbn in the bookstore list to the dictionary key (ebook marc record isbns) and if a match is found, it outputs the key (isbn) and associated dictionary values (separated by commas) in the python console window. if i in key: print(key + ',' + str(value)) # prints the matches in the viewing window in python. i copy and paste this into excel. the data is comma delimited. getting ius ebook matches... 9780787980672,['ebraryp', 'because writing matters : improving student writing in our schools / national writing project and carl nagin.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=707687'] 9781474225199,['ebraryp', 'plays one / translated and introduced by don taylor ; with introductions by don taylor and j. michael walton.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=6159981'] 9781474225199,['ebraryp', 'plays one / translated and introduced by don taylor ; with introductions by don taylor and j. michael walton.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=6159981'] 9781474225199,['ebraryp', 'plays one / translated and introduced by don taylor ; with introductions by don taylor and j. michael walton.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=6159981'] 9781462531592,['ebraryp', 'building literacy with english language learners : insights from linguistics / kristin lems, leah d. miller, tenena m. soro.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=4844819'] 9781462531592,['ebraryp', 'building literacy with english language learners : insights from linguistics / kristin lems, leah d. miller, tenena m. soro.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=4844819'] 9781433829789,['ebraryp', 'how to write a lot : a practical guide to productive academic writing / paul j. sivia, phd.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=5525817'] 9780231191630,['ebraryp', 'what is japanese cinema? : a history / yomota inuhiko ; translated by philip kaffen.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=5613954'] 9780674018242,['ebraryp', 'the poems of emily dickinson : reading edition.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=5906829'] 9781788291552,['ebraryp', 'machine learning with r : expert techniques for predictive modeling / brett lantz.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=5752784'] 9781788295864,['ebraryp', 'machine learning with r : expert techniques for predictive modeling / brett lantz.', 'http://proxyse.uits.iu.edu/login?url=https://ebookcentral.proquest.com/lib/[pqebk.library.id]]]/detail.action?docid=5752784', 'ebraryp', i then take the python console output and copy and paste the matches into an excel file. duplicate matches do occur for bookstore list books that have multiple isbns listed for a single title (print and ebook) and these can be easily removed once the data is in excel. i have tried multiple times to create an output of the isbns that did not match as a text file during this process but have failed to write code that works. i use excel and its conditional formatting/highlight cell rules/duplicate values for this process of removing isbns that matched from those that did not match. once i have the isbns that do not match to an existing library ebook, i take those, and batch search them in the gobi platform [24] to identify which required materials could be purchased as unlimited access ebooks. figure 7. gobi platform’s alternate formats search of unmatched isbns. figure 8. gobi platform’s alternate formats search results. conclusions having this python script to automate the identification of already licensed and owned library ebooks has helped immeasurably with this process. as reported in previous studies, existing licensed or subscribed content could make up as much as 27% of the ebooks that a library can provide for programs like this, and up to 17% can come from ebooks that were previously purchased (and reused semester to semester) for these programs [25]. being able to accurately and quickly identify these ebooks increases the turnaround time between receiving the required materials list from the bookstore (which can vary in the timeliness of its delivery) and the notification of faculty and students about these no-cost alternatives prior to the start of the semester. any next steps would be to make this process and these scripts more user friendly. i’d like to output the isbns that do not match to existing library ebooks to a text file so that i could batch search those in gobi without any additional work. i’d also like to create a basic gui interface that could be used by colleagues and/or libraries without any python experience and would only require the marc records to be used in the match process. about the author mitchell scott is the coordinator of collections and online resources at indiana university southeast. bibliography [1] bell, s. (2017). what about the bookstore?: textbook affordability programs and the academic library-bookstore relationship. college & research libraries news, 78(7), 375.https://www.doi.org/10.34944/dspace/98. [2] colvard, nb, watson, ce, park, h. (2018). the impact of open educational resources on various student success metrics. international journal of teaching and learning in higher education, 30(2), 262-276. [3] eighmy-brown, m, mccready, k, riha, e. (2017). textbook access and affordability through academic library services: a department develops strategies to meet the needs of students. journal of access services, 14(3), 93-113. [4] rokusek, s, cooke, r. (2019). will library e-books help solve the textbook affordability issue? using textbook adoption lists to target collection development. the reference librarian, 60(3), 169-181. [5] wimberly, l, cheney, e, ding, y. (2020). equitable student success via library support for textbooks.” reference services review. https://www.doi.org/10.1108/rsr-03-2020-0024. [6] scott, r, jallas, m, murphy ja, park, r, shelly, a. (2022). assessing the value of course-assigned e-books. collection management, 47(4), 253-271. [7] carr, pl, cardin, jd, shouse, dl. (2016). aligning collections with student needs: east carolina university’s project to acquire and promote online access to course-adopted texts.” serials review 42, no. 1: 1-9. https://www.doi.org/10.1080/00987913.2015.1128381. [8] wallace, n, filion, s. (2018). textbook or not: how library ebook purchasing power aligns with curricular content trends.” the evolution of affordable content efforts in the higher education environment: programs, case studies, and examples. [9] hoover, j, shirkey, c, barricella. (2020). exploring sustainability of affordability initiatives: a library case study.” reference services review. https://www.doi.org/10.1108/rsr-03-2020-0016. [10] sotak, d, scott, jg. (2020). affordable education with a little help from the library.” reference services review. https://www.doi.org/10.1108/rsr-03-2020-0012. [11] library of congress isbn faq [12] e-books and isbns: a position paper and action points from the international isbn agency [13] alma analytics [14] ex libris alma [15] folio [16] oclc wms [17] pymarc python library documentation [18] oclc worldshare collection manager [19] my experience with python is limited and i am a self-taught coder. as a result, there are many redundancies and complexities that exist within the code that more experienced and knowledgeable coders will uncover but it works for me. [20] glob python library documentation [21] collections python library documentation [22] regular expression python library documentation [23] pickle python library documentation [24] ebsco gobi [25] scott, m. (2021). shifting priorities: a look at a course adopted text (cats) e-book program and how its success realigned one library’s e-book collection priorities. collection management, 47(4), 238-252. [26] script 1: building the ebook dictionary with marc [27] script 2: matching and outputting matches subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – taming the generative ai wild west: integrating knowledge graphs in digital library systems mission editorial committee process and structure code4lib issue 60, 2025-04-14 taming the generative ai wild west: integrating knowledge graphs in digital library systems since the 17th century, scientific publishing has been document-centric, leaving knowledge—such as methods and best practices—largely unstructured and not easily machine-interpretable, despite digital availability. traditional practices reduce content to keyword indexes, masking richer insights. advances in semantic technologies, like knowledge graphs, can enhance the structure of scientific records, addressing challenges in a research landscape where millions of contributions are published annually, often as pseudo-digitized pdfs. as a case in point, generative ai large language models (llms) like openai’s gpt and meta ai’s llama exemplify rapid innovation, yet critical information about llms remains scattered across articles, blogs, and code repositories. this highlights the need for knowledge-graph-based publishing to make scientific knowledge truly fair (findable, accessible, interoperable, reusable). this article explores semantic publishing workflows, enabling structured descriptions and comparisons of llms that support automated research insights—similar to product descriptions on e-commerce platforms. demonstrated via the open research knowledge graph (orkg) platform, a flagship project of the tib leibniz information centre for science & technology and university library, this approach transforms scientific documentation into machine-actionable knowledge, streamlining research access, update, search, and comparison. by jennifer d’souza introduction there is a growing mountain of research. but there is increased evidence that we are being bogged down today as specialization extends. the investigator is staggered by the findings and conclusions of thousands of other workers—conclusions which he cannot find time to grasp, much less to remember, as they appear. … a record if it is to be useful to science, must be continuously extended, it must be stored, and above all it must be consulted. [1] — vannevar bush, as we may think (1945) the sheer volume of scientific advancements poses significant challenges for researchers, who must sift through vast amounts of information to derive meaningful insights. this challenge is further compounded by the largely unstructured nature of current scientific communication, which hampers transparency, integration, peer review efficiency, and machine assistance. to overcome these limitations, research records must transition from traditional unstructured formats to machine-actionable, semantically structured representations, such as knowledge graphs (kgs), enabling smarter and more efficient research tools. this shift also aligns with the fair principles—findable, accessible, interoperable, and reusable [2] —ensuring usability for both humans and machines. the open research knowledge graph (orkg), a flagship project of the tib leibniz information centre for science & technology and university library, seeks to address these challenges by enabling the structured representation and comparison of scientific contributions based on their key properties. much like e-commerce platforms that allow side-by-side product comparisons, the orkg facilitates efficient exploration of scientific knowledge, opening new opportunities for researchers and digital libraries[3][4][5]. the orkg is a web-based platform and infrastructure designed to structure and interlink scientific knowledge using semantic technologies. unlike traditional text-based research publishing, the orkg, in close adherence to the fair principles, represents research contributions as interconnected kgs, enabling more efficient comparison, discovery, review, and synthesis of scientific findings. figure 1 depicts the structured machine-actionable scientific information published on the orkg. in the orkg, researchers and librarians can submit structured descriptions of research papers, making scientific knowledge more accessible for both humans and machines. the platform is available as a service with public community access (https://orkg.org/) and is implemented via open-source software (https://gitlab.com/tibhannover/orkg/). it integrates with persistent identifier services such as dois and orcids for improved scientific metadata management and fosters collaboration and open science by enabling interdisciplinary research contributions. additionally, orkg provides an api as a python package (https://orkg.readthedocs.io/) that allows developers to integrate its capabilities and data into external applications, enhancing knowledge discovery and research automation. figure 1. machine-actionable scientific knowledge capture via semantic publishing in the orkg (red) versus non-machine-actionable, article-based publishing (gray). this figure depicts the process by which a user fills out a template of key properties of research that generates structured paper descriptions, used by the orkg to generate comparisons of multiple studies with similar properties. tracking the rapid advancements in generative artificial intelligence (ai) the development of large language models (llms) began in 2018 with openai’s gpt-1 and google’s bert, which pioneered transformer-based architectures for language understanding. since then, the rapid evolution of llms has been characterized by a wave of innovative releases, each pushing the boundaries of performance, efficiency, and accessibility[6]. the year 2023 saw the release of models like mistral 7b which introduced grouped-query attention for enhanced computational efficiency, while mixtral 8x7b leveraged sparse mixture-of-experts architectures to optimize scalability without sacrificing performance. other notable advancements include meta ai’s llama 2, which improved fine-tuning capabilities and context understanding, and openai’s gpt-4, which significantly expanded multimodal capabilities, including image and text input. additionally, anthropic’s claude series focused on safety and interpretability in alignment techniques, and google’s gemini models demonstrated cutting-edge integration of external tools, setting a new benchmark for task-oriented ai. there is a growing focus on open-source development, computational efficiency, extended context handling, and multimodal capabilities. these breakthroughs illustrate the relentless pace of innovation, driven by a combination of architectural refinements, expanded datasets, and collaborative efforts across the ai community. this continuous stream of innovations highlights the transformative potential of llms, as well as the increasing challenge of staying informed in such a rapidly evolving landscape. the continuous influx of new models, techniques, and datasets demands constant vigilance and adaptability. key information about llms is often scattered across scientific articles, organizational blog posts, and source code repositories, making it difficult to track developments efficiently. llms can be compared using key details such as model name, organization, release date, pretraining data, fine-tuning strategies, and carbon footprint. modeling these consistent properties supports better comparison and searchability and is an excellent use case for the open research knowledge graph. background: the open research knowledge graph essential technical considerations when publishing kgs include defining salient entities, attributes, and relationships for generative ai models as standardized resources accessible on the web via uniform resource identifiers (uris). informational statements about llms should also be serialized as structured triples conforming to the resource description framework (rdf) syntax recommended by the world wide web consortium (w3c) [7]. standard vocabularies, metadata, and publishing data on the web using dereferenceable uris are crucial for semantic interoperability and linking to external datasets, enabling a true linked data ecosystem. these functionalities are fully supported by the orkg, a next-generation scholarly publishing platform. in this vein, the orkg is not just software but also a namespace for specifying scholarly vocabularies via uris, intuitive interfaces for defining resources properties, classes, and templates, and backend technologies for representing the kg in rdf syntax. additionally, its machine-actionable semantic structure supports advanced comparison views, akin to e-commerce product comparisons, and enables sparql queries for customized or aggregated views of structured scholarly knowledge. the orkg adopts a collaborative, dynamic knowledge creation model inspired by wikipedia, where anyone can contribute data using vocabularies of their choice to represent entities interconnected via rdf links, forming a global, discoverable data graph. its open-ended nature allows the content of research artifacts, such as papers and comparisons, to be edited, updated, and republished, with all versions stored for future reference. persistent identifiers, such as dois, enable precise referencing of specific versions, while provenance metadata—capturing details like creator, creation date, and methods—ensures traceability, trustworthiness, and quality assessment. dois assigned to graph components enhance discoverability through global scientific bibliometric infrastructures like datacite and crossref, while resources are also findable via search engines. accessibility is ensured through http protocols, rest apis, and a user interface, with metadata accessible independently of graph data. the orkg achieves interoperability using rdf, the w3c-recommended standard for machine-readable knowledge representation, and fosters reusability by automatically generating provenance metadata and publishing graph data under a cc by-sa license. these features collectively support efficient, trustworthy, and fair-compliant scientific knowledge dissemination. taming the generative ai wild west: a catalog of llms on the orkg publishing information about llms on the orkg begins with defining standard properties for their structured and comparable description. drawing inspiration from hugging face, a central hub for open-source llms and their model cards, the author developed a standardized vocabulary and schema to support knowledge capture and semantic representation. 4.1. the llm specification orkg template standardized nomenclature. to ensure consistency, we established a controlled vocabulary that forms the foundation of the llm specification schema. properties were derived from hugging face model cards, with additional llm-specific predicates added to the orkg web namespace to expand the vocabulary. to align these properties with external ontologies, the rdf same-as relation was employed to establish equivalences across predicates. use of orkg templates. for uniform recording of llm information, we created a form-based template comprising predetermined properties. this template system leverages recurring subgraph property patterns in the orkg, enabling the consistent specification of relevant attributes across multiple contributions. the template includes 27 structured properties that comprehensively describe llms, including: identification: model name, model family, organization, and date created. development details: pretraining architecture, pretraining task, pretraining corpus, size of training corpus (in tokens in billions), knowledge cutoff date and innovation. fine-tuning: fine-tuning task and fine-tuning data for refinement. technical specifications: optimizer, tokenizer, number of parameters, maximum number of parameters (in million), and context length (in tokens). language and usage: supported language, extension, and application. infrastructure and impact: hardware used, hardware description, and carbon emitted (tco2eq). transparency and licensing: availability of the source code, blog post, license, and the research problem addressed. these properties ensure a holistic representation of llms, enabling structured comparisons and detailed analysis. for example, transparency is supported by attributes such as source code availability, while environmental impact is captured through carbon emissions data. the orkg llm specification schema, which encapsulates these 27 properties, is accessible on the orkg platform. this schema empowers researchers to record, compare, and analyze llms with precision and consistency, fostering greater transparency and accessibility in the study of generative ai. 4.2. the orkg catalog of llm-centric papers, comparisons, visualizations, and review different papers on llms are published as structured semantic descriptions in the orkg using the orkg papers module and then aggregated as orkg comparisons of research. comparisons are the core type of orkg content and give a condensed overview on the state-of-the-art for a particular research question. contributions towards the problem are organized in a tabular view and can be compared and filtered along different properties. here is a published comparison of 92 llms with respect to their structured, machine-actionable descriptions published as one research comparison on the orkg. the comparison includes metadata such as a title, creator, and a short description paragraph. the unique aspect of orkg comparisons is that they can be published at any granularity. for instance, if there are structured descriptions of the early openai llms, viz. gpt-1, 2, and 3, i can aggregate them as an orkg comparison. this is depicted in figure 2 below. figure 2. a comparison of early openai large language models (llms), accessible on the orkg platform. additionally, the machine-actionability afforded by this fine-grained semantic information in the orkg supports dynamic interactions, such as computing visualizations from the data of any comparison. for instance, the author created another comparison titled “a catalog of deepmind’s llms including their seminal chinchilla model” ( based on their early models—alphafold, gopher, chinchilla, gophercite, flamingo, gato, and sparrow. figure 3 below depicts a column chart, comparing, for each model, the parameter size of the largest variant (measured in millions). this showcases how machine-actionable scientific knowledge in a structured digital library can provide quick insights into specific research questions. for instance, one could explore questions like “what was the largest model size released by deepmind?” or “what was the maximum variation in llm sizes across deepmind’s models?” figure 3. a column chart comparing the parameter sizes (in millions) of seven early deepmind models—alphafold, gopher, chinchilla, gophercite, flamingo, gato, and sparrow—based on the orkg comparison “a catalog of deepmind’s llms including their seminal chinchilla model” (https://orkg.org/comparison/r1355351). the chart dynamically visualizes machine-actionable data from the orkg platform and hovering over any bar in the chart shows the underlying data stored in the orkg for that specific plot. the resource is published at https://orkg.org/resource/r1355283. finally, the orkg offers reviews, a tool designed for authoring and publishing review articles that facilitates community-based creation of dynamic, living articles by utilizing comparisons from the open research knowledge graph to deliver machine-actionable knowledge. for example, the review available at https://orkg.org/review/r640001 aggregates multiple llm comparisons. it provides granular insights into the features of llms developed by various organizations while also integrating these into a broader comparison of all models. this approach is ideal for review articles, enabling detailed analyses of studies or, in this case, comparisons of the llms reviewed. 4.3. queryability sparql queries can be used to generate smaller, specific views of the data. for example, the query below might be translated into a natural language question as follows: which llm models were created in 2023? this query returns a table (limited to the first 100 results) of paper titles, their accompanying llm, and the llm creation date. natural extensions of this query would be to change the year, and even to be able to specify a range of dates in the form of yyyy-mm-dd within which to search. this is an open query endpoint available to any user of the web. prefix orkgp: prefix orkgc: prefix orkgr: select ?paper_label ?model_label ?date_created where { orkgr:r609337 orkgp:comparecontribution ?contribution . ?contribution orkgp:has_model ?model . ?model rdfs:label ?model_label . ?paper orkgp:p31 ?contribution . ?paper rdfs:label ?paper_label . ?contribution orkgp:p49020 ?date_created . filter(strstarts(str(?date_created), "2023")) } limit 100 while this paper outlines the use case for llm, the possibilities and opportunities for interlinking information on the web are substantial. related work the closest representation of the information captured via the transformer model or llm orkg template are model cards for these models released on huggingface hub [8],. however, the model cards and fine-grained knowledge representation in kgs like orkg differ significantly in terms of discoverability, reproducibility, and sharing. model cards are again human-readable documents offering project explanations, but their discoverability is limited to keyword-based searches in repositories, and they often lack the semantic structure needed for precise, automated discovery. while recommendations for model cards involve filling out relevant values for model properties, this structured representation, while it might improve human readability to quickly gloss over the features of a model, lacks machine-actionability since they remain embedded within a document. in contrast, kgs provide machine-readable, structured, and semantically rich representations that enhance discoverability through advanced querying and metadata tagging. they excel in reproducibility by capturing detailed, standardized metadata, tracking updates, and linking methods and datasets transparently. furthermore, kgs enable seamless sharing and integration within interconnected systems, promoting collaboration and interoperability, unlike the static and standalone nature of any text document regardless of which format it is prescribed to be written in for better human readability. despite its advantages, implementing a structured, kg-based approach for model representation presents challenges, particularly in ensuring the quality of articles authored from structured reviews. while orkg enhances discoverability and reproducibility through machine-readable key aspects of the scientific content, expert oversight remains essential for translating structured data into high-quality narratives. sustainability is another key factor—unlike volunteer-driven efforts, orkg benefits from state funding as the flagship project of the tib, ensuring long-term development. this support enables continuous improvements in automation and usability, reducing the effort required for researchers to contribute structured knowledge. by integrating machine-assisted extraction, incentivizing contributions, and aligning with scholarly publishing workflows, orkg fosters broader adoption. thus, it provides a sustainable and scalable framework for improving model discoverability, reproducibility, and knowledge sharing. conclusion this article highlights the transformative potential of using knowledge graphs, specifically the open research knowledge graph (orkg), to represent and manage information about generative ai technologies such as large language models (llms). by adopting a structured, semantic, and machine-actionable approach, the orkg ensures that the resulting knowledge adheres to fair principles—findable, accessible, interoperable, and reusable. the kg-based representation demonstrated in this article is not only applicable to llms but is easily transferable to other scientific domains. the orkg platform exemplifies the versatility of a modern digital library, offering generic infrastructure that enables users from any field to model their research as structured knowledge using templates, papers, comparisons, visualizations, and reviews. these living records can be edited and curated collaboratively, ensuring that the knowledge remains dynamic, up-to-date, and accessible to both humans and machines. by addressing the challenges of fragmented, unstructured knowledge, this approach mitigates the creation of information silos and fosters greater transparency, reproducibility, and usability. ultimately, leveraging the orkg and similar platforms paves the way for a more interconnected and machine-actionable future in scientific knowledge dissemination and discovery. references [1] bush, v. (1945). as we may think. the atlantic. retrieved from https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/ [2] wilkinson md, dumontier m, aalbersberg ij, appleton g, axton m, baak a, blomberg n, boiten jw, da silva santos lb, bourne pe, et al. 2016. the fair guiding principles for scientific data management and stewardship. scientific data. 3(1):1–9. [3] auer s, oelen a, haris m, stocker m, d’souza j, farfar ke, vogt l, prinz m, wiens v, jaradeh my. 2020. improving access to scientific literature with knowledge graphs. bibliothek forschung und praxis. 44(3):516–529. [4] stocker m, oelen a, jaradeh my, haris m, arab oghli o, heidari g, hussein h, lorenz al, kabenamualu s, farfar ke, prinz m, karras o, d’souza j, vogt l, auer s. 2023. fair scientific information with the open research knowledge graph. fair connect. 1(1):19–21. https://doi.org/10.3233/fc-221513. [5] d’souza j, hussein h, evans j, vogt l, karras o, ilangovan v, lorenz al, auer s. 2024. quality assessment of research comparisons in the open research knowledge graph: a case study. jlis.it. 15(1):126–143. [6] fourrier, c. (2023). 2023, year of open llms. hugging face. retrieved from https://huggingface.co/blog/2023-in-llms [7] cyganiak, r., wood, d., & lanthaler, m. (2014). rdf 1.1 concepts and abstract syntax. world wide web consortium (w3c). retrieved from https://www.w3.org/tr/rdf11-concepts/ [8] hugging face. (n.d.). model cards. retrieved from https://huggingface.co/docs/hub/en/model-cards about the author dr. jennifer d`souza, tib leibniz information centre for science and technology, (m.sc. in computer science, university of texas at dallas, 2010 and ph.d. in computer science, university of texas at dallas, 2015). dr. d’souza is a senior postdoctoral researcher at tib leibniz information centre for science and technology, specializing in ai, nlp, and scientific knowledge extraction and organization. she leads the nlp-ai aspect of the open research knowledge graph (orkg) and heads the scinext project, advancing neuro-symbolic ai and nlp for scientific innovation extraction, funded by the german ministry of education and research (bmbf). her projects develop knowledge extraction and organization services, recently using generative ai, aimed at enhancing the strategic use of science and innovation to bolster societal r&d cycles. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – managing an institutional repository workflow with gitlab and a folder-based deposit system mission editorial committee process and structure code4lib issue 50, 2021-02-10 managing an institutional repository workflow with gitlab and a folder-based deposit system institutional repositories (ir) exist in a variety of configurations and in various states of development across the country. each organization with an ir has a workflow that can range from explicitly documented and codified sets of software and human workflows, to ad hoc assortments of methods for working with faculty to acquire, process and load items into a repository. the university of north texas (unt) libraries has managed an ir called unt scholarly works for the past decade but has until recently relied on ad hoc workflows. over the past six months, we have worked to improve our processes in a way that is extensible and flexible while also providing a clear workflow for our staff to process submitted and harvested content. our approach makes use of gitlab and its associated tools to track and communicate priorities for a multi-user team processing resources. we paired this web-based management with a folder-based system for moving the deposited resources through a sequential set of processes that are necessary to describe, upload, and preserve the resource. this strategy can be used in a number of different applications and can serve as a set of building blocks that can be configured in different ways. this article will discuss which components of gitlab are used together as tools for tracking deposits from faculty as they move through different steps in the workflow. likewise, the folder-based workflow queue will be presented and described as implemented at unt, and examples for how we have used it in different situations will be presented. by whitney r. johnson-freeman, mark e. phillips, and kristy k. phillips introduction like many other institutions, the university of north texas (unt) libraries operates an institutional repository to collect, preserve, and make available the scholarly and creative output of its faculty, staff, and students. the institutional repository, unt scholarly works (https://digital.library.unt.edu/scholarlyworks/), has been implemented as a collection in the unt digital library (https://digital.library.unt.edu/) alongside other collections such as the unt electronic theses and dissertations, unt data repository, unt graduate works, and unt undergraduate works. unt scholarly works has been in operation for nearly a decade and has grown to include over 6,100 items. the collection has been managed through a combination of email-based communications and workflows for receiving new materials, and an existing folder-based workflow that is used for digital projects throughout the unt libraries’ digital libraries division. in early 2020, there were two events that caused the team to review our workflows. first, a new repository manager took over the role and needed to learn the workflow. this necessitated the review of existing documentation, recognizing that it was not as thorough as needed, and the consideration of new tools and workflows for managing the repository. the second event that caused a shift in thinking was the covid-19 pandemic, which moved all work with the repository to remote work. this change further demonstrated the need to have documented, repeatable workflows so that multiple people could work on adding new content without getting in each other’s way. as a way of fixing these issues, we turned to a project management tool that was being used successfully by our software development team to manage their projects. this common software suite called gitlab was used to create a method for tracking items and tying our processes together. in addition to gitlab, we further codified our workflows on a departmental wiki that empowers us to keep our documentation more up-to-date as workflows and processes change. finally, we took advantage of the existing folder-based workflows used in other areas of the division and further refined our workflows to take advantage of this simple system for keeping track of files and digital objects on our shared network space. so far, it’s been a successful, simple solution that has been able to evolve as our needs and working arrangements change. workflows: moving from old to new as in most institutions with long-running projects like our unt scholarly works repository collection, the process for collecting, processing, and ingesting content into our digital library system has evolved over the years. since the beginning of the collection, the preferred mechanism for faculty to submit content has been emailing the repository manager at a shared repository email address. the shared nature of the email address allows for multiple members of the team to have access to new submissions but also causes challenges in trying to keep track of these same submissions as they are moved through the workflow. while many messages to the repository email address are content submissions in digital format, others are requests for help in digitizing legacy resources such as publications, datasets, and grey literature. this content is also handled by the repository manager and involves several additional steps that are not present in born-digital workflows. examples of these steps include receipt of the physical material, digitization of the material including quality control, and then finally loading the material into the repository. we’ve tried to keep our workflows simple, so they can apply to most types of submissions. the majority of our submissions are currently received in digital format. the workflow for these items revolves around confirming rights for hosting, preparing the files to be ingested, and creating the metadata record for the item. figure 1 provides an overview of the workflow in place for processing new items into unt scholarly works. figure 1. workflow for processing digital items into unt scholarly works. gitlab overview and our workflow gitlab is a web-based project planning and source code management application that is built on top of the git distributed version control system [1]. gitlab can be used as a hosted solution or its community edition can be installed on local servers. the unt libraries has been using a locally hosted version of the community edition of gitlab to successfully manage many of its software development projects. additionally, in 2018, the unt libraries’ website was migrated from the drupal content management system to a static site generator using jekyll. to manage the content creation for the website, we used another installation of gitlab and configured it for that type of content. with so much existing expertise with this software platform, it made sense to try using it to manage the content workflows for unt scholarly works. to get started with gitlab, we established a new project called unt-scholarly-works. in gitlab, a project is the primary unit of organization and includes a number of components that are useful for managing a software project. the new unt-scholarly-works project includes features such as a git enabled repository for managing code, an issue tracker for creating issues, discussing specifics of an issue, assigning responsibility, and managing the issues. a project in gitlab also includes a wiki that can be used to develop documentation about the project. finally, a suite of access controls can be used for a project to designate who can perform different tasks such as viewing, changing, and deleting content. once the unt-scholarly-works project was created, the primary component we were interested in using was the issue tracking system. we had seen it used in a number of different configurations in software development projects and had ideas of how it might be useful for managing the unt scholarly works workflow. gitlab issues the component of gitlab that we use for creating tickets is called issues. in a software development project, this is the method for submitting new features, changes, or updates to a project, but we use it as a tracking tool within our workflow. when we receive a content submission at the repository email address, a new issue is opened with details from the submission email. the required parts for an issue are: unique title processing checklist submitter’s email notes labels a unique title is necessary for tracking items between gitlab and our folder system, especially when multiple items are submitted at once. our titles are structured with underscores separating these parts: submission date, in year-month-day format_ submitter_name, last name followed by first name for instance: “2020-11-25_last_first”. when multiple items are submitted at once by the same person, identifying details are added to the end of the title, like keywords from an item’s title or the item description: 2020-11-25_last_first_repository_workflow or 2020-11-25_last_first_handout. this unique title is used for the corresponding folder name on our network drive to encapsulate the files for this item and move it through its folder-based workflow. a full description of that process is included later in this article. a checklist is included in the issue to ensure that all steps have been completed. these checklists are added based on templates we’ve saved in gitlab. the gitlab issues system has a method for creating different templates for issues that get created within a project. these templates can be customized to the workflow and allow for adding placeholders to each issue so that important information for an issue is not missed. we have developed templates for two types of submissions: new item submissions or item replacements. examples of these two templates are provided in figure 2 and figure 3. the templates include the relevant steps for the different processes required for the different submission types. more information on creating issue templates is available from gitlab [2]. figure 2. issue template for new submissions. figure 3. issue template for item replacements. because most of our submissions arrive via email, the submitter’s email address is also included to help connect the issue to the repository account’s inbox. the notes section is used to document any relevant information provided from the submitter, like conference information, and any other information that we might need during processing, like licensing guidelines from a publisher. labels we make extensive use of labels that can be applied to an issue as part of the gitlab issues system. it is possible to add one or more labels to an issue and later these labels can be used to sort and filter issues. for each new issue that is added to the system, we use at least two labels, one based on the type of content of the submission and the other a high-level label designating the type of submission. these labels are helpful for tracking where an item is in the unt scholarly works workflow. we’ve opted for broad labels that can help identify a submission at a glance. two high-level labels are used to label what kind of submission was deposited: submission correction another set of labels reflect the major content types that we receive as submissions. each of these content types have similar workflows but it is helpful to differentiate between them. here is a listing of the major content types we have designated as labels: audio correction data poster presentation text video there are two additional labels we use that aren’t related to item type but that relate to the state of the item in the workflow: checking rights embargo these are steps that tend to hold up processing, so these labels let us see why an issue hasn’t been closed. for embargoed items, we can also add a due date to the issue which will remind us to confirm that the item’s access rights are updated in the repository and notify the submitter that their item is now publicly available. figure 4. example issue for a newly submitted item. in addition to improving issue tracking, another benefit of this label system is that we can go back to see what types of submissions we’ve received. this additional use of the label system provides a data layer that we can reference in the future to see any trends with our submission process. for example, we can select the poster label and see quickly how many poster issues have been closed or that are still open. an example of an issue is provided in figure 4. other benefits of gitlab once an issue has been opened, there are a few different features we use to create an additional layer of documentation and support for our team. the first is that gitlab tracks all changes or updates to an issue with the person, date, and time. the next is the comment feature. comments can be added to issues, and others can respond to that comment or create their own. we can also direct a comment to an individual who is also a member of the unt-scholarly-works project, and they will be notified that there is a comment waiting for them in gitlab. the final feature we’ll mention is the ability to assign individuals to an issue. this adds the issue to the assignee’s list of issues and presents the user with an updated list of activities that need to be completed for items being added to unt scholarly works. in addition to changes being tracked for issues in the system, almost everything is reversible. if an issue is closed but then something comes up where it needs to be reopened, that is a simple task in gitlab and doesn’t require much effort at all. as a team of gitlab users that are also working remotely, the ability to undo changes allowed us to experiment with how to use different features correctly and figure out how they might fit in our new workflows. gitlab has allowed the repository team to monitor the submissions to unt scholarly works through a system that is detached from the primary repository email account, but did not require custom programming or development. we made use of the existing features of gitlab to customize our unt-scholarly-works project to our local needs. finally, it has allowed us to experiment with modifications to our workflows, and the overhead for each modification is very low as we only have to change labels on issues in order to find out if an idea will or won’t work. for example, as we explained earlier, when we initially started with gitlab we modeled our label system too closely to the folder-based workflows we use on the network drives. we were able to experiment with different approaches to labeling issues without affecting other parts of the workflow for unt scholarly works. there are a number of other features of gitlab that we haven’t fully explored, including some related to issues and labels. these include the board and milestones features. milestones, which are a specific tool used to manage timetables, would allow us to indicate the time-sensitivity of a group of issues within a project, such as a deadline for uploading a number of items to the repository. issue boards, which are a tool used for project management, would allow for a more visual way of organizing and managing our submission issues in the unt scholarly works workflow. both of these features are something we plan to explore as we continue with gitlab. gitlab is how we track our ingesting process, but the folder-based workflow is where the processing of deposited files actually happens. as mentioned previously, we’ve created our gitlab checklists and labels to correspond to our folder-based system to help blend the workflows together. folder-based workflow the concept of folder-based workflows has been present in the unt libraries’ digital libraries division since it was formed. while it is a simple concept, it provides a fairly robust process for managing projects on our shared network drive. we have made use of folder-based workflows for both born-digital as well as digitized content workflows and they are flexible enough to allow for modification to suit a given project. what are they? the basic idea of a folder-based workflow is that each project has a folder created for it on a shared network drive. each step in the workflow is used as a folder name. objects within each folder are at that step within the workflow. this ensures that every step in the workflow is known and accounted for. this can be tailored to each project, and we use this workflow for submissions to unt scholarly works. for each project, the workflow will be contained within its own folder. even if the project is relatively small, for example digitizing four maps, it will receive its own workflow folder. this folder may be named something like “scanned_from_ohs” for a project that had digital materials originating at another institution. all of the files that are important to this project are stored in this folder structure. the folder names are based on local naming conventions. the workflow is essentially an ordered sequence of folders that define the steps in a workflow. for example, there is often the process of applying optical character recognition (ocr) to a set of scanned items. this step usually happens after other steps have been completed, like scanning the documents and performing quality control checks on the images. after ocr has been completed, we then need to create descriptive metadata and can stage the content for uploading to the digital library system. in the rough workflow just outlined, there might be one or more people involved in moving digitized items through a workflow. to facilitate this we have outlined a workflow that looks like this. scanned_from_ohs 0.staging/ 1.scanning/ 2.toqc/ 3.issues/ 4.toocr/ 5.tometadata/ 6.toupload/ 7.uploading/ 8.uploaded/ while the steps are somewhat arbitrary and are aligned with workflows at the unt libraries, a description of the steps is helpful to understand how items move through the workflow. before we do that, it is helpful for us to establish a few guiding principles of these workflows. only move content when you are ready for someone else to take it as part of the following step. keep folders as shallow as you can. the location of an object in the folder structure should reflect the work that needs to be done to the object. adjust the folder order and name to reflect the workflow. with these guiding principles in place, we can talk about how content moves through a workflow. first off, in the 0.staging folder, we can begin to sort items that need to be digitized. in many cases, we can create directories that will act as containers for digital objects as they move through the process. we have found that representing a digital object as a folder and naming it with the unique identifier for the item is helpful. then, all files related to that object are placed inside this object’s folder. moving an object between steps in the workflow is now just moving a folder. by keeping the folder structure shallow with ideally just one level of hierarchy, the object folders can easily be moved around within the workflow folders making it easy to see how items are distributed throughout the workflow. the next step in this workflow is called 1.scanning. as an example, we will be digitizing a journal for a historical society. as a technician is digitizing the item, they can place working files into this workflow step’s folder. once they have completed the initial quality control check, they will move the object folder to the next step’s folder in the workflow called 2.toqc. in this step, another round of quality control is performed by another person in the unit. as this step is being completed, if there are any issues, such as missing pages or images that need to be rescanned, the person completing the quality control can move this object folder to the 3.issues folder. if there are no problems, they can move the object folder to the 4.toocr folder in the workflow. at this step in the workflow, the person responsible for ocring items will take items from this workflow folder and process them with an ocr engine. after that has completed, the object folder is moved to the 5.tometadata folder, where they will have descriptive metadata created for them. this step may involve full-record metadata creation or could involve the creation of basic metadata that will be updated and completed later when it is loaded into the system. after this metadata has been generated, the object folder is moved to the 6.toupload folder. the final three steps in this workflow outline the ingest activities of the unt digital library where content is staged for upload and finally ingested into both the access repository, aubrey, and the preservation repository, coda. during this ingest period, the object folder is moved into the 7.uploading folder to indicate that it is being uploaded to the system. once verification of upload has taken place, the object folder is moved to the final folder, 8.uploaded. at some point in the future when the project has completed, the files in the object folder are deleted to recover network space. folder-based workflow discussion with the overview of the folder workflow complete, there are some things to discuss. this is a very simple way of organizing projects. at unt this method allows us to have multiple people working on a single project at different times and it is clear what is happening along the workflow with a given digital object because of its location. it is important for all participants in the project to remember to only move items into a subsequent folder when they are ready for someone else to take that object and process it further. if an object hasn’t finished the step that its location in the workflow folder designates, it shouldn’t be moved. this important rule allows for different team members to work independently and confidently within the workflows. because we try to keep the folder structures as shallow as possible, ideally with only one level of folders representing objects in each of the workflow folders, it is easy to see how much content is waiting in what steps for a project. simply browsing the folder structure gives an overview of where things are in the workflow. we have found that it is important to keep objects at a single level within the workflow folders. as folders become more deeply nested or contain multiple other folders, it becomes harder to understand the progress of the overall workflow or tell when objects are incomplete. because workflows can differ depending on the type of project, type of content, or because of unique requirements, it is important for the folder names to be descriptive about the step in the workflow that it represents. the number preceding the step is important to define the sequence of the steps, but the name of the step in the folder’s name should be clear to make the step understood by others involved in the project. finally, if we start a project and discover that the initial folder structure doesn’t meet the needs of the project, we change the structure to meet the needs. this is helpful because there are often many projects and a variety of types of projects being worked on at any given time. in order to keep track of what is happening within a specific project, it is important to make sure the folder structure matches the workflow. while the folder-based workflow is not a particularly complicated management structure for projects, it has served its purpose well over the years. unt scholarly works folder-based workflows as we have mentioned before, the unt scholarly works workflows are split between high-level tracking in gitlab in combination with object-level tracking in a folder-based workflow. there are different content types that are deposited for inclusion in the repository. we use different folder-based workflows for each of these content types so that their specific requirements are documented as part of the folder structure. each of these folders contains subfolders for each step we have for processing that item type. figure 5. example folder-based workflows for the different content type submissions in unt scholarly works. you will notice that in the seven examples in figure 5 there are many steps or even whole portions of the workflows that are similar to others. this is seen as a feature of the folder-based workflows. in the case of the posters and presentations workflows, it isn’t so much the workflows but the final ingesting steps that are slightly different for oversized posters as compared to the presentations. by splitting them up into two separate workflows, it is easier for the person doing the final ingest to remember the different configurations needed to ingest the items into unt scholarly works. some of the workflows, specifically audio, video, and data do not contain steps such as toocr because that step is not appropriate. this is another example of tailoring the folder names to the needs of the specific workflow. when an item is received for ingestion into unt scholarly works, an object folder is created in the first folder in the appropriate workflow. the object folder’s name is the same as the issue name in the unt-scholarly-works project in gitlab. once an item has been successfully moved through the folder-based workflow on the network drive, the contributor is emailed with information about their submission and the gitlab issue is closed to complete the whole process. conclusion it is important for us to occasionally rethink our workflows and examine the tools we use to manage projects. the occasion of having new staff is often a good time to make sure that documentation is up-to-date and reflects the actual workflows in place at the institution. a situation like the covid-19 pandemic can further identify challenges to workflows when new requirements are placed upon them, such as completely remote workers and different people being involved with the process. the unt libraries took a combination of these two events as an opportunity to do such a thing. by further documenting the use of folder-based workflows that had been used successfully in many other projects, we were able to adjust our processes so that they better reflect the items we were receiving from faculty, staff, and students submitting content to the unt scholarly works repository. likewise, we moved from an ad hoc email-based workflow for tracking submissions to a new process using the web-based tool gitlab. this software allowed us to further codify our workflow using the tools and features present in gitlab projects, such as issues, labels, and a wiki. by focusing our efforts on simple-to-adjust methods for organizing content, like the folder-based workflows, we are able to make changes to workflows without additional overhead. likewise, working with a system that is already being used by several groups within the organization allows us to leverage existing infrastructure for our needs and doesn’t add additional tools to support in the organization. together gitlab and the folder-based workflows have allowed us to improve the productivity of the unit around submissions to the unt scholarly works repository. notes [1] for more information on git and gitlab in a library context see engwall k, roe m. 2020. git and gitlab in library website change management workflows. the code4lib journal [internet]. (48). available from: https://journal.code4lib.org/articles/15250. [2] more information on creating issue templates in gitlab is available at https://docs.gitlab.com/ee/user/project/description_templates.html#description-templates. about the authors whitney r. johnson-freeman is the repository librarian for scholarly works, the institutional repository, at the university of north texas. her primary focus is in digital curation, but she is also interested in data and assessment. mark e. phillips is the associate dean for digital libraries at the university of north texas. he has a wide range of research interests, including web archiving, digital library infrastructure, and metadata quality. kristy k. phillips is a phd student at the university of north texas college of information. her research focuses on computational linguistics, with a particular interest in building tools for low-resource languages. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – purposeful development: being ready when your project moves from ‘hobby’ to mission critical mission editorial committee process and structure code4lib issue 16, 2012-02-03 purposeful development: being ready when your project moves from ‘hobby’ to mission critical throughout the library community examples can be found of development projects evolving into mission critical components within an organization’s workflow. how these projects make that move is unique and varied, but little discussion has been had about how these projects impact their developers and the project community. what responsibilities does a developer have to ensure the long-term viability of their project? does simply freeing the code meet those long-term responsibilities, or is there an implied commitment to provide long-term “care and feeding” to project communities built up over time? code4lib represents a group of developers consistently looking to build the next big thing, i’d like to step back and look at some of my own experiences related to the long-term impacts that come with developing successful projects and communities, and try to provide library developers food for thought as they consider their own ongoing responsibilities to their projects and user communities. by terry reese introduction the role of the developer–and attitude toward open-source in general–has changed a lot within the library community over the last ten years. while libraries have always had individuals writing code and building tools, this work has predominately been relegated to the fringes of library research or hidden away within the technical services and systems communities wrangling with marc and library management systems. and while marc and library management systems haven’t gone away, development activities have become an integral part of the library and have gained acceptance as both fertile ground for library research and a necessary component of a successful library. in many respects, i think that this will be one of the most enduring legacies of the code4lib movement; legitimizing the work developers do to the library community. developers and open-source software no longer flummox the vast majority of library administrators, who now see development opportunities as areas for research, collaboration and innovation. libraries, in large part, have taken on large development projects and encouraged developers to think innovatively and experiment liberally. unshackled, developers have worked to re-engineer the library, and in the process are actively sharing their code, expertise and experiments within the wider library community. what’s more, as libraries continue to evolve, more and more of their critical workflows and systems will be developed utilizing components or tools created in and around the code4lib community. however, in the midst of this sea change within libraries, it is important to consider the long-term ramifications for developers as we move forward. for every library technologist or developer hoping to create the next big thing, there is the underlying question of how a successful development project may impact that developer or their organization. and what’s more, if a project is really successful, how does the developer or host institution determine their long-term responsibilities to a particular community or project? is it enough for a developer to simply open-source his/her work or is there a larger responsibility to provide the necessary “care and feeding” to the communities that they help to establish? these are difficult questions with few “right” answers because they represent large opportunity costs. however, as libraries become more reliant on the larger library development community to provide shared workflow solutions, it becomes even more incumbent on library developers to think critically about the long-term impact a project may have, both for themselves and the larger development community. while every developer hopes to build a successful project, few are prepared for the community entanglements and the long-term hold these projects can have on a developer’s career. the privilege of software development it is sometimes easy to forget that being a developer is a privilege. we spend our days negotiating with unfriendly systems and archaic data models. we find ourselves needing to be licensing experts as we search for components, libraries and building blocks from which to build our work. these every-day issues can make one feel like sisyphus, forever pushing against the mundane tasks that interrupt the development process. but the reality is that as developers, we create things–very cool things that people count on and come to rely on. whether intentional or accidental, people become developers because they have the opportunity to take those things that they imagine, and make them real. it is a privilege because “we collectively are given the opportunity to create things that matter: to individuals, to teams, to organizations, to countries, to our civilization. we have the honor of delivering the stuff of pure intellectual effort that can heal, serve, entertain, connect, and liberate, freeing the human spirit to pursue those activities that are purely and uniquely human” (booch, 2004). we are at our best when the work that we create transparently enables the user to make connections and meet their needs. users become co-developers, not through the writing of code, but through the interactions between the developer and the community. within the code4lib community, we tend to think of our work as being about the code, but in reality, that is just a very small part of a project. users tend to provide the vision, the context that enables developers to find problems and solve them. through our work, the user and developer form a symbiotic relationship allowing the developer and user the ability to imagine new ways to interact with the world. and because of this relationship with the user, the privilege of software development also comes with a responsibility. within the code4lib and library communities, we have done a very good job of thinking creatively and pushing innovative approaches to these larger problems. we continue to encourage library developers to think differently about the library, to reimagine workflows and what it means to be a library in the 21st century. and yet, we tend not to think about the long-term responsibilities developers and organizations will have to the user communities that spring up around these new innovations and research. as the code4lib and library development communities continue to move libraries in new directions, these communities need to engage in a new discussion; one around the idea of more responsible development that requires developers to consider not just the life-cycle of their project, but also the long-term stewardship of their creations and the user communities that spring up around them. developers need to be prepared to fulfill their responsibilities to their work and can do so by planning not just for a successful development, but for accidental success as well. what if you succeed? when starting a “hobby” or experimental project, few people expect for it to fail, but even fewer plan for it to succeed. when one thinks about it, that’s really not all that surprising since most “hobby” development within the library community starts out as a tool to meet a local need. if we looked around our own institutions, nearly everyone could find one project that would meet this description. these projects serve vital needs within our organizations and often times do such a good job, that they become viral within our communities as well. through word of mouth, these tools spread outside of our organizations, taking on a life of their own and, intentionally or not, binding an organization and developer to supporting a particular tool set. i have had some experience with this kind of accidental success. in 1999, i started writing a small marc-based set of code libraries as a way to learn how to understand the marc format. because these tools ended up having a significant impact at oregon state university, i was encouraged and cajoled to release what would become marcedit. when marcedit (http://people.oregonstate.edu/~reeset/marcedit/) was first made available, i had no idea at the time how quickly the tool would become as successful as it has, nor did i understand the size and scope of the community that would develop around it. in the 12 years since i first released marcedit, the application has become a well-used utility for manipulating library metadata. it has been written about and cited in a number of books and articles, and has become in many organizations a critical part of their metadata workflows. for my part, the program has become both a vehicle for experimentation and research; and a never-ending stream of support questions from the user community. as a group, developers tend to underestimate the time necessary to answer questions related to their work. like most, i had made the assumption that the “web” would shoulder the lion’s share of the support requests, while in reality, developers or their organizations need to make some hard decisions related to the amount of support that they are willing to provide. even today, i still receive 20-25 personal requests a week for support or assistance with a metadata project, which has a real opportunity cost in relationship to my time as a developer to do new and creative work for the community. for the first time in 2011, i tried to get a handle on the size and scope of the current marcedit user community. utilizing marcedit’s automatic updating feature, i was able to capture access log statistics representing an active usage of the program. what i found was surprising – taking just the first million executions over the last year, i could see that marcedit had been actively used in over 143 countries/jurisdictions. what’s more, breaking down the active user community, nearly half of all usage from the tool came from outside of the united states. iso code country total uses us united states 619,884 ca canada 103,558 au australia 47,916 gb united kingdom 34,004 es spain 24,402 tw taiwan 18,816 hk hong kong 17,740 sg singapore 14,818 in india 12,022 mx mexico 10,386 it italy 9,344 pk pakistan 9,230 my malaysia 8,842 fr france 8,826 de germany 7,174 table 1. top 15 countries by usage while i have long suspected that marcedit had a substantial user community, i had no idea that it had grown to represent such a diverse set of the library community. without meaning to, i’d managed to create something that a large number of people were finding useful in their daily work. however, this success also raised questions related to my long-term responsibility to the marcedit community. as the sole developer of marcedit, what are the long-term responsibilities to provide support, project stewardship and succession planning of the application – and does a user community have the right to expect this level of support? these are important questions for any member of the code4lib (and larger library development community) as more and more community driven development is becoming critically embedded within the library. how we steward our work now and ensure its long-term success will continue to play a very important role in how the library community views and ultimately relies upon the development community. the responsibilities of software development if you accept that being a software developer is a privilege, then it is equally true that software development comes with a set of responsibilities. the unwritten contract between the user and developer underscores certain expectations related to support, stewardship and longevity – expectations that many developers may not fully consider when building a user community. what’s more, some developers may indeed view the idea of stewardship as out of scope for their particular work or project. within the code4lib community, it’s very common for developers to believe that releasing their tools as open-source fulfills the stewardship contract with their users. from a developer’s perspective, making the source code available ensures that someone, down the road, will have the ability to pick up the project and continue development. for some projects, this type of foster care approach to software development may make sense – especially if the project in question never succeeded in growing its user community. however, projects developed at libraries tend to spread virally. developers or users attend conferences, they write articles–word gets out and very quickly these projects become embedded in best practices and workflows at other organizations. these projects require much more “care and feeding” and for the developer or organization actively seeking to build a community, a great deal more should be expected than simply releasing the source code for a project. stewardship of open-source projects over the years, library administrators have become much more comfortable relying on open-source projects. at the same time, very few within the library community could be considered good open-source citizens. (intridea, 2011) while libraries have generally become much less resistant to using open-source tools and projects, very few actually contribute anything back to those projects. there are obviously exceptions, like the dspace effort, that have had success building robust user and development communities around open-source projects. however, engendering this level of involvement and ownership in a project doesn’t happen by accident. the dspace community, for example, was created through a very intentional development process. securing funding to seed and support a user community, mit worked with a number of early partners to advocate for the long-term health of the project. as the project has grown, mit has been able to step aside and become a participant in the larger dspace community, while handing off the project to a new set of maintainers. however, mit’s experience in growing the dspace community is atypical within the library community. in most cases, developers or organizations may be able to build user communities, but struggle to create the thriving developer community needed to relinquish stewardship of a project. for example, oregon state university libraries (osul) developed an open-source content management system for librarians called library à la carte, to support the development of course guides, subject guides, portal pages and modules. osul sought to develop the project to provide libraries with a more flexible tool kit than found in existing products like libguides, as well as providing a framework for librarians to think innovatively about how they connect to users. by most metrics, the project has been successful building a community of users and pioneering a number of user oriented innovations. likewise, osul has actively pursued a plan to enable libraries to utilize this project, providing support and continued ongoing development. however, while building a user community has proved to be somewhat successful, encouraging the growth of a developer community has proved elusive. while osul has engaged in training, hosted impromptu user groups at national conferences, and even for a brief period offered a hosting service for a limited number of partners, an active developer community never formed around the software. for osul, this means that support for the project continues to be shouldered solely by the libraries, which does incur opportunity costs. the experience osul has had is probably much more typical of development projects within the library world. the issue then becomes what role or responsibility does the developer/organization have for long-term stewardship of a project. in the case of library à la carte, osul encouraged the development of a user community and now feels a deep obligation to that community, making decisions about the library à la carte program much more difficult. so what should developers do? as developers and organizations seek to make their open-source projects available, they should plan not only for success of the projects but include the long-term stewardship of their project as part of the development life-cycle. project developers should consider the following: have clear community goals and a plan of action to reach them many projects flounder because of the belief that a vibrant user community will equate to a vibrant developer community–it doesn’t. just like user communities, projects looking to birth and support a strong development community require purposeful action. it’s not a coincidence that projects that make a priority to support developers and actively court their participation throughout the project lifecycle engender the most community support. at osul, this was probably the biggest single miscalculation that was made when promoting library à la carte, assuming a vibrant development community would spring out of the existing user community. in reality, these two groups have different needs and oftentimes will not be made up of the same principle players. by not having a clear plan of action, osul was able to build user interest in the project, but to this day, has never been able to generate the same interest from within the developer community. what’s more, given the rapid rate of technological change, most projects have a narrow window to make those connections–missing that window will result in a missed opportunity and ultimately will make community building more difficult in the future. determine if you can provide support one of the least transparent processes within the library development community surrounds the topic of support. developers need to be exceedingly clear what type of support will be provided and for how long. invariably, questions related to enhancements, system requirements, component incompatibilities, updates or simple troubleshooting will need to be addressed. when a project or tool is first released, developers will often times go to great lengths to provide support for their work. while this will certainly help to build a user community, it also raises false expectations. users that come to rely on a tool or service may continue to require a high level of hand holding, making long-term support more difficult as a project attracts more users. developers or their institutions need to be fully transparent to potential users regarding the support that they can provide, as well as their own expectations of the user community. while it would be naïve to expect a large number of project users to contribute code back to the project, there are many other tasks in which members of a user community can participate. supporting documentation efforts, creating support groups, financially supporting the development efforts of a developer or host institution – the ability to contribute to a project go far beyond just contributing code back to a project, and all ultimately help determine a projects long-term success. likewise, users themselves will need to understand support expectations so that they can make informed decisions regarding a project’s fit for their institution. provide a development roadmap one of the benefits of an open-source project is that the source code is always transparent to the user, even if the user is unable to do anything with it. just as important to source-code transparency is the accompanying development roadmap. while few users will actually contribute code back to a project, the presence of a clear development road map tends to encourage community members to seek out ways in which they can support a project. users that understand and have confidence in the development process have a much greater incentive to become active project participants as testers, documenters or simply providing valuable feedback relating to functionality. given the rapid rate of change facing libraries, projects actively seeking adoption by other institutions have a responsibility to provide a clear development path and goals. this allows their users the ability to understand and mitigate changes to a tool or project without impacting critical services. have a clear exit strategy for developers projects rarely last forever. at some point, developers and organizations move on. like any project, open-source projects need to have a succession plan in place. for projects that are successful at building strong developer and user communities, the continued stewardship of a project or tool can generally be transferred directly to the community. dspace and koha are great examples of projects started by a set of institutions that successfully transferred stewardship to another entity [1]. given the fluid nature of many open-source projects, developers need to be especially clear when detailing the life-cycle of a tool or project and need to be realistic about their overall interest and capacity to provide long-term support. developers need to engage in responsible development, considering not only their own finite resources, but also the community of users that they have worked to create. have a clear exit strategy for users libraries insist upon exit strategies from vended projects, and should hold open-source developers to the same standard, requiring projects that provide polished exit strategies. while it is true that users of open-source projects or tools always have access to their underlying data, it is unrealistic to expect a project’s users to develop their own methodologies to extract and repack their data when a project is abandoned. developers and their institutions have an ongoing obligation to those user communities through the total life-cycle of their projects. building new tools and making the source code available to the larger library community is an exciting time. for new developers, promoting and seeing your work make a difference is always a thrill. however, taking the time to consider the long-term stewardship of your project will not only make the overall project experience a better one, but builds trust within the larger library community by demonstrating an understanding of the long-term implications of making your work available. stewardship of closed-source projects while the topic of closed-source projects tends to rarely come up for discussion within the code4lib community, a developer or institution may have very good reasons for keeping the source code for a project closed and outside of the public domain. sometimes these reasons are related to remuneration, sometimes due to organizational concerns and, more often i think, because the project represents a hobby or an itch that a particular developer wishes to scratch. and while many in the code4lib community tend to discourage the closed-source model, the library world is full of examples of closed-source projects that have made and continue to make significant contributions to the field. while open-source tools continue to become more ubiquitous, the software that drives the library at the enterprise level still largely remains closed-source. many tools like ezproxy, illiad and contentdm started out as resource projects only to grow into much larger development efforts, while others come from large system vendors or boutique software developers that create tools like bookwhere and marc magician. these tools and systems play an important role in the library software ecosystem, enriching the community and facilitating innovation [2]. however, for developers choosing a closed-source model, considering at the very beginning their responsibilities to their user community becomes even more important. while open source project maintainers need to consider their roles as stewards, the fact remains that the source code is available and open for further development. if a project is dropped, it may languish and fall into disrepair, but if an organization has sufficient need, they can revive the project and continue its development given enough time and resources. closed-source projects are very different, in that the user communities that build up around them are completely dependent upon the project maintainer to continue to support and improve the product. marcedit is another example of a product within the library community that follows a closed-source development model. i’ve been asked a number of times throughout the years why i haven’t open-sourced the application since i participate in and have lead a number of other open-source projects. i think sometimes people may believe that marcedit remains closed due to some bureaucratic wrangling with oregon state university (osu), rather than a choice that i purposely have made regarding the development process. i’m sure that some people within the code4lib community may have experience with this type of bureaucratic nonsense, but this has not been my experience at osu. no, in my case, i have decided to keep marcedit closed because i have a very personal connection to the software–developing it has been a passion of mine for almost 13 years now and one that i selfishly keep to myself. however, in doing so, i understand that i have very specific responsibilities to the user community who have come to view the project as a necessary part of many organization’s technical services workflows. when considering the stewardship of a closed-source application, one should consider all of the questions that relate to open-source development. in addition to those initial questions, maintaining a closed-source application requires careful consideration of some additional issues due to the closed nature of the project: provide a transparent and well-defined enhancement process unlike open-source software which thrives on communities of users that are also good open-source citizens and contribute back to the development, closed-source software generally employs a user feedback model. this doesn’t mean that community members of closed-source projects are passive participants. actually, it can be quite the opposite as communities built around closed-source projects must proactively work with the developers to ensure that community needs remain a development priority. like open source projects, community members can and should play an active role in helping to support the user community. at the same time, developers of closed-source projects and the user communities that build up around them often are insulated from each other. this is a challenge for both users and developers because the developer is for all intents and purposes the bottle-neck within this model. with only so many resources, developers must strive to provide their user communities with a well-defined enhancement process and issue tracking system to engender transparency in the development process. like open-source projects, project roadmaps are vitally important to provide transparency to the development process and provide a way to interact with the user community. one of the dangers for developers of closed-source projects is the potential to become out of touch with the user community. developers of these projects must remain vigilant to maintain the lines of communication and encourage active participation in the enhancement and testing process. determine who will provide support like all projects, one of the primary success drivers for a closed-source project is the level of support provided to the user community. while open-source support tends to be provided through the project community at large, closed-source projects are disadvantaged in that support must come directly from the software developers. for large enterprise development efforts, support is often funded through fees that libraries pay for the software. however, for hobby projects or boutique development efforts, the path to providing support oftentimes isn’t as clear-cut since time spent on support is ultimately time spend away from working on the project. how that support is offered can vary greatly, but users are much more vulnerable to support issues when working with a closed-source project because their ability to troubleshoot issues on their own is often limited by the software. at the same time, an active user community can be a boon for a closed-source project, as it engenders a sense of ownership in the work, as well as providing functional experts that can be utilized for both testing and feedback. in the nearly 13 years of developing marcedit, i had seen the number of support requests being forwarded to me topping over a hundred a day. questions would often be repeated, but answering each question individually eventually begins to take its toll. a developer looking to go the closed-source route need to understand that providing support for a project is the single most time consuming aspect of the development. the project’s user community must become a support resource, through the use of listservs or forums to allow the user community to participate in the support process. this allows users to gain a sense of ownership of the project, while freeing the developer to spend more time developing the project. in my own experience with marcedit, providing the user community with a public listserv to ask support questions has been a boon to the community. what’s more, the listserv has helped to identify who within the user community has specific knowledge of relevant workflow specific tasks. this has given me a set of experts to query when discussing new functionality, and has given the user community more depth and a different type of available support. escrow the code because close-sourced projects don’t make the source-code available to the public, project maintainers must take special precautions to ensure that the source-code remains secure and accessible. this is especially true of single developer projects, where one coder may have the only copy of a project’s source. for developers looking to build a user community around their closed-sourced projects, it is the developer’s responsibility to detail the steps taken to preserve the source-code so that it can survive beyond the project team. again, with marcedit, this scenario nearly played out in the summer of 2010 when i was hit by a car on my bike. since marcedit is a closed-source application, many people began to wonder what would happen to the project if the sole developer suddenly disappeared. this is a very valid question and concern. it’s been my experience that unattended software tends to have a short shelf-life; with its applicability depreciating the longer it remains idle due to software defects, standard changes or changes within the software’s system ecosystem. for developers looking to build closed-source systems, one important decision that needs to be made and documented is how the source-code will be escrowed for future developers. for years, marcedit’s source-code has been escrowed on two independent remote systems, accessible by two or three (it varies) other individuals besides myself. these escrowed copies act as insurance policy of sorts for the user community that the project could survive beyond any individual developer of the project. develop a succession plan in addition to the escrowing of code, closed-source developers must also plan from the very beginning how the project will succeed without them. succession planning is a necessary obligation for a developer to undertake to ensure that the project can continue to be maintained when the developer finally wants to move on to something else. fortunately, there are a number of available options, from deciding to open-source the code and encouraging interested members in the existing user community to take up the development of the project to appointing a new project developer. additionally, closed-source projects featuring a single developer should take additional steps beyond escrowing the project’s code, including considering how to support emergency succession of a project. in my case, this is something that i have long neglected in my own succession planning in regards to marcedit. while those that can access the escrowed code have the ability to continue the project, i’m not sure they would want to take over long-term development of the project. however, closed-source developers have an obligation to their user communities to give a lot of thought to the life-cycle of their projects and work to ensure that their role as the developer is ultimately expendable. as one can see, developing a closed-source project places a number of new and unique obligations on a developer. this is one of the costs for keeping the source-code of an application closed. as i have found working with marcedit, there are tradeoffs going this route, and one of these tradeoffs is the added responsibilities to the user community. whether i want to or not, i’m wedded to the marcedit user community because of the decision to close the project code. developers looking to follow this same path need to realize that this decision comes with many obligations that cannot be easily untangled once a user community develops around a specific project or application. conclusion the negotiation between the developer and their user community can be a very personal thing. however, as the code4lib and library developer communities continue to mature and play a more important role in determining the direction of libraries in the 21st century, we also need to remember what great privilege it is to be a developer. we have the ability to break down barriers, build connections and help individuals make new discoveries. it really is a privilege, but it also comes with an obligation to think beyond our immediate involvement in the work and consider how we will steward a project throughout its life-cycle. it’s an important discussion developers need to have with each other, as well as at their institutions, if libraries are going to continue supporting and encouraging an active development community. about the author terry reese is the gray family chair for innovative library services at oregon state university (osu) overseeing the development and implementation of new strategic initiatives for the libraries. he is the author of a number of metadata related software packages and libraries like marcedit, a marc/xml metadata software suite and the c# oai harvesting package. he is a regular requested speaker and has published a number of works on digital libraries and library metadata issues, including co-authoring a book with kyle banerjee entitled, building digital libraries: a how-to-do-it manual (http://www.amazon.com/building-digital-libraries-how-do/dp/1555706177/). personal homepage: http://people.oregonstate.edu/~reeset/ notes [1] while the current controversy surrounding the koha project would seem to indicate otherwise, i believe that it actually demonstrates how successful this project has been at diversifying its community. koha has been so successful, that the project has been able to survive not only being forked, but has seen both forks flourish. real issues exist between the two communities, but i think that this group still represents a great example of how a project can be transferred to a community (as well demonstrating some of the potential issues that can arise when community members philosophically differ). [2] i know many will argue that many closed-source tools have traditionally impeded libraries from innovating, but i think that attitude is naïve. while many enterprise system in the library community have admittedly provided less than stellar support to extend their systems beyond their scope, it’s unfair to not recognize the work these groups have also done developing tools like electronic resource management systems and supporting openurl development and support. references booch, g. (2004). the privilege and responsibility of software development. retrieved 11 18, 2011, from http://www.ibm.com/developerworks/rational/library/2101.html intridea. (2011). 10 tips for open source citizens. retrieved 11 02, 2011, from http://intridea.com/2011/8/11/10-tips-for-open-source-citizens subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – talking portraits in the library: building interactive exhibits with an augmented reality app mission editorial committee process and structure code4lib issue 46, 2019-11-05 talking portraits in the library: building interactive exhibits with an augmented reality app with funding from multiple sources, an augmented-reality application was developed and tested by researchers to increase interactivity for an online exhibit. the study found that augmented reality integration into a library exhibit resulted in increased engagement and improved levels of self-reported enjoyment. the study details the process of the project including describing the methodology used, creating the application, user experience methods, and future considerations for development. the paper highlights software used to develop 3d objects, how to overlay them onto existing exhibit images and added interactivity through movement and audio/video syncing. by brandon patterson introduction augmented reality (ar) and other display technologies are increasingly being utilized in educational settings to create immersive, interactive, and exciting learning environments[1][2][3]. these technologies are found to promote participation and motivation while creating a novel learning environment that mixes the virtual with the real world[4]. the transition in online learning from conventional multimedia platforms to a more creative and engaging augmented learning environment provides opportunities for educators to engage learners in exciting new ways. according to a study by chang, et al.[5], an ar guide to an in-person museum exhibit enhanced a visitor’s learning effectiveness, their engagement, and extended the amount of time the visitor focused on the paintings. the goal of this project was to create a custom ar software to enhance the experience for viewers of an online exhibit in the same way that previous studies had shown it did with an in-person one. this project was funded in whole or in part with federal funds from the department of health and human services, national institutes of health, and national library of medicine, under cooperative agreement number ug4lm012344 with the university of utah spencer s. eccles health sciences library. with these funds, a small team built a prototype app allowing viewers to see characters from the national library of medicine’s (nlm) online exhibit, “renaissance science, magic, and medicine in harry potter’s world”. using graphic software and local expertise in the library, the online activity, “lumos: renaissance thinkers behind the magic” was transformed to include a virtual environment using ar. the characters interact with viewers as talking portraits did in the harry potter book series[6], providing enhanced engagement and time spent with the exhibit. findings this study found that when someone interacted with the augmented reality portion of the nlm exhibit, it resulted in increased time spent engaging with the material and improved levels of self-reported enjoyment with the experience. the retention of information conveyed by the exhibit was the same as that viewed on the website. this report details the process of the project including methodology used, prototype creation, user experience testing, and future considerations for development. methodology there were six phases of development for this project that included planning, configuration, migration, testing, documentation and operation. phase 1: planning the planning phase consisted of the tasks necessary to successfully setup an augmented reality environment. it allowed us to better acquaint ourselves with the content and what the viewers expect when experiencing the online exhibit. we’ve outlined our roles as follows: a project manager led the week-to-week progress of the study, supported the staff, documented the process, and created the final report. the technical lead came from the it department and built the models for the characters to interact with the viewer. he also led the app integration. for hardware, we used existing pc computers and our personal mobile devices. our software needs consisted of unity, mixamo, blender and custom programs to create and animate 3d elements, described under “configuring the model” heading below. in the planning phase, the first task consisted of creating a character persona for a key figure in the exhibit, nicolas flamel. personas are based off factual information about them and include personal details that will be used to create elements of the characters including script, mannerisms, and animations. experts in the field of augmented reality who work for the university’s electronic arts and entertainment (eae) program were consulted to give insight into the process and provide feedback about the project design. scripts were written by the project manager to align with the character personas created, described under the “persona and script development” heading below. phase 2: configuration the configuration phase was the most time intensive and consisted of the character design and animations. characters were created, wire-framed, animated, and rigged for movement via the blender development platform, which is a free open-source 3d creation suite. the character was exported into unity, a game developer engine, to coordinate when animations play and how it looks, including adding lighting and developing shading. we did a voice recording during this phase to create a working prototype which we tested with patrons of the library. phase 3: migration this phase consisted of moving the animation to an app for deployment. unity creates an app template for developers which we used to connect to the virtual environment app created. the app is currently in prototype form and is not publically available. the authors are working with the funders of the project to build out the prototype and will release the app through the national library of medicine. after the project has ended, the authors would be happy to share the code of the project with others per guidance of the funders. it is restricted to android users at this time with future development to make it available on iphone. content was integrated and transformed to fit the specifications of the mobile devices. the project manager gathered users to test the app. phase 4: testing this phase consisted of user testing which informed the outcomes of the study, which were: increased retention of information conveyed by the exhibit; more time spent engaging with the material; and improved levels of self-reported enjoyment with the experience. we conducted structured interviews before and after the ar-enhanced version of the exhibit. we selected 30 subjects from a pool of patrons from throughout the university, 10 who interacted with the nlm online activity in its existing form (control), another 10 who interacted solely with the ar component of the nlm exhibit and 10 who interacted with both components. identical surveys were given to all groups. some questions for the subjects included: rank your enjoyment of the online activity (1 to 10). what do you enjoy about the online activity? what suggestions do you have to improve the experience? content specific questions. we then analyzed the data based on survey questions and short interviews with subjects. phase 5: documentation throughout the study, the project manager and technical lead documented their progress. a summary report was written and given to nlm to provide suggestions for feasibility and scalability of future virtual environment projects for nlm exhibits. the report included: survey results and analyzed data timeline and key milestones for project creation documentation of project from lead and student developer peer evaluation on technology recommendation and suggestions for future projects guide to support and produce similar content phase 6: operation this prototype was meant to be a test case for the exhibit and additional funds are needed to roll it out into a fully operating application. with additional time and resources, the prototype can be a guide to develop more characters and interactions with guests of the online exhibit. prototype development to create the prototype, we developed a persona and script and configured a model using software and hardware detailed below. persona and script development a learner persona was created to get a sense of the target audience for this application (figure 1). we used this persona as a way to target the app toward a specific audience and got direct feedback from patrons of the library. figure 1: learner persona to use as target audience for app. a character persona and script was then written about nicolas flamel (figure 2). future characters developed for the app and their script would go through a similar process but experts in the field of renaissance science and medicine would be consulted for accurate information. figure 2: nicolas flamel persona and script for app. configuring the model to configure the model for the app, free and for-cost software and assets were used to develop the character of nicolas flamel. instead of hiring an artist, we opted to create our own character via adobe fuse, a character model generator. although less in line with the art of the exhibit, it provided a good prototype for what the app could do. we then used mixamo, blender and lipsync pro as a way to animate the character. to make it ada compliant, we used unity to add closed captioning to match the voice of the character and applied a custom shader to have the character match the style of the exhibit. finally, vuforia was used to create the ar component which took the image and replaced it with the animation on the phone. adobe fuse adobe fuse is a character model generator. because the university of utah has an agreement with adobe, it is no cost to us. one can choose base meshes to work with and modify the model in various ways. for example, selecting the nose allows one to adjust the nose height, width, depth, etc. after the character looked as close to a reference photo as possible, the technical lead exported the model as a 3d model format (obj) with the accompanying material package (see figure 3 & 4). figures 3 & 4: nicolas flamel character created in adobe fuse. mixamo mixamo is a web based character rigger and animator. rigging a model consists of creating a “bone structure” under the model and assigning each vertex to one or more bones; a slow and tedious process to do by hand. fortunately, fuse exports their character models in a standard t-pose position that allows mixamo to rig the character with ease. the technical lead uploaded the object file (obj) generated by fuse and selected the required joints (figure 5). figure 5: rigging of nicolas flamel and selecting joints using software mixamo. after rigging was complete, the character was animated with the many animations available from mixamo (figure 6). the technical lead selected an animation called “sitting talking” and downloaded the pack as an fbx file. this is a file format that includes models, rigging, animations, and materials. figure 6: available preset animations for nicolas flamel on mixamo. blender the final task for getting the model ready for full animation was completed in blender. the technical lead imported the file of the sitting-talking nicholas. he replaced his small beard with one that he modeled to better match the reference photo. then using the shape keys tab, he connected to the mouth of the model and created ten phonemes (the shapes that the mouth makes while making certain sounds while talking) that can be used to visualize english speech (figure 7). he then exported the package as an fbx file. figure 7: facial phonemes created in blender for nicolas flamel character. unity + lipsync pro importing the fbx file and material packages into unity was the next step. once in unity, the technical lead generated a standard shader material for each of the character objects. he linked all of the textures from the material package to the appropriate shaders. he then installed lipsync pro to get started on animating the character’s face (figure 8). figure 8: lipsync pro software interface. this process was quite simple as the blendshapes section in lipsync allowed the technical lead to select the shape keys he created in blender as a simple pulldown menu. once the phonemes were all linked up, he needed to create an audio data file to keyframe the various phonemes with the audio track, a simple yet tedious process (figure 9). figure 9: audio data file of recorded voice to match keyframes for animation. closed caption to address the idea of closed captioning, the technical lead created a simple on-screen text box in unity where he manipulated the text during runtime. single lines of dialogue were typed out and added to a text field along with a time value to sync with the moving lips of the character (figure 10). once populated, a button was added to the start of the app which allows the user to start the video with or without the captions. figure 10: unity function to add caption to application. custom shader the goal of this shader was to generate a “hand drawn” effect to the visuals to match the look of the exhibit. the technical lead wanted the model to look as though it was being drawn or sketched while the video was playing. we purchased a shader pack from the asset store called hand-drawn shader pack which almost created the desired effect, but not quite. fortunately, the author of the hand-drawn asset included access to the code for all of his shaders. the technical lead was able to walk through the code and understand the principle of how it works. using this knowledge, a customized version of the shader was created (figure 11). the differences include: separating the shading texture “levels” into individual images. this gives a little more direct workflow in creating the sketch textures for each shading level. each shading level includes a cutoff threshold to give a hard edge between shading texture levels. this allows adjustment for each shading level individually. the base color for the material can include an alpha channel which will allow the underlying material to “peek” through which generates some interesting effects. figure 11: different shader shapes and subsequent effects. this new shader offers much more flexibility to match the project’s needs. vuforia the final step was to track the main image through a devices camera and display this model and animation within the image. to accomplish this, we made use of the vuforia plugin for unity. after installing the plugin and creating an account, the technical lead uploaded the image to the developers vuforia website where the image was analyzed for trackability. fortunately, our image of nicholas received a 4-star rating, which makes for a good image to track (figure 12)! figure 12: vuforia target manager for nicolas flamel project with four and five star images. the technical lead downloaded the tracking package and activated it with the previously created unity scene. this generates a sprite with the tracking image assigned. at runtime, the image is hidden from view while vuforia searches for that image from the camera input. when found, the sprite within the scene is aligned with the world image and the tracking image on the sprite is then hidden. but, any object attached to the tracking sprite will then stay aligned with the world image (figure 13). figure 13: nicolas flamel sprite as found in nlm exhibit attached to virtual environment. so, with that in place, the technical lead built the background composition with respect to the tracking image, as if it were a window. finally, he masked off the area around the tracking image with a camera mask. this will show whatever the camera is seeing (hiding the background models), and provides the “looking into a window” effect (figure 14). figure 14: desired “looking into a window” effect for app. with everything set to animate on start, the technical lead built the app to be used on an android device and tested the effect. satisfied with the results, he added a simple user interface to delay animation until a button is pressed. this allows for the selection to have captions enabled or disabled. results compiling the series of parts created from developer tools listed above made the desired application – which recognizes the given image of nicolas flamel and replaces that image with a 3d rendered scene – fully animated, with audio (figure 15). this animation tracks the user’s movements and shows the scene in the appropriate orientation. for more about the project, see the following: video of application found at https://youtu.be/yvjp2an0ogw. figure 15: finished prototype app with original image and interactive virtual environment. pricing and costs the following is a list of pricing for the development of the prototype and costs associated to continue the project. unity personal – free to use if revenue or funding raised is less than $100k/year. if revenue exceeds $100k, prices range from $25 – $125/month vuforia – free for development, after deployed on app store, get access to classic ($499 one-time) or cloud-based ($99/month) subscription lipsync pro – available on asset store: $35/ one time fuse – included in adobe creative cloud: $52.99/ month mixamo – free to use blender – free to use user testing a weeklong usability study provided insight into the application and whether or not it met the goals of the study. a testing protocol was developed to remain consistent throughout the process. all users interacted with the same portion of the online exhibit, “renaissance science, magic, and medicine in harry potter’s world.” a pre-survey was given to all participants to get a sense of their familiarity with renaissance science and medicine, harry potter, augmented reality and online learning. we also wanted to get information about their highest level of education and age. we split participants into three groups and timed them while they participated in one of three ways: group a consisted of participants that would look at the online activity and the app. group b would only look at the online activity on the website. group c would only look at the app. we then asked them follow up questions regarding their enjoyment of the app, responses to why they did and did not enjoy it, and a few questions to understand their comprehension concerning nicolas flamel. findings the usability study showed that groups a and c, having interacted with the app, had greater enjoyment interacting with the exhibit and spent more time with the content. interactions with the app (group a and c) were an average of 116.6 seconds and 62.6 seconds compared to no app (group b) at 36.6 seconds. groups a and c rated their enjoyment levels at an average of 4.25 and 4.6 (out of a scale of 5) as compared to 3.05 for group b which didn’t have the augmented reality app. many users commented on how they enjoyed the interactivity, the ability to view the character from multiple perspectives, and the novelty of the experience with added information coming from the application. some pointed to the length of the experience, matched artistic styles and interactions beyond augmented reality as possible improvements to the experience. retention of knowledge about nicolas flamel wasn’t any better than having read the website. researchers in ar development have addressed users paying too much attention to the novelty of the new technology and ignoring the surrounding environment[7]. in fact, one user admitted that they weren’t listening to the content and rather paying attention to the character because they thought it was so cool. in two questions referencing what nicolas flamel was rumored to have created and what the philosopher’s stone could do, those that viewed both the website and app scored similarly to those that had just viewed the website, while those with only the app didn’t do as well on the knowledge questions (see table 1). table 1: usability groups a, b, and c and results of follow-up questions. overall, the demographics of our study participants mirrored that of our learner persona. most participants (63%) were between 19-29 years old and had finished high school (67%). most were also familiar with harry potter and online learning, while mostly unfamiliar with renaissance science and medicine and augmented reality (see table 2). table 2: pre-survey results on familiarity with different aspects of learning experience. future considerations based on the findings from our usability test, conversations with experts in the field, and our own experience, we recommended steps to expand the project. firstly, we would like to create more animated characters and shorten the animations to 30 seconds to stimulate viewers. secondly, adding a digital artist and professional voice actor, while costing additional funds, would raise the profile of the project and better immerse viewers. finally, we’d like to build features using machine learning software where we’d experiment with interactive features using voice recognition and guided conversation software. conclusion this project helped highlight custom-made ar technology to enhance an online exhibit. the tool helped increased enjoyment of the exhibit and invited viewers to interact with the exhibit for longer periods of time. viewers retained the same amount of information as interacting with the website alone. if this tool were to be developed further, more studies would need to be done to understand how to increase viewers’ retention of information gathered from the interaction. many of the tools used to develop the prototype were free or inexpensive, and library personnel and partnerships proved vital to the project. references [1] sylaiou, s., mania, k., karoulis, a., & white, m. (2010). exploring the relationship between presence and enjoyment in a virtual museum. international journal of human-computer studies, 68(5), 243-253. [2] galani, a. (2003). mixed reality museum visits: using new technologies to support co-visiting for local and remote visitors. museological review, 10. [3] hall, t., ciolfi, l., bannon, l., fraser, m., benford, s., bowers, j., … & flintham, m. (2001, november). the visitor as virtual archaeologist: explorations in mixed reality technology to enhance educational and social interaction in the museum. in proceedings of the 2001 conference on virtual reality, archeology, and cultural heritage (pp. 91-96). acm. [4] damala, a., marchal, i., & houlier, p. (2007). merging augmented reality based features in mobile multimedia museum guides. in proceedings of anticipating the future of the cultural past (pp. 1–6). athens, greece. [5] chang, k. e., chang, c. t., hou, h. t., sung, y. t., chao, h. l., & lee, c. m. (2014). development and behavioral pattern analysis of a mobile guide system with augmented reality for painting appreciation instruction in an art museum. computers & education, 71, 185-197. [6] rowling, j. k. (1997). harry potter and the philosopher’s stone. london: bloomsbury pub. [7] wang, x., & chen, r. (2009). an experimental study on collaborative effectiveness of augmented reality potentials in urban design. codesign, 5(4), 229–244. about the author brandon patterson is the technology engagement librarian at the eccles health sciences library at the university of utah. he connects students, staff, and faculty to digital tools and emerging technologies and creates meaningful experiences using prototyping tools, augmented and virtual reality, and online learning platforms. he is a health sciences education liaison and coordinates with faculty to incorporate information literacy instruction and technology into their classrooms. email: b.patterson@utah.edu contributors chris allen, entertainment arts & engineering, university of utah alex johnstone, entertainment arts & engineering, university of utah subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – how we went from worst practices to good practices, and became happier in the process mission editorial committee process and structure code4lib issue 32, 2016-04-25 how we went from worst practices to good practices, and became happier in the process our application team was struggling. we had good people and the desire to create good software, but the library as an organization did not yet have experience with software development processes. work halted. team members felt unfulfilled. the once moderately competent developer felt frustrated, ashamed, helpless, and incompetent. then, miraculously, a director with experience in software project management and an experienced and talented systems administrator were hired and began to work with the team. people in the group developed a sense of teamwork that they had not experienced in their entire time at the library. now we are happy, excited, and energetic. we hope that you will appreciate our “feel-good” testimony of how excellent people and appropriate processes transformed an unhealthy work environment into a fit and happy team. by amanda french, francis kayiwa, anne lawrence, keith gilbertson and melissa lohrey introduction this paper is unusual in that we have tried to incorporate the unique personal perspectives of the members of our team. throughout the paper, thoughts and opinions of team members appear in blocks of text that address some aspect of our technology or processes. the voices included are those of a systems administrator, a director, a collections specialist, and a software developer. our focus is primarily on how we changed our workflow, and we will also give details about the technologies that support our new process. background we are the team responsible for vtechworks, the institutional repository at virginia tech, hosted by the libraries. our team is made up of a director, a systems administrator, a metadata specialist, a content specialist, and a software developer, with additional assistance from student employees and occasionally other teams. due to the nature of library work, several of us have many other responsibilities. also, because of this workload, as well as the openness of our new system, the roles are less rigid than our titles imply; for example, we all have access to do software development work, and we all have access to change metadata. vtechworks was brought online using the open source dspace software in the fall of 2011, at a time when only one of the current team members was employed at the library. it was a project of the former digital library and archives department, as a challenge by the new dean that had started in the spring: build an institutional repository by the end of the year. although the library at virginia tech had a history of innovation in information systems, with an early etd system, the vtls ils, and imagebase, the preceding dean had chosen to concentrate efforts in areas other than technology development. therefore, the technology infrastructure wasn’t as robust as one might have expected from an organization in 2011, and some of the culture and processes to support software development weren’t yet in place, even though there were already skilled technologists and managers. the decision was made to launch vtechworks, even though it infrastructure and processes weren’t fully developed. this was a good decision at the time, as we were in need of a repository and couldn’t wait: another technology group on campus, outside of the library, had already started an institutional repository based on fedora 3 technology, but this repository failed to thrive due to problems with content recruitment. today, vtechworks has grown to over 50,000 items, with initial boosts by repackaging previously existing digital collections, such as electronic theses and dissertations, into the repository. equally important to those of us who work on the system, our technology infrastructure and work processes have also grown and improved over the years. we’ll show what’s changed and discuss our unique perspectives. systems when vtechworks was first launched, there was a single development server. due to a lack of hardware resources, the server was actually an old desktop computer that had been reclaimed from an employee that had left several years prior. there was initially no source control, but this was set up soon by a systems administrator after the developer arrived. the developer was the only person on the vtechworks team with access to read or change the source control, and the developer handled all deployments to test and production systems. when francis (our systems administrator) arrived, he made great strides in our systems that ultimately helped with our workflow. our source control was moved to github, the standard site for open source software development, based on the git version control system. all team members now have read and write access to the code. additionally, it is easier for us to collaborate with other organizations using dspace. shortly after the source code was migrated to github, we also began tracking active issues with the github issues tool built into the project. we do have some files stored in a locally hosted instance of gitlab, an open source alternative to github, due to security concerns, including a new system for quickly provisioning test instances of dspace. francis created our local development environment (lde) infrastructure so that all team members have a vtechworks development environment running on their own machines, in a virtualbox vm. the lde system is based on the vagrant + dspace project, but uses ansible instead of puppet. virtualbox allows us to run virtual machines, which in our case emulate servers, within our standard mac systems that we were already using although some team members with older systems needed to upgrade to at least 8gb of ram. we use vagrant in our project to configure and deploy the virtual box machine. for our local development environments, the work that vagrant performs includes provisioning a virtual machine with 4gb of ram that runs ubuntu linux, setting up a private network so that we can connect to the virtual machine from the host machine, granting access to the dspace source code folder, and running an ansible playbook. ansible, a deployment and configuration automation tool, allows us to perform further configuration on the virtual machines. first, ansible installs all of the software prerequisites for building and running dspace, including java, apache tomcat, ant, postgresql, and others. ansible then configures postgresql and initializes the dspace database. finally, ansible performs the dspace install process and starts the tomcat application server when the install is completed. the local development environment system has worked well for us. now, every team member has a place to experiment and unfettered access to look at and modify the source code. each person can do this at any time, even when the official development server is in use for testing or demonstrating new features. when we’re finished experimenting, we can easily save the state of our local development environments to disk, or delete it completely using vagrant. at the time of these changes, the old desktop system that we previously used as our development server was retired. it is now running on server class hardware, and is configured with the same technology as the lde, so that our local development and production systems are very similar. in addition to the development server, we now have a pre-production system to use for user testing while integration work is happening on the development server. the production server was also migrated to a similar setup, though it does have some differences: for example there is a dedicated database server on a separate node. the deployments to our integration, pre-production, and production servers are now handled by systems administrators, typically at the request of amanda, who serves as the product manager. first try transitional current development server old desktop system old desktop system; developer and student developer also created individual setups on personal computer vagrant, ansible, and virtualbox used to give similar development environments to all team members. source code no source control on private svn server, then private git server. most of the source is on github, with a few things specific to our infrastructure on a private gitlab server. issue tracking no issue tracking issues on redmine. many issues went unentered. issues were not reviewed regularly. issues stored with source code on github and gitlab, or in the backlog spreadsheet. nearly all issues and change requests recorded. issue lists reviewed regularly. documentation initially, there was no official set of documentation. notes, when available, were shared via e-mail from the systems administrator to the developer. eventually, team members began to create documentation and add it to a private redmine wiki, but this was not seen as a priority. when francis created our vagrant and ansible setup for dspace, he created detailed setup documents so that the environment could be recreated. the setup instructions were stored along with the source control project. others improved the documentation as it was used, making edits for errors or clarification. this initial documentation project served as a basis for most of our documentation efforts. any time we have a new system or a new procedure, we document it, and in most cases put the documentation in source control, either in readme files or on the project wiki. this works well for us because everyone our team has a strong urge and motivation to document anyway. each of us is detail-oriented. it’s also reassuring to know that there are others on the team that can handle work in case we need to take time off or devote ourselves to other projects. documentation has become an intentional effort now, and we are considering dedicating entire sprints to documentation improvement efforts. first try transitional current typical documentation practice shared via e-mail on a need to know basis. some documentation in redmine wiki and in source control, created as team members had time, but not as a priority. a big set of documentation shared on public github and private gitlab projects. other documents shared with all team members on google drive. team members improve documentation as they make use of it. we are discussing dedicating entire sprints to improving documentation. openness as our team has evolved, we have become more open. we’re open with each other, in that we share what we are working on and how we do things. we regularly work with others in the library to improve metadata, content, and systems. and we’ve again started to work with others in the dspace community, attending conferences, contributing small changes, and testing fixes. first try transitional current source code source code stored privately and access available only to developer. source code stored privately. write access available to developer, and to student developer after convincing the information technology department that the student needed access. most code stored in public, on github. all team members have read/write access to the source code, and contribute. the team has resumed making small contributions to dspace core. intra-team collaboration it felt like there was no person on the team – developer, manager, content specialists, systems administrators – that had the power to make positive changes to the product. developer hid from other team members. team members are productive during meetings, assist each other as needed, and enjoy the company of each other. team members feel free to share suggestions for improvements. students previously, the student developer position reported directly to the software developer. this worked well when the full-time developer was able to dedicate sufficient time to training the student, but there would be difficulty when students were hired at a time when the full-time developer was not available to train or guide them. additionally, the student developer would sometimes be given independent assignments from multiple team members, without any guidance on how to prioritize those assignments. the student developer position is now fully integrated with the team. she participates in the stand-up meetings when it fits with her work schedule. team members make sure that there is always a list of tasks for her to perform. additionally, in recognition of the value that student developers can bring to our project, our student developer wages have gone from $11.00, to $13.00, and now to $15.00 an hour within the last year. this also helps us to be competitive with other student jobs on campus. first try transitional current team dynamics with student assistant to developer work with other team members as needed. integrated into team meetings, processes, and priorities. student developer pay unpaid intern $11.00 /hr $15.00 /hr student recruiting process friend of a friend developer lists job, collects resume, performs telephone interview job listed in multiple places. scoring matrix created for candidates. programming knowledge questions asked. entire team plus one outside member evaluates candidates. testing and fixing initially, there was no test process, formal or informal. after the first set of changes, a simple theme change, the site was displayed in a meeting of library directors for feedback and redesign. for later changes, there was no test process, other than testing performed independently by the software developer. bugs and change requests were initially handled as they came in, but this system eventually failed with over 100 open, unaddressed, and unprioritized tickets in the tracker, and a single developer, with other responsibilities, available to handle them. we’ve improved upon these practices: now, all team members have input on what they will work on, but items for a potential sprint are prioritized with the assistance of the team at the sprint planning meeting, and based upon team discussion. with all team members contributing, and a documented deployment process, we are now able to release to the production server approximately once a month. each change is reviewed by someone else on the team. we test before and after each production deployment, and have created a standard test script. first try transitional current frequency problems resolved as they are reported. test on major releases (about once per year) test branches test before all deployments (about once per month) test location test on production server test on development server test on local development environments, on at least one test server, and on production test depth serendipitous, cursory testing test a checklist of items based on previous problems and basic uses. use a written test plan based on a longer list of previous problem reports, add serendipitous testing. experimentation with selenium ide tests. people involved few people, incidental. team, plus a cataloguer and an open access specialist. student developer for minor releases.team for major releases. preparing for additional usability and user certification testing. workflows amanda french, our new director, worked to improve workflows on our team. we went from having no team workflow – essentially chaos when many problem reports and change requests came in – to having a highly organized workflow based on the scrum framework. the elements of scrum that we find particularly helpful include: product backlog – a place to document all potential upcoming changes to a project sprints – time-bounded development units where specific project tasks are accomplished planning meetings – meetings held to set accomplishable goals for each sprint task board – a visual representation of the status of each task in the sprint. ours is made of post-it notes on a whiteboard, with areas for to-do, doing, done, and waiting. stand-ups – frequent, often daily brief meetings where participants discuss what they’ve done since the last meeting, what they’ll work on now, and any obstacles many of our processes and technologies have improved gradually over time, but the best and most revolutionary changes came when two new hires came in spring of 2015 and modernized our team. one of these hires, francis kayiwa, senior unix administrator, worked on modernizing the systems used for development and production. his thoughts, along with those of amanda french, and anne lawrence and keith gilbertson, both of whom have been involved with the vtechworks project for several years, are offered below. francis kayiwa’s thoughts: delivery of service “i am not just a proponent of empowering everyone in the project. i would actually rather not be involved in any project that doesn’t aim to do this. my entire career has been around maintaining other people’s work. over time i have worked towards ensuring stakeholders are not only given power to work in a way that suits them, but more importantly are reminded of what part of the project they will be responsible for maintaining. the benefit here is no one person feels the burden of the project, and when done right everyone will have a good understanding of how to avoid disasters. this kind of work now falls under the auspices of the “difficult to define” devops. for this article we will describe it as the ability for developers, application administrators, and operations people to meet periodically to define the way to work most productively. my goal as part of the operations team was to be able to make our institutional repository, vtechworks, be a highly repeatable environment for all involved in the project. i wanted all involved to be able, with the minimal amount of documented effort, to stand up and tear down a replica of our production institutional repository. i wanted the entire group to understand that there’s nothing magical about the production server. in order to do this we needed to define what our preferred development environment would look like. we selected vagrant and ansible tools to do this. we walked through the “dspace way” of setting up a new instance. we worked towards making sure setting up a new instance would require no human action in order to have a replica of the environment. this process was iterative and continues to be improved upon. the most important thing i brought to all of this was my ability to facilitate. it was more important for me to have everyone comfortable working with a repeatable environment than it was for me to use any particular technology.” amanda french’s thoughts: implementing scrum “i began at the virginia tech libraries as director of digital research services in late april, 2015. my director-level position includes responsibility for several areas, the two largest of which are digital humanities and the institutional repository. my background is really in digital humanities, but i’ve picked up some theory and practice of project management along the way, especially but not solely project management of software development projects. i first heard about the agile manifesto and “extreme programming” in 2008 when i was at emory libraries running the reserves service and managing the development of a locally developed reserves system there, reservesdirect. really, it was the developer there, jason white, who taught me about how to manage software projects. at his recommendation i read the classic book the mythical man-month, and i remember nodding vigorously at many of its insights. i also remember beginning to hear the terms “agile” and “iterative” used much more often in the broader culture at that time to indicate desirable qualities. at jason white’s suggestion, we adopted a vaguely agile sprint-based workflow, but honestly it wasn’t strictly necessary, since there were only the two of us involved with developing the software (and both of us had other projects and duties), which made communication and prioritization and evaluation simple. we also weren’t subject to requests for new features or even bug fixes, since the production system was relatively stable and users were used to it. i have, however, spent a lot of time around software developers in the last ten years, first at ncsu in the digital library initiative group, then at emory, then at the roy rosenzweig center for history and new media, and what has become increasingly clear just from listening is that there are an infinite number of possible features and potential bugs in any piece of software, it isn’t possible for most users to judge how much time and effort developing a feature or fixing a bug will require, and it’s often not much easier for a developer to judge how much time and effort developing a feature or fixing a bug will require, either. coming from an english phd program, too, i couldn’t help but liken the difficulties with writing code quickly to the difficulties of writing anything quickly; there are an infinite number of topics you could write about about, an infinite number of books and articles you could read to help you write it, and it’s very difficult to judge how long it will take to answer a research question you pose yourself or how long it will take to produce a passable piece of prose. as a teacher of writing, i had a great deal of success asking my students to write one-page weekly papers instead of one or two longer papers each semester; the papers were better, the students became more used to writing and got much more feedback from me and from each other, and leaving writing until the last minute became at once more difficult and less disastrous. that assignment might have been called “an iterative process” if it were in the tech sphere. similarly, when i learned about the unconference model and later headed up the unconference thatcamp (the humanities and technology camp), i appreciated the innate agility of it: by waiting until the very morning of the event to decide what to do, and to make the decision together, participants would be able to address issues that were on their mind that very day instead of having to read aloud a paper that they had first thought about several months before. therefore, when i came to virginia tech, where i was for the first time in charge of more than one person at a time doing software development, and where (as i soon learned) there was a lengthy backlog of feature requests to address, it seemed to me that i should learn a lot more about methods based on agile principles. i read scrum: a breathtakingly brief and agile introduction by chris sims; i read the elements of scrum by chris sims and hillary louise johnson; and i perused the slides from a 2011 digital library forum presentation by naomi dushay (on behalf of bess sadler), tom cramer, and jessie keck – agile methodology, innovation, and quality. because i was lucky enough to come in at the same time as a new unix systems administrator, francis kayiwa, who wanted to overhaul the whole technical backend of the boxes our systems resided on, we did that first, which already meant that we would have new github-based workflows. once we were through upgrading dspace, getting set up on github, and getting used to developing on local virtual machines, the time seemed right to introduce a new way of working. it didn’t hurt that just at the same time we were due to move into new offices, as well: why not get all the changes over with at once? before i decided to adopt scrum officially, however, i spoke with a few people beforehand about potential problems, most usefully the head of virginia tech university libraries’ it services, john borwick, who had participated in scrum processes at wake forest. there were three major concerns i had: scrum is really designed for teams who are developing new software, not for teams who are supporting production it systems. i wasn’t sure whether scrum processes would work in that context. the vtechworks team doesn’t only work on the dspace software; we identify and acquire content and check metadata quality and respond to user requests as well. some reports averred that scrum worked well for other work besides software development, but i wasn’t sure it would. much of the seemingly endless debate around scrum online concerned whether it was okay to depart from the established roles and meetings and structures; many people claimed that if scrum failed, it was because that team didn’t do it correctly. i knew, in particular, that my own role mapped most closely onto product owner, but that we didn’t have a scrum master. john’s opinion on these three points was that scrum would probably work for production it (although he’s more a fan of itil, about which he pointed me to some fresh reading), that it would probably work for other types of work as well, but that it was crucial to have a scrum master if at all possible. he also advised that the product owner role might best be filled by a user. ultimately, i decided to forge ahead, and to take the risk of doing without a scrum master. i put together a brief presentation for the four-person team about scrum, identified a place for a physical task board, created a product backlog in a google doc spreadsheet from the 100+ issues in the old issue tracker, and put a bunch of meetings on the calendar: “sprint planning,” “sprint showcase,” “sprint retrospective,” and of course the “daily stand-up,” which in our case is at noon. together we decided on a month-long sprint cycle (sort of: more on that below), and we began sprinting in september 2015. at keith gilbertson’s suggestion, our sprints and sprint releases are named after animals in alphabetical order (aardvark, bichon frise, chipmunk, dogue de bordeaux, emu, frog … ); we plan our work at the beginning of the calendar month and usually (though not always) deploy changes to production at the end of the sprint and calendar month. figure 1. taskboard. i’m definitely happy with the way it has turned out; we’re making significant progress on the backlog of issues, and i think that scrum does work well even in the context of a production it system and for non-technology work such as correcting metadata or acquiring content. i also think that while it would be nice to have a scrum master, someone who would make sure that i’m not asking too much of people in my role as product owner, we’re managing pretty well without one. to make up for the lack, i try to give team members a choice of what to work on and i try to check in often to make sure they’re not overwhelmed. nothing is perfect, of course, and there are a few new problems that have been introduced. the first problem is that running vtechworks is only part of my job, and the scrum process is so powerful (those daily meetings!) that it tends to concentrate my efforts on vtechworks at the expense of other things i should be doing, things that are not part of the scrum process. it’s still not entirely clear if we have a three-week sprint or a four-week sprint: i tried to set it up so that we had a three-week sprint and then a week off so that i could focus on other things and the staff could too (learn new things, write things, clean up old tasks, revise documentation, take a break from each other), but in practice it doesn’t seem to work that way, and we wind up with four-week sprints in which the last week is devoted to testing and bug fixing and code review and deployment and, yes, a few things we didn’t get done. the second problem is that while scrum does function pretty well for non-technology work, there is a sort of emphasis on and privileging of the technology work that is hard to overcome. i’m not sure whether that’s innate to the scrum process or is just a problem with my own prioritizing: some month or months soon, i’d like to plan almost the whole sprint around content acquisition, metadata correction, or learning new skills, or writing and revising and migrating and archiving documentation. but meanwhile, we’re still very focused on learning dspace and recreating or fulfilling old feature requests. the third problem, and perhaps the least important, is that we’re still not very good at judging how long it will take to accomplish particular tasks, nor at how much we’ll be able to get done in the course of a month. we decided not to bother with burndown charts and other tools designed to help us estimate such things, because while we do have a lot of feature requests and projects, so far we haven’t had a firm deadline for any particular feature or project that we’ve been unable to meet. what is clear is that we’re getting a lot done, and even when we don’t get everything done that we thought we might, well, next month is a brand-new month, filled with possibility, and we’ve gotten a lot done in the past month, which is satisfying. and even sometimes fun.” anne lawrence’s thoughts: the scrum experience “i have degrees in industrial engineering and library science, with an interest in efficiency and the continual improvement of processes and systems. i appreciate that i can contribute to the improvement of our repository in a variety of ways: improving the software, improving workflows, administration, ingesting content, etc. i have practiced getting things done, somewhat loosely, for about 8 years and believe strongly in the utility of getting things out of my head and into an external system. our system allows this principle to extend to our work team. this system not only allows me to free up my brain of things to remember, it facilitates communication of my information and detail to our team. similarly, i am informed of my co-workers’ and supervisor’s projects and priorities. everyone’s ideas and suggestions are welcome; they may not all be implemented but will be considered. we are encouraged to take on projects that are a personal stretch and allow us to build skills and knowledge. our git/vagrant system creates some overhead to working, but provides a good safety net for developing and experimenting. i thought the number of meetings would be overwhelming but it’s ok. i greatly appreciate that meetings are generally held to their allotted time (generally 15 minutes or 1 hour). clear communication is required to do complicated or more involved tasks. high level wishes have to be translated into coding details. github, daily stand-ups, and the planning meetings go a long way to meeting that need. three of us are in the same physical space, which is a help. but some members telework, which can inhibit communication. so far, that has not been a big problem. they communicate via email, video, and the slack-like tool mattermost. we seem to have a good level of communication about issues. our director acts as product manager and scrum leader. it might be better with a larger group, but it works and allows each person time to report on their work and express their ideas. i would prefer if all our git work were in gitlab (limited to the libraries staff) rather than in the public github. we use a large percentage of the functionality in github for code development, the wiki, and issue tracking. github has very nice features which make it easy to add screenshots, create pull requests, relate issues and pull requests, etc. we used redmine previously for issue tracking and to create a wiki. however github/lab integrates better into our code development work, particularly since we have a fork of the dspace github repository. the backlog is well implemented as a spreadsheet on google drive, which makes it easy to edit, while ensuring that it is backed up. separate sheets for the customer help form, content, technology, metadata, and documentation projects aid us in focusing on each category at a time during product backlog meetings. team members are free to add issues to it, recording requests from users and their own ideas or tasks. as i moved a sizable portion of my personal to-do list to the backlog, i appreciated that the list of tasks i had carried personally would now be shared and prioritized as a group. the product backlog allows me to list items without bothering anyone in the moment, knowing that they will be considered in the product backlog meeting. i like the daily stand-up for several reasons: it’s short — i can do anything for 15 minutes! i prefer to stand and stretch after sitting so much. tasks are much easier and much less uncertain if i can check in with others, especially my supervisor, as questions and concerns arise. since we all prefer to work in uninterrupted stretches, i am sometimes hesitant to disturb co-workers, even to negotiate a time to meet. and tracking down people can be even more difficult. rather than not communicate or delay communication too long, it’s much easier and more efficient to plan it into the day, as stand-up meetings. i can save up most issues for the meetings, without having to disturb co-workers at other times. i give the group a summary of my work and bring up issues that would benefit from communication with them. as needed, i provide additional documentation and details in github issues, email, or google docs. and i truly appreciate hearing from my co-workers about their work and news at this meeting. because they are so useful, we have stand ups during our “off sprint” week. our manager tends to concentrate on her other responsibilities during that week, but since we stay on vtechworks during that week, communication is still needed. the monthly policy and procedure meeting isn’t part of the scrum, but it allows us time to improve our procedures and discuss policies and other issues in depth. i find it useful to have this time away from strict task completion in order to handle administrative issues of the repository. so, in short, i am very pleased with our current system and team climate. i feel it fosters communication, a team spirit, and improved results.” keith gilbertson’s thoughts: redemption “for most of the prior three years at virginia tech, i had been the bad egg and the bottleneck on the vtechworks team. i started my career working primarily as a software developer. prior to libraries, my jobs had been in either very small, very focused organizations, where communication was simple, or in very large organizations with well-defined software development processes. my work from 2007-2009 on institutional repositories at ohiolink (the ohio libraries and information network) served as a bridge into software development for academic libraries. at the time, a significant amount of software development, configuration, deployment, and maintenance happened at ohiolink. the organization was small and flexible, but very productive. for example, when i left ohiolink, a team of one developer, one manager, and a shared systems administrator were providing installation, customization, support, and upgrades for approximately a dozen institutional repositories in the state, with once-a-month guidance from a team of librarians serving at ohio universities. when i began working in academic libraries proper, i experienced somewhat of a culture shock. most libraries aren’t sized as enormous corporations, but they can be ambitious, and often have multiple priorities that compete for staff time. also, while i don’t know if i could say that the same is true today, i will risk stating that 10 years ago, many libraries, even when staffed by talented technologists, had organizational cultures that would have made completion of software development projects difficult. when i arrived at virginia tech, it was clear that there were extremely competent managers with a history of success, and that the organizational competencies in software development didn’t yet match the technology aspirations that came with a new dean. i failed in that environment. i was overwhelmed with all of the responsibilities and challenges. i felt incompetent because i wasn’t able to complete everything that everyone wanted from me. now, things are much better. because of the changes that we have made in our technology and processes, everyone on the team participates in improving vtechworks. i’m not overwhelmed, and i feel confident again that i can help others who will also help me. i’m also available to work on other projects that will benefit the library. the most astonishing thing for me is that i have learned that i can be a “people person”. i used to seek time alone because i feared that every interaction with people would create demands that i wouldn’t be able to fulfill; now i’m a team player and enjoy working with my colleagues again, so much so that when i must miss a meeting, i feel disappointment instead of relief. i’m happy to be here.” conclusion the primary lesson learned from our experience is that having a process for our team, much like we have processes for our own individual work, is helpful. knowing which items we are going to concentrate on at any given time helps us to follow through with our own individual work processes, because we’re less likely to be distracted by incoming work that might seem urgent because it has landed in our email inboxes. instead of distracting us, new feature requests go into our backlog spreadsheet, and newly reported bugs – unless they are showstopping – are placed into our github ticketing system. the most important requests will be selected for attention at our next sprint planning meeting. the process is not perfect. for example, some group members have expressed concern that we under-emphasize task prioritization. members are assigned specific tasks each week, and each task is, in theory, equally important. in reality, this is often not the case and some projects are clearly more time-sensitive or essential than others. due to the challenges of correct time and effort estimation, often not all of the work selected for a sprint can be finished. therefore, we recommend that groups trying scrum also assign priorities to tasks within the sprint planning meeting. and we do recommend that groups try scrum, and adapt it to their own needs. in particular, we recommend documenting all change requests in a product backlog. we recommend sprints, and holding official sprint planning meetings to determine, based on the product backlog, the goals of each sprint. we recommend selecting a primary person of responsibility for each task during the sprint, and posting this information on a task board, whether this is done with post-it notes or virtually. finally, even those of us in our group who initially disliked meetings recommend holding brief, regular stand-ups where each group member quickly explains work done since the last meeting, what will be worked on next, and any obstacles faced. this gives group members an opportunity to share victories, and to ask for help when it is needed. in our case, this regular, structured contact where we report on our successes and help each other achieve goals has allowed some members to have a stronger sense of being a part of a team than they have ever felt before. resources allen d. 2015. getting things done: the art of stress free productivity. 2015 ed. penguin books. brooks fp. 1995. the mythical man month. anniversary ed. addison-wesley professional. dushay n, sadler b, cramer t, keck j. 2011. agile methodology, innovation, and quality. presented at 2011 digital library federation meeting; 2011 november 1; baltimore. slides available at: http://www.slideshare.net/ndushay/agile-for-digital-library-projects. selenium ide. [internet]. seleniumhq; [cited 2016-3-20]. available from: http://www.seleniumhq.org/projects/ide/. sims c, johnson hl. 2012. [internet]. scrum: a breathtakingly brief and agile introduction. available from http://www.agilelearninglabs.com/resources/scrum-introduction/. sims c, johnson hl. 2011. the elements of scrum. dymaxicon. pottinger h, donohue t, prado r. vagrant + dspace. 2016. [internet]. available from: https://github.com/dspace/vagrant-dspace/blob/master/readme.md vtechworks wiki. 2016. [internet]. available from: https://github.com/vtul/vtechworks/wiki. about the authors former clir postdoctoral fellow amanda french is currently director of digital research services and associate professor at virginia tech university libraries, where she is helping to build digital humanities infrastructure and is running the institutional repository, vtechworks. from 2010-2014 she was thatcamp coordinator and research assistant professor at the roy rosenzweig center for history and new media at george mason university. her particular expertise consists of making humanities content (both cultural content and scholarly interpretation of that content) openly available online, as well as introducing scholars to the various methods of and issues with making humanities content openly available online. francis kayiwa is a unix/linux systems administrator with over 15 years of experience, and increasingly serves as a software developer and project manager with expertise in perl and python, and exposure to ruby. his current interest is devops automation of server infrastructure using ansible and salt, while exploring chef/puppet. anne lawrence is a repository collections specialist at virginia tech. keith gilbertson is a digital technologies development librarian at virginia tech. melissa lohrey is a repository collections specialist at the virginia tech libraries. she assists users from virginia tech (and around the world) to deposit and obtain access to scholarly and cultural heritage resources in vtechworks, virginia tech’s institutional repository. she received her ba in psychology from the university of maryland, baltimore county and her mls from the university of maryland, college park. before arriving at virginia tech, melissa worked as a student assistant in the university of maryland’s special collections department and as a metadata specialist at the national agricultural library. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – reaching users through facebook: a guide to implementing facebook athenaeum mission editorial committee process and structure code4lib issue 5, 2008-12-15 reaching users through facebook: a guide to implementing facebook athenaeum facebook athenaeum is an open source application that integrates library resources directly into the facebook website. facebook is one of the single most-visited websites in the world, and its popularity among college-aged students provides a unique opportunity for libraries to redefine how they interact with students. this article walks you through the deployment of facebook athenaeum, and discusses some of the usage trends and pitfalls of deploying applications using the facebook api. by wayne graham introduction in 2007, facebook was positioning itself to overtake myspace as the number one social website. until this point, facebook had only been available to higher-education institutions, but they were in the process of opening the site up to all users and rolling out an api to allow developers to create “useful” applications. seeing an opportunity to provide our own “useful” social application, we set out to integrate an existing tool developed by tom macwright, a student at william and mary, with our library’s rss feeds and catalog/website search. macwright developed a nifty little application that allows users to click on a map of the library and generate an im status link (or anything else that accepted a hyperlink) so your friends could see where you were located (http://swem.wm.edu/services/swemsignal/). swem signal was used by quite a few students, and we even got the code donated to the library to run on our servers. after the facebook api launched, we thought swem signal would be a really cool social feature to integrate into a library facebook application. the real impetus, however, was to expose our search tools to users who may spend more time socializing than studying. figure 1: facebook athaneum main page [view larger] how do facebook applications work? facebook applications are a little different from other online applications in that they live on two different servers. you host the application on your own server, which is responsible for the business logic of your application, and facebook’s servers work with the user’s profile. when a user interacts with your application, facebook interprets the request and passes it to your server. your pages are constructed with fbml (facebook markup language), a superset of html, to pass back to the facebook server farm. facebook interprets the fbml response from your server and generates the resulting page the user sees. figure 2: facebook page processing server set up so, what do you need to get going? facebook athenaeum uses the php facebook client libraries for interacting with facebook profiles; a database server (for storing cartesian coordinates); smarty for creating the display; and the pear db modules [1] and xml_rss packages for interacting with the database backend and reading any rss feeds you may have. floor maps of your building are really handy and they can be in whatever image format is most convenient for you. the official libraries provided by facebook are in php and java, but there are unofficial libraries for most popular web and application programming languages, so if you are using a different web platform, you can very quickly migrate the code to your local environment. you can find links to all the client libraries on the facebook developer wiki. additionally, there are convenient, unofficial, facebook library implementations in pear (pear install services_facebook) and rubyforge (gem install rfacebook). i have made some assumptions in these instructions, first that you’re running a *nix system, that you have permissions to install software on the server, and you have php, pear, and wget installed. this is not to say you cannot run this on the windows server, but you will need to manually install the dependencies. i will do my best to note the software you will need and where to get further instructions when needed. setting up smarty and pear modules to make things go a little easier, we wrote a bash script for *nix users that will download smarty, extract it, and put it in the php include path and then update pear and install the db and xml_rss packages with their required dependencies. this script is included in the facebook athenaeum source code. #! /bin/bash phpdir=`pear config-get php_dir` smarty_ver="smarty-2.6.20" smartydir="$phpdir/smarty" if [ ! -d $smartydir ] then echo "downloading smarty..." wget http://www.smarty.net/do_download.php?download_file=$smarty_ver.tar.gz echo "extracting smarty" tar -zxf $smarty_ver.tar.gz echo "moving smarty to $smartydir" mkdir $smartydir mv $smarty_ver/libs/* $smartydir echo "cleaning up..." rm $smarty_ver.tar.gz rm -rf $smarty_ver fi ## install pear packages echo "updating pear packages" pear upgrade pear pear install --onlyreqdeps db pear install --onlyreqdeps xml_rss this script was updated in november of 2008. you may need to check to make sure you are getting the latest version by visiting the smarty website (http://www.smarty.net/download.php). if the version changes, just change the smarty_ver line to the correct version and then run the script. note: if you are running windows, you will need to manually install smarty and the pear modules. to install smarty, refer to http://news.php.net/php.smarty.dev/2703 for detailed instructions. assuming you have correctly installed pear on windows (there is a convenient msi installer), you will also need to run the pear lines from the above script (e.g. pear install –onlyreqdeps db xml_rss). get the source code now that you have retrieved the dependencies, it is just a matter of setting up your web environment. you can do this by simply setting up a folder in your web directory (e.g. /var/www/facebook) or creating an alias in your apache configuration. download the files from the google code site (http://code.google.com/p/facebook-athenaeum/downloads/list) and unpack them into the location you want the files. if you are comfortable with working with the development code, you can just check out the trunk. svn checkout http://facebook-athenaeum.googlecode.com/svn/trunk/ facebook-athenaeum-read-only note: if you’re on a *nix web server, you will need to change the permissions on the cache and compile directories to give write access to the web server. chown nobody:nobody /path/to/facebook-athenaeum/cache chown nobody:nobody /path/to/facebook-athenaeum/compile chmod 775 /path/to/facebook-athenaeum/cache chmod 775 /path/to/facebook-athenaeum/cache mysql setup facebook athenaeum includes scripts for creating a mysql database, but since the code uses the pear db library it should work with any database system that pear db supports. you will just need to write the sql to create the locations table. to aid with this, i wrote a basic sql-99 compliant sql script for those wanting to use an alternative relational database management system. create table locations( uid int not null, x int, y int, floor int, updated datetime not null, primary key(uid) ) as you can see, there is not much to the table: it just stores some very basic information about what image was used and the cartesian coordinates set by the user. we did write a helper script for those running mysql on *nix (mysql_helper.sh) to make this as easy as possible. you can also just create the database and table through whatever convenient tool you have available. whatever method you decide to use, just be sure to note your settings for the configuration file. note: if you do write a table definition for an alternative database system, be sure to make a patch and submit it to the project (http://code.google.com/p/facebook-athenaeum/issues/list). facebook setup now that the necessary server libraries and source code are in place, you need to tell facebook about your application. this, unfortunately, is a multi-step process. you first need to add the facebook developer application to your account (accessible at http://www.facebook.com/developers/). once installed, click on the “+set up new application” button to start the process. figure 3: set up new application detail you will be prompted to give your application a name. just remember, you can’t have the word “face” anywhere in the name (not even something like “surface”). after you read the terms for the facebook platform (http://www.facebook.com/developers/tos.php), expand the optional fields. figure 4: facebook optional fields detail [view larger] at this point, you just need to set the callback url (the url of the application you have set up on your server, e.g. http://library.myuniversity.edu/facebook) and the canvas page url (the facebook url for the application, e.g. yourlibraryathenaeum). if you make a mistake or change your mind, you can always go in and change these fields later from the develop application in facebook. once you have finished the wizard, you will be directed to the application manager for the facebook application you just created. on this screen, you will need to note the api key and secret for the configuration file on your server. application setup with all the preliminaries out of the way, it is time to finish setting up your instance of facebook athenaeum. in the configurations folder on your web server, there is a file named config.inc.php that you need to edit. just add your facebook api key, secret, callback url and canvas url. next, edit the dsn settings for your host. you will need to provide the host, user, password, and database for pear db to make the connection. again, this should work for database systems other than mysql, but you will need to reference the pear db (http://pear.php.net/package/db/docs/latest/db/db.html) documentation to make sure the required values are passed to the constructor. the next section lets you set up the application customizations. you can name the application whatever you would like, and if you have an rss feed, you can add that into the feed_url variable. in our configuration, we use this to show the current news feed for our library. the app_dir is also pretty straightforward–it is just the absolute path to where your application lives on the server. now comes the part that can give some headaches. we use a google mini appliance at our library. the google mini is a great appliance and has the ability to produce search results in xml format. [2] this allows us to use a facebook ajax call to pull the search results into the facebook application rather than kick them out to another server. for our database and catalog searches, we actually use a query evaluator to funnel the search into the correct server. you may find yourself needing to refactor this code a bit, but it should be a pretty quick process. and to answer your question, yes, we do plan to make this easier in the future. the rest of the configuration file should be relatively self-explanatory, but i will key you in on one last item. facebook athenaeum also integrates with google analytics (http://analytics.google.com) by using the google_analytics_key. the statistics facebook offers are pretty light and google analytics will give you “better” usage statistics. you just need the key, which looks like “ua-nnnnnnn-n”. you cannot directly copy from the analytics page as google wants you to copy the entire code block. you can either write the key down, or paste the contents of the code block in a text editor and then copy just the key. the map fortunately for us, we had a set of floor plans from our recent building renovation. figure 5: facebook page with friend locator, including floor plan [view larger] if you’re lucky, you too already have a set of nice floor plans to use for your facebook application. if not, this becomes a slightly bigger project, but basically you need to generate an image format that is easily read by browsers (e.g. gif, png, jpeg, etc.). for ease-of-coding/lazy reasons, we named the images 0-3 of our different floors so they will map easily map to the array. each floor gets a label and a message. the message is just the text that gets displayed when the user wants the application on their profile page. if all went well, everything is configured now and ready for testing. to add the application to your profile, visit your application’s canvas page (e.g. http://apps.facebook.com/your_app_name). you will be prompted to give several permissions to the application from your profile. once you click on “allow,” you can start using the application. publishing when you have worked out any issues you may encounter and have finished adding any additional functionality you may need (and please post any patches to the project page at http://code.google.com/p/facebook-athenaeum/issues/list), you are ready to publish the application in facebook and start letting “friends” know how to use it. publishing an application in facebook’s application directory (http://www.facebook.com/apps/), requires five people to first add the application to their profile. hopefully you have some facebook friends who will help you out. to actually submit the application for listings approval, log on to facebook and go to developer application (should be on the right-hand side of the screen). go into your application, where you will find a tiny link under directory status to submit it to the product directory. figure 6: submit to facebook screen detail finish filling out the form and click save. it can take some time (usually two to three business days) for the approval process to go through because someone actually looks at every application before it gets listed. once added to the directory, facebook users will then be able to discover your application through the search interface. final housekeeping you also will want to develop the content on your application’s about page. this is the page that people can use to become “fans” and allow you to interact with your users (at least those who choose to interact on the page). you can get to the about page from the developer application. just go into your application and click on edit about page. figure 7: facebook atheneum about page [view larger] edit the appropriate sections you want to have appear. this is a convenient page to subsequently interact with your users (or at least the ones that become fans of the application). getting help there are two google groups set up in case you run into a problem with facebook athenaeum. the groups are listed off the main project page (http://code.google.com/p/facebook-athenaeum/). if you need some help in tweaking the application, you can also consult the facebook developer wiki (http://wiki.developers.facebook.com/index.php/main_page). conclusion road map what does the future hold for facebook athenaeum? well, one of the features we want to work on is integrating with vufind (http://www.vufind.org) to pull search results from vufind into the facebook application. as vufind grows to enable libraries to index and search more diverse content types (it currently supports indexing marc content), all of this will also be directly available through facebook. the other big item on the roadmap is to migrate the database storage from a relational database system to facebook’s data store api. this will allow you to keep most of the user interactions directly on facebook’s cloud, decreasing the number of resources this application consumes. usage swem library has been running its swemtools application for over a year now. we currently have 481 users who have installed the application, with about 40 monthly users (a little less than 10% usage). doing a little poking around, there seem to be a large percentage of facebook applications that hover around this threshold. we realized when evaluating our application usage that we experienced a phenomenon where there are a lot of initial additions immediately after publicity of the application, and then a sharp drop off in usage. when we set out to develop this project, i will candidly say we did not exactly know what to expect–we were really just testing out this new platform and wrote what we thought would be cool and tried to make it useful. as we progressed, we started to think more about what criteria we might use to evaluate the effectiveness of this project. we realized we are really trying to reach an audience that is more comfortable interacting on facebook than the online catalog; a group that perhaps is not always effectively reached by traditional library outreach. what we have found is that there has been sustained, long-term usage of the application, with little if any prodding from us. the fact that these users are coming and using the application shows that there is at least a perceived usefulness in the application from users. so, should your library have a presence in facebook? i am perhaps biased, but i think the site provides a unique opportunity for libraries to redefine how they interact with students and how libaries can facilitate the interaction between students. i’ll be quite honest though, when the facebook site first launched and i created my account, i really did not see what the point was, and i know i am not the only one who had this reaction. what brought me around was seeing just how many students actually use the site on a daily basis. being able to interact with these students on a platform they are comfortable with seemed like a natural extension of what the library has traditionally done in developing its web content and outreach activities. we have further found that when we advertise an event on facebook, we get far more participation than we do through posters, news feeds, and other outlets. additional resources dia (http://projects.gnome.org/dia/) facebook athenaeum (http://code.google.com/p/facebook-athenaeum) facebook client libraries (http://wiki.developers.facebook.com/index.php/client_libraries) google analytics (http://analytics.google.com) gimp (http://www.gimp.org/) pear db (http://pear.php.net/package/db) pear xml_rss (http://pear.php.net/package/xml_rss) mysql (http://www.mysql.com/) smarty (http://www.smarty.net/) subversion (http://subversion.tigris.org/) notes [1] the pear db package has been deprecated. instead of migrating the code to the mdb2 library, we plan to move the backend to use the facebook data api. [2] for a more in-depth analysis of leveraging the google mini, see: edwin burgess and edward metz, “applying google mini search appliance for document discoverability.” online 32, no. 4 (july 2008): 25-27. (coins) about the author wayne graham is the coordinator of emerging technology at the earl gregg swem library at the college of william and mary. he is the author of facebook api developer’s guide (apress, 2008) and contributes code to the vufind and solrmarc projects. wayne occasionally blogs at http://www.liquidfoot.com, and you can always shoot him a note on facebook. subscribe to comments: for this article | for all articles 5 responses to "reaching users through facebook: a guide to implementing facebook athenaeum" please leave a response below: sheila bryant, 2008-12-18 as an academic librarian in a community college, i do not understand the academic use of facebook. how is facebook applicable to the academic education, informatiion literacy, and critical thinking of students? is there some literature or a work in progress that would detail the academic accomplishments? anna, 2008-12-18 sheila, the way students in today’s institutions of higher learning receive information and communicate with one another is changing every day. collaborative networks, such as facebook, provides us with another avenue to reach our patrons. nancy adams, 2009-10-12 in one respect, facebook is applicable to information literacy and critical thinking of students the same way that e-mail is: it isn’t, except it’s how you get the message about information literacy and critical thinking to the students. in another respect, because facebook allows apps such as the ones described in this article, it allows students to customize their own portals to library content. and that is something we want to encourage. jayasree, 2009-11-04 sir, the article is very much informative. i am doing my research on “role of social networking in the services of university libraries”.kindly send me some information or articles related to this topic. thanking you jayasree s ??????????? ??? ????????? ??????: ????? ??? ????????? « greeklis, 2012-11-08 […] wayne. “reaching users through facebook: a guide to implementing facebook athenaeum“. the code4lib journal 5 […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – extending and adapting metadata audit tools for mountain west digital library members mission editorial committee process and structure code4lib issue 41, 2018-08-09 extending and adapting metadata audit tools for mountain west digital library members as a dpla regional service hub, mountain west digital library harvests metadata from 16 member repositories representing over 70 partners throughout the western us and hosts over 950,000 records in its portal. the collections harvested range in size from a handful of records to many thousands, presenting both quality control and efficiency issues. to assist members in auditing records for metadata required by the mwdl metadata application profile before harvesting, mwdl hosts a metadata auditing tool adapted from north carolina digital heritage center’s original dpla oai aggregation tools project, available on github. the tool uses xsl tests of the oai-pmh stream from a repository to check conformance of incoming data with the mwdl metadata application profile. use of the tool enables student workers and non-professionals to perform large-scale metadata auditing even if they have no prior knowledge of application profiles or metadata auditing workflows. in the spring of 2018, we further adapted and extended this tool to audit collections coming from a new member, oregon digital. the oai-pmh provision from oregon digital’s samvera repository is configured differently than that of the contentdm repositories used by existing mwdl members, requiring adaptation of the tool. we also extended the tool by adding the dublin core facet viewer, which gives the ability to view and analyze values used in both required and recommended fields by frequency. use of this tool enhances metadata completeness, correctness, and consistency. this article will discuss the technical challenges of project, offer code samples, and offer ideas for further updates. by teresa k. hebron background originally founded in 2001 as a collaboration by members of the utah academic library consortium (ualc), mountain west digital library (mwdl) has grown to become a digital public library of america (dpla) regional service hub, harvesting digital collections metadata from 16 member repositories representing more than 70 partners in 6 western us states and hosting over 950,000 records in its portal. mwdl was one of dpla’s six inaugural service hubs and, at the time of dpla’s launch in 2013, the largest contributor of records. an in-depth history of mwdl can be found in neatrour, et al., 2016. this article will discuss my adaptation of an existing metadata auditing tool provided by mwdl for its staff and members to extend the tool’s capabilities and offer ideas for future enhancements. focusing on mwdl’s open archives initiative-protocol for metadata harvesting-enabled (oai-pmh) harvesting of digital collections and regional aggregation, this article is a case study for adapting similar tools for metadata auditing against different metadata profiles and standards. mwdl audits incoming metadata to ensure conformance with the mwdl metadata application profile (map). the map currently requires 8 dublin core fields: date, description, format, identifier, rights, subject, title, and type. two further fields are mandatory-if-applicable: conversion specs and creator. language and geospatial fields are recommended, but not required (mountain west digital library metadata taskforce 2011). while the map provides instructions how to apply these metadata standards at the local level, the challenge of checking them for completeness and conformance before harvest at the regional level remains. since the majority of mwdl records are in turn harvested by dpla, this warrants double-checking the records at the point of regional aggregation. dpla periodically audits ingests for missing or non-conforming required fields, which could necessitate a large-scale metadata remediation project before harvesting is possible. these collections range widely in size from a single record to tens of thousands in a single collection. mwdl has a very small staff, consisting of two full-time librarians and several part-time student workers. to extend the ability of the staff to audit collections rapidly and accurately, mwdl launched an adaptation of the north carolina digital heritage center’s dpla oai aggregation tools project available on github in late 2015 (mcintyre 2015). the ncdhc’s original version of this web-based tool offered three features to aid in metadata auditing by dpla hubs: 1) displaying data mapped to simple dublin core fields (mapping checker), 2) displaying the frequency of terms in each of the simple dublin core fields (facet viewer), and 3) displaying records that are missing “required” fields (required data checker). ncdhc staff developed the tool “… to quickly assess (1) the presence of required, recommended and optional fields and (2) the baseline quality of different metadata collections” (gregory and williams 2014). to use each of the features, the user inputs the oai-pmh base url of a repository, chooses a set to test, and the tool issues an oai listrecords command and draws a table with the results. to demonstrate how useful this tool can be, consider the previous auditing workflow. using oai, the metadata librarian can examine the records in a particular set via a browser, checking them visually for conformance with the map. for example, the oai base url for utah state archives’ contentdm repository is: http://images.archives.utah.gov/oai/oai.php oai-pmh uses a standard set of requests, called verbs, to request and provide data in a uniform way. a user can see the group of collections (or sets) available by using the listsets verb: http://images.archives.utah.gov/oai/oai.php?verb=listsets to see the records in a specific set, the user can use the listrecords command (tooele county (utah) county clerk register of deaths): http://images.archives.utah.gov/oai/oai.php?verb=listrecords&metadataprefix=oai_qdc&set=p17010coll40 this collection only contains a single record, making it easy to check. but imagine checking a collection with ten thousand records! the aggregation tool issues the listsets command, allows the user to choose a set from a drop-down, then issues the listrecords command behind the scenes and provides a visual presentation of the entire set for auditing. while this article will not provide an in-depth discussion of oai-pmh, readers unfamiliar with the protocol may consult mwdl’s public overview of oai with examples[1] or other free oai-pmh validators and test tools available on the web, such as the open archives initiative repository explorer (aka “the green tool”)[2] or the oai-pmh validator and extractor tool[3]. use of this tool provides several advantages over traditional metadata auditing workflows. first, as a tool publicly available on the web, it is accessible to mwdl members throughout the collection description process for iterative audits or reviews before they submit a collection for harvest. in using the oai feed from a repository, the tool provides a “what you see is what you get” window into how collections metadata will be used by the mwdl portal aside from a few metadata transforms and enrichments (for example, adding institution or collection-identifying information fields). further, non-professional staff can be easily trained to use the tool without needing in-depth knowledge about oai syntax or learning tools like oxygen, openrefine or r to perform audits. introduction of the tool shifted the focus of mwdl metadata work from metadata auditing based on a small sampling of records in a particular collection’s oai feed to precise, comprehensive auditing able to check all records in a collection before harvest. the original mwdl adaptation of the tool offered the ability to display the mappings of all fields in a set and test a set for required fields in both simple dublin core (oai_dc) and qualified dublin core (oai_qdc) as provided by contentdm. it did not provide the facet viewer that displays frequency of terms. the homepage of the first adaptation is shown in figure 1: figure 1. first mwdl adaptation of metadata tool, launched in 2015 since the original implementation of the tool, the profile of repositories used by mwdl members has shifted. the addition of oregon digital on samvera (formerly hydra), the move of one member to islandora, and the launch of university of utah’s solphal repository (a homegrown solution built on lucene solr, phalcon and nginx) necessitated updating and extending the tool to work with oai-pmh feeds configured differently from those using contentdm. further, the entire staff of mwdl has shifted since the tool was initially launched, leaving a large gap in institutional knowledge about original design decisions. when i began my position as the digital metadata librarian at mwdl in january 2018, i saw this as a natural time to assess and evaluate existing tools and workflows. needs assessment the foremost driver for adapting the audit tool at this time was the ingest of oregon digital’s collections into mwdl. technical difficulties with harvesting and displaying data in the mwdl portal (using exlibris primo) and lack of mwdl staffing contributed to a long onboarding process for oregon digital, but by early 2018 these hurdles had been resolved and we were ready to do a large-scale harvest of collections. this necessitated auditing the collections in a more thorough, comprehensive way than had been previously done. depending on the oai provision of the source repository, other institutions hoping to adapt this tool will want to undertake similar steps as i did when adapting the tool for samvera. comparing the view of records in the oai feed from contentdm against samvera makes it apparent why adaptation was necessary. the following is an example of a single record from weber state university’s contentdm oai stream for their ben lomond high school scrapbooks collection (found using the oai listrecords command on this set, https://cdm.weber.edu/oai/oai.php?verb=listrecords&metadataprefix=oai_qdc&set=blhs):
oai:cdm.weber.edu:blhs/316 2017-12-21 blhs
1957-1958 ben lomond high school scrapbook since 1953, students at ben lomond high school have been creating scrapbooks. these books document the memories of the students each year. the scrapbooks hold a snapshot and time capsule of each student body. each one contains photographs, newspaper articles and a written yearly history 1957; 1958 students--1950-1960; education; ogden (utah); ben lomond high school (ogden, utah) http://cdm.weber.edu/cdm/ref/collection/blhs/id/316 the required dublin core fields are enclosed in tags formatted with qualified dublin core fields formatted . this is consistent across contentdm oai feeds. for comparison, here is an example of a single record from the angelus studio photographs collection in oregon digital’s repository (again found using the oai listrecords command, https://oregondigital.org/oai?verb=listrecords&metadataprefix=oai_qdc&set=oregondigital:angelus-studio):
oai:oregondigital.org:angelus-studio/df71gd54c 2018-04-18t22:16:09z oregondigital:angelus-studio
first congregational church at corner park and madison, portland. muddy street in foreground. 23712; p2523 first congregational church at corner park and madison, portland. muddy street in foreground.; default image/tiff here, all the required dublin core fields are enclosed in tags formatted . the previous mwdl version of the tool cannot parse this feed, resulting in false positives for all required fields. figure 2 shows a test of the previous required data checker with a collection from oregon digital: figure 2. existing required qualified dublin core data checker fails all records updating and extending the tool the first step of the project was to get and analyze the existing code. a serendipitous query about the tool from the american theological library association in early march 2018 kicked off the process of getting the code downloaded to a university of utah-hosted gitlab, which i then downloaded onto my local machine. having never worked with xsl before, i put myself through a crash course on it using the w3school’s xslt tutorial. i then used atom (v1.24.0 x64) to view the existing project and began to analyze its structure. this is when i learned the facet viewer feature had been excluded from the previous mwdl adaptation of the tool. i left this feature until last in my work, as i wasn’t sure i wanted it to be included. i chose to concentrate first on the required field checker feature, as this gets the most use by mwdl staff when auditing incoming collections. each feature consists of a php page paired with an associated xsl file used to test the data; this pair for the required data checker in qualified dublin core is shown in the red rectangles in figure 3: figure 3. xsl/php file pairs in required data checker my first step was to clone the index_oai_qdc.php file to make a specific landing page for oregon digital, index_oai_qdc_ore.php. i updated that file’s page title and headings to reflect this. then, i cloned the check_req_fields_qdc.xsl file as check_req_fields_qdc_ore.xsl and began updating the xsl variable definitions to work with oregon digital’s feed: to i applied the same change for all the required fields, resulting in this set of xsl variable definitions: i also added a variable for contributor after analyzing several collections from oregon digital and seeing that some collections use creator and others use contributor. depending on fields required by various metadata profiles or guidelines, other institutions can similarly add or subtract required fields from the variable definitions. i then analyzed the xsl if tests to make sure they were consistent with the variables. because i added a variable for contributor, the tests had to be updated in two places: the column of the table that flags recommended fields is drawn by these tests: as oregon digital uses both creator and contributor, i wanted the table to flag a record only if neither was present. after searching stack overflow for possible solutions, i changed the above to: i then returned to my new landing page for the oregon-specific required metadata checker, index_oai_qdc_ore.php. i searched the document for ‘xsl’ to find and update the call for the xsl file to use my newly revised file (lines 6-7): $xml = new domdocument; if (@$xml->load($feedurl) === false) { echo "
please enter a valid feed url.
"; } else { $xsl = new domdocument; if ($mp == 'oai_qdc') { $xslpath = 'xsl/check_req_fields_qdc_ore.xsl'; } elseif ($mp == 'mods') { $xslpath = 'xsl/check_req_fields_mods.xsl'; } else { $xslpath = 'xsl/check_req_fields_dc.xsl'; } i moved on to the mapping checker next. it is structured very similarly to the required data checker, with pairs of php and xsl files. i cloned the index_oai_qdc.php and extract_qdc.xsl files in the mapping_checker folder and set about adapting them to work with oregon digital’s oai feed. figure 4 shows the base pair of qualified dublin core files in red rectangles: figure 4. xsl/php file pairs in mapping checker the mapping checker’s xsl (extract_qdc.xsl) uses for-each elements to select dublin core elements from each record and build a table display of required vs. recommended data: as with the required data checker, adapting these for-each elements for oregon required updating the select statements: finally, i had a look at the facet checker. after inspecting it and reading the ncdhc’s documentation, i elected to add the facet checker because it was already coded, didn’t need adaptation, and provides a visual audit tool to help identify inconsistencies in collections. figure 5 shows a view of subjects used in utah state archives’ salt lake city (utah) fire department photographs collection: figure 5. facet viewer demonstrated using subjects in utah state archives’ salt lake city (utah) fire department photographs collection this demonstrates how the same subject headings have been used consistently across the records, but could also reveal variations in spelling or application that needed remediation. when i updated the top-level index.php to include the newly coded features, i uncommented the facet viewer and moved it to the bottom of the list. the pages are called as list items in an unordered list. finally, i updated inc/byline.php; this file is called on each of the tool’s php pages to provide the footer. i added details to reflect the version changes and update the mwdl contact information. technical challenges/testing before making the tool live for member and staff use, i wanted to test it to make sure the updates were working correctly. a colleague suggested using mamp, a free personal web server program (see sources). university of utah library it support built a virtual machine for me with mamp installed that was accessible through vmware fusion (version 8.5.8). the testing process was fairly slow, primarily owing to some issues of copying the project file structure from my local machine to the correct root directory in the virtual machine as per mamp documentation (/applications/mamp/htdocs). there seemed to be a glitch with both the menu-based copy-and-paste and drag-and-drop function that should have worked as per vmware fusion documentation (see references). however, once the project had been copied to the virtual machine, i could test it and this resulted in a few rounds of iterative changes to improve aspects of the tool. for example, i realized i had failed to update the if variable for rights to look for “dcterms:rights”, resulting in false positives for missing rights statements in all oregon records. i then uploaded the project to the gitlab, created a merge request and notified university of utah staff we were ready to make the tool live. at this time, we decided to move the tool to a newer server and update its url from dpla-aggregation.sandbox.lib.utah.edu to mwdlmetadata.tools.lib.utah.edu, a more branded url indicating its publisher and purpose. once the tool was live, further testing revealed a dependency that needed updating for the set-selector to work correctly. a colleague in the utah digital infrastructure development team helped troubleshoot and made this update, and the tool was fully functional. further changes/extensions needed after using the new version of the tool to audit collections, i developed a list of future enhancements and cosmetic issues that need further troubleshooting. first, i realized the recommended field language is missing from the qualified dublin core required data checkers. this was true of the original adaptation, so by copying the files, my updated checkers inherited the same flaw. highlighting of table headers for recommended fields is also inconsistent in the mapping checker. the tool was originally built to display required fields in bright yellow and recommended fields in pale yellow. for example, i forgot to give the contributor column a css class in the mapping checker (extract_qdc_ore.xsl) when i added it, resulting in that column header not having any highlighting at all (line 3): after using the tool on several collections, i also noticed that the oregon required data checker draws the table cells for each record, even if nothing is flagged as missing. this is not the case for contentdm collections; while it isn’t bothersome for small collections, it is cumbersome for ones with large numbers of records. at the time of this writing, i haven’t had a chance to troubleshoot why this is happening. it might also be possible to code a single required data checker for all qualified dublin core feeds by using more sophisticated xsl-if statements and variable definitions. however, i have observed a marked performance issue with the oregon required data checker that i suspect may be due to my combined creator/contributor test; this bears further testing to see if separating them out improves performance and if the tool could be streamlined. i made a subsequent cosmetic update to reflect the current name of oregon digital’s repository (samvera rather than hydra) and updated index.php to indicate the simple dublin core required metadata checker also works with university of utah’s solphal system as well as islandora systems. i anticipate future updates following the forthcoming revision of the mwdl map in 2018. conclusion the project to update the metadata tool was ultimately successful, allowing mwdl staff to rapidly audit oregon digital’s incoming collections. between march and april 2018, we harvested over 50 new collections ranging in size from less than 5 records to more than 20,000, and the bulk of the audit process took roughly 2 days. the work was performed by the metadata librarian and a part-time undergraduate student worker. our student metadata assistant began in february 2018 and, with no prior knowledge of or experience auditing library metadata, was able to immediately add value by efficiently delivering feedback about metadata quality. relying on student workers for first-pass auditing frees the metadata librarian for other tasks. further applications of the tool could include doing internal quality control projects as well as preparing collections for either platform or metadata migrations. the profile of repositories used by mwdl’s members continues to diversify; as of this writing, an additional two mwdl members are planning moves from contentdm to islandora, and i envision continual adaptations and refinements of the tool to meet changing member needs. acknowledgements huge thanks to anna neatrour and brian mcbride at university of utah for advice, technical assistance, and good humor. a special thanks to the north carolina digital heritage center for creating the tool and sharing their work, and to sandra mcintyre for her work on the original mwdl adaptation of the tool. about the author teresa k. hebron is the digital metadata librarian at mountain west digital library, based at the university of utah in salt lake city, ut. footnotes [1] https://mwdl.org/getinvolved/oai_queries.php [back] [2] http://re.cs.uct.ac.za/ [back] [3] http://validator.oaipmh.com [back] references dpla oai aggregation tools project version 1.0 [internet]. [updated 2016 may 25]. north carolina heritage center; [cited 15 march 2018]. available from: https://github.com/ncdhc/dpla-aggregation-tools gregory l, williams s. 2014. on being a hub: some details behind providing metadata for the digital public library of america. d-lib magazine [internet]. [cited may 23, 2018]; 20:7/8. available from: http://www.dlib.org/dlib/july14/gregory/07gregory.html mamp [internet]. [updated 2018]. appsolute; [cited 2018 march 18]. available from: https://www.mamp.info/en/ mamp & mamp pro 4 documentation [internet]. [updated 2018 may 25]. appsolute; [cited 18 march 2018]. available from: http://documentation.mamp.info/ mcintyre s. 2015. new tools for rapid auditing of your collections [internet]. [cited 2018 may 23]. available from: http://mwdlnews.blogspot.com/2015/11/new-tools-for-rapid-auditing-of-your.html mountain west digital library dublin core application profile version 2.0 [internet]. [updated 20 july 2011]. mountain west digital library metadata task force; [cited 2018 may 23]. available from: https://mwdl.org/docs/mwdl_dc_profile_version_2.0.pdf moving and copying files between virtual machines and your mac [internet]. [updated unknown]. vmware; [cited 2018 march 25]. available from: http://pubs.vmware.com/fusion-7/index.jsp?topic=%2fcom.vmware.fusion.help.doc%2fguid-3c0ea5da-98dd-4835-9c84-354472b25303.html neatrour a, cummings r, mcintyre s. 2016. regional aggregation and discovery of digital collections: the mountain west digital library. in: varnum k, ed. exploring discovery: the front door to your library’s licensed and digitized content. ala editions. available from: https://collections.lib.utah.edu/details?id=713372 testing against one of several values in xslt? [internet]. [updated 2017 december 6]. stackoverflow.com; [cited 2018 march 28]. available from: https://stackoverflow.com/questions/47679280/testing-against-one-of-several-values-in-xslt xslt tutorial [internet]. [updated unknown]. w3schools.com; [cited 2018 march 20]. available from: https://www.w3schools.com/xml/xsl_intro.asp subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – breathing life into archon: a case study in working with an unsupported system mission editorial committee process and structure code4lib issue 57, 2023-08-29 breathing life into archon: a case study in working with an unsupported system archival repositories at the university of illinois urbana-champaign library have relied on archon to represent archival description and finding aids to researchers worldwide since its launch in 2006. archon has been officially unsupported software, however, for more than half of this time span. this article will discuss strategies and approaches used to enhance and extend archon’s functionality during this period of little to no support for maintaining the software. whether in enhancing accessibility and visual aesthetics through custom theming, considering how to present data points in new ways to support additional functions, or making modifications so that the database would support utf-8 encoding, a wide variety of opportunities proved possible for enhancing user experience despite the inherent limitations of working with an unsupported system. working primarily from the skill set of an archivist with programming experience, rather than that of a software developer, the author also discusses some of the strengths emerging from this “on the ground” approach to developing enhancements to an archival access and collection management system. by krista l. gray introduction this is a story about breathing life into an allegedly defunct open source platform, archon, to respond to continuing and emerging archival management needs — despite the end of developer support. since its launch in 2006, the university of illinois urbana-champaign has relied on the archon web application software to provide access to archival description, both for researchers and staff. however, following the release of archivesspace in 2013, development on archon has largely lapsed. what is described on github as the “last planned release” occurred in 2014 to provide migration tools for archivesspace. [1] the latest release, funded by institutions in the archon stability update project and completed by a vendor, occurred in 2017 and provided additional security enhancements, as many institutions continued to rely on archon to manage access to their archival metadata. [2] when i came to the university of illinois in early 2015, one of my first projects was to enhance the web presence for the illinois history and lincoln collections (ihlc) unit. this work included learning more about the source code structure of archon, especially in locating the display templates and theme files that could be modified without impacting any core functions. i was fortunate to be able to consult with christopher (chris) prom, one of the creators of archon, and began making changes to improve the appearance and functionality of archon in line with how i saw it being used or not used for different tasks in the ihlc. one of the advantages of this approach of embedded archivist development work, so-to-speak, is being able to see the problems firsthand and to be able to think through how the problems could be solved by adding features through the existing coding structures. over the past eight and a half years, my work with archon has expanded into creating, enhancing, and supporting more elaborate features for each of the units using archon within the special collections division at the university of illinois library, which encompasses four independent instances of archon: illinois history and lincoln collections (ihlc), rare book & manuscript library (rbml), university archives (ua) which includes the student life and culture archives and the sousa center for american music archives, and american library association archives (ala). as ihlc is oftentimes the “middle ground” between the larger rbml and ua, this made my role well-placed to listen and understand the needs of each unit. while some of these approaches and solutions may be particular to archon or the university of illinois, my hope is that others working with legacy or outdated systems can draw inspiration from this case study to determine what might be possible to enhance user experience and functionality now, rather than wait for the next, better system to become possible. one of my early supervisors in the archives field had the motto of “you’re making it better!” and that has been a theme of my work with archon. i can’t fix everything that is holding archon back, but i can make it better. background when archon first launched in 2006, it was groundbreaking. university of illinois archivists and project co-directors chris prom and scott schwartz presented archon, an open-source web-based archival management system built on a mysql or mssql database with a web-based public and admin user interface written in php, to a standing-room-only crowd at the society of american archivists annual meeting in august 2006. [3] designed to facilitate archival management and easy publication of archival descriptive records online, archon increased open access to finding aids at archival repositories across the country, from large universities to small colleges as well as government and nonprofit organizations. archivists could edit records within the web interface and immediately make those changes available to the public, who could either search or browse for descriptions of archival collections and finding aids in the database. by march 2008, about a year and half after its release, archon had been installed hundreds of times and more than thirty repositories had released it into production. [4] in december 2008, the university of illinois library received the mellon award for technology collaboration for “leadership and development work on project archon.” [5] by 2015 archon was showing its age. development resources in the us archives field had largely shifted to a new system, archivesspace, which aimed to combine the strengths of archon and archivists’ toolkit, the two primary archival information management software systems in the us at the time. [6] nearly ten years after its initial release, the default styling in archon no longer matched users’ mental models of how a database would work, or how a modern website might look. for example, on the landing page the search box for the default theme is tucked up at the top corner, and the center of the page presents search tips rather than information about the repository or contents of the database. figure 1. how the main page of archon appears when using the default theme moreover, at the university of illinois, encoding errors arising from using mssql for the backend database (which at the time did not support utf-8) led to collection descriptions inadvertently (and infuriatingly) populated with extraneous characters every time a collection with text that included double spaces, curved quotation marks, or any diacritics was saved to the database. figure 2. excerpt of a collection description with double spaces generating stray characters (from the august 2015 version of the control card for ihlc’s busey-yntema collection) archon also did not integrate with aeon, software that the rbml had adopted in 2014 to manage requests and retrieval of materials for use by researchers in the reading room or by staff. staff and researchers had to create request transactions for archival materials manually in aeon, leading to significant data inconsistencies and a lot of extra work. closer to home at the ihlc, a supplementary word document was employed to track which collection identifiers were already in use in the system (a document which, naturally, was not consistently updated). furthermore, unprocessed collections with no descriptions were displaying publicly with no way for a user to interpret if there was an error in page rendering or if the collection was undescribed, and a new employee like myself had no way of learning about the collection descriptive information in the database as a whole (including which collections remained unpublished) without checking each collection one by one. starting with styling my first steps to enhance user experience with archon were with styling; specifically through archon’s theme and template files. archon’s theme files include the css for the entire public interface, along with the code for generating the headers, footers, and standalone pages (including the main page). templates include the code for generating database record views such as the “control card” (modeled on the old index cards) and the finding aid for archival collections. both theme files and templates were designed to be customizable without touching the core code in archon. styling can seem simple, but it can do a lot to breathe life into an outdated system — whether making the interface more beautiful, adding accessibility features, or making pages responsive. while it might seem superficial, updating the visual elements can enhance how staff and researchers experience the system, making it a more pleasant and informative interaction. [7] styling and modification of templates supports the creation of better visual connections to other institutional pages so that the transition feels less jarring to visitors. i have not attempted to keep up with every style change in the university of illinois library website, but creating a theme that blended elements of the library website and the digital collections site (the library’s access repository for digital and digitized special collections content and other library resources) was a huge step toward a more integrated visual experience. modifying the theme or template can also provide additional context, such as a physical address and contact information (both prominently missing from the default templates in archon). figure 3. at left, ihlc’s archon as it appeared in february 2015. at right, as it appears in 2023. beyond aesthetic considerations of styling, something as simple as converting a link that is easy to miss into a larger button can improve user experience greatly by providing them a clear course of action. in the case of archon, adding code to the theme files also supported adding a large search box to the landing page that better meets current user expectations for interacting with databases. figure 4. at left, ua’s digital object record display as of 2022 that relies on users to recognize the url to click within the metadata. at right, how the digital object record appears in 2023 with the added “access files” button. simple changes, such as adding focus styling to support keyboard navigation and increasing text color contrast, can enhance accessibility. and while rebuilding the interface in a responsive framework like bootstrap was beyond my skill set and time commitment, adding media queries for different screen sizes created a “good enough” solution for access on different devices. [8] reusing data in new ways working more with the templates also led to learning more about how the functions for retrieving data for display from the database worked. when faced with new challenges and workflow needs, this background supported developing ideas of how to get the data out of archon without having to gather it manually page-by-page. one such challenge was how to gain an understanding of all the collections held at ihlc and how they were described (or not) in archon. archon lacked a native reporting function, so to fill this gap i wrote the code to generate data tables, building from the functions i had seen used in other pages and templates. lacking the skills to create a csv file to download, i generated an html table and then copied it into excel to further analyze the data. figure 5. a portion of the custom page that generates a table with basic information for all collections in archon for ihlc. this approach has proved valuable time and time again over the years. most recently, i wrote php code to generate a list of links to the ead versions of collections or records series with science, technology, and medicine-related subject headings to create a page for our partner at the consortium for the history of science, technology, and medicine (chstm) to crawl and ingest into their access system. using a list of relevant subject headings, the code iterates through an array of ids to generate a table for library staff to review with the results for each science-related collection or record series. code for a second page generates a simplified list of ead links with no styling or duplicates for our partner at chstm to crawl for ingest. the appropriate theme directory holds the files for both pages as well as the index.php file to route to them. php code excerpt showing iterating over an array of subject term ids to generate the review table: foreach ($arrtermschstm as $chstmtermid) { $objsubject = new subject($chstmtermid); $objsubject->dbload(); echo("\n"); } figure 6. review table listing chstm-relevant collections by subject heading in ua’s archon. understanding how to pull data from archon has also enabled the development of a solution for the lack of integration between archon and our special collections request management software, aeon. one of the methods for submitting requests for materials through aeon involves using a url with parameters to generate a get request and to autofill visible and hidden fields in the request form. by understanding how the control card template displays the contents of database fields to the user, i was able to write php code to auto-generate a link with the necessary parameters to autofill in the form. php code excerpt showing retrieving metadata from the collection object: $requesttitle = ($objcollection->title) ? $objcollection->title : ""; $requestdates = ($objcollection->inclusivedates) ? ", " . $objcollection->inclusivedates : ""; $requesttitle = urlencode($requesttitle . $requestdates); $requestidentifier = ($objcollection->classification) ? $objcollection->classification->tostring(link_none, true, false, true, false) . "/" . $objcollection->getstring('collectionidentifier') : $objcollection->getstring('collectionidentifier'); php code showing the process of building a portion of the get request link: $requestbaselink = $_archon->config->requesturl; // defined in config file // concatenate the field names (url parameters) and metadata from the collection to form the request link $requestlink = $requestbaselink; $requestlink .= "&itemtitle=" . $requesttitle; $requestlink .= "&callnumber=" . $requestidentifier; this approach works best for collection-level metadata (it does not allow one to autofill a particular folder title with the request link, for instance). while a shortcoming in some ways, this approach is well-suited to ihlc and university archives finding aids (where individual folder titles would be listed), as these are primarily separate pdf documents rather than encoded in the database itself. responding to emerging needs this increased facility with the archon code base and how data could be reused has also helped with imagining how the database could meet emerging needs not foreseen at the time of its original creation. in preparation for an upcoming move, by summer of 2021 the rbml, ihlc, and ua had all begun projects to barcode archival boxes so they could be uniquely identified when transferred to the new vault. in order to centralize this tracking information, i wanted to find a way to record the barcode assignments in archon. archon already has a location table built into its database for tracking the locations of boxes (by location, range, section, and shelf) that lent itself well to consideration for this purpose. not having the capacity to significantly rewrite the source code, i ultimately decided the best option would be to use the shelf field in the location table to record the barcode and move the shelf information (if it existed) for any barcoded box to the section field, combining the two data points there. while not ideal, this meant avoiding having to change any database tables, which was not within my existing skill set. i consulted with fellow archivists to check if any important functionality would be lost. i already knew that many of the university archives locations did not use the shelf field at all, and for those that did, i confirmed that combining the data with the section information would still serve their needs and received permission to proceed. i also wanted to think ahead to how we would update the locations for hundreds if not thousands of boxes after the initial move. storing barcodes in archon would allow the locations of boxes to be updated more quickly by using the barcodes to retrieve the matching line in the location table to update it through an import file. this information could also be used to uniquely represent boxes in aeon, which had the potential to make it easier for researchers to select the requested box and have it be auto-filled via a link. staff could also employ it to aid in the offsite retrieval process from the library’s high density storage facility. i had previously adapted the request function to create separate links for each line of the location table in order to support the future adaption of aeon by ua and ala, both of which store boxes in multiple buildings. drawing on this previous work, i adapted code to also provide users with the ability to request individual boxes and to send the corresponding barcode to aeon along with the box information for the request. staff could then use the barcode to retrieve the location information without having to look up the collection or record series and then having to skim through the list to find the location of the specific box. figure 7. search box and results page for retrieving location information by barcode (location redacted). implementing these new features involved several components: admin and public user interface changes, as well as an import script. each of these built on archon customization work completed earlier. within the admin interface i increased the input field width for the shelf field to hold a fourteen-digit barcode. this drew on work done years earlier when i had made a similar change to accommodate a 5-digit range value for ihlc locations. as shown in figure 7, i also created a barcode-specific search box that allows staff to retrieve a box’s collection and location by scanning the barcode. this work drew on past experience creating a search box to retrieve collections by their identifier for ihlc. archon’s built-in search is a general keyword search and did not serve our needs for precise retrieval in either use case. the request function in the public interface drew on an approach initially developed to handle ala and ua collections that were stored at different service locations. this version uses a pop-up modal box to display request links for individual lines of the location table so that the associated service location can be displayed and encoded properly when submitting an aeon request. in its most recent iteration, the request modal box checks to see if the shelf field is populated with a barcode. if it is, the code will display the box information to the user and send it to aeon. if not, the line in the location table is hidden and a generic “other” option is generated – a necessary adjustment since ihlc and rbml staff did not originally enter location table descriptions to be displayed publicly and only the descriptions of barcoded boxes can be assumed to be ready for display to the public. figure 8. example of a request modal box (with an example filter of “2”) for an ihlc collection with an embedded pdf finding aid, showing also the “other” option for unbarcoded boxes. the import script drew on an earlier project to survey the archival collections in ihlc and record where they were located. the code includes logic that checks that the data appears to be correctly structured and catches errors. it also allows for variations based on whether the import file is adding location information for a given box or barcode for the first time or is merely updating it. in order to create a means for staff to search and update by barcode without having to specify the collection or box number associated with that barcode, the code also establishes a new getlocationentryidfromshelf($string) function. i based this off the similar get___idfrom____($string) functions, such as getextentunitidfromstring($string). [9] while this did not have a direct correlation from past work, it grew out of many, many readings of the source code to understand the functions underlying how something worked (or didn’t work) in archon. while hacking a database field for a different purpose and collapsing data from two fields into one is never ideal, leveraging my understanding of the inner workings of archon, along with my on-the-ground experience managing an archival repository, allowed me to create a feature that extends the use of the barcoding project. anticipating maintenance as the complexity of the features that i developed grew, i adjusted my approach to make the source code easier to maintain and easier to roll out and reuse in multiple instances. since this was essentially a side project, being able to make changes in a central location or to a file that could be copied whole rather than to code segments that had to be copied and pasted individually was a significant improvement to the workflow. most importantly, the adaptability of the code supported the sustainability of archon. the first improvement came when i began defining variables in archon’s config file rather than burying them in the template or theme files. by defining these variables in a central location in the top level directory for archon, making edits or changes became much easier. for example, defining a variable in the config file for text to create an alert banner on ihlc’s archon to notify users of unexpected closures from the header file in the theme made it easier to change this text quickly. defining a series of variables for the strings and arrays needed to build a request link from the control card and finding aid templates made it easier to change both request links at once as well as to transfer code between different archon instances. this approach also supported easily turning features on or off, such as the ability to offer expanded descriptive information about collections in the browse lists. while this is helpful when working with template or theme files, it is especially valuable when modifying the core code to do something that changes archon’s functionality. in late 2016, a contributor introduced a new feature (or bug, depending on your perspective) to use an inexact search rather than an exact search in archon’s identifier search function. the inexact identifier search did not always return the searched-for identifier at the university of illinois, so the change had to be manually reversed. to provide better control and clarity, i rewrote the php code to allow institutions to choose which type of identifier search they wanted from the config file rather than having to dig through the source code. php code excerpt showing the use of a variable ($_archon->config->searchexactidentifier) from the config file for the identifier search: if ($_archon->config->searchexactidentifier) { $minlengthquery = " or collectionidentifier = ?"; $minlengthvars = array(str_pad($collectionidentifier, config_collections_collection_identifier_minimum_length, "0", str_pad_left)); } else { // replace = with like $minlengthquery = " or collectionidentifier like ?"; // added wildcards with $collectionidentifier for partial search $minlengthvars = array(str_pad("%$collectionidentifier%", config_collections_collection_identifier_minimum_length, "0", str_pad_left)); } all the unit-specific variables being defined for the aeon request link, however, soon created an unwieldy config file. in this case using an include statement to separate out those variables into a request-specific config file made the code easier to manage. expanding the use of include statements to the implementation of the aeon request link made the code easier to manage and modify as well. admittedly, the organic way the request link functionality has developed – from initially providing only basic collection-level information to now supplying information about the individual boxes, barcodes, and service locations – has made even this approach a bit unwieldy. the template directory now holds two main include files for the request link itself, requestprep.inc.php and requestlink.inc.php, but also requestlinkforboxes.inc.php, and three more relating to the location table displays (one for staff, one for the public, and one for a summary). however, it is much more manageable than it would be with everything buried in the existing template files. it also provides for more seamless updating across instances, as each unit’s control card template is unique, but the include files for the request link setup can be identical across units. exploring new possibilities for solving old problems in spring 2022, one of my fellow archivists at the library, bethany anderson, wanted to know how diacritics could be represented in archon. as the project director for the doris duke native oral history revitalization project, she needed diacritics to accurately represent the names of many native tribes and nations (for example, the stó:lō nation). however, the university of illinois’ instance of archon had been unable to properly save and display diacritics or even simple accents. this issue had yet to be fully diagnosed and resolved. an email exchange with chris prom, associate university librarian for digital strategies, opened a new line of inquiry to this problem. my work years earlier investigating a related issue had focused on tracing the issue through the php source code, but prom suggested looking at the underlying database’s character set. he recommended seeking support from library it to create a separate test mssql database for this work to be able to test and debug without concern of the impact on the live, public database. this new line of inquiry soon revealed that the patterns of characters being garbled indicated an issue of double encoding into utf-8 for the browser (for example, ñ would display as ã±). i also learned that in 2019 mssql finally added support for utf-8 encoding, and having the underlying database support utf-8 encoded text began to seem like more of a possibility. [10] working extensively with jon gorman from library it’s infrastructure management and support (ims) group for the database setup, modification, and conversion, i was able to test modifying the php source code to support additional characters, first in the original encoding in a test copy of an existing database and then with the full breadth of utf-8 with the data migrated to a utf-8 mssql database. the process did not go smoothly but drew on a lot of the experience and problem-solving skills i had built with archon for the previous seven years. in contrast to the features and solutions described earlier in this article, these issues could not be resolved through the php source code alone. after tracing through the source code for the display, i discovered that commenting out a line of code that was converting the string from iso-8859-1 (latin-1) to utf-8 led to archon displaying standard latin diacritics and symbols without garbling them. logically this didn’t make sense. our database could not be using utf-8 encoding (since it was a pre-2019 mssql database), but the function for the user interface display was clearly being passed text that had already been converted to utf-8 and was garbling it by converting it a second time. investigating the code, one theory that emerged was that freetds (which, as a non-developer, i had not realized was a part of the process until then) could be converting it as an intermediary between the mssql database and the php code. this understanding soon proved valuable. when i tried to connect archon to use the migrated utf-8 test mssql 2019 database, i got a blank white screen, which turned out to be due to having to set freetds to a higher version to work with mssql 2019. following the pattern outlined in the previous section, i coded the mssql and freetds modifications (which appeared multiple times in different files throughout the source code) so that they could be turned off and on via variables in the config file. my tests of copying utf-8 characters from other languages and the native nations and tribes names eventually worked and saved correctly. [11] we rolled out the changes to the ihlc’s archon in may 2023. migrating the three other instances of archon revealed new complexities in how the code interacted with server settings as well as the impact of previous attempts to fix the double encoding issue. for instance, the earlier strategy of simply commenting out a line of code converting the string from iso-8859-1 (latin-1) to utf-8 worked on the development server but not the production server for the three other archon instances. on this production server it led to text fields with diacritics or double spaces failing to display (and therefore being overwritten with an empty string when saving). tracing through the code, i found the error arising from functions like htmlspecialcharacters and preg_replace returning null if the string passed to them is not in utf-8. i added some code to catch this error and convert the string to utf-8 if needed so that these functions would not delete data if the encoding was not in utf-8 as expected. analyzing the text strings on each server as hexcodes, i was able to find that the php received text in utf-8 on one server and in a different encoding (possibly latin-1) on the other. this pointed to the freetds settings as the likely culprit. to resolve the issue, gorman wrote a stanza to establish the appropriate freetds settings to support utf-8 versions of archon on this other server as well. after resolving these issues, we successfully migrated both the rbml and ala instances of archon to utf-8 versions and began adding in diacritics for names that had been inaccurately represented for years. in migrating the ua instance of archon, we encountered one final issue. in the test copy of their database, with all the same code and server settings that had worked for rbml and ala, sample text in the control card with diacritics would save correctly but display in the same garbled, double-encoded pattern from earlier. figure 9. the saved version of the sample text in the admin editor in the ua utf-8 test database. [12] figure 10. the public display of the same text with the characters in the scope and content note (at right, labeled “description”) displaying with the double encoding error pattern and the sample text in the abstract (at left) displaying correctly. given that the abstract displayed the sample text correctly but the scope and content note did not, this pointed to an issue in the display of that particular field in the control card template. applying a different template set (the one in use for ala’s instance of archon) led to the text displaying correctly. comparing the two templates revealed a modification to the ua template that added another layer of encoding the scope and content note into utf-8, leading to the persistent double-encoding. in the ua template they had: getstring('scope'))); ?> as opposed to in the ala template: getstring('scope')); ?> once corrected, the ua instance worked as expected. this issue, while ultimately simple to resolve, also points to the challenging opacity of one-off changes in legacy codebases, and to the value of documentation. in this case, the source code did not even include any comments around this line about why this modification was added, or that it was an attempted fix for a specific issue. in working with legacy codebases, there is a balance between band aid-type solutions vs. more systemic investigations. however, a focus on increased transparency regardless provides value for sustainability as well as future directions. throughout this project, i drew heavily on what i had learned in navigating, reading, and modifying archon’s source code over the previous seven and a half years. i am grateful to jon gorman for all his help in the areas where i did not have experience, and to bethany anderson and chris prom for bringing up the issue and pointing me in the right direction. while it was more involved than anticipated, after more than seven years of struggling with double spaces and having to tell my student employees that, no, we can’t spell names or words in other languages correctly, it was wonderful to finally have archon support utf-8. figure 11. excerpt of the control card description for the doris duke indian oral history program archives, 1908-1995, showing the names of native nations and tribes successfully displaying with diacritics. conclusion: is there really hope for archon? to some extent there is an elephant in the room. after all this, archon is still running on a maximum of php 5.6 and cannot be updated further due to dependencies on a pear library that has not been updated since 2012, with the last stable release from 2007. maintaining archon into the future will eventually require dedicated development support for rewriting the entire interface and database communication layer, likely from the ground up. so, was all this a waste? i would argue it’s not a waste. all this work has provided us with a more functional system that can accommodate changing needs and support emerging workflows. it has also provided proof-of-concept for certain features not originally included in archon. and it has provided years of more accessible information about our holdings. while archon’s limitations have held us back in some ways, being an archivist on the ground who could see new solutions and implement them has allowed us to make far better use of the system rather than staying in a continual holding pattern. beyond improving our current use of archon, the investment in gaining in-depth understanding of archon, its data model, and how it functions can be leveraged also toward the assessment of future systems, as well as eventual plans to transition into a new system. developing new ideas and imagining how the current system could function more effectively lends itself well to envisioning what could and should be improved in a new system as well as to analyzing what else might be possible. not only did this work breathe new life into archon for the present, but it also positions us better in planning archival management systems for the future. notes [1] chris prom, “release archon 3.21-rev. 2,” archonproject/archon on github, released january 17, 2014, https://github.com/archonproject/archon/releases/tag/v3.21.2 [2] anne salsich, caitlin nelson, and nat wilson, “should i stay or should i go: what to do when your beloved software is no longer supported? update it yourself!” (conference presentation slides, library technology conference, 2017, https://digitalcommons.macalester.edu/libtech_conf/2017/sessions/18/). [3] scott w. schwartz, christopher j. prom, christopher a. rishel, kyle fox, “archon: a unified information storage and retrieval system for lone archivists, special collections librarians and curators” partnership: the canadian journal of library and information practice and research 2, no. 2 (2007): 2. while the early documentation emphasized that archon could run on any web server, by 2013 archon’s readme file specifically recommended against using a windows server and noted that archon “was developed and tested on a lamp (linux, apache, mysql, php) server.” https://github.com/archonproject/archon/blob/master/readme.md [4] scott w. schwartz, christopher prom, kyle fox, and paul sorensen, “archon: facilitating global access to collections in small archives” (74th ifla general conference and council, quebec, canada, august 2008), 3. [5] “two university library archivists honored with 2008 andrew mellon foundation award,” university of illinois library, published december 16, 2008, accessed from the december 28, 2015 crawl of the page saved in the internet archive, available at https://web.archive.org/web/20151228234745/http://www.library.illinois.edu/news/archon_award.html. [6] assessing whether archivesspace has achieved this goal is out of scope of this article. while a number of former archon institutions have migrated to archivesspace, others have shared the perspective that “the result of that merger, archivesspace, left many archon users feeling excluded. the technical complexities of hosting a local instance, the exhaustive descriptive possibilities, and the steep learning curve all presented significant barriers to repositories accustomed to archon’s ease of use.” jeremy brett, colleen mcfarland rademaker, doris cardenas, anne thomason, and nancy webster, “light from the north: reviving the spirit of archon through atom” (presentation abstract, midwest archives conference annual meeting, detroit, mi, april 6, 2019, https://www.iastatedigitalpress.com/macmeetings/article/id/294/). [7] research has suggested that higher website aesthetics positively influences the user’s opinion of the corporate body represented by the website as well. see zhenhui (jack) jiang, weiquan wang, bernard c.y. tan, and jie yu, “the determinants and impacts of aesthetics in users’ first interaction with websites” journal of management information systems 33, no. 1 (2016): 229–259. doi:10.1080/07421222.2016.1172443. aesthetics may also have an impact on usability, but research outcomes in this area of study vary. a 2019 meta-analysis of studies examining the impact of aesthetics on user performance found a small positive impact of aesthetics on performance. see meinald t. thielsch, jana scharfen, ehsan masoudi, and meike reuter, “visual aesthetics and performance: a first meta-analysis,” in proceedings of mensch und computer 2019 (new york, ny: association for computing machinery), 199–210. https://doi.org/10.1145/3340764.3340794. in a work-in-progress presented in spring 2023, researchers in germany argued that the aesthetic-usability effect may be strongly impacted by levels of “processing fluency,” an impact they summarize as “mind at ease makes usable and beautiful,” though this research measured only perceived usability (as well as perceived aesthetics and processing fluency) rather than performance. see jan preßler, lukas schmid, and jörn hurtienne, “statistically controlling for processing fluency reduces the aesthetic-usability effect” in extended abstracts of the 2023 chi conference on human factors in computing systems (new york, ny: association for computing machinery), article 261, 1–7. https://doi.org/10.1145/3544549.3585739. [8] there are many guidelines and recommendations available for improving accessibility and responsiveness of websites. “designing for web accessibility” from the w3c web accessibility initiative (wai) is one possible place to start. https://www.w3.org/wai/tips/designing/ [9] one shortcoming of this approach is that it assumes only one location entry is associated with a particular barcode and cases may exist where components from multiple collections are stored in the same box. [10] pedro lopes, “introducing utf-8 support for sql server,” sql server blog, microsoft, july 2, 2019, https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-utf-8-support-for-sql-server/ba-p/734928. [11] i am grateful for frank da cruz’s utf-8 sampler for testing, available at https://www.kermitproject.org/utf8.html. [12] the sample text used came from frank da cruz’s utf-8 sampler, referenced above. about the author krista l. gray is the archives program officer for illinois history and lincoln collections at the university of illinois urbana-champaign library. subscribe to comments: for this article | for all articles one response to "breathing life into archon: a case study in working with an unsupported system" please leave a response below: charles riley, 2023-10-19 great to hear about this successful use case in updating archon to accept utf-8! leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – core concepts and techniques for library metadata analysis mission editorial committee process and structure code4lib issue 52, 2021-09-22 core concepts and techniques for library metadata analysis metadata analysis is a growing need in libraries of all types and sizes, as demonstrated in many recent job postings. data migration, transformation, enhancement, and remediation all require strong metadata analysis skills. but there is no well-defined body of knowledge or competencies list for library metadata analysis, leaving library staff with analysis-related responsibilities largely on their own to learn how to do the work effectively. in this paper, two experienced metadata analysts will share what they see as core knowledge areas and problem solving techniques for successful library metadata analysis. the paper will also discuss suggested tools, though the emphasis is intentionally not to prescribe specific tools, software, or programming languages, but rather to help readers recognize tools that will meet their analysis needs. the goal of the paper is to help library staff and their managers develop a shared understanding of the skill sets required to meet their library’s metadata analysis needs. it will also be useful to individuals interested in pursuing a career in library metadata analysis and wondering how to enhance their existing knowledge and skills for success in analysis work. by stacie traill and martin patrick introduction metadata analysis, or the application of data manipulation and analysis tools and skills to library metadata, has become an increasingly important task in libraries. library job postings and job descriptions show a growing need for metadata analysis skills. metadata analysis, even though it is rarely labeled as such, is a necessary skill for staff in libraries of all types and sizes, and across a variety of technical functions, including cataloging and metadata, acquisitions, electronic resources management, collection management, and discovery, as well as traditionally information technology-heavy functions such as library systems management (binici 2021; gonzales 2019; nasig 2019; ratledge and sproles 2017; hall-ellis 2015; mathews and pardue 2009). metadata analysis can also be an important skillset for library staff in more recently emerging specialties, such as research data services and digital scholarship (hannah et al. 2020; thielen and neeser 2020; skene 2018; xia and wang 2014). despite the need for metadata analysis skills, few jobs use the phrase “metadata analyst” in their titles. but many library staff are doing this work, no matter what their official job titles might be. this is abundantly clear from even a cursory sampling of conference presentations and publications, including many published in this journal. but perhaps because metadata analysis work is distributed across so many library functions and job titles, there is little clear guidance for learning these skills, or even understanding what they are. as people who came to the work of metadata analysis because of some combination of experience, aptitude, and interest, but without the base of skills and knowledge that we have today, the authors know very well how challenging it can be to piece together both the technical and library-specific knowledge required to do metadata analysis well. to put it another way, there are a huge, almost overwhelming, number of case studies and project descriptions from which the aspiring analyst can glean useful information about what to learn and (sometimes) how to learn it, but there are few resources that synthesize and generalize that information into something digestible and actionable. in this article, we will describe metadata analysis and its typical tasks, outline the skills and knowledge required to execute those tasks successfully, and discuss several key analysis and problem solving approaches that a good metadata analyst relies upon. metadata analysis scope and tasks before we can meaningfully explore the skills and knowledge required for good metadata analysis, we will attempt to clarify what we mean by metadata analysis in the library context. data analysis broadly defined includes processes such as data cleaning, exploration, transformation, and statistical analysis. library metadata analysis may include all of these activities, embedded in the understanding of metadata in context as a central, living component of an ecosystem enabling many core library functions. here, “ecosystem” means a library’s larger environment of internal and external repositories, services, and tools, in which planning, analysis, design, deployment, and maintenance of metadata management processes are frequent and ongoing needs. a large number of library business processes — in areas ranging from discovery to access to collection management — depend on accurate, high-quality metadata, meaning that metadata analysis is essential work in most libraries. working from that definition, we can generalize about the tasks involved in completing typical metadata analysis projects as a helpful first step toward identifying the specific skills and knowledge areas an analyst needs. first, the analyst needs to know (or be able to find out) what data is available, and the sources from which it is available. second, the analyst needs to know how the data can be accessed programmatically or in batch with automation tools. third, the analyst needs to know how the data are used, and how they should be used, if that differs from current usage. fourth, the analyst needs to determine what data changes, merges, or enhancements are needed to produce the desired outcomes. finally, if the same process needs to be repeated, the analyst must be able to determine how to do it efficiently, sustainably, and at whatever scale the library requires. with these tasks as a framework, we can start to see what the analyst needs to know in order to work through the tasks successfully. core knowledge areas based on the generalized tasks, we developed a list of core knowledge areas for metadata analysis. the list of seven core knowledge areas presented here is extensive, but not comprehensive; it would be difficult to include every knowledge area that library metadata analysis jobs might require, in part because library staff frequently have multiple intersecting areas of responsibility. the authors have chosen to focus on these seven areas based on their own experience and informal observation of the field. a future project to survey library staff with metadata analysis responsibilities to learn what knowledge areas they find essential would be valuable. seven core knowledge areas for metadata analysis library metadata standards, applications, and systems data cleaning and text normalization data serializations interoperability database design and querying web technologies and services workflow analysis and documentation practices because there are many different educational and experience backgrounds that might make someone interested in metadata analysis work, it should be expected that an analyst will come to the job with more expertise in some areas than others. the authors have relatively similar backgrounds: both had deep experience in library technical services, especially cataloging and electronic resources management, along with technical aptitude and interest, but little formal education in programming, computer science, or business analysis. as a result, some of these knowledge areas were more challenging than others for us to learn, and we both continue to learn every day on the job. most metadata analysts will not need deep expertise in all of these areas, and some may need skills in areas not among these seven, such as data visualization and project management. the key is to know enough to get the job done, and to know how and where to learn more when the need arises. in the next section of the article, we will go into some detail about each area of knowledge. library metadata standards, applications, and systems the first core knowledge area represents the domain-specific knowledge that is a requirement for almost all metadata analysis: library metadata standards, how those standards are applied, and library systems. the analyst should have a broad familiarity with library metadata standards, including content standards, and encoding and transmission standards. additionally, the analyst should have expertise in widely-used standards, which may vary from library to library. some fluency with very widely-used standards, such as marc, is likely to be necessary for most library metadata analysts. the analyst needs a good understanding of the library management system(s), discovery system(s), and other repositories in use at their library, comprising both a high-level understanding of the systems’ capabilities and limitations, and a detailed understanding of their metadata management processes. because metadata is the foundation for many core library business processes, the analyst should be able to identify and articulate the cross-functional impacts of data changes, and how internal data changes may affect external services and tools that rely on metadata. finally, the analyst needs to be competent in using whatever built-in reporting and analysis capabilities are offered by the library’s systems. key domain-specific knowledge that is sometimes overlooked includes legacy standards and practices, and local practices and schema. it is not always easy to gather such knowledge, because documentation of past decisions and practices may be incomplete or inaccessible. depending on how thoroughly a library recorded such information in the past, documenting past practice can be an important part of an analyst’s work. data cleaning and text normalization the second core knowledge area is data cleaning and text normalization, also known as data wrangling. this is the practice of evaluating and transforming data to ensure its validity, accuracy, completeness, and consistency. along with library metadata standards and systems, this is the most important knowledge area for metadata analysis work. it is critical for data remediation and enhancement projects, but also for devising mappings and crosswalks between metadata standards, and for creating any kind of analysis that combines data from varying sources, standards, or application profiles. data wrangling encompasses many specific skills. most important for metadata analysis work are programmatic file and text manipulation, recognizing and expressing patterns (often through tools such as regular expressions), using conditional statements and actions, and isolating elements for evaluation, modification, or both. to successfully wrangle textual metadata, the analyst also needs to understand how to use and convert between character encodings, know where to find documentation of any standardized values, and be able to identify where the library’s metadata departs from those standards. without dwelling on specific tools, the ability to use programmatic manipulations is undeniably important. although learning a scripting or programming language is a likely path for any metadata analyst who does not come to the job with that proficiency, there are many tools that do not require programming ability that can be used to achieve the same goals, and which can also serve as a gentle introduction to programming concepts and ways of thinking. specific tool choices are less important than conceptual understanding. data serializations the third core knowledge area is data serializations. data serializations are ways to package data structures or data objects for storage or transmission. the most common serializations for library metadata (like much other data available on the internet) are xml, json, and csv, but there are many others. a metadata analyst should be well acquainted with those three serializations, and also able to work with others as needed. the keys to working with any serialization are to understand whether and how the data can be processed with familiar toolsets, and to be able to output data, once processed, in other serializations. when working with less common serializations, it may be necessary to explore new or unfamiliar tools to assist in transforming the data to a more familiar format. interoperability the fourth core knowledge area is interoperability. there are two types of interoperability with which a metadata analyst should be familiar: metadata interoperability and systems interoperability. the dublin core metadata initiative glossary describes metadata interoperability as “the ability of … systems or components to exchange descriptive data about things, and to interpret the data … in a way that is consistent with the interpretation of the creator of the data.”[1] metadata interoperability is about semantic compatibility: do these two elements really have the same meaning? do these two records really represent the same resource? the ability to formulate and answer questions like these is critical to the work of a metadata analyst when it involves mapping across metadata schemas, incorporating enrichments from one metadata source into another, or creating data mashups to address core business needs, such as collection analysis. another key concept of metadata interoperability is identifiers, which are essential to almost all automated metadata processing. reliable identifiers are important for the matching, reconciliation, and enhancement tasks that comprise much of the work of metadata analysis. reliable identifiers are also critical for functional linked data. a metadata analyst should understand what identifiers are available in the metadata they work with, which entities and levels of granularity each type of identifier signifies, and how the identifiers are maintained. systems interoperability can be understood as the context surrounding metadata: how is it stored, accessed, communicated between systems? how can the analyst fetch the data when needed, or use it to power external systems or tools? a metadata analyst may not need deep technical expertise in this area, but rather needs to have a clear understanding of how metadata will be used by and communicated between systems, especially if and when full automation of a workflow or integration is the goal. even if an analyst is not writing the code for their own extract-transform-load (etl) pipelines, they will likely need to be able to conceptualize the process at a high level and devise requirements to communicate with a developer. database design and querying the fifth core knowledge area is database design and querying. again, deep knowledge here may not be necessary, but a basic understanding of how relational databases are structured is crucial. an analyst should have at least a high-level understanding of table relationships, logical structure, and data modeling. knowing specific query languages such as sql, xquery, xpath, or sparql may be important depending on the systems and data an analyst works with. but more than fluency in any particular query language, it is important for an analyst to understand the core concepts of relational algebra that those languages implement, such as set operations, unions, differences, joins, and so on. those concepts underlie many of the fundamental tasks in data analysis beyond database querying. web technologies and services web technologies and services comprise the sixth core knowledge area. much of what a metadata analyst needs to know in this area overlaps with interoperability, but beyond that, an analyst needs to understand some specifics of how to work with data on the internet. a high-level understanding of how web applications are created and managed, web technology stacks, html, css, and javascript are useful. in addition, an analyst may need a deeper understanding of web apis, the request-response cycle, and library-specific internet standards and protocols such as openurl, oai-pmh, and z39.50/sru. a working knowledge of semantic web and linked data concepts, standards, and technologies such as rdf, bibframe, json-ld, and sparql are also essential; even if these are not currently required for a metadata analyst’s work, they may well be in the near future. workflow analysis and documentation practices the seventh and final core knowledge area is workflow analysis and documentation practices. important skills in this area include business and requirements analysis, process improvement analysis, and flowcharting. understanding how to devise and communicate workflows and dataflows is especially important for an analyst’s work to be shareable and replicable, and also important when a metadata analyst collaborates with a developer. even when, as is common, an analyst acts as both analyst and developer on small projects, these skills help an analyst understand exactly what they are implementing in code, and how to communicate that to others. finally, because documentation is often a major part of an analyst’s work, they must also understand documentation best practices, such as plain language and documentation accessibility. practical analysis techniques and problem solving approaches libraries create, acquire, maintain, and share significant quantities of metadata, in a wide variety of types and formats, for various functions, including bibliographic, authorities, acquisitions, administrative, holdings, and others. in most libraries the amount of metadata constantly increases, and each year brings new pieces of metadata that are not present in older records, or new standards that create records substantially different from older records. the increasing reliance on externally-sourced metadata and growing demand to share metadata with external partners creates in turn an increasing reliance on varying quality and approaches to the metadata the library manages. both human and computer generated metadata are subject to errors that need to be addressed. it is within this context that the need for metadata analysis, with a goal to solve metadata problems, arises. we recognize at least 6 areas where metadata analysis is frequently needed: workflow and automation, configuration, batch creation and updates, import and export, troubleshooting, and documentation creation. these areas are not necessarily independent of each other, and often overlap or form a sequence within a larger project. the knowledge areas and skills described above, implemented through tools and techniques with which the analyst is comfortable, inform the final shape of the approach taken for any particular analysis. techniques in order to address a metadata problem, the metadata analyst must begin by gathering information about the problem. we will first discuss the basic techniques a metadata analyst can use, and then we will turn to an overview of the kinds of information that are useful for completing a metadata analysis. while not exhaustive, common techniques include observation, reading documentation, advanced queries and searches, conversation, and process mapping and workflow diagrams. sometimes one of these techniques will be enough to resolve the problem; other times the analyst may need to rely on all of them, along with others. as an example, martin was tasked with updating cataloging macros for both a migration to sierra from millennium, and for rda, a workflow and automation problem. he chose to use observation in order to understand how catalogers were using macros, as well as to see where things were different enough in sierra that the macros broke. then, he turned to the autoit – a scripting language for automating windows tasks and macros – documentation in order to understand what was possible. finally, he used a kind of process mapping to map the logic that the macros used in order to get them working again. approaching a metadata problem one high-level way to organize an approach to a metadata problem is by thinking through four aspects of the problem: context, extent, urgency, and importance. in the macro cataloging example above, the context and extent was determined through observation, while the urgency and importance were related to both the move to a new ils, and the recently released rda standards for cataloging. context includes such things as local practices and policies, national and international standards, and the constraints and possibilities of local systems and the tools the analyst is comfortable with. context also includes other, pending job duties and projects on your plate. other important contexts can include whether there is direct, local control over the metadata at issue or not. for instance, as customers of ex libris, we rely on ex libris’s central discovery index for a significant amount of discovery metadata. this means that article level metadata, for example, is not something we can directly edit. before an analyst can formulate a plan to address the metadata problem, it is important to understand if, for example, there is a need to deal with one subfield in one record, or multiple subfields across 25,000 records. this is the extent of the problem. in the modern library, it is likely that one record with an issue means there are more with the same issue. this work involves using preliminary analysis tools at your immediate disposal, such as using a global search in the library catalog, or doing queries through other means such as sql or another tool that the catalog might interact with. if the problem was raised by someone else, the analyst can follow up with that person to find out if they have useful knowledge of the problem’s extent. understanding the extent helps the analyst to determine if the problem would be best addressed programmatically, for example, through the use of a scripting language such as python. depending on the extent of a problem, a one-time solution with many manual steps may be sufficient; for larger or ongoing problems, partially or fully automated solutions may be required. another important aspect to evaluate is the urgency of the problem. urgency could be determined by a number of factors: library administrators, whether access to content is broken, project deadlines, and more. this is subjective, but there could be a vast difference between needing to fix a typo in note fields vs. a typo in title fields. one record with a typo in the title could be fixed immediately, or it could be considered low urgency and put off to a later time. this is why it is so important to put the context, extent, and urgency together with importance. only then does it become clear when, and how, the analyst needs to address the issue. importance is also a subjective determination. that typo in the notes field might seem urgent to a doctoral student who could not find a source while working on their dissertation, but it may not seem urgent to the metadata librarian who already has a backlog of projects. this is where the question of importance comes in. in a large research university, a doctoral student missing out on a key source is pretty important, so while the analyst may disagree on the urgency, there may be agreement on the importance. practically speaking, once all of these aspects are taken into consideration, it becomes clear that there is just not enough time to tackle every problem under the sun. this is an important step in the process because none of these four areas are binary choices, and they all also influence one another. it is also important to remember that the person who reports a metadata issue may believe that the problem is important, urgent, and extensive; it is up to the analyst to see the problem in the context of other pending analysis work. the authors have never yet encountered a life-or-death situation involving metadata. tools solving a metadata problem may sometimes be as simple as adjusting a journal coverage date in the knowledge base, but more often it will involve using a software tool of some kind. the big three tools (external to library system tools) familiar to most librarians are marcedit, python, especially with pymarc and pandas libraries, and openrefine. all can be run on mac, windows, and linux systems. python also benefits from its widespread use in other domains. advanced metadata analyses and transformations can be performed in all of them, and their ability to do so is often only limited by the analyst’s creativity. marcedit is perhaps the best known tool for metadata transformations. it is a powerful tool with an easy to use interface, and can handle simple find and replace tasks as well as extremely complicated transformations. marcedit’s developer, terry reese, continues to actively expand the tool’s functionality to include working with linked data, library catalog apis, oclc apis, authorizing bibliographic headings, processing and outputting metadata in various serializations, among many other helpful functions. users can also develop “tasks” that meet their specific needs that work like scripts and can include many steps to transform marc fields and subfields, including using regular expressions. there are many, many examples of projects completed with marcedit, but one excellent example of using marcedit with xslt to perform advanced transformations is described in marielle veve’s article “from digital commons to oclc: a tailored approach for harvesting and transforming etd metadata into high-quality records.” (veve, 2016). openrefine presents metadata in an interface reminiscent of a spreadsheet, and offers many useful functions for metadata analysis and transformation. one of its most useful out-of-the-box functions for the metadata analyst is clustering, which is especially useful when “columns that have a reasonable amount of overlap but sometimes suffer from transcription errors, such as names” are at issue in a metadata analysis project.[2] python is excellent for efficiency’s sake, and for its ability to handle different kinds of data at the same time. with enough experience, an analyst using python can replicate functions from openrefine and marcedit while also bringing in additional functionality from python’s extensive, modular ecosystem. for instance, an analyst can work with a spreadsheet and marc records in the same script, and can extend that efficiency further if their library systems support apis by using those apis to fetch and update batches of records. another frequently-used python tool is jupyter notebooks. a notebook is a fully-functional python editor that can also record explanatory notes, instructions, and commentary as well as include visualizations. jupyter notebooks are extremely useful when learning python, and also for times when the analyst wants to see the results of each piece of code, and either review it or act on it before moving on. the authors have several processes we use regularly that are executed through jupyter notebooks rather than as standalone scripts, including processes to produce holdings reports for ingest by hathitrust, and to derive lists of print and electronic library holdings from vendor offer lists.[3][4] solving the problem when a metadata problem comes up, regardless of its nature, it is important to gather information to determine its context, extent, urgency, and importance using the techniques discussed above. when the analyst combines that information with knowledge of relevant tools, the solution often becomes clear. the important thing is to understand and use the tools at hand, while building knowledge and skill with other tools. here is a helpful analogy: there are multiple ways to dig a tunnel. you can dig a tunnel with a boring machine, or a spoon. one option might take longer than another, but you can usually get to the same place eventually. while we are capable of using python and other sophisticated tools to analyze and solve problems, if a tool like marcedit can already do what needs to be done, there is no need to write new code. the only time being unable to write code will prevent an analyst from successfully solving a metadata problem is when there is a need to change thousands of unique things across thousands of records, or when the task at hand is labor intensive and needs to be done more than once. in those situations, it is important to eventually find a programmatic solution. but being able to effectively use the tools with which the analyst is already familiar is more important than trying to learn how to use every tool, even if it means needing to run a long sequence of find and replace commands in marcedit. that will likely still be faster than learning python. conclusion the knowledge areas, techniques, and tools of metadata analysis discussed in this paper will no doubt sound familiar to many of the readers of this journal. besides providing a high-level overview of the knowledge, skills, and practices of metadata analysis for those who are already engaged in it, our hope is that this exploration will benefit those newer to the field who seek to learn metadata analysis skills. there is much overlap between metadata analysis and more discrete library functional areas of cataloging, e-resources management, and library systems management, among others. because the need for metadata analysis is not limited to one specialty within libraries, it has been difficult to pin down precisely what a metadata analyst does, and consequently for the library community to provide a concrete action plan for learning to do this kind of work. our own experiences confirm this. through our discussion of metadata analysis and its common tasks, approaches, and tools, we have sought to articulate a baseline of knowledge that can be used by library staff at all levels of the organization to build coherent development plans, write accurate job descriptions, and develop a common understanding of metadata analysis work. notes [1] dublin core metadata initiative. “metadata interoperability,” https://www.dublincore.org/resources/glossary/metadata_interoperability/ [2] phillips m, tarver h, frakes s. 2014. “implementing a collaborative workflow for metadata analysis, quality improvement, and mapping” the code4lib journal (23). https://journal.code4lib.org/articles/9199 [3] umnlibraries/hathitrust-inventories: https://github.com/umnlibraries/hathitrust-inventories [4] umnlibraries/holdings-from-isbns: https://github.com/umnlibraries/holdings-from-isbns references binici k. 2021. what are the information technology skills needed in information institutions? the case of “code4lib” job listings. the journal of academic librarianship, 47(3):102360. doi:10.1016/j.acalib.2021.102360 gonzales bm. 2019. computer programming for librarians: a study of job postings for library technologists. journal of web librarianship, 13(1), 20-36. doi:10.1080/19322909.2018.1534635 hall-ellis sd. 2015. metadata competencies for entry-level positions: what employers expect as reflected in position descriptions, 2000-2013. journal of library metadata, 15(2):102-134. doi:10.1080/19386389.2015.1050317 hannah m, heyns ep, mulligan r. 2020. inclusive infrastructure: digital scholarship centers and the academic library liaison. portal 20(4):693-714. doi:10.1353/pla.2020.0033 mathews jm, pardue h. 2009. the presence of it skill sets in librarian position announcements. college & research libraries, 70(3):250-257. doi:10.5860/0700250 nasig. 2019. nasig core competencies for electronic resources librarians. https://www.nasig.org/competencies-eresources ratledge d, sproles c. 2017. an analysis of the changing role of systems librarians. library hi tech, 35(2):303-311. doi:10.1108/lht-08-2016-0092 skene e. 2018. shooting for the moon: an analysis of digital initiatives librarian job advertisements. digital library perspectives, 34(2):84-90. doi:10.1108/dlp-06-2017-0019 thielen j, neeser a. 2020. making job postings more equitable: evidence based recommendations from an analysis of data professionals job postings between 2013-2018. evidence based library and information practice, 15(3):103-156. doi:10.18438/eblip29674 veve m. 2016. from digital commons to oclc: a tailored approach for harvesting and transforming etd metadata into high-quality records. the code4lib journal (33), 2016-07-01. https://journal.code4lib.org/articles/11676 xia j, wang m. 2014. competencies and responsibilities of social science data librarians: an analysis of job descriptions. college & research libraries, 75(3):362-388. doi:10.5860/crl13-435 about the author stacie traill (trail001@umn.edu) is metadata and discovery analyst at the university of minnesota libraries. martin patrick (patri299@umn.edu) is metadata analyst at the university of minnesota libraries. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – creating a library database search using drupal mission editorial committee process and structure code4lib issue 10, 2010-06-22 creating a library database search using drupal when florida gulf coast university library was faced with having to replace its database locator, they needed to find a low-cost, non-staff intensive replacement for their 350 plus databases search tool. this article details the development of a library database locator, based on the methods described in leo klein’s “creating a library database page using drupal” online presentation. the article describes how the library used drupal along with several modules, such as cck, views, and fckeditor. it also discusses various drupal search modules that were evaluated during the process. by danielle rosenthal and mario bernardo introduction the florida gulf coast university library is a mid-sized academic library serving a four-year public institution. we opened our doors and virtual presence in 1997 with roughly six database subscriptions, since then the collection has grown to over 350. the library also provides access to over 30,000 e-journals, maintains a 115-seat open computer lab, a 10-seat reference lab, and employs twenty staff members and twelve librarians. the interface for accessing databases has changed only a couple of times, with the last iteration in use for over eleven years. when the library was informed that the database locator service was being discontinued, we had to come up with a replacement in a relatively short time. with a small technology staff and no funds set aside for this project, our solution had to be fairly easy to implement and, ideally, free. our search brought us to explore open source, something we yet hadn’t used. figure 1. database page, 1997 while we initially implemented a “dbl taskforce” consisting of various library faculty and staff to complete this project, ultimately it was just the assistant director of library computing and technology systems (cts) and the web librarian who were able to implement the solution. we have no programmers on staff. while the library can and does seek help from university technology on occasion, for the most part cts operates independently from the rest of the university’s it. the database locator that florida gulf coast university library relied on was a tool created by and hosted at the florida center for library automation (fcla). fcla is a consortia library technology center which: “…provides state-of-the-art, cost-effective information technology to assist the libraries of the public universities of florida in their support of teaching, learning, research and public service. more specifically, implement and centrally support high quality computer systems that help the libraries acquire, manage and provide access to information resources…”.[1] the shared database locator was used by all state university system (sus) libraries at one point in time. the sus libraries can purchase and license databases through the fcla consortia if they wish and the fgcu library licenses many of our databases though fcla, but not all. figure 2. database locator pre-drupal instance in september 2007, the fgcu library was notified that fcla needed to free up the hardware and software licenses residing on the server hosting the database locator service. we had to quickly find an alternative solution because we were the last university still using the outdated database locator software and there were no plans to keep it running for one university. all the sus libraries had created new alternatives themselves or had outsourced the work. while fcla was willing to help with a replacement dbl, we weren’t sure what we really wanted. the need to find a new dbl solution presented an opportunity to create something new, as well as deal with known issues, in hopes of creating a more usable system. incorporation of additional features or fixes of known issues was secondary. in short the new dbl must: be searchable by title as well as keyword: the outdated system was also only searchable by title, which worked well for experienced users; however our web usability study illustrated the need for a more context-driven interface. during the web study, students unfamiliar with the dbl were observed typing keywords in the search box instead of titles. [2] contain an alphabetical subject list: all 350+ databases must fit into an appropriate subject category or multiple categories for users who don’t know which database to use. users need to be able to browse a list of art databases or biology databases, etc. contain an a-z list: for users who don’t know the correct spelling of a particular database or for users who prefer to click instead of typing in a database name. be reliable and fast: database use at fgcu is very high and stability is a high priority. the locator page is consistently in the top 3 most accessed pages on the web site. users want to get into their database of choice quickly, without long search times. be easy to update: records need to be easy to add, delete, or edit with preferably no lengthy re-indexing time. our old system re-indexed every 12 hours. getting started we started by creating a “dbl taskforce.” the taskforce consisted of the assistant director of cts, the cts staff, the web librarian, head of technical services, our cataloging and metadata librarian, and the e-resource librarian. our initial work consisted of researching and contacting the other sus libraries and comparable institutions to see what they were using. we found, however, that many of them had programmers on staff or the funds to outsource the service. we had no programmers on staff and no funds available to outsource the service. a couple of institutions could not tell us how to re-create their systems because they did not know themselves. multiple authors and programmers had worked on their systems over the years, many of whom had since left.   budget considerations, in particular, led us to consider open source solutions. the development process our initial step was to examine the most common open-source cms products available. we then installed test instances on local hardware and begin evaluating how they would fit the needs of the dbl objectives already drafted. among the cms’s installed and tested were joomla, drupal, and wordpress. while looking at drupal, we discovered a 20-minute online presentation by another librarian, leo klein, who used drupal to create a searchable system.[3] drupal, as described on drupal.org, is “open-source software distributed under the gpl (“gnu general public license”) and is maintained and developed by a community of thousands of users and developers.” from the presentation we saw how we could leverage the basic framework demonstrated and customize it to fit our needs. we chose our existing microsoft iis server platforms rather than the more common windows apache msql php (wamp) or the linux based alternative (lamp) often used to host drupal. this choice was based on several factors: availability, established support, familiarity, and necessity to bring the product quickly to production. after deciding on the platform, we downloaded and installed drupal version 5.7. it should be noted that we chose drupal version 5.7 over version 6 (the latest version available at the time) because the marc import module (http://drupal.org/project/marc) was not available in version 6. we then proceeded to import our data and investigate and install various modules. one of the most challenging aspects of this project turned out to be the preparation and importing of our existing 300+ database records. the records were stored in the older system housed at fcla called citation server, in a customized marc record format. the marc records had multiple fields with repetitive information, often with odd graphic symbols used as field fillers and delimiters. the unusual formatting of the records prevented them from being cleanly imported into drupal. it turned out to be easier to perform a partial import of some elements of our dataset, specifically the titles and description because those fields were clean. we then cut and pasted or manually entered the remaining data into the drupal content forms. we can assume, as demonstrated with the clean import of the title and description fields, that the data import feature works well and other libraries may not encounter problems unless they are importing data from legacy systems such as the old citation server. once the data is in the native drupal environment, it is easy to enter and update record data, but getting it there was not as easy as we were hoping. in hindsight, we should have scrapped version 5.7 and installed version 6.0 since we weren’t using the marc module anyway. when creating web pages using drupal you must pick a theme or use the default. themes allow you to customize the look and feel of your pages and/or site. different themes have different functions and not all themes can be used with all versions. there are many themes (http://drupal.org/project/themes) to pick from and they are easily applied. we chose a basic garland default no frills theme and customized it with our university’s css, header, and footer. drupal organizes content by using content types and comes with only three standard types: blog, page, and story. we needed to add a module that would enable us to create another content type. modules in drupal are plugins that extend its core functionality and are available at the drupal site. when you install or activate a module, the new functionality is automatically integrated into drupal, appearing as new menu items, tabs, or fields ready for you to use. leo klein’s presentation demonstrated how to use the content construction kit (cck) to create content types. cck allows you to add fields to an existing drupal content type and/or create new content types. we created a custom content type called eresource. the eresource content type needed fields that would be searchable and display all the information related to each database. for consistency, the fields we chose were the same as our old system: title, url, description, coverage dates, and alternate title. figure 3. the cck module allows the creation of new "content types" we then enabled the optional core module taxonomy. taxonomy allows you to classify content into categories and subcategories, multiple lists of categories for classification (controlled vocabularies) and offers the possibility of creating thesauri (controlled vocabularies that indicate the relationship of terms), taxonomies (controlled vocabularies where relationships are indicated hierarchically), and free vocabularies where terms, or tags, are defined during content creation. we created a vocabulary called “subject” and added a list of terms. the terms are the names of each of our subjects (art, biology, etc.). figure 4. taxonomy screen: where categories are created and edited we then used the views module to display the front page list of subjects in alphabetical order since drupal’s default view is not alphabetical. we also had to create a view for the a-z list of databases by title. the combination of cck and views also enabled us to create a flexible and varied eresource content format to accommodate multiple descriptions and multiple urls per resource if necessary, as well as the assignment of multiple subject heading categories to each resource. there are many search modules available for drupal and we tried quite a few before settling on drupal’s own built-in search module. some of the search modules we tested included: dynosearcho, faceted search, fuzzy search, live search, trip search, and a few others. many of these search modules were excellent for specific types of searches and content, but because our drupal dbl consists primarily of database names and descriptions, a customized search feature was not necessary. the built-in search module in drupal consistently achieved the best results display for our purposes. we did enhance the built-in search module by activating the optional core feature called “search config” which allowed us to configure the display of the advanced search form. also, one of the release patches we installed improved partial word searching and provided better context delivery. we also tried the field indexer module, which provides a configuration page where the site administrator may select which fields to index. we decided not to use this module because we wanted all of our fields to be indexed for search. lastly, porter-stemmer was installed in an effort to improve search results for some librarians that were used to typing in shortened titles or acronyms. for example, typing in “pro” for “proquest”. since this module uses a stemming algorithm to reduce each word in the search index to its basic root or stem it is supposed to retrieve more relevant search results. however, we found that it generated too many results and that the database they were looking for, proquest, was getting lost in the result list. with the partial word patch applied, entering “proq” works just fine. acronyms that did not appear in the title or descriptions were added to the alt title field so that they could be indexed and searchable – the alt title search had already been a feature of our previous dbl; for example, “acs” for american chemical society. we also installed the fckeditor module, which is a visual html editor. it works by allowing drupal to replace text area fields with a wysiwyg editor. with the fckeditor, formatting your text is simple, whether you know html or not. we also used the autologout and poormanscron modules. the autologout module does what its name implies. it allows the site administrator to set the time that will automatically logout users after a certain amount of inactivity. poormanscron module is used to run basic cron jobs using normal browser page requests (as opposed to having to set up a crontab), which are self-maintenance tasks like removing expired session information, and updating indexes and tables. progress around the time the alpha drupal instance was created, we were notified that the old dbl was no longer in jeopardy which resulted in the project being put on hold and the dbl taskforce being disbanded. we continued working on the drupal solution, however, because we knew we were onto something. if we could get it to work as envisioned, then at least we would have a backup system in the short term, as well as a long-term opportunity to promote a more up-to-date context-based search system. as it turned out, a few months after we completed this project, we learned that the fcla dbl tool was really going away this time. that same week, fcla had a very serious hardware-driven outage and the existing fcla hosted dbl crashed, something that rarely (if ever) happened. fortunately, our own drupal dbl solution was ready and just needed to be made live. so we presented the option to the head of collections and technical services and were given the go ahead to make the drupal dbl live. the data was still fairly up-to-date because not many records had changed since the last import. we came to the realization that, in all those years, there was no backup to the dbl system in place. if it weren’t for the drupal-based dbl, we would not have had copies of the data anywhere. fcla probably had a copy, but they had other priorities, like bringing the extensive florida sus systems back online.   having a local solution gave us the ability to bring an important library service back on line quickly. figure 5. new database page using drupal, 2010 looking at our original objectives: objective 1: be searchable by title as well as keyword using the search box that comes with drupal search to search by database title was working just fine. it was the partial word search that was giving us some relevancy problems. we found a partial word search patch on the drupal site and it was applied to the drupal search module. applying the patch produces better context results to title and keyword searches. other patches and the application of the search-config were also applied to customize search box labels to be more descriptive and similar to previous dbl interface. objective 2: contain a subject list we met this objective by creating categories as our subjects: biology, health professions, art, etc. and then assigning the databases to the appropriate categories. databases can be assigned to multiple categories, which is essential since so many databases cover more than one subject. from a maintenance point of view we don’t edit this list very often. when a new database gets added to drupal it is entered manually and then assigned to a category or categories. objective 3: contain an a-z list it is easy to create an a-z list in drupal using views. objective 4: be reliable & fast this solution has turned out to be reliable. we have had no failures or slow performance since the day we went beta. objective 5: easy to update – records need to be easy to add, delete, or edit with preferably no lengthy re-indexing time. drupal is very fast and easy to update. data updates are performed through to the drupal server by logging into administration, choosing the database to edit, making the necessary changes, and hitting “publish”. re-indexing is configured to occur within minutes. we backup the drupal instance, but also created a simple flat html file that consists of the database name and the connection link. the html file resides on our staging web server, and we keep another copy on a share drive. both get updated manually whenever the drupal dbl records are changed in any way. in order to upgrade to a newer version of drupal, we need to determine whether all our current modules and configurations are supported in the new drupal version. we are currently evaluating and anticipate upgrading sometime in 2010. for consideration if you can live with not knowing exactly how items are ranked by drupal’s search feature, then use the core drupal search. while records display appropriately, some items appear farther down the list than would be expected. we would like to know how to get certain records to display higher or lower in the results list, but the mechanism of this is still a bit of a mystery. challenges include learning how to tweak descriptions so they appear higher on the results list or finding a different or customizable search module. this is a weakness of all the search modules we tried. there appears to be no good way to completely control the granularity of relevancy results or to produce search results based on a mixture of hybrid factors, such as a combination of relevancy and preset manual index indicators. this is one of the reasons we are very curious to experiment with drupal 6 and the beta of version 7 to determine if search results can be better managed. notes florida center for library automation. (2009). http://fclaweb.fcla.edu/ rosenthal, d. (2009). fgcu library website usability study. unpublished manuscript. http://library.fgcu.edu/webanalysis/reports/usabilitystudyreport.pdf klein, l. (05/15/2010). screencast: creating a library database page with drupal. http://chicagolibrarian.com/node/262 about the authors danielle rosenthal is the library web development and science liaison librarian and mario bernardo is the assitant director of library computing and technology systems at the florida gulf coast university. subscribe to comments: for this article | for all articles 3 responses to "creating a library database search using drupal" please leave a response below: andrea carter, 2010-06-24 congrats! what an accomplishment. leo robert klein, 2010-09-13 hey, great article! thanks for the mention! screencast on using drupal mentioned in code4lib journal | chicago librarian, 2013-06-02 […] "creating a library database search using drupal" […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – beyond opac 2.0: library catalog as versatile discovery platform mission editorial committee process and structure code4lib issue 1, 2007-12-17 beyond opac 2.0: library catalog as versatile discovery platform north carolina state university has developed an application programming interface (api) “platform”, called catalogws, to provide web service access to catalog search and availability services. this project was motivated by the realization that the discovery of library collections should not be limited to a single catalog application, and such a platform could support the efficient creation of novel interfaces based on consistent services. some technical discussion of the catalogws architecture is provided, including a technical description of web service protocols implemented. several applications providing discovery in novel contexts have already been developed based on catalogws, and are described in some depth. catalogws has helped create a culture of experimentation and enabled a larger group of staff to work with library catalog data and services in new and interesting ways. by tito sierra, joseph ryan, and markus wust introduction many libraries, with the goal of modernizing their web presence, are racing to deploy a “next generation catalog.” next generation catalog applications typically offer a mix of these features: faceted navigation, keyword searching, relevance-ranked search results, “did you mean?”-style search revisions, item recommendations, rss feeds, and mechanisms to collect and display user feedback [1(coins)]. these “opac 2.0" efforts to replace or upgrade legacy opacs with more powerful alternatives will no doubt improve the overall catalog experience for many library users. unfortunately, the gains from these efforts are limited because a single catalog application cannot be optimized for all library users and uses. consider, for example, the two basic types of searches a user might perform in a library catalog: known-item and exploratory. in a known-item search the user typically has a specific item in mind; the goal is to acquire information about this item as quickly as possible. in an exploratory search the user has a topic in mind and the goal is to acquire a list of items related to the topic. an interface optimized for a known-item search would likely take advantage of the information the user knows about the item, thus emphasizing bibliographic metadata such as the item’s title or author. optimizing for an exploratory search would likely emphasize descriptive metadata that items share in common, such as subject headings or user-defined tags. it would be challenging, if not impossible, to optimize a single library catalog application for both of these common use cases. this conflict begs the question: why should the discovery of library collections be limited to a single catalog application? given the resources required to keep library catalog data accurate and up-to-date, libraries ought to explore methods for integrating this data into relevant external applications. lorcan dempsey, of oclc, refers to this process as “lifting out the catalog discovery experience.” he writes: as we work to aggregate supplyâ€¦ so we must work to place these resources where they will best meet user needs. in this process, discovery of the catalogued collection will be increasingly disembedded, or lifted out, from the ils system, and re-embedded in a variety of other contexts — and potentially changed in the process. [2] realizing dempsey’s vision requires increased flexibility in how we provide programming interfaces to our catalog data. to enable creative use of catalog data and improve interoperability with external systems, we need to think beyond the application-specific enhancements that characterize current opac 2.0 efforts. one strategy is to think about our catalogs as a platform that can support many discovery applications, not just the opac. we believe this approach shows great promise for libraries looking to enhance long-term end-user discovery and use of library collections. catalogws netscape co-founder marc andreessen’s definition of “platform” neatly describes the fundamental difference between platforms and applications: … a “platform” is a system that can be reprogrammed and therefore customized by outside developers — users — and in that way, adapted to countless needs and niches that the platform’s original developers could not have possibly contemplated, much less had time to accommodate. in contrast, an “application” is a system that cannot be reprogrammed by outside developers. it is a closed environment that does whatever its original developers intended it to do, and nothing more. [3] in early 2007 the ncsu libraries began to experiment with the “catalog as platform” approach through an initiative called catalogws. the original designers of catalogws, tito sierra and emily lynema, were members of the team that implemented the first endeca-powered opac [4] in january 2006. following the release of the new opac, the implementation team faced a list of post-launch feature enhancements. two of these enhancements were rss feeds for catalog searches, and the integration of catalog search results into quick search [5], our library website search tool. rather than modify the opac application, the designers decided it would be better to create a separate application programming interface (api) to the search indices used by the opac. once in place, the api would make it easier to build the desired feature enhancements, as well as enable more versatile catalog data use in the future. after a few weeks of development, the catalogws api was born. catalogws currently provides two functions: the “search” service and the “availability” service. the search service returns item and facet values for a given search query, whereas the availability service returns item availability information for a given isbn. documentation for both services is available on the catalogws project homepage [6]. to date, most of our catalogws-powered applications have used the search service. the scope of the api is necessarily limited because it uses search indices, rather than querying the full set of data stored in the ils. for example, the api does not expose all the data in the marc record for our catalog items, since not all marc fields are indexed by our search server. the designers decided to use search indices as a data source because they provide easy access to normalized data. using search indices also made the data modeling process much easier because the indices already captured and consolidated the most useful catalog data for end-user application development. this convenience comes at the price of limiting the type of applications that can be built with the api. since the api is read-only, we could not build tools for modifying bibliographic data or updating the inventory status of items in the catalog. these ils-oriented functions were beyond the scope of what the designers intended to accomplish with the api. figure 1: ncsu libraries catalog architecture [view full-size image] despite the scope limitations, we have realized several benefits by using our existing endeca-generated indices as a data source. architecturally, search indices are designed to accommodate a high volume of low-latency requests. building on search indices has also enabled us to include search-specific features in our applications, such as faceted data and spelling suggestions. the earlier endeca implementation decision to include holdings information in the search indices proved to be very useful for the api. it means we are able to include inventory data in our applications, such as the name of the library that holds the item, the call number of the item, and even the current availability status of the item since the last index update. the technical implementation of the api was modeled after the rest web apis implemented by companies such as yahoo! [7], facebook [8], and amazon.com [9]. requests to the api are made through the http get protocol. here is an example of a basic search request: http://www.lib.ncsu.edu/catalogws/?service=search&query=usability the search service supports many features one might expect from a search api, such as the ability to limit the number of results returned, and ability to change the sort order of returned results. by default, the service queries our catalog’s “keyword anywhere” index, which looks for keyword matches across many fields in the catalog record. one can also search specific fields such as title, author, or isbn/issn. it is even possible to supply a null keyword to browse the entire catalog. including facet results in the response makes it possible to query specialized slices of the catalog using shared facet attributes, such as commonly shared subject headings, material formats, or author names. catalogws responses are available in a variety of output formats, with a custom xml format as the default. the decision to define a custom base xml format for the search and availability services was motivated by several factors. first, the api designers believed it would be useful to include as much of the data from the search indices as possible into a single search request, rather than requiring multiple requests to the same data source. for example, the search service returns several categories of response data including search metadata (e.g. total results, spelling suggestions), bibliographic metadata (e.g. item title, author name), holdings data (e.g. library location, call number), and facet results. although there are many existing standards such as marc-xml, mods, and dublin core that could describe some of this data in varying levels of granularity, no existing standard seemed suitable for such a heterogeneous collection of data. second, the api was intended for use by developers at ncsu, rather than for programmatic external application consumption, so cost-benefit analysis of distilling the base api profile into one or more specific standard formats weighed in favor of a custom approach. below is an abbreviated xml response for the earlier example. a cached version of the full response is available here. http://www2.lib.ncsu.edu:9921/catalogws/?service=search&query=usability http://www2.lib.ncsu.edu/catalog/?ntt=usability&ntk=keyword&n=0&nty=1 usability keyword 477 0 30 http://catalog.lib.ncsu.edu/web2/tramp2.exe/do_ccl_search/guest?setting_key=files&record_screen=record_brief.html&*search_button=keyword&servers=1home&index=ckey&query=1791483 cost-justifying usability : an update for an internet age c2005. book 0120958112 qa76.9 .u83 c67 2005 stacks (6th floor) available elements--> format http://www2.lib.ncsu.edu/catalog/?ntt=usability&ntk=keyword&n=0&nty=1&ne=200043#200043 http://www2.lib.ncsu.edu:9921/catalogws/?service=search&query=usability&n=206437 http://www2.lib.ncsu.edu/catalog/?ntt=usability&ntk=keyword&n=206437 book 461 elements --> elements --> although the base xml format is custom, the catalogws search service has built-in support for returning results in several standard output formats. search results can be returned in rss 2.0, opensearch, and json formats. additionally, the api supports an optional “style” parameter that enables requests to specify a path to an xsl stylesheet. when the style parameter is passed, catalogws provides server-side xsl transformation services for the request. the style parameter makes it possible for developers to create interactive catalogws-powered web applications using only xsl stylesheets. current applications since going into production in january 2007, catalogws has provided the infrastructure for a variety of applications at ncsu libraries. some of the applications developed to date include specialized or experimental catalog interfaces, while others are innovative collection promotion display tools. below, we provide a few examples of how we have used catalogws thus far. to begin, we will discuss the mobilib catalog, an application designed for known-item searches in a mobile context [10]. some example use cases for this application are looking up call numbers and checking the availability of items in the library bookstacks. unlike our information-rich opac, the mobilib catalog is a barebones search tool that allows users to look up an item by keyword, title, author, or isbn. given the importance of item location and item availability implied by the mobile use context, the application emphasizes the display of inventory data, such as call numbers, and enables users to limit their search to items that are not checked out. figure 2: mobilib catalog [view full-size image] developing mobilib presented several challenges. the first was getting the application to function and display properly on a variety of mobile device platforms including handheld pcs, web-enabled cell phones, and pdas. the smaller displays on these devices limit the amount of information that can be displayed in each view. mobile devices also have speed and cost-related bandwidth issues. these unique constraints forced us to rethink the catalog search experience from the ground up. while the mobilib catalog is optimized for known-item searches, facetbrowser is an experimental catalog interface that was designed to emphasize facets for use in exploratory searches [11]. facetbrowser places facets persistently at the center of the application’s interface. this design encourages users to browse the entire catalog at once, promoting use of facets as a way to explore or navigate the results set at every level. we believe this approach has the potential of exposing users to items in the catalog that they otherwise might not have discovered in a more conventional catalog search application. figure 3: facetbrowser [view full-size image] besides serving as an experimental catalog interface, facetbrowser has also provided us with an opportunity to promote library collections in new ways. as part of our recently opened ncsu libraries learning commons, several large screen displays were installed in the new space to promote a variety of library services. some of these displays are dedicated to showing “bookwalls,” which are displays of book cover images for a thematic collection of books from the catalog. facetbrowser provides the means for library staff to quickly generate these bookwalls. the process involves applying one of several built-in output styles to an editorially selected list of items in facetbrowser. the application automatically generates the html used to display the bookwalls on our large screen displays, enabling library staff with limited technical skills to promote specialized collections of books in a visually appealing way. figure 4: facetbrowser bookwall output display [view full-size image] in addition to bookwalls created by facetbrowser, we use catalogws to promote library collections in other ways. the new books bookwall is similar to the staff created thematic bookwalls, except that the focus is on featuring books recently added to the collection [12]. this bookwall is different because the items it displays are not hand selected. instead, a scheduled catalogws request retrieves the 300 newest items, with the goal of providing a peek at what’s new at the library. the new faculty books display is a variant on the new books bookwall [13]. this application focuses on featuring new books published by ncsu faculty. although the user interface is slightly different, the underlying technical implementation is the same. the new faculty books display adds an additional filter constraint to items authored by ncsu faculty. figure 5: new faculty books list [view full-size image] closing thoughts returning to andreessen’s definition of a “platform,” we see that the final part of his definition is that a platform enables outside developers to adapt the system “to countless needs and niches that the platform’s original developers could not have possibly contemplated, much less had time to accommodate.” this observation mirrors the ncsu libraries’ experience perfectly. although the original motivation for the api was to enable development of some opac feature enhancements, we have been able to use the api to create completely new applications that exist independently of the opac. the platform approach has provided us with two distinct benefits. first, the api has reduced catalog application development time because it exposes data in an easy-to-use format. a second benefit is that our api has enabled versatile use of catalog data, as illustrated by the applications we have built so far. as we reflect on our experience to date, we find that the collection of data exposed by our api could be more comprehensive. we encourage those who are developing a similar platform to cast a broad net when deciding what content to include in their api. although our lightweight approach has worked well for us, we acknowledge the limitations of relying on search indices as a catalog data source. those interested in learning from a more ambitious “catalog as platform” approach should look at the talis platform [14], which supports a wide range of interfaces and technologies. overall, we believe the time invested in developing our catalog platform was worth the effort and should continue to pay dividends in the near future. we never anticipated that catalogws would serve as a basis for such a broad range of applications. in our organization, it has fostered a culture of experimentation and enabled a larger group of staff to work with library catalog data in new and interesting ways. although we are in the early stages of working with our platform, we recognize there is a large amount of untapped potential. we would like to see other institutions experiment with the platform approach and we welcome the opportunity to learn from other libraries about how to fully exploit the value of catalog data beyond the opac. notes marshall breeding. next-generation library catalogs. library technology reports 43(4). july/august 2007.(coins) lorcan dempsey. the library catalogue in the new discovery environment: some thoughts. ariadne 48. july 2006. http://www.ariadne.ac.uk/issue48/dempsey/ http://blog.pmarca.com/2007/06/analyzing_the_f.html http://www.lib.ncsu.edu/catalog/ http://www.lib.ncsu.edu/dli/projects/quicksearch/ http://www.lib.ncsu.edu/catalog/ws/ http://developer.yahoo.com/search/ http://wiki.developers.facebook.com/index.php/api http://aws.amazon.com http://www.lib.ncsu.edu/m/catalog/ http://www.lib.ncsu.edu/facetbrowser/ http://www.lib.ncsu.edu/display/bookwall/eboard.php http://www.lib.ncsu.edu/display/facultybooks/list http://www.talis.com/tdn/platform/ about the authors tito sierra is assistant head for digital library development at ncsu libraries where he leads a small research and development team that builds new digital library applications. joseph ryan is an ncsu libraries fellow in the digital library initiatives and administration departments. markus wust is an ncsu libraries fellow in the digital library initiatives and special collections departments. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – visualizing library statistics using open flash chart 2 and drupal mission editorial committee process and structure code4lib issue 19, 2013-01-15 visualizing library statistics using open flash chart 2 and drupal libraries continue to need to demonstrate their value to stakeholders, and while statistics alone do not represent value, they are an important element. we found ourselves, and our stakeholders, uninspired by our infrequently updated bulleted list of statistics on our website and so set out to create a more dynamic and visually appealing look at our statistics. this article outlines how we used our content management system, drupal, open flash chart and custom programming to convert library statistics into flash charts, including how to populate the graphs with dynamic data from external sources. the end result is our library statistics dashboard (http://library.uncw.edu/facts_planning/dashboard) that visually demonstrates the use, activity and resources in the library via interactive and visually interesting graphs. by laura k. wiegand and bob humphrey libraries are experts at collecting statistics. we have complex systems that collect data on transactions, we have analytics that track online visits, we have gate counters that record visits, we have forms to keep track of our interactions with users, we get reports on pdf downloads and how many times our computers get logged into. but what to do with all this data? we often use it internally for decision-making or reporting, but what about communicating these often staggering numbers to our stakeholders and clientele? like many libraries, uncw randall library had a page on our website under “about us” that listed out, in bulleted sentences, some of our more compelling numbers (e.g., “9,185 people received instruction”). but in today’s world of instant metrics and focus on return on investment (roi), these annually updated lists of numbers were uninspiring. in addition, they didn’t show context over time, neither reflecting the academic year cycle nor the steady increase year to year. in response to this less than helpful list of statistics, our university librarian asked that we create a “statistics dashboard” to visually demonstrate activity in the library. other libraries have built dashboards, both external [1] [2][3] and internal[4], used data visualization for art and physical displays displays (such as the well-known ticker display of check out data at the seattle public library) [5], produced kicked up marketing materials [6], and employed it for decision-making [7]. in fact, in an era of increasing availability and flexibility of massive amounts of data, employing effective data visualization has become increasingly important [8]. on the spectrum of data visualization, our project was relatively simple. our goal was to present traditionally collected statistical data about the library in a graphical and dynamic manner. this “dashboard” was to be a piece of the newly created “facts and planning” (http://library.uncw.edu/facts_planning ) section of our website designed to provide the university community with information about our resources and their stewardship. in their article “statistical graphics: making information clear – and beautiful” niemi and gelman outline the qualities of a “well-designed graph” [9]. they write when planning visual representation of statistics or data, the first questions to ask are “who is your target audience? what are you trying to show?” in our case, our target audience was university administration, our users and possibly librarians from other libraries. we were trying to show the large volume of transactions, visits, instruction and usage increasing over time and cycling through the academic year. next they recommend “avoid distracting elements. use informative colour. keep the figure simple.” we attempted to do this by styling the graphs to match our website, creating simple graphs using proven graphing techniques (line, bar, area) and having exact numbers remain hidden until the user hovered over the graph. in terms of infrastructure, we wanted the dashboard to be fully integrated with our drupal 6 website in a way that both allowed for the reuse of content and harnessed existing technology. additionally, we are a medium-sized academic library with limited staff resources, meaning that building a complex, custom system was not an option. since many other libraries utilize drupal, and likewise have limited resources, we thought we would share the steps for setting up a simple dashboard in drupal. tools for building the dashboard to create the dashboard, we chose open flash chart, a free, open source charting package available at http://teethgrinder.co.uk/open-flash-chart-2/. this software creates beautiful animated charts in a variety of styles, including line charts, bar charts, stacked bar charts, and pie charts. the project has adequate online documentation and the charts can be built and customized using php. there are several drupal modules that can be used to integrate charts. we chose the open flash chart 2 api module (http://drupal.org/project/ofc_api) based on its ease of use. the procedure for installing this module was similar to that for any other drupal module except for the addition of a couple of extra steps. once the module had been saved in the drupal installation sites/all/modules folder, then the open flash chart (ofc) files needed to be added to the ofc_api folder. we downloaded ofc from http://sourceforge.net/projects/openflashchart/files/open-flash-chart/version%202%20lug%20wyrm%20charmer/ and extracted the contents of the .zip file. then we moved the ofc files into the module as follows: the open-flash-chart.swf file was copied from the ofc package to the ofc_api module the php-ofc-library folder was copied from the ofc pagkage to the ofc_api module the swfobject.js file from the js folder within the ofc package was copied to the ofc_api module when we were finished moving files, the dirctory of the ofc_api module looked like this: ofc_api |-php-ofc-library [folder] |-ofc_api.info |-ofc_api.install |-ofc_api.module |-open-flash-chart.swf |-swfobject.js the next step was to configure the module at [website]/admin/settings/ofc_api. we checked the options to use swfobject to embed charts and to use an external library and set the path to the external library file as ‘php-ofc-library/open-flash-chart.php’. figure 1: open flash chart api configuration page and block display setup we chose to put our graphs inside of individual blocks to make it easy to reorder, update, add and remove the various charts that we would be creating over time. since the randall library website already had more than 200 blocks defined (future clean-up project!), there was a real possibility that our new dashboard blocks were going to be difficult to manage. to deal with this, we took some extra steps at the beginning of the project to help organize the new blocks being created. drupal makes it easy to define new display regions that can be incorporated into the theme being used, so we decided to add a new ‘dashboard’ region to our theme. this was done by including a new element in the regions array defined in our theme .info file, as shown below: regions[left]          = left sidebar regions[right]          = right sidebar regions[navbar] = navigation bar regions[content_top] = content top regions[content_bottom] = content bottom regions[dashboard]      = dashboard regions[header]         = header regions[footer]         = footer regions[closure_region] = closure then we had to specify exactly where the new region should appear on a web page. to do this, we modified our theme page template page.tpl.php to display the content in the dashboard region just above any content that would normally be placed in the content-area of a page.

...
all blocks created for the dashboard page were then assigned to the new dashboard region. figure 2: dashboard blocks next we created a page node consisting only of a title and url alias. this provided us with an empty container that we could then fill up with drupal blocks, each one holding a different chart or graph to be displayed in our dashboard. creating a dashboard chart in a block we were then ready to create our first chart, a line graph showing the weekly number of logins to the computers in our learning commons. the process for creating a graph consisted of the following steps: creating and configuring a new block writing a short php script to specify all the parameters for the chart. we set the new block to display on the container page created earlier, and changed the input format to php code. if this input format is not an option to you, it will need to be enabled as an available text format at [website]/admin/settings/filters/list. also, the user creating the block needs to be assigned to a role that has permissions to use the php code input format (grant with care!). we then proceeded to write the code that would display the desired chart. the resulting script consisted of two parts. the first part specified the data that was going to be displayed in the graph, i.e. the number of logins and the week ending date. the second part described the appearance of the chart, including each of its various elements, such as title, data points, lines, and axis. to accomplish the first part of the task, a query was made against the database that contained the login data, returning the following information: week ending number of logins 1-7-12 180 1-14-12 201 1-21-12 3318 1-28-12 4284 2-4-12 5365 2-11-12 6281 2-18-12 6528 2-25-12 6460 this data was then placed into two arrays at the beginning of the script: size(4)->halo_size(1)->colour('#006666')->tooltip("week ending #x_label# #val# logins" ); the lines were defined in a similar way: a line object was created. the data property of the line was set using the $data array defined at the beginning of the script. the dot style property was set using the data point style defined above. the width and colour properties were set. // define line appearance $line = new line(); $line->set_values( $data ); $line->set_default_dot_style($d); $line->set_width( 4 ); $line->set_colour( '#006666' ); the next task was to define the horizontal x-axis. we followed the same steps as we did in creating the data points and lines. referring to the open flash chart documentation for guidance, we created an x_axis object and then specified its properties. note that the $labels array defined at the beginning of the script was used to set the labels for this axis using the set_labels_from_array() method. // define x-axis $x = new x_axis(); $x->colour('#dddddd') ->grid_colour('#dddddd') ->tick_height(5) ->stroke(3); $x->set_labels_from_array($labels); before creating the vertical y-axis, we needed to determine the scale of the values being represented. according to the data table shown above, the greatest number of logins (6528) occurred during the week ending february 18. therefore, 7000 was selected as the maximum value for the y-axis. this would be large enough to accommodate the full range of values that need to be displayed, and small enough so that the line graph created would expand to use up most of the available chart space. it was also decided that the graph would have horizontal gridlines representing 1000 logins, 2000 logins, and so on, up through the maximum value of the axis. to accomplish this, three range variables were set as follows: // set the range of values for the y-axis $low_value = 0; $high_value = 7000; $step_value = 1000; with the range of values accounted for, the rest of the y-axis could then be defined: // define y-axis $y = new y_axis(); $y->set_stroke( 3 ); $y->set_colour( '#dddddd' ); $y->set_tick_length( 5 ); $y->set_grid_colour( '#dddddd' ); $y->set_range( $low_value, $high_value, $step_value); at this point, all the different elements of the chart had been specified. the last remaining task was to create a chart object and add to it all the various objects representing the chart elements. // create chart and add all elements to it $chart = new open_flash_chart(); $chart->set_title( $title ); $chart->add_element( $line ); $chart->set_bg_colour('#eeeeee'); $chart->set_x_axis( $x ); $chart->set_y_axis( $y ) we finished the block off with a little bit of text providing some additional information about the chart. // add text below the chart print '
'; print ofc_api_render($chart, 400, 250, array('wmode' => 'opaque')); print 'the approximately 100 computers in the learning commons are logged onto several thousand times a week; many students use them daily.'; print '
'; ?> all of this code was entered into the block body field of the block input form. the block was then saved and enabled for the dashboard region of the site display. figure 3: library statistics dashboard with open flash charts graph automatic data updates at first we kept the chart current by manually updating the $data and $labels arrays on a weekly basis with new information. but as soon as time permitted, code was added to obtain this information directly from the database where it was stored, thus eliminating the need for this weekly chore. // data and labels $data_r = array(); $labels_r = array(); // connect to the database and query it $conn=mysql_connect('[server]',[user name],[password]); @mysql_select_db([database name]); $sql = [sql code to retrieve number of logins and date from the database for the previous 16 weeks]; // populate the data and label arrays with information from the database; // keep track of the largest value being used $high_value = 0; $result = @mysql_query($sql); if ($result) { $x = 2; while ($row = mysql_fetch_array($result, mysql_assoc)) { $value = intval($row['logins']); $data_r[] = $value; $high_value = ($value > $high_value) ? $value : $high_value; $rem = $x % 2; if (!$rem) { $labels_r[] = substr($row['date'], 5); } else { $labels_r[] = ''; } $x++; } } // reverse the arrays so that they display in the correct order on the chart $data = array_reverse($data_r); $labels = array_reverse($labels_r); the new code created a connection to the database and retrieved the number of logins summarized on a weekly basis for the previous 16 weeks. as each record was returned, the arrays $data_r and $labels_r were updated with the information. since the records were retrieved in the opposite sequence in which they needed to be displayed, the arrays were reversed before building the chart. an additional consideration in automating the data retrieval was to programmatically determine the correct scale for the y-axis of the graph. as each record was retrieved from the database, the number of logins was compared to the value of the previous records, and $high_value was updated if the number of logins was the largest value returned up to that point. then, when all the records had been read, $high_value was adjusted up to the next hundreds value. $high_value was then divided by 5 to determine the other amounts to be displayed on the y-axis. // set the range of values for the y-axis $low_value = 0; $places = strlen($high_value) 2; if ($places > 0) { $high_value = ceil($high_value / pow(10, $places)) * pow(10, $places); } $step_value = $high_value / 5; discussion our goal for this project was to replace our traditional statistics page with a more compelling display, and to demonstrate the scope of activity in the library to our users and stakeholders. we feel that we are on our way towards meeting those goals with the current implementation. our previous list of statistics has indeed been replaced by attractive, informative graphics. our university librarian has included demonstration of the dashboard in presentations to administration. the charts are easy to update, harness existing technologies, are integrated with our website and can easily connect to external data sources for hands-off updating. however, we have fallen short in some ways. for example, the number of charts we have set up is limited, and only a handful of them are dynamically updated. we are planning to continue to increase the number of graphs, specifically increasing the number of graphs that utilize dynamic or automatically updated data. second, the metrics we are currently displaying are relatively simple and consist of statistics traditionally reported by libraries (i.e, number of information literacy sessions per month). these traditional counts over time may not be the most interesting way to show the impact of our activity. a better demonstration might be a chart that compared information literacy sessions to the number of librarians and the number of number of students at uncw. enhancing the way we think about data and statistics, and then turning those into charts, would be another way to meet our goals. finally, the statistics dashboard is relatively buried on our website; this is somewhat intentional because it is auxiliary to prime functions on our website. however, as we update our website we will be considering ways to further integrate or display the dashes so that our users can get a glimpse of the activity in the library in relevant contexts. references [1] dashboard_beta :: dashboard information [internet]. providence(ri): brown university library; [cited 2012 oct 31]. available from: http://library.brown.edu/dashboard/info/ [2] dashboard [internet]. [updated 2010 sept 16]. indianapolis (in): the trustees of indiana university; [cited 2012 oct 21]. available from: http://www.ulib.iupui.edu/dashboard [3] library dashboard [internet]. [updated 2012 june]. new york(ny): libraries of the metropolitan museum of art [cited 2012 oct 31]. available from: http://www.libmma.org/watstat/ [4] morton-owens, e, hanson, k. 2012. trends at a glance: a management dashboard of library statistics. information technology and libraries [internet]. [cited 2012 oct 31]. 31(3): 36-51. available from: http://ejournals.bc.edu/ojs/index.php/ital/article/view/1919/pdf [5] legrady, g. 2005. “making visible the invisible” seattle library data flow visualization. in: digital culture and heritage. proceedings of ichim05 sept. 21-23, 2005. paris(france): archives & museum informatics europe (amie). available from: http://www.archimuse.com/publishing/ichim05/legrady.pdf [6] south dakota public libraries data digest 2011. pierre(sd): south dakota state library. [cited 2012 oct 31]. available from: http://library.sd.gov/sdsl/publications/doc/rpt-datadigestpublib2011.pdf [7] visualizing library data [internet]. raleigh(nc): ncsu libraries. [cited 2012 oct 31]. available from: http://www.lib.ncsu.edu/dli/projects/dataviz [8] davis, hilary. 2009. not just another pretty picture. in the library with the leadpipe [internet]. [cited 2012 oct 31]. available from: http://www.inthelibrarywiththeleadpipe.org/2009/not-just-another-pretty-picture/ [9] neimi, j, gelman, a. [2011] statistical graphics: making information clear — and beautiful. significance. 8(3):135-137. about the authors laura k. wiegand (wiegandl@uncw.edu) is an information systems librarian at the university of north carolina wilmington. bob humphrey (humphreyr@uncw.edu) is the library web & applications developer at the university of north carolina wilmington. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – static web methodology as a sustainable approach to digital humanities projects mission editorial committee process and structure code4lib issue 60, 2025-04-14 static web methodology as a sustainable approach to digital humanities projects the web platforms adopted for digital humanities (dh) projects come with significant shortand long-term costs—selecting a platform will impact how resources are invested in a project and organization. as dh practitioners, the time (or money paid to contractors) we must invest in managing servers, maintaining platform updates, and learning idiosyncratic administrative systems ultimately limits our ability to create and sustain unique, innovative projects. reexamining dh platforms through a minimal computing lens has led university of idaho librarians to pursue new project-development methods that minimize digital infrastructure as a means to maximize investment in people, growing agency, agility, and long-term sustainability in both the organization and digital outputs. u of i librarians’ development approach centered around static web-based templates aims to develop transferable technical skills that all digital projects require, while also matching the structure of academic work cycles and fulfilling dh project needs. in particular, a static web approach encourages the creation of preservation-ready project data, enables periods of iterative development, and capitalizes on the low-cost/low-maintenance characteristics of statically-generated sites to optimize limited economic resources and personnel time. this short paper introduces static web development methodology (titled “lib-static”) as a provocation to rethink dh infrastructure choices, asking how our frameworks can build internal skills, collaboration, and empowerment to generate more sustainable digital projects. by olivia wikle and evan peter williamson static web methodology as a sustainable approach to digital humanities projects the shortand long-term costs of the web platforms adopted for digital humanities (dh) projects extend far beyond the financial: when we choose a digital platform we are also making decisions about how to invest and deploy our time as well as technical and social resources in the digital projects we create and maintain (barats et al., 2020). the friction between the institutional, technical, and social infrastructures in which we make choices about our digital projects has too often led to unsustainable projects and practices in dh work. many advocates of a more sustainable approach to dh (including the participants in the dh 2022 panel on “sustainability + the politics of dh infrastructure” which inspired this short paper [footnote 1]) have noted the opportunity cost of investing in systems over people: as dh practitioners, the time (or money paid to contractors) we must invest in managing servers, maintaining platform updates, and learning idiosyncratic administrative systems ultimately limits our ability to create and sustain unique, innovative projects. in response, many librarians and dh practitioners have been reexamining platforms through a minimal computing lens, asking, as risam and gil have summarized, “what is, in fact, necessary and sufficient when developing a digital humanities project under constraint”, a methodological inquiry that leads “a decision-making process driven by the local contexts in which scholarship is being created” [1] (risam and gil, 2022). this reflection in our own context has led librarians at the university of idaho to pursue new project-development methods that minimize digital infrastructure as a means to maximize investment in people, while growing agency, agility, and long-term sustainability in both the institution and our digital outputs. for nearly a decade, we have been developing digital collections, scholarship projects, and instructional content using static web tools, leading to our development and support of the digital collection framework collectionbuilder. building on collectionbuilder, we have remixed template components and techniques to create custom projects such as oral history exhibits, deep maps, and digital editions, which we present below. this development approach, centered around minimal static web templates driven by tabular data, which we call lib-static, seeks to increase the return on learning new technical skills that all digital projects require, while also establishing technical solutions and social workflows that more closely match the structure of academic work cycles and dh project needs. in particular, the static web approach encourages the creation of preservation-ready project data, enables periods of iterative development, and capitalizes on the low-cost/low-maintenance characteristics of statically-generated sites to optimize limited economic resources and personnel time. this short paper will introduce the lib-static development methodology as a provocation to rethink dh infrastructure choices, asking how our frameworks can build internal skills, collaboration, and empowerment to generate more sustainable digital projects. the challenges of dh infrastructure working within the realities of academic funding and project planning, dh practitioners often still assume that traditional large-scale infrastructure is a necessity, which can result in huge sums sunk into outsourced development, contract work, and 3rd party subscriptions. this reflects an economic model that prioritizes purchasing systems over internal development of people and capacity. however, we (dh researchers and publishers) need to work with and maintain these systems, which can drain our time and reduce our agility and creativity, as well as lock us into unsustainable long-term commitments. beyond the missed opportunities to focus our time more productively, on the practical side the lack of human resources to continuously maintain needy digital infrastructure can result in dead links, half functioning websites, and lost data–the strikingly rapid loss of investment poured into the project. for example, we often want to teach creating a digital archive in the classroom, giving students the full experience of publishing their curation, description, and interpretation of archival objects on the web. however, even in setting up a fairly low barrier platform for this use case such as omeka, we are adopting (as they say) a puppy that requires constant ongoing maintenance and attention [2] (dombrowski, 2019). this requires us to either invest significant time in being system administrators, pay a subscription to a 3rd party to do it, or have a clear sunset date to close down the site–each of which requires the correct resources and can have serious drawbacks. will the students be able to reference the outputs of their project? will their work remain an accessible resource? how will the librarians facilitating the project have to invest their time? all dh platforms demand investment in learning (no matter how slick the system)–how can we make the most of that investment to build resilient capacity in our organizations and the dh community? at university of idaho library, balancing these questions has led to the development of a minimal, static web-based approach to creating digital scholarship projects and teaching dh in the classroom, that enables librarians to efficiently collaborate with faculty and students while centering sustainable systems, workflows, and data. static web opportunities most of the web today is built using dynamic web applications, such as content management systems like wordpress or drupal, which use server-side processing and databases to respond to user requests and manage data. likewise these types of cms or repository systems are chosen to build most dh projects because of the powerful web-based features they provide such as administrative interfaces, user management, and theming–think scalar, omeka, storymaps, or samvera. these platforms enact a distinct division between it maintainers/administrators, system designers, ux designers, and project/content creators, allowing any part of the equation to be outsourced. these applications come with heavy infrastructure requirements and ongoing maintenance costs–they require well resourced servers to provide adequate performance and continuous vigilant updating to avoid serious security risks. any customizations to the platform add complexity to the maintenance. in contrast to dynamic platforms, static websites are composed of plain html, css, js, and media files ready to be delivered to users without any server-side processing or database. this simplicity provides high performance for users, minimal infrastructure requirements for it, and lower barriers for developers. it uses less bandwidth, less energy, and fewer server resources–which ultimately means less cost for creators, maintainers, and end users. modern approaches are powered by static site generators that utilize simplified markup, plain text data files, and web apis to simplify the creation of site content and automate building complete websites ready to deploy. static web approaches have some unique benefits that have the potential to make dh projects more sustainable. first, because static assets do not require any server-side processing, they can be deployed on minimal servers (or free hosting services), significantly lowering cost and ongoing maintenance, while providing higher performance and scalability without further investment. this matches the reality of academic dh projects, which often involve lots of work during an initial active period and then are usually left with few if any resources for ongoing maintenance, so project websites need to be okay being left alone for long periods of time. static sites are suited to this type of neglect, avoiding the security risks and increasing breakage associated with insufficiently maintained dynamic platforms. with no additional software to administer on the server, static sites’ comparatively simple workflows and security reduce the need for system administrators, making it possible for dh practitioners to take complete control of the development and hosting process. this gives us a concrete opportunity to learn and build on basic web development and programming skills as we collaborate on a project, which can lead to more creative solutions and use of the platforms we are working with–and build additional capacity to power future projects. this helps acknowledge and remove the division of technical labor and “scholarly” contributions, as chan and sayers argue, “from the labor perspective, a minimal computing project positions technical work as creative and critical work beyond familiarity with management systems” [3] (chan and sayers, 2022). the low cost of this approach also makes the development process more agile and inclusive: we can iterate through project stages with less up front investment, quickly prototyping ideas (or changing course without too much loss if an idea doesn’t pan out) and building out smaller projects. this opens up opportunities to collaborate with more people, contexts, and use cases, including those without significant funding support. in practice, this can mean more people involved directly in development and the creation of projects that were previously impossible–student projects, classroom experiences, one off exhibits, or community led collaborations–while empowering more people with the skills necessary to sustain these resources. finally, an essential benefit is the creation of preservation-ready data as a byproduct of project creation. static web projects are built on structured plain text data and code in open standard formats–the content and data are not locked into a platform, but are prepared for preservation, migration, and reuse in the inevitable future with a new platform or new project. lib-static in practice at the university of idaho library we have been building static web-based projects as the core of our digital scholarship practice, many of which are rooted in the collectionbuilder code framework. below we present a few examples that we hope help illuminate why we have made this choice, highlighting some of the things we think the static approach has enabled while supporting more sustainable collaborations and digital outputs. most of this work takes place in our dh center, the center for digital inquiry and learning (cdil), led by librarians with faculty and student collaborators. the library and cdil do not have dedicated it personnel. prototype project sites are typically hosted on github pages or our self-managed reclaim hosting. final published versions are manually built and deployed to basic static site hosting provided by our main university it. the chart below demonstrates our typical workflow, technical infrastructure, and collaborator involvement for a static web project: table 1. typical workflow, infrastructure and collaborator involvement. project workflow description of work people involved curation, digitization, and metadata ↓ scholarship work focuses on curating data, digitizing objects, and creating quality descriptive metadata. librarians consult with collaborators (students and/or faculty) to envision site features and iteratively develop metadata templates that will support the site’s design. collaborators focus on creating metadata in spreadsheets and curating their items. collectionbuilder project ↓ prepared metadata csv and digital media files are added to a collectionbuilder code template. the template defaults are replaced by project configurations. project code is edited in a text editor such as visual studio code, and stored in github. librarians set up the project and work on code customizations to support new feature ideas. collaborators focus on contributing the site’s content using markdown and metadata spreadsheets. static site generator ↓ the static site generator jekyll knits together the source code in the collectionbuilder template to output a static site. there is no admin interface or log in! jekyll is run on a local computer or via a hosting service such as github pages. during iterative prototyping phases, the site is generally built on github pages to enable collaborators to see their changes to metadata or site content quickly. final deployment is handled by librarians, building the site with jekyll and moving it to the library’s server. static website final site is html, css, js, and media files ready to be hosted on any basic server or hosting service. librarians maintain the static sites for the long term. librarians package the complete data and media files for archival storage preservation. digital collections → research, analysis, access we first started working on a static web approach in pursuit of a more sustainable method to create and maintain our library’s digital collections. at first we redesigned and migrated a handful of sites that had aging / broken features or deserved more specialized presentation. before too long, we were rebuilding all our digital collections using a shared template, enabling us to balance customized browsing features and interpretive content with a sustainable workflow. the work on this template eventually grew into the collectionbuilder project, an open source framework for creating digital exhibits powered by static web methodology. our skills also grew with the project, enabling us to create more features and unique collections along the way, opening up access to more materials and content–but also enabling us to include more collaborators, opening up our collections to more people creating unique interpretive writing and exhibits. digital collections are made up of a few distinct chunks of data: the digital media files, metadata describing the items, and interpretive content. once separated out from a specific platform, this data is easier to work with and update using standard tools, and becomes a preservation-ready package, a digital archive ready to be plugged into any access framework. this separation also makes convenient divisions for collaboration–curating and digitizing objects, describing items in spreadsheets, writing about the context in markdown files, and developing website features. in practice, faculty and graduate students often collect and curate archival content during their research process (often in collaboration with librarians and archivists). with just a folder of files and a spreadsheet, we are able to generate digital collection websites to explore and publish these materials. for students, seeing their items as data and as a digital archive can become an iterative aspect of the research and analysis. their work will prepare access and publishing, while also creating new means of exploring and answering research questions. these types of iterative and exploratory collections were not possible when our infrastructure was tied to a more rigid traditional asset management system. for example, one of our earliest static web collaborations was on the historical japanese ceramic comparative collection with archeology graduate student renae campbell. together, we were able to transform her thesis research into an interactive website with new ways to visualize the information, increasing the impact and access to her research. the work was completed during a summer fellowship, with campbell’s focus on spreadsheet data and ideas for presentation, and librarians’ focus on creating the site. campbell is able to contribute periodic updates with new items and metadata corrections (which only require working with a spreadsheet), but no ongoing maintenance is required. likewise, ctrl+shift started as a collection of interviews of contemporary american poets conducted by librarian devin becker. in building a website for public access, becker’s graduate research assistants created new metadata (tagging the interview transcripts) and imagined new features for the website, which in turn became a means of interrogating the collection, exploring new ways to understand and discover. the flexibility of the platform enabled publishing these evolving interpretive threads by remixing the data and the recipes of display, without needing to rebuild the infrastructure. creative nonfiction writing student isabel marlens was exploring archival materials about historical wildfires in idaho, curating items that informed the project fire lines. during her research, she encountered records of the timber protective associations which contain fascinating historic data and perspectives that informed fire policy in america. as a byproduct of her research process, others are also now able to explore these resources alongside her published interpretive essay. these small collections, responsive to researchers interests and needs, are uniquely enabled by the minimal digital collection template approach. static web for digital scholarship applying the static web approach to our digital scholarship projects has allowed for sustainable publishing: we’re able to develop unique projects and keep them up for the long-term without major maintenance responsibilities. preservation-ready data is prepared as part of the workflow, integrating stewardship of unique materials into our practice. it has also increased options for us to be creative, as we learn more about the technology we’re using and remix our template recipes to create unique features and explore new ideas. and it is continuously expanding the ways we collaborate, from embarking on initiatives to migrate old projects to teaching collaborators how to contribute content to spreadsheets or github repositories. one example of this is storying extinction, a multidisciplinary digital project which maps and documents the extinction of caribou from idaho. graduate students on this project collected interviews and videos, created spreadsheets of data, and sketched ideas for web functionality. librarian devin becker, building from a collectionbuilder template, remixed map features to develop a highly customized site enacting the collaborators’ concept of a “deep map” presenting multiple paths to explore and understand the collected videos, interviews, and interpretive content. this involved combining a typical collectionbuilder item page with the leaflet.js powered map layout, using the “flyto” function to animate movement across the landscape. the same collaborators built on this experience and template to develop keeping watch, a geospatial narrative mapping project documenting the fire towers of idaho. another example is the letters of marie mancini project, which provides access to long hidden letters of 17th-century noblewoman marie mancini. faculty and student collaborators transcribed, translated, annotated, and encoded the letters in tei, which freed librarian olivia wikle to develop a custom site to transform and visualize that data for engaging public access. starting from a base collectionbuilder template, adding approaches from the wax static framework, and remixing a variety of other code recipes, the project represents the flexibility a static approach enables. wikle wrote rake tasks (ruby scripts integrated into the project) that are able to transform the project’s plaintext data into website features presenting the letters’ transcriptions and translations in multiple languages, and to view patterns across them via maps and timelines. teaching with static web when it comes to teaching, we’ve been able to use static web development to create lasting digital artifacts, without adding long-term system maintenance to classroom prep. this approach frees us up to teach students basic digital and data literacy skills (using formulas in spreadsheets, writing for the web using markdown, creating a commit on github) that they’ll likely put to use in their future careers, while we remix templates to create oer resources that fit the unique classroom context. librarian evan williamson even created a digital collection with a 2nd grade class in this way, using paper forms to collect metadata about students’ favorite classroom books. in 2022, we worked with faculty and librarian collaborators on an neh-funded initiative called learn-static to explore and create resources for teaching with static web tools. these include documentation for getting started with github, managing data in spreadsheets, and writing in markdown and html, as well as example lesson plans that demonstrate how these various skills can be incorporated into the classroom as part of a larger dh project (such as building a digital collection or coding oral histories). librarians and instructors worked together to develop project templates, assessments, and resources that were then used and refined in the classroom. in this context, the static web approach facilitated collaboration between instructors, the creation of reusable open educational resources, and the teaching of more fundamental digital literacies integrated into domain-appropriate projects. for example, digital exhibit lab started as a customized version of collectionbuilder designed for a classroom project creating a shared digital collection. to enable students to iteratively test and debug their own metadata spreadsheets (without breaking the live site), the code uses papaparse to load a published google sheet on the fly. students get instant feedback on their work, allowing active prototyping where they can see the impacts of metadata refinements in real time. this functionality was further developed into the collectionbuilder-sheets template. conclusion overall, our experience has been that a static web methodology has opened up a more agile and sustainable approach to our digital initiatives, while growing internal skills, collaboration, and empowerment. the creation of well-structured and well-described data as part of the project process ensures our role in stewardship and preservation of unique materials and research, while the flexibility of static templates has increased our creativity. likewise, the ease of iterative development and prototyping has enabled new opportunities that were previously prohibitively costly. by minimizing infrastructure costs and commitments, we have been able to invest time in growing our internal capacity for deep collaboration, creative thinking, and sustainable project management. while a minimal computing static web methodology is not optimal for all situations, recent work informed by this approach demonstrates that it is a viable and powerful option when holistically balancing the costs of the institutional, technical, and social infrastructures that can sustain innovative dh scholarship. footnotes [1] “panel 1-01: sustainability + the politics of dh infrastructure”, adho dh2022, july 2022, https://dh2022.dhii.asia/dh2022bookofabsts.pdf references barats, c., schafer, v., and fickers, a. (2020). fading away… the challenge of sustainability in digital studies. digital humanities quarterly, 14(3). https://www.digitalhumanities.org/dhq/vol/14/3/000484/000484.html. [3] chan, t., and sayers, j. (2022). minimal computing from the labor perspective. digital humanities quarterly, 16(2). http://digitalhumanities.org/dhq/vol/16/2/000600/000600.html. [2] dombrowski, q. (2019). sorry for all the drupal: reflections on the 3rd anniversary of ‘drupal for humanists.’ quinn dombrowski, published november 8 2019, https://quinndombrowski.com/blog/2019/11/08/sorry-all-drupal-reflections-3rd-anniversary-drupal-humanists/. drucker, j. (2021). sustainability and complexity: knowledge and authority in the digital humanities. digital scholarship in the humanities, 36 supp_2: ii86–ii94. https://doi.org/10.1093/llc/fqab025. gil, a. (2018). design for diversity: the case of ed. the design for diversity learning toolkit. https://des4div.library.northeastern.edu/design-for-diversity-the-case-of-ed-alex-gil/. gil, a. (2015). the user, the learner and the machines we make. minimal computing. https://go-dh.github.io/mincomp/thoughts/2015/05/21/user-vs-learner/. [1] risam, r., and gil, a. (2022). introduction: the questions of minimal computing. digital humanities quarterly 16(2). http://digitalhumanities.org/dhq/vol/16/2/000646/000646.html. about the authors olivia wikle is the head of digital scholarship and initiatives at iowa state university, where she collaborates on projects involving digital scholarship, digital collections, and the institutional repository. she is a co-developer of the collectionbuilder static web framework, and her research interests include sustainability in digital libraries and digital literacy instruction. evan peter williamson is the head of digital scholarship and open strategies (and digital infrastructure librarian) at the university of idaho library, working with the center for digital inquiry and learning to bring cool projects, enlightening workshops, and innovative services to life. despite a background in art history, classical studies, and archives, he always manages to get involved in all things digital–especially websites, digitization, and digital preservation. his recent focus has been on data driven, minimal infrastructure web development, currently embodied in the collectionbuilder project. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – enhancing descriptive metadata records with freely-available apis mission editorial committee process and structure code4lib issue 24, 2014-04-16 enhancing descriptive metadata records with freely-available apis this article describes how the university of north texas libraries’ digital projects unit used simple, freely-available apis to add place names to metadata records for over 8,000 maps in two digital collections. these textual place names enable users to easily find maps by place name and to find other maps that feature the same place, thus increasing the accessibility and usage of the collections. this project demonstrates how targeted large-scale, automated metadata enhancement can have a significant impact with a relatively small commitment of time and staff resources. by mark phillips and hannah tarver introduction the university of north texas libraries’ digital projects unit (dpu) worked on two projects for the portal to texas history [1] that make use of publicly-available application programming interfaces (apis) to enrich and extend metadata records so that more users are able to find the resources which the records describe. the two collections that the dpu used for this work contain two series of maps, both issued by federal agencies. usgs topographic maps the first set of maps processed with the workflow described in this article was the texas 7.5 minute topographic map collection, containing over 8,000 maps acquired by the unt libraries as a “born-digital” collection [2]. the united states geological survey (usgs) [3] created these maps through its ongoing project to chart the geographic detail of the united states at a uniform size and scale (1:24,000) [4]. the collection includes an older set of maps scanned as digital raster graphic maps, and a second set of maps, for the same quadrangles, that were digitally-generated from geospatial data [5]. the united states geological survey makes these maps freely available in a number of ways via its various websites including the usgs store [6], the national atlas [7], and a historical topographic map site. u.s. census reference maps the second set of maps processed with the workflow was a collection of maps produced by the united states census bureau [8]. the maps in this case denoted the boundaries used for the collection and aggregation of census statistics and are part of the census bureau’s “reference map series” which the bureau makes available via its website for download [9]. the united states census bureau produces several kinds of reference maps that denote census blocks, census tracts, and other areas for which it compiles data. the maps include some roadways, railways, geography, hydrography, and other features to aid in identifying a particular location. users can consult the reference maps to determine which geographic areas correspond to census data and statistics. the maps are generated every ten years, corresponding with census data collection. the collection in the portal [10] contains the county block maps from 1990, 2000, and 2010, as well as the block maps generated under public law p.l.94-171 for 2010 that include legislative and voting districts. this paper discusses the workflow for acquiring, parsing, and preparing the map data for use in publicly-available apis and then discusses the apis used to enrich the map metadata before uploading the records and digital objects into the portal to texas history. getting the maps usgs 7.5 minute topographic maps currently, the easiest way to access topographic mapping data in bulk for the kind of work described in this article is from the usgs website, “the national map: historical topographic map collection [11].” the site allows users to specify a state and specific scale of maps to return a listing of search results that matches their query. this list of search results can be downloaded as a comma separated value (csv) file, which is the preferred format for this workflow. once the map metadata is downloaded, it can be opened with standard spreadsheet software such as openoffice or microsoft excel. each map is listed in a separate row and the column with the heading of “download geopdf” contains the download url for each map. this column can be copied into a text file with one url per line; the wget [12] command will download each map using the urls in the text file, designated by the --i argument (see example 1). example 1. wget command to retrieve topographic maps. wget --i topos_urls.txt u.s. census bureau reference maps the u.s. census bureau reference maps can be downloaded from the bureau’s reference maps portal [9], which provides access to a number of map series. for each of the census block maps the number of maps can be narrowed by decade, then state, with options to limit to county, subdivision, or city. after the final selection, the portal displays the relevant maps in a directory structure which contains a number of pdf map files. for example, maps of armstrong county in texas for the 2010 census are available at the following url: http://www2.census.gov/geo/maps/dc10map/gublock/st48_tx/county/c48011_armstrong/ wget can download all of the files in the armstrong county folder by referencing the url (see example 2). example 2. wget command to retrieve census reference maps. wget -r -np -nc --accept *.pdf http://www2.census.gov/geo/maps/dc10map/gublock/st48_tx/county/c48011_armstrong/ the command in example 2 executes the wget program with the following arguments: -r recursively download links found at the designated url. -np no parent, do not move higher in the directory structure. -nc no clobber, if the file has already been downloaded, do not download again, even if the process is restarted. --accept only allow content matching the following pattern, in this case only .pdf documents. there are a number of options for wget that can be tuned to your specific needs. the above example will get you started with downloading the census reference maps you are interested in. two application programming interfaces what is an api? an application programming interface, or api, is a way of programmatically interacting with a component or service within a programming language, within a framework for building applications, or as a way of connecting to and programming against various services offered by builders of applications. builders of web applications often offer apis to their applications and frameworks that allow third-party developers to create new applications and tools. this article uses the term api and web service interchangeably as is often the case when discussing apis and the web. the workflow used for these two collections makes use of two kinds of apis. the first extracts all of the place names that are found within the boundaries created by the four sides of an individual map; the second returns the county name for a given geographic latitude and longitude. these two lookups are used together to generate the place names and county names relevant for a given map and add the values to the map’s metadata record. the addition of these values will improve discoverability and access to the maps by allowing users to search for place names in the records or find other items about the same place(s). place name api the extraction api, citiesjson, is made available by geonames [13]. geonames is an organization that provides access to high-quality name and coordinate data which it aggregates from a number of organizations from around the world. geonames provides a large number of freely-available apis for developers to utilize in building systems which rely on geographic name and location data. in addition to the freely-available tools and data, a commercial support option allows for a higher rate of api usage and access to premium data. however, this workflow made use of the freely-available apis. geonames makes the underlying data, which their apis are built upon, available as bulk downloads so that developers can implement similar systems locally. geonames requires that a user register with the system to receive a free api key (the registered username) and then regulates api requests by the use of tokens. a user can make up to 30,000 daily api “credits,” but different api calls use different numbers of credits; thus the number of requests that a user or program can make in a day will depend on the kind of information requested. this workflow employs the citiesjson api that uses four credits per request, so the maximum number of requests that our application could make in one day from one username is 7,500. in addition, geonames limits api requests to 2,000 credits per hour; for the citiesjson api, we had to ensure that we made only 500 requests per hour to stay under the limit defined by the service. web services typically put these limits in place to reduce abuse of a system; it is important to understand the limits and penalties associated with exceeding them before using apis. in many cases, the web service will return an exception until sufficient time has elapsed. however, if consistent abuse is detected by the system it may permanently throttle or ban the user or ip address. the citiesjson api takes as input the following pieces of information: the north, south, east, and west coordinates of the bounding box, the preferred language in which to return the place names, the maximum number of names to return within a bounding box, and an optional callback function name. finally the api requires a username as the user’s api key. a complete request will look like this: example 3. geonames citiesjson request url. http://api.geonames.org/citiesjson?north=32.028676&south=31.905515&east=-95.371331&west=-95.516337&lang=en&maxrows=100&username=demo example 4. geonames citiesjson api response. { "geonames": [ { "countrycode": "us", "fcl": "p", "fclname": "city, village,...", "fcode": "ppl", "fcodename": "populated place", "geonameid": 4722123, "lat": 32.0232193, "lng": -95.3921752, "name": "reese", "population": 0, "toponymname": "reese", "wikipedia": "en.wikipedia.org/wiki/reese%2c_texas" }, { "countrycode": "us", "fcl": "p", "fclname": "city, village,...", "fcode": "ppl", "fcodename": "populated place", "geonameid": 4700164, "lat": 31.9190566, "lng": -95.3735624, "name": "ironton", "population": 0, "toponymname": "ironton", "wikipedia": "" }, { "countrycode": "us", "fcl": "p", "fclname": "city, village,...", "fcode": "ppl", "fcodename": "populated place", "geonameid": 8479911, "lat": 31.979577, "lng": -95.398622, "name": "corine", "population": 0, "toponymname": "corine", "wikipedia": "" }, { "countrycode": "us", "fcl": "p", "fclname": "city, village,...", "fcode": "ppl", "fcodename": "populated place", "geonameid": 4737012, "lat": 31.9190559, "lng": -95.4777322, "name": "todd city", "population": 0, "toponymname": "todd city", "wikipedia": "" } ] } the result from the api call is a list of place names that fall within the designated bounding box. geonames has many kinds of registered names; each of them has a unique identifier, called a geonameid. table 1. example values returned by the citiesjson api for a place name. field example value fcodename populated place toponymname reese countrycode us fcl p fclname city, village,… name reese wikipedia en.wikipedia.org/wiki/reese%2c_texas lng -95.3921752 fcode ppl geonameid 4722123 lat 32.0232193 population 0 we limited the number of place names to 100 for each map, in order to get a reasonable number of place names, especially for maps covering a large area. it turned out that for all of the bounding boxes in this project where place names were requested, the request with the most results contained 94 names [14]. because of the nature of these maps and the geography of the state of texas, which was the primary area covered, there were many maps that had no relevant place names. one thing to note about the citiesjson api is that it will only return a result for a city or place name if the latitude and longitude in the geonames database is contained within the bounding box. this can be problematic for large cities or cities that cover an oddly-shaped area when a map shows a portion or edge of a city, but the recorded latitude and longitude for that city are not within the bounding box for the particular map. even with this limitation, the geonames citiesjson api makes it possible to insert a large number of place names into the metadata records that would not be included otherwise. this workflow takes a given geonameid and converts it into the string value that the unt libraries use to designate a place in their metadata system, listing place names hierarchically from the largest (country) entity to the smallest relevant entity. in this case a city such as denton, texas, which has a geonameid of 4685907, is represented as the string “united states – texas – denton county – denton” and follows the pattern of – – – used for locations in north america. the geonames hierarchy web service [15] allows for the conversion from a geonameid to the hierarchical string representing each place name. an example of the response for the hierarchyjson api for this service can be seen at the following url: example 5. geonames hierarchyjson request url. http://api.geonames.org/hierarchyjson?geonameid=4685907&username=demo example 6. geonames hierarchyjson api response. { "geonames": [ { "adminname1": "", "countryname": "", "fcl": "l", "fclname": "parks,area, ...", "fcode": "area", "fcodename": "area", "geonameid": 6295630, "lat": "0", "lng": "0", "name": "earth", "population": 6814400000, "toponymname": "earth" }, { "adminname1": "", "countryname": "", "fcl": "l", "fclname": "parks,area, ...", "fcode": "cont", "fcodename": "continent", "geonameid": 6255149, "lat": "46.07323", "lng": "-100.54688", "name": "north america", "population": 0, "toponymname": "north america" }, { "admincode1": "00", "adminname1": "", "countrycode": "us", "countryid": "6252001", "countryname": "united states", "fcl": "a", "fclname": "country, state, region,...", "fcode": "pcli", "fcodename": "independent political entity", "geonameid": 6252001, "lat": "39.76", "lng": "-98.5", "name": "united states", "population": 310232863, "toponymname": "united states" }, { "admincode1": "tx", "adminname1": "texas", "countrycode": "us", "countryid": "6252001", "countryname": "united states", "fcl": "a", "fclname": "country, state, region,...", "fcode": "adm1", "fcodename": "first-order administrative division", "geonameid": 4736286, "lat": "31.25044", "lng": "-99.25061", "name": "texas", "population": 22875689, "toponymname": "texas" }, { "admincode1": "tx", "adminname1": "texas", "countrycode": "us", "countryid": "6252001", "countryname": "united states", "fcl": "a", "fclname": "country, state, region,...", "fcode": "adm2", "fcodename": "second-order administrative division", "geonameid": 4685912, "lat": "33.20526", "lng": "-97.11697", "name": "denton county", "population": 662614, "toponymname": "denton county" }, { "admincode1": "tx", "adminname1": "texas", "countrycode": "us", "countryid": "6252001", "countryname": "united states", "fcl": "p", "fclname": "city, village,...", "fcode": "ppla2", "fcodename": "seat of a second-order administrative division", "geonameid": 4685907, "lat": "33.21484", "lng": "-97.13307", "name": "denton", "population": 113383, "toponymname": "denton" } ] } county name api for this workflow, we also needed to be able to extract the county name for a given latitude and longitude in the united states. because of the way that the information is stored in the geonames services, there is no direct access to an api that will supply this information. this project used an api available through http://broadband.gov–another free web service–that takes a specific place point and returns the associated county name and state federal information processing standards (fips) code. broadband.gov is maintained by the national telecommunications and information administration and provides a number of apis for developers [16] interested in leveraging broadband.gov’s data in local applications. this workflow needed a form of reverse geocoding, which turns a known point location into a readable value such as an address, city, county, or country. the broadband.gov web service does not indicate that it has a rate limit on the number of requests that can be made in a given amount of time. even without a given rate limit, it is good practice to throttle requests to avoid swamping a service; in our case we decided to wait for two seconds after each request as a way of being polite. example 7. broadband.gov county api request url. http://www.broadbandmap.gov/broadbandmap/census/county?latitude=33.827395&longitude=-99.67&format=json example 8. broadband.gov county api response. { "results": { "county": [ { "envelope": { "maxx": -99.47449799999997, "maxy": 34.242023999999994, "minx": -100.04847999999998, "miny": 33.73384899999999 }, "fips": "48155", "geographytype": "county2010", "name": "foard", "statefips": "48" } ] }, "message": [], "responsetime": 7, "status": "ok" } the response for this request includes the following fields: table 2. example values returned by the broadband.gov county api. field example value envelope “maxx”: -99.47449799999997, “maxy”: 34.242023999999994, “minx”: -100.04847999999998, “miny”: 33.73384899999999 geographytype county2010 statefips 48 fips 48155 name foard for this workflow, we are specifically interested in the statefips value and the name value. we use these two values to construct a string for the location, for example “united states – texas – foard county” or “united states – louisiana – caddo parish.” generating api request urls each api has specific coordinate value needs. the geonames citiesjson web service requires the north, south, east, and west coordinate values for given maps, while the counties api from broadband.gov requires just the latitude and longitude for the point you are interested in reverse geocoding. to accommodate this, the authors created a python module [17] that will parse a simple text format containing bounding box information for each map. in this case, the input of the script is in the format described on the page “map sheet corner point coordinate files” from the u.s. census bureau [18]; the format contains the information for generating the coordinate values, as well as a unique identifier in the “sheet code” field that allows us to uniquely identify a map. corner point coordinate files are available for the p.l. 94-171 county block, and p.l 94-171 voting district/state legislative district outline and census tract outline maps. for the topographic maps, it is relatively straightforward to generate the same kind of corner point file from the usgs topo map csv file, as it contains the needed information. the python script in example 9. uses python’s standard csv module to convert a file downloaded from the national map site into the corner point file format. the script adds the unique identifier for the map as well as the other fields. example 9. python code for extracting corner point coordinate file from usgs csv. import sys import csv import math if len(sys.argv) != 2: print "usage: python extract_corner_points.py " exit(-1) filename = sys.argv[1] topos = csv.dictreader(open(filename)) def norm_cord(cord): cord = cord.strip() final_cord = math.fabs(float(cord)) * 1000000 return str(int(final_cord)) for topo in topos: row = [] row.append(topo["gda item id"]) row.append(norm_cord(topo["e long"])) row.append(norm_cord(topo["w long"])) row.append(norm_cord(topo["s lat"])) row.append(norm_cord(topo["n lat"])) row.append(norm_cord(topo["scale"])) print "\t".join(row) once the data is in the same format for both the census maps and the usgs topographic maps, subsequent code can be used for both series of maps, producing the same kind of place names for all of the map records. the python module written for this project is capable of constructing the request urls for both the geonames and broadband.gov apis. these urls are paired with a wget command and a sleep command, which when executed will download the response from the api, save it to a unique file, and then wait for a specified amount of time. the authors of the module set conservative default values for sleep values based on the web service being used. the resulting output can be redirected into a new file, which will contain the scripting needed to make requests from the specific web services. example 10. example broadband.gov output. wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=32.173850&longitude=-95.215715&format=json" -o 210048001000_ne.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=31.418890&longitude=-95.215715&format=json" -o 210048001000_se.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=32.173850&longitude=-96.072291&format=json" -o 210048001000_nw.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=31.418890&longitude=-96.072291&format=json" -o 210048001000_sw.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=31.796370&longitude=-95.644003&format=json" -o 210048001000_ct.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=32.156354&longitude=-95.649899&format=json" -o 210048001001_ne.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=32.033471&longitude=-95.649899&format=json" -o 210048001001_se.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=32.156354&longitude=-95.789205&format=json" -o 210048001001_nw.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=32.033471&longitude=-95.789205&format=json" -o 210048001001_sw.json ; sleep 1 wget "http://www.broadbandmap.gov/broadbandmap/census/county?latitude=32.094912&longitude=-95.719552&format=json" -o 210048001001_ct.json ; sleep 1 example 11. example geonames output. wget "http://api.geonames.org/citiesjson?north=32.173850&south=31.418890&east=-95.215715&west=-96.072291⟨=en&maxrows=100&username=demo" -o 210048001000.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=32.156354&south=32.033471&east=-95.649899&west=-95.789205⟨=en&maxrows=100&username=demo" -o 210048001001.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=32.151701&south=32.028676&east=-95.504864&west=-95.649899⟨=en&maxrows=100&username=demo" -o 210048001002.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=32.146900&south=32.023732&east=-95.365412&west=-95.510608⟨=en&maxrows=100&username=demo" -o 210048001003.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=32.046965&south=31.924375&east=-95.928524&west=-96.072887⟨=en&maxrows=100&username=demo" -o 210048001004.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=32.042616&south=31.919882&east=-95.789205&west=-95.928524⟨=en&maxrows=100&username=demo" -o 210048001005.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=32.038117&south=31.915241&east=-95.649899&west=-95.794585⟨=en&maxrows=100&username=demo" -o 210048001006.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=32.033471&south=31.910452&east=-95.510608&west=-95.655454⟨=en&maxrows=100&username=demo" -o 210048001007.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=32.028676&south=31.905515&east=-95.371331&west=-95.516337⟨=en&maxrows=100&username=demo" -o 210048001008.json ; sleep 10 wget "http://api.geonames.org/citiesjson?north=31.928719&south=31.806134&east=-95.938923&west=-96.077906⟨=en&maxrows=100&username=demo" -o 210048001009.json ; sleep 10 for the state of texas there are generally 4,500 – 7,800 maps per series. each map requires one citiesjson request, and at least five county api requests per map. for these projects a county request is made for each corner of the map and the center point. generating metadata records once we collected all of the cities and counties from the two apis, the next step was to generate the needed metadata records for import into the repository system. in the case of these items, the metadata process followed a workflow similar to the workflow discussed for other collections going into the unt digital collections system [19]. the city and county names for each map were de-duplicated and inserted into the metadata records (see figure 1). in the case of the untl metadata specification [20], this information is added by the record creation scripts to the coverage element with a qualifier of placename. we decided to include all cities and place names in the records, and to include a county only if it is not represented in the hierarchy of the city or place name values. the resulting coverage section can look like this: example 11. place names as they appear in metadata record. united states arkansas miller county texarkana united states texas bowie county nash united states texas bowie county wake village united states texas bowie county wamba united states texas bowie county whatley united states arkansas miller county we also included information in each record about the outside boundaries of the map as well as the center of the map to display the location for online users (see figure 2). these values were encoded using the dublin core metadata initiative’s (dcmi’s) box encoding scheme [21] and point encoding scheme [22] respectively. these are described in full detail on the dcmi’s website and look like this when generated for the untl metadata format (which has coverage qualifiers for placepoint and placebox): example 12. place point and place box values. north=33.476531; east=-94.092814; northlimit=33.526615; eastlimit=-94.032387; southlimit=33.426447; westlimit=-94.153241; the resulting metadata file once constructed and added to the portal to texas history is as follows. example 13. complete census map metadata record. 1990 census county block map (recreated): bowie county, block 15 1990 census county block map org united states. bureau of the census. washington d.c. united states. bureau of the census. 1990 2013 eng parent map for bowie county, texas showing the area of one geographic block for which the u.s. census bureau collected data. the plotted map scale is 1:15,000. 1 map ; 84 x 92 cm. census -maps. bowie county (tex.) -maps. places united states texas bowie county landscape and nature geography and maps census blocks statistical areas 1 united states arkansas miller county texarkana united states texas bowie county nash united states texas bowie county wake village united states texas bowie county wamba united states texas bowie county whatley united states arkansas miller county 1990 mod-tim north=33.476531; east=-94.092814; northlimit=33.526615; eastlimit=-94.032387; southlimit=33.426447; westlimit=-94.153241; uscmc untgd image_map image 90b48037_015 (90rblk) 19348037015000000000 total sheets: 41 (index 1; parent 27; inset 13) dimensions: 33 x 36 in. "all maps display the 1990 geography; however, the features displayed on these maps are those shown on the census 2000 maps. these maps show the boundaries and numbers of the 1990 census blocks as well as the named features underlying the boundaries." additional information taken from census bureau website: "these large scale, large format (36" x 33") maps depict the smallest geographic entities for which the census bureau presents data, census blocks. the recreated 1990 census block maps were produced for counties. all maps display the 1990 geography; however, the features displayed on these maps are those shown on the census 2000 maps. these maps show the boundaries and numbers of the 1990 census blocks as well as the named features underlying the boundaries. they also show the boundaries, names and codes for 1990 american indian/alaska native areas, counties, county subdivisions, and places. the scale of the maps will be optimized to keep the number of map sheets for each area to a minimum, but the scale and number of sheets will vary by the area size of the county and the complexity of the census blocks." http://www.census.gov/geo/maps-data/maps/1990coublock.html mphillips pth ark:/67531/metapth363004 2014-01-05, 17:46:10 false mphillips 2014-01-10, 08:14:01 after the authors’ metadata creation scripts were finished, the resulting metadata files were paired with their corresponding map pdf files and loaded into the portal to texas history. they are presented both as high-resolution map images (see figure 3) which can be navigated using the full screen zoom interface (see figures 4 & 5), and as the original pdf format as it was downloaded from the agencies’ websites (see figure 6). figure 1: landing page for an individual census reference map. figure 2: bounding box and point information display in a census map record. figure 3: image view of census map. figure 4: map zoom interface. figure 5: detail view of map. figure 6: user access to view the image file or download the original pdf. closing the workflow outlined in this article can be used by others who are interested in integrating maps for their states into their local repositories. the apis and methods are easily adaptable and can be used in a number of applications. the ability to provide textual searching of the place names and county names for these maps increases the usability and discoverability of these maps. the enhanced metadata allows users to find all maps that show a particular county within the portal collection without having to consult a paper index map. likewise, a genealogist or historian can locate a map containing a specific place name or city name, which may not be indexed in any other way. the unt libraries placed the first 8,556 texas topographic maps online over the past three years; to date they have received over 61,700 uses [23]. the u.s. census maps were placed online in january of 2014 and at the time of this publication had already received 1,692 uses [24] by the public. the relatively high usage of these two collections illustrates the importance of providing high-quality metadata, hosting, and access for born-digital map collections on the web. the collections also represent examples of ways that metadata records can be augmented and enhanced by simple, freely-available apis. this allows organizations to potentially add large, high-demand collections to existing digital collections with a relatively small commitment of time and staff resources. references [1] the portal to texas history; http://texashistory.unt.edu/ [2] usgs topographic map collection; http://texashistory.unt.edu/explore/collections/ustopo/ [3] united states geological survey; http://www.usgs.gov/ [4] usgs topographic maps; http://topomaps.usgs.gov/ [5] us topo quadrangles, united states geological survey; http://nationalmap.gov/ustopo/index.html [6] the usgs store; http://store.usgs.gov/ [7] the national map, united states geological survey; http://nationalmap.gov/index.html [8] united states census bureau; https://www.census.gov/ [9] u.s. census reference maps; https://www.census.gov/geo/maps-data/maps/reference.html [10] united states census map collection; http://texashistory.unt.edu/explore/collections/uscmc [11] the national map: historical topographic map collection; http://nationalmap.gov/historical/ [12] wget; https://www.gnu.org/software/wget/ [13] geonames; http://www.geonames.org/ [14] 1990 census county block map (recreated): cherokee county, index; http://texashistory.unt.edu/ark:/67531/metapth358801/ [15] geonames hierarchy webservice; http://www.geonames.org/export/place-hierarchy.html [16] broadband.gov developer; http://www.broadbandmap.gov/developer [17] corner_points_parser; http://github.com/vphill/corner_points_parser [18] map sheet corner point coordinate files, u.s. census bureau; http://www2.census.gov/geo/tiger/rd_2ktiger/pl_maps/cornerpt/cornerpt.html [19] phillips, mark, hannah tarver, and stacy frakes. “implementing a collaborative workflow for metadata analysis, quality improvement, and mapping.” code4lib journal. n. page. web. 17 jan. 2014. http://journal.code4lib.org/articles/9199 [20] unt libraries metadata documentation; http://www.library.unt.edu/digital-projects-unit/metadata/ [21] dcmi box encoding scheme; http://dublincore.org/documents/dcmi-box/ [22] dcmi point encoding scheme; http://dublincore.org/documents/dcmi-point/ [23] statistics for usgs topographic map collection; http://texashistory.unt.edu/explore/collections/ustopo/stats/ [24] statistics for united states census map collection; http://texashistory.unt.edu/explore/collections/uscmc/stats about the authors mark phillips is assistant dean for digital libraries at the university of north texas libraries. hannah tarver is head of the digital projects unit at the university of north texas libraries. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – customizing open-source digital collections: what we need, what we want, and what we can afford mission editorial committee process and structure code4lib issue 59, 2024-10-07 customizing open-source digital collections: what we need, what we want, and what we can afford after 15 years of providing access to our digital collections through contentdm, the university of louisville libraries changed direction, and migrated to hyku, a self-hosted open-source digital repository. this article details the complexities of customizing an open-source repository, offering lessons on balancing sustainability via standardization with the costs of developing new code to accommodate desired features. the authors explore factors in deciding to create a hyku instance and what we learned in the implementation process. emphasizing the customizations applied, the article illustrates our unexpected detours and necessary considerations to get to “done.” this narrative serves as a resource for institutions considering similar transitions. by emma c. beck, terri l. holtze, rachel i. howard, and randy kuehn system selection our proprietary past in 2005, the university of louisville libraries selected contentdm to host its digital collection of unique and rare primary source materials and used a library services technology act (lsta) grant for funding. limited technology resources meant that the institution needed a “plug and play” system with the possibility of customization. contentdm seemed functional, and allowed self-hosting with support from oclc, an established library vendor. with the creation of two new positions in 2006, the digital initiatives librarian and the web services librarian, along with the assistance of a server administrator who managed the self-hosted server, we established our digital collections program using a distributed approach to creating scans and metadata. we added content from multiple libraries at the university plus a few non-library campus collections; upgraded to a license without limits on the number of collections and items; created interactive maps and custom queries; and added a development server to test upgrades and customizations along the way. by the time oclc announced that it would no longer support self-hosted contentdm instances after 2018, we had experienced increasing dissatisfaction with oclc’s lackluster customer service, slow pace of software development, and recurring issues of server downtime with their other products. a move to a hosted instance would mean a loss of our customizations, an increase in costs, and a deeper dependence on an unreliable vendor. we thought we were outgrowing the “out of the box” approach to repository software and hoped that adopting open-source tools would better serve our needs and lower costs. system requirements as we began the search for a new system, our digital collections consisted of approximately 100,000 digitized and born-digital works, including images, maps, newspapers, and videos, with some “compound objects” (a contentdm term) comprising a parent item with two or more children (for example, a multi-page newspaper issue). the new system would need to accommodate such hierarchies. it would also need to provide a more user-friendly image viewer and be able to handle full text provided as either metadata or through optical character recognition (ocr). we also wanted a system that provided rights protection, controlled vocabulary, batch import/export, advanced search, facets, and reports. in addition to the built-in capabilities, we sought the ability to edit the look of the pages, including item records, headers and footers, and the ability to create additional informational pages. we preferred, but did not require, that the new system allow for the scalability and interoperability that would allow us to move our electronic theses and dissertations (etds) and oral histories into the same software stack, saving more money on vendor licenses and removing silos. as well as the system requirements and functionality, we also needed to keep in mind our lean staffing. digital initiatives, which oversees digital collections and the institutional repository, has three full-time equivalent (fte) employees, up from 2.5 at the outset of the migration. web services, responsible for multiple sites and systems, has just 1.7 ftes, and the server administrator oversees linux and windows servers across five libraries. we needed a system that could be implemented and maintained with these staffing levels. open-source solutions given our dissatisfaction with stagnant proprietary software, we found ourselves drawn to open-source software (oss) as a compelling alternative. it held the promise of access to a community of users in an environment of knowledge sharing, innovation, and transparency. its inherent flexibility would enable us to customize to suit our specific needs, empowering us to tailor the software to our requirements. we had successfully adopted oss for specific, limited uses prior to this exploration. for example, our archives catalog runs on archivesspace, and the oral history audio files and transcriptions now live in a locally developed application employing the oral history metadata synchronizer (ohms) oss. we use lockss (lots of copies keeps stuff safe) for digital preservation. other applications posed implementation or maintenance challenges, often due to underdeveloped documentation and the absence of dedicated software support, but vended solutions had also lacked those resources. in 2018, the two newest oss systems in the digital asset management space were islandora and hydra/sufia (now hyrax, created and maintained by the samvera community). both could handle audiovisual materials and texts, including scholarly publications such as etds. at the time, islandora offered a hosted solution through lyrasis, although that would limit design and customization options. hydra users tended to be at institutions with teams of in-house developers. however, they were working on a more turnkey approach called “hydra-in-a-box” (now hyku), which sounded like a good fit for institutions like ours who needed an entry-level oss application. it was still in development, but the university of houston had recently received a grant from the institute of museum and library services (imls) grant to build a bridge2hyku toolkit which sounded like exactly what we needed to move forward with our migration. the toolkit wasn’t as far along as the software development, but we did not feel hurried. we thought we could keep the public contentdm instance up (albeit without vendor support) while we took our time setting up the new system (which we thought would require our time but not any new expenses). implementation unboxing hyku when we began working with hyku, we undertook basic customizations to incorporate our preferred work types and metadata schema, essential for ensuring a seamless data migration process. this initial phase involved extensive planning and proved time-consuming. the functionality enhancements benefited from the available hyrax documentation, but we found ourselves needing to fill in the gaps. these initial customizations to improve functionality and usability provided the foundation for our initial instance of hyku. this version operated without any significant issues and was put into production. however, only a fraction of our collection was represented. importing the remainder of the collection required further system enhancements. outsourcing development we had not yet focused on more complex customizations of the hyku oss. hyku did not include some necessary features, and the imls grant had not produced them as we had hoped. we also sought to improve the user experience. the most critical requirement for us, the newspaperworks gem, was not part of the core code. we had previously completed a community transcription project with text transcribed and shared in page-level metadata. the system needed to be capable of housing this metadata and making it searchable. an entire issue of the newspaper should be treated as a parent, with each page (and its accompanying searchable metadata) treated as a child. a search for any term in the child should resolve to the parent record, so as not to lose context, but have a way within the viewer to navigate to the appropriate page (this was possible, albeit clunky, in contentdm). parent/child records made up the bulk of our collections and help would be needed to incorporate this functionality. we realized that outsourcing development would be necessary to complete our migration. incorporating the newspaperworks gem continued to be our primary concern. aside from that, our list of customizations needing to be outsourced started out small. however, due to the task complexity, the list grew significantly when broken down into individual steps. we worked closely with the software development company contracted to improve our instance of hyku, including meeting monthly, and testing code changes. troubleshooting issues occupied most of our sole in-house programmer’s time over the duration of the project. the digital initiatives team needed to restructure metadata to match the new code’s handling of compound objects, adding and populating a “parent” field to every parent and child and to import or reimport all metadata and associated files through the system’s bulkrax tool – another time-consuming process. even with a team of contracted developers handling the code changes and moving the project forward, we underestimated the time and involvement necessary to complete this phase of the project. additionally, while they did their own testing, it did not always represent how we used the system. for example, load time of a single newspaper issue on the developer’s test server appeared to be speedier than that same newspaper loaded on to our server with 100,000 items and heavy server traffic. usability we took advantage of the ability to customize the homepage and other page layouts. the custom homepage allowed us to highlight document types and themes like louisville maps and materials about the university. users could also use filters from the homepage to browse the collection rather than being required to complete a search before seeing the filters, as was the case in contentdm. we were also able to customize the thumbnail images throughout the system to be larger to help users more quickly identify items they wanted to investigate. once they did access an item record, the image could be zoomed in with exceptional clarity. on the negative side, we found that compound objects loaded so slowly that they were essentially inaccessible. to make matters worse, there was no indication to tell users that the files were loading – only a blank space in the viewer where the items would eventually appear. while importing our parent/child records, we contracted for a second round of development, focusing on additional customizations involving lower-priority features such as friendly urls, more options for sorting search results, and the ability to limit the size of the files downloaded by users. this round was a bit more difficult to navigate, as we discovered the customizations interfered with one another. at the same time, hyku’s core code was evolving faster than expected. we had to halt all planned local work on the system to avoid any further conflicts with the developers’ work. the project dragged on as we blew past our initial estimate for completion, which was not critical as long as the contentdm site could continue to provide access. however, campus information technology professionals had become increasingly concerned about the security risks posed by the aging server and unsupported software and gave us a hard deadline for its end of life. with the development contract ended and file importing nearing completion, we noticed that the navigation of full text within the image viewer no longer functioned, which, in conjunction with the image load time issues, critically affected the usability of the system. furthermore, the code did not meet ada standards. accessibility to begin with, hyku lacked adequate development in terms of accessibility. however, it is particularly important. firstly, it provides the best possible experience for all users. secondly, for public institutions – as many of hyku’s potential users are – accessibility is legally required under the americans with disabilities act. the samvera community has discussed the accessibility issues in working group meetings, but open-source communities have complex and differing needs, and development still focuses on creating additional functionality within the system rather than on correcting known accessibility issues. whether stemming from the programmers or the clients, many updates focus on improving functionality rather than improving accessibility even though the accessibility issues would be easier to fix. we began testing for and correcting accessibility issues early in our project; however, this was not an efficient use of our time because further development overwrote large sections of the code and the work had to be repeated after the customizations and fixes from the outsourced development were implemented. knowing that the accessibility issues existed, however, did help in planning what steps needed to be taken to eradicate the problems. the main accessibility issues we noted include: headings nested incorrectly and/or out of semantic order. pages without unique titles. settings on buttons and links need to show an outline on focus. various spots where
was a child element of which is semantically incorrect. license and rights links opened in new windows. search form elements needed ids and labels. in the image viewer: needed to remove “maximum-scale=1.0” in meta tag and set scalable as “yes” so zoom functionality would work. no title attribute. throughout the customization process the in-house programmer and the web services librarian tracked changes to the code in a spreadsheet. this served multiple purposes: first, with multiple people working on the files it ensured communication about edits between collaborators; second, as different versions were implemented it allowed us to review the changes we had made and determine if they needed to be recreated in the newest version, and third, we shared the items in the spreadsheet that pertained to accessibility with the samvera metadata interest group (smig) and with our developer contractors in the hope that accessibility issues we had found would assist with future development. evaluating open source our hyku implementation took longer than anticipated, involved out-of-pocket costs to outside developers as well as extensive time on the part of the in-house team, and did not produce all of the desired results. when we finally launched our site in august 2023, full-text search and video streaming were not working within the image viewer, and image loading could be so slow that larger records were not useable, so we held off on adding yearbooks and a journal which had been available in contentdm, angering users who relied on free online access to those resources. users did appreciate the improvements to the design, site navigation, and show pages. the university of louisville project team participated in samvera community work groups and enjoyed the collaborative nature of this open-source community. we watched and listened as other desired improvements were developed for other projects, and eagerly anticipated the day when we could add them to our code. however, by trying to push the boundaries early on with new functionality, we now struggle to maintain the system with these customizations. where do we go from here? at the university of louisville, our extensive customizations, though tracked in github, have led to challenges as we try to upgrade our system. we are currently on hyku 3 and the release we would like to be on, hyku 6, is tentatively scheduled for this summer. to upgrade and take advantage of improvements made by others in the community, we need to separate our customizations from the core code, since some (not all) of our custom features have been refined and added to newer versions, but there is no simple way to do this. we are either stuck in our current system with its problematic features, or we could start from scratch. some of the work we outsourced to external developers and already paid for would likely be lost. we would also have to re-import all our collections and reconfigure all the settings. as an early implementer, the work we did or contracted, and the bugs we helped find, assisted in the overall development of the system, but have left us in a tricky spot. operating without full functionality left us in a challenging place. development slowdowns resulted in fewer upgrades, which could cause security problems. it became hard to delineate between necessary changes and the point of no return. which customizations will create update issues? would it be better for us to stay with the core code? at that point is there a difference in open-source software and proprietary except the ethos? the system’s complexity forces institutions staffed like ours to outsource for maintenance and upgrades of the system, requiring funding for an unknown and unplannable amount of work on an ad hoc basis. these unexpected requests put stress on an already tight budget, leading us to wonder if other proprietary systems, with known costs, would be a better option. we have begun to explore the digital collection landscape and wonder about other possibilities that would not strain our budget or personnel. digital collection options are scarce. there are a few open-source options still in early development and a few proprietary systems currently being built out with a variety of capabilities and cost structures. so where does that leave us? between a rock and a hard place. could we hire a new developer or two, whose time is dedicated to oss development? we wish, but with tight budgets, it’s difficult to get approval for a new employee line and our administration would rather pay for a temporary outsourcing expenditure than take on a continued annual expense. we are at the mercy of a stark budget for the coming year. these issues will plague us until we find a sustainable way forward. resources bridge 2 hyku – https://bridge2hyku.github.io/ digital collections relaunch – https://uofllibraries.wordpress.com/2023/08/04/digital-collections-relaunch/ guidance on web accessibility and the ada – https://www.ada.gov/resources/web-guidance/ hyku documentation – https://samvera.atlassian.net/wiki/spaces/hyku/overview hyku source code – https://github.com/samvera/hyku hyrax source code – https://github.com/samvera/hyrax samvera documentation – https://samvera.github.io/ samvera project – https://samvera.org/ university of louisville digital collections – https://digital.library.louisville.edu/ about the authors emma c. beck is metadata librarian at the university of louisville libraries. terri l. holtze is web services librarian at the university of louisville libraries. rachel i. howard is digital initiatives librarian at the university of louisville libraries. randy kuehn is digital technologies systems librarian at the university of louisville libraries. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial mission editorial committee process and structure code4lib issue 59, 2024-10-07 editorial welcome to a new issue of code4lib journal! we hope you like the new articles. we are happy with issue 59, although putting it together was a challenge for the editorial board. this was in no small part because issue 58 was so tumultuous, including a crisis over our unintentional publication of personally identifiable information, a subsequent internal review by the editorial board, an extra editorial, and much self-reflection. all of this (quite rightly) slowed down our work. several editorial board members resigned, which left us with a much smaller team to handle a larger workload. as a volunteer-run organization without a revenue stream, code4lib journal is a labor of love that we all complete off the side of our overfilled desks. it was demoralizing to feel that we had lost the support of many in our community. a lot of us were tempted to quit rather than try to pick up and carry on. so, although we have published issue 59 later than planned, and with a different coordinating editor, we made it. this issue is testament to the perseverance of my colleagues on the editorial board, and to the wonderful articles contributed by our community. in this issue, you will find: jack o’sullivan, sarah romkey and karin bredenberg, in their article response to premis events through an event-sourced lens, continue a conversation on the premis data model begun by ross spencer in issue 56. they provide an interesting rejoinder to spencer, and we welcome – and value – these dialogues which enrichen the conversation on data preservation. this work demonstrates that the journal can be a useful home to conversations impacting our profession. sometimes the challenges of developing with and for open source projects are substantial. emma c. beck, terri l. holtze, rachel i. howard and randy kuehn share their travails working with early versions of hyku in customizing open-source digital collections: what we need, what we want, and what we can afford. despite spending substantial resources to contribute to an open source software project, and in building their own in-house customizations, they found that the solution they ended up with was far from ideal. their article describes some of the potential pitfalls when implementing an open source solution in a glam institution. the cost of electronic subscriptions is an evergreen issue for many acquisitions and e-resources librarians. having accurate data to inform these costs is essential for effective decision making about collections. lydia harlan, kristin buxton, and gabriele hayden describe their iterative process for harvesting counter data in their article cost per use in power bi using alma analytics and a dash of python. these journeys into harvesting useful data will be relatable – and useful – to many librarians in many collections-facing roles. wilhelmina randtke discusses the implementation of an intranet at georgia southern university libraries. in launching an intranet in libguides cms at the georgia southern university libraries, randtke describes some of the reasons for – and challenges inhering from – using technologies such as libguides and google docs as the building blocks for a library intranet. randtke focuses on some of the organizational considerations that bear upon rolling out a successful intranet project. in the dangers of building your own python applications: false-positives, unknown publishers, and code licensing, corey schmidt describes an exploratory process of running into some of the sharp edges of developing desktop applications in python. the article describes how the author met and overcame these hurdles, and provides the beginnings of a roadmap for anyone looking to build their own python desktop applications. finally, harry bartholomew describes a conversion of bliss classification to rdf, in converting the bliss bibliographic classification to skos rdf using python rdflib. bartholomew’s work is being done at queen’s college, cambridge, but offers important new developments for other collections who may also be using bliss. offering a representation of bliss data in rdf opens exciting possibilities. hannes lowagie and julie van woensel describe the creation of subject classification tools for the national library of belgium. their python-based approach forgoes two common ai strategies: large language models and self-trained models. helpfully in their article, simplifying subject indexing: a python-powered approach in kbr, the national library of belgium, they lay out an alternate third way forward, one that is customized to their specific requirements and community needs. our thanks to past serving editors, and to everybody who helped pull this issue together, including our generous authors. my hope is that the editorial board can continue to aspire to the journal’s mission of ‘fostering community and sharing information among those interested in the intersection of libraries, technology, and the future’. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using amazon mechanical turk to transcribe historical handwritten documents mission editorial committee process and structure code4lib issue 15, 2011-10-31 using amazon mechanical turk to transcribe historical handwritten documents the developing “information age” is continually unraveling new ways of discovering, presenting and sharing information. most new academic material is digitally formatted upon its creation and is thus easy to find and query. however, there remains a good deal of material from times prior to the “information age” that has yet to be converted to digital form. much of this material can be found in library collections—whether academic, public or private—and thus remains available only to a limited number of locals or willing-and-able sojourners. using ocr technology, most typeset documents can be digitized and made available online; and there are several projects underway to do exactly this. however, there remains little to be done for handwritten materials. those who own collections of handwritten documents are increasingly wanting to make the content thereof available to the general public. unfortunately, traditional transcription models typically prove to be expensive or inefficient and pdf snapshots are not searchable. we have developed a model for digital transcription using google docs and amazon’s mechanical turk. using this model, one can use an online workforce to efficiently transcribe handwritten texts and perform quality control at a cost much lower than professional transcription services. to illustrate the model we used amazon’s mechanical turk to transcribe and then proofread the frederick douglass diary which we have made available on a public searchable wiki. the total cost of transcription and proofreading for the 72 page diary was less than $25.00 with some pages being transcribed and proofread for as little as $0.04. our results show that using amazon’s mechanical turk holds great promise for providing an affordable transcription method for hand-written historical documents making them easily sharable and fully searchable. by andrew s.i.d lang and joshua rio-ross introduction transcribing often hard-to-read historical handwritten documents is a painstakingly slow and costly undertaking. this has meant that the majority of important writings are not available for public study and those that are available are often only viewable as digitally scanned images. after more than 100 years of effort, only slightly more than fifty percent of james madison’s papers have been transcribed and published, [1] while the transcription of thomas jefferson’s papers, begun in 1943, will take another 15 years to complete. [2] [3] the remaining untranscribed portion of madison’s and jefferson’s writings are only available as “digitally scanned images of microfilmed copies of handwritten documents.” [2] transcription can be outsourced to a professional transcription service, but this is very costly, usually costing somewhere between $7.00 and $15.00 per hour depending on the level of service. so it is not surprising that many archivists are turning to crowdsourcing for transcription. most current attempts at crowdsourcing are run in-house, where they provide and preserve software for end users, upkeep servers, build and maintain web sites, and have editors proofread results. all of this requires a certain level of technical knowledge and support staff, meaning the projects can become time-consuming and costly. when done correctly, such crowdsourcing efforts have been remarkably successful, but they do not come without obstacles. daniel stowell, director and editor of the papers of abraham lincoln, is quoted in a recent new york times article on crowdsourcing transcription, saying that volunteers produced so many errors and gaps that “we were spending more time and money correcting them as creating them from scratch.” [3] our transcription project, the written rummage project, has itself been a rummage for the most efficient method of using crowdsourcing and various other internet resources to digitally transcribe handwritten documents. our primary objective was the successful and accurate transcription of the selected frederick douglass diary documents to searchable, digital text. however, the project would be of little general interest if it could not eliminate the cost barrier that usually stands between editors and accurate transcription. we therefore resolved to develop a procedure that is cheaper to implement than other transcription services. while crowdsourcing is inherently crowded with workers, managing the workers should require relatively few people so long as transcriptions can be completed efficiently. thus, the last objective was that written rummage be expedient. academic crowdsourcing crowdsourcing is the process by which a task is outsourced to an undefined group of people (the crowd) rather than contracting professionals to accomplish that task. crowdsourcing is used in both business and academic settings and the crowd can either be unpaid volunteers or be incentivized with micro-payments or other benefits. crowdsourcing is therefore especially useful for tasks that need to be repeated numerous times but are beyond the skills of computers–such as classifying images, gathering data, and transcribing handwritten text. a good example of a successful academic crowdsourcing project is zooniverse, [4] which now consists of several separate crowdsourcing projects but originally began with just the galaxy zoo project. zooniverse now has close to half a million volunteers and has led to several academic papers and scientific discoveries. [5] [6] one recent addition to the set of zooniverse projects is old weather, where volunteers transcribe the location and weather from british royal navy ship log books from around the time of world war i. [7] to encourage participation, old weather leverages google maps technology to allow volunteers to follow the course of a ship over time as they enter data. old weather also provides an experience where volunteers can learn about the ships, the log books and the people who kept them. as volunteers contribute they also move up in rank, with the top contributor for each ship being designated captain. another project of transcriptional interest is the open dinosaur project, where volunteers are asked to transcribe dinosaur limb bone measurements from academic papers. citizen scientists usually need little impetus to get them to contribute to a “dinosaur project”, because “dinosaurs are cool”, but the organizers have offered an additional incentive for participation stating that “all participants will be included as junior authors on the resulting scientific paper.” [8] both old weather and the open dinosaur project have a large user base of volunteers because participation is fun, interesting, and incentivized; and in the case for the old weather project liberally funded. [9] an up-to-date list crowdsourcing projects of general interest can be found on wikipedia. [10] crowdsourcing digitization projects while the old weather project and the open dinosaur project both use crowdsourcing for transcription, the resulting digitization is not the primary focus of these projects. the former focusing on climate change and the latter on evolution. both projects do show that to be successful in the increasingly crowded crowdsourcing arena it helps for a project to be either of great interest, well funded, or fun. if your transcription project is none of the above, then it will take greater effort to be successful using volunteers for transcription. one technique to increase participation in more esoteric transcription projects is “community engagement”, where organizers use various enticements to attract volunteers. recent success stories “have been particularly adept at using social media, developing refined mechanisms for ensuring that contributions are quality assured, working with large data sets, and creating interfaces that interact in a way that reduces complexity and confusion.” [11] a good example of using social media and other incentives for the volunteer transcription of handwritten documents is the transcribe bentham project, which is a “participatory project” transcribing the manuscript papers of jeremy bentham based at university college london. [12] the transcribe bentham project has had great media exposure and also a lot of success with “encouraging undergraduates and school pupils studying bentham’s ideas… to use the site to enhance their learning experience.” [13] the project is also being aided by the arts and humanities research council who awarded the project a grant of £262,673.00. [14] there are many other noble efforts in using crowdsourcing for the transcription of historical handwritten documents, with new ones seemingly appearing every other day. [15] one of the more recent projects aims to help transcribe the soon to be released us census data by using an ingenious hybrid of automation and crowdsourcing, automatically recognizing handwritten text using word spotting. the current government proposed system plans to pay for “transcribing the handwritten content of the images, a task that will take thousands of trained laborers anywhere between 6 and 12 months.” [16] how successful this and other projects will be remains to be seen, though newcomers are being aided by open source transcription tools such as scripto. [17] using mechanical turk to transcribe handwritten manuscripts several methodologies were tried, scrapped, and/or modified in the nascent stages of the written rummage project. the primary hurdles to overcome were developing a way to ensure transcription accuracy—a problem encountered by other transcription projects—and tweaking various aspects of management to expedite the project. what emerged is a technique that uses two separate transcriptions for each document, such that the first transcription is proofread in the process of the second transcription. the procedure is as follows. assuming that some storage of manuscripts to be transcribed exists in pdf format—in the present case, the frederick douglass papers—the first step is to prepare the mechanical turk human intelligence task (hit), a simple task performed by an anonymous workforce for micropayments, in this case, transcribing a page of handwritten text. amazon’s mechanical turk, often referred to as mturk, is among the premier crowdsourcing resources on the internet. mturk’s framework can be used to submit various tasks to a crowd for a prescribed amount of compensation, which amazon also takes a percentage of. one must have an account with mturk as a “requester” to do this. [18] once we do, we can click “design” on the toolbar. from here, a hit can be designed from a number of templates, depending upon what sort of task is to be accomplished by the crowd, see figure 1. for transcription purposes, text specifying the requirements of the task, a link to the manuscript image, and a text/comment box for the worker to type the transcription are all the necessary components. the “design” stage is also when the requester designates how much a worker is compensated for each completed hit. the lower the compensation, the slower the transcription process takes, since workers have a market of tasks to select from. written rummage has found that $0.08 per hit is enough to ensure quick acceptance and completion of hits at this stage—usually a set of seven (7) hits can be transcribed in three or four days. figure 1.template for hit design in amazon mechanical turk. in order to implement the hit, a comma-separated values (csv) file has to be loaded containing the links to each of the respective pdf files the requester wants to have transcribed. one of the convenient features of mturk is the ability to publish multiple hits at once using a .csv file. a list of any order x will in turn instruct the server to produce x hits corresponding to each linked item, see figure 2. typically a batch of seven hits is most manageable. figure 2. one of our csv files used to submit multiple transcription hits. once the desired csv file is made, we click “publish” on the toolbar, select the desired template to implement—“first transcription,” or the like—upload the csv file, confirm, name and “publish” the batch. mturk will subtract the appropriate amount of pending funds from the requester’s account, and all that is left to do is wait. we thereafter visit the “manage” screen to check on the hits progress, see figure 3. figure 3. managing batches in amazon mechanical turk. as aforementioned, the batch will typically be transcribed in three or four days. to access the submissions, requesters need only click the “manage” link on the toolbar, then select the batch we want to view. the next step is to copy-and-paste the submissions to google docs, a free cloud-based document service, where they are saved and stored. each transcription is saved as an independent document, resulting in a unique url for each document. since google docs has privacy settings, these must be set to “public” so that anyone with the url can view them—they will be used again in the second transcription process. the second stage of transcription has proven the most troublesome. the purpose of having a second transcription at all is quality control. rather than proofreading the original submissions ourselves, which would be tantamount to undertaking the transcriptions ourselves, we decided to crowdsource this responsibility as well. providing the original manuscript image alongside the first transcription, however, did not guarantee that any proofreading would be done—users could simply copy and paste the first transcription’s content and submit it as their own. yet again, having to manually read through each second submission and compare it to the first submission is little improvement upon transcribing the documents ourselves. we therefore considered another means of ensuring that the documents were at least being thoroughly scanned a second time for errors. once a first transcription is submitted and saved in google docs, a second, corresponding document is made from that transcription with errors inserted throughout the document. some errors are unusual sequences of letters, such as “vxz,” interpolated into words—for instance: “interpvxzolated.” to avoid mere “spellcheck” proofreading submissions, we also insert entire words, such as “bark,” into the pages—for instance: “we also insert bark entire words.” the insertions of each page are logged and the edited manuscripts are saved with their own urls. once second transcriptions are submitted, we simply search the submissions for the key words or sequences of letters. if they have been edited out by the worker, then the submission is assumed to have been edited and proofread thoroughly, and therefore accepted. if they have not been edited, the page is assumed to have not been proofread, the worker is not compensated, and the hit is published again. the process for publishing the proofreading hit is similar to that of the first transcription. the primary difference is that two documents—the modified first transcription and the original manuscript image—are uploaded as links in the hit rather than one. using two columns rather than one in a csv file allows us to enter the corresponding manuscript and image for each hit. workers are asked to open both the image and the transcription side-by-side, then to check the work for congruity. compensation for the proofread transcription is a dime—slightly more than for the first transcription due to the request to scan for congruity between two documents rather than simply type what one reads in one document. as with the first transcription, the chosen compensation usually yields a fully transcribed batch of seven manuscripts in three to four days. if multiple batches of seven manuscripts are published simultaneously, this projected completion time usually holds constant. results the written rummage project has shown promise for providing a viable alternative for private manuscript collectors to transcribe their documents to digital, searchable text. our test collection of documents, the aforementioned diary from the frederick douglass papers, is completely transcribed and proofread. to make the 72-page transcription viewable, we created a free wiki for anyone interested to visit: frederickdouglassdiary.wikispaces.com. each manuscript page has its own web page on the wiki, and that page contains the transcribed text as well as a link to the original source manuscript corresponding to that text. as an example, using the search box on the wiki, we find five pages (10, 29, 34, 42, 58) in the diary where the word “slave” or “slaves” appears. (figure 4) figure 4. searching for the word “slave” in frederick douglass’ diary. by clicking on a page, for example page 42, we can see the entire transcribed text of that page. (figure 5) figure 5. fully transcribed and proofread text from page 42 of frederick douglass’diary. then clicking the link provided retrieves an image of the original handwriting, as shown in figure 6. figure 6. original image of page 42 of frederick douglass’ diary. copyright library of congress. time has been the most fickle variable in the nascent stages of written rummage. we have sought to develop an efficient model for receiving, transcribing, storing, proofreading, and re-storing documents. most pressing has been the need to ensure that the hits through mechanical turk were completed as hastily as possible. we began with low compensation per hit. while documents were eventually transcribed for transcription rates of $0.01 per first transcription and $0.03 per proofread, they typically took two to four weeks to finish a batch of five to seven pages. after several infrastructural changes and changes in transcription rates—now about $0.08 per first transcription and $0.10 per proofread—most batches of six to eight pages are completed in less than or equal to a week. if multiple batches are published simultaneously, this number tends to hold constant rather than proportionally extend; that is, several batches of six to eight hits can typically be transcribed and proofread in as much time as one batch. with rates as they are presently set, our method offers transcription and proofreading at a rate of $0.18 per page, plus the 10% service charge to mechanical turk. thus, our services can be implemented by private collectors for roughly $0.20 per page. should one choose to add another stage of proofreading, this might increase to $0.30 per page. the exact cost will depend on exactly how long you are willing to wait for the transcription and proofreading process. even then better price and speed performance can be attained. [19] our research has found that this is a remarkable improvement upon other professional transcription rates, which range from $2.00 per page to $8.00 per page depending upon the service and service provider. further, unlike volunteer transcription efforts, which require it support and have server maintenance issues, all our money went directly to the transcription effort. to provide some perspective on the difference these rates make, we can take, as an instance, the same 72-page diary used for this pilot run of written rummage. with rates held constant where they are now, the projected cost of transcribing the whole diary with written rummage is between $14.26 and $21.60. due to complications and test runs and other restructuring expenses, we actually spent $22.86 to complete the diary. in contrast, using the range of rates offered by other services, the same transcription project would cost somewhere between $144.00 and $576.00. our rates were so low with mechanical turk that at one point we had serious ethical discussions as to its use. certainly the economy of mechanical turk has given critics and users pause—and rightfully so: is it ethical to request tasks for such low compensation? [20] is work simply being outsourced to people who benefit from the exchange rates, but at a pay rate otherwise unacceptable—for instance, unacceptable in the u.s.? we decided that using mturk to transcribe handwritten documents was not unethical but we realize that these questions may be potential barriers to some researchers who are considering using our methods. we in the end felt comfortable using mturk because the character of crowdsourcing in general has typically been avocational for the crowd rather than vocational; that is, the workers are typically not performing hits for living wage, but rather either as a hobby or for pocket cash, though this seems to be changing. [21] many crowdsourcing projects offer no compensation at all but instead only call for volunteers. we use mturk under the assumption that workers enter the mechanical turk marketplace aware of its supply and demand economy and choose their hits accordingly—no hits are compulsory. we therefore determined that, so long as workers choose to accept the tasks for the prescribed compensations, the mutual agreement establishes that the hit’s value is acceptable. conclusion the “information age” is voracious; libraries and private collectors are looking for means of transcribing their handwritten manuscripts to make them available to the academic community and broader public. as the void they have yet to fill becomes more apparent, collectors are seeking out convenient, inexpensive ways of digitizing their documents. digitization is the focus of a number of major projects going on in the academic and internet community; likewise, crowdsourcing is increasingly being used to accomplish otherwise tedious or impossible tasks. our project utilized avant-garde technology to transcribe handwritten historical documents, and it did so affordably. the crowdsourcing model that we have presented uses no funds for server maintenance or it support, nor for any other support personnel. because of this, given a collection of handwritten documents from a library or other manuscript collector, we can have that collection transcribed and proofread at rates near $0.30 a page–approximately 15% the nearest competition’s rates. a 200-page collection that would cost at least $400 to transcribe elsewhere would cost only around $60 to be transcribed with us. further, due to significant milestones in overcoming past quality control challenges, we can expediently offer accurate transcriptions. entire collections, depending upon their size, can be returned digital and searchable in 2-4 weeks, allowing collectors to contribute them to shared library resources or other academic spheres. the written rummage project set out to develop a means of expediently and accurately transcribing handwritten documents to digital, searchable text; it did so–and thereby broadens the possibilities for sharing knowledge in the digital realm. references [1] the james madison papers -american memory from the library of congress [internet] [cited 2011 october 13] available from: http://memory.loc.gov/ammem/collections/madison_papers/mjmabout.html [2] the thomas jefferson papers -american memory from the library of congress [internet] [cited 2011 october 13] available from: http://memory.loc.gov/ammem/collections/jefferson_papers/mtjabout.html [3] cohen p. 2010. scholars recruit public for project. new york times. [internet] [cited 2011 october 13] available from: http://www.nytimes.com/2010/12/28/books/28transcribe.html?nl=books&pagewanted=all [4] zooniverse – real science online. [internet] [cited 2011 october 13] available from: http://www.zooniverse.org/home [5] cardamone c. et. al. 2009. galaxy zoo green peas: discovery of a class of compact extremely star-forming galaxies. monthly notices of the royal astronomical society, 399: 1191–1205. doi: 10.1111/j.1365-2966.2009.15383.x [6] lintott c. et. al. 2009. galaxy zoo: ‘hanny’s voorwerp’, a quasar light echo?. monthly notices of the royal astronomical society, 399: 129–140. doi: 10.1111/j.1365-2966.2009.15299.x [7] old weather – our weather’s past, the climate’s future. [internet] [cited 2011 october 13] available from: http://www.oldweather.org/ [8] the open dinosaur project [internet] [cited 2011 october 13] available from: http://opendino.wordpress.com/ [9] old weather sails on. [internet] [cited 2011 october 13] available from: http://blogs.zooniverse.org/oldweather/2011/02/old-weather-sails-on/ [10] list of crowdsourcing projects – wikipedia, the free encyclopedia. [internet] [cited 2011 october 13] available from: http://en.wikipedia.org/wiki/list_of_crowdsourcing_projects [11] dunning a. 2011. innovative use of crowdsourcing technology presents novel prospects for research to interact with much larger audiences, and much more effectively than ever before. [internet] [cited 2011 october 13] available from: http://blogs.lse.ac.uk/impactofsocialsciences/2011/08/25/innovative-use-of-crowdsourcing/ [12] moyle m, tonra j and wallace v. 2010. manuscript transcription by crowdsourcing: transcribe bentham. liber quarterly, 20 (3-4) [13] transcribe bentham. [internet] [cited 2011 october 13] available from: http://www.ucl.ac.uk/transcribe-bentham/ [14] winners of the ahrc’s £4m digital programme announced.[internet] [cited 2011 october 13] available from: http://www.ahrc.ac.uk/news/latest/pages/winnersdigitalprogramme.aspx [15] brumfield b. 2011. 2010: the year of crowdsourcing transcription. [internet] [cited 2011 october 13] available from: http://manuscripttranscription.blogspot.com/2011/02/2010-year-of-crowdsourcing.html [16] mchenry k. et al. 2011. toward free and searchable historical census images. electronic image and signal proscessing. 22 september 2011. http://dx.doi.org/10.1117/2.1201109.003833 [17] scripto | crowdsourcing documentary transcription. [internet] [cited 2011 october 13] available from: http://scripto.org/ [18] amazon mechanical turk. [internet] [cited 2011 october 13] available from: https://www.mturk.com/mturk/welcome [19] faridani s, hartmann b and ipeirotis p. 2011. what’s the right price? pricing tasks for finishing on time. [internet] [cited 2011 october 13] available from: http://husk.eecs.berkeley.edu/courses/cs298-52-sp11/images/c/c7/faridani-hcomp11.pdf [20] fort k, adda g and cohen k. 2011. amazon mechanical turk: gold mine or coal mine? computational linguistics. vol. 37, no. 2, pages 413-420 doi:10.1162/coli_a_00057 [21] ipeirotis p. the new demographics of mechanical turk. [internet] [cited 2011 october 13] available from: http://www.behind-the-enemy-lines.com/2010/03/new-demographics-of-mechanical-turk.html about the authors andrew stuart ian donald lang, is a mathematical physicist and professor of mathematics at oral roberts university. he has received a number of awards, including being named a 2010 davinci institute fellow for his groundbreaking work in virtual worlds. he is an open science advocate, publishing in open access journals whenever possible. he has published in many academic fields, including mathematics, physics, and chemistry. andrew can be contacted at alang@oru.edu. joshua rio-ross graduated from oral roberts university with a b.s. in mathematics and a b.a. in english literature. he is currently working toward his master’s in mathematics at university of missouri, after which he intends to study christian theology and literary criticism. beyond the academic, joshua enjoys reading dostoevsky, writing poetry, swing dancing, and, as of late, crafting his bike-riding skills on the way to school. joshua can be contacted at another.leaf.in.the.current@gmail.com. subscribe to comments: for this article | for all articles 4 responses to "using amazon mechanical turk to transcribe historical handwritten documents" please leave a response below: greg lindahl, 2011-12-16 this brings to mind the project gutenberg distributed proofreaders website (http://pgdp.net), which i’ve used for handwritten documents such as an early 17th century italian dance manual. it provides a lot of support for proofreading, such as a dictionary, division of labor (multiple rounds of proofreading and formatting), and a ready population of volunteer proofreaders, including enough italian speakers to ensure that my project was done very accurately. session proposal: crowdsourcing thatcamp american historical association 2012, 2011-12-30 […] mechanical turk or various freelancing sites? if so, how do you ensure accuracy? (one recent project introduced known bad data to transcripts before paying users to proofread and transcribed a […] practical example on #crowdsourcing using amazon mechanical turk | antoni barniol, 2013-09-17 […] original page: http://journal.code4lib.org/articles/6004 […] brian tkatch, 2013-10-14 instead of proofreading, the same jobs can be done a second time, (by different people, if possible). this was save $.02. a comparison of the two (diff) would point out and differences. where there is no difference, it is better than a proofread, as the same thing was types in twice. where it is different: if it is a small amount, just fix them manually. where there are many, the transcription itself is problematic, redo the job. (and the person with the bad transcription should be banned from future hits.) leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using xml schema with embedded schematron rules for mods quality control in a digital repository mission editorial committee process and structure code4lib issue 41, 2018-08-09 using xml schema with embedded schematron rules for mods quality control in a digital repository the michigan state university libraries digital repository relies primarily on mods descriptive metadata to convey meaning to users and to improve discoverability and access to the libraries’ unique information resources. because the repository relies on this metadata for so much of its functionality, it’s important that records are of consistently high quality. while creating a metadata guidelines document was an important step in assuring higher-quality metadata, the volume of mods records made it impossible to evaluate metadata quality without some form of automated quality assessment. after considering several possible tools, an xml schema with embedded schematron rules was ultimately chosen for its customizability and capabilities. the two tools complement each other well: xml schemas provide a concise method of dictating the structure of xml documents and schematron adds more robust capabilities for writing detailed rules and checking the content of xml elements and attributes. by adding the use of this schema to our metadata creation workflow, we’re able to catch and correct errors before metadata is entered into the repository. by lisa lorenzo introduction and context galleries, libraries, archives, and museums (glam) rely on high quality metadata to make their collections discoverable and understandable by their users. metadata quality is also a crucial precursor to emerging tools and technologies for search and discovery, such as linked data and machine learning. a future where users can leverage linked data to discover resources with a common attribute across institutions (“how many university libraries have oral histories about world war ii available online?”) requires consistent description. emerging efforts to automate subject assignment for electronic resources relies on record sets with accurate subject analysis to serve as teaching sets for algorithms. further, organizations such as the digital public library of america (dpla) also rely on high quality and consistent metadata in order to develop their aggregation platform. metadata standards and best practices abound, and a great deal of work has been done in the glam community to assess how well those standards, whether international, national, or institution-specific, are met. the digital library federation (dlf) metadata assessment working group [1] and the dpla’s metadata analysis tools [2] are but two examples. while more tools are becoming available for assessing metadata, it is still a challenge to choose metrics that are appropriate and meaningful to the particular institution and a method of assessment that is sustainable and integrable into existing workflows. what follows is a description of how the michigan state university libraries (msul) digital repository team approached this challenge and how it uses the chosen method, xml schema and schematron, to ensure metadata quality. msul digital repository the msul digital repository is a freely available collection of digitized and born-digital content covering a variety of topics and media. the repository is built on fedora with an islandora front-end and uses mods as its descriptive metadata. indexing is done with apache solr. collections come to the repository from a variety of different locations, including msu graduate students and faculty, library grant projects, and donations. unsurprisingly, metadata is in a range of formats and of varying quality when it reaches the repository team. from there, the metadata librarians on the team convert the metadata into xml if necessary, enhance and correct information as needed, and create an xslt to transform the source xml metadata into mods records. in order to normalize metadata over the entire collection, the repository team put together a set of metadata guidelines in 2016. the creation of this document was an important first step in a collection-wide metadata normalization effort that took place over the course of 2017 and guided metadata creation for new collections ingested into the repository. however, the guidelines alone weren’t enough to ensure consistent metadata quality in the entire collection. while xslts were updated to reflect the new guidelines, due to the volume of records in the digital repository, it was still impossible to ensure that every individual record conformed to the guidelines. errors resulting from mistakes in source metadata, such as incorrect date formats or improperly formatted uris, would often go unnoticed unless they caused a solr error during indexing or happened to be seen when browsing repository content. the repository team’s metadata librarians evaluated a few tools that might address these issues, namely openrefine [3] and metadata breakers [4]. openrefine, an open source tool for cleaning up messy data, is a powerful tool with many functionalities. it boasts subject and name reconciliation capabilities, support for regular expressions through general refine expression language (grel), and ability to edit large numbers of records at once. its widespread use in the glam community was also attractive due to the large user base that could potentially be tapped for guidance. however, importing a hierarchical metadata scheme such as mods into openrefine proved to be overly cumbersome. the python-based metadata breakers set of tools was helpful for getting a sense of certain problems within the metadata. it worked well in areas such as showing field usage within a collection, but wasn’t designed for identifying problems in each individual record. this is an important functionality when working with, for instance, electronic theses and dissertations metadata. the original metadata for this collection comes from individual authors and can vary widely, and may require remediation of individual records. additionally, the repository team’s metadata librarians’ relatively low comfort level with python meant customizing this tool would require too much of a time investment for this project. glam institutions often use xml schemas as validation tools for xml-based metadata and definitions for metadata standards. while validation documents of course don’t offer the editing ability that openrefine does, their customizability and extensive capabilities, when extended with additional languages, made them an attractive solution for msul. oxygen xml editor [5], the software used by the repository team for creating and transforming metadata, has a very simple process for running xml schemas, making it easy to add one extra step to the existing metadata workflow. while learning a new language was time-consuming at first, the metadata librarians’ strong familiarity with xml, xpath, and xslt helped a great deal, making the process of creating a custom xml schema relatively intuitive. the next sections will provide an overview of what the msul repository team considers the most helpful features of xml schema definition (xsd) and schematron. xml schema definition xsd is a humanand machine-readable language maintained by the world wide web consortium (w3c) for expressing constraints on xml documents. an xml schema specifies how and in what order data appears in an xml document. it can provide a list of allowed elements and attributes, specify how many times each can appear within the document and whether they are required or optional, and define the datatype of each element and attribute. while these capabilities make xsd a flexible and powerful tool for validating xml documents, it fell short of expressing constraints on several xml features that were necessary for determining the quality of mods records. for example, while xsd can declare that an element or attribute is required, it cannot express a more complex requirement like a dependency, such as “ may contain either a usage or an access attribute, but not both.” xsd also cannot express co-occurrence constraints, or rules about which elements may occur together. this means that a rule such as “if contains an authority attribute, then it must also contain an authorityuri attribute” cannot be defined in xsd. in fact, xsd is not designed to satisfy every possible validation scenario. as stated in its purpose, “the language defined by this specification does not attempt to provide all the facilities that might be needed by any application. some applications may require constraint capabilities not expressible in this language, and so may need to perform their own additional validations” (w3c 2004). to this end, xsd is easily extensible. embedding rules from other validation languages, such as schematron, is as simple as adding them to an element, as shown in the “validating content with schematron” section. schematron schematron is a rules-based xml validation language and an open iso standard [6]. it uses xpath to locate nodes within xml documents and then displays a natural language error message written by the creator of the schematron document. it is quite flexible and simple in structure, and there are a number of ways to accomplish any validation task. while schematron provides a concise way to test the content of elements and attributes, it is less straightforward than xsd for defining the structure of an xml document (obasanjo 2004). implementation given the strength of xsd in concisely defining the structure of an xml document and schematron’s ability to handle conditional requirements and co-occurrence constraints, the team decided to create an xsd with embedded schematron rules. the following section describes how each tool was used within the context of a mods validation. some examples are given throughout, and the entire schema can be found on github [7]. defining structure with xsd xsd is an extremely complex validation language with many capabilities that are out of scope for this article. what follows is a very brief overview of the portions of the xsd language that have been most useful at msul. xsd defines two categories of elements: simple types and complex types. simple types include elements that contain only text, attributes, and restrictions (a list of allowed values for an element). complex types include elements that contain other elements or attributes. the author of the schema defines the constraints on each individual complex type that may exist in a valid xml document. the element can contain an element that gives a list of possible child elements for an element or an element that gives a list of child elements in the order in which they must occur in the xml document. both or will contain a list of references to elements that are defined elsewhere in the schema. each can contain optional maxoccurs and minoccurs attributes which define the maximum and minimum number of times the element can occur. if not explicitly defined, the default value for maxoccurs is “unbounded” and for minoccurs is “1”. each element may contain a use attribute that defines whether the attribute is optional or required. xsd further defines a list of data types that restrict the type of content allowed in xml elements or attributes. some examples are which can contain characters, spaces, returns, and so on, which can contain a date in yyyy-mm-dd format, and which can contain numeric values. the w3c defines a complete list of xsd data types [8]. in the following partial example from the msul digital repository mods schema, the element is defined. it contains a sequence of three child elements, (which must occur at least once, but can occur any number of times), (which can occur one time or not at all), and (which also must occur at least once, but can occur any number of times). it also contains a required attribute, type, which has the type xs:string. finally, it contains an optional usage attribute, which may only contain the xs:string value, “primary”. validating content with schematron six elements make up the basic structure of a schematron document: , , , , , and . the element is the root of the document, and and are its child elements. the element defines any namespaces used in the document. a element contains one or more elements. the element works like an if/then statement: only the first that matches a node in the xml being tested will execute (if rule 1 matches a node, execute its child elements, if not, try to match rule 2, and so on). the element contains a context attribute that specifies an xpath. this defines where in the xml document to search for the xpath statements in its child and elements. the element specifies an xpath in its test attribute and plain text within the element that will be displayed to the user if the test assertion is false. for example, if the test attribute contains a simple xpath, the presence of that node is what’s being tested. so, the following: there is no title. checks for a element as a child of in the xml document being validated. if the element is not present, the error message “there is no title” will display in the error report. any valid xslt functions and expressions may be used in a test attribute. the element functions in the same way as assert, but is its inverse. where will return its error message if its test is false, will return its statement if its test is true. for example: date contains "u". change to "x". this rule checks for the “u” character, and if it is present gives the user the message “date contains ‘u’. change to ‘x’.” rules may also be written outside of a specific context, known as abstract rules. the advantage of abstract rules is that the author of the schema only needs to write them once and then may call them in any context, eliminating the need to rewrite the same rule multiple times. this partial example from the msul mods schema shows an abstract rule that tests whether an element contains a trailing period, and then calls that rule on the element with : check for trailing punctuation. one other important aspect of this example is the element in the content of . this will print the name of the context node in the error report if the assertion comes back as true. so, in this example when the rule is called on , the error would read “check mods:namepart for trailing punctuation.” schematron can be used on its own or embedded into an xsd, as msul has done. adding schematron rules to an xsd is as simple as entering the rule within and in the element declaration where the rule will take effect. the above example for would be called in an xsd like so: check for trailing punctuation. workflow a variety of software, both open source [9] and proprietary, is available for validating xml with xsd and schematron. since the msul digital repository team already uses oxygen xml editor to create and transform metadata, it made the most sense to validate metadata with this software as well. running xsd validations with embedded schematron rules in oxygen is very simple [10] and can be done by right clicking a file or a directory in the project pane and clicking “validate with schema.” from there, the user simply has to navigate to the correct xsd file and select the option to include embedded schematron rules. oxygen generates a list of errors and displays it in a new pane, or indicates that validation was successful. if the user ran the validation on multiple files, the list of errors is divided by file name. for the msul digital repository, a metadata librarian will run the schema against every new collection before it’s ingested. after reviewing the list of errors, the reviewer will decide how to address them. typically, if an error is widespread, it is solved by adjusting the xslt that transformed the original source metadata into mods. for instance, referring to an earlier example, if the validation found a series of records where unknown or uncertain dates were systematically entered with ‘u’ instead of ‘x’, a simple replace() function could be added to the xslt and solve the issue each time it appears.  isolated or minor errors, such as a missing author name, are usually fixed in the source metadata. in order to address errors in legacy collections, the schema was run against all collections in the digital repository over the course of several months. the resulting error reports were used to improve the xslt documents for each collection and create new metadata records, which were then ingested into the digital repository.  with the changes that were made, metadata records now conform with local standards, allowing more consistent display between collections. this also facilitated sharing repository metadata with the dpla and the msu libraries’ discovery layer, ultimately making repository resources discoverable in more environments. future work while xsd and schematron are effective tools for validating document structure, validation of content contained in elements and attributes is mainly limited to structural checks. spell checking and name and subject reconciliation are two areas in particular where the metadata quality control workflow could be improved. future areas of work for the repository team include taking a closer look at tools such as openrefine and metadata breakers. for example, openrefine’s reconciliation tools may be useful in converting local subject headings to fast headings, the preferred subject vocabulary for digital repository collections. bibliography obasanjo d. 2004. improving xml document validation with schematron [internet]. [cited 2018 june 4]. available from:https://msdn.microsoft.com/en-us/library/aa468554.aspx. piez w, lapeyre d. 2008. introduction to schematron [internet]. [cited 2018 june 6]. available from: http://www.mulberrytech.com/papers/schematron-philly.pdf. robertsson e. 2010. combining schematron with other xml schema languages [internet]. [cited 2018 june 4]. available from: http://www.topologi.com/resources/schtrn_xsd_paper.html. w3c (world wide web consortium). 2004. xml schema part 1: structures second edition [internet]. [cited 2018 june 5]. available from: https://www.w3.org/tr/xmlschema-1/#intro-purpose. xsd – quick guide [internet]. c2018. [cited 2018 june 4]. available from: https://www.tutorialspoint.com/xsd/xsd_quick_guide.htm. notes [1] for more information, see: http://dlfmetadataassessment.github.io/ [2] available here: http://openrefine.org [3] available here: https://github.com/dpla/metadata-analysis-workshop [4] available here: https://github.com/cmh2166/metadataqa [5] available here: https://www.oxygenxml.com [6] for more information, see: http://schematron.com/2016/11/iso-schematron-2016-released/ [7] available here: https://github.com/lmlorenzo/mods-schema [8] available here: https://www.w3schools.com/xml/schema_dtypes_string.asp [9] one open source tool available for running xsd and schematron validations is schemanon, available here: https://github.com/thelanguagearchive/schemanon. this tool was not evaluated for this project. [10] video tutorial available here: https://www.oxygenxml.com/demo/schematron_validation.html about the author lisa lorenzo is a metadata librarian at michigan state university libraries working primarily with the library’s digital repository metadata. she is a member of an agile development team working to improve and expand the msul digital repository and is responsible for transforming metadata from various sources and formats into standards-compliant mods and dublin core to facilitate searching within the repository and metadata reuse in other systems, such as the digital public library of america (dpla). lisa’s recent projects undertaken with her colleagues on the development team include normalizing mods records across all collections in the repository, updating and implementing local metadata guidelines documentation, and dynamically generating json-ld metadata as part of an experimental search engine optimization initiative. she also has a quarter-time appointment as a reference librarian. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – connecting historical and digital frontiers: enhancing access to the latah county oral history collection utilizing ohms (oral history metadata synchronizer) and isotope mission editorial committee process and structure code4lib issue 29, 2015-07-15 connecting historical and digital frontiers: enhancing access to the latah county oral history collection utilizing ohms (oral history metadata synchronizer) and isotope the university of idaho library received a donation of oral histories in 1987 that were conducted and collected by a local county historical society in the 1970s. the audio cassettes and transcriptions were digitized in 2013 and 2014, producing one of the largest digital collections of oral histories – over 300 interviews and over 569 hours – in the pacific northwest. to provide enhanced access to the collection, the digital initiatives department employed an open-source plug-in called the oral history metadata synchronizer (ohms) – an xml and php driven system that was created at the louie b. nunn center for oral history at the university of kentucky libraries – to deliver the audio mp3 files together with their indexes and transcripts. ohms synchronizes the transcribed text with timestamps in the audio and provides a viewer that connects search results of a transcript to the corresponding moment in the audio file. this article will discuss how we created the infrastructure by importing existing metadata, customized the interface and visual presentation by creating additional levels of access using complex xml files, enhanced descriptions using the getty art and architecture thesaurus for keywords and subjects, and tagged locations discussed in the interviews that were later connected to google maps via latitude and longitude coordinates. we will also discuss the implementation of and philosophy behind our use of the layout library isotope as the primary point of access to the collection. the latah county oral history collection is one of the first successful digital collections created using the ohms system outside of the university of kentucky. by devin becker and erin passehl-stoddart introduction, or the difficulty of oral history in the digital age in today’s libraries, museums, and cultural institutions, oral histories are viewed as important primary sources that are utilized by everyone from students to researchers to genealogists to family members. historian donald a. ritchie acknowledges that historians have generally conducted interviews to acquire information that would otherwise not have existed, including the small but telling details that previously escaped notice and were overlooked in the historical narrative (ritchie 2011) [1]. new technologies are helping to advance the preservation, access, and use of oral histories. the medium, however, is in a transitional moment. some oral histories are now born digital, but many still exist in previous formats of audio cassette tapes, minidisc, mini-dv, etc. moreover, while oral histories are gaining in popularity as historical resources, they remain underutilized resources due partly to the difficulty researchers have had in gaining access to oral history collections (boyd 2011) [2]. this is especially true of oral histories stored in analog forms, such as the latah county oral history collection, where the expense and staffing needed to migrate audio cassette tapes over to a digital format is often prohibitive. many times oral histories contain multiple components, including audio, transcripts, indexes, and sometimes additional materials such as photographs or handouts. it is therefore necessary for information providers to, as doug boyd argues, “make the individual components of the collections all work together in a user-friendly, efficient, useful, and intuitive manner; be engaged and discovered more easily, more widely and effectively distributed, and more responsibly preserved” (boyd 2011) [3]. all of these components are addressed in the new open-source plugin developed at the university of kentucky named ohms (oral history metadata synchronizer). this article will address the steps taken by the digital initiatives department at the university of idaho library to preserve and make accessible the latah county oral history collection. the latah county oral history collection the latah county oral history collection consists of audio interview recordings and transcriptions for over 300 interviews that were collected in the 1970s. the interviews feature the recollections of long time residents of latah county who lived in the area as early as the 1890s. latah county is located in the northern idaho panhandle, which is a rural area of the state that produced much of the wheat, lentils, peas, oats, barley, and timber in the united states. it is the only county in the united states to be created by an act of congress in 1888. early settlers came seeking wealth through prospecting, mining, lumbering, and farming (originally through grazing, later as soil for agriculture) (hubbard 1957) [4]. it is also the home of the state’s land-grant university, the university of idaho. in the early 1970s, the latah county museum society board of trustees selected an oral history project committee to facilitate oral recordings of citizens of latah county, idaho. one of the main reasons behind conducting these oral histories in latah county was the idea that “by hearing people tell of the range of circumstances they faced, the choices they made, and the traditions that have given them strength, [we] can discover the dominant concerns of american working people during the last century” (schrager 1978) [5]. the funding originally came from a grant by the idaho bicentennial commission to design questions for oral history interviews, conduct the oral interviews, and present an oral history tape and slide program throughout the county during the bicentennial celebration. the project later received additional funding from the idaho commission on the arts and humanities to extend the scope of the project to include interviews from 1976-1978, as well as the creation of transcripts, photography of narrators, and the creation of a series of oral history booklets for elementary school social studies (schrager 1978) [6]. by the end of the project, the collection included approximately 570 hours of recorded audio interviews on cassette tapes featuring over 200 narrators from 1973-1978. by cooperative agreement, the collection was transferred to the university of idaho library special collections for public reading, listening, and preservation. initial digitization, ingest, and metadata the digitization of this collection began with a request from a patron who desired to use the audio from the cassettes for a documentary film. we did not at that time have the expertise or capacity to digitize the audio cassettes, but we were able to quickly digitize and publish the contents of all the transcripts in the collection. to do this, we used a guillotine to cut the bindings to the bound transcripts and then used an auto feed scanner to scan each transcript. we produced high-resolution tiff files for each page and large ocr generated pdf files for each transcript. we deposited the tiff files into our digital archive, then extracted the individual interview transcripts from the larger pdfs into smaller pdfs and named the files according to the interviewee’s name and the date/number of the interview (some participants were interviewed numerous times). digitization of this collection was prioritized for many reasons, not least among them was the thorough descriptions collected about each interview and interviewee during the initial recording of the interviews. this metadata was only available via the typed version of the guide to the latah county, idaho oral history collection, which listed each interview participant along with his/her biographical information and a brief synopses of each interview in which he/she participated. each synopses also included one uniformly formatted line that listed the date of the conversation, the length of the recording of the conversation, the length (in pages) of the transcript of the interview, and the initials of the interviewer [7]. we extracted all this information by copying the ocred text from our pdf, after which we worked to convert this text into an xml file. to transform the text document into a valid xml document we used regular-expression-enhanced find-and-replace strategies to tag the individual name, birthplace, origin, birth year, and occupation of each individual, as well as a nested listing of each interview he/she participated in, which included tags for the date, audio length, page length, interviewer, and synopsis. this initial xml file used common html tags such as
and so that we could then create a listing of these interviews and people online. eventually, the file was transformed using xsl into a tab-delimited spreadsheet that we used to upload the corresponding interview transcripts into our contentdm instance. we accomplished the digitization, clean up, description, and upload of transcripts into contentdm that users could publicly search, read, and browse in approximately one month. figure 1. initial xml file generated from the guide to the original collection. digitizing the audio files the stories contained in these oral histories were consistently compelling, depicting as they did the exhilarating stories of the hardships and trials many early homesteaders, loggers, and miners endured while living here, but the digitization of the audio recordings and building of the collection website took some time. after considering sending the tapes out to vendors, we finally decided to use some older hardware and software in the library to digitize the audio cassettes. since audio cassette digitization occurs at a 1:1 time ratio (one hour of effort to digitize one hour of recording), this conversion of 569 hours of interviews required several staff over the course of 18 months. the procedure for digitizing these cassettes involved multiple steps: the tapes were recorded onto a macintosh desktop computer using peak audio. each side was saved individually as a sound document 2 (.sd2) file (a proprietary format developed by avid technology). once all the tapes in an archival box were recorded, the .sd2 files were converted to audio interchange file format (.aif) using peak audio and the files were copied to a windows 7 machine for editing and preservation. using wavepad sound editor, the individual side recordings were trimmed for dead air and then combined into full interviews. interviews were converted from .aif format to .mp3 format using a batch process in wavepad. a common problem we encountered was broken and/or deteriorated tape cassettes, which required additional staff effort. over the course of a year and a half, however, we were able to digitize the vast majority of the tapes and ingest them into contentdm as mp3 files. discovering the oral history metadata synchronizer (ohms) as the digitization process came to a close, we saw a presentation by doug boyd at the 2014 northwest archivists conference about the development of an open-source software application for oral histories called the oral history metadata synchronizer (ohms) (boyd 2014) [8]. ohms is a web-based system that provides users with word-level search capability and a time-correlated transcript or index, connecting the textual search term to the corresponding moment in the recorded interview online (boyd 2013) [9]. ohms originally began in 2008 as a way to work with oral history transcripts for the kentucky digital library; it was later redesigned with an indexing component and was distributed as an open-source product with funding from the institute of museum and library services (imls). in its current form, ohms exists as a plugin that can work with content management systems such as contentdm. the ohms system uses xml files containing an interview’s metadata together with a php viewer to create the interface. several javascript files are used for audio/video players to connect the recording with that interview’s metadata, transcript, and index, which are all full-text searchable. this is the most important aspect of the ohms system, namely that it presents and connects the text and recording of the oral history interview on the same web page, which in turn allows the user to stay on one page while exploring both the audio and textual recording of the history. to help administrators create the xml document that drives the system, ohms operates an online tool that uses forms and tagging features to help users properly describe and timestamp an interview [10]. the ohms tool allows an interview’s metadata to be uploaded as an xml or csv file. it can then be enhanced by adding minute by minute timestamps to the transcript. administrators can also define and describe sections of the interview to create a linked index, which can also be enhanced with links, maps, and other features. after enhancement and conversion with the tool, these correctly tagged xml files will then run on one’s own system after being downloaded from the ohms account. figure 2. forms for filling out an index section via ohms online application figure 3. the tool that helps users add timestamps to the transcript. ohms implementation after downloading and installing ohms onto our web server, we customized the system to fit with the design of our other digital collections. both the installation and customization of the ohms system proved to be exceptionally easy to perform. we did not end up using the ohms online tool because we discovered that it was easier for our purposes to edit the xml files directly. we created our xml files, however, from an initial model file that was uploaded into the ohms tool, enhanced via its features, then downloaded back into our system. we decided to go this route because we already had indexes for most interviews and initially felt that adding minute timestamps to the audio would be too time consuming. modifying the ohms system the ohms system allows for the incorporation of a number of different playback systems, from youtube to brightcove. since this collection was solely audio, the main playback mechanism we were concerned with was jplayer, a common javascript application built for the playback of mp3, mp4s, and other audio files online (that is included in the distribution).[11] appearance/client-side modifications files modified: ohms-viewer/css/viewer.css ohms-viewer/tmpl/viewer.tmpl.php ohms-viewer/js/viewer_other.js ohms-viewer/tmpl/player_other.tmpl.php we had two goals when designing the appearance of these interviews: 1) to fit the display into our digital collections template and 2) to create interactive controls and web site architecture that promoted easy discovery of and access to the interviews themselves. to accomplish the first we revised the viewer.css file and the viewer.tmpl.php to accommodate our web site backgrounds, layouts, and colors. within viewer.tmpl.php, we simply erased the boilerplate html containers included and replaced these with the html of our customized header and container for the collection; we also included links to our custom css and javascript files to better fit with our design. after doing this, we adjusted the viewer.css file so that the html would display at the proper size in our container. we then made a small change to the player_other.tmpl.ph to properly layout the jplayer skin. figure 4. viewer.tmpl.php file (gist) back-end/server-side modification files modified: ohms-viewer/lib/version3cachefile.class.php ohms-viewer/lib/transcript.class.php ohms-viewer/config/config.ini the customizations we made to the back-end php and config files were relatively small in nature; they were based on both the design of website we planned for the collection and the collection’s metadata itself. the first file to set up is the config.ini file. there was ample documentation on how to do this via the ohms site (weig 2014) [12]. the other back end customizations we performed were based on our own metadata needs for the collection and the needs of those who would be enhancing our metadata, namely our student workers. for the most part, we were able to fit the fields of the metadata in our collection to the template established by the ohms system. the ohms metadata template includes options for basic fields such as title, date, interviewer, interviewee, length (in time), file name, rights, keywords, subjects, and other fields. however, our collection did contain several idiosyncratic fields that were original to the project that we wanted displayed, namely birth year, family origin, residence location, occupation, and interview number. [13] to enable these fields to then display via our viewer, we added several classes to the array defined in version3cachefile.class.php file. figure 5. adjustment of version3cachefile.class.php file (gist) we also needed to address the metadata for the index of each interview. out of the box, ohms and jplayer use seconds to determine its timestamps, so for each point listed in the index portion of the xml file, a corresponding second is required. given that our original metadata for the indexes was recorded in minutes, we decided to adjust the transcript.class.php file to multiply the “point” field by 60 to accommodate the input of minutes in the xml file rather than seconds. this eventually allowed our student workers to expedite their work with the indexes. figure 6. adjustment of transcript.class.php file using student workers to create complex metadata the digital initiatives lab at the university of idaho library employs between four to eight student workers and interns. these students assist with scanning, file management, and the creation of metadata for many digital collections, gaining technical skills along the way. for this particular project, students participated in active learning tasks through the preparation and processing of oral history interviews, including the editing of complex xml files (wilton 2011) [14]. in order to bring student workers into the metadata creation and wrangling of xml files workflow, we first tested out specific tasks and goals on staff and created a guide to assist with students who were not familiar with xml editing. eventually, students were assigned specific interviews in batches of approximately ten xml files. students worked in oxygen with surrogate files, for which they received brief introductions to the software before starting work. students were assigned four goals for each xml document: create a searchable index with corresponding timestamps, create a searchable transcript, and assign keywords and locations for each individual interview using getty art and architecture thesaurus. creating a searchable index with corresponding timestamps many indexes found in the transcript pdf file listed minutes and brief descriptions on each side of a cassette tape; students copied and pasted the time segments and synopses into “points” in the xml document. a point consisted of slots for time, title, and synopsis. students counted how many points were in each index and created the correct number of points in the xml file. figure 7. screenshot of uncustomized xml file viewed in the test server. figure 8. screenshot of index from pdf transcript. the timestamps needed to be manually adjusted to correlate with the digital length of time versus the 30 minute increments found on the original cassette tapes. we created a timestamp cheat sheet to help students adjust correctly depending on the side of the cassette tape they were working on. figure 9. cheat sheet for adjusting time based on cassette side length. students were then asked to create a title for each synopsis, since that wasn’t part of the original metadata. we asked them to create a phrase or a few words that describes the corresponding segment within a minute or two. students were asked to skip this process and make a note in their progress spreadsheet in cases of garbled ocr text from the transcript upon copying and pasting into the xml file, as well as in cases where there was no transcript at all or a transcript without an index. figure 10. screenshot from oxygen editor of xml file looking at a point (timestamp, title, synopsis), created by students. next, while the interview was still fresh, students assigned keywords and locations for each interview. students entered this information in their progress spreadsheets and used a controlled vocabulary text file that contained entries from getty art and architecture thesaurus. location keywords were generated from the original metadata but then added to by students to include not only where the narrators were born or lived, but what areas the interviews discussed at length. creating a searchable transcript students then created searchable transcripts within the xml file when they were available for each interview. if there was no transcript, students were asked to copy in the phrase “transcript not available” in the transcript tag; they then skipped to indexing. if a transcript existed, students copied the ocr text from the pdf and dumped it into a text file using notepad++. upon the dump, the student workers deleted all text that occurred before the actual transcript began. students then cleaned up the ocr text by getting rid of a number of common symbols by using find/replace. common symbols included < , > , & , • , £ , © , and ® . students also did a quick scan through the text file to delete any large text strings errors. we did not have students check the text against the transcript, as that was too time consuming. these files were saved as ‘scrubbed’ text files. figure 11. text file filled with ocr errors. students tracked how much time it took per interview so we could get an estimate for the project, as well as which students worked faster and could take on the harder files. times ranged from five minutes (interviews with no transcript or index) to 80 minutes (large, complex interviews with multiple issues), with the average time of approximately 32 minutes per interview for student workers. once students completed these steps, the digital projects manager completed quality control checks on the xml files. this included adding breaks between speakers listed by their initials in the transcripts to make for a more readable document online; copying the final scrubbed transcript into the students’ interview xml file; performing checks on the titles, keywords, and locations; and finally grabbing the code and moving it into the final xml files located on the server, which students do not have access to. students then clicked on the synopses to confirm that the xml file was indeed working correctly. figure 12. final view of transcript in viewer with ocr corrections and breaks. issues with the process/metadata staff provided quality control checking on keywords and locations generated by the student workers, as consistency proved to be an issue when considering the subjectivity of classification coming from six different people. as more interviews were completed, additional keywords were added to speak for the multitude of topics found in each interview. figure 13. excel spreadsheet of all interviews and corresponding keywords and locations. another issue that occurred through student generated metadata was that certain audio files did not match up with the corresponding transcript and/or index. we left these files towards the end, and the project manager decided it was worthwhile to listen to certain interviews and figure out where the errors occurred, or in some cases, to listen to the audio in real time to generate points (minute, title, and synopsis) for some interviews. these files took significantly longer to adjust and were done on a case by case basis. building out the website using xsl templates once the student work and quality control was completed, we were ready to design the discovery interface and information architecture for the collection. we are committed to providing several different modes of access—timelines, subject clouds, maps, and other collection-determined modes—to each of our digital collections. these interfaces are always driven by the metadata provided in the collection, and the latah county oral history had metadata both related and unique to our other digital collections from which to determine our means of discovery. for instance, we knew we would want to provide a map for users to browse the locations covered by the interviews, as well as subject clouds from which to discover the breadth and extent of the locations and keywords present in the collection. before these visualizations were produced [15], we wanted to address the more idiosyncratic metadata we had for the collection, which included the biographical information and list of interviews for each interviewee. as noted above, the original metadata for this collection included information on each interviewee’s family origins, residence, occupation, and year of birth. we wanted users to be able to quickly search and access individuals and their interviews by these attributes and by their name. we also wanted to provide similar means of browsing sequentially between interviewees as we did for the interviews themselves. so we first constructed a page that listed each interviewee, along with their biographical information, and linked to a main page for each person. next, we needed to create each person’s page, so we built a basic template page in xsl for any interviewee, listing, much like the original guide did, his/her name along with biographical information and a list of each interview he/she participated in together with its synopsis. figure 14. xsl file for producing a page for an interviewee this structure of information and the subsequent sites created via the template established in xsl was based on the structure of our updated, master xml file, which nested all the individual xml interview files within the information we obtained from the initial processing of the ocred guide to the collection. to build this master xml file, we first joined all the interview xml files together, then deleted the transcript field entirely to limit the size of the file. this very large xml file sometimes caused oxygen to crash, but we were able to use it carefully along with the customized xsl sheets to create pages for each individual with the information listed above. the xsl file and operation extracted the relevant information for each individual and interview and displayed that according to our layout. figure 15. typical page for an interviewee. we also wanted to have some piece of visual interest on each page, so we used the location data we had to create automatically generated google map images of latah county with markers based on the latitude and longitude of each location referenced in the interview. the “src” for each image looked like the following, with each latitude and longitude coordinate creating the red marker you can see in the image above: we were also able to use some images from the latah county historical society of the interviewees to add to select pages. to do this, we simply added a field to the large xml file that listed the location for the image on our web server. we then added an xsl:choose command in the xsl sheet to say that if an image was present, then put the html code to display the image on the page. figure 16. xsl choose code for adding a picture to a page. figure 17. page for an interviewee with a photo. building the front page and central feature using isotope having built the majority of pages for the collection, including pages for each interviewee and interview, we finally needed to address the central issue of this project, namely how to structure the site as a whole so as to use our extensive metadata to drive discovery of the interviews. essentially, we had created the wide foundation of our collection and now needed to organize the site and design our interfaces to best point users to individual interviews of interest, at which point they could further search and browse the interviews’ content themselves using the metadata created by the original directors of the projects and enhanced by our students. in order to spark curiosity and encourage use of the collection, we also needed to provide a user interface that was engaging visually and interactively while at the same time intriguing a user via the interviews’ metadata. unlike most of our collections, the latah county oral history collection had no inherent visual objects to design around, so this was a particularly challenging aspect of the design process. ultimately, we decided to use isotope, which is a javascript application created by metafizzy, a small web development shop led by david desandro that dynamically arranges and filters html objects. isotope adjusts the layout of objects on a web page based on the width of the browser being used to view the page. the application is not simply for images, however, as it also offers means to filter the html objects appearing on a screen based on classes and other variables added to the connected html objects. we have used isotope in the past to create image galleries for our digital collections [16], but this collection represented the first time we had attempted to implement isotope’s filtering capabilities. for the latah county oral history collection, we wanted a user to be able to search and browse the interviews using keywords or locations in order to quickly discover those in which he/she was interested. to create these filterable objects, we again extracted the requisite information from our large xml file using a customized xsl file, structuring the information in each container to contain the interviewee, interview number, date of the interview, short synopsis, and a list of the assigned keywords and locations. we tried several different means of filtering before we decided that a search box best fit our purposes. this would allow a user to input any word they might think of to limit the selection of interviews presented. we also wanted to include suggested keywords and locations to guide the user’s browsing, as well as connect the keywords and locations listed for each interview to encourage discovery. to do this, we added additional jquery code to enable the value of any “vocab” link to replace the value represented in the search box. this enabled us to utilize the search box function and use our keywords and locations to limit the interviews listed. figure 18. jquery for filtering objects via replacing the value of the text box with the value of the “data-filter” attribute within a link (gist) once we had this function built, we added lists of suggested keywords to the side of the isotope container, as well as a short introductory paragraph. at this point, we also wanted to enable a dynamically filtered list of interviews based on the incoming url. to address this issue, we added additional jquery commands to extract any text after a hash symbol in the url. upon the loading of the page, this extracted text was then inputted into the search box itself, filtering interviews based on the text entered. for instance, the url http://www.lib.uidaho.edu/digital/lcoh/index.html#railroads will filter the interview records displayed on the main page so that only those records that include the word “railroads” in their html will display. this development also enabled us to link all of our additional visualizations, including our locations map and subject tag clouds back to filtered front pages as well. figure 19. jquery code allowing filtering based on hash substring. figure 20. example of a filtered front page for interviews related to “murder” in the end, this feature enabled a circular architecture for our web site, one in which almost all links in the collection lead back to a filtered homepage. we went with this feature because we wanted our users to be focused immediately and continuously upon discovering the interviews themselves, especially since the metadata available via ohms on the interview pages themselves would further allow for deep investigation and speedy discovery of pertinent portions of the interview.[17] site release and next steps the latah county oral history collection website officially launched in april 2015 with a public program open to the community. over 80 people attended the informational event that featured the history of the collection, its new digital presence, and how students have already made use of the website through research. latah county historical society announced that it received a grant to continue collecting oral histories from the county and encouraged people to sign up or refer family and friends. this partnership between the latah county historical society and the university of idaho will continue to ensure these future interviews will be preserved through the digital collection for future citizens and students to enjoy. notes [1] ritchie, d. 2011. introduction: the evolution of oral history. in: ritchie, d, editor. the oxford handbook of oral history. oxford: oxford university press. p. 12. [2] boyd, d. 2011. achieving the promise of oral history in a digital age. in: ritchie, d, editor. the oxford handbook of oral history. oxford: oxford university press. p. 291. [3] boyd, d. 2011. achieving the promise of oral history in a digital age. in: ritchie, d, editor. the oxford handbook of oral history. oxford: oxford university press. p. 295, 286. [4] hubbard, cr. 1957. mineral resources of latah county. moscow (id): idaho bureau of mines and geology. http://www.idahogeology.org/pdf/county_reports_%28c%29/c-2.pdf, p. 1-2. [5] schrager, s. 1978. guide to the latah county, idaho oral history collection. moscow (id). p. x. [6] schrager, s. 1978. guide to the latah county, idaho oral history collection. moscow (id). p. xiii. [7] this points to a national trend in digitization here that is often little acknowledged. so much of the metadata used in many digital collections across the us and abroad was gathered by the excellent librarians and volunteers that worked in the library and special collections and archives departments of their respective libraries and archives years before any digital collections were born. [8] boyd, d. 2014. one bourbon, one wine, one beer: academic alcohol archives that document the cultural history of a community. northwest archivists annual conference; 2014 may 30; spokane, washington. http://northwestarchivistsinc.wildapricot.org/conferenceschedule2014. [9] boyd, d. 2013. ohms: enhancing access to oral history for free. oral history review; 40(1): 95–106. available from: doi:10.1093/ohr/oht031. [10] to use the tool, users first need to create an ohms account through kentucky by signing up for access. [11] one thing to note about jplayer: we initially were serving our mp3 files off of our contentdm server, which was an iis windows server. however, jplayer had difficulty reading the length of the mp3s coming from this server, so we moved all the mp3s to our production web server. this cleared up the problem. [12] weig, e. 2014. installing & customizing the ohms viewer. http://www.oralhistoryonline.org/wp-content/uploads/2013/06/ohms-viewer-installation-v3-2-13.pdf [13] when we originally developed the metadata schema for this collection, we based it off of a basic dublin core schema we often used in our contentdm system, then added these original fields to the collection. [14] wilton, j. 2011. oral history in universities: from margins to mainstream. in: ritchie, d, editor. the oxford handbook of oral history. oxford: oxford university press. p. 477. [15] we use google fusion tables to create our maps and tagcrowd.com to create our subject clouds, and we did the same for this collection, collecting latitude and longitude coordinates to add to our google fusion table for each of the locations pertinent to latah county and then cleaning and processing the lists of keywords and locations via tagcrowd.com’s interactive interface. for more information on this process, see previous work from the department, particularly, devin becker’s article “each item its own time and place: using google fusion tables and simile timeline to map the ott historical photograph digital collection” (coins) [16] see the university of idaho’s northwest historical postcards collection: http://www.lib.uidaho.edu/digital/postcards/. [17] as many who work with digital collections can guess, the development process for this website and digital collection was not the smooth, sequential process we present above. we present the narrative here as such, however, to expedite and assist others who might like to envision or apply the procedure. in reality, the metadata was being created at the same time as the interface, and so the complete website had to be overhauled several times throughout the process to include new metadata and additional information several times. thankfully, having our xsl style sheets expedited the process of updating the collection each time new additions or revisions were added to the general xml files. about the authors devin becker is the digital initiatives and web services librarian at the university of idaho. his first book of poetry, shame | shame, won the a. poulin jr. poetry prize and was recently released by boa editions, llc. erin passehl-stoddart is the head of special collections and archives at the university of idaho. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – book review: the success of open source by steven weber mission editorial committee process and structure code4lib issue 1, 2007-12-17 book review: the success of open source by steven weber the success of open source by steven weber details the history, process, motivations, and possible long-term effects of open source software (oss). weber’s book can be used as a set of guidelines – a description of a framework – for building software solutions for the computing problems facing libraries. weber, s. (2004).the success of open source. harvard university press. isbn: 0674012925 (coins) by eric lease morgan introduction the success of open source by steven weber details the history, process, motivations, and possible long-term effects of open source software (oss). this scholarly yet easy-to-read, well-written, and provocative book is worth the time of anyone who wants to understand how open source software is affecting information technology. using linux as its primary example, the book describes how the process of open source software may affect business & economics, methods of governance, and concepts of intellectual property. it is also a great read for those of us librarians who desire to play a role in the building of “next generation” library catalogs and other library-related information systems.my acquisition of the book was rather embarrassing, and at the same time typical. as the leader of an open source software project called mylibrary, i asked some of my fellow hackers and open source software aficionados for advice on how to promote mylibrary and build a larger community around it. one of the suggestions was to read weber’s book. like most people, i searched google for the title, and google returned it as the first hit. i was able to read the entire text online, if i desired. i didn’t. the librarian in me then went to worldcat where i learned the book was located down the street in another academic library. i could borrow the book for free. all i had to do was visit the library and check it out. i didn’t. instead, i looked the book up on amazon.com and found a “new” copy from an amazon.com associate. twelve dollars and four days later my book arrived. easy. convenient. cheap. food for thought.the book can be divided into four overarching topics: the history of open source, the process of open source software development business models and open source software’s relationship to the idea of a “commons”, and finally, a summary as well as a look to the future describing how the process of open source software might affect other human endeavors. history of oss weber traces the history of open source software from its roots in at&t unix and the berkeley software distribution (bsd) to the present day linux operating systems. weaved throughout is the development of networking technologies, specifically the internet. the history brings to light two very influential computing philosophies: the “unix way” and software as something to be bought and sold.the first computing philosophy is the “unix way”, an outline of three engineering principles for making good software: write programs that do one thing and do it well, write programs that work well together, and write programs that handle text streams because that is a universal interface. software that adheres to these principles is typically considered more useful than software trying to be all things to all people. such software is modular, portable, easy to create and maintain, and can be applied in any number of settings.the second philosophy revolves around ideas of intellectual property. suppose i spend time and creative energy writing a piece of software. through this expenditure i have the right to sell the software to other people in order to gain compensation for my efforts. software, like other goods, can be exchanged for other things of value, namely money. moreover, if people copy my software and give it other others, then such a process is just like stealing from me since i am not being compensated for my efforts. this was the attitude of bill gates as he stated it in his “open letter to hobbyists” as they distributed his implementation of the basic programming language in order to run programs they had written against it. this perspective regarding software as something to be bought and sold ultimately led to the creation of microsoft. at the time of gates’ writing of the 1976 “letter” computers were bought and sold. software sort of just came along for the ride.while it is important to note that i, the author of this review, am certainly an advocate for open source software, the book takes no sides one way or another. instead, throughout the book, steven weber tows a middle ground by simply asking questions and then does his best to answer them as objectively as possible. the book neither advocates nor condemns open source software. it simply observes the environment and makes generalizations accordingly. oss as a process not a thing the chapters i found most interesting described the open source software process, how it works, and the motivations of its participants. i learned from the book that open source software is more about a particular process and less about a thing or product. consider two statements a priori: software can be copied an infinite number of times and not denigrate the original version, and globally networked computers allow the tiniest numbers of like-minded individuals to find each other easily. given such an environment the open source software process flourishes, and that process is outlined here: someone has a computing problem they want to solve — an itch to scratch. the person builds on the good work of others and writes a computer program. the person “freely” shares their software with others under some sort of license agreement. a community forms along with norms of behavior and guidelines (governance) for contributing back to the solution. the software grows and matures, hopefully go to step #1 until the software is “done” or until someone else wants to take on the leadership role. the resulting open source software application, the book points out, is not necessarily better (or worse) than “closed source” software. instead, it is simply different. it is more easily modified. it is vetted through the eyes of end-users — a set of self-described pragmatists wanting to scratch their own itch. furthermore, since the process easily accommodates the philosophy of letting a thousand flowers bloom, the resulting software is not necessarily designed for the mass market: the software is not aimed at a lowest common denominator.what motivates open source software participants? weber shies away from the altruistic motives espoused in eric raymond’s the cathedral and the bazaar. instead, weber imagines that the motivations stem from a desire for artistry & craftsmanship, the desire to create better software, and the desire for recognition from peers (“ego-boosting”). many of the people who participate in open source software seem to enjoy puzzle solving. some of them like reading and writing beautiful code — software as poetry. they are engineers looking to build better solutions to known problems, to “build a better mouse trap”. like scholars and researchers in academia, peer-review is an important aspect of the work, and if you write something that is used (cited) by many people, then in the eyes of others your value is increased. oss and business the sections and chapters regarding open source software and business are probably the most significant aspects of the book. they cover issues that are the least understood by the wider community including the definition of “free”, the concept of property and the “commons”, the differences between the bsd and gpl licenses, how these differences effect business opportunities, and business models in an environment where the primary thing to be exchanged for value is bound by rights of distribution as opposed to exclusion.in our increasingly commercial society, the concept of “free” software is something most people find confusing. weber compares and contrasts “free” from the point of view of the free software foundation (led by richard stallman) and the point of view of dot-com companies such as red hat (a commercial distributor of open source software). in both cases the concept of “free” should be equated with the word “liberty” as opposed to “gratis”. however, the fsf approach has a moral slant while the red hat approach emphasizes practicality and the ability to easily improve software solutions.the concept of property plays a big role in business models. how can you sell something that is “free”? how can you earn money on a thing that is a part of the community commons? who owns this intellectual property? these questions can be answered, according to weber, by interpreting the bsd and gnu (or general) public license. in both cased the software is give away gratis. the differences lie in redistribution. under the gpl license, new software based on gpl-licensed software must be re-distributed under the gpl or a less restrictive license. the bsd license does not have this stipulation; new software based on bsd licensed software does not have to be redistributed for “free”. mac osx is an example of bsd-based software which has been commercialized: apple , inc. started with bsd unix, enhanced it, then redistributed their modified version for a fee. weber outlines a number of possible business models for open source software: support sellers, loss leaders, “sell it, free it”, accessorizing, service enablers, and branding. for each of these models weber points to a number of examples. the book’s summary the last two pages of the book summarize much of weber’s observations, and listing them here feels a bit like spoiling the ending of a mystery novel. the effective open source software process/project needs to take into account and support: disaggregated contributions, the need for a critical mass of users, peer review, the positive effects of the internet, belief that a small group can generate something truly useful, and a voluntary community. weber also outlines the characteristics of effective agents (open source software participants). they: can judge the viability of the product, can make an informed bet their contributions will be used, are driven by things beyond simple economic gain, gain personal knowledge through the process, and hold a positive ethical valence towards the process. oss and libraries naturally, i read the book through my rose-colored glasses of librarianship. after reading it and combining it with more recent personal experiences, i am now less of a “believer” in open source software. i take away a more realistic perspective on the definition of open source, its process, and what motivates its participants. this does not in any way diminish my belief that the open source software process can benefit the library community and therefore library users.throughout the book i kept comparing the kernel of the linux operating system to the integrated library system (ils) of libraries. both provide fundamental interface functions between two entities. in the case of an operating system kernel, the interface is between hardware and people. in the case of an ils the interface is between a library and patrons.many library open source projects already exist. the extent they meet with continued success and wider adoption, i predict, will be measured by the extent that they can accomplish the following things. first, an easy-to-understand vision statement needs to be outlined by one or more people who possess leadership qualities. second, those leaders need to amass the resources required to make their vision a reality. third, they need to put their vision into practice allowing as many people to participate as possible. fourth, start small and work up. encourage the building and re-use of existing core applications, such as databases, indexers, editors, and server platforms. make sure the applications are modular and standards-compliant. practice the unix way. make it work first, then improve things. don’t even attempt to create the perfect system the first time. the process won’t be quick. the process won’t be easy. the process won’t be “free”. on the other hand, the process will empower and enable the profession. it will give it increased choice and opportunity. weber’s book, the success of open source, can be used as a set of guidelines – a description of a framework – for building software solutions for the computing problems facing libraries. about the author eric lease morgan is the head of the digital access and information architecture department at the university libraries of notre dame. he considers himself to be a librarian first and a computer user second. he and his fellow teammates have primary responsibilities for the university library’s website, the campus-wide search engine, and a number of digital library projects. he is also on the editorial committee of the code4lib journal. in his spare time he can be seen folding defective floppy disks into intricate origami flora and fauna. tags: book review subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – machine learning based chat analysis mission editorial committee process and structure code4lib issue 50, 2021-02-10 machine learning based chat analysis the byu library implemented a machine learning-based tool to perform various text analysis tasks on transcripts of chat-based interactions between patrons and librarians. these text analysis tasks included estimating patron satisfaction and classifying queries into various categories such as research/reference, directional, tech/troubleshooting, policy/procedure, and others. an accuracy of 78% or better was achieved for each category. this paper details the implementation details and explores potential applications for the text analysis tool. by christopher brousseau, justin johnson, curtis thacker introduction in 2019 whitchurch and merrill published a qualitative analysis of chat transcripts between librarians and patrons of brigham young university’s harold b. lee library (whitchurch and merrill, 2019). their dataset was created by coding 4,475 chat transcripts recorded during 2016. they coded into 25 categories – described below. using this dataset, the byu library attempted to create a machine learning model that will programmatically code a chat transcript. using the resulting model, we hope to programmatically code about 9 years or 70k additional transcripts. this larger dataset would allow for analysis of trends over time. background background on machine learning “machine learning is the study of computer algorithms that improve automatically through experience” (mitchell, 1997). a machine learning model is an algorithm that uses patterns in data and previous examples to learn to perform a task. detecting a cat in an image and predicting how much a 3,000 square-foot house in boston will be worth in 2025 are examples of tasks that can be performed by using machine learning. figure 1. supervised learning (classification/regression) | unsupervised learning (clustering) there are two basic paradigms for machine learning models (hereafter called models), supervised and unsupervised learning. a supervised learning model uses patterns found in example data to find relationships between inputs and outputs such as whether a cat appears in a given picture. unsupervised learning algorithms are used in pattern detection and descriptive modeling tasks. background on text classification for our experiment, there are two tasks to accomplish, sentiment analysis and classification. sentiment analysis is a classic natural language processing (nlp) task which tries to predict the overall positivity or negativity of a statement or utterance. our experiment required us to use text to provide a variety of labels for a block of natural language text. nlp is a broad field in machine learning, spanning from text processing to speech recognition. one of the main focuses and most common tasks of nlp is creating what are called language models. language models are essentially large records of computer-generated rules pertaining to a language, and there are many different types of language models. in order to minimize the amount of time it takes to train but maximize the accuracy and learning rate, we’ve opted to use a pre-trained language model for our task, and fine-tune or adapt it to our specific dataset. bert (devlin, 2019) is a sophisticated unsupervised machine-learned language model created by google. its purpose is to take natural language in plain text and represent it as a sequence of numbers that capture the nuance and meaning of the language. doing this enables the creation of other models that perform tasks like sentiment analysis and translation. for all the tasks described in this paper, we used dilbert (sanh, 2020), a smaller version of bert that doesn’t sacrifice accuracy or efficiency. other very common tasks in nlp are text processing, part-of-speech tagging, tokenization, and sentiment analysis; our model uses all of them except for part-of-speech tagging. text processing or pre-processing involves cleaning data so that it can be used without introducing any unnecessary bias into the results. for example, in our dataset all of the customer interactions originally contain ip addresses and other personal data that isn’t relevant to the task we’re trying to complete. we don’t want to introduce bias into the results that we aren’t prepared to deal with like the idea that only a certain ip address ever asks questions about certain topics. tokenization is the act of splitting up a text by tokens, in our case individual words. in other cases, those tokens could be morphemes or prefixes/infixes/suffixes, etc. some language models are based on tokenization, such as the bag of words statistical model, which counts each unique token in order to draw important conclusions from the text. others, while likewise based on tokenization, do not use only tokenization to draw conclusions, like the term frequency-inverse document frequency model, which deems words that appear fewer times as being more important, but it’s safe to assume that where a language model exists, some form of tokenization is done. we did not use part-of-speech tagging in our model but it is often used to introduce bias for grammatical structure into language models, usually for models where grammar and syntax are important for accurate results, such as those models used for translation or summarization of text. back to the first task we completed, sentiment analysis combines several of the nlp methods above in order to get results. it’s a corpus-based method, meaning that it’s either needed to make our own corpus of “known” words or to piggy-back off of an already created corpus. in our case, we used both methods. we utilized bert’s pre-trained word embeddings as the base of our corpus, and during fine-tuning whenever we came across a new word that wasn’t in the current corpus, we added a new entry to represent that word. these entries in the corpus are represented by vectors. the vectors, or word embeddings, take into account how frequently the word shows up, but are also based the word’s placement in a sentence along with its perceived semantic meaning. all of these are taken into account for determining the overall positivity/negativity of a given word. once we have those vectors, to analyze sentiment, we add up each individual word’s “score.” from there, we determine how positive/negative the whole utterance is. we aren’t analyzing straight sentiment, although our model has learned to consider the overall sentiment of a chat as a statistically significant factor in determining the satisfaction of the patron. our second task is classification which is predicting which of two or more categories (or classes) a given input fits into. for the related task of binary classification, prediction is limited to just two classes such as true or false, cat or dog, black or white, dead or alive, etc. we’ve opted for much more of a black-box model, as opposed to other more easily understandable models for classification. instead of manually injecting bias by predetermining rules for the classification to follow, we allowed the language model bert to follow its algorithm to “learn” how it should classify each input. this approach leaves us in a position where we will never be one hundred percent certain about everything that the model learned or which features it deemed especially important for determining the classifications it gives. the justification for this approach, even with what could be taken as a huge drawback is twofold: 1. as we previously stated, we do not want to inject any bias into the model that we aren’t prepared to deal with. that includes bias that we as researchers have about whether we ourselves would be satisfied with a particular interaction with librarians. 2. we have a dataset of human-labeled-and-verified examples, meaning that if our model gets a good score generalizing on that dataset, we are guaranteed a certain amount of confidence in its ability to generalize on data within this very limited space effectively. an example: say we were to analyze a chat starting with this question from a patron, “patron: when does the library close on friday?” this question within our model will receive a neutral score, as it’s neither positively nor negatively worded. the remainder of the chat will determine the overall sentiment of said chat. if the librarian gives a polite and accurate answer, and the patron then says thank you at the end of the chat, the overall chat will be classified as weakly positive. if the librarian takes a long time to respond and ultimately gives inaccurate information, the model will still classify the chat as weakly positive if the patron at the end of the chat indicates that they are happy with the service. if they are unhappy with the service, and they indicate that at any point in the chat, it will be classified as negative. background on library chat a review of the literature shows that libraries primarily improve their chat services in two ways: first, by making technological improvements to the service and second, by evaluating the service and improving it through training and support to library employees. a sentiment analysis tool would help in regard to the second method. evaluation of chat services is typically done using a qualitative analysis or evaluation of the chat service. this is done via a review of chat transcripts using an established rubric or dataset to evaluate all chats in the same manner. there are many publications that demonstrate the usefulness of this approach and describe the method of coding typically used for these evaluations. of the studies reviewed, the following examples are included in more detail to show the methods used to evaluate chat. these evaluations are one review of the interactions based on established reference frameworks such as the acrl framework (hunter, 2019), and one study identifying answers through the evaluation of the questions being asked (moore, 2017). a 2009 study from the library at the university of guelph did an assessment of their virtual reference and instant messaging services by reviewing all transcripts from the past two years and categorizing them into five broad categories of the type of questions asked. the analysis was done by a team coding the interactions into categories that were directional, policy, ready reference, specific search, and research questions. in comparison, a study published in 2019 by a task force from berkeley college reviewed 369 chats. they focused on coding chat interactions that were identified as needing research or writing help. the researchers analyzed each of the interactions according to the acrl’s framework for information literacy in higher education. the examples included above show how libraries have typically conducted qualitative analysis of chat interactions by using teams or task forces of researchers using established rubrics and conducting coding of each transcript. these demonstrate the potential benefits that are available in creating a machine learning or sentiment analysis model to assist in this type of analysis. there has been some use of sentiment analysis within libraries already such as a study that evaluated libqual+ surveys’ open-ended comments (moore, 2017). in this libqual+ study, the author created a sentiment model using manually-coded data for its dataset to identify positive and negative comments within the survey. he took 514 coded entries from five separate surveys to create a set of positive and negative word vectors for the open comment sections of the libqual+ survey. he then ran the comments from those five surveys through his model to review the results. expansion of work like this can be applied to other word-based interactions or transcripts such as chat. these are examples of how libraries can use an established dataset to create a qualitative analysis of chats through sentiment analysis. project description we performed supervised learning on 21 different tasks, one of which was fine-grained sentiment classification, the other 20 being binary classifications. we have synthesized, or used our model to create, some analyses similar in accuracy to michael whitchurch’s projects, but without needing human researchers. we analyzed and classified 77,000 chats in 21 hours. whitchurch’s project analyzed 4475 chats with 3 people over the course of 1 year. this means that we analyzed 1720% of the project in 0.02% of the time, resulting in an almost 2000% increase in time efficiency. we specifically checked with michael whitchurch about presenting this number, and he agreed it was accurate considering the project and its circumstances. we expect this vast increase in speed to be valuable to anyone who would like to implement this project for themselves. our experiment makes use of three python code libraries: pytorch (https://pytorch.org/), transformers (https://huggingface.co/transformers/), and fastai (https://docs.fast.ai/). we used the transformers library to access dilbert from huggingface. using transformers version 2.5.0, we have easy access to the three portions of the model we need most: the pre-trained model, tokenizer, and configuration. the dilbert model has a maximum input sequence length of 512, which isn’t immediately compatible with fastai’s architecture, which is built more for working with a recurrent neural network (rnn) than a transformer model. beyond that, it’s much shorter than many of the chats we were attempting to analyze. we had to implement a custom wrapper for the tokenizer to normalize the input and the max sequence length. this allowed us to make our model as flexible as possible. in order to get our model to correctly process our data, we needed to first preprocess the text, then create a way to load it into the model. to properly load our data into the model, we rely on the fastai databunch api, which gives us an easy way to not only load the data but also shuffle and test it without requiring too much configuration. you’ll remember from the brief explanation of apis that they define standards for input and output for different types of data, and one of the reasons why this databunch api is helpful for us, is that it accepts so many different types of data, and presents the same types of output, meaning that our model is very flexible as to what types of data and what formats it can work with. based on this excerpt, https://github.com/huggingface/transformers#models-always-output-tuples (huggingface 2020), the creators of the models state that all of their pytorch-based transformer models output tuples with some variation between all models. because a goal of ours is flexibility between models, we don’t consider those variations especially helpful, and adjusting our whole architecture for each and every possible model would not be useful for our purposes. with this in mind, we created another custom wrapper that funnels the output to only the most important probabilities for our results (logits), no matter which pretrained transformer model we choose to put through it. in the last part of the setup, you’ll remember that we are working with the 21 data categories defined in michael whitchurch’s experiment. some of these are binary classification, while others are multiclass, meaning that for testing, we need to be able to seamlessly transition between the two tasks. thanks to transformers, we have easy access to these features in the form of the config for each model. the model is now set up correctly. because of all of the work we put into custom wrappers, our program works not only with dilbert, but also bert, roberta, and xlnet (all large pre-trained language models). we are training on the data collected and labeled by michael whitchurch and predicting all of the same categories. we are using adamw as an optimizer, and again fastai allows us to bundle our model up nicely with the learner api. fastai allows us to visualize our data immediately and easily after every epoch. to analyze the results, classificationinterpretation is a useful api to implement, allowing you to generate confusion matrices quickly and effectively. after that, we are able to predict any of the classes we trained on in our model, along with saving our model to use within larger systems and output our predictions as files. saving the model is especially important, as it significantly reduces prediction time, as we don’t need to retrain anything. dataset our dataset consists of 4476 chats collected in 2016. this data was collected through the harold b. lee library’s chat system and analyzed by michael whitchurch’s team and detailed in his publication. it contains 21 categories of questions that whitchurch found to be useful in determining a couple of things: are patrons generally being satisfied by the chat experience, and how do trained library professionals compare to students when interacting meaningfully through the chat. we extended the first of these questions to our project, utilizing the same categories present in the original. the data originally came in an unusable format for our model. in order to clean it up, we first replaced all personal information that came with the chats to protect our users’ privacy. now, instead of displaying studentid@ipaddress: and libraryid@ipaddress:, it displays patron and library. the next step in cleaning was replacing all of the text answers for the categories with numbers, so instead of true or false, it was 1 or 0. once our data was cleaned, we were ready to begin training and testing. something worth noting is that despite all of the effort put into cleaning the dataset, there were still several biases that inhibited the model from performing at what we would call a “perfect level,” all of which have already been discussed earlier. here’s a breakdown of the data on question types and how they are portioned throughout the entire experiment: table 1. question type breakdown question type total percentage research/reference 2936 65.6% policy/procedure 124 20.4% tech/troubles 397 8.9% directional 314 2.8% lastly, here’s a breakdown of each of the 21 categories tested in detail: table 2. category breakdown label scale meaning 1 patron satisfaction 1-5 overall satisfaction of the patron 2 unanswered true/false whether or not a librarian or student answered the chat 3 check original chat true/false whether a given chat was the original contact, or if another chat needed to be referenced 4 premature exit true/false whether the patron left before the question was answered 5 cite source(s) true/false whether a patron was citing a source from the library 6 guided to source true/false whether the patron was successfully guided to the source 7 no source true/false whether the source actually existed at the library 8 unnecessary true/false whether the source citation was actually necessary 9 research/reference true/false question type – research question 10 directional true/false question type – directions to something on campus/in the library 11 tech/troubleshooting true/false question type – help with using or debugging technology 12 policy/procedure true/false question type – help understanding what the rules are and why they’re in place 13 inappropriate true/false whether the patron behaved inappropriately or used the chat for an unintended purpose 14 student-to-student true/false whether the patron (a student) was connected to a student employee through the chat 15 greeting true/false whether a greeting was expressed at the beginning of the chat 16 follow-up true/false whether any follow-up is required based on the chat 17 closing true/false whether a closing statement (e.g. bye) was expressed at the end of the chat. 18 campus question true/false whether a question is about a campus schedule or activity 19 perceived inaccurate true/false whether the answer given by the librarian is completely accurate 20 perceived incomplete true/false whether the question needed more information than answer given by the librarian contained 21 employee inappropriate true/false whether at any time the employee behaved inappropriately during the chat, or if they used the chat outside of its intended purpose results our project was largely a success, and we expect that anyone who implements this model will experience the same significant margin of improvement over more traditional methods of analysis. this project was very useful for our purpose, which in this case was providing a means for quick and accurate analysis of customer interactions. this analysis provides a bunch of meaningful data for a variety of purposes including employee training and hr management, customer satisfaction and quality assurance, and overall efficiency. this data can be used as the basis for a custom chatbot specific to a company’s data and market, or as a helpful start to building better-customized training for employees. its main usage, however, is beginning a library’s journey towards analyzing their patron chat interactions. in building a custom dataset for fine-tuning on your own data, at least a couple thousand examples of labeled data are needed, be those chats, reviews, or just general email interactions. our model accepts csv, xlsx, or tsv formats. you should only have to run through training one time to get accuracy in the high 80’s or low 90’s with a good dataset. table 3. satisfaction confusion matrix x – predicted y – actual dissatisfied/frustrated neither satisfied above and beyond dissatisfied ~2% <1% <1% 0% neither 0% 8% 4% 0% satisfied 0% ~2% 82% 0% a&b 0% 0% <1% ~2% table 4. employee appropriateness confusion matrix x – predicted y – actual appropriate inappropriate appropriate 100% 0% inappropriate 0% 0% table 5. training data for satisfaction epoch training loss validation loss accuracy error rate time elapsed 0 0.357155 0.375009 0.891374 0.108626 00:39 1 0.320672 0.391226 0.891374 0.108626 00:41 2 0.232104 0.412163 0.900958 0.099042 00:37 table 6. category breakdowntraining data for employee appropriateness epoch training loss validation loss accuracy error rate time elapsed 0 0.000000 0.000000 1.0 0.000000 00:25 1 0.000000 0.000000 1.0 0.000000 00:29 2 0.000000 0.000000 1.0 0.000000 00:23 with the results, we can look at trends over the last few years based on the criteria we used. we can identify potential issues based on the results and use these for the purpose of evaluation or improved training. being able to fine tune the results can also help in identifying some of the underlying issues that might be present in the service being provided. fixing these issues is an opportunity to improve a service by understanding better how the interactions have progressed. an example of this is using the satisfaction rating to see if there have been any trends in our ratings in chat interactions and identify potential issues that might be present within the service being provided. looking at graph 1, you can see that our number of satisfied chats have been decreasing the last three years. while our total number of chats have decreased as well, this alone does not account for the decrease. comparing ‘neither’ and ‘dissatisfied’, we can see that these two categories have stayed roughly the same across the three years. in comparison, the ‘satisfied’ category is seeing a marked decrease. identifying this trend can help in finding areas for improvement in our chat services. this can be done through improvements to the service and to training of the employees who handle all the chat interactions. with the results available we can spot the trends and do more in-depth analysis to improve our service. two other categories we can look at for potential value are the unanswered group and the wait time. using these results, we can track how long it is taking for chats to be answered and identify the number that go unanswered. these are areas we can achieve focused improvement through training by working to answer chats and make sure that they do not go unanswered more quickly. for example, in 2018 we had 1422 chats have a wait time of 1 minute or longer and had 739 go unanswered. these are areas we need to focus on in conducting training with employees to address these issues. we can provide better customer service and these results give us the opportunity to establish reliable benchmarks that we can track and use in annual evaluations and in developing focused training. these are just examples of some of the benefits available within the results we were able to get from the text classification model we developed for our chat interactions. conclusion having started out with an initial dataset and a goal to programmatically implement an analysis based on the dataset, the byu library was able to create a sentiment analysis chat tool. the tool gave us the ability to quickly and accurately analyze all saved library chat transcripts based on the initial dataset created by michael whitchurch. the data obtained was highly accurate and useful for the purpose of analyzing the chat interactions. so, while we accomplished our stated goal, further development and growth is needed. first, expansion of the dataset to eliminate or correct any bias is needed. this will help to refine the results and improve accuracy. expansion of the dataset will also help in better tuning the chat analysis tool to meet the needs of supervisors in evaluating chat interactions. second, working with library staff to identify how best to use the tool and what role it will play in chat analysis and evaluation in the future. this is an equally important step, and it will help provide direction for future development and show how best to use the tool. for example, while our current dataset can give some good baselines on evaluating chat interactions between staff and patrons, it might be better refined to be used in future training and evaluations. with these future modifications implemented to correct bias and bring the tool more into line with user needs, we feel the product will be a viable chat evaluation tool for libraries to analyze chat interactions. about the authors christopher brousseau is an nlp research scientist at lifepod, where he creates, analyzes, and manages automatic speech recognition and text processing functionalities for their chatbot system. chrisbrousseau304@gmail.com justin johnson is the collections access librarian at the byu lee library, which includes managing the main help desk and chat. justin_johnson@byu.edu curtis thacker the former director of the research & development, and discovery systems teams at brigham young university’s (byu) harold b. lee library. his research interests include machine learning, open source software, empirical software engineering. in addition to a m.s. in computer science from byu, curtis has worked in libraries for 13 years. curtis.thacker@byu.edu bibliography devlin j, chang mw, lee k, toutanova, k. 2019. bert: pre-training of deep bidirectional transformers for language understanding. arxiv.org[internet]. [cited 2020 june 27]. available from https://arxiv.org/pdf/1810.04805.pdf. howard j, ruder s. 2018. universal language model fine-tuning for text classification. association for computational linguistics [internet]. [cited 2020 jun 27]; 29(5):328-339. available from: https://www.aclweb.org/anthology/p18-1031/ hunter j, kannegiser s, kiebler j, meky d. 2019. chat reference: evaluating customer service and il instruction. reference services review. 47(2):134-150. available at http://jlisnet.com/vol-7-no-2-december-2019-abstract-2-jlis. mitchell t. machine learning [internet]. mcgraw hill. 1997[cited 2020 june 27]. available from http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html. moore m. 2017. constructing a sentiment analysis model for libqual+ comments. performance measurement and metrics 18(1):78-87. rourke l, lupien p. 2009. learning from chatting: how our virtual reference questions are giving us answers. evidence based library and information practices 5(2):63-74. sanh v, debut l, chaumond j, wolf t. 2020. distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arxiv.org[internet]. [cited 2020 june 27]. available from https://arxiv.org/pdf/1910.01108.pdf. whitchurch mj, merrill, e. 2019. chat response competency: library professionals vs. undergraduate student employees. journal of library and information sciences dec 2019 7(2):10-23. appendix i: custom transformer process custom transformer tokenizer class: class transformersbasetokenizer(basetokenizer): """wrapper around pretrainedtokenizer to be compatible with fast.ai""" def __init__(self, pretrained_tokenizer: pretrainedtokenizer, model_type = 'bert', **kwargs): self._pretrained_tokenizer = pretrained_tokenizer self.max_seq_len = pretrained_tokenizer.max_len self.model_type = model_type def __call__(self, *args, **kwargs): return self def tokenizer(self, t:str) -> list[str]: """limits the maximum sequence length and add the special tokens""" cls = self._pretrained_tokenizer.cls_token sep = self._pretrained_tokenizer.sep_token if self.model_type in ['roberta']: tokens = self._pretrained_tokenizer.tokenize(t, add_prefix_space=true)[:self.max_seq_len 2] else: tokens = self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len 2] return [cls] + tokens + [sep] the basic databunch api from fastai, this is the portion that we did: databunch = (textlist.from_df(train, cols='phrase', processor=transformer_processor) .split_by_rand_pct(0.1,seed=seed) .label_from_df(cols= 'sentiment') .add_test(test) .databunch(bs=bs, pad_first=pad_first, pad_idx=pad_idx)) custom transformer wrapper class: # defining our model architecture class customtransformermodel(nn.module): def __init__(self, transformer_model: pretrainedmodel): super(customtransformermodel,self).__init__() self.transformer = transformer_model def forward(self, input_ids, attention_mask=none): logits = self.transformer(input_ids, attention_mask = attention_mask)[0] return logits transformer config for our model: config = config_class.from_pretrained(pretrained_model_name) config.num_labels = 5 the fastai learner api that combines everything we’ve created together: learner = learner(databunch, custom_transformer_model, opt_func = customadamw, metrics=[accuracy, error_rate]) training loop: learner.save('untrain') learner.load('untrain'); learner.freeze_to(-1) learner.lr_find() learner.fit_one_cycle(1,max_lr=2e-03,moms=(0.8,0.7)) learner.save('first_cycle') learner.load('first_cycle'); learner.freeze_to(-2) lr = 1e-5 learner.fit_one_cycle(1, max_lr=slice(lr*0.95**num_groups, lr), moms=(0.8, 0.9)) learner.save('second_cycle') learner.load('second_cycle'); learner.freeze_to(-3) learner.fit_one_cycle(1, max_lr=slice(lr*0.95**num_groups, lr), moms=(0.8, 0.9)) learner.save('third_cycle') learner.load('third_cycle'); learner.unfreeze() learner.fit_one_cycle(2, max_lr=slice(lr*0.95**num_groups, lr), moms=(0.8, 0.9)) appendix ii: custom transformer model architecture customtransformermodel( (transformer): distilbertforsequenceclassification( (distilbert): distilbertmodel( (embeddings): embeddings( (word_embeddings): embedding(30522, 768, padding_idx=0) (position_embeddings): embedding(512, 768) (layernorm): layernorm((768,), eps=1e-12, elementwise_affine=true) (dropout): dropout(p=0.1, inplace=false) ) (transformer): transformer( (layer): modulelist( (0): transformerblock( (dropout): dropout(p=0.1, inplace=false) (attention): multiheadselfattention( (dropout): dropout(p=0.1, inplace=false) ) (sa_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) (ffn): ffn( (dropout): dropout(p=0.1, inplace=false) ) (output_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) ) (1): transformerblock( (dropout): dropout(p=0.1, inplace=false) (attention): multiheadselfattention( (dropout): dropout(p=0.1, inplace=false) ) (sa_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) (ffn): ffn( (dropout): dropout(p=0.1, inplace=false) ) (output_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) ) (2): transformerblock( (dropout): dropout(p=0.1, inplace=false) (attention): multiheadselfattention( (dropout): dropout(p=0.1, inplace=false) ) (sa_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) (ffn): ffn( (dropout): dropout(p=0.1, inplace=false) ) (output_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) ) (3): transformerblock( (dropout): dropout(p=0.1, inplace=false) (attention): multiheadselfattention( (dropout): dropout(p=0.1, inplace=false) ) (sa_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) (ffn): ffn( (dropout): dropout(p=0.1, inplace=false) ) (output_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) ) (4): transformerblock( (dropout): dropout(p=0.1, inplace=false) (attention): multiheadselfattention( (dropout): dropout(p=0.1, inplace=false) ) (sa_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) (ffn): ffn( (dropout): dropout(p=0.1, inplace=false) ) (output_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) ) (5): transformerblock( (dropout): dropout(p=0.1, inplace=false) (attention): multiheadselfattention( ) (sa_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) (ffn): ffn( (dropout): dropout(p=0.1, inplace=false) ) (output_layer_norm): layernorm((768,), eps=1e-12, elementwise_affine=true) ) ) ) ) (pre_classifier): linear(in_features=768, out_features=768, bias=true) (classifier): linear(in_features=768, out_features=5, bias=true) (dropout): dropout(p=0.2, inplace=false) ) ) appendix iii: methodology based on the universal language model fine-tuning for text classification (ulmfit) method of transfer learning in nlp, we have achieved a state-of-the-art system that can still be visibly improved. ulmfit was developed by jeremy howard, a co-founder of fast.ai, which he uses to demonstrate state-of-the-art results using models trained on 100x less data than the originals, making it a very flexible and powerful algorithm. some of the key techniques involved in ulmfit namely discriminative learning rates, gradual unfreezing, and slanted triangular learning rates have been essential to the reported level of results, and explained below. the network architecture utilized is a standard distilbert layered neural network as pioneered by huggingface nlp. we wrote a custom transformer wrapper to make it more flexible for testing and development, allowing us to substitute different base transformer language models (such as xlnet and roberta) in the easiest way possible.throughout this testing we learned that in order to optimize our network for the shortest amount of time, distilbert was the best choice, because it is the smallest. the network is initialized in layers in the following sequence: 1 – pre-trained embedding layer 2 – distilbert transformer layer 1 3 – distilbert transformer layer 2 4 – distilbert transformer layer 3 5 – distilbert transformer layer 4 6 – distilbert transformer layer 5 7 – distilbert transformer layer 6 8 – pre-classifier layer 9 – classifier so on to what we’ve done differently. discriminative learning rates propose a theory that to improve a model, a different learning rate should be used for different layers, with the highest learning rate at the very end of the model (pre-classifier layer). when learning with stochastic gradient descent, the equation changes slightly: figure 2. images credited to howard and ruder (howard et al. 2018). slanted triangular learning rates build off of the first idea by increasing the dynamic learning rate at the beginning of training, then decaying it linearly to form a sort of triangle graphically: figure 3. image credited to howard and ruder (howard et al. 2018). gradual unfreezing bring the first two ideas home, by starting the training with the entire model frozen except for the very last layer (thus having the highest learning rate) during the first epoch of training, then drop the learning rate slightly and do the second round of training with everything frozen but the last 2 layers, then the last 3 layers, then with the lowest learning rate, we unfreeze the entire model and train for 2-5 epochs, with the learning rate decreasing each time. this basic methodology has allowed us to achieve state-of-the-art results on 21 categories, with only 4475 labelled examples in our training set. this dataset is relatively small, making this achievement that much more impressive. with that in mind, our results could be much more useful for the purposes of byu library if we put some more effort into cleaning our current dataset, augmenting outlier examples, and add more data into it. subscribe to comments: for this article | for all articles one response to "machine learning based chat analysis" please leave a response below: daniel van strien, 2021-02-12 this was a lovely paper. i really liked that you avoided using a ‘black box’ commercial solution for a domain that is probably quite different from commercial chat services. i had whether you have (or considered) using any additional metatada related to the chat message. for example, do you have information on time, day of the week of the chat? i haven’t worked with library chat systems but wonder if they follow some patterns i.e. more questions about when the library closes come later in the day and some times of year may have more questions related to dissertations. if this information is given to the model it may be able to use it to help make predictions without having to hand craft these rules. there are some example of doing this with the ulmfit approach https://www.novetta.com/2019/03/introducing_me_ulmfit/ i’d be curious to hear about any thoughts on this and whether you think it might be a possible approach. thanks again for the wonderful paper :) leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – developing sinopia’s linked-data editor with react and redux mission editorial committee process and structure code4lib issue 45, 2019-08-09 developing sinopia’s linked-data editor with react and redux an important software product for the linked-data for production phase 2 grant from the mellon foundation was the creation of a linked-data editor that professional cataloging staff would use to create original rdf descriptions of their collections. using the bibframe editor from the library of congress as inspiration, the stanford university library-based software development team are actively building a react/redux linked-data editor for use by a cohort of national, academic, and special libraries. a very popular combination for front-end javascript applications, this article will explain how react and redux are used with great success in the editor’s implementation of a domain-specific-language (dsl) called profiles containing one or more resource templates that specify an html form-based user interface for cataloging using rdf. by jeremy nelson background of sinopia funded by the andrew w. mellon foundation as part of the ld4p (linked data for production) grant, sinopia, available at sinopia.io, is an open-source, cloud-based collaborative linked-data cataloging environment that could be used in a production environment. the sinopia project’s primary software development team are members of the stanford university libraries with michelle futornick the project owner, prioritizing user needs and requirements in an iterative manner following agile software practices. the agile development methodology provides an approach and processes for responding to changes in requirements or environment by emphasizing flexibility and adapting to those changes. specific agile practices of the stanford development team include pair programming, testing, stand-up meetings, planning sessions, and sprints. in pair programming two or more developers collaborate together to code or solve specific problems by having one programmer type while both developers talk through the problem together, thereby increasing code comprehension and knowledge transfer within the team. another critical agile practice used by the sinopia development team is creating unit tests for specific javascript modules in addition to integration tests that test interactions between multiple components and modules in the code base through scripting actions in a headless web browser. the sinopia development team also has daily stand-up meetings where the developers check-in with each other and the product owner on progress and challenges encountered in the previous day, as well as plan for the upcoming day. finally, the sinopia team is organized in one to two week sprints that attempt to accomplish specific goals and tasks within the sprint. each sprint also includes a planning meeting at the beginning of the sprint and weekly storytime meeting where challenges and issues that often required the project owner’s feedback and decisions that are then incorporated into the sprint development. at the start of the sinopia project, the decision made by the team was to build the linked data editor based on the previous development work done by the library of congress in the creation of their bibframe editor or bfe along with supporting projects like the bibframe profile editor and verso. the library of congress catalogers have been using the bfe for cataloging books and other materials in specific workflows related to the library of congress infrastructure. christina harlow, now at temple university, created sinopia’s architecture starting with a forked version of the library of congress bibframe profile in amazon web services (aws) with the linked data editor to follow. because the initial bfe codebase was a monolithic javascript module and the bfe lacked any unit or integration tests, it was decided that a hard fork of the bfe code base for sinopa linked data editor was necessary in order to address design and implementation shortcomings present in the bfe. in addition, a new backend architecture was necessary that leveraged aws services like cognito, a user authentication service, and to host an instance of trellis, a linked-data platform that allows the use of a postgres relational database to store the output of the linked data editor. two partners in the sinopia project provide an important service called questions authorities (qa), a joint project of cornell university and the university of iowa. qa is an api service that queries lucene indexes of both rdf and non-rdf datastores and returns json data of the results. this service provides an intermediary source of lookup data for the library of congress subject and name authorities, the getty linked data vocabularies and sharevde. functional javascript within the programming community, the ubiquity of javascript as the default programming language for scripting websites and applications means most developers have at least a passing knowledge of the language. while client-side manipulation of the html dom (document object model) has changed over the years with the introduction of javascript libraries like jquery and the continued addition of different supported technologies, javascript has morphed into a server-side language with the emergence of the node.js ecosystem. sinopia uses ecmascript features and conventions that are then converted into javascript using the babel compiler. ecmascript 6 features used in sinopia include class declarations such class input {...}, importing of javascript modules like import input from './input' and to support the import, using the export feature like export const input. because sinopia’s linked data editor communicates in an asynchronous fashion with the sinopia server back-end, the ecmascript 7 and 8 features like promises with async and await keywords are used in the code. finally, sinopia extensively uses the ecmascript spread operator … for expanding passed arguments or elements to simplify object cloning like this expression const newstate = { ...state } where newstate is a clone of the existing state. sinopia also adopted the arrow function syntax () => {} throughout the code base. javascript supports multiple programming paradigms like imperative, object-oriented, and in the past ten years, the functional programming approach. the popularity of building functional react and redux code for web browser and native user interfaces has encouraged the use of javascript built-in map and reduce operators along with support for currying and other functional-friendly constructs like the => arrow functional form. the key insight in writing functional javascript is focusing on functions with minimal side-effects and deterministic expectations that given a set of inputs, the function will always return the same set of outputs. for example, a traditional javascript function is declared using the function keyword: function addtwo (x,y) { return x+y; } a more functional ecmascript using the const keyword and arrow functional form looks like: const addtwo = (x,y) => { return x+y } both of these addtwo functions are equivalent in that they take two input parameters and return a single value. the const keyword for the second is a critical difference in that it creates an immutable function that cannot be reassigned to another function or variable in subsequent code, while defining the addtwo function in the first form can be reassigned without the babel compiler complaining. the const keyword reduces the opportunity for accidentally introducing bugs by preventing the developer from reassigning the function with potentially harmful or unintentional effects. a javascript array is a list-like object that contains two methods map and reduce that helps in developing functional code for react components and redux reducers. the map method returns a new javascript array with the results of executing a function on every member in the original array. using the arrow functional form with the map method results in simpler implementation code and is fully functional by taking two inputs, the array and an anonymous function defined with the arrow functional form, and returns a new array. in the example below, starting with an integer array, performing a map call returns a new array firstfivesquared with each value of the original array raised to the second power: const firstfive = [1,2,3,4,5] const firstfivesquared = firstfive.map(row => row**2) console.log(firstfivesquared) array(5) [ 1, 4, 9, 16, 25 ] the reduce method with the arrow functional form applies an anonymous function that reduces the members of the array to a single member object. in sinopia, the reduce function is used in the inputpropertyselector function to go through a complex json hierarchy representing the current values of the react components in the user interface and either return the component’s items or an empty array based on the values of the component’s reduxpath. const inputpropertyselector = (state, props) => { const reduxpath = props.reduxpath let items = reduxpath.reduce((obj, key) => (obj && obj[key] !== 'undefined' ? obj[key] : undefined), state.selectorreducer) if (items === undefined) { items = [] } return items } resource templates and profiles the library of congress’s bibframe editor is a bibframe-focused linked-data editor that was an important source of inspiration for sinopia. to support a wide range of cataloging use cases, the bfe followed a two-part strategy. first, a user creates a json profile containing one or more resource templates in the bibframe profile editor. these templates can vary by the material being cataloged and are saved in the backend verso middleware server from the library of congress. this template is then loaded into the bfe to generate different types of user interfaces depending on the requirements for the material. so, for example, a monograph profile contains resources templates for a bibframe work and instance along with any supporting resource templates. a cataloger could then load into the bfe to catalog a book instance or work. the profile json file containing one or more resource templates is in effect a domain specific language (dsl) for generating the user interface needed by the user to catalog a specific type of bibframe entity. the profile dsl, while developed for the bibframe ontology, is generic enough that other linked-data vocabularies can be used in addition to bibframe, though they must be specified within a resource template. a resource template contains a number of properties that define properties such as the label to display in the user interface, a uri to use as the rdf type, and one or more properties contained in the propertytemplate list. sinopia, following the bfe’s example, only displays one top-level resource template at a time in the editor user interface. a source of confusion early on in the project arose when analyzing the library of congress set of profiles, as the id of a profile was often duplicated as one of the ids of the contained resource templates. this was fixed in later iterations of these library of congress profiles. here is an example of a resource template for a bibframe note that could then be referenced by other resource templates: { "id": "resourcetemplate:bf2:note", "resourceuri": "http://id.loc.gov/ontologies/bibframe/note", "resourcelabel": "note", "propertytemplates": [ { "propertyuri": "http://www.w3.org/2000/01/rdf-schema#label", "propertylabel": "note", "mandatory": "false", "repeatable": "false", "type": "literal", "resourcetemplates": [], "valueconstraint": { "valuetemplaterefs": [], "usevaluesfrom": [], "valuedatatype": {}, "editable": "true", "repeatable": "false", "defaults": [] } } ] } the source for constructing the actual react components containing the html inputs used by the sinopia linked data editor are the properties in the propertytemplates list. each property contains a propertyuri attribute that is used as the rdf predicate in the constructed rdf while the propertylabel is displayed in the user interface either as an html label or as a placeholder in the html input. with the resourcetemplate:bf2:note resource template above, sinopia’s editor generates the following display: figure 1. title note property panel. after entering the text, “a great title note”, this resource template generates a rdf graph using a relative uri as seen here (when the user clicks the preview rdf button): figure 2. rdf preview modal other attributes for the property include mandatory set to true if the property is required, and repeatable if the property can be duplicated. early in the development of sinopia, the decision was to limit and simplify what values are supported in the type property, with the simplest being literal, and the others being lookup, list, and resource. the literal property is for string values that are entered by the user, while the list and lookup types references typeahead components for searching and linking external entities and values in the user interface. the valueconstraint attribute contains a number of sub-attributes like defaults, as a list of one or more values that are used to pre-populate the values in the input fields. the resource type is more complicated in that it references another resource template through the valueconstraint‘s valuetemplaterefs attribute. the valuetemplaterefs attribute is a list of one or more resource template ids that are then used to embedded a resource template into the user interface so that the values entered act as a separate entity that is then linked to the calling property. when generating the rdf, a uri or blank-node of the embedded resource templates is positioned in the rdf object role with the original resource template as the rdf subject and the propertyuri property functioning as the rdf predicate. below is an example of a propertytemplate with a type of resource that references the id of the resource template above, resourcetemplate:bf2:note in the valueconstraint.valuetemplaterefs attribute: { "propertyuri": "http://id.loc.gov/ontologies/bibframe/note", "propertylabel": "notes about the work", "mandatory": "false", "repeatable": "true", "type": "resource", "resourcetemplates": [], "valueconstraint": { "valuetemplaterefs": [ "resourcetemplate:bf2:note" ], "usevaluesfrom": [], "valuedatatype": {}, "defaults": [] } } react components an open-source project sponsored by facebook, react is a javascript library that wraps html elements in defined classes and functions for building user interfaces. react classes and functions are often defined in an extension of javascript called jsx that is used to build components that can be assembled into larger, more complex user interfaces. for example, a relatively simple jsx component could represent a title on a page with an html h1 tag with a valid javascript expression embedded between curly brackets: const title =
book title: {title}
react components also have two important javascript arrays called props and state. the props array contains read-only properties that can be referenced within the component itself using the curly braces syntax and are set when the component is constructed. because the component’s props are read-only, this enforces the constraint that the props are immutable and follow a pure functional form. creating react components in pure functional form is possible and often desired, using the es6 class syntax allows for more comprehensible code that extends the base react.component class. a react component class must implement a render method that returns the desired javascript. the title javascript expression above could be refactored as a simple es6 class by extending the react.component and returning the html snippet with the title prop referenced using the this keyword indicating a class instance variable: class title extends react.component { render() { return (
book title: {this.props.title}
) } } if coming from object-oriented languages, the temptation might be to create a hierarchy of react components but this pattern is discouraged by the designers of react because react components are intended more for composition, where more complex react components are made-up of simpler components where the enclosing components pass properties down through the child’s initial props. to illustrate, here is a header react component that contains other react components. class header extends react.component { render() { return(
<author givenname={"jane"} familyname={"austen"} /> </header> ) } } the header component sets the title prop for the included <title> and the <author>‘s givenname and familyname props. this could render html output to the web browser’s dom like this: <header> <h1>book title: pride and prejudice</h1> by jane austin </header> sinopia’s react components sinopia’s react components are built as pure functions or as jsx class components. to build out the user interface for the linked data editor, sinopia uses a combination of third party react components along with custom react components in a hierarchy of components with <rootcontainer> being the top-level react component. the <rootcontainer>imports the <offcanvas> that is composed of two children, the <offcanvasbody> and the <offcanvasmen> components both from the react-offcanvas module. the <offcanvasmenu> presents a list of links to help and third-party resources displayed in a pane which is displayed when the help and resources link in the navigation bar is clicked. the other function of the <rootcontainer> component is to connect the react user interface to the redux state store. the <offcanvasbody> contains another react component from the react-router-dom third-party module that allows for the easy generation of a single-page application <browserrouter> react component that matches specific url patterns into multiple routes for base route / to the homepage, the /editor route to editor forms, a /templates route that displays a list of available templates and to upload a new resource template. other supporting routes include the /login to allow the user to authenticate using aws cognito, the /menu for the off-canvas help and resources page, and a 404 route for unmatched routes entered by the user. <rootcontainer> <offcanvasbody> <browserrouter> <app> <offcanvasmenu> <canvasmenu> on the sinopia’s homepage, the top level react components are outlined in the following image when the help and resources is clicked and the offcanvasmenu is displayed: figure 3. sinopia rootcontainer and offcanvasbody. homepage and user authentication when accessing sinopia’s homepage at sinopia.io, the <app /> react component, using the <switch /> and <route /> react components from the react-router-dom package, displays the <homepage /> and the standard sinopia <footer /> react components. the <homepage /> component contains three children, the <header />, <newspanel />, and the <descpanel /> react components. the homepage’s <header /> contains links to the /templates route, a link to sinopia’s profile editor, and finally a link that activates the <offcanavsmenu />‘s child’s <canvasmenu />. the <newspanel /> has a <newsitem /> that allows the product owner to make announcements by editing and then creating a pull request that can then be reviewed and then pushed to the aws environments by the developers and system administrators. the <descpanel /> contains more general description of the sinopia project and what the project hopes to accomplish during the grant period and after. amazon’s amplify sdk is used to authenticate the user to the aws cognito service that then generates a valid json web token for user authentication in sinopia and in sinopia’s backend trellis linked data platform. for the initial release of sinopia, all catalogers are authorized to add, edit, or delete any resources stored in trellis. this may change in future releases with more restrictive user rights for resources created in specific group containers within trellis. the react component hierarchy for the <app />‘s <homepage />, <loginpanel />, and <footer /> is: <app> <loginpanel> <switch /> <route /> => <homepage /> <header /> <newspanel /> <newsitem /> <descpanel /> <footer /> below are these react components diagrammed in a screen shot of sinopia’s homepage at sinopia.io: figure 4. sinopia homepage components. <homepage /> child components <header> the <header> component for sinopia’s homepage has a different navigation bar with links to the <importresourcetemplate >, sinopia’s profile editor, and a link to activate the <offcanvasmenu> to display the help and resources links. <newspanel> and <newsitem> the <newspanel> component is composed of one <newsitem> component that is a list of news items maintained by sinopia’s product owner and usually includes recent items like new sinopia releases, conferences or presentations about sinopia, and other items of interest to the sinopia’s community. <descpanel> the <descpanel> component is a short description of the goals for sinopia and the partner institutions in this project. resource templates upload and listing from sinopia’s <homepage />‘s <header />, an authenticated cataloger is taken to the /templates route using the <importresourcetemplate /> component containing a <header /> component, an <importfilezone .> component used to display and handle a button for catalogers to upload a json profile file containing one or more resource templates, and a <sinopiaresourcetemplates /> component displaying an html table populated by resource templates contained either in a running instance of the sinopia server or the sample resource templates if running the editor in use_fixures mode. <app> <loginpanel> <switch /> <route /> => <importresourcetemplate /> <header /> <importfilezone /> <sinopiaresourcetemplates /> <footer /> figure 5. linked data editor resource templates tab with components. <importresourcetemplate /> child components <header> for both the /templates and /editor routes and components, the same <header> component is used for displaying the navigation bar and the three tabs for the browser, editor, and resource templates. the <header> component also contains three <navlink> react components from the react-router-dom package that highlights the correct tab depending on the route being displayed in the editor. <importfilezone /> the <importfilezone /> component wraps a third-party react component that allows the cataloger to use their computer’s drag-and-drop feature to place a json profile file containing one or more resource templates into the application. the editor validates the profile and resource templates first before uploading the resource templates to the sinopia server with the results appearing in the <sinopiaresourcetemplates /> component. <sinopiaresourcetemplates /> the <sinopiaresourcetemplates /> component is initialized with a call out to sinopia server’s trellis linked data platform and loads all of the available resource templates for selection by the cataloger. linked data editor tab when a user selects a resource template from the <sinopiaresourcetemplates /> table, sinopia loads the json resource template into a new instance of the <editor />‘s <resourcetemplate /> component. in the <resourcetemplate /> component functioning as a container and connector to the redux state, a <resourcetemplateform /> creates a <propertypanel /> for each property. the react component hierarchy is outlined below: <app> <loginpanel> <switch /> <route /> => <editor /> <header /> <rdfmodal /> <groupchoicemodal /> <resourcetemplate /> <resourcetemplateform> <propertypanel> <inputliteral> <inputlistloc> <inputlookupqa> <inputlookupsinopia> <propertyresourcetemplate> <propertytemplateoutline> <outlineheader> <propertytyperow> <propertycomponent> <inputliteral> <inputlistloc> <inputlookupqa> <inputlookupsinopia> <resourceproperty> <propertytemplateoutline> <footer /> figure 6. linked data editor components with the work title resource template. <editor /> child components <rdfmodal /> part of the <editor/> is an html button labeled preview rdf that when clicked shows a bootstrap modal containing the generated rdf based on the redux state of the application. <groupchoicemodal /> at any point when adding or editing the rdf for the entity, the cataloger clicks the save & publish button that brings up a pop-up modal that displays a drop-down list of institutions and organizations <resourcetemplate /> and <resourcetemplateform /> the <resourcetemplate /> component includes the label of the loaded resource template, that preview rdf and save & publish buttons, and the <resourcetemplateform /> that wraps one or more <propertypanel />. the <propertylabel/> component provides the text in the panel’s header with one or more optional styling using such components as the <requiredsuperscript /> to display a red asterisk for a required <propertypanel />. <propertypanel /> component for every property template in the loaded resource template, an instance of the <propertypanel /> is rendered. understanding how the <propertypanel /> react component as a pure function is constructed illustrates how sinopia is able to build an editing environment for rdf triples. the first line in the <propertypanel /> source code imports two javascript classes from the react node.js module. the proptypes import provides a means to check if the <propertypanel />‘s variable props are of a certain type. the third line imports the <propertypanel /> component. import react from 'react' import proptypes from 'prop-types' import propertylabel from './propertylabel' the <propertypanel /> is a pure function with two constant variables, a floatclass that sets the panel div’s bootstrap floating class to create two columns of <propertypanel />s, the cssclasses variable sets all of the css and includes the floatclass variable. const propertypanel = (props) => { const floatclass = props.float > 0 && props.float % 0 > 0 ? 'pull-right' : 'pull-left' const cssclasses = `panel panel-property ${floatclass}` } finally, the <propertypanel /> renders a combination of html elements and other react components, including any children stored in the props for the component instance. the <propertypanel /> doesn’t need to know or care what the children components are, just that the child react component render some content. return ( <div classname={ cssclasses } data-label={ props.pt.propertylabel }> <div classname="panel-heading prop-heading"> <propertylabel pt={ props.pt } /> </div> <div classname="panel-body"> { props.children } </div> </div> ) <inputliteral /> component the most basic html input in sinopia is part of the <inputliteral /> react component. depending on if the property template’s repeatable is true, the cataloger can enter multiple values that are displayed immediately below the html input element or only enter a single value. each of those values can have a language attribute set and the value can be edited by clicking on the edit button. below is an example of a property template literal with multiple items: figure 7. input literal component. the challenge of representing a resource’s properties that include referencing a target resource template needs to be solved in the user interface when using these profiles. in the library of congress bfe, the ui solution was to open a modal that would render the target resource templates propertytemplates. those propertytemplates often reference further resource resources that require a new pop-up modal until the user loses the context with a whole series of modals layered on top of one another. in an early prototype, astrid usong, sinopia’s user interface designer at stanford libraries, came up with different approach. instead of using modals, her design represented these relationships as an outline. when demonstrated during a sinopia-focused pre-conference at the 2019 code4lib conference in san jose, ca., the participants preferred the outline view over the pop-up modal because it was easier to keep the context of the target resource template in relation to the original resource template when editing a resource’s metadata. the outline view also provide a visual hierarchy built on familiar user interface tree where parent nodes can be expanded to reveal one or more component children. to support this tree hierarchy of propertytemplates one or more layers deep, the <propertytemplateoutline> react component is used and is composed of an <outlineheader> made-up of a collapsed plus-sign icon that when expanded shows any child components. the <propertytemplateoutline> contains either <propertycomponent> made up of either <inputliteral>, <inputlookupqa>, or <inputlistloc> , or a <resourcecomponent> that has <propertyactionbuttons> (containing an <addbutton> if the property template’s repeatable property is true) and one or more <propertytemplateoutline> components. when the <resourcecomponent> is expanded, a network call with the resource template id is made to the sinopia server and the resource template json is retrieved and displayed in the expanded view of the resource, with each of the target resource template’s property templates having their own <propertytemplateoutline>. linking to existing sinopia entities and other sources sinopia is able to link to other sources through one or more custom react components that provide a typeahead input using a third party node.js module called react bootstrap typeahead. <inputlistloc> library of congress id.loc.gov component the library of congress’s linked data service at http://id.loc.gov provides a number of subjects, thesauri, classifications, and other vocabularies. for large linked data services sinopia uses questioning authorities service, but for smaller vocabularies the inputlistloc react component directly connects and retrieves a json list that is presented to the end user as a lookahead provided by the ‘react-bootstrap-typeahead'(#rbt) module. these linked data vocabularies are pulled from a json configuration object that is shared with the <inputlookupqa> and soon the <inputlookupsinopia /> react components. here is a screen shot of the inputlistloc component within sinopia: figure 8. inputlistloc component with drop-down. the uri and label is saved in the <inputlistloc> props with the uri becoming an rdf object of the entity as the rdf subject and the propertytemplate’s propertyuri as the rdf predicate. <inputlookupqa> questioning authorities component from the beginning, the sinopia project team included close collaboration with huda khan of cornell university who is working the react components to support searching the questioning authorities (qa) service. qa is run as a collaborative effort with lynette rayle, and david echimann at the university of iowa school of information science. the qa service has a cache built with solr and the fusuki triplestore, with a swagger api integrated using with sinopia’s <inputlookupqa> component. figure 9. inputlookupqa component. <inputlookupsinopia /> component the <inputlookupsinopia /> component allows catalogers to reference existing entities created within sinopia and will look and act in a similar fashion as the <inputlistloc> and <inputlookupqa>. editor’s redux state initially, sinopia’s react components were structured with extensive props and state changes to represent and respond to actions and user expectations to accepting values both from any parent information and also push state information to any composited child components. as the team became more conversant with redux, the refactoring implementation of the react components simplified both the component creation as well as what props and state variables are needed at this level of components. the redux state in sinopia is a snapshot in time of what data is in the active react components. for example, loading the following resource template, a bibframe work title, into the editor and entering a couple of values so the current state of the user interface looks like: figure 10. populated work title resource template. is represented in this global redux state: selectorreducer: { 'resourcetemplate:bf2:worktitle': { 'http://id.loc.gov/ontologies/bibframe/maintitle': { items: [ { content: 'pride and prejudice', id: 'xsgpxn4su' } ] }, 'http://id.loc.gov/ontologies/bibframe/partname': {}, 'http://id.loc.gov/ontologies/bibframe/partnumber': {}, 'http://id.loc.gov/ontologies/bibframe/note': { buolqnhya: { 'resourcetemplate:bf2:title:note': { 'http://www.w3.org/2000/01/rdf-schema#label': { items: [ { content: 'many editions, adaptations, and films', id: 's7h8lzdj4' } ] } } } }, rdfclass: 'http://id.loc.gov/ontologies/bibframe/title' } } for the <inputliteral > component instance for the propertytemplate maintitle, the values are extracted by using a redux reducer inputpropertyreducer function that takes the <inputliteral>‘s reduxpath as a javascript array and using the redux state as the starting point, reduces the redux state to the javascript object containing an items javascript array. the reduxpath for the bibframe main title <inputliteral> is in this case: ['resourcetemplate:bf2:worktitle','http://id.loc.gov/ontologies/bibframe/maintitle']. react components in sinopia, they reflect or mutate the global redux state through two methods, mapstatetoprops and mapdispatchtoprops as recommended in the official react-redux documentation. rdf generation and the sinopia server as the redux state captures the event of a user entering or linking data to when it happens in the react components, sinopia can generate rdf at any particular point-in-time on demand. from the previous redux example, clicking on the preview rdf button uses a redux reducer to create an instance of the rdf graphbuilder class and generate the following rdf graph that can then be saved and published through the backend sinopia server: <> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://id.loc.gov/ontologies/bibframe/title> . <> <http://id.loc.gov/ontologies/bibframe/maintitle> "pride and prejudice" . <> <http://id.loc.gov/ontologies/bibframe/note> _:b1 . _:b1 <http://www.w3.org/2000/01/rdf-schema#label> "many editions, adaptations, and films" . next steps the sinopia linked data editor targeted minimal viable product release is at the end of july 2019. this release will allow the sinopia cohorts to start cataloging using the linked data editor and provide valuable sources of requirements, pulled from their usage and experiences, to help the sinopia team learn and plan for the next work-cycle. as the sinopia user and developer community expands beyond the stanford and cornell development teams, sinopia as a cataloging tool built as a generic cataloging editor for linked data, is a strong base to extend and expand into the future. currently sinopia has three linking sources, the library of congress, the questioning authorities, and internally created entities. an early requirement of sinopia is to provide the ability of catalogers to do two different, but related, workflows. for pre-existing graphs of rdf entities either available from third party authorities like library of congress, viaf, or sharevde or from internally created entities stored in sinopia’s server, catalogers will need to be able to either edit, add, or delete triples about these entities or derive a new rdf entity by copying those triples, much like the copy-cataloging current marc based work-flows in libraries. finally, to encourage and broaden adoption of sinopia beyond aws specific services, a new sinopia server infrastructure based on kubernetes or docker swarm would allow other organizations to run their own version of the sinopia stack. in addition, integration with existing open-source library systems like folio will be explored in the future. references agile alliance agilealliance.org react bootstrap typeahead node.js package at http://ericgio.github.io/react-bootstrap-typeahead react redux usage guidelineshttps://react-redux.js.org/using-react-redux/connect-mapstate#usage-guidelines acknowledgments sinopia would not have been possible without the talents, hard-work, and experience of the current and former members of the development team including joshua greben, naomi dushay, sarav shah, johnathan martin, joseph atzberger, michael giarlo, justin coyne, peter mangiafico, justin littman, and aaron collier. about the author jeremy nelson is a software engineer with the stanford university libraries and technical lead on the sinopia project. he is also co-founder and cto of knowledgelinks.io. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – to everything there is a session: a time to listen, a time to read multi-session cds mission editorial committee process and structure code4lib issue 56, 2023-04-21 to everything there is a session: a time to listen, a time to read multi-session cds when the cost of cd burners dropped precipitously in the late 1990s, consumers had access to the cd-r, a format with far greater storage capacity than floppy disks. multiple session standards allowed users the flexibility to add subsequent content to an already-burned cd-r, which made them an attractive option for personal backups. in a digital preservation context, cds with multiple sessions can pose significant challenges to workflows and can lead to data errantly not being acquired or reviewed if users are using a workflow designed for single-session, single-track cds. in workflows that include cds as software installation or transmission media, extra-session behavior can have an impact on software supply chain review. this article provides an overview of the structure of a multi-session cd and outlines tool behavior of disk images generated from multi-session cds. to support testing in specific contexts, we provide a guide to creating a multi-session cd that can be used when developing workflows. finally, we provide techniques for extracting content from physical media as well as existing disk images generated from multi-session cds. by dianne dietrich and alex nelson disclaimer the views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of any agency of the u.s. government or cornell university. any mention of a vendor or product is not an endorsement or recommendation. logos and trademarks are copyright of their respective owners. 1. introduction many digital preservation organizations have developed workflows to handle optical media that consist of single-session, single-track data cds, especially at scale (rothrock et al, 2021). workflows for single-session cds consisting of one data track followed by one or more audio tracks also exist, as demonstrated by geoffrey brown’s work handling mixed-mode [1] cd-roms from the voyager collection (brown, 2012). cds with multiple sessions can pose significant challenges to workflows and can lead to data errantly not being acquired or reviewed if the user is not aware, especially since guidance written for single-session media may fail in interesting and subtle ways when applied to multi-session discs. the goal of this paper is to provide practitioners a technical framework by which they can develop appropriate analysis and preservation workflows for these types of media. this work improves acquisition workflows, understanding of the varied storage models in optical media schemes at layers before file systems, and, in the case of software installation media, software supply chain review. reflecting on our own experience with cd-rom corpora, we saw an unfulfilled need for guidance on preservation techniques for multi-session cds, since a digital preservation practitioner might encounter this as personal backups on cd-rs and cd-rws. 1a. scope within this document, the discussion around optical media is limited to the compact disc digital form factor and approximate quantity of storage capacity, that is, under 800 megabytes. while “optical media” as a general term encompasses other media classes, such as dvds and the analog laserdisc format, the technical details of arranging storage reviewed in this article on those media differ from compact discs, and are left out of scope. 2. background cd standards are detailed in technical standards known as the rainbow books, which started in 1980 with the red book, the standard for encoding audio on a cd [2]. in 1983, the yellow book established the cd-rom standard [3] for the storage of data. related, iso 9660 defines the file system standard for a cd-rom. the orange book (1990) [4] defined the cd-r standard, which included writing multiple sessions of data to a cd; related information can also be found in ecma-394 (2010) [5]. iso 13490 (1995) and ecma-168 (1994) defined how operating systems read cd-rs with multiple sessions [6]. in 1995, the blue book [7] provided the technical specifications of a multi-session disc with both audio and data components, to address the limitations of single-session, mixed-mode cd-roms. a blue book cd contains exactly two sessions: the first containing audio tracks, and the second containing one data track. disc level session level session 1 session 2 … track level lead in (with toc) track 1 … lead out lead in (with toc) track 1 … lead out … figure 1. structure of a general multi-session cd. here toc refers to table of contents. (adapted from wikipedia [8]). individual tracks may be audio or data. (see also ecma-394, figures 12–18.) in 2017, johan van der knijff of the open preservation foundation shared findings from the dutch national library showing that disk images created from blue book cds failed their validation step, even though nothing in the imaging process indicated that the capture had failed in any way (van der kniff, 2017). van der knijff explained that this was happening because the second data track required a physical offset value in order to be correctly parsed. with a disc in hand, it is possible to determine the location of the data track and the appropriate offset value. van de knijff described how to extract content from a disk image by generating a derivative image file left-padded with zeros to represent the physical offset. in our 2018 talk (dietrich and nelson, 2019), we built on the foundation of van de knijff’s work, providing examples of the user’s perspective in imaging a blue book cd and attempts to read the resulting disk images, showing how each tool we tried (and, at times, different options within the same tool) yielded different output formats. we also showed that each kind of output required different post-processing actions in order to extract content. we concluded by offering one suggestion for determining the correct offset value of a blue book cd-rom if the original disc was no longer available that appeared to successfully identify the correct offset value for the set of testing discs we sourced from our collections. in reflecting on that work, we noted the similarities between blue book cds and multi-session cd-r/ws: understanding that the offset value was necessary for a data track located in any session subsequent to the first was undoubtedly crucial. one important question remained for us, however. what findings from working with blue book cds did not hold when applied to multi-session cds? the goal of this paper is to provide additional detail on multi-session cds to aid digital preservation practitioners in accurate and complete captures of the data they contain that can be applied in optical media workflows. 3. generating a multi-session cd on the command line this section is intended to illuminate the inner workings of a multi-session cd by guiding the reader through the creation of one using ubuntu 20.04 lts. the steps illustrated below will generate a “data disc” as defined by ecma-394: “a disc on which every session contains one or more data tracks.” the test cd that we use throughout this paper will be a data disc constrained to one track per session. in the test cd, each data track will contain only iso 9660 filesystem data. the standards define abilities to include multiple data tracks per session, but our experience with disc authoring software frequently showed incomplete or inconsistent support for the capabilities provisioned by the standards. see appendix b for a discussion on creating discs with multiple data tracks in a single session. the purpose of this section is not to attempt to re-create the functionality of cd authoring tools, nor replicate all possibilities for multi-session cds users may have in their collections. these steps make use of command line tools installed on an ubuntu 20.04 lts workstation. there are many options within the tools used here – xorrisofs [9] and mkisofs [10] – that can be used to generate a testing cd. the parameters specified in the following example detail the steps necessary to generate a basic disc that can be used for testing tools and workflows. the following examples specify xorrisofs, but mkisofs has equivalent parameters and can be used to generate a test cd using the same technique. (the documentation for xorrisofs also includes documentation for generating multiple-session cds.) to understand the sample commands below, the directory first_session contains files selected for the first session of the disc. in this sample multi-session cd, each session contains a single text file named after the session it is in: session1.txt is in the directory first_session. the contents of session1.txt are “this is track 1 in session 1.” we will use this convention for subsequent sessions (i.e., files for the second session will be indicated by a directory named second_session, and it will contain one file called session2.txt containing text “this is track 2 in session 2.” and so on). the first step to making a multi-session cd is to generate an iso file with the data for the first session by running the command indicated below: xorrisofs –o first_session01.iso first_session/ the next step is to burn first_session01.iso to blank cd-r or cd-rw media using the following command [11]: cdrecord –multi –v –eject –speed=4 first_session01.iso next, determine what the offset for the second session must be and use that information to generate an iso file with the data for the second session. this information is provided by the cdrecord command, when the freshly-burned disc is inserted back into the closed tray. this command records the necessary data: cdrecord –msinfo > msinfo_after_session_1.txt this command will not produce the correct result if the cd (that was just burned to) is not currently loaded into the drive. in the event of error, msinfo_after_session_1.txt will be empty. the following command generates the iso file with the data for the second session: xorrisofs –o second_session02.iso –c $(head -n1 msinfo_after_session_1.txt) \ -m /dev/cdrw second_session/ next, burn second_session02.iso to the media with the data from the first session: cdrecord –multi –v –eject –speed=4 second_session02.iso keep repeating the iso generating step that calculates the next offset from the physical media and burning steps until the last session, and then run cdrecord without the -multi flag. forgetting to calculate the session offsets with each iso generating step will result in an unusable disc. utilities that list out the contents of each generated iso file show how each session’s file listing display includes the files from the prior session. for example, a view of the first session of the test cd described above would include session1.txt, a view of the second session of the test cd would include session1.txt and session2.txt, a view of the third session of the text cd would include session1.txt, session2.txt, and session3.txt, and a view of the fourth session would include session1.txt, session2.txt, session3.txt, and session4.txt. 4. behavior patterns in disk imaging outputs of multi-session discs it can be difficult to know when a multi-session cd is present in a workflow. this section elaborates on tool behavior of multi-session cds created by disk imaging processes. if the provenance of a disk image is unknown, there are a few possibilities for how it might be structured. the following list details categories of outputs that are possible given ecma-394. raw disk image of an entire disc, including data tracks from all sessions disk image of the user data portion of an entire disc, including data tracks from all sessions (referred to after this as a user data disk image) raw session image of an individual session within a disc user data session image of an individual session within a disc raw track image of an individual data track within a session user data track image of an individual data track within a session using the imaging tools listed in appendix a, we were able to generate disk images in all of the above categories except for category 4. we repeat, as we believe it deserves clarification: if we consider the general, abstract class of “disk images,” we are considering two disjoint subclasses of disk images, the raw disk image and the user data disk image. we also define three disjoint subclasses of a more-general “image”, the disk image, session image, and track image. a raw image will include error correction and detection data from the physical disc and its size will be a multiple of 2,352 bytes. a user data disk image does not include this information, and only includes the “user data” [12] portion of each sector. its size will be a multiple of 2,048 bytes. detecting that one has a disk image of a multi-session cd using various parsing tools can be subtle. the lack of an explicit error message does not necessarily mean the tool successfully displayed the contents of that disk image file. the table below summarizes behaviors of tools when presented with a disk image file that matches the description in the middle column. the table specifically does not indicate how the disk images were created. it only provides detail on the behavior of the tools we tested to parse our test cd generated in section 3 (i.e., multiple sessions, each session containing only one data track with iso 9660 filesystem data). tool input image format tool behavior mount -t iso9660 -o loop cd.iso mount_point image of any level (full disc, session, or track), raw non-applicable. the mechanism to create a loopback device (/sbin/losetup) does not support conversion from raw sectors to userdata sectors (i.e. flag-value --sector-size 2352 is rejected). mount -t iso9660 -o loop cd.iso mount_point [13] image of the full disc, user data file listing at the mount point only includes the first track of the first session; listed file contains expected data. mount -t iso9660 -o loop,sbsector=[offset to last session on disc] cd.iso mount_point [14] (see also section 5 for description of the sbsector option.) image of the full disc, user data file listing at the mount point includes all files from all tracks; all files contain expected data. mount -t iso9660 -o loop cd.iso mount_point track image from any of the sessions but the first, user data error message: can’t read superblock on /dev/loopx mount -t iso9660 -o loop,sbsector=[offset to session of track on disc] cd.iso mount_point track image from any of the sessions but the first, user data error message: wrong fs type, bad option, bad superblock on /dev/loopx, missing codepage or helper program, or other error. mount -t iso9660 -o loop,sbsector=[offset] cd.iso mount_point track image from any of the sessions but the first, user data, prepended with zeros corresponding to the session offset file listing at the mount point includes files from all previous tracks and sessions back to the first; however, all files not originally included on the track/session specified by the offset contain no data. see section 7 for further detail. isoinfo [15] image of the full disc, user data with no offset provided, display metadata for first session data only with -l; -p only reports one path table.when using -t with offset to session: display metadata for that session with -l; display start of path table for that session with -p. isoinfo image of the full disc, raw non-applicable – isoinfo does not support raw images. isoinfo session image of any of the sessions but the first, raw non-applicable – isoinfo does not support raw images. isoinfo track image from any of the sessions but the first, user data with no offset provided, error message: “short read on old image.” when using -n with offset to specific session: display metadata for that session with -l; display start of path table for that session with -p fiwalk [16] image of the full disc, raw or user data reports out file listing for first session only. fiwalk session image of any of the sessions but the first, raw tsk_error ‘cannot determine file system type’ using a variety of sector offset values and sector sizes. fiwalk track image from any of the sessions but the first, raw or user data tsk_error ‘cannot determine file system type’ … using a variety of sector offset values and the automatically-used sector size [17]. fls [18] image of the full disc, raw or user data reports out file listing for first session only. options for specifying an image offset (-o) yield the following error: “cannot determine file system type.” fls session image of any of the sessions but the first, raw “cannot determine file system type.” fls track image from any of the sessions but the first, raw or user data with no extra parameters, error message: “cannot determine file system type.” with -b 2048 (defining sector size) and -o (with offset value), “sector offset supplied is larger than disk image.” tsk_recover -a [19] image of the full disc, raw or user data files recovered: 1 file corresponding to contents in the first track of the first session only. tsk_recover -a -o [sector offset corresponding to session] image of the full disc, raw or user data cannot determine file system type (sector offset: 11702)files recovered: 0 tsk_recover -a session image of any of the sessions but the first, raw cannot determine file system type (sector offset: 0)files recovered: 0 tsk_recover -a -o [sector offset corresponding to session] session image of any of the sessions but the first, raw cannot determine file system type (sector offset: 11702)files recovered: 0 tsk_recover -a track image from any of the sessions but the first, raw or user data cannot determine file system type (sector offset: 0)files recovered: 0 tsk_recover -a -o [sector offset corresponding to session] track image from any of the sessions but the first, raw or user data cannot determine file system type (sector offset: 11702)files recovered: 0 5. accessing content from images of the full disc this section will illustrate how to use isoinfo to extract content from an image of a full disc, whether the image is raw or user data. the technique for extracting content in this and the next sections 5 through 7 focuses only on images that contain an iso 9660 file system. first, obtain the appropriate offsets; either with the physical media in hand, described in 5a; or from imaging metadata, described in 5b; or from the image itself, described in 5c.if you have a raw disk image, follow the directions for converting to a user data disk image as described in 5d. the process of using bchunk [20] will generate track-level images; follow the directions for extracting content in 7. if you have a user data disk image, use the technique described in 5e. it is also possible to extract content from an image of the full disc by using mount and the value of the session offset (in 2kib sectors) for the last session on the disc. the following command worked using a full disk image generated from our test cd: mount -t iso9660 -o loop,sbsector=[offset to last session] \ cd.iso mount_point 5a. obtaining the disc offset(s) from the physical media if the original physical disc is still accessible, the command line utility cd-info [21] will display the offset(s) present on that media. sample output from the multi-session cd created in section 3 is below: disc mode is listed as: cd data (mode 2) cd-rom track list (1 4) #: msf lsn type green? copy? 1: 00:02:00 000000 data false no 2: 02:38:02 011702 data false no 3: 04:14:04 018904 data false no 4: 05:50:06 026106 data false no 170: 05:54:08 026408 leadout (59 mb raw, 58 mb formatted) media catalog number (mcn): 0000000000000 last cd session lsn: 26106 __________________________________ cd analysis report cd-rom with iso 9660 filesystem iso 9660: 184 blocks, label `isoimage ' application: preparer : xorriso-1.5.2 2019.10.26.180001, libisoburn-1.5.2, libisofs-1.5.2, libburn-1.5.2 publisher : system : volume : isoimage volume set : session #2 starts at track 2, lsn: 11702, iso 9660 blocks: 174 iso 9660: 174 blocks, label `isoimage ' session #3 starts at track 3, lsn: 18904, iso 9660 blocks: 174 iso 9660: 174 blocks, label `isoimage ' session #4 starts at track 4, lsn: 26106, iso 9660 blocks: 174 iso 9660: 174 blocks, label `isoimage from this output, we see the logical sector numbers (lsns) corresponding to our sessions are 0, 11702, 18904, and 26106. these are the numbers we would use, e.g., in the sbsector option to mount. 5b. obtaining the disc offset(s) from imaging metadata some imaging utilities automatically generate metadata that includes offset information for sessions present on a cd. while technically, the cue sheet syntax [22] does not include fields that would supply that value, some programs supply this information in the unstructured-text comments of their respective cue sheets. lines starting with rem are comments that some utilities use to store offset values. 5c. obtaining offset values from full user data disk images for user data images of full multi-session discs, obtaining the offset of any session after the first may be possible using isoinfo through testing every possible option. to find all of the possibilities, divide the disk image file size by the sector size (2,048). for each possible offset, run isoinfo -i cd.iso -t [sector number] any sector value that does not yield the error message, “cd-rom is not in iso 9660 format” or “short read on old image” is a usable offset value. beware that using -n with an incorrect offset value may not result in an error message, but instead yield an incomplete listing of files. 5d. conversion to user data from raw using bchunk one method of converting a raw image to a user data image is by post-processing it using the command bchunk. this utility accepts in a raw image and a cue sheet and uses these to generate user data images for data tracks and “native cd audio” or wav files for audio tracks (quoted directly from bchunk man page). a sample bchunk command follows: bchunk raw_disc_image.bin cd.cue derivative_iso_basename not all disk imaging utilities supply a cue sheet as output by default, and in some cases, creation of a cue sheet must be handled separately from the image creation process. 5e. extracting files from images of the full disc in our testing, we found the command line tool isoinfo is able to extract files from an image of the full disc using the corresponding session offset(s). consider the test multi-session cd created in section 3. if the file session1.txt was added in the first session as the first track, session2.txt was included in the second session as the second track of the disc, and session3.txt was included in the third session as the third track, (and so on), in order to extract session2.txt from an image of the full disc, one would need the offset for the second session. the offset for the third or fourth session would not work, even though when listing files from a specific track using a session offset, that file would appear to be listed (because all of the files from each of the previous sessions’ tracks are listed). we used the following command to get a complete listing of all of the files in each track from all of the sessions of our test cd. isoinfo -i cd.iso -t 26106 -l when working with an image of the full disc, isoinfo requires the -t flag (instead of the -n flag for the single-track image). we used the following command to extract an individual file from the image and redirect it to a file named session4.txt: isoinfo -i cd.iso -t 26106 -x '/session4.txt;1' > session4.txt the ;1 in the filename display is for the iso file version and may not be present on all multi-session cds. if it is present in the isoinfo listing, it should be used when invoking the specific commands to extract files, as shown in our example below. ensure that the full absolute pathname within the disc’s last session’s file system is accurate (the -f flag will provide the full path to be used when invoking the -x flag). with the commands exercised so far, it is possible to see how in some cd authoring programs, it was possible to “delete” a file in a subsequent session. when working with a multi-session disc where the creator did this, the file listing of the track in the last session will not include those so-called deleted files. they can still be located in an image of the full disc, and within the metadata, by using the offset to the session where that track was included. do carefully consider how to handle this “deleted, but recoverable” (casey et al., 2019) data when working with real-world collection material (lassere and whyte, 2021). 6. accessing content from single-session images in our testing, we were only able to generate single-session raw images. since isoinfo only handles user data images, it cannot be used directly with a raw image. use the technique for converting the single-session raw image into single-track user data images described in 7b first, obtain session offsets (either by following 5a, 5b, 5c, or 7a), and then proceed to the techniques outlined in 7c. 7. accessing content from single-track images in order to extract content from single-track images, you must have the image of the track where the file was originally included on the disc in order to extract content, as well as the session offset for that particular track. depending on your starting point, it may be possible to obtain session offsets using the techniques available with the disc in hand (5a), or using imaging metadata (5b), or from an image of the full disc (5c). if these techniques aren’t possible, try the technique in 7a. if the imaging tool that you used generated track-level user data images, or you generated such images using bchunk (described in 5d) you can use the technique described in 7c to extract content from those images. if you have single-track raw disk images, convert them to user data using the technique in 7b. it is also possible to extract files from single track-images using the technique referenced in (van der kniff, 2017 and [23] ) of prepending an image with sectors of zero-bytes corresponding to the session offset and mounting the resulting image. we confirmed this in our review of mount in the table in section 4. the same caveat applies here too: not every file that will be listed under the resulting mount point will contain data captured from the physical disc. in our testing, only the files originally added in that track will have the expected data. 7a. offsets from single-track images for single-track images, isoinfo combined with the -p flag will report the start block of the type-l path table [24]. noting the iso 9660 spec [25], the path table immediately follows the volume descriptor set, which consists of at least two volume descriptors – a primary volume descriptor and a volume descriptor set terminator, which is itself preceded by 16 sectors designated as a system area. from this information, it should be possible to backtrack and determine the value of the original offset for that session. in testing the single-track disk images created from our sample multi-session disc, subtracting the path table location from the number of volume descriptors and the 16 sectors of the system area consistently yielded an offset value that was one sector greater than the value reported by utilities that read the physical disc. thus, our formula for determining the offset of the session was calculated by the following formula: type-l path table location – number of volume descriptors – 17 7b. conversion to user data from raw one method of creating a user data disk image from a raw image is described in 5d. it assumes an image of the full disc and the existence of a cue sheet generated from the original physical media. if all you currently have is a track image and no cue sheet information, it might be possible to generate a cue sheet that represents an abstract single data track in order to use as input to bchunk. save the following text as file.cue (with each indentation level being two single-whitespace characters): file "data_track.bin" binary      track 01 mode1/2352           index 01 00:00:00 once this is generated, it is possible to use the following bchunk command to convert a raw image to user data: bchunk data_track.bin file.cue derivative_iso_basename under the scope of images we inspect in this paper—one data track per session—this “default” supplement should enable conversion to user data. the “mode1” string may need to be changed to “mode2”, because user data within a mode-1 sector appears at a different location than a mode-2 sector. if you use the wrong mode, the resulting image will not appear to have an iso9660 file system, so there is effectively no chance of getting incorrect file extractions with the incorrect mode – the file system will not function. 7c. extracting files from single-track images the process of extracting files from single-track images is similar to what is described in 5e. use isoinfo with the offset to the session that has the track where the file is originally added to the disc and the pathname. unlike in 5e, use the -n flag (instead of the -t flag) with the session offset. working with an image of a track from the last session of our test multi-session cd, we used the following command to get a listing of all of the files listed in the fourth file system. (this will not necessarily be the listing of every file in every track, but for our test cd which did not exercise “deletion,” this will be a complete listing.) isoinfo -i session4.iso -n 26106 -l we have provided sample output of this command below: [ 24 00] session1.txt;1 [ 11726 00] session2.txt;1 [ 18928 00] session3.txt;1 [ 26130 00] session4.txt;1 we used the following command to extract a specific file: isoinfo -i session4.iso -n 26106 -x '/session4.txt;1' > session4.txt this example works because the file session4.txt was added in the fourth track (within the fourth session), but generally having only an image of the last track does not mean you will be able to extract every file referenced in the file listing from that image. one way to check whether a file is extractable from an image is to observe the corresponding session offset value and compare it against the values in the column to the left of the file name in the isoinfo file listing output. if the leftmost number in the bracketed area is less than the session offset that will be supplied to isoinfo, then the extraction will fail. in the above example, 26106 is the offset value for the fourth session, and the number next to the file name listing for session2.txt is 11726. that means that any isoinfo command that supplies an offset value for the fourth session (26106) will not be able to successfully extract session2.txt. 8. extracting files from physical media with the media in hand, it is possible to directly access files from a multi-session cd. on the graphical interfaces for windows 10, ubuntu 20.04 lts, and macos big sur, when presented with our test data-only multi-session cd, all of the files from every track in each session of the disc were displayed and accessible. when copying files directly from the media, if a multi-session cd includes file systems other than iso 9660, copying directly from the media may not include the originating file system features a current workstation’s file system does not support, (e.g., resource forks from the source file system not being supported in a destination file system in a windows environment). 9. non iso 9660 filesystem data we attempted to generate a test multi-session cd that included hfs file systems using mkisofs and the technique outlined in section 3, but were unsuccessful. (xorrisofs only supports embedding hfs+ partitions, and could not be used for this test.) according to the documentation for mkisofs, “only files from the last session will be in the hfs volume…mkisofs can not add existing files from previous sessions. however, if each session is created with the -part option, then each session will appear as separate volumes when mounted on a mac.”. we did find one resource that suggested hfs discs cannot be “multi-session”, but instead must be “multivolume.” [26] without confirmation through tests, this information suggests that images of individual sessions with hfs data can be accessed through specific file system tools such as hfsexplorer [27] or the hfsutils hmount [28] command; however it is unclear how those same tools might work on an image of the full disc. conclusion when working with optical media, workflows that are unaware of, or do not account for, formats such as multi-session discs may risk losing data. multi-session cd-r/ws may pose significant challenges if treated in the same manner as single-session, single-track media. acknowledgements the authors gratefully acknowledge bez thomas for supplying the inspiration for the title of this article. about the authors dianne dietrich is the digital assets librarian at cornell university library. (orcid: 0000-0002-0009-8736) alex nelson is a computer scientist at the us national institute of standards and technology. (orcid: 000-0002-3771-570x) disclaimer the views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of any agency of the u.s. government or cornell university. any mention of a vendor or product is not an endorsement or recommendation. logos and trademarks are copyright their respective owners. appendix a: imaging pathways this section describes some of the pathways by which one can create an image of a multi-session cd. this might be helpful in understanding the provenance of images if the physical media is no longer available. guymager when imaging our sample multi-session cd, guymager [29] created a user data disk image of the entire disc. while conducting testing, guymager logged a significant number of bad sectors, even though the resulting image was understandable (using the techniques described in this paper) with isoinfo. dd when imaging our sample multi-session cd, dd [30] gave the following error: dd: error reading ‘dev/cdrom’: input/output error the resulting file was understandable (through file [31] and disktype [32] ) as a disk image containing iso 9660 filesystem data, but it contained only the data from the track of the first session; thus, it appeared to generate a single-track user data disk image. this contrasts with the behavior exhibited by the gui interface for accessing the mounted disk. ddrescue we issued the following command to test ddrescue [33] (installed from the gddrescue package) on our sample cd: ddrescue –n –b 2048 /dev/cdrom [iso file] [log file] the utility reported 7 bad areas and 864 read errors for one of the authors, and 6 bad areas and 861 errors when tried by the other; but in both cases produced a disk image that contains iso 9660 filesystem data, as indicated by running both disktype and file. this image contained data from all tracks from all sessions’ data; thus, it appeared to generate a full disc user data image. to confirm, when we supplied the appropriate offsets we could obtain file listings for all four sessions (using isoinfo), as well as read the contents of all of the files. cdrdao this utility will create a raw image and toc-file for a single session only (see: –session flag ); thus, it is generating a single-session raw disk image. the default filename if none is given for the output is data.bin. a sample command is given below: cdrdao read-cd –-read-raw --session 2 --datafile cd.bin cd.toc the toc-file output, which is specific to the cdrdao [34] utility, includes the same technical data as a cue sheet (tracks, pre-gaps, etc.) in a different structure and layout. the command line utility called toc2cue [35] can convert this toc-file into a cue sheet. appendix b: multiple data tracks ecma-394 defines a “data disc” as “a disc on which every session contains one or more data tracks” suggesting that an individual cd may contain multiple data tracks. however, it was difficult to find examples to support that these kinds of disc were common, or that their creation was supported by cd burning software. to give an example, in the dialogue box for cd burning software (pcmag, n.d.), the option to leave a session open to add new tracks is not available if the user had previously selected to generate a data cd. while there were examples for generating mixed-mode cds in the man pages for cdrecord and cdrdao, there were no examples given in either man page to generate a disc with multiple data tracks within the same session. further, the man page for cdrecord notes that, “many operating systems are not able to read more than a single data track, or need special software to do so.” we attempted to generate two testing discs: one cd that consisted of a single session and nine data tracks and another cd that consisted of three sessions, each with three data tracks per session. in order to generate the first test cd, we used the following command: cdrecord –v –eject –speed=4 track01.iso track02.iso track03.iso \ track04.iso track05.iso track06.iso track07.iso track08.iso \ track09.iso in order to generate a testing disc with three sessions and three data tracks per session we used the methodology outlined in section 3, with each cdrecord step including three data tracks that were each generated (when applicable, using the offset of the previous session) with xorrisofs. we noticed the output of cd-info with both testing discs in the drive reported nine “sessions”, with one data track per “session”. this discrepancy between the layout we intended and the tool output confounded our ability to establish ground truth, and without understanding our initial inputs into the workflow, we chose to not pursue this line of inquiry further. references brown, geoffrey. developing virtual cd-rom collections: the voyager company publications. international journal of digital curation (7): 2 (2012). https://doi.org/10.2218/ijdc.v7i2.226 casey, eoghan, alex nelson, and jessica hyde. standardization of file recovery classification and authentication, digital investigation (31) (2019). https://doi.org/10.1016/j.diin.2019.06.004 dietrich, dianne and alex nelson. the emperor’s new grooves: recognizing multisession cd-rom tracks not captured in disk images. talk at bitcurator user forum 2019. https://bitcuratorconsortium.org/panel-media/ lassere, monique and jess m. whyte. balancing care and authenticity in digital collections. journal of critical library and information studies (3): 2 (2021). https://doi.org/10.24242/jclis.v3i2.125 pcmag. definition of track-at-once. pcmag encyclopedia. accessed 2022-12-02. https://web.archive.org/web/20220127030639/https://www.pcmag.com/encyclopedia/term/track-at-once rothrock, michelle, alison rhonemus, and nick krabbenhoeft. assessing high-volume transfers from optical media at nypl. code{4}lib journal (51): 2021-06-14. https://journal.code4lib.org/articles/15908 van der knijff, johan. imaging cd-extra / blue book discs. open preservation foundation. 25 april 2017. https://openpreservation.org/blogs/imaging-cd-extra-blue-book-discs/ endnotes [1] mixed mode cd – wikipedia [2] compact disc digital audio – wikipedia [3] yellow book (cd-rom standards) – wikipedia [4] orange book (cd standard) – wikipedia [5] ecma-394 [6] iso 13490 – wikipedia, ecma-168 [7] blue book (cd standard) – wikipedia [8] track (optical disc) – wikipedia [9] xorrisofs(1) – linux man page [10] mkisofs(8) – linux man page [11] cdrecord(1) – linux man page [12] ecma-130 (see figure 11) [13] mount(8): mount filesystem – linux man page [14] attempts to use the session=x option in mount with an image of the full disc were unsuccessful; only the file from the first track of the first session was listed at the mount point. [15] isoinfo(1) – linux man page [16] fiwalk – forensics wiki [17] our attempts to force tsk tools to use a sector size, any power of 2, were disregarded by the tsk library on encountering the iso9660 file image. raw data was also translated into user data for analysis. hence, we did not experiment further with attempts to force sector sizes in tsk. [18] fls(1) manual page [19] tsk_recover(1) manual page [20] bchunk(1) – linux man page [21] cd-info(1) – linux man page [22] cue sheet (computing) – wikipedia [23] [libcdio-devel] re: retrieving data session from multisession audio disc [24] https://en.wikipedia.org/wiki/iso_9660#path_tables [25] iso 9660 – osdev wiki [26] information on cd formats [27] catacombae – hfsexplorer [28] hfs utilities home page [29] guymager [30] dd(1): convert/copy file – linux man page [31] file(1): determine file type – linux man page [32] disktype documentation [33] gnu ddrescue manual [34] cdrdao(1) – linux man page [35] toc2cue(1) – linux man page subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – implementing a real-time suggestion service in a library discovery layer mission editorial committee process and structure code4lib issue 10, 2010-06-22 implementing a real-time suggestion service in a library discovery layer as part of an effort to improve user interactions with authority data in its online catalog, the unc chapel hill libraries have developed and implemented a system for providing real-time query suggestions from records found within its catalog. the system takes user input as it is typed to predict likely title, author, or subject matches in a manner functionally similar to the systems found on commercial websites such as google.com or amazon.com. this paper discusses the technologies, decisions and methodologies that went into the implementation of this feature, as well as analysis of its impact on user search behaviors. by benjamin pennell and jill sexton introduction in the summer of 2007, the library at the university of north carolina at chapel hill joined with our partners in the triangle research libraries network (trln) to create a shared catalog using endeca software [1]. the search trln consortial catalog [2] was released in march of 2008, and the unc chapel hill libraries released a locally-scoped public catalog [3] using the shared endeca index in july of 2008. at the time of implementation, trln opted not to include authority records in the shared index. the endeca index is optimized for keyword searching, and does not include the necessary data structure to support “begins with” searching. furthermore, we were not trying to replicate the functionality of a traditional opac and felt that we didn’t need to reinvent a true authority browse functionality; our traditional catalogs are optimized for that particular use. while overall feedback on the new interface was positive, one of the most common criticisms unc’s catalog implementation team received from both patrons and staff concerned the interface’s lack of support for authority searching and browsing. during the summer of 2009, unc chapel hill libraries began to investigate ways we could enhance our users’ search experience by reintroducing some limited authority data into the catalog. at the same time, we had been considering introducing an auto-suggest feature for the catalog interface that would provide real-time suggestions based on what the user types in the search box. users are accustomed to auto-suggest features in most commercial search interfaces and we wanted to see if we could replicate this service using the data in our catalog. populating the auto-suggest feature with authority data from our collection addressed both of these concerns while providing a balance between traditional library use of authority and the expectations and habits that users have formed in using the web. traditional authority browse functionality accepts a user’s query and returns an alphabetical list of nearby authority headings based on a “begins with” match. most opacs force users to adhere to certain conventions when entering query terms: e.g., the requirement to enter author searches in “last name, first name” format; the need to leave leading articles out of title searches; and the need for users to know something about lcsh structure in order to perform successful subject searches. real-world experience, as well as a scan of user search terms gleaned from log data, revealed that many of our users do not follow these conventions when searching the catalog. we hoped to develop a solution that would be easier for typical library patrons to use than a traditional authority search, but at the same time would provide useful suggestions for our more sophisticated users. the result is an auto-suggest feature that presents relevant authority-controlled author, subject, and title suggestions to the user as they type, without users having to understand traditional library opac syntax. we hope that this “do you mean?” strategy of providing real-time suggestions is more useful than a similar service would be if the suggestions were presented after the initial search. methodology request process from the perspective of a catalog user, the process of submitting and receiving suggestions should be a seamless extension of the individual’s standard search interactions with the interface. the interface presents users with a typical search form, including a query input box, a drop down menu for indicating which type of metadata they wish to search, and a submit button. as the user begins to type, the interface displays a panel populated by suggested known entities below the query box. a system involving javascript, servlets and solr is used to provide these suggestions in real-time, following the request process outlined below: user begins to type a query into the search box. every 150 milliseconds while the user’s query is changing the client sends an ajax request to the suggestion servlet, containing: query as typed search index presently selected (ie, title, author, subject, keyword, journal_title) the suggestion servlet validates the user’s input and submits it to the suggestion datahandler. the datahandler transforms the user’s input into a solr query, and then submits it to the index as an http request. see section “suggestion query construction” for more details. when solr has completed the requested search, it returns json results to the datahandler. the datahandler then processes these results and returns them to the suggestion servlet. the suggestion servlet writes a json response containing the results restructured as a dictionary containing a list of suggestion/index pairs and the time at which the servlet originally received this request, in milliseconds. user client receives the response. it compares the timestamp of the most recently displayed suggestion results versus that in the json response, and discards the json results if they are older. javascript formats the results into a list of suggestions. repeat steps 2-8 until one of the following occurs: user stops typing. user continues typing, but the new results are contained within a cache of previous results in the auto-suggest user interface. user submits the search, either by pressing enter or clicking the search button. user clicks on a suggestion, causing the client to submit the selected suggestion/index pair. user highlights a suggestion with the up or down arrows keys, causing the client to place the highlighted suggestion in the search box. at this point the user may highlight another suggestion, begin typing again, or submit the suggestion. if the user submits a query, a search behavior parameter is added to the catalog search url, indicating if the user searched for a suggestion, and if the user manually selected a search index. the catalog performs a search as it would normally, and returns results. display results. the user’s search is displayed in the search box, and the index of the last search is selected, whether it was automatically set, manually chosen, or not selected. when the user clicks in the search box, if the search index of the last search was automatically set by a suggestion (if it was an author or subject search), the search index is reset to keyword. figure 1: request process flow diagram. (view larger) user interface we implemented the user interface using a modified version of the jquery autocomplete plug-in [4], as well as css and javascript as necessary to style and configure the plug-in. while it would have been preferable to retain the autocomplete plug-in as is, there were a few behaviors that we could not modify via the provided configuration options. one such change is adding verification that incoming suggestions are more recent than those currently being displayed, since network traffic or slow queries can result in responses appearing out of order. as is mentioned in step 2 of the overview, the interface includes a drop down menu that allows users to specify whether to retrieve all available matching suggestions or to only return suggestions of a particular type. the latter option allows users to separate out suggestions categorized as “titles,” “authors” or “subjects” in order to reduce the number of unrelated matches. for example, a user may be seeking all records authored by john c. adams, but cannot remember the author’s full middle name. to further complicate matters, the suggestion list for an “anywhere” search for “john adams” would be filled with works about presidents john adams and john quincy adams. by selecting “author” from the drop down menu, the user limits the suggestion results to only author entities, and can quickly locate “adams, john crawford” on the ensuing suggestion list. figure 2: the suggestion interface providing feedback for the query “john adams c” even if the user had not decided to preselect “author” from the search type drop down, selecting a suggestion labeled as an “author” (or “subject”) would have still resulted in the interface issuing an author or subject search rather than a keyword search. while this behavior helps to keep suggestion searches as relevant as possible, our test groups did find it introduced some confusion on the subsequent results page. by default, the catalog interface presets the index type on its results display to that of the last search performed so that users can easily modify their query. in the case of automatic pre-selection of the search type, however, users may have forgotten about or not been aware of the change in search type, and would therefore wonder why they were now only receiving author or subject suggestions when they were previously receiving more suggestions. our test groups found the system much more intuitive when we altered the behavior to revert back to a generic “keyword” search type when users began editing queries that had automatically selected a search type. step 9 mentions one other important detail about the jquery autocomplete plug-in; it has a built-in result caching mechanism which avoids making additional requests for suggestions when the user’s input is an extended version of a suggestion query that was already processed and returned fewer than the maximum number of displayable results. in these cases, it instead reuses the results from its cache and limits them by the new portions of the query. web service layer & query construction the web service layer is comprised of two components, the first of which is a java servlet that directly handles interactions with user requests for suggestions. the servlet extracts and validates user input before forwarding it to the solr datahandler, and retransmits results to the user. the second component acts as a datahandler for interacting directly with solr. primarily it handles the generation of solr queries and processing of results. the auto-suggest index is queried via a combination of begins-with, keyword, and partial keyword matching along with result filtering by search index type and collection. additional weighting mechanisms are applied to each search, in general to weight word order matches and popularity more highly. see table 1 for definitions of all the involved fields. while many aspects of the suggestion query are consistent for all cases, such as the maximum number of results (m, where m=15 in the unc implementation) or the result format, there are three attributes that allow us to adjust the suggestion query to provide more meaningful results for specific subclasses of queries: number of search terms (n) if the last search term was a stop word (i.e., “the last of” returns true, “the last” returns false) if the previous search pass had fewer than m results, where there can be up to 3 search passes below are explanations of how these attributes interact to form actual queries, where “terms” is a list of words from the user’s query after it has been tokenized, terms[x]* indicates a partial term wildcard search, keyword indicates a match-anywhere search, and begins-with indicates a word order specific search: case 1: n = 1. only a single search term provided: selected by: keyword(terms[1]*) weighted by: occurs + beginswith(terms[1]) + keyword(terms[1]) in most cases while a single search term is present, that search term is an incomplete word. however, clause 3 of the weighting formula gives full word matches additional weight. case 2: n > 1, last search term was not a stop word: selected by: keyword(terms[1..n-1]) and keyword(terms[n]*) weighted by: occurs + beginswith(terms[1..n]) all terms prior to the last word are considered to be “finished” by the user, while the last term is the active word and is most likely not complete. case 3: n > 1, last search term is a stop word: first pass: selected by: beginswith(terms[1..n]) weighted by: occurs perform the narrowest search, a begins-with. second pass: selected by: keyword(terms[1..n-1]) and keyword(terms[n]*) weighted by: occurs + beginswith(terms[1..n]) the first pass did not return the maximum number of results (m), so broaden the search by treating the last term (a stop word) as a partial term. search behavior is identical to case 2. third pass: selected by: keyword(terms[1..n]) weighted by: occurs + beginswith(terms[1]) passes 1 and 2 did not get full results, so perform the broadest search possible by treating all terms as keywords. case 3 provides a far better match rate for queries ending in stop words by progressively expanding the suggestion selection, but only doing so by as much as is required to get a full set of m results. in our testing, multiple passes were not a common occurrence, but when employed were sufficiently helpful to warrant the extra resources. in addition, two categories of filters are applied on top of the previous three query methods: search index type, where the type can be “title,” “author” or “subject.” source/collection identifiers, where each suggestion retrieved must match all the provided collections. these filters offer a method of limiting the suggestion results based on predetermined criteria. the first filter type enforces matching based on the type field described in the data dictionary in table 1, and provides the behavior described in the user interface portion of this paper for limiting suggestions to particular types of entities. similarly, filtering based on source identifiers limits by the source field, and allows for the pre-scoping of suggestion results to subsets of the data or for other applications. both filters are applied as facet filters in order to improve result caching in solr. dataset for the implementation of this feature, the online catalog served as both the data source and the main platform into which it was integrated. the opac at unc chapel hill is a two tiered system, where an iii integrated library system provides the originating data source while an endeca-based web application acts as the public interface. details of the consortial endeca-based indexing system used by the triangle research library network, of which unc is a part, have been discussed previously [5]. while the auto-suggest enhancement was developed as a component of the unc discovery layer, in reality it operates in parallel to the two layers of our opac and is not directly dependent on the current public interface. when designing this system we were interested in maintaining a platform independent infrastructure that would continue to function even if the library migrated to different indexing software. in order to satisfy the technical requirements necessitated by user keystroke triggered suggestion requests from a large text dataset while maintaining platform independence, we selected apache solr [6] instead of endeca to provide the core indexing and searching functionality, populating it with marc data extracted from the iii ils. we designed the dataset schema to store the minimum level of data required to make suggestions relevant to a user’s query. the data dictionary in table 1 outlines the schema, as well as how values for each field are derived. ac main suggestion field. stores the actual suggestion value that is displayed to the user. internally, values are tokenized around whitespace and non-alphanumeric characters. stop words are not indexed on this field in order to improve performance, but stemming is disabled. all suggestions are stored in lowercase. type: text searchable: true displayable: true required: true type enumerated field used to indicate which of the three index types this data represents. possible values are: author, title, subject. these values are pulled from the following marc fields: author – 100 title – 245 subject – 6** type: string searchable: true displayable: true required: true enumerated: true occurs weighting mechanism similar to a popularity metric. this value is derived in two different ways: authors and subjects: occurs = count(ac,type) titles: occurs = ceiling(sqrt(count(ac,type) + sum(circulation(ac,type)))) where count(ac,type) returns the count of the number of duplicates for all entries matching a suggestion/index pair, while circulation(ac,type) returns a list of circulation values for each instance of a suggestion/index pair. ceiling(x) rounds numbers up to the next whole number. the divergence here is due to the nature of each index type, where authors and subjects are far more likely to repeat than titles. meanwhile, circulation levels (taken from the cumulative circulation statistics in our ils) are not available in all cases, but are useful in judging popularity when present. as a result, they are included in the occurs value for titles, but the value is normalized, using the square root, so as not to dramatically overemphasize items with circulations present. type: sint searchable: true displayable: false default: 0 source originating sources for this suggestion entry, indicating what institutions and/or collections this entry is relevant for. there can be any number of sources per entry. sources are represented by a limited set of textual identifiers, such as “unc”, “filmfinder”, etc. type: string searchable: true displayable: true required: false multi-valued: true enumerated: true ackey this field serves a double purpose. its primary role is to act as a unique identifier for the entry, as there can be duplicate “ac” values with different “type” values that must be tracked separately. solr does not allow multiple fields for an index’s unique key. its secondary purpose is to enable begins-with phrase searching in the collection, as this is not possible with a tokenized field such as a text (in other words, the ac field). solr/lucene allows for full begins-with wildcard searching as long as there are not spaces in the given search phrase, so we replaced spaces with underscores. therefore, this field has the form: replace_spaces_with_underscores(ac)|type for example, “shakespeare,_william,_1564-1616|author” type: string searchable: true displayable: false required: true unique key: true table 1: data dictionary for auto-suggest solr schema data loading there are two variations of the solr index loading process: full loads and partial loads. in both cases, the process is performed using python scripts that extract the data from the marc extract files used to populate the production endeca index. for both load methods, the selected fields are: 100 for authors, 245 for titles, and 6** for subjects. subject headings are only loaded in their complete forms, rather than as individual subcomponents. internally, each type of data is stored and de-duplicated in an index specific dictionary, which also contains the occurs field as described in the data dictionary. de-duplication is performed as data are retrieved by temporarily storing the selected fields as the keys of a python dictionary data structure, meanwhile tracking the number of collisions in the occurs field. in the case of titles, circulation data are extracted and added to the occurs value in the partial extract script, only records newly cataloged since the previous update are included. deletions and modifications to existing suggestions do not factor into partial loads; these changes can only take place in full reloads. to enable either of these functions would have required a large expansion of the data set stored in solr in order to recognize if a record edit should result in the removal of a suggestion from the index. in our implementation, we perform partial updates twice daily in accordance with the update schedule for the catalog. full reloads are performed when catalog reloads occur, but this is not a scheduled event. in order to export the data from a python dictionary to a format readable by solr, the loading script generates pipe delimited files containing the full set of properties contained in the data dictionary: ac, type, occurs, source, and ackey. it should also be mentioned that for partial loads, if a suggestion previously existed, the occurs value is retrieved from solr and combined with the new value. as the pipe delimited data are generated they are segmented into x number of files, where x is the total number of suggestions by index type divided by the maximum number of suggestions per file (300,000 in the unc implementation). in our testing, attempts to load more than a gigabyte of data at once slowed down the insert process considerably. as a result we limited the file length to 300,000 lines to keep the file size below 100 megabytes each.   these pipe delimited files can then be loaded directly into solr via a post request to a built-in servlet that handles delimited inputs rather than xml. development process early in our development cycle, we attempted to use mysql for storing and searching the suggestion collection. we started with this approach because mysql is the standard package used for custom database applications at our institution. while the system did work under this arrangement in a development environment, we worried that a 7 million item database performing multiple text searches per catalog query would exceed the scope of mysql. at that point, we began to investigate the use of solr. this decision carried with it some additional benefits, namely the speed with which solr allows for mixing text search methods and other weighted ranking metrics. we did briefly experiment with alternative search methods, such as double metaphone phonetic matching [7] to compensate for spelling errors, but concluded that this introduced too much noise into the results. after considerable tweaking and research into the behavior of other similar suggestion features in commercial websites, we settled on the mixed begins-with and keyword model described in this paper. the adoption of this model allowed for easier suggestion of unknown or partially known items than a begins-with only model, particularly for lcsh subjects where users often do not know the proper syntax and formatting. aside from these concerns, refinement of the solr query structure to accommodate more difficult cases, such as stop word heavy inputs or author names that commonly appear as titles, required a significant amount of development time. ultimately, there were additional refinements that, while beneficial to the quality of the suggestions, resulted in a performance hit. if the system were not required to provide users with immediate feedback, as in a more traditional synchronous query and response model, these enhancements likely would have been reasonable inclusions. it should be mentioned that some tweaking of rank weightings would likely need to be done at any institution implementing this feature to best match their own datasets, even if the algorithm were consistent. additionally, in our original design we had planned to include user queries as well as suggestions extracted directly from our catalog data, merging the two sets in a method similar to that described by yang et al. [8]. however, concerns over privacy, index maintenance, and development time eventually led us to include only bibliographic and circulation data. analysis to determine the effectiveness of the auto-suggest feature we have performed periodic access log analysis since the feature’s launch in january 2010. based on this information, we found that users began using the feature immediately upon its implementation, with 18.99% of queries issued in the first week using suggestions from our index. since then, the usage rate has averaged 18.06% of searches from forms that offered the service. some pages, such as our advanced search pages, do not at present employ the suggestion feature, but 88.68% of all searches are made from forms that do offer it. users perform on average 7682 new searches in the catalog per day, which generate an average of 38426 requests to the auto-suggest service, an average of about 5 requests per search performed. one of our primary interests in implementing this feature was to determine what effect it would have on users’ utilization of authority data. as a result, we focused much of our attention in statistics gathering on tracking the changes in usage of index specific searching and how these related to which suggestions users selected. figure 3: change in the average number of search requests per day by index after implementing auto-suggest. (view larger) in doing so, we discovered that the largest change in search behaviors was in the rate of subject index searching. although subject searches are a relatively minor portion of the overall queries performed in the catalog each day, just 3.92%, they have seen the largest percentage increase in use. during the baseline period before enabling real-time suggestions, there were on average 115 subject searches in the catalog per day. after implementation, the average number increased to 343.8, a 198.6% increase, of which 222.7 were from the suggestion service. similarly, author searches increased 69.64% over the same period. overall, the total number of searches in this period increased 42%, so the changes for subject and authors are significantly higher than that of the overall set. it is not clear how much of the total search increase is related to the time of the academic year and how much is related to changes in the catalog, but both are likely factors. we attempted to minimize the impact of the former by choosing a baseline during moderate to heavy traffic periods during the 2009 fall academic semester, but we did not have log information available far enough back to compare year over year changes. figure 4: percentage change in the average number of searches issued by index per sampling period versus the baseline period before the implementation of the auto-suggest feature. the all indexes series represents the change in the average number of searches across all indexes. (view larger) users choosing to preselect a search index made up 14.52% of suggestion searches. the increase in subject and author suggestion searches are mainly due to the automatic index selection feature described in the user interface portion of this paper. on the other hand, generic keyword searching, which makes up 56.89% of all searches, saw a 3.01% decrease in usage during this period. the impact of the service on system resources has so far been minor. as of the launch of the service at unc there were 7244000 suggestion entries, including 1131000 authors, 3933000 titles, and 2179000 subjects. these suggestions were derived from around 4.95 million individual records in the unc catalog. the actual memory usage of the index was roughly 115mb, excluding caching, and suggestion queries average around 200ms execution times. it should be noted, however, that the implementation of this feature resulted in an increase in the average number of terms per catalog query, which can have an impact on performance of the catalog itself. conclusion providing users with real-time feedback in the form of query suggestions provides unique opportunities for presenting users with catalog authority data in a familiar and helpful manner. since doing so, the unc libraries’ discovery layer has seen a significant increase in the rate of subject and author searches, where almost 2/3 of subject searches are selected directly from the suggestion service. the use of apache solr has offered a scalable and flexible back-end for implementing a feature that has been adopted by users for 18% of all searches in a major research library’s online catalog. jquery’s autocomplete library allowed for a heavily configurable user interface experience, while only requiring a minor time investment to establish the basic functionality needed to enable the user’s interaction with a query suggestion web service. the success of this project has led us to consider other potential applications for the auto-suggest service. we are currently investigating the possibility of integrating the service into tools used for local non-marc metadata creation, such as for digital collections materials, or items in the institutional repository. in these collections, which often depend on student labor for metadata creation, we face challenges in normalizing descriptive metadata within and across collections. a tool that would allow users to enter a descriptive term and receive a list of possible authorized forms of the term would both improve the accuracy of descriptive metadata and speed the description process. we invite your comments and questions about this service, and would welcome your ideas for other useful applications of this tool. references [1] endeca information access platform: http://www.endeca.com/products-information-access-platform-mdex-engine.htm [2] search trln: http://search.trln.org/ [3] unc chapel hill library’s online catalog: http://search.lib.unc.edu/ [4] jquery autocomplete plugin: http://docs.jquery.com/plugins/autocomplete [5] apache solr: http://lucene.apache.org/solr/ [6] hemminger, b. & lown, c. (2009). extracting user interaction information from the transaction logs of a faceted navigation opac. code4lib journal, (7). http://journal.code4lib.org/articles/1633 [7] wikipedia: http://en.wikipedia.org/wiki/double_metaphone [8] yang, j., et al. (2008). search-based query suggestion. proceeding of the 17th acm conference on information and knowledge management, 1439-1440. (coins) about the authors benjamin pennell (bbpennel@email.unc.edu) is applications analyst and jill sexton (jill@email.unc.edu) is information infrastructure architect at the university library at the university of north carolina at chapel hill. together they are unc’s catalog implementation team. subscribe to comments: for this article | for all articles one response to "implementing a real-time suggestion service in a library discovery layer" please leave a response below: technical skills for lis professionals « librar*, 2011-06-25 […] on june 25th, 2011 i am pretty fascinated by some of the really great developments to library web interfaces i’ve seen recently. i am also a bit disillusioned at times by the lack of it flexibility […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – creation of a library tour application for mobile equipment using ibeacon technology mission editorial committee process and structure code4lib issue 32, 2016-04-25 creation of a library tour application for mobile equipment using ibeacon technology we describe the design, development, and deployment of a library tour application utilizing bluetooth low energy devices know as ibeacons. the tour application will serve as library orientation for incoming students. the students visit stations in the library with mobile equipment running a special tour app. when the app detects a beacon nearby, it automatically plays a video that describes the current location. after the tour, students are assessed according to the defined learning objectives. special attention is given to issues encountered during development, deployment, content creation, and testing of this application that depend on functioning hardware, and the necessity of appointing a project manager to limit scope, define priorities, and create an actionable plan for the experiment. by jonathan bradley, neal henshaw, liz mcvoy, amanda french, keith gilbertson, lisa becksford, and elisabeth givens. background qr code audio tour at virginia tech, guided tours of newman library, the main branch, were provided for incoming freshmen. the tours were approximately an hour long, and were led by librarians. as the size of incoming classes increased – now at over 6500 incoming students for the class of 2019 (freshman snapshot) – it was recognized that self-guided tours would allow more effective use of librarian time, as well as give each student more flexibility as to when to visit the library for her orientation. the classroom manager and technology liaison for the library, neal henshaw, used instructional design techniques to create a self-guided tour for new students, based on qr code technology and ipad 2 tablets. the original self-guided tour was designed primarily as an audio tour. students began at the circulation desk, and were shown by circulation staff how to scan a qr code which plays a video that gives an introduction to the tour. the introductory video described the tour, explained procedures for ending the video at each stop and moving to the next stop, and detailed a workaround for problems with unreliable wi-fi. though the original tour was primarily designed for audio, most of the scan targets on the tour were actually video files. the video was used only occasionally, to help the student identify a tour stop by showing an image of the area, or to demonstrate computer search techniques. figure 1. a qr code placard from the current library audio tour in the original, qr code based self-guided tour, information about the stops was carefully written into a script for each of nine stops. the audio on the tour was narrated directly by neal, who has a perfect radio voice. the tour introduces the students to quiet and group study areas, circulation and reserves, reference, and collaborative projects with other university groups, including a writing center and a communications lab. students are also shown basic instructions for accessing the library web site and performing searches using summon. this self-guided library tour has been required for the standard english course for the freshman class, introduction to writing. students taking the tour for the purposes of this course are required to take a quiz after completing the tour. technology update: mobile app with ibeacons what’s an ibeacon? an ibeacon is a bluetooth low energy (ble) device that broadcasts a unique identifier, repeatedly. mobile applications can be created to listen for these identifiers, and perform location or context aware activities when the identifier is detected. (ibeacon 2016) the ibeacon identifiers are set up in three parts: a uuid, a major number, and a minor number. the beacon identifiers are configurable by the beacon owners; the method for doing this is configurable according to the brand. the most common example given for the use of ibeacons is in retail. the ibeacons can be placed near retail store locations, and the retail store brand’s app running on a user device receives a notification when a user is near one of the stores. the app can alert the user to sales or special events at the store, and attempt to entice the user to enter. the ibeacon aware apps can also detect specific locations within a store, with the use of major and minor numbers. the apple developer documentation for ibeacon cites an example where all beacons belonging to a retail store with multiple locations are assigned one uuid, while a differing major number is assigned to each location of the store, and a different minor number to each department in the store. (getting started … 2014) the app for the retail store can now perform context-sensitive actions when a person enters each department in each specific store. figure 2. estimote beacons. photo courtesy of estimote, inc. updating the tour we chose to update the tour primarily because this saves students the step of having to fumble with qr codes at each stop. converting the tour to a native app will also give us the opportunity to improve the experience in the future, through better presentation and assessment of learning content. this has been an especially challenging project, primarily because it requires time and coordination from several teams, all of whom are busy on several other projects. over the last few years technology units within the library have been better on completing technology projects within units specifically created for that task. this project has been different because graphics designers, video editors, marketers, script authors, education and learning specialists, and developers have all been needed. during planning and implementation of this project we have faced library reorganizations, the loss of the initial designer to other employment opportunities, a deficiency of test hardware capable of working with the technology, and the loss of a talented project manager because the project was not within her primary area of responsibility. we’re excited to finally be implementing and deploying this application, and to tell you how we have been able to scrounge resources in order to pull this off. user roles during the design of a technology application, it is sometimes useful to think about the roles of each user. so, let’s begin with the primary role, that of the student. student liqin is a first year student that has just moved to blacksburg to study computer engineering. liqin was assigned to take the self-guided tour of the library by an english instructor, but does not understand why this is necessary. during the tour given to high school students and parents wishing to choose a university, it was clearly stated by the tour guide that there’s no need to actually go to the library; everything is online. however, the student tour guide could be wrong; he also stated that the library was open 24/7, when in fact it is actually open 24/5, which in real life means that there are only 4 days in the week when the library is truly open 24 hours. whatever the hours of the library are, liqin has only one and a half hours between the end of quidditch and the beginning of aerial robotics club to eat and take the self-guided tour of the library. in addition, there are some other use cases for minor roles that we had in mind when we designed the app. instructor dr. jones is a professor associated with the english department, and teaches four sections of the introductory writing course. dr. jones needs to make sure the students know how to use the library for research, and wants to be able to follow up with each student that might be having difficulties understanding how to use library resources. learning services and education technologies devon is a librarian responsible for helping students learn how to use the library, in both face-to-face and technology based scenarios. as the library undergoes incremental remodeling and reorganization, new spaces and services are introduced each year. this year, a studio of 3d printers has been added to the library. devon needs a way to update the tour that incoming students are shown so that it explains the 3d printing studio, preferably without waiting for a software developer to become available. requirements thinking about the three roles in the system helped to determine requirements for the application. requirement supports student supports instructor supports library not in this version load specific video file when near a beacon x x x ask questions at every stop x x display important information on screen x give students the correct answer when they miss a question x subtitles for the deaf and hard of hearing x x content needs to be easy to update x includes maps x runs on library-owned equipment x records pass/fail grade for each student x x records name of student and class, and stops visited x introductory screen x offer help for wi-fi problems x students visit stops in order x additional languages x run on student-owned equipment x use indoor mapping x using the roles to determine the system requirements was a useful exercise. as you can see from the chart, some requirements support multiple roles. some requirements were determined not to be core requirements at all, but have been deferred instead as wishes for future versions. this type of chart doesn’t reflect that some requirements are basic capabilities for this type of app, regardless of the roles that we defined. interface design a basic understanding of the requirements for the application gave liz mcvoy, the digital media specialist, the ability to design the interface using a largely visual approach. liz designed screens that reflect the requirements. introductory screen figure 3. the introductory screen for the new tour app click to enlarge the introductory screen is designed to entice the student to explore the library. there is a single button, labeled “begin”, to start the tour. the color scheme, referenced throughout the interface, is that of a fantastic sunset behind hokie stone. main interface figure 4. the main user interface for the tour app click to enlarge the main interface is the primary screen that students will see while taking the tour. this main screen has several elements taken from the requirements document: a large window that displays videos for each tour stop a “maps” button that displays maps of the library a “wifi help” button to help the student with connectivity issues a display area to show important library facts there is also a progress bar to indicate the remaining stops on the tour. explore prompt figure 5. an example of a prompt in the tour interface click to enlarge this prompt appears during the tour when students are asked to look at specific details in the environment. maps figure 6. an in-app map of the second floor of newman library, with tour locations marked click to enlarge the app has maps of each library floor on the tour, with paths and numbered markers showing how to get to each tour stop. maps are potentially important, because even people who work in the library occasionally get lost. the ibeacon technology allows for the possibility of indoor mapping using signal triangulation to detect location more precisely, and to give directions. the current version of the app doesn’t use this ability; instead, the beacons associated with each tour stop location are stored in a list. the correct video plays when the student is near the correct beacon for the next stop. wi-fi help screen figure 7. the wi-fi help screen this screen was included in design sessions because in the past, the library has often had problems with wi-fi connectivity, especially when moving throughout the library. the qr code version of the tour has a video that shows students the steps to perform on the device when there are wi-fi problems. it might be possible to eliminate this screen in the deployed implementation, because the videos are now stored on the device instead of streamed via wi-fi. an internet connection is still required in order for a student to sign-in and take the quiz, but this is done at the end of the tour in an area with reliable wi-fi coverage. sign-in screen figure 8. student sign-in is required for students to get credit for the quiz this screen is used by students to “sign-in” to the tour. after the tour is complete, the tour will record the email address of each student so that instructors can keep track. after signing in, students will take the quiz. quiz interface figure 9. the multiple-choice quiz interface click to enlarge the quiz interface presents students with a series of multiple choice questions. the questions that are asked were covered in the tour stop videos, and in many cases, the answers to these questions were shown in the display area of the main interface. the buttons to answer the questions are very large, to accommodate people with big fingers, while still allowing for use by people with small fingers. completing the quiz leads to the final screen. final screen there has been intense debate about whether balloons should be used as the final image, or fireworks. currently, the balloon proponents are winning the debate. figure 10. the final screen, shown after the tour is completed click to enlarge figure 11. an alternate final screen,with balloons instead of fireworks click to enlarge reboot and mini-hackathon after the project had languished for some time due to lack of resources, the project was rebooted. jonathan bradley, the web learning environments application developer, became the main developer on the project. after observing amanda’s success in managing and completing other projects with limited resources, keith gilbertson, another developer on the project, attempted to get amanda french involved in the project. amanda held a reboot meeting and clarified requirements on the app. from the meeting, each person was finally assigned a set of action items. she created a project in virginia tech’s gitlab environment, documented the project, created an issue tracker, and organized a mini-hackathon and proof-of-concept demo. during the hackathon, jonathan and keith worked on coding what we perceived as the most important pieces of the app: the code to detect the beacons, and the code to detect the appropriate video. shortly after the hackathon, the developers were able to present a demonstration to the rest of the team. beacons were placed in several locations throughout the library. the team walked from beacon to beacon, and observed as the correct video played in each location. during the hackathon and the resulting demonstration, we noticed three interesting issues. first, in the time it takes to walk from station to station, especially when the stops are between floors, the ipad screen would go blank. when the screen went blank, the app would stop searching for beacons. our current workaround for this is a simple one. the ipads are library owned, and we have chosen to disable automatic screen blanking on the ipads. second, the beacons had been placed approximately one half hour before we started our demonstration. as we approached a stop on the first floor, where a beacon had been placed in an out of the way area, but not secured, we observed a student handling and examining the beacon. in recognition that brightly colored technology devices might be interesting to visitors at a university with a large number of technology students, secure, 3d printed mounting devices have been proposed. the mounts will also help with another issue – sometimes the adhesive isn’t strong enough to keep the units attached directly to certain surfaces. third, we learned that the ipad units that were allocated for library circulation at the time would not work with our app. we initially tested the estimote sdk on a newer, personal ipad air 2 model successfully. the same test failed when used with the older ipad 2 models owned by the library. after researching the issue, we found that the ipad2 model is not compatible with the bluetooth low energy specification, meaning that it will not work with the ibeacon technology. however, new ipad air 2 units were being ordered for the library, and they arrived before we would finish work on the app. after the hackathon, amanda was no longer available to work on the project due to the workload in her primary areas of responsibility. however, her work was useful in organizing and defining the project, and helping us to make progress. implementation choices jonathan bradley became the lead developer on the project around the time of the hackathon, and as evidenced by e-mail timestamps, at times had to work in the wee hours. swift was chosen over objective c as the programming language for the ios app. this is primarily because ios development in general tends to be shifting over to swift, which has been open-sourced. swift also has a reputation for being easier to read and easier to maintain. (solt 2015). the estimote sdk required an objective c bridging header in order to be used with swift. the procedure is documented at http://developer.estimote.com/ibeacon/tutorial/part-1-setting-up/#swift-users-add-an-objective-c-bridging-header. we chose to use the latest, non-beta version of xcode as our development tool. xcode is the standard development tool from apple for creating ios applications, and requires a system with os x to run. by choosing the latest release version, but avoiding beta we are able to use new features without sacrificing stability, and can test on devices with release versions of ios. however, due to the unexpected timeline of the project, the beta versions of xcode eventually made it to release status, and this required changes to our code as the relatively new swift language has evolved. source code for the project is stored in our gitlab repository, and will be open-sourced and placed on github when proven. we are using estimote beacons, primarily because they were the first available when the project was envisioned, but we have decided to stick with them for the initial implementation because of a strong sdk and a large developer community. this version of the app is loaded manually onto library-owned ipads, instead of being distributed through the app store. learning objectives while designing the new version of the app, learning services came up with a set of eight core learning objectives. by the end of this tour, the student will identify the ways in which they can receive reference help (ref desk, college librarians, ask a librarian chat). by the end of this tour, the student will differentiate how special collections is different from the rest of the library, by identifying its unique mission and collection specifics. by the end of this tour, the student will list the type of items that can be checked out through circulation services. by the end of this tour, the student will compare and contrast between different study spaces and the benefits of each. by the end of this tour, the student will be able to find library materials using summon. by the end of this tour, the student will differentiate the services offered by the comm lab, writing center, and 3d printing studio. by the end of this tour, the student will explain the basics of using the library print collection (reading call nos., library terminology, returning unshelved books, classifications). by the end of this tour, the student will describe how to print from their devices and locate the printers in the library. the learning objectives, along with the tour stops, were used to create a script for the new tour videos. videos as part of the overall renewal of the self-guided library tour, new videos were desired. there has been some difficulty in making this happen. the library has video equipment and an event capture team that, upon request, records and edits events around the university. they’ve been booked solid. additionally, creating new videos requires collaboration between educators, who determine learning objectives, scriptwriters, video recording staff, and editors. as mentioned earlier, collaboration across functional groups within the library is often difficult. in order to make progress on the project, an outside consultant was contracted. the outside consultant is a neighbor of one of the project members, and was hired from money out of his own pocket. while this was useful in making progress on the project, we are not recommending it as a general practice for developing this type of application! use library funds when possible. fortunately, the consultant that was hired is an experienced videographer and has won competitions in creating short videos. armed with the learning objectives, a list of tour stops, and scripts from the old videos, the videographer consultant has been recording new videos. while the original videos were meant to be primarily auditory in nature, the new videos are meant to have some visual appeal and humor. here’s a draft of one of the new videos. figure 12. old-timey lettering in a screenshot from one of the new tour videos note the young sounding voice in the video. hypothetically, a younger voice will appeal to our “typical” college audience, but this is an accidental experiment. our videographer is 15 years old! she has offered to have the videos narrated by her father, if the young voice doesn’t work out, or by neal, the previous narrator, who has a professional-sounding, radio-quality voice. operational challenges we’ve struggled with many of the same issues in deploying a beacon app as the brooklyn museum – including lost beacons, and beacon confusion. (bernstein 2015) while the availability of different beacon colors is not as much of an issue in our deployment as in that of an art museum, where bright colors may act as visual distractions from exhibits, we wished that the beacons would either blend into our walls, or come in our school colors of maroon and orange. battery replacement we have two different revision numbers of the estimote beacon hardware, and the differing revisions use different battery types. the initial beacons, which were considered beta hardware, but will be used in our production application, use cr2477 coin batteries. the newer estimote beacons use cr2450 batteries. we need to keep a supply of both types of batteries on hand. the older, beta beacons require a minor surgery, and must be sliced open in order to replace the batteries. in the newer beacons, the adhesive backing can be separated from the main unit, and the battery is easily replaced. figure 13. a cross-section of the newer revision of estimote beacons, showing the battery location. photo courtesy of estimote, inc. we don’t yet have statistics for how often the batteries will need to be changed, but an initial estimate is every four months. the devices broadcast the id repeatedly after a specified advertising interval. estimote initially claimed a battery life of two years, but this was based on an advertising interval of 900ms, while apple recommends a much shorter, and energy expensive, advertising interval of 100ms. however, we can control power expenditure somewhat by adjusting the signal strength of the beacons, instead of just the advertising interval. we’re optimistic because of the current design of our application. we’re not using indoor location services. each stop is far away from other beacons, and will be visited in a specific order. this combination of characteristics means that we are able to set the broadcast power quite low on the beacons, which will extend battery life. updating videos at existing stops in order to support the requirement that video content can be updated easily by the library, the app has some built-in functionality to check for new videos when it starts. videos are stored in google drive. when the library needs to update content for a specific stop, we will place a new video with the same name into the google drive folder. the library will then restart the app on the library-owned ipads, and the new video will be detected and downloaded. as the videos are now pre-loaded, this reduces the need for dependable wi-fi at all stops. videos are large, but the current version of the application runs only on library owned devices, so we don’t yet need to worry about taking too much storage space on the device. adding new stops adding a new stop to the tour is currently a heavyweight operation that requires many hands and many steps: write a narrative script for the new stop produce a new video for the stop deploy a new ibeacon device at the stop location add the stop information to the app code and recompile redeploy the app to the library ipads all of the stops on the tour, along with information for the associated beacons and videos are currently hard-coded into the app. // this represents a stop on the tour struct tourstop { let major: clbeaconmajorvalue let minor: clbeaconminorvalue let identifier: string let videoname: string // filename of video, without .mp4 extension } the stops are ordered according to their position in an array: let tourbeacons = [ tourstop(major: 10, minor: 5, identifier: "reading_room", videoname: "04_readingroom"), tourstop(major: 10, minor: 1, identifier: "special_collections", videoname: "05_spec"), tourstop(major: 20, minor: 5, identifier: "reference_desk", videoname: "09_refdesk"), ] we anticipate that we will be able to eliminate the recompilation and redeployment steps by writing code that retrieves stop information from a network location at the start of the tour. collecting student data the initial version of the app uses the the same method for collecting student data as the original self-guided tour. students log into a web page and take a quiz. this is where we hope that future versions of the app will have improvements that help the students, teacher, and library. the goal is to automatically record grades into canvas, our new university-wide learning management system. tracking beacon id numbers while coding for the post-hackathon demonstration, one morning the app stopped working. after troubleshooting, jonathan discovered that the id numbers on the beacons were different from the id numbers that had been coded into the app. either the id numbers had been reset when the batteries were replaced, or we accidentally mixed up our beacons. we decided that, especially for deployment and operational purposes, we would need a way to track the id that the app expects for each beacon location. we have a google drive document for this purpose, which looks like the following: location color real color uuid major minor reference desk mint cocktail light green library default 20 1 special collections icy marshmallow light blue library default 10 1 4th floor blueberry pie purple library default 40 1 learning commons mint cocktail light green library default 20 5 reading room icy marshmallow light blue library default 10 5 the major number corresponds to a particular floor in the library. major number 10 corresponds to the first floor, 20 corresponds to the second floor, 30 to the third floor, and so on. the minor number, (when taken along with the major number), corresponds to a specific stop on a specific floor. the beacon with major number 20 and minor number 1 corresponds to the reference desk on the second floor; the beacon with major number 20 and minor number 5 corresponds to the learning commons on the second floor. for the purposes of our current app, it isn’t necessary for all of the beacons on the same floor to share the same major number, but the organization helps our mental model and will make things easier if we ever decide to create an app that uses the beacons for indoor mapping. we’ve also left large gaps in the sequence for minor numbers on each floor, in case we add more beacons later. an oddity of the chart is the existence of both “color” and “real color” columns. the “color’ column displays the estimote name for the beacon color. the estimote names, such as “icy marshmallow”, were confusing to us, so we added a “real color” column. the “real color” is the basic color of the beacon, to the eyes of a person that does not have color blindness. an icy marshmallow beacon is actually light blue. if we ever paint the beacons, we will update the “real color” column. we kept the “color” column, because the beacon reports this color information in estimote apps that are used for beacon configuration. this document is useful in case a beacon needs to be swapped in temporarily for another that is having its battery replaced; it makes it easier to set the id of the replacement beacon to the correct settings. unfortunately, the app doesn’t read directly from the document at this time. it serves only as an aid for humans to maintain the beacons. beacons can be easy to lose. one hectic morning, one of the developers who had a slight problem with insomnia had taken home a set of beacons and was working on the project at odd hours. while packing up to go to the office, he noticed that one of the beacons was missing. the “icy marshmallow” beacon was eventually found with the aid of a bluetooth scanning application in a box of cereal on the pantry shelf. beacon security we are having some early problems with beacons staying on the wall, possibly due to the adhesive on the backs of the beacons. two beacons went completely missing during testing. we have proposed, but not yet designed or printed custom mounts for the beacons using our 3d print lab. future opportunities additional languages our assistant director for international outreach has proposed an idea to increase the usability of the tour for new international students. some international students arrive at virginia tech fluent in english; others have difficulty listening in real time to fast conversations that have domain specific terminology, such as conversations about libraries. for example, think about the terms “stacks”, “interlibrary loan”, “course reserves”, “circulation”, and “journal database.” in order to facilitate the introduction of the library to these students, the director has suggested making the tour available in other languages, in particular arabic, spanish, and chinese. this could be done incrementally. first, the subtitles for the deaf and hard of hearing could be translated into written representations of these other languages. students would hear the videos in english while seeing subtitles in a preferred language. as a next step, the audio for each stop could be recorded again in each language, and students would be asked at the beginning of the tour to select a preferred language so that the appropriate audio track could be used. the international outreach director has offered to recruit students to translate and record the videos. student-owned devices this version of the tour was developed for the ipad air 2 devices owned by the library. having a single known device eased development, because we didn’t need to be concerned with multiple screen sizes, multiple operating systems, or with differing video playback capabilities for each type of device. for future updates, we have in mind that students would be able to bring their own devices into the library and take the self-guided tour. the first update would allow the tour to be taken on all ios devices that students might bring, including iphones. the next update would allow the app to run on android devices, including tablets and phones. the android update would require significantly more developer time, as the android apis are based on an implementation of the java language, and all of the code would be new. making the tour available to student owned devices would also mean that the app would be distributed by the app stores for each platform. currently, the tour app is loaded manually onto the ipad air 2 tablets that belong to the library. references bernstein, shelly. the realities of installing ibeacon to scale. [internet]. 2015 feb 04. available from https://www.brooklynmuseum.org/community/blogosphere/2015/02/04/the-realities-of-installing-ibeacon-to-scale/ freshman snapshot [internet]. [updated 2015 may 15]. virginia tech. available from: http://www.admiss.vt.edu/apply/freshman-snapshot/ getting started with ibeacon [internet]. 2014 june 02. apple, inc. available from: https://developer.apple.com/ibeacon/getting-started-with-ibeacon.pdf ibeacon [internet]. [updated 21 feb 2016]. wikipedia. available from: https://en.wikipedia.org/wiki/ibeacon solt p. 2015 may 11. [internet]. it’s high time to make the switch to the more approachable, full-featured swift for ios and os x app dev. infoworld. available from: http://www.infoworld.com/article/2920333/mobile-development/swift-vs-objective-c-10-reasons-the-future-favors-swift.html about the authors dr. jonathan bradley is the web learning environments application developer at virginia tech’s newman library. jonathan creates educational technologies and trains faculty and staff in their use, both inside and outside of the library. he is the technical admin for virginia tech’s libguides instance, and most of his previous research has surrounded pedagogy at the university level. neal henshaw is the educational technology and instructional designer at virginia tech’s newman library. neal designed and maintains the self-guided tour that freshman take, and also creates tutorials for use in online instruction. neal has a background in education in both the public and private sectors, as well as experience as a server/network administrator. liz mcvoy lends her keen design and film eye to virginia tech’s university libraries as the digital media specialist. she has a strong background in video production and graphic design, and combines these skills to create impactful, visually exciting film and design pieces that break down the invisible “fourth wall” to reach students, faculty, staff, and other users around virginia tech and beyond. part of her role is to manage the event capture service, which offers the recording and storing of scholarly lectures in the university repository, vtechworks. this service invites people all around the university into a new relationship with the libraries, highlighting just one of the ways we can partner with them to learn and be more successful. former clir postdoctoral fellow amanda french is currently director of digital research services and associate professor at virginia tech university libraries, where she is helping to build digital humanities infrastructure and is running the institutional repository, vtechworks. from 2010-2014 she was thatcamp coordinator and research assistant professor at the roy rosenzweig center for history and new media at george mason university. her particular expertise consists of making humanities content (both cultural content and scholarly interpretation of that content) openly available online, as well as introducing scholars to the various methods of and issues with making humanities content openly available online. keith gilbertson serves as a digital technologies development librarian at virginia tech. lisa becksford is the learning services and educational technologies librarian at virginia tech university libraries, where she provides face-to-face and online learning experiences that help students across the university grow as researchers and scholars. elisabeth givens is the director of digital strategy and outreach at virginia tech’s university libraries, where she oversees the development and implementation of marketing initiatives and outreach campaigns that serve to connect the virginia tech community with the services and resources provided by the libraries. prior to virginia tech, elisabeth was the senior social catalyst at 919 marketing, a national content marketing agency based in raleigh, nc, where she worked with clients of all sizes, from startup companies to fortune 500 corporations. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – determining usability of vufind for users in the united arab emirates mission editorial committee process and structure code4lib issue 19, 2013-01-15 determining usability of vufind for users in the united arab emirates in late 2011, the higher colleges of technology, a higher education institution in the united arab emirates, implemented vufind as the search interface for the libraries’ resources. before launching vufind in the 2012 academic year, usability testing occurred across three campuses to test the functionality of the search interface features. twenty-one participants, including emirati students and expatriate faculty, were tested using a performance based assessment along with think-aloud protocol, which was recorded using camtasia screen capture software. as a result of the testing several features of vufind were customized including language, layout and prioritization of results. the current study builds on the limited existing body of literature on vufind, which has previously indicated a number of design elements and practices which should optimize user experience. several key findings are consistent with and confirm results from prior studies with findings from this study adding to the literature by observing how or why linguistic orientation affects user behavior in search systems. by nicole johnston, alicia salaz, and rob o’connell introduction in 2011 the higher colleges of technology (hct) formed a usability committee to look at usability throughout the library system. the hct library system consists of seventeen campuses with a central department of library technical services (lts) overseeing electronic resources and systems for the whole organization. the library, as of 2011, had been using two search interfaces for its resources; innovative interfaces’ integrated library system, millennium, for physical library holdings, and serials solutions’ product, summon, for electronic resources. the supervisor of lts initiated development and testing of the open source software vufind as a possible means of unifying the interface for these two search systems and creating a more seamless experience for hct library users. vufind was conceptualized in this circumstance to sit on top of both search systems and mediate queries between them and users. there is limited literature outlining the use and testing of vufind as a search interface, especially outside of america. western michigan university (wmu), york university, the university of illnois and yale university have all published articles on the testing and usability of vufind (bauer, 2008; denton & coysh, 2011; emmanuel, 2011; ho, kelley, & garrison, 2009). this literature was used as a basis for the initial design of the interface, but the usability committee at hct felt it important to pursue a study of our particular users in the context of a public institution of higher education in the united arab emirates. this user group is comprised of primarily emirati students who are non-native speakers of english, as well as a group of globally diverse faculty members. vufind is highly customizable and each implementation is quite different. this article will describe core features of the design adopted by the hct along with functionally important snippets of coding for any developer attempting to replicate the design. additionally, details of usability testing conducted with hct users and a discussion of findings and design changes that were made as a result will inform developers in making decisions about vufind implementation. design concept the higher colleges of technology (hct) created its vufind interface to cater to both first-time patrons and advanced users. we initially designed the interface to be a wrapper for the summon electronic resources search service with the intention of continuing to use innovative interfaces’ millennium opac for physical holdings material (“local” material). as we began our testing, we quickly realized that we needed a simpler search mechanism for first-time patrons. after several design changes, we settled on a layout which would divide the results of an initial search into two columns, with electronic resources and local material divided respectively to the right and left. an “everything” button was included to shift patrons to a single-column vufind-summon interface which also offered more advanced search options. designing the split column layout the split-column layout provided us with the opportunity to ease new patrons into the library catalog. patrons searching for light reading material such as graphic novels, graded readers or fiction can quickly locate their resource and not be bogged down by facets and limiters. exact title searches for books and electronic resources can be quickly located. patrons have the ability to compare our local content with the electronic resource collection. this works particularly well for ebooks as patrons may select an ebook if their local copy is checked out. the split screen requires the addition of a resource folder in web/services and three new files: ajax.php, home.php, and results.php. additionally, interface adjustments must be made in the blueprint theme to include two separate searches – one for solr and one for summon. to facilitate fluid interaction between solr and summon, a toolbar was added to each page that would switch according to the module selected or the template used. the code is located in blueprint theme under layout.tpl: {if $module == "summon" && $pagetemplate != 'advanced.tpl' && $pagetemplate != 'record.tpl'} {include file="$module/searchsum.tpl"} {/if} {if $module == "summon" && $pagetemplate == 'record.tpl'} {include file="$module/searchsumrec.tpl"} {/if} {if $pagetemplate == 'advanced.tpl'} {include file="summon/searchsumadv.tpl"} {/if} {if $module != "summon" && $pagetemplate == 'record.tpl' } {include file="search/searchnav.tpl"} {/if} {if $pagetemplate == 'view.tpl' } {include file="search/searchnav.tpl"} {/if} {if $module=="resource"} {include file="$module/searchall.tpl"} {/if} {if $module == "search" && $pagetemplate != 'history.tpl' && $pagetemplate != 'advanced.tpl'} {include file="search/searchbrowse.tpl"} {/if} {if $module == "myresearch"} {include file="myresearch/myresearchbar.tpl"} {/if} {if $module == "author"} {include file="author/searchauthor.tpl"} {/if} vufind as a wrapper for summon advanced patrons require more search options than our first-time users. those that had been using our summon interface had been extremely happy. in our design, we wanted to take the good functionality of summon (eg facets, limiters) and integrate it into vufind. this can easily be done in the vufind configuration file. the difficulty we ran into was that we wanted to keep our local materials searchable in summon so we could provide a unified search, however information about the availability of material from millennium would not display correctly. fixing this required that we copy the availability code over from the result.tpl file, located in record drivers/index, and place it in the summon/list-list.tpl. however, it is not enough to just copy over the code, as summon adds additional information in the url for local material. this required us to strip out the additional information so that we could grab the record id: <span class="ajax_availability hide" id="callnumber{$record.id.0|substr:18:8|escape}">{translate text='loading'}...</span><span class="ajax_availability hide" id="location{$record.id.0|substr:18:8|escape}">{translate text='loading'}...</span><span class="ajax_availability hide" id="status{$record.id.0|substr:18:8|escape}">{translate text='loading'}...</span> we also needed to stop summon from pulling availability information on electronic materials out of millennium, which wouldn’t display correctly. availability information should be displayed for hard copy materials only, since electronic resources are always available. millennium can only display availability information correctly for material that has an item record. because most of our electronic material has only a bibliographic record and not an item record in millennium, we were able to use an ‘if’ statement based on the existence of an item record to either display availability information (for hard copy material) or a link to full text (in the case of electronic material). title</pre> <div> <div>{* <a href="{$url}/record/{$record.id.0|substr:18:8|escape:"> {if $record.id.0|substr:6:13 == "hct_catalog_b" } </a><a href="{$url}/record/{$record.id.0|substr:18:8|escape:"> {else} </a><a class="title" href="{$url}/summon/record?id={$record.id.0|escape:">{if !$record.title.0}{translate text='title not available'}{else}{$record.title.0|highlight}{/if}</a></div> </div> <h4>availability</h4> <pre>{if $record.id.0|substr:6:13 == "hct_catalog_b" && $record.contenttype.0|getsummonformatclass|escape != "electronic" && $record.contenttype.0|getsummonformatclass|escape != "government document"} {else} {if $record.link} <span><a class="fulltext" href="{$record.link|escape}">{translate text='get full text'}</a></span> {elseif $record.uri && (!$openurlbase || !$record.hasfulltext)} <a class="fulltext" href="{$record.uri.0|escape}">{translate text='get full text'}</a> {elseif $openurlbase} <span>{include file="search/openurlresults.tpl" openurl=$record.openurl}</span> {/if} once these additional bits of codes were added, the vufind summon interface was able to correctly display status information for hct material that had item records and move everything else to a status of get full text. usability testing method usability testing of vufind was conducted at three campus libraries across the united arab emirates. each campus tested the vufind search interface with five undergraduate students and two faculty members for a total of 21 participants. the usability testing was conducted by three librarians who formed part of a four person usability committee set up to test and investigate new library technologies. one set of tests was completed with male students at a men’s campus and the other two sets of testing were done with female students. faculty members were also tested to see if they faced similar issues as students in using vufind. nielsen (2000) argues that five participants in a usability study is an optimal number; beyond five, the rate of new user issue discovery drops significantly and most data is redundant. therefore the most useful information and highest return on research investment comes in the first five users. usability testing of search interfaces conducted at other higher education institutions libraries have tested between five to fifteen users. (bauer, 2008; denton & coysh, 2011; emanuel, 2011) in this case, the total of 21 users allowed us to include people from a range of distinct groups of male, female, faculty, and student users. the participants were given eight tasks to complete such as finding books, journal articles and narrowing searches. (see appendix a) the participants were also asked a series of questions after they had completed the tasks on the appearance and usability of the search interface to help determine what features were easy or difficult to use. (see appendix b) participants were recorded navigating through vufind using the screen capture software camtasia. recording allowed us to record the time required for each task, review the same performance multiple times and to adjust specific areas (for example, originally the tabs were on the right, then they were changed to the left after watching the recordings. this study used the “think aloud protocol”. the think aloud protocol is a testing method where the user is presented with vufind and given a set of tasks to perform. the user speaks his/her thought processes. the benefits of using think aloud protocol include being able to redesign elements of your user interface based on users’ misconceptions, or things they did wrong as well as discovering which parts of the user interface users found easy to use. (nielson, 2012) findings of the usability testing our inquiry resulted in several changes and enhancements to the original draft interface. users found the interface intuitive and were able to perform most of the test tasks independently without advice from the accompanying investigator. overall, users said they preferred even the draft interface to the previous version of the catalog. it was also observed that users progressed rapidly in their proficiency from the first task to the eighth, taking less time to navigate the interface and narrow searches after each subsequent task. some of the intuitiveness might be attributed to the carryover of verbiage from the prior catalog. where possible, the draft interface employed similar language as the catalog – for instance, using “request” vs. “hold”, “cite” vs. “reference”, etc. figure 1 : previous version of summon search interface used before implementation of vufind figure 2: previous version of library catalogue used before implementation of vufind social features social features commonly found in library search systems include functions such as the ability to tag content, comment on content, share content via twitter, facebook or other social media sites, and make use of social bookmarking. we found a disparity between students and faculty users that we tested in terms of personalization and social features. similar to what ho, kelley, & garrison (2009) found, neither our student nor faculty users made use of these features, although the study tasks didn’t ask them to. the students we tested generally ignored the presence of these features. faculty in our tests, in contrast, expressed a dislike for the way these features are presented in catalog records – deeming them unimportant and secondary to the primary goals of a library search system. faculty users were also unsure of where to find what they considered more critical information, such as the location and call number of a book. this information was further down the page and required scrolling to view. multiple faculty members suggested removing personalization and social features from item record pages altogether, or moving them down the page or into other, less obvious places in favor of prioritizing “more important” information. this recommendation was taken and the final version featured a less cluttered item record page with personalization and social features tucked into labeled menus on the left side of the page [figures 3 and 4]. these changes also resulted in other information about the items moving up from the bottom of the page, becoming partially visible without scrolling. figure 3: draft interface shows social, sharing and personalization features such as an item qr code, comments, tags and favorites, located prominently in the center of the page. similar items are displayed on the left. item location information is not visible anywhere on the screen and the user must scroll down to see it. figure 4: revised interface shows certain features categorized and moved to a less prominent location on the left and bottom of the page, including the qr code, comments, and favorites. similar items are now hidden behind a clickable button, rather than displayed by default. users are invited to scroll down the page by visible tabs indicating that more information is available. linguistic orientation because the draft vufind layout separated print and electronic resources into two vertical columns [figure 5], one of our key questions prior to the study was whether native arabic-language users would begin their interaction with the page from left or the right. if our students were to begin looking for important features on the right side of the page, this would have design implications for us. the arabic language is written and read right to left and many arabic websites place their menus and navigation on the right side. (for example, the abu dhabi government website ) somewhat surprisingly, we found that these users did tend to interact with the search interface and navigation menus from left-to-right; perhaps because they are accustomed to doing so when content is presented in the english language. most users intuitively looked for books and significant headers on the left (as indicated by mouse movement), and when asked explicitly, agreed that the layout was logical. figure 5: draft interface initial search results through vufind are separated into print results (left) and electronic results (right). accessing full text our test subjects did not intuitively understand how to access the full-text of e-books and articles. users frequently attempted to access the full-text of a retrieved item by clicking its hyperlinked title; which retrieves a bibliographic record for the item but does not immediately present the full-text [figure 6]. figure 6: draft interface users tended to click the hyperlinked title of an item when searching for full-text, overlooking the “get full text” link just below. the committee considered linking titles to full-text, but decided against it. although many users do not care to examine an item’s bibliographic record, there are many others that do. the bibliographic record is the key route for more sophisticated researchers to select a database source of their choice when the item is available in multiple databases. we also recognize that many search engines and electronic interfaces (eg databases such proquest and ebsco) link the title to the bibliographic records, and that our user group may reap the benefit of greater transferability if they learn to look for a ‘full-text’ button or option in our search interface. therefore, we have addressed this usability issue with a combination of some minor changes to the interface [figures 7, 8, 9], along with a user education initiative which focuses on accessing full-text. figure 7: revised interface the item type and full-text link has been emphasized with space and content divider bars figure 8: draft interface after a user has clicked on an item’s title seeking full-text, the bibliographic record provides no clear avenue to continue. most users did not recognize that the holdings link at the top would retrieve full-text, and instead pressed the back button. figure 9: revised interface the “get full text” link has been added to the item’s bibliographic record as well, to enable users who do come here by mistake to proceed. new and multiple searches another finding was that users in our test who performed multiple searches would re-search from the same page where they finished their prior search, using a search box at the top. this resulted in new search listing results in the same format as the last page, rather than the original results format of books on the left and electronic resources on the right. given the positive user feedback about this division, we reset the search box to always default to this format on a new search, regardless of where the search was performed from. emotionally intelligent wording additionally, wording was changed to be more emotionally intelligent. we tried to anticipate the emotional reactions of our users and promote more positive reactions. the users we tested, particularly faculty, expressed feelings of annoyance and anger through the think-aloud protocol when they encountered wording on an initial results list about the location of book copies. in order to save space, the first draft interface included a results page which listed the physical location of a hard copy in the format location : [campus – collection]. however, if multiple copies of the item existed in multiple campuses, the results would read location : multiple locations [figure 10]. users could see a list of locations by accessing the item record after clicking its title. users, however, felt that the multiple locations indicator was useless, redundant, and unhelpful, resulting in negative emotional reactions. users did not understand where they could see a list of locations for the item. this wording was changed to an imperative construction providing a better indication of how users could get more information [figure 11]. figure 10: draft interface users reacted negatively to the “multiple locations” indicator, exhibiting frustration. figure 11: revised interface the “check record for locations” indicator provides a useful instruction for users about where to get the information they are seeking. all fields search user search behavior tended towards the default all fields search regardless of what they were looking for; be it author, title, topic, or other. author searches tended to come in first name, last name order rather than the reverse. this information was used in customizing data input and search parameters to make results more relevant. conclusion our users are largely consistent in their behavior with other groups in prior studies. (denton & coysh, 2011; ho, kelley, & garrison, 2009) based on the results of this study, several changes were made to the vufind interface in order to facilitate a more intuitive search experience for our users, including the rearrangement of features on item record pages, the nature of wording and the size and placement of text and images. based on these results, we also made some decisions about what usability issues should be treated through interface adaptations, and which should be treated through user education. further study into whether user behavior changes along with the language of use (right-to-left or left-to-right) and whether native language orientation has any impact, would be helpful to confirm and support our initial observations in this study. future directions the revised vufind interface was launched in september 2012 incorporating the findings and observations from the usability study. feedback from faculty and students on the new interface was positive. the usability committee plans to keep adding features to the vufind interface and another usability test or user satisfaction survey will be conducted in order to ensure that the search interface continues to be user friendly. references bauer, k. (2008). yale university library vufind test—undergraduates.   retrieved 28 aug, 2012, from https://collaborate.library.yale.edu/…/summary_undergraduate.doc denton, w., & coysh, s. j. (2011). usability testing of vufind at an academic library. [doi: 10.1108/07378831111138189]. library hi tech, 29(2), 301-319. emanuel, j. (2011). usability of the vufind next-generation online catalog. information technology and libraries, 30(1), 44-52. ho, b., kelley, k., & garrison, s. (2009). implementing vufind as an alternative to voyager’s webvoyage interface: one library’s experience. library hi tech, 27(1), 82-92. nielsen, j. (2000). why you only need to test with 5 users.   retrieved 28 aug, 2012, from http://www.useit.com/alertbox/20000319.html appendix ausability study task list usability study task list determine if we have the book “twilight” at any college. is it available at this college? find books by the author j. k. rowling. how many copies of “harry potter and the sorcerer’s stone” are there? can you find similar items to this book? what are they? cite “harry potter and the sorcerer’s stone” in apa. find the following book: introducing public relations : theory and practice who published this book? was that information easy/difficult to find? find a book about drawing comics or manga characters at any college, and make a request for it. find and open an article (newspaper/magazine/journal) from 2005 or 2006 about diabetes in the uae. find and open an e-book which will help you learn about adobe photoshop. find information about ‘human resource management’. find a newspaper article on the topic. open the full text. find a scholarly journal article about emiratization. open the full text of the article. appendix b – task debrief [to be discussed verbally with, and recorded by, interviewer]. overall, what do you like most about this search interface? what do you not like? how did the search interface help/not help with your searches? what do you think of the arrangement of what is on the left and what is on the right side of the screen? how would you change it, if you could? about the authors nicole johnston is a faculty librarian at the higher colleges of technology. she has previously worked as a librarian and english teacher in australia, ireland and japan. she has a bachelor of arts degree, masters of library and information science degreeand is currently a phd candidate at qut in brisbane, australia. nicole.johnston@hct.ac.ae alicia salaz is a faculty librarian at the higher colleges of technology and doctor of higher education student at the university of liverpool. she earned her masters of library and information science from the university of washington in seattle in 2006. alicia.salaz@hct.ac.ae rob o’connell is the manager of library technical services at higher colleges of technology with eight years of library experience. rob specializes in managing integrated library systems and has overseen the millennium library system at the higher colleges of technology and liwa library consortium for four years. rob has a bachelors degree in graphic design and received his masters in library science from clarion university in pennsylvania. roconnell@hct.ac.ae subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – click tracking with google tag manager for the primo discovery service mission editorial committee process and structure code4lib issue 55, 2023-1-20 click tracking with google tag manager for the primo discovery service this article introduces practices at the library of oregon state university aiming to track the usage of unpaywall links with google tag manager for the primo discovery interface. unpaywall is an open database of links to full-text scholarly articles from open access sources[1]. the university library adds unpaywall links to primo that will provide free and legal full-text access to journal articles to the patrons to promote more usage of open-access content. however, the usage of the unpaywall links is unavailable because primo does not track the customized unpaywall links. this article will detail how to set up google tag manager for tracking the usage of unpaywall links and creating reports in google analytics. it provides step-by-step instructions, screenshots, and code snippets so the readers can customize the solution for their integrated library systems. by hui zhang introduction in 2020, staff at oregon state university library started a project to provide single-click links to open access content in 1search[2], the university’s library discovery interface built on the primo service platform[3]. the goal of the project is to provide free and legitimate access to full-text scholarly resources for the patrons. although primo already includes open access content in the search results, there are studies [1] that show the primo solution has significant flaws in indexing and providing open access resources. ultimately, we decide to use unpaywall, an open database that harvests and indexes tens of millions of open access scholarly articles, as the source of open access content in addition to primo. by customizing the user interface (ui) of primo, we added unpaywall links of open access content to the search result and the individual item view. however, it has a problem in that we cannot get the link usage statistics from primo’s analytics tool because primo does not track the unpaywall links. this article will detail tracking customized links in primo using google tag manager [4] including testing, troubleshooting, and creating usage reports with google analytics. although the case study is specific to tracking unpaywall links, the workflow and configuration of google tag manager are general. users may also adapt the included snippets and tags to tracking activities for websites beyond the types used in library systems. adding open access links with primo customization finding open access resources in primo primo users can find open access content in two ways. the first way is to filter the search results by selecting “open access” in the availability facet. figure 1. open access facet in primo. the second way of finding open access content is to look for the open access icon that will appear for an item identified as open access both in the search results and in the full item view. figure 2. open access indicator in primo. primo provides open access content in many resource types, such as journal articles, books, and theses. why adding unpaywall links to primo the significant flaw in primo’s solution of providing open access is the preference for subscribed content over open access. one study [1] finds search results in primo will provide article links to subscribed journals even when the articles are open access, undermining the visibility of open access content to the readers. with the growing demand to make more open access content available to the patrons, developers at primo added a feature to integrate the unpaywall api so users can find open access articles that may not appear or be available to them initially [2]. however, librarians at oregon state university (osu) finally decided to provide unpaywall links to primo by adopting a solution called oadoi link [3] developed by the primo customization standing group. the standing group is formed by the orbis cascade alliance, which oregon state university is a member of. one advantage of oadoi link is that the osu library has better access to technical support as it is locally developed. but more importantly, a recent study [4] found that oadoi links can provide an estimated 30% more open access articles compared to primo’s unpaywall feature. we extend the oadoi links in our solution [5] to provide unpaywall links to open access items in the brief display view next to the availability status. figure 3. customized unpaywall link shown in the brief display of an open access item. offering unpaywall as single-click links is a significant improvement to usability as library patrons can access the full-text contents without authentication with their university credentials. the challenge of tracking unpaywall links usage in primo because the unpaywall links are added to primo by a ui customization, we cannot get usage statistics of these links from primo because they are not tracked. it is a major problem as usage data is crucial evidence to assess the impact and success of the unpaywall project. to overcome the problem, we investigated the possibility of using google tag manager to track the customized unpaywall links. as primo continues to provide new features including link tracking, it is worth updating the latest situation so that the readers will have a better understanding of the motivation and contribution of our approach. ex libris, the company that develops primo, added the capacity to track and report the usage of unpayall links in august 2021 [6]. however, that feature is only available to primo ve [7], a newer and different cloud computing platform to primo. the ex libris solution was unavailable when we investigated the potential of google tag manager in 2020, and we then used primo, not primo ve, as the discovery interface. for full disclosure, oregon state university library migrated its discovery interface to primo ve in the summer of 2022. however, many libraries worldwide are still using primo, and our work on google tag manager will help them to track customized links like unpaywall in their discovery interfaces. tracking unpaywall links with google tag manager how google tag manager works perhaps many readers, like us, are confused when they try to understand what google tag manager is at the beginning. we will explain how google tag manager works by answering two questions: what is a tag and what is the difference between google tag manager and google analytics? according to google, a tag is a code snippet deployed to measure website user activity [8]. these tags, or tracking codes, were usually created and deployed by developers before tools like google tag manager were available. however, with google tag manager, people can create, test, and deploy a tag without programming skills. creating triggers that will tell the manager when, where, and how to operate the tag is required to set up a tag. google tag manager provides two types of triggers: all elements and just links. the all elements trigger can track clicks on any element on a page, e.g. links, images, and buttons. the just links trigger can track clicks on html links that use the element. google tag manager and google analytics are two different tools, but you should always use them together. google tag manager can add google analytics tracking code (i.e., tag) to the website but can not create reports. instead, it will send activity data of the website to google analytics for analyzing and reporting. adding google tag manager to primo your first step to adding google tag manager to primo will be creating an account. go to the google tag manager website to create an account or log in using your google account. then you need to create and name a container with “web” as the target platform, and you will be given an id in the format of “gtm-xxxxxxx” after the container has been created. take note of the container identifier because you will need it in the next step. we suggest creating a container for every website, then defining tags for activities you want to track in the newly created container. you will add a snippet to primo to allow google tag manager to track the web activity. the technical detail of managing and customizing the primo ui package is beyond the scope of this article. however, the primo administrators should have the knowledge and privilege to add the sample javascript snippet below to primo. make sure you use the correct container id in the snippet. /* google tag manager */ const gtmid = 'gtm-xxxxxxx' function addgtm(doc) { const newscript = doc.createelement('script') const scripttext = `(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new date().gettime(),event:'gtm.js'});var f=d.getelementsbytagname(s)[0], j=d.createelement(s),dl=l!='datalayer'?'&l='+l:'';j.async=true;j.src= '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentnode.insertbefore(j,f); })(window,document,'script','datalayer','${gtmid}');` newscript.innerhtml = scripttext doc.head.append(newscript) const noscript = doc.createelement('noscript') const noscripttext = `<iframe src="//www.googletagmanager.com/ns.html?id=${gtmid}" height="0" width="0" style="display:none;visibility:hidden"></iframe>` noscript.innerhtml = noscripttext doc.body.insertbefore(noscript, doc.body.firstchild) } addgtm(document) then you need to save and deploy the change so it will take effect in primo. congratulations! you have done all the required configuration for primo, and all the rest will happen in google tag manager. creating tag and trigger for unpaywall link clicks you will create a google tag manager tag and trigger in the newly created container. for example, you can create and name a tag called “unpaywall” with the type of google analytics and associate the tag with an existing google analytics account. in our case, we associate the tag with the google analytics account of primo. figure 4. google tag manager tag with the type of google analytics inside the new tag, you need to create a trigger that fires the tag when the unpaywall link is clicked. figure 5. google tag manager trigger for tracking unpaywall link clicks for our purpose of tracking the link click events, make sure you will select “all elements” for the trigger type. the next step is to attach a condition or rule to the created trigger. in our example, we create a condition like this: the tag is activated when links with the text “open access(via unpaywall)” are clicked. the link text used in the trigger is the label of the unpaywall link we added to primo ui by our customization code. if you want to customize the method, you can define a trigger with different conditions that are appropriate for your need. trigger testing and troubleshooting people can test newly created tags and triggers using the preview feature in google tag manager. figure 6. google tag manager preview for testing tags and triggers. the preview feature is available at the container level. a pop-up window appears after clicking the “preview” button, where you can enter the url of the primo instance for testing. after the connection is established, you can go to the primo website, click the unpaywall link, and check whether the tag is triggered as expected in google tag manager. the browser’s developer console will be your best tool for troubleshooting. for example, we use chrome’s console to confirm the label text for the unpaywall link is the same as entered in the trigger condition. after finalizing the configuration with the preview, you must verify the changes by clicking the “submit” button next to the right of the “preview.” you will be asked to create a version of your container and finally publish it. the deployment is reasonably quick, and in our example, we can see click data shown in google analytics a few minutes after we published the change in google tag manager. generating usage report with google analytics you can use the many report functions in google analytics to review and analyze the data collected by google tag manager. in our example, we can find statistics of unpaywall link usage in the event under behavior, then in the “click” event category. figure 7. click event report in google analytics. the result is promising as it shows patrons are attracted to open access content, and there is a clear trend that more patrons are using the unpaywall links. for instance, the total number of clicks for the unpaywall link is 53,361 during the calendar year of 2021. that number jumped to 60,534 for the first six months of 2022 until osu migrated its discovery interface to primo ve. conclusion in this article, we describe the work of tracking customized unpaywall links with google tag manager in primo. we outline the motivation of our project, introduce google tag manager, and provide details on how to define tags for tracking unpaywall link clicks. we have used data collected by google tag manager for decision-making. for example, we have decided to continue to provide open access and unpaywall links in primo ve with the usage statistics of unpaywall links collected in primo. by integrating google tag manager and google analytics, we can also get more insights into patrons’ activities, such as which open access databases are popular and which subjects patrons are most interested in. we hope the code and screenshots are helpful and the readers can refer to them in their works. notes [1] unpaywall: https://unpaywall.org/ [2] 1search: https://search.library.oregonstate.edu/ [3] primo service platform: https://exlibrisgroup.com/products/primo-discovery-service/ [4] google tag manager: https://tagmanager.google.com/ references [1] bulock, c. (2021). finding open content in the library is surprisingly hard, serials review, 47:2, 68-70, doi: 10.1080/00987913.2021.1936416 [2] how to utilize the unpaywall api for open access content and resources in discovery. (2021). ex libris knowledge center. https://knowledge.exlibrisgroup.com/alma/knowledge_articles/how_to_utilize_the_unpaywall_api_for_open_access_content_and_resources_in_discovery [3] oadoi link. (2022). orbis cascade alliance. https://www.orbiscascade.org/programs/systems/pcsg/primo-ve-toolkit/oadoi-link/ [4] veldhuisen, k. (2020). unpaywall in alma, oadoi customization in primo (and other open access). retrieved from https://docs.google.com/document/d/1rbz7l4_ktra7psxfxatpyjev-og3sizw1qjrpxc5ebk/edit [5] osulp/1search-ui-package. (2021). github. https://github.com/osulp/1search-ui-package [6] primo ve 2021 release notes. (2022, may 6). ex libris knowledge center. https://knowledge.exlibrisgroup.com/primo/release_notes/002primo_ve/2021/010primo_ve_2021_release_notes [7] primo ve overview. (2022, september 18). ex libris knowledge center. https://knowledge.exlibrisgroup.com/primo/product_documentation/020primo_ve/primo_ve_(english)/010getting_started_with_primo_ve/005primo_ve_overview [8] overview. (2022). google developers. https://developers.google.com/tag-platform/devguides about the author hui zhang (hui.zhang@oregonstate.edu) is the digital services librarian at the oregon state university. subscribe to comments: for this article | for all articles 5 responses to "click tracking with google tag manager for the primo discovery service" please leave a response below: jonathan rochkind, 2023-01-23 nice article! it’s great to see the implementation — but i’d also be super interested in hearing about what you learned from the tracking, what the numbers and your interpretation of them were! perhaps a future article? hui zhang, 2023-01-25 hi, jonathan: thanks for the comment. the tracking numbers are reported at the bottom of the article (admitted that it is provided like a footnote): for instance, the total number of clicks for the unpaywall link is 53,361 during the calendar year of 2021. that number jumped to 60,534 for the first six months of 2022 until osu migrated its discovery interface to primo ve. we also get some feedback from the patrons that they like the unpaywall link because it is easy and fast to use. author, 2023-01-25 thanks for the comment, jonathan. the tracking numbers are at the bottom of the “generating usage report with google analytics.” osu migrated to primo ve in the summer of 2022, and ex libris now provide a very similar implementation to oadoi in ve. it will be interesting to run a study on that implementation. anna couthures-idrizi, 2023-02-01 dear hui, thank so much for this article ! i would like to know if the script you propose can also be integrated into a primove custom.js file best regards, anna scott k, 2023-05-02 thanks for sharing this–i added it as-is (with our own container id) and it worked to get ga4 collecting data in primo ve for me. i noticed that unless i put it at the very bottom of my customization package’s js file, none of the scripts that came after it would run. i’m not good enough with javascript to understand why, but i wondered if the author (or anyone else) might be able to tell me why that happened. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – response to premis events through an event-sourced lens mission editorial committee process and structure code4lib issue 59, 2024-10-07 response to premis events through an event-sourced lens the premis editorial committee (ec) read ross spencer’s recent article “premis events through an event-sourced lens” with interest. the article was a useful primer to the idea of event sourcing and in particular was an interesting introduction to a conversation about whether and how such a model could be applied to digital preservation systems. however, the article makes a number of specific assertions and suggestions about premis, with which we on the premis ec disagree. we believe these are founded on an incorrect or incomplete understanding of what premis actually is, and as significantly, what it is not. the aim of this article is to address those specific points. by jack o’sullivan, sarah romkey, and karin bredenberg introduction the premis editorial committee (ec) read ross spencer’s recent article “premis events through an event-sourced lens” with interest.[1] the article was a useful primer to the idea of event sourcing and in particular was an interesting introduction to a conversation about whether and how such a model could be applied to digital preservation systems. however, the article makes a number of specific assertions and suggestions about premis, with which we on the premis ec disagree. we believe these are founded on an incorrect or incomplete understanding of what premis actually is, and as significantly, what it is not. the aim of this article is to address those specific points. a brief introduction to premis data model premis (preservation metadata: implementation strategies) is a data model describing “the information a repository uses to support the digital preservation process”.[2] the data model is comprised of four top-level entities: objects, the subjects of preservation; events, the actions performed on the objects in the course of preservation; rights, the bases under which a repository is permitted to perform preservation actions; and agents, the people, organizations and/or software who perform preservation actions or confer rights to the repository. each entity has a set of “semantic units” (properties) associated with it. some of these semantic units are mandatory whilst some are optional. some are intended to be single value property fields, whilst others have further substructure. some can be repeated multiple times for the same entity whilst others are only expected to occur once. some semantic units are expected to be populated with values from a controlled vocabulary. the data model itself is defined in the data dictionary, which is the primary artifact maintained by the premis ec. premis is technology and implementation agnostic. it defines the set of information that a repository is likely to need to know to support preservation activities. it does not mandate how that information should be stored, or how systems using premis should be architectured. whilst the data model is technology independent, the premis ec do maintain two additional resources to support implementation, an xml schema and an rdf owl ontology, both of which are technology specific encodings of the data model. similarly, for semantic units where the value is expected to be populated with values from a controlled vocabulary, this is a guideline, expressed as a “should” rather than a “must” requirement. the premis ec maintains a list of controlled vocabularies that are applicable to these semantic units, which have been developed through community collaboration over an extended period of time, and which we believe to be broadly applicable. individual repositories may have other constraints and policies regarding these values, or different scopes of what they determine or require “preservation” to mean, and so the actual vocabularies used are an implementation specific decision. conformance since premis does not mandate any specific implementation details, it provides a means for implementers to assert the conformance of their specific implementation. this conformance statement provides three levels that implementers can assert. for simplicity, we are omitting the a and b sub-clauses at each level, which cover the scope of the premis data that is conformant, rather than the manner in which it is conformant. level 1 (mapping) – implementers assert that the data model in use by their system can be mapped to the premis data model, including assertions that all mandatory semantic units can be supplied or derived. level 2 (export) – implementers assert that their repository provides functionality or processes to export internally held metadata as valid premis metadata. level 3 (internal implementation) – implementers assert that their repository directly implements the premis data model as the internal data model of the system. even at level 3, this statement is agnostic as to the technology, details and strategy for implementation. what premis is not as stated above, premis is deliberately intended to be agnostic of any particular implementation or technology. as a consequence, it is not the xml schema, the rdf owl ontology, or any of the controlled vocabularies, all of which the ec maintain as references and supporting materials for implementers. specific suggested premis changes given what we have said about premis, we can now look in some detail at some of the specific suggestions and assertions made about premis in the original paper. nouns become actions in the past tense the assertion that premis events are often “nounified” seems to us to imply one (or both) of two things. the first is that the event types in the premis maintained controlled vocabularies are nouns. the second is that systems that implement premis tend to use nouns as event labels. the first of these is undoubtedly true. this can be understood as the controlled vocabulary describing the types of preservation actions that preservation systems are likely to enact, rather than describing the performance of the act itself. the second of these is also, in our experience, true. we believe that this can be understood as a symptom of the fact that, as the author points out, preservation systems today do not tend to be architectured as event sourced systems, but instead as records of how things are, and how they came to be. that being said, we don’t believe that it follows that the author’s assertion that “titular premis nouns become a barrier to engage with, use and extend without committee” is also true. the event type controlled vocabulary is intended to cover as broad a range of use cases as possible, but is not intended to be prescriptive or proscriptive. implementers are free to use their own vocabularies where they wish or need. the premis data dictionary strongly suggests only that these vocabularies should be controlled so as to facilitate exchange of information and description. less importance placed on the data dictionary as discussed above, the data dictionary is the primary artifact of premis. it is the definition of the data model, so it’s hard to see how less importance could be placed on it. given what the author actually says under this heading, our reading is that the author actually means that less importance be placed on the set event types covered by the controlled vocabulary. as discussed, this is already a guideline for implementers, who can choose the degree of importance they attach to adhering to the list. closer to a technical implementation it is not clear to us that the author is actually suggesting any change here. we would like to reiterate however that premis is explicitly and very deliberately implementation agnostic. premis conformance restrictions around internal schemas are made more flexible the author is really covering three distinct issues here, firstly around the potentially hierarchical nature of the levels of conformance, secondly as to whether the distinctions between the levels are still valid, and finally whether level 3 is “reasonable” today. are the levels hierarchical? the author asks, “while described as levels, should they be viewed hierarchically”, without seeming to posit a reply. whilst it is not explicit what is meant by this, we can at least offer the beginnings of an answer. logically, the three levels do indeed build hierarchically. level 1 states that the metadata that the repository stores or knows can be mapped to premis. that is, for each premis entity and semantic unit, the repository stores or knows an equivalent metadata value. level 2 takes this a step further, stating that not only does the repository store or know that metadata, but that, on request, it can provide a user with that metadata using premis terminology directly. level 3 then further states that the repository actually stores that metadata using premis terminology directly. if you can assert level 2 conformance, then by definition you must also meet the criteria for level 1. similarly, if you can assert level 3. it is theoretically possible that you could assert level 3 by directly implementing premis, but not be able to assert level 2 because all export mechanisms transformed the metadata away from premis (or there are no export mechanisms), but it seems so unlikely that any implementation would actually do this that for all practical purposes we can think of the levels as a hierarchy. this is not to say that there is an implied hierarchy of merit in approach. it is not inherently “better” to have level 3 conformance than level 2, or to have level 2 rather than level 1. the levels simply represent different choices made by each implementer. are all three levels still required? the author asserts that: the first level of conformance clearly makes interpretation of an internal schema next to premis more difficult, and vice versa. one may remove this level of conformance completely. conceptually, however, two and three, have the same outcome, functionally, they take different paths toward getting there. can one approach realistically be viewed as better than the other? there are several threads to pick at in this statement. the first is contained within the first sentence. it is not “clear” to the ec that providing an option to allow implementers to map their internal schemas to a common standard makes anything more difficult. in practice, this means that implementers may have to include elements in their data models that they would not have included in the absence of premis or a requirement or desire to attain conformance with premis. it does not constrain their freedom to include other elements that are beyond the scope of premis, or indeed to choose how to interpret the premis model. there follows an assertion that level 1 could be removed completely, and that levels 2 and 3 “have the same outcome”. like everything in premis, the levels of conformance are subject to periodic review based on user and implementer feedback. that said, the ec believes that the current levels and definitions are each useful and sufficiently distinct from each other that they are not in need of such drastic overhaul as suggested here. for any system implementing with level 1 or 2 conformance, the implication is that there is not a single unique 1:1 correspondence between the internal data model and premis. in most cases that will mean that there is some ambiguity in exactly how the internal data model should be mapped, with multiple valid choices that could be made. conformance level 2 imposes functional requirements that level 1 does not. it requires that a canonical mapping between the system’s own internal model and premis be established; and that this canonical mapping be formalized through explicit api contracts or functionality. for providers of systems that are used by multiple institutions, level 2 could be seen as restricting the freedom of those individual institutions to choose a mapping that best suits their needs. further, the additional implication of the second part is that the repository has to commit to maintaining functionality. we believe therefore that there are, and will always be repositories for whom level 1 is a valid end state to aim to achieve, and that there will be others willing to, or being required to, take the additional steps required to reach level 2. to believe that level 2 and level 3 have the same outcome, is to consider export of premis metadata as the only, or the main function of a digital preservation system. for level 2, an implementer can choose to have data in the system that does not map to premis at all, or to have a working model that differs substantially in form from premis in order to facilitate other functionality, accepting that exporting to premis is a transformation performed for a specific use case. this is a fundamentally different proposition from a system that directly implements premis. level 3 essentially requires that an implementer is constrained by decisions made by an external body, i.e. the premis ec, at least in regards to preservation metadata. this may be a drawback for level 3 implementation, but there are also benefits in choosing to take an internationally defined and agreed data model and use that as the basis of your system. the benefits of not implementing an external data model are broadly around increased control and flexibility, however the trade-off to consider is the likely loss of easy interoperability and exchange with other systems. the benefits and considerations of conforming to any level are outlined above; ultimately the systems designer can take all benefits and drawbacks into consideration and still be in alignment with premis. premis can only provide this flexibility by continuing to define the current levels of conformance. is level three reasonable in today’s software development world? again here, the author posits a question, which actually raises multiple issues. initially they state that “premis can be trivially demonstrated to record information that extends beyond that which is useful for preservation”, citing access as one of those things. there are two responses to this, the first is to note that access has always been considered a part of digital preservation, to the point that one of the functional areas of the oais model is access. the second is to reiterate the point made above, that premis as a model specifies that you should know information about the objects being preserved, and the actions performed, without explicitly mandating which actions a repository should concern itself with. the second issue is whether premis is the most efficient format to be used as a model for digital preservation metadata. this is not really expanded on within this section, but does seem to be raised again within the conclusion. here the author notes that there have been efforts to reduce the size of premis metadata. again, what we believe is at issue is not premis as a data model per se, but specific technological implementations. premis in xml format can indeed be verbose; however this is not unique to premis, one of the more common complaints of xml is its verbosity. we can reduce the footprint of premis metadata in xml by using compression tools, such as zip, gzip, exi encoding etc. obviously this comes at the additional preservation risk of having an extra layer of technology in our solution, and at the expense of some cpu overhead for each read and write operation. alternatively, a different technological encoding altogether of the model could be used. the author notes that the rdf owl ontology is an alternative that might reduce size, but equally the premis metadata could be expressed in json or yaml, or one of any number of data serialization schemes, which in their uncompressed forms would likely be smaller than xml. indeed, the premis metadata could be directly encoded in a database, which for a large collection could also reduce size. the core point is that the premis data model specifies the metadata that most repositories are likely to need to know in order to perform preservation; how efficiently this is stored is an issue for each implementation. conclusion the first version of the premis data dictionary was published nearly 20 years ago, and in that time the practices of digital preservation have benefited from numerous advances in technology, while simultaneously being challenged by technical and societal factors. the aim of the premis data dictionary continues to be to provide a technology and implementation agnostic way for digital preservationists to capture the necessary metadata for continued preservation and access to digital objects. we’d like to thank ross spencer for contributing his ideas to the community and hope that this response provides clarification or illumination for anyone interested in implementing premis in their own systems or practices. we would also welcome the opportunity to continue a dialog around how an event sourced system could express the data it preserves in premis form. references [1] ross spencer, “premis events through an event-sourced lens,” code4lib journal, no. 56 (april 2023). https://journal.code4lib.org/articles/17264. [2] premis editorial committee. data dictionary for preservation metadata: premis version 3.0 (june 2015, revised november 2015). https://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf. about the authors jack o’sullivan, sarah romkey, and karin bredenberg (chair) are members of the premis editorial committee. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – auto-populating an ill form with the serial solutions link resolver api mission editorial committee process and structure code4lib issue 4, 2008-09-22 auto-populating an ill form with the serial solutions link resolver api in this article we’ll take a tour of the openurl protocol; discover how to use it to get an xml api response from the serial solutions link resolver; and see how to receive and process that xml data using php to create an interlibrary loan webform. finally, we’ll see a few examples of how to handle form processing. this article will be of interest to beginner programmers interested in examples of programming with openurl and xml in php, and to more experienced programmers interested in taking a look at the serial solutions 360 link api. by daniel talsky introduction in this article we’ll take a tour of the openurl protocol; discover how to use it to get an xml api response from the serial solutions link resolver; and see how to receive and process that xml data using php to create an interlibrary loan webform. finally, we’ll see a few examples of how to handle form processing. this article will be of interest to beginner programmers interested in examples of programming with openurl and xml in php, and to more experienced programmers interested in taking a look at the serial solutions 360 link api. what is an openurl link resolver? an openurl is a formatted citation embedded in a url.[1] link resolver products receive openurl requests identifying a citation, and provide services to the user for the given citation. the service which generally motivates an institution to set up a link resolver is providing online full text links for a citation, generally based on a ‘knowledge base’ of licensed full text maintained in the link resolver product. most openurl link resolvers return results as an html page. this is great for a human, who can read the results and manually link to the full text. but sometimes a library might want to set up more advanced features than the link resolver product supports, still based on underlying link resolver functionality. even significant customization of the html output may be limited by the interface of the link resolver. you may want to use or develop other tools which provide more functionality, while still using the full text resolution services of the link resolver. an xml api allows you to receive results as a package of data, and use a program to pull the data you want and display or use it in any format. want to populate a spreadsheet with link resolver data? a pdf file? generate an rss feed? put the information in an email? all of these are possible using an xml api for link resolver data. some openurl link resolvers provide xml apis. what is an xml api? an xml api is a web service where either the request or the response are in xml. in the case of the 360 link xml api, the request is in the form of a url, and the response is in xml. read more about web services at wikipedia. 360 link vs. other openurl link resolvers 360 link is a commercial link resolver product of serials solutions. there are other link resolver products available that accept openurl requests, some of which also provide xml apis. i will use 360 link as an example because i work for serials solutions (full disclosure). other openurl services include ex libris sfx, and ebsco link source. the 360 link xml api official documentation is not public, so this will be the first public look at how our api works. serials solutions includes the api with the purchase of 360 link. an interlibrary loan form as practical example when a library doesn’t have a full text resource, it’s common to direct patrons to an ill form, so that they can request the resource from another library. then the user needs to manually enter the citation data into the form; this is an error-prone and time-consuming process. our example is of a hypothetical “citation finder” service, which will direct the user to a pre-populated ill form only when the local library does not have a licensed electronic copy of the article citation requested. the patron could get to this ill form from either of two starting points. either the patron would fill out citation information in a search form on a library’s website, or the original data would come via an openurl string passed from another database’s search results. [2] the xml api client in php writing a client our client will be a computer program that sends an openurl request and receives the xml package. then we can use the data in the xml package to construct our ill form. the specific tasks our client software will engage in are: collect the citation data based on a user search for an article assemble the data for an openurl query generate a valid url from the data send it across the web to the 360 link api and recieve the xmlresponse read and extract variables from the xml display the variables as an html form many programming languages have all the tools you need to complete these tasks. i chose php, a scripting language designed for the web, for a few reasons: it's widely available, can be easily deployed in a wide variety of environments, is considered very easy to work with even for beginners, and has very good documentation at php.net. the request: an openurl query the first step in talking to the api is to generate an openurl request. constructing the url once we have a set of key-value pairs, we need to express them in the language of a url. the baseurl first we need to point to a certain location on the internet. this is done with the baseurl. each openurl provider has their own baseurl. the baseurl for serial solution 360 link is: http://client identifier.openurl.xml.serialssolutions.com/openurlxml the client identifier refers to a code provided by serials solutions for your organization. if your client identifier is 000-00, then your url would be: http://000-00.openurl.xml.serialssolutions.com/openurlxml if this url is opened in a browser, you will get an xml package back, but it will just be a diagnostic message in xml telling us that the server needs more information. next we will explain how to provide that information. openurl key-value pairs if we send only the baseurl, we’ll get a diagnostic message saying you’re missing a required field. first we’ll describe the information to provide, and then show how to encode this information in a url. openurl 1.0 has a certain minimum set of key-value pairs you need to send with each query, just to meet the openurl standard. in addition, the serials solutions 360 link xml api requires an additional version key-value pair that is not a part of the openurl standard. the version allows us to make changes to our api interface over time. add to these the fields for citation data, and we have a basic set of key-value pairs that allow us to interact with the server. if we miss one of these values, the server will return an error message in xml. required values friendly name notes key = value example value range 360 link api version included to allow for backward compatibility when newer versions of the api are released. this is not part of the openurl standard, but 360link xml api queries won’t work without it. version = 1.0 currently the 1.0 version is the only 360 link xml api version openurl version this refers to the ansi/niso standard for openurl itself. url_ver = z39.88-2004 z39.88-2004 refers to the current standard (1.0), and the only currently supported by the 360 link xml api. metadata format specifies what resource we’re referring to. openurl offers several metadata formats for different types of items. for the the purposes of this article we will use the journal format which is for journal articles among other things. rft_val_fmt = info:ofi/fmt:kev:mtx:journal read the openurl specification for other kinds of queries. in general, use the example value for 360 link xml api requests. in addition to including these basic minimum values, you need values that refer to a specific resource. the full openurl specification is fairly complex, and is the only place where all possible fields are documented. however, here is a list of the most common key-value pairs for citation data: friendly name key article title rft.atitle journal title rft.title genre rft.genre issn rft.issn journal volume rft.volume journal issue rft.issue page number or range rft.spage issue date rft.date author last name rft.aulast author first name rft.aufirst author full name rft.au author corporation rft.aucorp when you send this citation data for an article to the server, 360 link attempts to figure out what journal resource you’re looking for, and return full text results if they are available. to refer to a specific article, we can instead use a doi number or pubmed id by using the rft_id key with a special prefix for the value: friendly name key = value prefix doi number rft_id = info:doi/[doi number] pubmed number rft_id = info:pmid/[pubmed id] what data to send the api is expecting at least enough data to identify a journal, and will respond in one of three different ways: give an error message because of lack of data return only the most basic journal level information return full citation information for an article usually, the api will not return the article data (like the page number or article name) unless they’re passed in in the request. in many cases, however, a pubmed or doi id will fetch full article citation data automatically by contacting either a pubmed or doi service. our first example: a doi number query the only real required value is the version, which refers to the version of the serials solutions xml api we’re using. at the time of this writing, the 360 link api has only one version, 1.0, so this is the value we’ll use. then we need some way of referring to an actual resource, and the simplest way is to refer to an article by a unique article-level link. as an example, let’s use the doi number for the article how the mind hurts and heals the body. the doi id is 10.1037/0003-066x.59.1.29. here’s the key-value pair for our doi number: friendly name key = value doi number rft_id = info:doi/10.1037/0003-066x.59.1.29 notice that a doi number is specified by adding info:doi/ at the beginning of the rft_id value to tell it what kind of id we’re using. key-value pairs in a url query string this is where we send the actual information. url’s have a format for sending key-value pairs called the query string. the query string always starts with a question mark (?) that separates it from the baseurl: http://000-00.openurl.xml.serialssolutions.com/openurlxml? then we add a key-value pair, separated by an equals sign (=) version=1.0 then we add our required key-value pairs, separated by ampersands (&) version=1.0&url_ver=z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal then we add our doi number so we can get some citation data: version=1.0&url_ver=z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft_id=info:doi/10.1037/0003-066x.59.1.29 now we add the query string to the baseurl: http://000-00.openurl.xml.serialssolutions.com/openurlxml?version=1.0&url_ver=z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft_id=info:doi/10.1037/0003-066x.59.1.29 this is close, but won’t quite work. url’s have rules about what characters are allowed. for example, question marks (?), equals signs (=), and ampersands (&) all have special meanings in a url, so if we need one of these characters in a url key or value, we need a different way to represent it. the only characters that are allowed in the key or value of a url are: upper case letters (a-z) lower case letters (a-z) a few symbols: the period (.), the dash (-), the underscore (_) and the tilde (~) all other characters, including the space character, need to be percent encoded. percent encoding url encoding is sometimes called percent encoding, because url codes start with the percent sign (%) and are followed by a 2-digit number. the percent code for a space is %20, but it can also be represented by a plus sign (+). you can view a chart of the percent codes at the wikipedia page on percent encoding. keep in mind that you don’t encode every character in a url, just the ones in the keys and values. when we convert these keys and values, we get the following url: http://000-00.openurl.xml.serialssolutions.com/openurlxml?version=1.0&rft_val_fmt=info%3aofi%2ffmt%3akev%3amtx%3ajournal&url_ver=z39.88-2004&rft_id=info%3adoi%2f10.1037%2f0003-066x.59.1.29 there’s no reason to do this process manually, because this is a great job for a computer. encoding the key-value pairs with php so, the first part of writing our client is getting your program to encode the values for you, and making a url the program can send. php has a function: urlencode() that takes an unencoded string, and returns a string that’s all ready for the key or value part of the url. here is a sample piece of php code that goes through a list of url keys and values like this: // required url values $required_url_elements = array(); $required_url_elements['version'] = '1.0'; $required_url_elements['rft_val_fmt'] = 'info:ofi/fmt:kev:mtx:journal'; $required_url_elements['url_ver'] = 'z39.88-2004'; // citation information $citation_url_elements = array(); $citation_url_elements['rft_id'] = 'info:doi/10.1037/0003-066x.59.1.29'; $url_elements = array_merge($required_url_elements, $citation_url_elements); and encodes each key and value, assembling our completed url: // url setup $client_identifier = '000-00'; $base_url = 'http://' . $client_identifier . '.xml.search.serialssolutions.com/openurlxml'; // query string values assembler $first_value = true; $query_string = ''; foreach ($url_elements as $key => $value) { // if it's the first value, use a ? to start the query string // otherwise, use the &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;. $seperator = $first_value ? "?" : "&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;"; $first_value = false; // add this key-value pair to the query string // using php's urlencode() function $query_string .= $seperator . urlencode($key) . "=" . urlencode($value); } // add the base url to the query string $api_url = $base_url . $query_string; once you have the query string, you can use php to send it to the serials solutions server. now we have a valid openurl request submittable to 360 link’s xml api. trying it out in a browser if you are not a 360 link customer, with a valid client identifier and authorized ip address for that client identifier, you’re not going to be able to issue this request. if you can, go ahead and put that url right in a browser and see what you get. get a login popup? or a 501 error? then you’re not authorized. get a hunk of xml? perfect! sending the query with curl via php we’ve constructed our url, and we could easily print it out, and then cut and paste it into a browser. the whole point of using an xml api is being able to automate this kind of process, so that we can build our ill form automatically based on any resource our user wants. so we need a way to be able to send our key-value pairs to the serials solutions server and get the xml in response. next we will demonstrate sending a url and getting a response in php using the curl library[3]. curl in php since curl is an external library, we need to make sure it’s enabled in php. php has a handy way to look at its settings. just create a file that looks like this, and open in a web browser: <?php> phpinfo(); ?> php will generate a page showing all of its settings, and external modules. if curl is installed, you should see something like this included on the page: if you don’t see that, then you may need to do some additional steps to set up curl on php (or see curl php docs). alternately, you could use fopen()[4]. assuming you have php with curl working, here is a simple function that uses curl to send the url and get the resulting xml: /* * a simple example function using curl to send an openurl request and * receive an xml string for processing. * * @param string $url an openurl to attempt to retrieve * @return mixed $result an xml string, or false if an error */ function get_xml_result($url) { // initialize our curl handle $ch = curl_init(); // set the url curl_setopt($ch, curlopt_url, $url); // set the user agent $user_agent = $_server['http_user_agent']; curl_setopt($ch, curlopt_useragent, $user_agent); // we don't want the http header, we just want the xml curl_setopt($ch, curlopt_header, 0); // this makes the curl_exec call return a string curl_setopt($ch, curlopt_returntransfer, 1); // follow any location headers curl_setopt($ch, curlopt_followlocation, 1); // set some sane timeouts curl_setopt($ch, curlopt_connecttimeout, 30); curl_setopt($ch, curlopt_timeout, 120); // attempt to fetch the xml $result = curl_exec($ch); // make sure we don't have whitespace at the beginning $result = trim($result); // check to see if we were successful if (curl_errno($ch) != 0) { $result = false; // we're not using this, but this is how to get the error message: $error_message = curl_error($ch); } // close our curl handle curl_close($ch); return $result; } and what is $result? hopefully, it’s a piece of data from the serials solutions servers that we can read and parse. here’s an example of using the function properly: // url input for testing if ($api_url) { $xml = trim(get_xml_result($url)); } // check to see if we successfully got xml if ($xml !== false){ print("<h2>we got a successful response:</h2>"); print(htmlspecialchars($xml)); } else { print("<h2>curl error</h2>"); } let’s go over the basics of xml and what we need to know to parse our response. understanding the 360 link xml response the diagnostic message first of all, here’s the url we’d use to get this diagnostic message. it’s just the raw baseurl with no key-value pairs at all: http://000-00.openurl.xml.serialssolutions.com/openurlxml and here’s the resulting xml we’d get: <?xml version="1.0" encoding="utf-8"?> <ssopenurl:openurlresponse xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ssdiag="http://xml.serialssolutions.com/ns/diagnostics/v1.0" xmlns:ssopenurl="http://xml.serialssolutions.com/ns/openurl/v1.0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://xml.serialssolutions.com/ns/openurl/v1.0 http://xml.serialssolutions.com/ns/openurl/v1.0/ssopenurl.xsd http://xml.serialssolutions.com/ns/diagnostics/v1.0 http://xml.serialssolutions.com/ns/diagnostics/v1.0/diagnostics.xsd"> <ssopenurl:version>1.0</ssopenurl:version> <ssdiag:diagnostics> <ssdiag:diagnostic> <ssdiag:uri>sersol/diagnostics/7</ssdiag:uri> <ssdiag:details>version not specified</ssdiag:details> <ssdiag:message>mandatory parameter not supplied</ssdiag:message> </ssdiag:diagnostic> </ssdiag:diagnostics> <ssopenurl:echoedquery timestamp="2008-07-18t14:57:21"/> </ssopenurl:openurlresponse> here is a visual diagram of the elements we’d see in a diagnostic message. compare the elements in the xml to the diagram and it should give a pretty good idea of the shape of this data. let’s go over a few of these containers and their meanings: openurlresponse this element will always be present, whether we get a successful response or an error message. version this is the api version being echoed back to us, a kind of sanity check, if you will. we can always expect this container. echoedquery this is another sanity check, and should always be returned. it’s a copy of the url query we sent so we can make sure the api server got the same information we sent. diagnostics the container for one or more diagnostic elements. in fact, there’s only going to be one diagnostic element at a time. if we’re missing multiple required parameters, for instance, the api will just pick the first one we’re missing and report on that. diagnostic this element is a container for the actual error. theoretically there could be multiple diagnostic elements in the diagnostics element, but we will only ever get one from this version of the api. uri uri means uniform resource indicator. a url is one kind of uri. this refers to a code for the error message that was returned that refers to the 360 link and openurl documentation. message this is the error type that the uri refers to. this can be only one of a few messages and won’t give a lot of detail about what the problem was. details this will be more specific text about what went wrong. used in combination with the message you should be able to decipher what went wrong. taking a look at our earlier message, and putting together the message and details, we should be able to figure out what went wrong. message: mandatory parameter not supplied details: version not specified so, the message is not very specific. it just gives us the message that we missed a mandatory parameter. it is the details section which tells us which parameter was missing. other diagnostic messages the full list of diagnostic information is detailed in the 360 link xml api documentation, but some other kinds of diagnostic message types are: system temporarily unavailable authentication error unsupported version identifier with no data not enough metadata supplied a successful query now let’s look at an example of a successful query that returns results. here’s the example for our earlier doi example query: http://000-00.openurl.xml.serialssolutions.com/openurlxml?version=1.0&rft_val_fmt=info%3aofi%2ffmt%3akev%3amtx%3ajournal&rfr_id=info%3asid%2fsersol%3arefinerquery&url_ver=z39.88-2004&rft_id=info%3adoi%2f10.1037%2f0003-066x.59.1.29 and the xml result: <ssopenurl:openurlresponse xmlns:dc="http://purl.org/dc/elements/1.1/"> xmlns:ssdiag="http://xml.serialssolutions.com/ns/diagnostics/v1.0" xmlns:ssopenurl="http://xml.serialssolutions.com/ns/openurl/v1.0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://xml.serialssolutions.com/ns/openurl/v1.0 http://xml.serialssolutions.com/ns/openurl/v1.0/ssopenurl.xsd http://xml.serialssolutions.com/ns/diagnostics/v1.0 http://xml.serialssolutions.com/ns/diagnostics/v1.0/diagnostics.xsd"> <ssopenurl:version>1.0</ssopenurl:version> <ssopenurl:results dbdate="2008-07-18"> <ssopenurl:result format="journal"> <ssopenurl:citation> <dc:title>how the mind hurts and heals the body.</dc:title> <dc:source>the american psychologist</dc:source> <dc:date>2004</dc:date> <ssopenurl:issn type="print">0003-066x </ssopenurl:issn> <ssopenurl:issn type="electronic">1935-990x </ssopenurl:issn> <ssopenurl:volume>59</ssopenurl:volume> <ssopenurl:issue>1</ssopenurl:issue> <ssopenurl:spage>29</ssopenurl:spage> <ssopenurl:doi>10.1037/0003-066x.59.1.29</ssopenurl:doi> </ssopenurl:citation> <ssopenurl:linkgroups> <ssopenurl:linkgroup type="holding"> <ssopenurl:holdingdata> <ssopenurl:startdate>01/01/1946</ssopenurl:startdate> <ssopenurl:providerid>prvcsa</ssopenurl:providerid> <ssopenurl:providername>csa</ssopenurl:providername> <ssopenurl:databaseid>cpb</ssopenurl:databaseid> <ssopenurl:databasename>apa psycarticles</ssopenurl:databasename> <ssopenurl:normalizeddata> <ssopenurl:startdate>1946-01-01</ssopenurl:startdate> </ssopenurl:normalizeddata> </ssopenurl:holdingdata> <ssopenurl:url type="source">http://www.csa.com/htbin/dbrng.cgi?username=xxxxx&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;access=xxxxx&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;db=psycarticles-set-c&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;mode=browse </ssopenurl:url> <ssopenurl:url type="journal">http://www.csa.com/htbin/dbrng.cgi?username=##########&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;access=$$$$$$$$$$&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;db=psycarticles-set-c&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;issn=0003-066x&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;mode=all </ssopenurl:url> </ssopenurl:linkgroup> <ssopenurl:linkgroup type="holding"> <ssopenurl:holdingdata> <ssopenurl:startdate>01/01/1946</ssopenurl:startdate> <ssopenurl:providerid>prvovd</ssopenurl:providerid> <ssopenurl:providername>ovid</ssopenurl:providername> <ssopenurl:databaseid>opa</ssopenurl:databaseid> <ssopenurl:databasename>journals@ovid psycarticles</ssopenurl:databasename> <ssopenurl:normalizeddata> <ssopenurl:startdate>1946-01-01</ssopenurl:startdate> </ssopenurl:normalizeddata> </ssopenurl:holdingdata> <ssopenurl:url type="source">http://www.ovid.com/site/catalog/group/886.jsp?top=2&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;mid=3&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;bottom=7&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;subsection=12 </ssopenurl:url> <ssopenurl:url type="article">http://ovidsp.ovid.com/ovidweb.cgi?t=js&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;page=fulltext&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;news=n&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;d=ovft&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;search=0003-066x.is+and+59.vo+and+1.ip+and+29.pg </ssopenurl:url> <ssopenurl:url type="journal">http://ovidsp.ovid.com/ovidweb.cgi?t=js&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;page=toc&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;d=ovft&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;mode=ovid&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;news=n&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;an=00000487-000000000-00000 </ssopenurl:url> <ssopenurl:url type="volume">http://ovidsp.ovid.com/ovidweb.cgi?t=js&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;page=titles&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;news=n&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;d=ovft&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;search=0003-066x.is+and+59.vo </ssopenurl:url> <ssopenurl:url type="issue">http://ovidsp.ovid.com/ovidweb.cgi?t=js&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;page=titles&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;news=n&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;d=ovft&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;search=0003-066x.is+and+59.vo+and+1.ip </ssopenurl:url> </ssopenurl:linkgroup> </ssopenurl:linkgroups> </ssopenurl:result> </ssopenurl:results> <ssopenurl:echoedquery timestamp="2008-07-18t15:02:32"> <ssopenurl:library id="cs3rs7ts5u"> <ssopenurl:name>julia's test library 3 (linkers as of 2.03)</ssopenurl:name> </ssopenurl:library> <ssopenurl:querystring>rft_id=info%3adoi%2f10.1037%2f0003-066x.59.1.29&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;version=1.0&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;rft_val_fmt=info%3aofi%2ffmt%3akev%3amtx%3ajournal&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;rfr_id=info%3asid%2fsersol%3arefinerquery&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;url_ver=z39.88-2004</ssopenurl:querystring> </ssopenurl:echoedquery> </ssopenurl:openurlresponse> even though this example is a little more complicated, you should still be able to make out the structure of these responses. and a diagram of the successful result: now, let’s sort out the important information from things you can safely ignore. namespaces take a look at this piece of xml: <ssopenurl:results dbdate="2008-07-18"> <ssopenurl:result format="journal"> <ssopenurl:citation> <dc:title>some article not at the library</dc:title> <dc:creator>some guy</dc:creator> <dc:source>some journal not at the library</dc:source> <ssopenurl:volume>5</ssopenurl:volume> <ssopenurl:issue>32</ssopenurl:issue> <ssopenurl:spage>12</ssopenurl:spage> </ssopenurl:citation> </ssopenurl:result> </ssopenurl:results> the element names are prefixed by some characters seperated by a colon (:). these represent groups of elements. the dc: prefix, for instance, refers to meta-information specified by the dublin core standard for library information. for the purposes of interpreting our xml results, they can mostly be ignored, and you can use only the element name (the part after the colon) to refer to them. you’ll notice that the first few lines of the xml mostly serve to set up these namespaces. these can be safely ignored as well. it’s beyond the scope of this tutorial to explain every part of the xml response, since we’re mostly focusing on what we’d get in the citationdata container, but let’s go over a few of the containers we’ll see: results if you got this container, you know some real data was retrieved, and not just a diagnostic message. a results element can contain one or more result elements and nothing else. result a result element represents a single journal-level resource. it’s possible to get more than one result, but that means that the 360 link server wasn’t sure which resource you mean. when you get more that one result element, you usually need some kind of disambiguation to help the user decide what resource they meant. linkgroups / linkgroup this is usually what the 360 link web-based tool and api are both primarily for: to indicate and link to full-text resources for the requested article. the details of a linkgroup element are beyond the scope of this article, since we want to fill out an ill form only when the library does not have a full text resource available and it needs to be requested from another library. citation this is the focus of most of our attention. when we get back citation data, but no full text link, that means we can use the citation data to populate our ill form. usually, we will get journal level citation data back from the server. we will not usually, however, get article level links, unless we passed them into the query. for example, if we pass in a page range in the url, we will see that page range echoed here. an exception is when we pass in a doi or similar article level identifier. in this case, the api will sometimes fetch more complete data about the article from a similar service. see the later code examples, and the sample ill form for examples of the kind of citation data we’ll be looking for to populate our ill form. parse the xml response with php now that we can send a query and get the results back as xml, we need to parsing a piece of xml means to use a program to break it apart into its component parts, so any piece can be accessed. there are many methods for parsing xml files, and i’m choosing the dom method for this tutorial. there is an overview on the php site that shows all the different methods. parsing a piece of xml means to use a program to break it apart into its component parts, so any piece can be accessed. there are many methods in php for parsing xml files, and i’m choosing the php4 domxml method for this tutorial. there is an overview on the php site that shows all the different methods. php processing changed a bit in php5. i use a php4 example because php4 is still the most widely installed. if you read and understand these examples, it shouldn’t be too hard to read the php5 dom documentation and adapt the code here for php5. turn an xml string into a php object the first thing we need to do is take our xml string and turn it into something useful in php. the way to do this is to use the domxml_open_mem() method. this method takes an xml string and attempts to turn it into something called a domdocument object. this is a php object that acts like an xml container with containers inside it, but allows you to ask the container for information. this is how you use it: if ($xmldomdoc = domxml_open_mem($xml)) { // successfully created a domdocument object. } else { // failed to create a domdocument object } get the ill data out of the php object now that we have an object in php that represents our xml, the xml is in a much more useful form. now we can use some php tools to look for specific containers and get their contents. looking for elements with get_elements_by_tagname() now that we have a php object, we have a handful of things we can do with it, but the most useful thing is to ask the object for its elements. that is done using the get_elements_by_tagname() method. get_elements_by_tagname() takes a name of an element, and always returns an array whether it finds any elements by that name or not. getting the contents of elements with get_content() the next important method of the domdocument object is the get_content() method. if you’re in a container that has the actual data you need, you can call get_content() to get the actual text inside. these two methods are just about all we need to get the data for our ill form. here’s some examples of how to use them: // getting an array of containers with get_elements_by_tagname() $array_of_containers = $xmldomdoc->get_elements_by_tagname('version'); // working our way through each each element in the collection foreach ($array_of_containers as $container) { $version_number = $container->get_content(); } // getting the only element if we know there's only one $the_only_version_container = array_pop($xmldomdoc->get_elements_by_tagname('version')); // if we do it this way, we need to check to make sure // we got an actual container object, otherwise calling // get_content() on null will crash our program if (is_callable(array($the_only_version_container,'get_content'))) { $version_number = $the_only_version_container->get_content(); } using array_pop() and is_callable() to confirm the presence of data some containers may or may not appear in the xml result. issn, for instance, we would want to display in our ill form if available, but our xml might not have an issn container. if we call get_elements_by_tagname() on the issn container, it will return an empty array. we’re calling array_pop(), which returns the first element in an array. if the array is empty, it returns null, meaning no value. if you try to call get_content() on null, php will return a fatal error: call to a member function on a non-object. one way of avoiding this in php is to use is_callable() to make sure get_content() is allowed for the object you’re about to call. you can see an example of use in the example above. now we have the tools we need to handle our ill form. when to display the ill form we don’t automatically want to display an ill form. the only case where we want to is when a user requests a valid resource we don’t have access to at the library. here’s a chart that shows the different cases and how to handle them: situation meaning how to handle it we get a diagnostic message somehow we didn’t construct a valid request. recover gracefully and display a useful error message to the user. log the error and troubleshoot the problem. we get multiple results. there was more than one matching possibility for the information sent. display a disambiguation page to let the user choose which resource they meant. we get a result with full text links the resource is available to the organization or library. display links to the full text of the article. we get a result with no full text links the resource isn’t available. use the citation information to populate an ill form. attached are some helper methods and code that will let you determine which of these decisions to make. these methods work by looking for containers by a certain name, and then using the count() method to see how many containers were found. if there is no diagnostics container, then we didn’t get an error message. if there are more than one result container then that means we got more than one article result. if there is a linkgroup container, that means we got link resolver results. once we have these methods, we can use them to make a decision about what to display: // ill display decision tree $do_ill_form = false; if (docxml_has_diagnostic($xmldomdoc)) { // display error message } elseif (docxml_has_multiple_article_results($xmldomdoc)) { // display disambiguation page } elseif (docxml_has_link_groups($xmldomdoc)) { // display links to fulltext article } else { // display the ill form $do_ill_form = true; } getting the variables we need let’s assume we know we want to display the ill form now. here’s a query example of a fake article, where we pass in citation data we made up. of course, the link resolver won’t find any full text results, and we will want to construct the ill form. http://000-00.openurl.xml.serialssolutions.com/openurlxml?rft.title=some+journal+not+at+the+library&rft.atitle=some+article+not+at+the+library&rft.au=some+guy&rft.volume=5&rft.issue=32&rft.pages=12-14&version=1.0&rft_val_fmt=info%3aofi%2ffmt%3akev%3amtx%3ajournal&rfr_id=info%3asid%2fsersol%3arefinerquery&url_ver=z39.88-2004 attached is an example of an xml response we’d get, with the citation data included and no full text links. here’s the code we’d use to get the variables we need. note that the xml element names are not always the same as the openurl variables! // first, get the citation container, it has everything we need $citation_container = array_pop( $xmldomdoc->get_elements_by_tagname('citation')); // then get the container and content for each variable to display: // author $creator_container = array_pop( $citation_container->get_elements_by_tagname('creator')); if (is_callable(array($creator_container,'get_content'))) { $au = $creator_container->get_content(); } # more ( more complete example attached here. ) constructing the ill form in html now that we have all the php variables we need, we need only the smallest effort to turn an html form into one that auto-populates with the citation data. using an html form as an ill form ill form example we will start with an example of a simple static ill form that will always have it's values start out blank, then we'll explain how to make one that's populated by our php script with the citation values returned in xml by 360 link. attached is some sample html for a simplified version of an ill form. include some css to make it pretty: form .formfield { margin-bottom: .3em; clear: left; } form .formfield label { width: 12em; float: left; } form .formfield input { width: 20em; } adding php to populate the form we take that static html from before, and embed it in a php file, adding code to populate the form the variables we've grabbed from the xml response. here’s one example: <input type="text" id="au" name="au" value="<?php print($au); ?>" /> attached here is the whole form retooled to show our variables. what’s missing this form doesn’t have any code included to make it validate. that is to say, it doesn’t do anything to ensure that the values people enter are valid, or that they have filled out the required fields. but form validation using php or javascript is well documented on the web and beyond the scope of this article. here’s a few good articles on form validation: simon willison’s weblog: easier form validation with php form validation using php to highlight non valid fields sensible forms: a form usability checklist a guide to unobtrusive javascript validation for the purposes of this article, we’re only populating the citation data, but if you have a logged in user, it’s possble to populate the name and email address of the requester as well. handling the submitted ill form let’s take a look at the form tag in the email form: <form action="incoming_ill_form_handler.php" method="post"> there’s two useful pieces of information here. one is the action and the other is the method. this sends the information in the form to a php script called incoming_ill_form_handler.php via the http post method. reading the form data php fills an array called $_post with incoming form data. here’s a simple way to debug and look at the information: <span class="html-tag">print</span>(<span class="quoted-strings">'<h2>post values</h2>'</span>); <span class="html-tag">print</span>(<span class="quoted-strings">"<pre>"</span>); var_dump(<span class="php-variables">$_post</span>); <span class="html-tag">print</span>(<span class="quoted-strings">"</pre>"</span>); if we submitted the form as filled, we’d see this: array(14) { ["name"]=> string(0) "" ["division_dept"]=> string(0) "" ["email"]=> string(0) "" ["priority_needby"]=> string(0) "" ["au"]=> string(8) "some guy" ["atitle"]=> string(31) "some article not at the library" ["title"]=> string(31) "some journal not at the library" ["date"]=> string(0) "" ["volume"]=> string(1) "5" ["issue"]=> string(2) "32" ["pages"]=> string(2) "12" ["issn"]=> string(0) "" ["pmid"]=> string(0) "" ["s_request"]=> string(14) "submit request" } previewing the data in a web page the simplest use of the form data is to display it on a web page as a preview. attached is a quick code sample for doing this safely. see the security section for more details on why to use htmlspecialchars(). sending an email with the ill form sending an email is similar. we just need to construct a string for the body of the email and use php’s mail() function. here is a basic example: $body = "ill form data \n\n"; $body .= "name: " . $_post['name'] . "\n"; $body .= "division:" . $_post['division_dept'] . "\n"; $body .= "email:" . $_post['email'] . "\n"; $body .= "need date:" . $_post['priority_needby'] . "\n"; $body .= "author:" . $_post['au'] . "\n"; $body .= "article:" . $_post['atitle'] . "\n"; $body .= "journal:" . $_post['title'] . "\n"; $body .= "issue date:" . $_post['date'] . "\n"; $body .= "pages:" . $_post['pages'] . "\n"; $body .= "issn:" . $_post['issn'] . "\n"; $body .= "pubmed id:" . $_post['pmid'] . "\n\n"; /** * returns false if text contains newline character */ function has_no_newlines($text) { return preg_match("/(%0a|%0d|\\n+|\\r+)/i", $text) == 0; } $sanitized_name = 'default name'; if (has_no_newlines($_post['name'])) { $sanitized_name = $_post['name']; } $subject = "an ill request form has been filled out by " . $sanitized_name; $to = 'danieltalsky@gmail.com'; mail($to, $subject, $body); storing the ill form in mysql covering all aspects of storing data using php and mysql is beyond the scope of this article, and is well covered in the excellent mysql documentation, mysql documentation about php and php documentation about mysql. still, a few useful examples are provided below. first of all, here’s a simplified example of a table structure that would store our ill form: create table `ill_form` ( `id` serial, `name` varchar(100) not null, `division_dept` varchar(50), `email` varchar(255) not null, `need_date` varchar(50), `author` varchar(100), `article` varchar(255), `journal` varchar(255), `issue_date` varchar(50), `pages` varchar(20), `issn` varchar(10), `pmid` varchar(20), `time_submitted` timestamp not null default now() ); and here’s an example of php code that would construct a safe sql query: $sql_insert = 'insert into ill_form '; $sql_insert .= ' (name, division_dept, email, need_date, author, '; $sql_insert .= ' article, journal, issue_date, pages, issn, pmid) '; $sql_insert .= ' values '; $sql_insert .= ' ('; $sql_insert .= "'" . mysql_real_escape_string($_post['name']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['division_dept']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['email']) . ', '; $sql_insert .= "'" . mysql_real_escape_string($_post['priority_needby']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['au']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['atitle']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['title']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['date']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['pages']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['issn']) . "',"; $sql_insert .= "'" . mysql_real_escape_string($_post['pmid']) . "',"; $sql_insert .= ' );'; using mysql_real_escape_string() allows mysql to handle most tricks people would use to try to get mysql to execute sql code you didn’t intend. finally, a couple of samples of retrieving the data using raw sql (mysql specific): /* gets all entries */ select * from ill_form; /* gets all entries in the last month */ select * from ill_form where date_sub(curdate(),interval 1 month) unpleasant but necessary security addendum always remember, do not trust data supplied by a user-submitted form. people or automated bots can place code in webforms that’s meant to hack the browser through javascript, php, or sql. these kinds of attacks are common and easy to do, so it’s vital to be responsible with your organization’s data. some security can be handled by good validation of the form when it’s submitted, but that is not a substitute for handling user-submitted data responsibly. it is easy for a hacker to send data to your processing page in a way that bypasses your form validaton. it can seem like there’s so much you can’t know about security that it’s almost not worth bothering, but the following examples should provide coverage from the most obvious kinds of attacks. for display on an html page people can enter both html and javascript in form fields. if you’re displaying things people entered in a web form back to them, it’s vital that you use a function like php’s htmlspecialchars(). this converts any special characters to a safe character set for display and derails the potential for javascript execution. just imagine if someone entered the following into a form field and you just printed it right to an html page! <script type="text/javascript">alert('bad security!');</script> for emailing the essential thing to remember is if you are inserting something from user input into the headers of an email (the subject, the from, to or cc lines), you need to make sure there aren’t any non-printing characters or linebreaks. usually a regular expression that makes sure it’s a valid email address or at least doesn’t contain the linebreak in the \r\n format is sufficient to prevent email injection. this is covered in detail in this excellent tutorial on preventing php email injection. for insertion into a database this is possibly the most dangerous and easily avoided problem. this kind of attack is called sql injection, and is well detailed on the wikipedia page on sql injection. all sql api’s will have an appropriate quoting mechanism or prepare statement. the important thing to remember is to never put raw data from a user into a sql query like this: $sql = "select * from users where name = '" . $user_submitted_name . "';"; this sample doesn’t make sure special characters are escaped, and allows room for sql injection. see our mysql example earlier for one example of doing it properly using mysql_real_escape_string() anytime you use data from a user form, just think about how the data is being used, and consider the possibility of abuse. in conclusion there are many possible uses for the 360 link xml api, and this article was meant to explain one of the simplest ones: using the returned citation data to construct an ill form. we covered the basics of querying the api server, interpreting the response, parsing the xml and displaying the citation data in an html form, then storing the results of that form in a mysql database or sending them in an email. these skills, especially the security notes, are useful in almost any kind of web application. if you’re considering using the xml api in any way, i’d love to hear about it. please contact me at danieltalsky@serialssolutions.com and tell me what you’re working on. i’ll be able to answer a few simple questions and point you in the direction of further resources based on your goals. notes [1] note from the editors: niso standard z39.88 formally defines openurl, but it can be a bit confusing for some. more information is also available at http://openurl.code4lib.org and from jeff young’s q6 blog: http://q6.oclc.org/. [2] many ill forms are behind a login, but the new york academy of medicine has been kind enough to allow us to link to the nyam ill form as an example for this article, and use their form as an inspiration for the basic html example used in the article. [3] curl is a php library for accessing files over the internet. php has another way of accessing files over the internet: its native fopen() function. fopen() is a lot slower and in some ways is less reliable. curl is more reliable, flexible, and over ten times faster. [4] if there’s some problem with curl on your server, it shouldn’t be too hard to use fopen(). here’s the php fopen() documentation, and information about using it to open files across the internet. acknowledgements kat ortland for copy editing, shawn kilburn for the article concept and consultation on 360 link, brenna flood for extensive consultation on the 360 link xml api, chris phillipe, for 360 link consultation, ross bleakney for getting me into xml api documentation at serials solutions, mark foong & harry kaplanian for providing the time and operational support for me to write this article and lisa genoese & steve chiaffoni at the new york academy of medicine for supplying code and permission to use their ill form as an example. about the author daniel talsky is a programmer and tech educator at serials solutions. he is beginning a tutorial and outreach program to educate librarians and web developers about serials solutions xml api’s. he can be contacted at daniel.talsky@serialssolutions.com . subscribe to comments: for this article | for all articles 5 responses to "auto-populating an ill form with the serial solutions link resolver api" please leave a response below: emily lynema, 2009-03-25 nice article, daniel! we’re considering the possibility of switching to 360 link from sfx, and this article helps make clear the kind of functionality available via the api. daniel talsky, 2009-03-26 thanks emily, that’s really the purpose of the article, to give an example of use and processing. however, it’s worth saying, you don’t need the api to be able to generate an ill form. even using 360 link, you can create simple links back to your ill form that contain all the information you’ll need. patsy, 2009-09-10 thanks, david! we’re in the process of creating an ill form and were stuck at this point in the process. much appreciate! mark, 2012-02-14 thanks, our library just got 360link so this is very helpful to getting started with using the api. is there any other public documentation available? one note for new readers: the api base url is currently’http://000-00.xml.open.serialssolutions.com/openurlxml?’, not ‘http://000-00.openurl.xml.serialssolutions.com/openurlxml?’ as described in the code in this article. benjamin, 2015-03-03 since this article was written in 2009 and the last post was in 2012, has there been any more development? we are currently looking at a process to do this with ill also and any information or current work would be great. thanks leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial: beyond posters: on hospitality in libtech mission editorial committee process and structure code4lib issue 40, 2018-05-04 editorial: beyond posters: on hospitality in libtech in this editorial, i will be using the word hospitality to mean the intentional welcome of others into a space which one currently occupies, possibly as a member of a dominant group. i do not wish to encourage the idea that one should cultivate or maintain a role of benevolent host in a way that forces others to remain forever guest or outsider, although there will always be newcomers. hospitality may be a first step to ceding one’s position as host in a space. it may be expanding that space to become a place with many potential hosts, each respected for their varied contributions and skillsets. it may also be supporting those in a different space or a different role, such as those who use the technologies we build and support (both colleagues and patrons), and respecting them in that space. i’d like to thank both becky yoose and kim phipps (messiah college president) for fixing the term hospitality in my mind, chris bourg for the challenge she extended and the phrase “for the love of baby unicorns,” linda ballinger for asking questions, julie hardesty for a year of collaboration toward a useful metadata documentation plan, the samvera documentation working group for their work and for helping the metadataists formulate a plan, and the c4lj editorial committee and becky for providing feedback on this draft i once owned a star trek uniform. it was deep space nine[1] style, blue for science and for jadzia dax, and an awkward, terrible fit on me. the second website i ever built was a ds9 geocities fan site. i was the kind of teenage girl who spent hours downloading miniscule 30-second trailers of episodes on startrek.com, just to get her fix, and i don’t regret that… for all that i still can be a walking memory alpha[2]. what i’m writing about today is hospitality in libtech, and my starting point is chris bourg’s 2018 code4lib keynote (keynote video). i understand, appreciate, and support the message of the study chris cited which mentioned star trek posters. i’m writing on the side of trekkies for inclusion. i shouldn’t assume, for example, you know that memory alpha is the trek wiki, named for the united federation of planets’ library/data archive.[3] this is the last trek reference, i promise. i’m going to walk through some of my own reflections on hospitality in library technology, building from chris’s talk to encounters with the idea i’ve had over the last year in particular. my undergraduate institution pushed “hospitality” hard (i was given a foot-washing towel at graduation) and it’s taken a long time for me to move beyond feelings about its pitfalls and a buzzword feel to appreciate its real worth as a concept. listening to becky yoose describe documentation as hospitality in an interview on the ( open paren podcast began the word’s rehabilitation for me. chris challenged me again to think intentionally about spaces i occupy and not just how i speak to but about (vouching for) others. in this editorial, i will be using the word hospitality to mean the intentional welcome of others into a space which one currently occupies, possibly as a member of a dominant group. i do not wish to encourage the idea that one should cultivate or maintain a role of benevolent host in a way that forces others to remain forever guest or outsider, although there will always be newcomers. hospitality may be a first step to ceding one’s position as host in a space. it may be expanding that space to become a place with many potential hosts, each respected for their varied contributions and skillsets. it may also be supporting those in a different space or a different role, such as those who use the technologies we build and support (both colleagues and patrons), and respecting them in that space. intentionality as hospitality when it comes to behavior in the physical and virtual spaces we create for our community, a thoughtful and enforceable code of conduct is only the bare minimum for hospitality. actually following up and enforcing such codes takes time and labor, but that work, followed by reports back to the whole community are the only way to build a deserved trust. beyond the language, behavior, and imagery covered by codes of conduct, however, we should intentionally consider the less overt ways in which we may exclude. do we consider whether everyone in a group knows what we mean by ils, git branching, 245$a, or other terms which arise in our daily work? at the 2017 samvera connect, my colleague linda ballinger encouraged her fellow metadataists on proactive responses to feeling overwhelmed by a barrage of new-to-you terms. such terms are much like trek posters, inoffensive in themselves, but exclusionary in certain contexts. the onus is on all of us to create environments where it’s ok not to know a thing and safe to ask. the recurse center social rules, which are becoming more common at libtech events, provide guidance for creating such an environment. they read as follows: no feigning surprise no well-actually’s no back-seat driving no subtle -isms these are an excellent starting point. if we’re trying to extend boundaries of the spaces we occupy, i would challenge us all to try the following as well: recognize that you can learn from those just starting out. embrace familiar questions as an opportunity to reflect on why they recur. consider your usage of acronyms and jargon. value both technical (code, qa, testing, ux, and accessibility) and interpersonal contributions to projects. regarding acronyms and jargon, this challenge is not to stop using them entirely but to proactively ensure everyone involved has sufficient context to understand what they mean. documentation as hospitality beyond physical spaces and gatherings for sharing and learning, we need to be intentional about the kinds of materials we build for those learning on the job through self-teaching. i would venture that anyone, no matter their area of expertise, has encountered some kind of documentation which fell two or three steps short of where they needed it to be. perhaps it was a primer with assumptions. perhaps they could find beginner and advanced materials but nothing to guide folks who’ve got the basics down but want to move to an intermediate level. and, of course, sometimes there’s no documentation at all. as mentioned in the introduction, becky yoose spoke on the ( open paren podcast[4] of the concept of documentation as hospitality. she then expands on her definition of hospitality as: “making sure that folks can enter in a particular community with the lowest bar possibly that i can do personally or i can build into the community systematically (or persuade others or bribe others) to make a more inclusive community” (emphasis added) this one-on-one work is a critical aspect of building community, but relying solely on it will leave out those uncertain who or where to ask questions and burn out those offering such assistance. as communities, we would do well to determine systematic approaches to easing community entry, both through concerted efforts at building regular, inclusive learning spaces and through our documentation.[5] a project which i’ve been watching and admiring is the samvera documentation working group’s a guide for the perplexed samvera developer. while still under development, it is an excellent example of documentation which tries to avoid presumption. early on in the “new? start here” section, the page “how to ask for help” includes directions for joining the samvera slack. most real-time conversation happens on the slack, particularly on its dev channel. the page then points readers to slack’s own documentation on getting started as a slack user. will many readers already know what slack is and how to use it? sure. will all? nope, and the documentation working group made the welcoming choice. meanwhile, while terry reese wrote his previous editorial on learning to be a selfish librarian, he puts a great deal of work into supporting marcedit. while many may be familiar with his tutorial videos, i only recently encountered the in-progress learning marcedit, which includes full chapters to spell out things like what every choice in preferences means, with context. as with samvera, this documentation does not require one to start out at some intermediate stage while still describing how to do complex tasks. from my own experience, i’m aware how difficult it is to block out time to write up documentation, even if you consider it a critical component of a project. besides time, however, i see a couple other impediments which documenters face. one is skill—knowing a lot about a thing and knowing how to write do not necessarily mean you’ll be good at writing its documentation. the other is that one may feel uncertain or even presumptuous in attempting documentation alone. does one know everything which should be documented? can one assess where others will need to start? this is one case where what i’m grateful for overlaps with the write the docs community, which provides principles and hosts events where one may practice them. working on the practice in a group, as above, also means that one person need not feel the full weight of creating welcoming documentation. one has partners, gets feedback, and benefits from multiple perspectives and backgrounds. the samvera metadata interest group has had false starts in figuring out how such a group should work, although we are now trying to align our work more with the developer-focused documentation wg, an effort which i hope will be productive.[6] design as hospitality as we work to create more welcoming environments and the means for everyone in them to participate in the work we’re doing, we may also turn our attention to what we build (or implement) together. a fundamental level of how we design hospitality in our systems is accessibility. rather than give a meager overview in this short space, i’ll refer readers to ng and schofield’s article in issue 37 of this journal. beyond their accessibility and usability, however, there are other aspects of our systems which i would like to frame through the lens of hospitality. specifically, i want to address system design, record-keeping, and hospitality. in my january 2017 editorial, i dedicated a section to “the records we keep. here i am concerned with a false hospitality which alleges we can become more hospitable through aggressive data collection. jones and salo have written an overview of learning analytics systems in academia and how these intersect with libraries and library ethics. besides the continual pressure libraries feel to justify themselves through learning outcome assessment, these programs are pitched as helping students do better in college. if such programs actually improve student success (according to some definitions of success) should we change our principles around privacy? beyond the challenges to this assertion outlined by jones & salo, i think we must also ask ourselves what kind of a world it is we want to be creating for our students. is it one in which every institution surveils on them and reports back to others, ostensibly with their own good in mind? is it one in which yet another dataset about them could be given or sold to researchers, demanded by governments, or compromised by hackers? hardly a hospitable choice. nor is the problem limited to academic libraries. public libraries already collect and share certain portions of patron data with a variety of content and service providers, including amazon. through its new product, “wise,” oclc now proposes to partner with libraries to collect all that data in one place, managed in the cloud by oclc. some of the marketing around this product struck me as connected the editorial i was already drafting. in particular, their blog post includes the following testimonial, that wise “removes subjectivity—instead of an interpretation of customer wants or what customers say they want, the system uses data to determine need.” i read this statement (from a customer, but promoted by oclc) as a fundamentally false, hollow hospitality. algorithms developed by human beings will reflect their assumptions and biases and impart their own oppressions. since the product is only just being marketed, i contacted oclc with some of my questions and concerns.[7] the responses i received from their new data protection officer were primarily marketing language, uncertainty about most technical specifics (it appears they still need to learn what’s in the wise system, which they purchased from its developer, and build from there), and assurances that they take privacy seriously without any firm commitments to minimum standards. i believe the following excerpt may be an accurate assessment of what wise attends to accomplish: “…it takes the very best of disparate systems and combines them into a unified, powerful solution; the collection and use of data, however, is not materially different from today’s practices.”[8] i was reminded of facebook’s recent statement to reuters, that “data collection is fundamental to how the internet works.” data collection certainly is the way things are right now. but rather than designing systems to collect, aggregate, and analyze that data more smoothly, perhaps we should assess where we are and reconsider. humans will remain subjective, frustrating, and beautiful, no matter what kind of technology we throw at them. i hope we’ll remain aware enough about that to meet them as fellow humans whose inquiry, escapism, and entertainment we facilitate and whose privacy and safety in doing so we respect. the labor of hospitality if any of the above were easy, we wouldn’t have to talk about it. intentionality about our space and words takes work. for most of us, it’s a practice that requires mindful attention. i certainly have a way to go when it comes to jargon. writing the documentation may take as much time as creating the code did, with far less administrative support or reward in our careers. and trying to understand or negotiate data in external systems or advocate for better choices in what we’re making may require a good deal of time and emotional labor. pushing back may even be dangerous to one’s employment. i’m aware of and grateful for the time another colleague recently put into phone calls and emails with a vendor after an internal group expressed concerns regarding privacy. neither getting the answers nor documenting them in a way that the whole group could understand were easy. if code4lib is good for one thing, though, it’s community. as we continue to recognize where we fall short, personally and collectively, in welcoming others, let us support each other in lowering barriers to entry and in building systems which extend that same hospitality and respect to the communities we serve. end notes [1] deep space nine was the third live-action star trek series, running in the mid-to-late 1990s. over its 7 years, two styles of starfleet uniform were used. mine was, regrettably, the former. [2] memory alpha is a wiki-based project reference for the star trek universe. [3] i’m now overcome by the desire to write a comparison of the star trek united federation of planets’ library/archive setup and that of the galactic empire as shown in star wars: rogue one. [4] ( open paren. [ for the writer’s piece of mind…let’s close these ) ) ) ] [5] speaking of systematic approaches, tools which serve as programmatic intermediaries, such as marcedit, the c# marc editor, and openrefine and the accompanying documentation and tutorials are invaluable in lowering bars to entry and expanding communities. their support of multiple skill levels also provides an opportunity for users to gradually build advanced skills (and confidence in their tooling expertise) which they may bring to other contexts. [6] in my own experience of trying to lower barriers for entry, i’ve also realized some of the pitfalls when documentation may become the primary way one learns about a thing. in creating the eadiva project, my goal was simple—rewrite the ead tag library in a way my classmates, who weren’t familiar with the language of document type declarations, could use it. and interlink it for usability. however, simply learning what elements exist in ead and how to encode data in them is not a good way to learn the practice of archival description. beyond linking out to the html form of describing archives: a content standard and pointing out on the about page that it is primarily a technical reference, i sometimes struggle with how i can shape the site to make it that the site really is about the technical aspects of encoded archival description and not about its practices. [7] in the interest of disclosure on my end, i asked the following questions: is the entire wise program opt-in for the patrons themselves? (e.g. if my public library adopts it, will it opt me in to some aspects and allow me to opt into others or does it give me a full opt-in option on my account page/elsewhere? does the library decide this?) can a patron continue to use a public library which has adopted it without their data being gathered in some way by wise? (this question includes shadow profiles, similar to those facebook created for people without accounts.) is the data stored entirely within an ils system or centrally on oclc’s servers/in the cloud? answer: cloud, like oclc’s ilses. will the data from multiple libraries be aggregated for overall analysis by oclc? while oclc has made clear it won’t sell patron data, will it be sharing that data with researchers or other library partners? which other oclc products might be informed by that data? answer: reiteration that they won’t sell patron data. if patrons opt in, are they told explicitly how their data is being aggregated and used? (i’d appreciate a copy of whatever disclosure is given, but also how it’s being conveyed in a non wall-of-text format since it’s a known issue that end-users don’t often read such agreements in detail) if data is stored centrally, would this be the largest dataset of library patrons’ borrowing and taste information ever compiled? what plans does oclc have to keep an unprecedented and tempting dataset safe from hackers? does oclc have a plan for what will happen if wise data (centralized or in one library) gets compromised? this question includes everything from possible punitive fines to the damage to libraries’ reputations as a whole. [8] ranjan, brandy. (2018). re: questions re: oclc wise (context for code4lib journal editorial). [email]. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – cost per use in power bi using alma analytics and a dash of python mission editorial committee process and structure code4lib issue 59, 2024-10-07 cost per use in power bi using alma analytics and a dash of python a trio of personnel at university of oregon libraries explored options for automating a pathway to ingest, store, and visualize cost per use data for continuing resources. this paper presents a pipeline for using alma, sushi, counter5, python, and power bi to create a tool for data-driven decision making. by establishing this pipeline, we shift the time investment from manually harvesting usage statistics to interpreting the data and sharing it with stakeholders. the resulting visualizations and collected data will assist in making informed, collaborative decisions. by lydia harlan, kristin buxton and gabriele hayden introduction/ background at university of oregon (uo) libraries high level collections decisions are made by a collection managers group consisting of representatives from acquisitions, electronic resources, subject librarians, and access services. facing a possible large cut to the collections budget in fiscal year 2022, this group started talking about how to decide which serials to cut. cost per use (cpu) was raised as one starting point for making data driven decisions. historically at uo, usage data has been stored in manually created spreadsheets with data compiled from all vendors, while cost data has been stored in alma. when small package cancellations were on the table in previous years, cpu was hand generated for those titles. the employees who manipulated the spreadsheets were understandably reluctant to hand generate cpu for the over 78,000 rows of usage numbers for combined vendor statistics at that time. two of the authors, members of the collection managers group, began exploring alternative options. we knew that there were commercial products that would provide this data but didn’t have buy-in from the necessary stakeholders to purchase a tool. instead, we pulled in a third colleague with data experience to help create our own tool. we wanted to provide cpu for as much of the continuing resources collection as we could in ways that would be easily understandable for the subject librarians who would need to use the data. literature review as harker ( [1] 2022, 355) has observed, the cost per use metric “has become ubiquitous and nearly universal for evaluating resources, especially renewable resources such as journals and databases,” because it is straightforward and easy to grasp. most articles discussing cost per use address controversies regarding whether cost per use should be used, the specific metrics and processes that should be included in a cost per use analysis, or limitations to the specific metrics (such as counter 5) currently in use. less common are articles describing the technical details of reproducible pipelines for gathering and deploying cost per use analyses. this may reflect the challenges of using proprietary software (such as power bi) to work with data protected by license agreements. the few articles that did address technical access questions were very helpful to our project. pesch ( [2] 2015) offers a basic primer for using sushi and counter, while boardman and thompson ( [3] 2022) describe using alma analytics to programmatically query sushi apis and visualize counter statistics, documenting that it could be done even if they did not share the technical details needed to execute the project. this paper aims to fill a gap in the scholarship by offering a clear, step-by-step description of what we did that may be of use to others. approaches to the pipeline in the first iteration of this project, we created a cpu table in excel by merging spreadsheets of usage and cost data. (figure 1) figure 1. we performed lookups on title and issn for journals where we had both usage numbers and payment history. we were able to calculate the cpu for many of our journals and created graphs to make the data more visual. we ended up not needing to make major cuts to our journals budget at that time, so this tool wasn’t needed immediately. we decided there was still value in improving our report. this excel-only version had two major limitations: 1) it was challenging to update with new usage numbers and 2) sharing the underlying usage along with the visual meant sharing a very large excel file vulnerable to user error, easily unintentionally manipulated to show false results. we next decided to import the excel files into microsoft power bi. (figure 2) figure 2. power bi allows easier update of datasets, provides additional visualization options, and offers a way to share visuals and tables of data without needing to share the underlying data files. we successfully translated the data transformations we had conducted in excel into power bi. however, this version still relied on the laboriously hand-created excel usage data file. the next step was to gather usage statistics dynamically by querying the sushi apis of our vendors. we experimented with making sushi calls directly from within power bi and were successful in getting the process to work with a static table, but we couldn’t find a way to continually add the next month’s usage in a way that would store the data properly. (figure 3) figure 3. we learned that this is a deliberate limitation of the power bi software, which is not designed to function as a data source but instead to draw on more stable data sources such as sql databases. we would have needed to store the data outside of power bi, which negated the elegance of establishing sushi calls from within power bi. we finally settled on using the sushi harvester in alma to gather and store our data. (figure 4) figure 4. alma is designed to both harvest and store usage data, and we knew we could get the information out of alma using alma analytics’ sql database. however, if we wanted a fully dynamic report, we would then need to query alma to retrieve cost and usage reports on a monthly basis. we explored querying the alma analytics api using power query (m) within power bi but found it very difficult to troubleshoot errors. (figure 5) figure 5. we decided instead to use a python script embedded in power bi to query the analytics api to retrieve both cost and use data. we then used power bi to transform the data, calculate and visualize cost per use, and distribute our dashboard to internal stakeholders. in this final model sushi api accounts are set up for as many vendors as possible in alma, which harvests the tr_j1 reports monthly. alma analytics reports are established to capture cost and usage statistics. python scripts embedded in a power bi m query retrieve each report via an api and loop through the data using a resumption token (required for data of more than 1,000 rows). the python code pulls in the complete dataset each time we want to update it, which we can do by simply pressing the refresh button in power bi. more m code simplifies the data. dax code then merges the data from the two datasets to calculate cost per use. power bi’s visualization tools are used to create a dashboard that can be shared with stakeholders. if you want to implement a model like this one in your system, full step-by-step details can be found in appendix a. the dashboard visualized our current dashboard includes four views. the first uses a bar chart to display journal titles and their cost per use with the ability to filter by fund and fiscal year. (figure 6) the second provides a table of usage data across multiple fiscal years, filterable by fund and searchable by platform, publisher or title. (figure 7) our third display is a table of titles that might be cheaper to obtain by ill than subscription, based on the usage patterns and average ill cost. (figure 8) finally, we have an about page that provides information about the dashboard. figure 6. figure 7. figure 8. cost per use defined because this is a proof of concept for our technical pipeline, it uses a very simple definition of cost per use, as the number of uses as defined by counter tr_j1 reports divided by the annual amount we paid for the product — counting only journal use and excluding databases and ebooks. we do not, for example, address any of the issues of open access articles discussed by jabaily ( [4] 2020) or addressed by the unsub [5] product. however, the nature of power bi as a tool that can aggregate and visualize data from a variety of sources means that we hope that this pipeline will be the basis for creating more robust analyses. in the future, for example, we might adjust for open-source access, include ebooks and databases, or incorporate elements such as tracking the journals our faculty and graduate students cite or publish in. assumptions and limitations in the process of writing up this project we had to distill it down to steps that are communicable and helpful to others. in reality, this was an iterative experience with several dead ends. we frequently tested elements as proof of concept before investing time in developing those same elements on the larger scale. it’s possible there are more efficient approaches, especially related to the normalization of titles and issns. we decided to stick with what was working and not get bogged down in perfection, in the spirit of getting it done. we know there are steps we’d like to improve, some quirks that need troubleshooting, and plenty of opportunity for additional report types and other upgrades, as well as regular ongoing maintenance. we welcome feedback and inquiries via email or via issues on our github repository. there are gaps in the data where publishers don’t provide sushi access, or counter 5 formatted reports, or don’t do so in a timely fashion. additionally, we have not yet systematically filled in gaps between the beginning of the existence of counter 5 and the initiation of sushi harvesting within our instance of alma. at times the titles or issns don’t line up perfectly, or we have usage reported from multiple platforms. we removed the gale and jstor aggregator platforms from our data, and don’t currently have a way of evaluating this usage. we do not include print usage of our collection at this time. in our initial excel model print usage was very low and usage was not of recent issues, so we determined that including print usage would not add value to our model. we made several assumptions along the way that led to the data being useful but not always completely precise. for example, if we don’t have a payment for the current year, we base our calculation on the previous year’s payment, even though it is likely not the same. we also have instances where the calculated cpu is zero. this includes cases where the title and/or issn aren’t a perfect match between our usage and cost data, journals that we pay for as part of a package but receive usage on a per-title basis, and journals only available on platforms we removed from consideration because they’re primarily backfiles, like jstor. in short, the data is still messy, and while we have plans for how to fix some of these issues, it will never be perfect. power bi’s visualization can be deceptive, even seductive. we think our model looks pretty, but we still must put our thinking caps on to be sure that those attractive charts mean what we think they mean. each of our displays is meant to be a starting point for cancellation decisions rather than a rigid list of things that must be cancelled. discussion and future work we set up this model using only the tr_j1 report, and once we gain some experience consulting the model for assistance with deselection decisions, we plan to add more report types. we’re interested in adding historical usage data by uploading counter 4 stats to alma in order to add more years into the model. we may also consider additional calculations beyond cpu that might be useful for our collection assessment needs. the way these reports are written, we shouldn’t have to make major changes until there are significant changes with either counter or alma analytics. this is intentional, so that we can keep the pipeline running with as little maintenance as possible. power bi lets us share out a dashboard that any of our colleagues can use to see the visualizations and tables of data. we have yet to use this model for significant cuts, and surely that experience will teach us a lot about what else we need to do to improve this tool. in the meantime, we have sent the model out to our collection managers and subject librarians for review, feedback, and to begin using it to help inform decisions. we encourage our colleagues to use the data as a starting point for collections decisions, rather than as the single tool to make the decisions. along with the data we’ve shared information about the limitations of that data. so far, we have used this model to extract statistics for administrative reports, and as a starting point for database assessment. visualization can help us analyze the data to uncover patterns, trends, and correlations that might not be apparent from a spreadsheet alone. these deeper insights can help us make better decisions. by automating the process, it makes year-over-year comparison more achievable. it also sets us up for expanding the types of reports to intake and types of content to analyze in a reproducible way. with regular updates, the tool will grow more useful over time as gaps are filled in. we expect this tool, and permutations of it, will help our collections management team make data-driven decisions to meet the collections and access needs of our students and faculty. notes [5] unsub: https://unsub.org/ references [3] boardman, erika, and jennifer france thompson. 2022. “visualizing counter metrics with sushi: exploring alma analytics for e-resource evaluation.” serials review, september, 1–3. https://doi.org/10.1080/00987913.2022.2120328. [1] harker, karen r. 2022. “the depths of cost-per-use: historical context and applications.” library trends 70 (3): 355–86. https://doi.org/10.1353/lib.2022.0000. [4] jabaily, matthew j. 2020. “recalibrating cost-per-use: implications of counter release 5 and unsub.” serials review 46 (4): 292–99. https://doi.org/10.1080/00987913.2020.1850040. [2] pesch, oliver. 2015. “implementing sushi and counter: a primer for librarians: edited by oliver pesch.” the serials librarian 69 (2): 107–25. https://doi.org/10.1080/0361526x.2015.1063029. about the authors lydia harlan is the budget analyst for collections, discovery, and digital strategy at university of oregon libraries. a librarian by training and budget analyst by circumstance, she holds an mlis from san jose state university and a ba in film from emerson college. kristin buxton is head of science liaisons and librarian for computer science, mathematics and physics at university of oregon. gabriele hayden is head of data services and librarian for research data management and reproducibility at the university of oregon. she holds a phd in english from yale and a ba from reed college. appendix a a step-by-step guide to building an alma to power bi pipeline for cost per use establish prerequisites to replicate our approach, you will need access to, knowledge of, and some skills in alma, alma analytics, counter 5, apis, sushi, python, and power bi. power bi desktop requires a windows pc. it is helpful to have some elementary familiarity with programming with sql in alma analytics and m and dax languages in power bi, in order to make adjustments that work better for your needs and workflow. set up vendors with sushi api in alma with tr_j1 reports types enabled follow the “managing sushi accounts” section of ex libris documentation “managing counter-compliant usage data”. [1] set up vendors manually in alma. identify administrative usernames and passwords for the content providers and the next layer of credentials for sushi: the customer id, requestor id, and occasionally, api key. please note: some sushi credentials can be found on the content provider’s website, but others cannot. it may be necessary to contact customer service to obtain them. many credentials changed during the switch from counter 4 to 5. credentials can change at any time without notice. establish the base url for the sushi api string. please note: commonly used vendors are already established in our consortial instance of alma, and most of the time the base urls were correct. if the “vendor url” doesn’t work in alma, experiment with the “override url” field. one may need to add “/sushi”, “/sushi/”, “/reports”, or “/reports/” to the end of the base url to pull a report successfully. “requester” id is spelled with an “e” in alma and is spelled “requestor”, with an “o” in counter api strings. this is an important detail if you test the api strings in a browser. under “usage report types” in alma, click “add report type” to see the list of report types. select tr_j1 to follow our example. select “test with response” to test whether the connection is working and see an example of the harvest that alma returns. set up auto-harvest of sushi in alma follow the “managing sushi harvesting” section of ex libris documentation “managing counter-compliant usage data”. [1] once the vendors are set up, tested, and verified, enable the auto-harvest. select a harvest date. after experimenting with different alma preset dates, we settled on the 18th of the month to collect new usage reports. enable the function to receive the monthly sushi harvest report by email once the harvest has been completed. the monthly sushi harvest report will list completed and failed attempts, with errors. handling errors is described in step 12.a. this email signals that the power bi model can be refreshed to intake updated usage statistics, described in step 12.c. create an alma analytics report on costs experiment with ex libris’ out-of-the-box reports or create your own. we did not find an out-of-the-box report that met our specific needs. to follow our example, use library name, fiscal period description, transaction expenditure amount, fund code, title (normalized), issn, and po line reference details including item description and filter on “continuous” continuity and an “active” status. item description provided the best results for us to be able to match titles with normalized title in the usage data subject area of analytics, but this may vary with your cataloging workflow. edit the column formula of item description to turn all item descriptions into lower case, like so – “lower("po line"."item description“) – to begin normalizing the data. removing proceeding articles occurs in a later step, in power bi. it is also an option to lower the case in power bi to do all the normalization steps in one environment. this may be the preferred method if analytics formulas aren’t familiar. write both analytic reports to display the fiscal year as “fy-xxxx” to enable matching between tables. test the report and adjust as necessary for your data and institution. create an alma analytics report on counter 5 tr_j1 use we did not find an out-of-the-box report that met our specific needs. to follow our example, use tr_j1 – unique item requests, tr_j1 – total item requests, usage date fiscal year, usage date year, usage date month key, usage date month, usage date quarter, normalized title, normalized issn, normalized eissn, origin issn, origin eissn, normalized publisher, publisher, normalized platform, and load file counter report type. this report uses the normalized title, so no additional transformations are necessary. write both analytic reports to display the fiscal year as “fy-xxxx” to enable matching between tables. test the report and adjust as necessary for your data and institution. obtain permissions to use the alma analytics api key for acquisitions work with your institution’s alma developer to get permissions and the api key to use the “retrieve analytics report” api, with secrets held in the alma developer network. follow gandalf’s advice to “keep it secret, keep it safe.” [2] build the path for each analytics report with the root url, analytics report path, and acquisitions api key. our path begins with https://api-na.hosted.exlibrisgroup.com/almaws/v1/analytics/reports?path= but it may vary for your alma instance. open either the cost or use report in analytics. in the browser, copy the portion of the link for the analytics report that begins after the equal sign. add the copied text after the equal sign of the path you are building. notice that when you copy and paste, blank spaces are replaced by “%20”. (figure 1) figure 1. add “&apikey=” and then the api key for acquisitions to the end of the path. repeat the steps in 6.b to build the second report url. test each report. there are many options for testing apis, including putting the whole string in the browser. another way to test this is to set the api as the data source in power bi. review the data. if your data is more than 1,000 rows, you will need to use a resumption token to retrieve the full set in power bi. we were unable to get the resumption token to work using power query (m) directly, so we used a python script to retrieve the report and pass in the resumption token, as described in the next step. write a python script to query the alma analytics api, including resumption tokens to retrieve more than 1,000 results we based our python query on a well-documented set of instructions for how to query the open alex api created by eric schares and sandra mierz. [3] feel free to copy our code, which is available in several different versions. [4] because the alma analytics api has access to patron information, we maintained a high level of security. we initially did this by creating a python script tracked in github that used the configparser library to pull in our secrets (api key and report paths) from a separate config.ini file not tracked in github. we recommend testing your python code outside of power bi using your preferred python interpreter with debugger since power bi doesn’t have a python debugger built in. however, because power bi does not allow the python code to consult other files (such as a config.ini file) we manually created a version that no longer used configparser to reference a separate file with secrets, but instead included all secrets hard coded into the python script. in the next section, we discuss how we added this hard-coded script to power bi, and how we modified it to allow each secret to be a power bi parameter. establish power bi connection cost and use reports using python script to add a python script to power bi, first follow the instructions in microsoft documentation to install python on the desktop (a virtual environment is recommended), import any necessary python libraries (we used pandas, xml.etree.elementtree, requests, and time), configure python access in power bi, and enable permissions. as of this writing, all python data sources need to be set to “public” to work with the power bi service. once permissions are configured, in power bi desktop, select “get data” and scroll down to select “more” to open up a new box. scroll down to select “python script” and then click “connect”. paste the python script in the text box, then click “ok”. if the script runs correctly, you will be prompted to load data into a power query table. the first applied step for this data will be “source,” and a gear icon to the right of the step allows you to open and view the python script as originally entered. to see the python script as it appears loaded into a power query (m) script, open “advanced editor” in the power query ribbon. note that power bi has translated the script by wrapping it with a command to run in python and adding #(lf) where each line ending was in the original python. we recommend turning the hard coded secrets (api key and report path) into power bi “parameters” that can be more easily updated by those less proficient with python or m. to do this, first follow microsoft’s instructions to create your parameters. for our code we created four: url, api_key, use_data_path and cost_data_path. be sure to specify data type “text” and enter each parameter with quotation marks around it. for example, the api_key parameter will look something like “k3lk2ke4536kjklk4” (quotes included). now return to the “advanced editor” window in power query and replace the hard coded api key, which will be something like “k3lk2ke4536kjklk4”, with “& api_key &”, where api_key is the name of the parameter, and the replacement code begins and ends with double quotes. to see our python code in the power bi context, open the power bi template (.pbit) file available in our github repository, https://github.com/uodataservices/cost_per_use/. [5] if no appropriate alma analytics reports are configured, enter the real url, api_key, use_data_path and cost_data_path values. if not, enter any placeholder value into the fields (like “test”) to access our power bi model. either way, click “load”. you may need to give permission to power bi to run the script. there will be a delay as power bi tries to make the api calls. if parameters have “test” entries, you will see a popup window warning of python script errors. go ahead and close that error box. despite more error messages, you are now in our power bi file and can explore every step of our data transformations in power query (m) and dax. find the python code by choosing “transform data” and then clicking on the “advanced editor” for either the usage_data or cost_data tables. the following steps will explain how we worked with our data once it was imported into power bi. you can look for these steps in the power bi environment to go with the steps described. match cost and use data on issns and titles using m and dax languages once a connection is established, transformations will need to be done in power query using the m language to open the tables, normalize data, and prepare it for matching. starting with the cost data, open the data tables by clicking on the arrows that point simultaneously left and right until all the column carrots point down and you have drilled down to the columns so that they match the alma analytic on costs. (figure 2) figure 2. you may need to change data types (e.g. convert numbers to currency), remove unnecessary columns, rename columns for clarity, etc. so that your data makes sense and functions well in power bi. this is an interactive process that you will need to customize for your data and needs. recall that we lowered the case of the item description in alma analytics. at this point use m code in power query to remove the beginning “the” and “a” to further normalize and match these titles with the titles that will appear in the use table. code for removing “the”: = table.replacevalue(#"renamed columns", each [title normalized], each if text.startswith([title normalized],"the ") then text.removerange([title normalized],0,4) else [title normalized], replacer.replacetext, {"title normalized"}) there may still be differences in the cost table versus usage table between “and” and “&”, punctuation in the middle of a title, and other unknown differences. iv. the field issn contains multiple issns. there were as many as six issns, but we decided that two would be enough to match on. we split the column by a delimiter to separate the first two issns into two columns. code for splitting by delimeter: = table.splitcolumn(#"replaced value", "issn", splitter.splittextbydelimiter("; ", quotestyle.csv), {"issn.1", "issn.2"}) optionally, add a conditional column for fund code display and group fund codes by broader category to make this information more digestible to subject librarians. anecdotally, our funding schema is more complicated than most libraries, so you may not need this step. move to the usage data, and open up the data tables until you have drilled down to the columns so that they match the alma analytic on use, like in step 8.a.1. normalize titles here as well, removing the beginning “the” and “a” so that they will match with the titles in the cost table. again, you may need to change data types, remove columns, rename columns, etc. so that your data makes sense and functions well in power bi. this is an interactive process that you will need to customize for your data and needs. navigating out of power query, and into the tables, at this point you should have one table for cost and one table for use. our table for cost has all the columns we built in the alma analytic on cost as well as the split issn columns and the fund code display column. create a use summary table. cost data is available per fiscal year, while usage data is available aggregated by month. to simplify joining information, we summarized usage data in a new table and matched the summaries with cost data on issns or normalized titles. here is the dax code for the usage summary table including the sum of total requests: usage_summary = summarize('usage_data', usage_data[usage date fiscal year], usage_data[normalized title], usage_data[normalized platform], usage_data[normalized publisher], usage_data[normalized eissn], usage_data[normalized issn], "total requests", sum(usage_data[total requests])) with the usage summary table established, match on issn and title. we started with issn1 of the cost table and matched it against the normalized eissn of the use table. here is the dax code: issn1 by normalized eissn = calculate (sum ( 'usage_summary'[total requests] ), filter ( 'usage_summary', and(and('usage_summary'[normalized eissn] = 'cost_data'[issn1], 'usage_summary'[usage date fiscal year]='cost_data'[fiscal period description]), if(and('cost_data'[issn1] = "", usage_summary[normalized eissn] = ""), false, true))) we made a new column for issn2 by normalized eissn, and so on, as well as title by title. because of data irregularities, we chose to calculate cost per use in a fuzzy way using the most expensive year and the highest use year. it will not be exact for data in a specific year. since this information is meant to discover outliers, this level of detail is sufficient for our needs. to calculate cost per use we added columns for maximum use and maximum cost per use: electronic use max: electronic use max = round(max('cost_data'[title by title],max(max('cost_data'[issn1 by normalized eissn],'cost_data'[issn1 by normalized issn]),max('cost_data'[issn2 by normalized eissn], 'cost_data'[issn2 by normalized issn]))),0) the employment of the round function is to resolve troubles with a fraction of a use. electronic max cpu: electronic max cpu = if([electronic use max]=0,0,'cost_data'[transaction expenditure amount]/[electronic use max]) recognizing that borrowing material is also an option in libraries, we included a formula for the cost beyond ill threshold, and a column for how that formula translated into whether to consider ill instead of subscription. this is our formula for cost beyond ill threshold, assuming documents through reprints desk cost $35 and we can request an item 20 times before having to pay for it [5]: cost beyond ill threshold = if('cost_data'[electronic use max]-20 the consider ill column is a simple true / false statement: considerill = if(and('cost_data'[cost beyond ill threshold] < 'cost_data'[transaction expenditure amount],cost_data[electronic use max] > 0 ), true, false) create a table for ill vs subscriptions so that it can be shown as a visualization. here is the code: ill_vs_subscription = summarize('usage_summary', usage_summary[normalized title], usage_summary[normalized platform], usage_summary[normalized publisher], usage_summary[normalized eissn], usage_summary[normalized issn], "max yearly requests", max(usage_summary[total requests])) create visualizations based on cost and use the point of all of that data wrangling was to be able put together a cost per use visualization, and now that all that hard work is done, the visualization is relatively straightforward. for cost per use, we chose a stacked bar chart, used title normalized for the y-axis and the sum of electronic use max cpu (step 7.d.ii) for the x-axis, and added the fund code display (which we shortened to fund) as the legend. we added slicers for fund and fiscal year. the display can be filtered by these slicers, or it can be searched by title. (figure 6 of main paper) for a closer look at the usage data, on the second tab, we created a table drawn from the usage summary table’s normalized title, total requests, and usage data fiscal year, and added in a slicer for fund. this table can be filtered by fund or searched by platform, publisher, or title. (figure 7 of main paper) the third display is a table of titles that might be cheaper to obtain by ill than subscription, based on the usage patterns and average ill cost. (figure 8 of main paper) the last tab is an about page, covering the highlights of what is and isn’t included in the data, known limitations, and the process overview. publish the power bi web dashboard/report so you can share it with colleagues in the power bi cloud app, create a workspace where you will share the model. name the workspace, describe it, and add the people who you want to have access to it. in power bi desktop, click “publish” and select the workspace where you want to upload it. navigate to the report in the workspace and then click the “share” button to share the report colleagues. perform ongoing maintenance continually refresh vendor usernames and passwords, harvest credentials, api keys, changes to platforms and base urls as needed. there are a variety of errors to work through on the monthly sushi harvest report, ranging from the vendor’s usage dates not being available yet, to failure to connect, to retrieving unwanted reports. it is helpful to have alma’s monthly sushi harvest report emailed the person managing the accounts to identify and troubleshoot errors. sushi errors are described by project counter. [6] alma / sushi errors are well described by carli. [7] add new vendors as they enable counter 5 reporting. monthly, after alma performs the sushi harvest, refresh the data in power bi. open the power bi model in power bi desktop click “refresh” to bring in both the updated cost data and usage statistics. click “publish” to overlay the currently published model with the updated one. click “save” to save the updated workbook. a message will pop up about overlaying the currently published data set. agree, and click “replace”. appendix notes [1a][1b] “managing counter-compliant usage data”: https://knowledge.exlibrisgroup.com/alma/product_documentation/010alma_online_help_(english)/020acquisitions/090acquisitions_infrastructure/010managing_vendors/managing_counter-compliant_usage_data [2] see the movie “the lord of the rings: the fellowship of the ring” [3] openalex-citedreferences: https://github.com/eschares/openalex-citedreferences [4] available at our project github repository (https://github.com/uodataservices/cost_per_use) and archived in zenodo at https://zenodo.org/records/10426231. [5] this formula is based on a formula created by our colleague david ketchum, director of access services at uo libraries. [6] appendix f: handling errors and exceptions https://www.projectcounter.org/appendix-f-handling-errors-exceptions/ [7] alma sushi harvesting status examples https://www.carli.illinois.edu/products-services/i-share/electronic-res-man/sushierror subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using low code to automate public service workflows: three cases mission editorial committee process and structure code4lib issue 52, 2021-09-22 using low code to automate public service workflows: three cases public service librarians without coding experience or technical education may not always be aware of or consider automation to be an option to streamline their regular work tasks, but the new prevalence of enterprise-level low code solutions allows novices to take advantage of technology to make their work more efficient and effective. low code applications apply a graphic user interface on top of a coding platform to make it easy for novices to leverage automation at work. this paper presents three cases of using low code solutions for automating public service problems using the prevalent microsoft power automate application, available in many library workplaces that use the microsoft office ecosystem. from simplifying the communication and scheduling process for instruction classes to connecting our student workers’ hourly floor counts to our administrators’ dashboard of building occupancy, we’ve leveraged simple low code automation in a scalable and replicable manner. pseudo-code examples provided. by dianna morganti and jess williams introduction what is low code? ifttt, short for “if this, then that”, launched in 2011, and became one of the first major forms of massively accessible low code or automation. they created and crowd-sourced “recipes” (now called “applets”) that allowed users to connect multiple interfaces to create a trigger and a result. for example, to get an email that tells you to bring your plants inside when it’s forecasted to freeze, you would just choose the icon for this applet, and sign into weather underground and tell it your location. the applets have matured now to performing multiple actions and integrating more deeply into apps, but ease of use has remained their priority. since 2011, a number of similar consumer-level low code apps have emerged, the most famous being google home, apple homekit, zapier, and microsoft’s power automate. these platforms have created replicable code with graphical user interfaces to make it easy to point, click, and leverage automation to make life or work easier. programmers work behind the scenes creating these low code solutions for consumer use. someone has already written an applet so that consumers can use their phone’s gps to trigger their home’s porch lights and air conditioners to all turn on when they pull into their driveway–users just need to click a few buttons and give permission to get it working across all their devices. low code is the use of a graphical user interface to take pre-written code and make it easy to use for just about anyone to leverage automation. the smart home market isn’t the only place low code has transformed our lives; the business world may have seen even more benefits than the home market. the low code solutions presented in the three cases below all leverage microsoft power automate. power automate (formerly known as “flow” and found at flow.microsoft.com) is a low code way to link microsoft products and other products and apps. it is intended for business solutions such as sending customized emails to customers based on their responses to a survey, or going through an approval workflow when a new document is uploaded to a specific folder on sharepoint. the example cases given below demonstrate how the team’s knowledge of the platform grew and we iterated over time. problem: mask mandate enforcement although the texas state university’s main library building never closed during the covid-19 pandemic, the students had been sent home from their campus residences in spring 2020 and all classes moved online, which meant that the library had a very low floor count for the remainder of that semester. the university allowed students to return to campus in fall 2020, and the library anticipated a great increase in foot traffic as a result. we had state-wide, local, and campus mask mandates as well as occupancy requirements for social distancing, and as stewards of a public building we needed a way to enforce those for our seven-story heart of campus. we explored some turn key solutions such as body heat occupancy monitors or all seats and spaces requiring student reservation, but ultimately the infrastructure (needed outlets and data drops for monitors or for reservation kiosks) didn’t exist yet and could not be implemented in a timely enough manner to make these a good solution. we also anticipated that mask compliance would be a greater issue than occupancy, and no turn key solution existed for that. peers who had also remained open shared some good practices such as constantly surveilling for mask compliance on each floor, but this solution wouldn’t work for us. we needed another solution. constraints the top constraint surrounding a solution to this problem was time. we couldn’t take the time to find the best solution, because the students would be returning in fall. however, we began this work during the summer and so we had a little time to experiment.` we started with an assessment of the resources we had at hand. we had the microsoft environment, student workers, and staff with safety at top of mind. we also had our own time (that is, the authors’) – we aren’t afraid to tinker with technology, but we also aren’t “coders”. we knew we could start with student workers doing hourly floor counts and checks for mask compliance, but we didn’t want to put the onus of enforcement for either concern onto student workers. so we needed some mechanism to communicate their information to managers who would do that work. student workers used a paper data collection form, a clipboard, and a handheld tally counter while walking the floors then transposed the data into a microsoft form when they returned to their work station. our first deployment of a low-code solution used microsoft power bi, which can automatically retrieve the information from the excel spreadsheet generated by the microsoft form. this initial solution involved teaching all the managers how to sign in to power bi, set up alerts for when they needed to respond, and teach them how to get those alerts on their mobile devices. there were many opportunities for user error in this system. one of our administrators suggested creating a team in microsoft teams for the enforcing staff to all receive the same alert from power bi using a flow in power automate, and from there we began exploring microsoft power automate. in this version 1.1 of our workflow, we only used power automate to take the alerts from power bi and send them to the team. but in exploring the template flows we were quickly able to envision a whole new automation that was much more efficient. the newly discovered template flows in power automate allowed us to remove power bi from the workflow, which was a positive thing as power bi created a delay in response. we retained the power bi dashboard to view the data, but it was separate from the response automations so that they could be done swiftly. this brought us to the 2.0 version of our workflow which we present here. solution (in workflow chronology) while we worked to iteratively improve the process over time, we can present the final product here. we ceased use of this process in may 2021, when the mask mandates were lifted by our state governor. find the workflow below listed roughly, with more detailed pseudocode demonstrating the low code process in power automate below. a student worker conducts an hourly floor walk and gathers data on occupancy, mask noncompliance, and any additional notes. a chart assists them with determining whether the floor is in “red”, “yellow”, or “green” based on predetermined cautionary levels of both mask compliance and occupancy (a proxy for social distancing) they update the alkek library’s entry-way floor status chart that assists customers in making a decision about the safest place to study they enter the data into microsoft forms. [student worker is done] power bi pulls the microsoft forms data to update a live dataset and dashboard of floor counts for library decision-makers to study trends and respond to concerns power bi also publishes to a web page summarizing the previous 7 days’ statuses to allow customers to make informed study space decisions microsoft power automate has a flow that detects every form entry and compares against criteria for concern if a floor meets a threshold for concern, an alert is sent to the “covid-19 response team” in microsoft teams. response team members communicate amongst each other in the discussion thread to determine who will respond. if response team members expel a customer for mask noncompliance, they fill out another microsoft form. the powerbi dashboard that tracks data and trends also pulls data from this form to track expulsions. libstaffer (a scheduling tool from the springshare company) creates a transparent schedule of who is responding to alerts from the floor counts. libstaffer updates a live schedule on an internal webpage so that student workers can see and contact responders about any concerns they observe this webpage also has complementary information such as links to the counting form, suggested scripts for giving warnings (optional for student workers), phone numbers for supervisors, and more. libchat has an sms text phone number for customer complaints about mask violation or other safety concerns, which come in almost daily. a few of us monitor this and alert enforcement staff via teams to respond. for the microsoft power automate section of the workflow, we started with a blank flow and used the “automated cloud flow” option. this means we wanted an automation triggered by an event (as opposed to an automation where we pushed a button to trigger it). for this automation the tasks are triggered by the submission of new form entries (the floor counts students entered). we used a template for sending a teams message when a new form response comes in, but made it so that it only takes action when certain conditions are met (that is, the occupancy or mask compliance are at cautionary levels). “when a new response is submitted” is the template for triggering a flow from a form entry. the creator will then choose the form from the drop-down list of all their office 365 forms. “get response details” is a necessary template for gathering the data from the form entry. “condition”: if the “floor” was “1st”, and: if the “occupancy” was more than a or if the “mask noncompliance” was more than b or if the “floor was “2nd”, and: if the “occupancy” was more than c or if the “mask noncompliance” was more than d or if the “floor was “3rd”, and: if the “occupancy” was more than e or if the “mask noncompliance” was more than f or if the “floor was “4th”, and: if the “occupancy” was more than g or if the “mask noncompliance” was more than h or if the “floor was “5th”, and: if the “occupancy” was more than i or if the “mask noncompliance” was more than j or if the “floor was “6th”, and: if the “occupancy” was more than k or if the “mask noncompliance” was more than l or if the “floor was “7th”, and: if the “occupancy” was more than m or if the “mask noncompliance” was more than n if yes: “post a message in teams” is the template to do so, and since we got all the response details in step 2, we can include all the form data. the message template supports basic html so i was also able to add a link to report any expulsions to another form. if no: do nothing. problem: instruction scheduling, first year writing classes in pre-covid times instruction for the university’s first year writing classes, particularly english 1320, was executed through a high number (around 125 ) of one-shot sessions. in academic year 2020-2021, the instructional content was moved to a self-paced tutorial accessible through the campus lms, canvas. however, many of the english instructors still requested synchronous library instruction though zoom. around the same time, the department primarily responsible for teaching these lower-division undergraduate courses lost team members and was left with one librarian and a handful of volunteer librarians from another department. to meet the instructional need, the library created a pseudo-flipped classroom model: instructors were given the option to schedule a question and answer session with a librarian on zoom after their students had completed the online tutorial content. for assessment purposes, we needed to know two sets of information: 1) which instructors were assigning the online tutorial only and 2) which instructors also wanted a live q&a session. the latter set required more details for coordination and scheduling; we also wanted a system to help ensure that the students were completing the online tutorial before the live session. this automation was the solution to these needs. constraints the group of volunteer librarians came from three different departments in the library, though they were also all part of a functional library instruction team. this team had been working collaboratively using an existing microsoft sharepoint site and teams since spring 2020 and the start of the covid-19 pandemic. continuing use of microsoft products made the most sense organizationally; it also limited some functionality, particularly with the ms planner components. further automation could have been added to overcome these limitations, however we also knew that this entire process itself was temporary and would be replaced by a new instruction model by the next semester. while ms forms, power automate, and planner were used in this scenario, other options are also viable. google forms, zapier, and trello is another powerful combination, in particular because trello now has utilizes built-in automation through butler. an additional constraint of this solution is that it existed outside of the library’s system for collecting instruction statistics, springshare’s libinsight, which is not interoperable with microsoft. logging the data was a manual process but also the inspiration for the following use case and iteration. solution (in workflow chronology) in fall 2021, the instruction model for first year writing classes will change and no longer require instructors to request or schedule a session. a more sophisticated iteration of this use case is presented below. the temporary nature of this workflow meant that we made the conscious decision not to tweak or improve certain aspects; suggestions for improvement are included at the end of this use case for the benefit of the reader. many ms planner features like progress status, priority, and labels were not used here but have the potential to greatly improve the user experience for the librarians. data collection using branching in ms forms the foundation for this automation is a microsoft form that utilizes branching. every instructor filled out sections 1 and 2 but only instructors requesting a live session were moved to section 3. instructors were asked to fill out one form per class section. section 1 questions course section number (text) name (text) email (text) number of students (number) due date for library tutorial (date) do you need help embedding the library guide (with tutorial) into your canvas course? (y/n) would you like to receive assessment results for your students? (y/n) section 2 instruction options (this was included as a text block before the next set of questions) the interactive tutorial provides students instruction on the following: how to use the library website, databases, catalog, and research guides how to chat with a librarian & schedule a research consultation an overview of the research process choosing a topic, core concepts, and keywords background research + library vs. google search strategies (creating queries, boolean operators, troubleshooting) source types interpreting search results page and refining the search query citations evaluating sources questions would you like to schedule a live zoom session for your class? (y/n) this session can be used to address student questions and research topics, with the librarian acting as a coach and guide; this time may also be used for the librarian to introduce content not covered in the tutorial. please make sure your students come prepared to this session for best results. branching note: if no, instructors were taken directly to section 4. if yes, the instructor was taken to section 3. section 3 schedule a live zoom session class time (text) first choice date (date) second choice date (date) how will your students prepare for the live session? (short answer) ex: by adding questions for the librarian on a discussion board beforehand additional comments (long answer) include any special topics you’d like covered or assignment details that would help your librarian prepare. section 4 complete! any questions? submit using ms power automate to create cards on ms planner board all librarians available to teach the live zoom sessions were already members of the library instruction group sharepoint site/teams group. to integrate this workflow into the existing structure, we created a ms planner board with the teams app. the primary purpose of the power automate flow was to create one card from each ms form submission/class section and then to display the data from the form in an easily available and readable format on that card. workflow overview ms form: submission ? ms planner: create a task ? delay (5 sec) ? ms planner: update task details ? send success email ?? send failure email workflow details trigger: when a new response is submitted to ms forms action: create a task in ms planner plan id selected from library instruction planner board title generated from dynamic content from two submission fields: [section number] [name] display example: 1320.255 juan doe five second delay action: update task details in ms planner pulling dynamic content from the form submission into the description field on the card: figure 1. dynamic content from the form submission added to the description field. action: using notification operation, send email send failure email (ensures the user doesn’t miss a request if the automation fails for any reason) send success email with dynamic content from the newly created planner task: figure 2. dynamic email in planner. completing workflow from ms planner board once cards were created in the incoming requests bucket, they moved from left to right across the board, creating a highly visual workflow. the buckets were: table 1. incoming requests bucket workflow. incoming requests scheduled & assigned needs assessment results sent needs added to libinsight closed a few weeks into the semester, the group of librarians met to learn the workflow and divvy up class sessions. one librarian was assigned to each card representing a session, and then that card was moved to the schedules & assigned bucket. due dates were set in order to trigger email reminders, which ms planner does by default. a checklist was added to each card to include: 1) email course instructor 2) confirm live session & zoom link 3) do the class. this workflow transparency helped the home department librarian keep track of each session and to ensure that requests were met. for courses not requesting a live session, those cards were moved directly to the needs assessment results sent bucket; live session course cards were also moved into this bucket by the assigned librarian at the conclusion of that session. due dates were added to each card, and either the home department librarian or an admin assistant would download and email assessment results to the instructor. these same individuals would move the cards though the next bucket and to completion. suggestions for improvement again, because of the temporary nature of this particular workflow, the author noted several areas for automation and improvement but opted to not to employ them for the sake of time. if using ms planner again, the author recommends: utilizing ms planner features like progress status, priority, and labels. these have the potential to greatly improve the user experience for the librarians. use ms power automate to populate a predetermined checklist, on the “update task details” action since ms planner does not have the option to copy and paste a checklist. automate the process of statistics and record keeping (see below for how we did that in the second iteration of this workflow). problem: instruction scheduling scaling up we faced a problem many of you faced most likely in 2020: we continued losing staff at at least the normal pace (if not a higher pace due to stress/anxiety around the pandemic), but we were in a long-term hiring freeze that prevented us from replacing staff we lost. specifically we lost all the staff who had provided the administrative and logistical support behind our instruction scheduling workflow. in the past, when a course instructor wanted a librarian visit, they went to our website and filled out a libinsight form. this form created an initial ‘shell’ record in libinsight and emailed an alert to our admin staff. the admin staff compared it to the subject librarian list and sent it out to the right librarian. that librarian, after they taught the class, would complete that shell record in libinsight with the statistics of duration, attendees, etc, and they would also send the faculty member a survey asking for feedback. we lost all the admin staff who had supported this process, and although other library staff could have stepped in to replace them, we were also doing that for many other jobs and responsibilities in the library. in order to show support for the staff workload and attempt to reduce some of the tasks that could contribute to the feeling of always needing to ‘do more with less’, we wanted to use some automation to reduce the administrative support necessary for this process. the scheduling described in the previous case gave us the idea of how we could scale up, and our model of starting the initial shell of a statistical record at the point when a faculty member submitted a request was a great one. that model allowed for a little bit of accountability in ensuring all our classes were recorded and accounted for at the end of each semester. a bit of reading and tinkering in power automate showed us that we could use sharepoint lists to replicate that part of the process, and we could use a series of if/then conditions to help send requests to the right librarians. we didn’t want to totally replicate the data collection form we had in the past – it was an onerous form that had been added onto over time to the point of being something everyone dreaded about teaching. so we started from scratch in determining what to ask faculty and what to ask librarians about the library instruction classes, tours, and workshops. we began by gathering every data point we needed to report out on to our various stakeholders and reporting agencies and classifying them according to whether they are necessary, are optional but useful, and which are optional and not used. with this study we were able to cut a significant amount of fields from the form, streamlining the process before automation. table 2. fields evaluated for need. necessary optional and useful optional and not used date and time room number was librarian embedded in lms? number of students anticipated desc of assignment month of instruction time (for synchronous classes) 2nd choice date department of class synchronous/asynchronous delivery number of students anticipated librarian requested (not used since we send request to the assigned librarian no matter what they chose) number of students attended assessment score (compiled elsewhere) prep time course name and number course libguide campus course level (duplicated in course number) we went through two full testing environments consisting of a unique instructor form for requesting a class, a sharepoint site for managing/recording class information, and flows to connect those and alert librarians. a small group of librarians helped test the automation over the course of about 6 months before we presented a fresh test environment to the entire group of 20+ librarians for full testing. constraints in the eng 1320 example above, the librarians were a small controlled set of people. it is easier to train 5 people to use a new platform (like teams and planner, utilized in that system) than the 20+ we’d need to train for this scaled up approach, so we prioritized using tools that the staff already knew well (email and forms and to a lesser extent sharepoint). when i had to convert some tasks to sharepoint, i focused on ease of use, since we weren’t a heavy sharepoint-using group. i moved any fields they had to interact with in sharepoint to the center of the page to catch their eye, and i hid all the fields they didn’t need to interact with. i used the sharepoint powerapps forms service to simplify the form in case any librarians preferred entering their data later. again, i hid the fields they didn’t need to input – for example, we ask faculty instructors to give us a 2nd choice date, but the librarians entering data later won’t even see that field. resources and time were a constraint in that we wanted to use only pre-approved software and environments for interacting with class information to limit the need for going to it security to approve any new tools. using google or qualtrics, for example, would have given a lot more customization options for a form, but they aren’t pre-approved for this type of use on our campus. while i’m familiar with the process of getting software approved, i also knew my timeline was a constraint as well. while time was a flexible constraint, since this was an improvement project where the existing solution worked just fine, i wanted to work on a deployment of a new solution by the next academic year. going through it security processes to approve new software is a seemingly unpredictable timeline, so i chose to be limited, then, to the options available in our pre-approved tools for this type of data on campus. lastly, we constrained the project to those apps for which microsoft power automate has pre-built flow templates (that is, it needed to be “low code”). while we may be confident enough to tinker around with simple code, this process needs to be sustainable. building it entirely using pre-built templates in a low code environment means there are faqs and tutorials online for how to use each component, and the entry level for the technology is lower. solution (in workflow chronology) the process starts with the template automation trigger for when a new response is made in a microsoft form. i then used the “conditions” to compare form entry data about what class it is to the librarians’ responsibilities to figure out who it should go to. then i used templates for “send an email” and “add to sharepoint list” to get the data to the right people and record that request in the statistical dataset. the microsoft form that serves as the trigger uses branching to ask questions based on previous answers. the first relevant question for the logic is asking what campus, then college, the faculty requesting instruction is in. from there, if needed, it will ask the department. if the faculty is from the round rock campus, though, the questions are moot since all those requests go to the same place; we don’t make the faculty fill out extraneous information. similarly once we get to date/time/location, if they’ve chosen an asynchronous format, we only ask what date they’d like it available – not time and location. when the triggering event occurs (someone fills out the form), the flow built in microsoft power automate will: check if subject equals a or b or c if yes: then fill out the stats instance – add librarian abc to field “librarian” and: email librarian and faculty with details and: stop. if no: nothing. it will move to the next condition. checks if subject equals x or y or z copy of above, with next subject librarian…. copy as needed for each librarian (we have 16 conditions – 15 individuals and 1 campus that didn’t want their requests sent to individuals) with the final librarian, it starts the same. check if subject equals aa, bb, or cc if yes: then fill out the stats instance – add librarian aabbcc to field “librarian” and: email librarian and faculty with details and: stop. if no: still fill out the stats instance – leave librarian blank and: send an email to a manager that says “class not assigned!!” (human intervention required to get this to a librarian to teach it) and: stop. we have thought of a few enhancements, and we’ll think of a few more, i’m sure. two likely enhancements are: that the process launch a calendar invitation for the preferred class date, booking the librarian and the classroom to have microsoft power automate send a follow-up satisfaction survey to the instructor after the class date. we will likely implement this before the first semester. problem: consultation request automation our final case is one that is not yet solved. we believe it is important and valuable to share the entire learning process, even those cases that are failures (or possibly just stuck in a failure stage on the way to success). while it was easy to envision a solution for faculty class instruction requests, it was harder to do so for consultation requests which can come from faculty, students, and even sometimes other staff. students, in particular, usually don’t know the colleges and departments their classes belong to, and the automated assignment to the right librarian depends on faculty knowing that information for their classes. an envisioned solution that we haven’t yet implemented might be to assign based on course prefixes instead, but that becomes more difficult when one also wants to accommodate faculty, staff, public, and non-academic student requests for research consultations also with the same form. conclusion automation inspiration knowing where to start is often the most difficult part of any endeavor, and the same can be true for workflow automation in service roles. here are some questions to spark inspiration: is there a repetitive task that you or your team dislikes doing? explore if it can be automated. do you have a workflow with a human bottleneck? investigate how you can employ conditions or branching to improve efficiency. do you have a process that consumes a lot of staff time? find a tool that manages it and then use low code automation to connect it to your existing tools or structures. this could look like: use a scheduling tool to automate research consultation booking and syncs with ms outlook or google calendar. automate checklists or daily tasks for student workers using a recurring duplication recipe. refresh your knowledge of the current tools and products you use for ways to connect them. browse your automation tool of choice for suggested workflows based on the tool. how do you know if your process can be automatable? in each of the cases presented here, we are providing examples where information is moved from one place to another. in the case of the floor counts, student workers count the number of occupants and mask noncompliance on each floor hourly. for both of our class instruction requests, as well as the envisioned consultation requests, the information consists of customer data. in all of those cases, a judgment needs to be made as to where to send the data, but that judgment isn’t reliant on human intuition or creativity – it’s enough to check pieces of the data against some criteria to determine where to send it. so what we have are the input, the automation which may have some criteria and conditions, and an output. our primary constraint of sticking with the environment and technologies that were already approved to handle our data was what determined that we use the microsoft ecosystem. if your ecosystem is google or is more open for your use case, then there is a wide variety of tools out there you can explore. any time you are plugging in a new app or tool, though, be sure to check that it meets your information security requirements for the type of data you’ll be using. for even the use cases we’ve presented, we can envision other scenarios involving: input sources ms forms google forms qualtrics survey monkey ms bookings automation tools ms power automate ifttt zapier means of output email/calendar event row in excel spreadsheet or ms list power bi dashboard or data visualization/report low code automation is often overlooked in public services, but taking the time to set up a system once saves an exponential amount of time in the long-term. in times of staffing shortages, when remaining staff are at their wits’ end of ‘doing more with less’, some automation inserted into the workflows of everyday life can help them feel supported and relieved. the possibilities for automating tasks and procedures continues to increase as the interoperability of software and applications becomes more common; public services departments in academic libraries should continue to explore these solutions in order to free up more time for their employees to provide services and produce scholarship. about the authors dianna morganti (she/her) is the interim head of research, instruction, and outreach for the texas state university libraries. she enjoys tinkering with technology and finding solutions to make work easier for the team. she’s also a project manager, a stem librarian, and a collector of hobbies. jess williams (she/her/ella) is the head of information & undergraduate services at texas state university libraries. she has loved crafting efficient systems for her team ever since her teenage gig managing a snow cone stand. she is passionate about student success, organizational culture, and creating spaces that facilitate authenticity and self-expression. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – epub as publication format in open access journals: tools and workflow mission editorial committee process and structure code4lib issue 24, 2014-04-16 epub as publication format in open access journals: tools and workflow in this article, we present a case study of how the main publishing format of an open access journal was changed from pdf to epub by designing a new workflow using jats as the basic xml source format. we state the reasons and discuss advantages for doing this, how we did it, and the costs of changing an established microsoft word workflow. as an example, we use one typical sociology article with tables, illustrations and references. we then follow the article from jats markup through different transformations resulting in xhtml, epub and mobi versions. in the end, we put everything together in an automated xproc pipeline. the process has been developed on free and open source tools, and we describe and evaluate these tools in the article. the workflow is suitable for non-professional publishers, and all code is attached and free for reuse by others. by trude eikebrokk, tor arne dahl and siri kessel, oslo and akershus university college of applied sciences; with thanks to: eirik hanssen introduction oslo and akershus university college of applied sciences (hioa) offers a publishing platform based on the open source software open journal systems (ojs). the learning centre and library maintains the platform. after a couple of years working with open access publishing it has become apparent that many online journals are still based on a print-centric publication model. it is easy to establish an e-journal using a traditional print workflow in an academic environment, because most researchers use a word processor as their major work tool. many know a bit about copy-editing from their contact with scholarly journals, and current word processors can save documents as pdf files. that is why pdf is the standard format used in the open access journals appearing outside the professional publishing industry. we decided to establish a project with the goal of creating a new workflow for the journals using epub as the main publication format. the project succeeded, and our aim is to share experiences and guide others planning to do the same. why not pdf? there are two important reasons why we wanted to replace pdf as our primary e-journal format. device independency we started considering alternatives to pdf after trying to read some of the journal articles on e-book readers (a sony reader touch prs-650 and a papercaster boox). pdf files do not work well on these e ink devices. there are font problems and it is hard to scale the text size. there are solutions to some of these problems, but pdf is a print format. it will never be the best choice for reading on tablets (e.g. ipad) or smartphones, and it is challenging to read pdf files on e-book readers with e ink displays (like most amazon kindle models and the sony reader). we wanted to replace or supplement the pdf format with epub to better support digital reading. this means that we needed to change the entire workflow in the journal in order to take full advantage of the opportunities provided by digital publishing. universal design and accessibility our second reason for replacing pdf with epub was to alleviate accessibility challenges. pdf is a format that can cause many barriers, especially for users of screen readers (synthetic speech or braille). for example, excel tables are converted into images, which makes it impossible for screen readers to access the table content. pdf documents might also lack search and navigation support, due to either security restrictions, a lack of coded structure in text formats, or the use of pdf image formats. this can make it difficult for any reader to use the document effectively and impossible for screen reader users. on the other hand, correct use of xhtml markup and css style sheets in an epub file will result in search and navigation functionalities, support for text-to-speech/braille and speech recognition technologies. in the last decade, accessibility and universal design have become a legal issue, both nationally and internationally. access to web content is required through article 9 in the un convention on the rights of persons with disabilities (crpd) [1] and some national anti-discrimination legislation and procurement regulations. the norwegian discrimination and accessibility act (daa) [2] authorizes universal design of ict in articles 13 and 14, and associated regulations for ict will come into force july 1, 2014. daa defines universal design as “can be used by as many people as possible”. accessibility is therefore an essential aspect of publishing e-journals: we must consider diverse user perspectives and make universal design a part of the publishing process. as early as in 1996, the world wide web consortium (w3c) established the web accessibility initiative (wai). wai has made technical guidelines for accessibility for all, e.g. to web content [3][4] and authoring tools [5][6]. the norwegian standardization organization (standards norway) released a standard for electronic documents in 2013. these guidelines and standards are useful tools when establishing a new publication workflow and format for e-journals. it is difficult to apply the wai guidelines and make the e-journals fully accessible within the current workflow of e-journal publishing at hioa. some of the main requirements for universally designed e-journals that can be satisfied by the epub format are: text format compatibility with different devices, software and older versions of devices and software structured markup of titles, headings, links, notes, tables, etc. informative alternative text on graphs and essential images layout with style sheets liquid layout of font size and line space, enlarging that adjust to the screen size high contrast between text and background highlighting of text or items in at least two ways why epub? epub solves both of the problems mentioned above. epub is a reflowable format. this means that the text will always fit the screen without the need for horizontal scrolling. the user can increase and decrease the font size without any changes in page width. this accessibility quality has always been a design goal of the w3c web standards. epub is based on web standards and inherits these qualities. additionally, epub is an open standard maintained by the international digital publishing forum (idpf). on the other hand, amazon’s file format azw/mobi is a proprietary format. amazon kindle devices cannot read epub files directly, but free tools like calibre can easily convert an epub file to the native amazon kindle format. epub has other advantages as well. the current version, epub 3 from october 2011, is based on a subset of html5 and css3, making it more suitable for multimedia content than earlier versions. at present, few reading devices and applications support epub 3. epub 2 from 2007 is still the most widely used and supported version of the format and works well for most academic journals. that is why we will concentrate on epub 2 in this article. the case our goal was to change the publication format in professions & professionalism from pdf to epub. professions & professionalism is an open access journal that “invites research-based empirical, theoretical or synoptic articles focusing on traditional professions as well as other knowledge-based occupational groups approached from any perspective or discipline” [7]. the first issue was published in november 2011. the articles have cc-by licences and are typically text-based with references, footnotes and tables. some contain images, too. the current workflow follows these steps: the authors upload their articles to the ojs software. most articles are uploaded as microsoft word documents (meaning the docx format). the editors distribute these microsoft word documents for peer-review and send the comments to the authors. the authors update their microsoft word documents according to the peer reviewers’ comments and re-upload them into ojs. the copy-editor does the finishing work on layout, proofreading and reference-checking. when finished, a document is saved as a pdf file. the editor sends the pdf file by e-mail to the author, who will then check the final layout. if there are changes, the copy-editor will apply the changes and save the final pdf document that is the published online version of the article. first attempt: conversion tools at first, we thought this was going to be a very simple project requiring little effort. we would apply some of the easy-to-use doc to epub converters freely available and end up with well-formatted and functional epubs ready for publishing. our prediction was that we maybe would have to ask the journal managers to tweak a thing or two in their original microsoft word files, but under no circumstances did we expect to have to make any big changes in the journal’s existing method of editorial work. we researched what kinds of tools were available for epub conversion. the following table describes the tools we tested and some of the issues we encountered. to validate the results, we used epubcheck. software description conversion main issues calibre an open source e-book library management application that can convert e-books to different formats. calibre is unable to convert directly from docx (microsoft word), but can convert the open document format (odt). there are tools for docs to odt conversion, for example by way of openoffice or libreoffice. broken tables, separated image and captions, footnotes at end of article counting only to 9. the remaining footnotes had number 1. missing headers and footers. epub file not valid. sigil a multi-platform epub ebook editor and open source software. sigil converts the docx file to filtered xhtml and packs it as an epub file. epub file not valid. writer2epub an openoffice.org extension that creates an epub files from the word processor. install the open office extension. open the document; choose writer2epub to generate the epub file. mainly a voluntary effort. images missing. epub file not valid. adobe indesign adobe indesign has native support for epub export. create or open a document in adobe indesign. export the document to epub. epub file not valid. none of our journals actually uses indesign in their publishing process, so that would mean another tool for them to learn. also licensed, so they would have to pay for it. none of the resulting epub files were valid. because non-valid epub files give unpredictable results on different reading devices (with some e-readers even rejecting such files), none of the tools we tested seemed conducive to accessibility in our academic e-journals. new workflow since none of the existing conversion tools were sufficient, we decided to change direction and start looking into developing our own conversion workflow. we considered marking up the journal articles directly as epub, but discarded this approach because epub, like xhtml, is more of a presentation format than a rich semantic publishing format. for instance, footnotes and references are typical parts of a journal article, but there are no such elements in xhtml. because an epub file is a collection of xml files, creating our own xml-based process to generate epub files seemed possible. we also saw xml as an ideal format for long-term preservation of the articles: it is open (not proprietary), can be read by humans and machines and is software independent. input: the journal article tag suite (jats) public library of science (plos) is the flagship of open access publishing, currently publishing seven peer-reviewed open access journals. the main publishing format in the plos journals is the native web format (in this case xhtml 1.0 transitional) with lots of hypertext links, for instance in the references section (for an example article, see michel and knouft [8]). pdf and—most interesting for us—xml are alternative formats. plos uses journal publishing tag set version 3.0 as its xml source format. the national center for biotechnology information (ncbi), which is part of the u.s. national library of medicine (nlm), develops and maintains this xml application. the journal publishing tag set is part of a larger family of xml applications called nlm journal archiving and interchange tag suite. pubmed central, the free archive of biomedical and life sciences journal literature at the u.s. library of medicine, requires use of the tag set in their file submission specification [9]. when we started working on our project, we found out that the tag suite was being prepared for a national information standards organization (niso) standardization process. niso formally approved the standard on 9 august 2012, and it is referred to as journal article tag suite (jats), version 1.0 (ansi/niso z39.96-2012). this is the version we have used in markup and will refer to from now on in the text. output: epub epub 2 consists of three sub specifications: the open publication structure (ops), the open packaging format (opf) and the open container format (ocf) [10]. the resulting epub 2 file is a zipped archive of a simple file structure including some required files with metadata information about the files in the archive. a typical and simple epub file has this content: a root folder with one file called mimetype. this file always has this content: application/epub+zip. a subfolder called oebps (an acronym for open ebook publication structure, which is inherited from epub’s predecessor file format). the subfolder has the following content: all the files with the main content of the epub 2 file, i.e. xhtml for structure and content, css for presentation, and any pertaining image files. these files can be organized in subfolders or put directly into the oebps folder. the epub “root file”, which is an xml file with the suffix .opf. the file has a metadata part (described in dublin core), a manifest part (a list of all the files in the epub zip file), and a spine part (defining the linear reading order of the files in the manifest part). the complete structure of this file is described in the opf specification. a navigation control xml file (with the suffix .ncx) that works like a table of contents and will help the user navigate through the content. a subfolder called meta-inf with only one file called container.xml (this is the only required file in the folder according to the epub 2 specification). the xml file has a reference to the name of a required opf file (the epub “root file”) located in the oebps folder. overview the figure below illustrates the old and the new workflow. the lower part of the figure shows the old workflow, where we simply save the microsoft word articles as pdf and publish them online. the upper part of the figure illustrates the new workflow. we mark up the content of the microsoft word document from the old workflow as a jats xml document, and that is the only manual work to do. we generate the different publishing formats—xhtml, epub and mobi—by running a script which is attached to this article. the national center for biotechnology information (ncbi) has created a toolset of xslt style sheets for transforming jats documents to (x)html and xsl-fo. we have only implemented the epub workflow so far but will create pdf versions by way of the xsl-fo style sheet in the future. building the epub file is not part of the ncbi toolset, so that is our contribution to the production pipeline. we did this by adding xslt style sheets transforming the jats source file into the mandatory files in the epub file structure explained above. our goal was to automate the epub production process completely. the ncbi toolset provides two shells or pipelines, putting all the steps together. one of them uses proprietary saxon extensions and the other uses xproc, which is a w3c recommendation. in the following sections, we go through the steps in the new workflow used in professions & professionalism to replace the pdf format with three digitally native formats: xhtml, epub and mobi. to illustrate, we use the article medical management in norwegian hospitals by ivan spehar and lars erik kjekshus [11] from professions & professionalism. the article contains 4 tables, 2 illustrations and 56 references of different kinds (books, chapters in anthologies, journal articles, reports and phd theses). we chose the example article because of its complexity, which should provide others with multiple templates for xml markup. jats markup as mentioned, we chose jats 1.0 as the native xml application in the new workflow. all xml files, including the example article, conform to this standard (or more specifically the journal publishing tag set that is a part of jats). when we start jats markup, we have the final microsoft word version of the article. we consider it unrealistic and unnecessary to use xml throughout the work cycle. microsoft word is the daily writing tool for the article authors and the copy-editor, and we do not want to change their already efficient collaboration and communication. we copy and paste the text from the microsoft word document to the content of the root element of a jats document in an xml editor. there are many xml editors to choose from. some are parts of integrated and commercial software packages, like liquid xml studio, oxygen xml editor and xmlspy, but we chose the simpler xml copy editor, which is free software released under the gnu general public license (gpl) version 2. xml copy editor has the basic features that we needed to do most of the markup and transformations: xml editor, validation against dtds and xml schemas, and xslt transformations. it only supports xslt and xpath version 1.0, but that works in most cases. the example jats document has three parts, which are child elements to the root element: head: metadata about the article, including title, authors and their affiliations, classification and subject headings, license and the abstract. body: the main content of the article, divided into sections and paragraphs. back: this is the reference section. the markup of the head part is not very challenging, but the journal has to have some clear principles about classification and subject headings to create a taxonomy that clusters articles about the same topics together. this will help navigation on the website as the number of archived issues increases. we use a standard cc-by license on all our articles. for this, jats has a permissions element, which is a child to the head element: <permissions> <license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0/">      <license-p>this is an open-access article distributed under the terms of the creative commons attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p> </license> </permissions> the body part is straightforward markup for anyone familiar with html. some elements have different names in jats (for example, italic for i, bold for b), but the semantics are the same. the hierarchical structure of the article is marked up using the sec element (which uses semantics identical to section in html5). sections can be nested into any number of subsection levels. each (sub)section has a title, which is the heading. paragraphs are marked up using the p element, just as in html. manual markup of tables is time-consuming, but not difficult. however, we emphasize the need for correct markup of row and column headings for accessibility reasons. the most complex markup is the reference section in the back part of the jats document. the references can either be marked up using the element-citation element, or the mixed-citation element. the difference is that element-citation is a strictly structural markup of the bibliographic description of the work, while mixed-citation is a simpler, more presentational markup of it. using mixed-citation, spacing and punctuation are preserved during transformation to other formats like html or pdf. with element-citation, the xslt style sheet is responsible for spacing, punctuation and italicization. we chose to use element-citation because of its rich semantics. the structural markup takes more effort than mixed-citation, where just some parts of the description are marked up, but the richer semantics provide more flexibility. the examples below show the differences between the two markup styles on a bibliographic description of an online journal article. first is the element-citation version that we have used: <ref id="aw2008">   <element-citation publication-type="journal" publication-format="online">      <person-group person-group-type="author">         <name>            <surname>album</surname>            <given-names>d.</given-names>         </name>         <name>            <surname>westin</surname>            <given-names>s.</given-names>         </name>      </person-group>      <year>2008</year>      <article-title>do diseases have a prestige hierarchy? a survey among physicians and medical students</article-title>      <source>social science & medicine</source>      <volume>66</volume>      <issue>1</issue>      <fpage>182</fpage>      <lpage>188</lpage>      <uri>http://dx.doi.org/10.1016/j.socscimed.2007.07.003</uri>   </element-citation> </ref> as can be seen, every part of the bibliographic description is marked up with elements. on the other hand, a mixed-citation has mixed content with only some parts marked up. the dots, commas, brackets, and other punctuation are preserved through transformations. this can be an advantage when dealing with uncommon publication types. <ref id="aw2008">   <mixed-citation publication-type="journal">album, d., & westin, s. (2008). do diseases have a prestige hierarchy? a survey among physicians and medical students. <source>social science & medicine</source>, <volume>66</volume>(1), 182-188.   <uri>http://dx.doi.org/10.1016/j.socscimed.2007.07.003</uri>   </mixed-citation> <ref> in both cases, the citation has the parent element ref, which has the mandatory attribute id. the value of the id attribute must be a unique identifier for the reference in the xml document. this identifier is used as a cross-reference when citing the work in the article. jats has the xref element for this. the xref element can take several attributes, but we only use two of them: ref-type: the type of reference. for bibliographic references, bibr is the correct value. rid: the cross-reference to the unique identifier in the reference list. if the value of the attribute is not in the reference list, the xml document will not be valid. when citing the reference in the example, the xml markup looks like this: <xref ref-type="bibr" rid="aw2008">album & westin, 2008</xref> the unique identifiers are very important, but the person doing the markup must make them up. we found a system for creating them that simplifies markup and ensures the quality of the reference list. we combine the first letters of the authors’ surnames with the publishing year, and then, if the combination is not unique, we add a letter (a, b, c etc.). in the example, the identifier is aw2008 from album, d., & westin, s. (2008). as mentioned, the jats document will not be valid if the rid attribute value of an xref does not match the identifier in the reference list. we have found several errors in the published articles this way. transformation to xhtml the process of transforming a jats document to an xhtml document using the ncbi toolset has three steps: pre-process the reference list. the toolset has xslt style sheets for two citation styles, american psychological association (apa) and nlm/pubmed central guidelines. transform the result document from step 1 to html. post-process the html document from step 2 to xhtml. in our case, we have to go through all these steps, since our main goal is to produce epub files, and epub requires xhtml 1.1 for the main content file. the journals use apa as citation style, so we chose apa transformation as the base xslt style sheet for references in the pipeline. the main xslt style sheet is jats-html.xsl, the web preview style sheet. there is also a css style sheet for the web presentation, jats-preview.css. none of them fit all our needs, so we had to customize them, just as we had to do with the apa pre-processing style sheet. we had to recreate the look-and-feel of the journal on the web and in the epub version. we have used a trial-and-error approach in the project. we succeeded in the end, eventually realizing that merging changes directly into the ncbi style sheets is not the proper way of doing it. piez calls this the ‘monolithic’ customization method [12]. all custom changes will be lost when upgrading the toolset. in his article, piez recommends ‘vertical customization’, which means creating separate xslt and css style sheets with custom templates and styles and then importing the ncbi master style sheets into them. the custom templates and styles will then override the master when needed, otherwise falling back on the master style sheet. in xslt, the xsl:import instruction does the trick; in css the @import rule achieves the same. our custom style sheets—hioa-citations-prep-apacit.xsl, hioa-xhtml.xsl and hioa-web.css—are provided with the article. the particular adjustments we have made will probably be of no use to others, but we would like to stress the importance of creating a vertical customization from the start. it will save you hours of work compared to separating them from the master style sheets afterwards. it is also worth mentioning that the reference pre-processing style sheets from ncbi are using xslt 2.0, which xml copy editor does not support. to run these transformations in tests before the final xproc pipeline, we had to download an xslt 2.0 transformation engine. we chose saxon-he. packaging the epub file the file structure of an epub file is described in the previous section titled output: epub. all the mandatory files in the file structure are either static files or results of xslt transformations of the jats source file. the xhtml version of the article is the output from the transformation described in the previous section. we use our customized css style sheet for presentation. these and any image files (gif, jpeg, png) must be copied to the oebps folder or subfolders. mimetype and container.xml are static files. only two files, content.opf (the epub root file) and toc.ncx (the table of contents) depend on the content of the particular epub file. we can generate both of these files from the source jats file using xslt. we provide two style sheets: epub-content.opf.xsl for the epub root file and epub-toc.ncx.xsl for the table of contents file. they should be applicable for any other journal using the same process as we are describing here. the xslt style sheet fetches the variable file content (title, author, metadata etc.) from the jats source file. you should, however, rename the css style sheet from hioa-epub.css in epub-content.opf.xsl to the proper name used in the actual journal. we tried using different zip tools to package the epub file but had problems with adding files in the correct order (the mimetype file must be first in the archive). however, the open source windows software info-zip worked well. conversion to mobi amazon kindle is the market-leading e-reader device but has no support for epub. the kindle’s native format is an amazon proprietary format, but it also supports mobi, which was developed by the french company mobipocket that amazon bought in 2005. during the testing phase, we used calibre to convert from epub to mobi. the result is lossless and close to impeccable in all our test cases. putting it all together: the xproc pipeline as mentioned above, the ncbi toolset includes a default xproc pipline, but we had to make several changes to make it suit our needs. we provide the resulting file, process-jats-xml.xpl, as part of the source code. the pipeline uses an input folder (for jats source files), a working directory for temporary files, and an output folder for the final result files. here is an overview of the pipeline: pre-process the jats references to apa citation style with hioa-citations-prep-apacit.xsl. we have made some slight adjustments to the xslt style sheet jats-apacit.xsl. fix some empty id attribute problems on graphic elements in the result file from the previous step. (these problems can probably be solved in a better way, but we have not yet managed to find a better way.) run the main transformation of the result file from step 2 with hioa-xhtml.xsl. this is our vertical customization of the master style sheet jats-html.xsl. in this case, we made major revisions to the master style sheet and chose to cast the result directly into xhtml rather than run the previously mentioned post-processing style sheet. fix empty namespaces in the result document from step 3. just like step 2, this should ideally be a non-issue, but we have not solved it completely. namespaces in xml generally and jats specifically are challenging [13]. most web browsers ignore empty namespaces, but in the strict xhtml 1.1 required by the epub standard, they cause crucial problems. transform the jats source file into content.opf using epub-content.opf.xsl. transform the jats source file into toc.ncx using epub-toc.ncx.xsl. the xproc file has to run on an xproc engine. we used xml calabash, developed by norman walsh. xml calabash includes saxon-he for xslt 2.0 transformations. xproc is a vital part of the epub production but does not make up the full workflow. for example, an epub file is a zip-archive of the epub file structure, but we also need a mobi version of this file. therefore, we have created two scripts (a windows batch file and a linux bash shell script) that include the xproc pipeline and also automate the rest of the process. the scripts take a jats input file and generate three versions of the article: xhtml (for the web), epub and mobi. see the source code section of this article for detailed information about downloading and running either script. costs the main issue with our new workflow is that it is time consuming. the time we spent on each article depends on the length and complexity of the article, and it is hard to predict how long it will take to mark up each issue of the journal. as a proof-of-concept we marked up all five articles in volume 2, issue 1 of professions & professionalism and measured the time spent on each article. we used xml copy editor for the manual markup and we received the articles in microsoft word format from the copy-editor of the journal. the articles had been through the editorial process and adjusted to a print layout. much of this work, like hyphenation and apa formatting of the reference lists, will be unnecessary in the future, when we will use the author’s postprints as the basis for markup. one of the xml coders has a great deal of experience with xml; the other is an expert. the table below shows the time used on markup, and the articles are linked so you can see the level of complexity of each article. tor arne marked up abrahamsen, holte, & laine, 2012, and nygaard, 2012; while trude marked up saks, 2012, kallberg, 2012, and spehar & kjekshus, 2012. article time (in hours) saks, 2012 3 ½ nygaard, 2012 4 ¾ kallberg, 2012 2 ½ spehar and kjekshus, 2012 2 ¼ abrahamsen, holte, and laine, 2012 8 ¾ sum 21 ¾ to compare our efforts, we asked the copy-editor to estimate the time he had spent on the copy-editing work. as mentioned in the section the case, microsoft word is used in the process. the copy-editor told us that it was hard to separate the different parts of his work but estimated that he had used “a full workday” on the issue, i.e. 7 ½ hours. he needed additional help from the journal editor on tables. the tables were exported graphic files from microsoft excel. this work took about 1 ½ hours, which means that the editorial work for the issue in the current production takes 9 hours. the xml markup takes a bit more than twice the time, but this additional work enables publishing in four additional formats: xml, xhtml, epub and mobi. conclusion in this project, we have established a new workflow for an e-journal to replace pdf with epub as the main publication format. by doing this we we have made the journal more accessible for everyone, both in terms of reading devices and user diversity. we have focused on accessibility, universal design, device and software independency, and preservability. marking up an article in xml is not very complicated and is something anyone should be able to do. as long as a document is correctly marked up as jats xml, a single command is enough to run the entire xproc pipeline, which applies all the transformations in the correct order and creates output files in xhtml, epub and mobi. the source code provided should make it easy for others to do the same. an important outcome has been to convince the journal editors of the benefits of the changed workflow. an unforeseen aspect of this project has been the automated crosschecking of the references in the article. any citation in the text that does not have a corresponding part in the reference list will stop the validation process. in addition, if any parts of the reference are missing (for example the value for issue) it will become very apparent in the markup process, as it will be an empty field in the markup. the editors consider this important and believe it adds value to the editorial process. our method will no doubt cost more in effort than the traditional microsoft word to pdf method that the journals so far have employed. however, we believe the value of the output we create will make up for the added costs. we plan to implement the same workflow in our other journals in the future. source code readers can download a zip file from github, which includes everything needed to run the pipeline with the example article as input. the file is located at https://github.com/eirikhanssen/jats2epub and is licensed under a gplv3 license. save the zip file to a folder and unpack the archive. the distribution requires java runtime environment but includes all the software needed with the exception of tools for epub to mobi conversion. the script needs amazon’s kindlegen to do this. you can freely download the software from amazon, but it is illegal to distribute it as part of the zip file. the readme files in the zip archive document the distributed files, folders and usage. eirik hanssen, learning centre and library, oslo and akershus university college of applied sciences has written the scripts, the xproc file and the documentation. references [1] united nations. 2006. convention on the rights of persons with disabilities. available from: http://www.un.org/disabilities/convention/conventionfull.shtml [2] barne-, likestillingsog inkluderingsdepartementet. 2008. diskrimineringsog tilgjengelighetsloven. available from: http://lovdata.no/dokument/nl/lov/2013-06-21-61 [3] w3c. 1999. web content accessibility guidelines 1.0. available from: http://www.w3.org/tr/wcag10/ [4] w3c. 2008. web content accessibility guidelines (wcag) 2.0. available from: http://www.w3.org/tr/wcag20/ [5] w3c. 2000. authoring tool accessibility guidelines 1.0. available from: http://www.w3.org/tr/atag10/ [6] w3c. 2013. authoring tool accessibility guidelines (atag) 2.0. available from: http://www.w3.org/tr/atag20/ [7] professions & professionalism. 2014. editorial guidelines. available from https://journals.hioa.no/index.php/pp/about/editorialpolicies#focusandscope [8] michel mj, knouft jh. 2012. niche variability and its consequences for species distribution modeling. plos one 7:e44932. available from: http://dx.doi.org/10.1371/journal.pone.0044932 [9] pubmed central. 2011. file submission specifications. available from: http://www.ncbi.nlm.nih.gov/pmc/pub/filespec/ [10] international digital publishing forum. 2010. epub 2.0.1. available from: http://idpf.org/epub/201 [11] spehar i, kjekshus le. 2012. medical management in norwegian hospitals. professions & professionalism 2:42–59. available from: http://dx.doi.org/10.7577/pp.v2i1.178 [12] piez w. 2010. fitting the journal publishing 3.0 preview stylesheets to your needs: capabilities and customizations. in: journal article tag suite conference (jats-con) proceedings 2010 [internet]. bethesda, md: national center for biotechnology information (us). available from: http://www.ncbi.nlm.nih.gov/books/nbk47104/ [13] piez w. 2011. taming the beast: jats data, non-jats data, and xml namespaces. in: journal article tag suite conference (jats-con) proceedings 2011 [internet]. bethesda, md: national center for biotechnology information (us). available from: http://www.ncbi.nlm.nih.gov/books/nbk62086/ about the authors trude eikebrokk is a senior librarian and leader of the digital services group in the learning centre and library. she is the technical manager of the publishing platform open access journals at the oslo and akershus university college of applied sciences. tor arne dahl teaches web technologies in the department of archivistics, library and information science at the oslo and akershus university college of applied sciences. he has previously worked as a software developer and is working on a phd in the study of professions concerning information architecture. siri kessel is an assistant professor at the department of computer science at oslo and akershus university college. she is one of the coordinators of the master’s programme in universal design of ict, teaches “user diversity and ict barriers”, and is doing research in this field. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – building an institutional author search tool mission editorial committee process and structure code4lib issue 45, 2019-08-09 building an institutional author search tool ability to collect time-specific lists of faculty publications has become increasingly important for academic departments. at ohsu publication lists had been retrieved manually by a librarian who conducted literature searches in bibliographic databases. these searches were complicated and time consuming, and the results were large and difficult to assess for accuracy. the ohsu library has built an open web page that allows novices to make very sophisticated institution-specific queries. the tool frees up library staff, provides users with an easy way of retrieving reliable local publication information from pubmed, and gives an opportunity for more sophisticated users to modify the algorithm or dive into the data to better understand nuances from a strong jumping off point. by david forero, nick peterson, andrew hamilton frequently academic departments want to know how many publications their faculty and researchers have published over a particular period of time. at the oregon health science university (ohsu) these questions were traditionally handled by having a librarian manually search various publication databases. these searches were complicated, and the outputs of these searches were often cumbersome and unwieldy. this meant that the answers to questions were often limited by how much effort could be put forth by the librarian whom may spend many hours managing these searches and the resulting data. the development of search strategies to capture accurate and relevant data brings with it many challenges. this is a problem of trying to generate an accurate and comprehensive list based on incomplete and sometimes inaccurate data (databases of publications) with almost always incomplete and contradicting information about authors (lists of university faculty or lack thereof). in this article, we will discuss the overall problem and what efforts we previously made to address it. then we will discuss how we approached the problem differently. we developed an augmented search tool that lightens the load and normalizes the process on our campus. what’s more, the tool is open and could easily be customized for other institutions. the challenge here is a brief list of overarching problems when conducting a search of publications for an academic department: who: what authors are considered part of the academic department. what: what source is being used to find all the publications. when: which of the many dates associated with the publication is being searched. the problem of “who” the most difficult challenge is figuring out which papers in indexes of publications should be counted. many publications include the organizational affiliation of the authors, though, not all. those that include the information tend to solicit the information from the paper’s authors and present the information as free-text. additionally, authors can have multiple organizations to which they belong to and may list some or all of these affiliations. this is further confounded by temporality, as people often change organizations and there is rarely any kind of tracking of who was part of what organization when. rather than looking at the problem top down and searching for organization data, one could do searches based on individual names instead. the typical problems with people with the same name crop up. if authors all used orcid ids (https://orcid.org/), that could be a very effective strategy, though, our experience has been that too few authors utilize them to make this identifier useful. also, getting definitive lists of people in the organization can be difficult. this fact might seem surprising but our experiences in higher education lead us to believe this is more often the case than not. the problem of “what” this issue is determining an authoritative source of what has been published. there is no one source to refer to for all publications for an institution. this is one of many compromises that must be made to approximate the full answer. for an institution like ohsu, scopus and pubmed represent a fairly comprehensive publication source as both offer a wide selection of publications but neither encompass all possible publications formats; for example, smaller journals, thesis, technical reports and other publications which are not monitored by these two services. also, there are the rare publications outside the main domains of faculty at our institution. for example, this article is extremely unlikely to turn up as a publication in either publication index and therefore won’t be counted in any tallies of ohsu publications. the problem of “when” as the previous point alludes, the exact definition of what counts as published might be hard to pin down. this may be a little in the weeds but it underscores the depth of problems facing a searcher. also, the date it is published turns out to be often squishy. publishers often “pre-publish” an article, and then later officially publish the article. what this means is the data in indexes like scopus and pubmed are dynamic, in other words, the data can change and be updated and often have different object identifiers. so, a diligent person who monthly tallies the publications of a particular organization might count an article in two different months. avoiding this requires tracking individual articles. our previous process prior to the development of our automated search tool, a list of ohsu authored publications was manually created by the ohsu library by running a monthly update for ohsu affiliations in both pubmed and scopus. the results of these updates were then imported into refworks, and later endnote, for processing which consisted of identifying and eliminating duplications between the two databases and moving newer versions of an article over to replace any existing versions in refworks. this monthly process was incredibly labor intensive and unsustainable for library staff. it was decided that the monthly update strategy for ohsu affiliations within pubmed could serve as the foundation of the new tool. as discussed, we had a very resource intensive process. initially attempts to address this problem involved essentially building a local database of publications. this involved constantly resynchronizing the local list of publications with an index like scopus and required a considerable amount of curation. in particular, the de-duplication and curation of members of organizational units was still considerable. essentially this was a fancier version of the existing manual process. instead, the approach we took was to augment people’s searches to approximate some of the normalization done in the manual process. part of this decision was based on the fact that, we knew we couldn’t get a set of perfect results and needed to focus on good enough. the new attempt would acknowledge the potential for incompleteness but instead focus on making the efforts less. essentially, if we’re not going to get a perfect answer, can we put in less effort to get an equally complete answer. for the “what” problem, the tool would leverage an existing publication index. we chose pubmed because though it was a smaller index, it does cover the main focus of our institution, and we could freely share the results without violating any contractual restrictions. we can easily capture the kinds of searches we were already doing. for the “when” problem, we were essentially leaving that up the individual searcher. if the searcher cares deeply, they could track their list of publications. however, we were no longer going to try to deal with this problem at the institutional level. the big problem was how to address the “who” problem. searching for specific people is fraught with difficulty as people change names and often have identical names with others. our approach is only to search for people after we have a search limited by the scope by institution. since ohsu does not have an identity management system, there are many collections of people in different departments for different purposes. we worked with the registrars and human resources offices to get a list of people and their primary organizational unit within the institution. the idea was to essentially extract a list of most likely ohsu publications from pubmed and then try doing name matches with the lists from the registrars and human resources. this isn’t a perfect solution, however; we have greatly reduced the likelihood of name collision by restricting the publication list to likely ohsu articles. also, we have a problem with older articles, where it is more likely that an author is no longer associated with ohsu and therefore not in the “master” list provided by the registrars and human resources. fortunately, publication requests were most often for recent publications. automated method the service <http://library.ohsu.edu/oast/> we created uses a fairly standard lamp (linux, apache, mysql, php) stack. we will highlight two main high-level components of the search tool: php which is used to query the pubmed api for citation information and mysql to query a local db that is culled from our institution’s available registrar and human resource data. in practice, any available web programming language will suffice, but we choose php since it is utilized heavily in-house. for the people data, we are still in the process of finding the most optimized way of matching authors from pubmed queries but for now mysql was chosen since we already had the data on-hand in this form. an important design choice here was to use fields that users already commonly use to search and mapping our fields directly to the fields used in pubmed. the basic gist of the php form takes a series of user provided inputs, constructs and submits a pubmed query, compares those results with available people data and returns the final result set to the user as an ajax datatable for viewing or export to common tabular data formats. user inputs on the form range from journal title [ta], date range of publication [dp], author name [au], grant number [gr], etc. fig. 1: ohsu author search tool additional fields are available under the advanced options button for fine tuning the search. under the advanced options is the original base query which is fully customizable by the end user. all user inputted fields are anded to this base query and submitted to the pubmed esearch api when hitting ‘search’ on the form. fig 2: ohsu search tool advanced options the api returns our result set and makes it viewable as a paginated table to the right of the search form using the ajax datatables library. this allows for responsive sorting/filtering as well as exporting to common tabular data formats including csv and excel. if the user selects the ‘match ohsu departments’ checkbox then an attempt to match author names from the pubmed result set against our mysql db of ohsu people data is made and those names are bolded in the report and their associated campus departments are returned. fig. 3: ohsu search tool results set example the form is ‘self-posting’ in the sense that all modified search fields (including the base query) are constructed as a uri query string. this means that links to specific result sets are entirely portable or even entirely customizable outside of using the form itself for input. this makes canned result sets easy to construct against the tool. our current plans are to expand this functionality to allow for proper api calls or direct csv/excel dumps for specific need cases. code is available on github here: https://github.com/ohsu-library/pubmed-query-util lessons learned our goal was to provide a service of equivalent quality with less resources dedicated to the effort. in this we were successful. as we did user testing, we learned more about what the users of this service needed. one particular lesson involved grant numbers. essentially, ohsu staff used grant numbers to search for publications. although most of the grants at ohsu are federal grant numbers which have a very specific format, we quickly discovered that these grant numbers are not consistently entered in different publications. the biggest success in our minds is this tool augments the capabilities of people using it. since everything in it is completely transparent. users can bookmark the search and since the search query is embedded in the url, it provides users with quick jumping off points from previous efforts. also, since the tool provides an easy way to modify the core query for institution, individuals can tailor the query to accommodate their needs. for example, as other organizational units are formed, the user does not have to wait to request a change to the base search from the library. instead they can modify the search for themselves and that change is also embedded in the url. this tool enables individuals to start a search with a fairly intelligent starting place. these individuals can still get help from an experienced reference librarian but now they come to the conversation with a reasonable start and the librarian can augment and tweak the search. this is educational and empowering for the user and allows our librarians to focus efforts on more complicated requests. as discussed, more consistent entry across all publications will make this process easier. perhaps one day ohsu will have a central registry of authors’ orcid ids that is always up to date with human resource data and organizational affiliations while all publications have consistent orcid id usage and some kind of usage of controlled organizational identity. until then, at ohsu we have found a way to empower our staff to make fairly quick power searches that provides solid results. about the authors david is the technology director for the ohsu library in portland oregon. he is active with the equity, inclusion, and diversity work both at ohsu and acrl. he also co-manages the game lending collection at ohsu. andrew hamilton, ms/mls is an assistant professor employed as a health science education & research librarian at the ohsu library. prior to joining ohsu, he worked for the nn/lm national online training center from 1996-2002 where he travelled the united states teaching thousands of librarians how to search pubmed. mr. hamilton earned his master of library science degree as part of a double master’s degree for drug information specialists (m.s.-pharmaceutical science/m.l.s.) from st. john’s university in 1993. nick is a systems analyst at the ohsu library whose primary responsibilities include management, maintenance and support of the library’s servers, software, desktop computers, printers and av equipment. his specialties include database development, systems integration and custom web development. outside of work he is an avid photographer, filmmaker and budding woodworker. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – feminism and the future of library discovery mission editorial committee process and structure code4lib issue 28, 2015-04-15 feminism and the future of library discovery this paper discusses the various ways in which the practices of libraries and librarians influence the diversity (or lack thereof) of scholarship and information access. we examine some of the cultural biases inherent in both library classification systems and newer forms of information access like google search algorithms, and propose ways of recognizing bias and applying feminist principles in the design of information services for scholars, particularly as libraries re-invent themselves to grapple with digital collections. by bess sadler and chris bourg libraries are not neutral in spite of the pride many libraries take in their neutrality, libraries have never been neutral repositories of knowledge. research libraries in particular have always reflected the inequalities, biases, ethnocentrism, and power imbalances that exist throughout the academic enterprise through collection policies and hiring practices that reflect the biases of those in power at a given institution. in addition, theoretically neutral library activities like cataloging have often re-created societal patterns of exclusion and inequality. for example, in the power to name, hope olson documents the ways the dewey decimal system has historically reflected patterns of knowledge organization that now seem archaic, such as classifying the subject of pregnancy under the heading of disease, and the subject of lynching under the heading of law enforcement (olson 2002). while historic distance makes it easier to recognize these examples as obvious instances of societal power imbalance appearing in library practice, it might not be so easy to recognize current examples of supposedly neutral practice in libraries that are actually perpetuating similar power imbalances. those of us creating library software and building digital libraries must also address this issue. just as the classification systems created by libraries carry the aura of “neutrality” and mask the bias reflected within them, the digital systems that libraries provide to our users are also assumed, by virtue of existing within a library, to be a “neutral” reflection of subject knowledge. however, just as with classification systems, this neutrality is an illusion. in this paper, we argue that without an explicit feminist agenda, the same processes of exclusion and marginalization that have always influenced libraries — and therefore scholarship — will continue to play out in our digital library and online discovery environments. we define library discovery as the set of affordances through which users search, explore, find, and interact with the information resources they need, particularly collections held by a library. for the purposes of this paper, we will focus on digital systems for library discovery, such as search algorithms, library software, and online collections. for our analysis, we borrow from the field of human computer interaction (hci) and shaowen bardzell’s paper “feminist hci: taking stock and outlining an agenda for design” (bardzell 2010). bardzell outlines some of the characteristics of feminist software interaction, which she defines as plurality, self-disclosure, participation, ecology, advocacy, and embodiment. we hope that by examining these qualities in the context of library discovery we can encourage the development of library discovery systems that resist patterns of erasure, exclusion, and marginalization. pluralism and self-disclosure in search in “feminist hci,” bardzell discusses how feminist theory can inform the field of human computer interaction. certain constellations of qualities characterize feminist human computer interaction, she argues. so what would library discovery that possessed these qualities of feminist interaction look like? and how might we explicitly design for that? we might, for example, invoke feminist rhetoric in our critique of search technologies and demand that the search solutions we employ value pluralism and self-disclosure. according to bardzell, “the quality of pluralism refers to design artifacts that resist any single, totalizing, or universal point of view” (p. 1305) and the quality of self-disclosure refers to the extent to which the software renders visible the assumptions it is making about its users (p. 1307). search software, in order to provide feminist interaction, needs to be particularly concerned with pluralism and self-disclosure because it deals with questions of relevancy and significance. a single, totalizing, or universal point of view to the questions “what is relevant?” and “what is significant?” is likely to re-create existing societal patterns of exclusion and marginalization. unfortunately, we have something very similar to a single point of view on the question of relevance and significance in the form of google’s search algorithms. although google’s exact search algorithms are trade secrets and shift over time, we do know that they are based on a patented and published algorithm called pagerank, and that they work by defining relevance and significance by looking at what pages on the internet are linked to most often on a given subject (https://en.wikipedia.org/wiki/pagerank). this creates a majority-rules definition of relevance that masquerades as neutrality. the concept of neutral relevance is an oxymoron, and yet neutrality is often what is sought in the design of search algorithms. in a recent antitrust lawsuit brought against google, rival search engines demanded that google use “neutral” search algorithms and display search results in a “neutral” manner. one need not invoke feminist theory to see how meaningless such a demand is; forbes, hardly a bastion of feminist thought, called the concept of a neutral search engine “incoherent” (ammori and pelican, 2012). in “missed connections: what search engines say about women,” safiya umoja noble illustrates some of the problems with google’s consensus-based relevancy. she describes an exercise in which she asks her university students to imagine themselves an african-american aunt, mother, mentor or friend who is trying to help young multicultural women learn to use the internet. she asks what they would search for to find content young black girls might be interested in, and asks what search strategies they might employ to learn about black accomplishments, identities, and intellectual traditions: someone inevitably volunteers to come forward and open a blank google search page–a portal to the seemingly bottomless array of information online–intending to find accurate and timely information that can’t easily be found without a library card or a thoughtful and well informed teacher. last semester, sugaryblackpussy.com was the top hit. no matter which year or class the students are in, they always look at me in disbelief when their search yields the result. they wonder if they did something wrong. they double-check. they try using quotation marks around the search terms. they make sure the computer isn’t logged into gmail, as if past searchers for pornography might be affecting the results. they don’t understand. noble summarizes: “try google searches on every variation you can think of for women’s and girls’ identities and you will see many of the ways in which commercial interests have subverted a diverse (or realistic) range of representations” (noble 2013). noble’s classroom exercise demonstrates how google search algorithms fail both our criteria for pluralism — because they represent a single majority-rules point-of-view masquerading as neutrality — and our criteria for self-disclosure — because the system does not render visible the extent to which is has been designed for an “ideal user.” libraries would do well to keep these lessons in mind as we adopt new technologies like linked data, which holds much promise for breaking the stranglehold of rigid categorization systems, but makes many of the same assumptions that the google algorithms make — namely that the most commonly asserted statements must be the most true. libraries should also be asking these questions of our journal article search systems. until we and our users are allowed to data mine and index scholarly journal content ourselves, we rely on commercial publishers to define significance and relevance for us. participation and ecology in the production of technology bardzell’s quality of ecology in feminist interaction design calls for an awareness of the broadest contexts of the effects and of the widest range of stakeholders in considering design (p. 1307). the quality of participation refers to valuing participatory processes in the creative endeavor (p. 1306). in the creation of digital library systems, one might ask about the people building these systems and the environments in which the software is produced, as part of the software’s ecology. similarly, one might ask about the methods used in the software’s creation and how participatory or inclusive they were. more bluntly, when software is created in environments hostile against women, people of color, and other underrepresented groups, many kinds of inequity result. currently, the majority of digital library software is built with open source software tools; even commercial or hosted digital library systems are built with open source components, requiring anyone who wants to help build them to participate in the world of open source software. this has many positive implications for the production of software capable of feminist interaction, as the open source software movement is built around the idea that software code should be freely shared and that everyone has the right to understand how software works. however, the open source software community is also a notoriously sexist space, as documented in the twitter feeds of many women software engineers and in academic papers such as “free as in sexist: free culture and the gender gap” by joseph reagle. reagle analyzed six years of discourse about gender and sexism within free culture communities. in spite of many attempts by women in those communities to raise awareness of the misogyny they encountered, reagle still found that many communities rationalized low female participation as a matter of women’s choice, in part the result of a pernicious narrative of “techno-eager men and techno-phobic women” well documented in books like making technology masculine by ruth oldenziel. the perpetuation of this narrative makes women’s exclusion seem natural and expected, and therefore a problem unworthy of addressing (oldenziel, 2004). this climate is part of the reason that only about 2 percent of open source software developers are women (nafus, krieger and leach, 2006) and is incompatible with the idea of a feminist future for library discovery software, since it either prevents women from taking software development jobs, or forces them into hostile work environments. unlike other harassment battles, however, the internet has no human resources department to intervene. this is an ongoing ethical challenge to libraries. as we shift our practice to digital collections, more library jobs — and many of the highest paying library jobs — are in the area of library software, requiring engagement with open source communities. some strategies to make these jobs more accessible include efforts to foster involvement from underrepresented groups, such as internships targeted at women and gender minorities. however, efforts that focus exclusively on recruiting underrepresented groups into technology work reinforce the narrative that they are not naturally interested in technology and must be persuaded to participate. crucially, these efforts fail to address harassment, and talented community members who are driven out of technology work every year by hostile environments (melymuka, 2008; wu, 2014). some of the best ideas on challenging sexism in open source software communities come from the ada initiative, a feminist organization dedicated to supporting women in open technology and culture communities. one of the initiative’s recommendations has been to encourage software communities and conferences to adopt codes of conduct (ada initiative, 2015). what was once a controversial effort has become a widely accepted best practice within the realm of open source software, and, increasingly, among library conferences. in 2013, stanford university library became the first employer to encourage staff to attend only conferences with a code of conduct, and to lobby conferences without a code of conduct to adopt one (keller, 2013). the authors hope this will be an increasing trend, as libraries and other employers grapple with their responsibility to create a safe work environment for their staff, even when that work environment extends beyond the walls of the institution. advocacy and embodiment in library interactions bardzell’s quality of embodiment means focusing attention on the bodies, emotions, motivating drives, and primordial urges of users. the quality of advocacy asks how we can design systems that improve users’ lives without imposing the designer’s view of what might constitute an improvement. focusing attention on the bodies of our users might mean prioritizing accessibility in our software, and thinking about the many ways we can make library resources available to users with a wide variety of impairments. conducting usability tests with patrons who use assistive technologies can be a particularly enlightening technique, and one that allows them to tell us what would constitute an improvement. embodiment might also mean valuing emotional responses to our collections. for example, historian natalie zemon davis discusses the specific emotional connection she feels when touching physical artifacts, like books. among other subjects, she studied the lives of 16th-century printers to see what she could glean about the everyday experience of people who are (typically) undocumented in archives. one of the ways she did that was to visit libraries that held the books these 16th-century subjects had made. she argues that actually holding the books connected her to the people she was studying in a meaningful way (cbc radio, 2011). in our rush to digitize the kinds of rare books davis loves to touch, have we asked ourselves what might be lost? will high-resolution scans produce the same feeling of emotional connection? if there is a specific emotional response achieved by touching something that the person you are studying also touched, is that a valid reason for allowing physical access to a collection? we know this emotional response exists because people like davis tell us so all the time, but we often dismiss it as unimportant. should we dismiss it because it is emotional in nature, or should we make some effort to understand it, in hopes of recapturing that feeling in the digital realm? from listening to historians like davis, and from our understanding of psychology, it is clear that design decisions which divorce scholarship from emotional response are not in the best interests of our users. we make choices with our emotional responses, and then we come up with justifications after the fact. we know this from studies of people who have been truly divorced of their emotions on a neurological level and then become unable to make even simple decisions (damasio, 1994). we also know that emotional states affect problem solving, and are tied to many unconscious problem solving centers in the brain (kahneman, 2013). the necessity of design that engages our emotions is well documented in the hci literature, for example in don norman’s emotional design (norman, 2005). insisting upon the principles of feminist hci, including valuing users’ emotional responses, as we envision the future of the archive and our digital transition, thus becomes an urgency, particularly for those who are often erased from official archives. so what would it take for us to seriously focus on it? archival theorists like jack halberstam, in the queer art of failure, and ann cvetkovitch, in an archive of feelings, both discuss the unrepresentability of trauma and the nature of the archive of trauma. both engage with erasure of homosexuality in the third reich, recent history, and our present, as a trauma that must be grappled with in the archive. meanwhile, we who build digital libraries (which we hope future researchers will utilize) are designing new trauma archives. the open source software we build enables discovery for the collections of the united states holocaust memorial museum, and archival collections at stanford like the records of the stop aids project. one should not attempt to engage with these collections without a sense of embodiment and advocacy. we need to push back against the notion of the dispassionate researcher, the dispassionate archivist. libraries should not try to be neutral the work of libraries and librarians can do more than just support feminist research and agendas. we can play a critical role in supporting the causes of inclusion, plurality, participation and transparency. building collections and developing the tools to access them are inherently political acts; we are creating the future library, the tools and collections that will be used to create new knowledge. the means of production for the archives of humanity are up for grabs, and within our reach is the possibility of new production methods that resist the recreation of existing patterns of exclusion and marginalization. however, this will be possible only if we approach this work consciously and with an understanding of the impact of our decisions. about the authors bess sadler has been building open source software for libraries for over a decade, and is a co-founder of several widely-used open source software projects including project blacklight and project hydra. she is currently employed as the manager for application development in the digital library systems and services group at stanford university library. her opinions are her own, and do not necessarily reflect the views of her employer or any software projects with which she is associated. chris bourg is the director of libraries at massachusetts institute of technology (mit), where she also has oversight of the mit press. prior to assuming her role at mit, chris was associate university librarian for public services at stanford university. chris has a phd in sociology from stanford university, and spent 3 years on the faculty at the united states military academy at west point. she is keenly interested in issues of diversity and inclusion in higher education; and in the role libraries play in advancing social justice and democracy. works cited anti-harassment work: conference anti-harassment policy [internet]. ada initiative. [cited 2015 feb 17]. available from: https://adainitiative.org/what-we-do/conference-policies/ ammori, m, and pelican, l. 2012. why search bias claims against google don’t hold up. forbes. [cited 2015 february 17] available from: http://www.forbes.com/sites/ciocentral/2012/06/07/why-search-bias-claims-against-google-dont-hold-up/ bardzell, shaowen. 2010. feminist hci: taking stock and outlining an agenda for design. in: chi 2010: hci for all. april 10–15, 2010, atlanta, ga, usa cbc radio. the best of the sunday edition july 17, 2011. [internet]. [cited 2015 february 17]. available from: http://castroller.com/podcasts/thesundayedition/2429203 cvetkovich, ann. 2003. an archive of feelings: trauma, sexuality, and lesbian public cultures. durham, nc: duke university press. damasio, antonio. 1994. descartes’ error : emotion, reason, and the human brain. new york: putnam. halberstam, judith. 2011. the queer art of failure. durham, nc: duke university press. kahneman, daniel. 2013. thinking, fast and slow. new york: farrar, straus and giroux. keller, michael. 2013. sul supports conference anti-harassment policies. [internet.] [cited 2015 feb 17]. available from: http://library.stanford.edu/news/2013/07/sul-supports-conference-anti-harassment-policies melymuka, kathleen. 2008. why women quit technology. computerworld. [internet]. [cited 2015 feb 17]. available from: http://www.computerworld.com/article/2551969/it-careers/why-women-quit-technology.html nafus, dawn; kreiger, bernhard; and leach, james. 2006. gender: integrated report of findings. free/libre and open source software: policy support. [internet] [cited 2015 feb 17] available from: http://www.flosspols.org/deliverables/flosspols-d16-gender_integrated_report_of_findings.pdf noble, safiya umoja. 2012. missed connections: what search engines say about women. bitch magazine, 54. [cited 2015 feb 17]. available from: https://safiyaunoble.files.wordpress.com/2012/03/54_search_engines.pdf norman, donald. 2005. emotional design. new york: basic books. oldenziel, ruth. 2004. making technology masculine: men, women, and modern machines in america, 1870-1945. amsterdam, netherlands: amsterdam university press. olson, hope. 2002. the power to name: locating the limits of subject representation in libraries. dordrecht, netherlands: kluwer academic publishers. 1 reagle, joseph. 2013. “free as in sexist?” free culture and the gender gap. first monday [internet]. [cited 2015 feb 17] available from: http://firstmonday.org/ojs/index.php/fm/article/view/4291/3381 wu, brianna. 2014. no skin thick enough: the daily harassment of women in the game industry. polygon. [internet]. [cited 2015 feb 17]. available from: http://www.polygon.com/2014/7/22/5926193/women-gaming-harassment subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – searching for meaning rather than keywords and returning answers rather than links mission editorial committee process and structure code4lib issue 57, 2023-08-29 searching for meaning rather than keywords and returning answers rather than links large language models (llms) have transformed the largest web search engines: for over ten years, public expectations of being able to search on meaning rather than just keywords have become increasingly realised. expectations are now moving further: from a search query generating a list of “ten blue links” to producing an answer to a question, complete with citations. this article describes a proof-of-concept that applies the latest search technology to library collections by implementing a semantic search across a collection of 45,000 newspaper articles from the national library of australia’s trove repository, and using openai’s chatgpt4 api to generate answers to questions on that collection that include source article citations. it also describes some techniques used to scale semantic search to a collection of 220 million articles. by kent fitch motivation and goals our expectation of how search works changed ten years ago when google introduced their “hummingbird” search algorithm [1]. our experience until then was that you needed to formulate your search query using some exact words that appear in the document you were seeking. ten years later, we’ve been further trained by “ok google” and continuous improvement in search algorithms to find what we need by issuing a query defining our intent rather than exactly matching words, and increasingly, by using natural language to simply ask a question. google’s knowledge graph [2] introduced in 2012 taught us to expect direct, summarised answers to queries, and now, bing and google are both anticipating a near future where the traditional “ten blue links” of a search result are superseded by text built by extracting information from the documents found by the search, enriched by the context provided by a knowledge graph and presented as an answer to the question inferred from the search query. the gap in capability between the almost universally used search engines that set community expectations and libraries and related repositories has continued to grow. that gap, opened by pagerank [3] 25 years ago, is now a chasm. to provide modern and efficient discovery services for their communities rather than risking mutual abandonment by becoming perceived as baffling backwaters of byzantine business-rules redolent of a bygone era, libraries must adopt the same core search technologies used by the commercial services: semantic search, knowledge graphs and generative ai to support summarisation and chat. ten years ago, even semantic search would have been an impractical goal: the technology was specialised and expensive to implement and wasn’t well advanced. just two years ago, attempting to implement much of this capability was more like a “science project” than a routine information technology undertaking. but as quickly and surely as public expectations have changed, so has the accessibility of the technology: the use of ai has spread from university labs and tech leviathans and is now becoming commonplace, further increasing the pressure on the library community to embrace its effectiveness for the benefit of the public. this article does not provide definitive answers to the many questions raised when introducing semantic search and chat-based question answering across a digital repository. rather, it presents findings from a preliminary exploration in the hope of encouraging widespread experimentation with this rapidly evolving technology so that the benefits it will deliver to library communities can be expedited and shared. the context for the technologies investigated and described in this article is a repository of newspaper articles. the main investigations were performed on a very small subset of the national library of australia (nla) digitised newspaper repository. this study examined the feasibility and effectiveness of: semantic search named entity identification and disambiguation of entities described by wikipedia question answering, summarisation and a “chat” interface a further investigation of the feasibility of scaling semantic search to the full nla digitised newspaper repository (about 220 million articles) was also performed. context the national library of australia (nla) digitised newspaper repository forms the most accessed component of their trove service [4]. it contains over 220 million newspaper articles published since 1804 in over 1,800 separate newspaper and gazette titles. during may 2023, google analytics recorded about 1.2m searches on trove, and 3.8m pageviews of newspaper articles. the repository is popular with family researchers looking for information using personal names and place names (particularly in birth/death/marriage notices) and is also widely used by academics and general researchers. an indicator of trove’s popularity is that when government budget measures threatened the ongoing sustainability of trove in 2022, a petition calling for adequate funding received almost 34,000 signatures [5]. from the trove newspaper corpus, 45,494 news articles from the title the canberra times from the year 1994 were extracted. articles of type advertising (display and classifieds) were excluded as the intent of this prototype was to focus on the benefits of semantic searching on news rather than on “births, deaths and marriages” searches which tend to be more “known item” searches on people and places. there were many uncorrected ocr errors in the extracted text. to evaluate the practicality of name disambiguation using wikipedia articles, 38,359 articles about organisations and 155,781 articles about people were extracted from an august 2022 wikipedia dump, limiting the extract to articles that contained the word “australia” or contained at least 3,000 words. the goal was to attempt to include all australian people and organisations, and all “prominent” (worthy of 3,000 words) people and organisations even if non-australian. heuristics were used to classify articles as being of interest to this trial (ie, about people or organisations, not place names). approach and results named entity identification the news articles were processed using the stanford nlp library [6] to identify entities of type person, location, organization and misc. common non-specific entities such as “he”, “his”, “she”, “her”, “hers” were dropped. minor processing to encourage better uniformity (particularly in the face of ocr artefacts) was performed, eg, “nsw” was changed to “new south wales”, “mel bourne cup” was changed to “melbourne cup”. embedding creation the text from each news article was used to create an “embedding”, which is a vector characterising the text as a point in a high-dimensional space in such a way that texts which have similar meanings have vector representations which are “close” in that multi-dimensional space. embeddings can be created in many ways, but those created by transformer-based large language models (llms) have been successfully used as a foundation for semantic search. the better the embedding represents the meaning of text, the better a semantic search using a set of those embeddings is likely to be. as well as characterising text, many models can also create embeddings for images, hence mapping images and text to a common high-dimensional space and enabling searches across both images and text. although not the immediate focus of this experiment, the trove newspaper repository contains millions of newspaper images (such as photos published in newspapers) and other sub-repositories in trove contain tens of millions of other images, so an embedding model able to characterise images and text is of interest. although it is impossible to visualise points in a high-dimensional space, this simplification may help: figure 1. simplified representation of embeddings. here, four input documents (three of text, one image) are each separately processed by an embedding model to produce four embedding vectors. if we just look at the first three values and treat them as coordinates in three-dimensional space, we can imagine how the four points represented by those coordinates could be compared for proximity, and how, if the embedding is successful in representing texts or images as vectors, it would be possible to find things similar to some other thing. figure 2. searching data flow. a semantic search engine accepts a query (a text string, or possibly an image) and creates an embedding for that query, just as it had previously created an embedding for the documents it stores. it then finds the closest document embeddings to that query embedding in the high-dimensional space of the embedding model, and returns the closest documents, ranked by the proximity of their embeddings to the query embedding. figure 3. simplified representation of embeddings. it must be noted that the language model used to create the document embeddings must be also used to create the query embeddings. for example, if openai’s ada-002 was used to create the document embeddings, ada-002 must also be available and used when creating the embedding of the query to be compared to those document embeddings. for large or long-lived repositories, the ongoing availability of the language model and associated software required to create embeddings is a key driver of the choice of the embedding to be used. embeddings map text to a vector space of hundreds or even thousands of dimensions. a higher-dimensional space derived from a very large language model, well-trained and well-aligned with human understanding, can typically represent more subtly of meaning than a lower-dimensional space. however, more dimensions are more expensive to store, index and compare. for the proof-of-concept, three embeddings were tried, derived from: the bert language model – a very early transformer llm. clip vit l-14 [7]: maps image and text to a single vector space. although not the “best in class” sentence transformer, clip vit l-14 performs well, is quick, and importantly for this proof-of-concept was very easy to set-up using “clip as a service” [8]. ada-002: a recent and highly-regarded embedding provided as a commercial cloud service by openai. evaluation of bert was discontinued because qualitative performance was inferior to clip. clip produces embeddings of 768 dimensions, and ada-002 produces 1,536 dimensions. qualitatively, ada-002 embeddings are more useful for semantic search, but its extra resource requirements, cost and non-open nature raise concerns for a repository with lots of data and a long life. each language model has a maximum size of text it is able to process when creating an embedding. text beyond that maximum is ignored and does not contribute to that embedding. the ada-002 embedding can accept about 6,000 words which is more than almost all newspaper articles, and even for longer articles, probably perfectly adequate (it is hard to imagine an article that buries the lede that deep). however, clip, like many other embedding-generating sentence transformers, is limited to about 70 words. it is not obvious how best to process longer text. one approach is to generate and index multiple embeddings per article. for example, for an average-length news article of, say, 500 words, generate 8 embeddings and index all of them with the article and somehow at query time, combine the similarity scores for each embedding to generate an article match score. however, it is important to “break” the embedding text on sentence boundaries, as sentence transformers are trained on sentences, not arbitrary extents of text. also, it isn’t obvious that splitting “meaning” across separate embeddings will allow that meaning to be represented by summing or otherwise combining the match scores of the component embeddings. and finally, such splitting greatly increases the size of the embedding index and will negatively affect performance. another approach is to somehow merge the multiple embeddings: that is, in the above example, generate perhaps 10 sentence embeddings then add the vectors together (and then “normalise” this summed vector to have a unit length required for best performance by embedding index engines). this proof-of-concept uses a “hybrid” approach for clip: it generates an embedding for the first (up to) 70 words honouring sentence boundaries which is indexed separately, and also produces a summed embedding from multiple sentence embeddings from the entire text. qualitatively, the use of the first 70 word embedding was unnecessary – summing the embeddings did not seem to “blur” or “smear” the meaning as much as was feared. the choice of embedding model is an important decision with significant consequences discussed below. embedding indexing embeddings can be indexed and searched using one of several specialist “vector” databases (such as pinecone, milvus and weaviate) or by using vector extensions to existing relational databases (such as the pgvector extension for postgres) and document stores (such as elastic and lucene/solr). as lucene/solr is an existing key component of nla’s technology stack, it was chosen for this proof-of-concept and performed well and “out of the box” with the clip embeddings. lucene implements “approximate nearest neighbour” (ann) search using the hierarchical navigable small worlds (hnsw) search graph [9]. the ada-002 embeddings however are longer than lucene’s current vector limit of 1,024 dimensions, and so required a trivial source-code change to enable. semantic searching the proof-of-concept lucene/solr index of 45,494 newspaper articles contains two separate types of indices: “traditional” free text index able to support keyword and phrase searching and ranking based on relative occurrences of search terms in the index, the document and the document’s length vector index able to match a query embedding against all document embeddings to generate a similarity score used for ranking each type of index has strengths and weaknesses. free text indices are great for exact matches on the words in the supplied query, and for “boosting” the ranking of documents where those words are found in more important fields (perhaps a title or abstract), or are found together as a phrase. vector indices are great for finding documents that are “like” the query – not necessarily due to word matches, but by matching the meaning of the query. an example of a good candidate for a text search is a name search – if you search for john smith, you are almost certainly searching for a document containing those words very close together. it is unlikely there are useful “semantics” beyond that – that is, there are probably not “john smith”-like documents that don’t contain those words close together. another possible example is a single-word search, such as kamikaze: it is likely that the searcher is looking for that exact word rather than content that does not contain that word but somehow represents the “spirit” or meaning of that word. however, train crash is less clear cut, as the searcher is possibly interested in railway accidents or derailment or level-crossing smash or train smashes into bus or any other number of other slightly broader, narrower or related concepts. as queries get longer, they tend to convey even more “semantic intent” that is impossible to honour with simple keywords, even when attempting to automatically expand the search with keywords. for example, a search for articles using the fall of john major (a prominent conservative british prime minister during the 1990s) conveys a clear semantic intent. if a patron walked up to a librarian and asked for content using that phrase, the librarian would have a good idea about what they were after. performing this search on trove (limited to 1993 canberra times articles to enable a comparison) produces results a user familiar with google does not expect: many results have nothing to do with “john major” – they just happen to have each of those 5 words somewhere in the article. perhaps worse, many directly relevant articles are not found: perhaps they use ousting or downfall or undoing or unravelling or humiliation or collapse of support rather than fall, perhaps they describe the political pressure he was under in other ways. the proof-of-concept explores a blended keyword and semantic search. it does this by: issuing a “standard” keyword and phrase text search with standard keyword ranking. for each of the top-10 keyword-ranked results, fetch that document’s embedding and use this to issue a semantic search to find other documents with embeddings most similar to it. the intent is to “enrich” the result candidates by including documents very similar to those best keyword results but which may not contain all the keywords. creating an embedding of the original search query and issuing a semantic search to find documents with embedding most similar to it. each of these searches produces a ranked list of documents with a search score (in this case, generated by lucene). the score for the first search (the “standard” keyword and phrase search) is calculated by lucene based on its default bm25 [10] ranking using keyword repository and document frequencies with a boost applied if all keywords were found nearby (ie, a phrase-like boost). the scores for the second and third searches are also calculated by lucene based on the distance in vector-space between the embeddings of the search embeddings and the document embeddings. the proof-of-concept then applies a separate weight to the scores produced by the three types of searches and adds the results across all three searches to generate a document result set for ranking. the three weights are referred to by the proof-of-concept user interface as: keyword boost keyword-found doc similarity boost query similarity boost hence to perform a ranking purely on keyword score, set the second and third weights to zero, and to perform a ranking purely on semantic similarity to the query, set the first and second weights to zero. figure 4. search results for the fall of john major: pure keyword results on the left, blended keyword and semantic results on the right. it is not obvious how to set the three weights. empirically, a search using the clip embeddings benefits more from a lower relative query similarity boost than a search using the ada-002 embeddings, probably because the “semantics” captured in the ada-002 embedding of both query and documents are much better. also empirically, a search on just one word, or on a “known item” such as a person’s name often gives better results when the scoring is dominated by keyword boost (but only when that word exists in the repository – see cybersecurity counter-example below) ranking of results is subjective and no formal recall/precision analysis has yet been performed. it must be noted that the vector search algorithm used by lucene, hnsw, does not guarantee to find the “most relevant” results, that is, the documents whose embedding vectors are closest to the query vector: hnsw is an “approximate nearest neighbour” algorithm that trades accuracy against resource consumption (cpu, storage, i/o). however, if reasonable care is taken with hnsw index construction parameters and query parameters, hnsw is highly accurate. a version of the proof-of-concept is available for demonstration here: http://nla-overproof.projectcomputing.com/knnblend a typical search demonstrating the comparison of keyword and blended results is this search using the query the fall of john major: http://nla-overproof.projectcomputing.com/knnblend?set=1994&embedding=ada-002&stxt=the%20fall%20of%20john%20major the following table is a summary of the first 50 results of each search using the ada-002 embeddings, showing how keyword results were “demoted” (rendered as with pink background in the blended results), “promoted” (rendered with light green background), and shows results only found by the semantic search (dark green background, designated as “knn”). for example, the first keyword result (about the war in bosnia containing comments by john major) was demoted to rank 48; the second keyword result (about a fall in regional bank values that includes many occurrences of the query keywords but is not at all about the intent of the query) is demoted to rank 50; the fourth keyword result is promoted to rank 1; the 6th through 17th blended results were only found by a semantic search. table 1. showing how keyword results where re-ranked after blending with semantic results keyword search blended search rank article re-ranked to rank article re-ranked from 1 international bihac still under siege by kurt schork sarajevo, friday: bosnian s 48 1 jig is up for uk conservatives bill mandle the british local-govern ment electio 4 2 business and investment regional banks shed value after rates increase by michae 50 2 major problems for unpopular pm ‘there is not much point in spending money on a 5 3 international major warns tories: eu yes or a poll – by richard meares london, f 18 3 momentum against major the administration of britain’s prime minister, john majo 9 4 jig is up for uk conservatives bill mandle the british local-government electio 1 4 the undoing of john major malcolm booker the conservative lead ers in britain, a 13 5 major problems for unpopular pm ‘there is not much point in spending money on a 2 5 major’s fortunes turn around john major must be almost unable, to believe his lu 23 6 brambles 313m loss due to big abnormal sydney: a huge abnormal loss and poor per 72 6 [foreign major’s moral crusade nauseating: lamont london: a bitter attack by nor knn 7 hopes now ride on future of falcon comment by peter brewer by axing the capri co 73 7 seeds sown for conservative uprising major facing mutiny after eu climb-down lon knn 8 community services the key to more jobs john quiggin says only a major policy re 74 8 international conservative crisis after budget defeat by alan wheatley of reuter knn 9 momentum against major the administration of britain’s prime minister, john majo 3 9 major crushes eurorebels for now london, saturday: the brit ish prime minister, knn 10 market hits 14-month low as base metals bear brunt melbourne: the australian sha 75 10 international mps expelled in row over eu key win saves maior bv alan whfatlfy l knn 11 business and investment steel leads bhp to record profit melbourne: steel, backe discard 11 international major humiliated in local elections london: in a humiliating setba knn 12 business and investment coal ‘rabble’ irks union australia’s coal industry was b discard 12 international major may face leadership test within weeks by don woolford of aap knn 13 the undoing of john major malcolm booker the conservative lead ers in britain, a 4 13 by-election disaster kicks off major’s dice with voters london: prime minister j knn 14 state bank sale to be finalised soon sydney: the sale of the state bank of nsw h discard 14 international tories sink into a moral stew three mps exposed . more rattling in knn 15 fears for peace push dublin, sunday: a dispute in the irish republic’s coalition discard 15 unravelling not due to major alone from hugo young in london the slow disintegra knn 16 in brief qdl expands into victoria brisbane: pharmaceutical whole– salcr and d discard 16 british tories rally behind their leader london: john major is expected to secur knn 17 elated rsl wins battle for wake’s war medals by marion frith a jubilant returned discard 17 blair rides high on eve of conference as major founders in sea of ‘sleaze’ by al knn 18 irc decision gives all workers on federal awards 11 public holidays a year by mi discard 18 international major warns tories: eu yes or a poll – by richard meares london, f 3 19 history highlights in history on may 8: 1704: british forces under duke of marlb discard 19 international major on the ropes as bribery claim minister resigns by michael wh 22 20 aids patients ‘incorrectly’ diagnosed by catriona bonfiglioli, aap medical corre discard 20 international major stakes govt on high-risk bill by donald macintyre london, th knn 21 business and investment ; i, colonial mutual rejects state bank sell-off claims discard 21 international eu vote revolt poses poll threat to tories by richard meares londo knn 22 international major on the ropes as bribery claim minister resigns by michael wh 19 22 mps cold on heat tax london, tuesday: the brit ish prime minister, john ma jor, knn 23 major’s fortunes turn around john major must be almost unable, to believe his lu 5 23 fresh row adds to tory woes bournemouth, england, wednesday: britain’s ruling co knn 24 collectables dion skinner logos of old still capture the mind in the world of ad discard 24 international pm fancies a return to banking the british prime minister, john ma knn 25 pm backs rate rise as recovery tool by, ian henderson prime minister paul keatin discard 25 elections speaking in the house of commons, labour leader john smith described t knn 26 us economy like banker’s dream the united states economy is showing signs of tur discard 26 heseltine front-runner to take over as leader the president of the british board knn 27 norman struggles, six shots off pace pittsburgh, pennsylvania: a disappointed gr discard 27 cloud is cast over major’s morals drive london: two forced resignations from t knn 28 saints fall victim to resurgent hawks melbourne: pundits proclaiming an end to h discard 28 heseltine rises from the political grave london: michael heseltine, once the gol knn47 29 , . , business and investment sale of state bank still not a sure thing sydney: discard 29 avoid a security over-reaction the attack on prince charles has set off the pred knn 30 b u sin e ssl an d investment bargain hunters help market rally shares recover a discard 30 world briefs major hangs on as party leader london: prime minister john major es knn 31 international villagers flee enclave, rebel onslaught serbs move on un safe have 76 31 they said it it could only have been expected … it’s the fault of the governme knn 32 fawcett quits in rift over fal plans perth: david fawcett, the man viewed by man discard 32 racist row over speech london: a high-flying minister has forced prime minister knn 33 adsteam shares take a dive sydney: shares in the debt laden adsteam group plunge discard 33 richard farmer liberals cannot merely sit back the liberal party’s election stra knn 34 power plays of men and women oleanna, a piece about sexual harassment, is meant discard 34 richard farmer economic news of no help to libs at first blush an oppo sition ma knn 35 us turnaround leads australian market recovery sydney: the australian stock mark discard 35 all hope is not gone for liberals peter cole-adams i.f the liberal party was in knn 36 bannister and friends remember a record oxford, england: after 40 years, about t discard 36 beggars ‘offend’ british leader london: the british prime minis ter, john major, knn 37 books discouraging view of the journalism scene in us who stole the news? why we discard 37 domestic issues dominate the elections for the euro pean parliament now under wa knn 38 1 head of gatt to quit this year . j from john zarocostas in geneva peter suther discard 38 uk budget freezes spending . london, wednesday: brit ain’s unpopular conservativ knn 39 book two decaying belief in the dentist j ust what i need: another reason not to discard 39 nz teeters towards poll resignation hits shaky majority wellington: new zea land knn 40 rates rise ‘not justified’ by keith scott the managing director of bhp, john pre discard 40 saturday forum the conjurers around the cabinet table federal politics ross peak knn 41 tjdget ’94 strategy depends on business investment recovery in 1994-95 budget cl discard 41 pm told to spend on the jobless by tom connors, economics writer the prime minis knn 42 accused mp vows to fight to e nd sydney: the embattled state liberal mp barry mo discard 42 a good test of downer’s mettle alexander downer had another bad day yesterday. h knn 43 test tickets now prized possessions ! roo tour i [notebook! by bevan hannan seat discard 43 pm keeps budget tax rise on cards by ross peake, political correspondent prime m knn 44 forestry protesters chip directly at pm australian federal police officers remov discard 44 crystal ball starts to get a bit hazy around 1999 political punditry is a risky knn 45 brave-faced lions view the future melbourne: the victim of the afl’s five-year p discard 45 crushing swing rocks british conservatives brierley hill, england, friday: the b knn 46 southern cross woden’s talent exodus leaves stephens with shaky foundation on th discard 46 kohl to call major’s bluff bonn: chancellor helmut kohl and his european allies knn 47 magic bullets’ take their aim on disease possible breakthrough in treating rheu discard 47 a state loss is a federal gain john black analyses the figures and concludes tha knn 48 loggers ready to blockade woodchip warning to govt by paul chamberlin the loggin discard 48 international bihac still under siege by kurt schork sarajevo, friday: bosnian s 1 49 small business confidence slips back a notch by ian henderson, economics writer discard 49 downer faces possible revolt on two fronts by peter cole-adams, political editor knn 50 senior libs consulted on hewson’s fall by jack waterford the four most senior of discard 50 business and investment regional banks shed value after rates increase by michae 2 the blended result is a great improvement, but it is far from perfect: blended results 33 to 44 are almost all about australian or new zealand rather than british politics, and results 63 and 67 are quite relevant to the search. result 63 (“death is another major setback london: the death of a conservative legislator in what police call suspicious circumstances adds to the troubles of prime minister john major”) ranks highly (at 21) on a “pure” semantic search, so it is being pushed down the rankings by keyword and similar-to-keyword results. result 67 (“harold macmillan’s fate haunts modern tories”) is not improved by a pure semantic search using the ada-002 embeddings, but is ranked at 34 using the clip embeddings. this result demands further investigation. the value of a semantic search is particularly evident when performing a search most naturally expressed as a question or phrase rather than a set of keywords, and it does indeed seem that the ada-002 embedding (and to a lesser extent, the clip embedding) manages to encode a commonly understood intent behind the phrase and match it successfully with articles, for example, this search on infiltration of asio by communist spies which returns zero keyword matches yet many relevant semantic matches: figure 5. search results for infiltration of asio by communist spies: there are no pure keyword results on the left, but many relevant semantic results on the right. another interesting example is the ability of a semantic search to find articles using current-day terminology that was not used at the time the article was written. cybersecurity was not a term much used in 1994, and a keyword search finds zero matches, but the semantic search finds many relevant articles: figure 6. search results for cybersecurity: there are no pure keyword results on the left, but many relevant semantic results on the right. named entity resolution when navigating very large repositories and trying to understand the context of their contents and how they relate to other resources, being able to see the people, organisations, places and events mentioned in articles and result summaries can be very useful. people especially can be referred to in many ways, often with name variations that sometimes are contextual (eg, joe biden, joseph biden, president biden, the president, j.r. biden jr (1942-), joe, mr biden, senator biden, ..), and identifying what is a name and the “real-world” entity it refers to is both useful and non-trivial. the proof-of-concept used the stanford natural language library [6] to identify named entities which were then indexed with their containing articles. this allowed displaying names in article and result summaries (the latter as facets which can be used as search filters). however, this is just the first step in providing a useful entity identification and context capability. it is valuable for an entity’s names (including their many variants) to be linked to a real-world entity. in the above example, it is probable, dependent on context, that joe biden, senator biden, president biden, etc. should all be linked to the same “real world” person represented by the wikipedia page https://en.wikipedia.org/wiki/joe_biden. this linking has not been attempted in the proof-of-concept, but an experiment in trying to resolve people and organisation names to wikipedia entities has been implemented which uses a combination of name matching and semantic similarity between the article containing the entity and wikipedia text describing the entity to generate a similarity score between people and organisation entities identified by the stanford natural language library and wikipedia articles. some examples from the first semantic match on the fall of john major search: figure 7. entities found in a newspaper article about the fall of john major. the first entity listed is mr major. figure 8. wikipedia articles matching the entity mr major ranked by textual and semantic similarity. although not the closest name match amongst at least 10 name matches on mr major, the wikipedia article on john major is the closest semantic match to the article, and reasonably confidently disambiguates the name. this article also contains a named entity for john major: figure 9. wikipedia articles matching the entity john major ranked by textual and semantic similarity. unsurprisingly, the better name match and same article/wikipedia similarity match very confidently disambiguates this name. the next entity is kenneth clarke (in this context, another prominent conservative british politician): figure 10. wikipedia articles matching the entity kenneth clarke ranked by textual and semantic similarity. semantic similarity confidently disambiguates this entity reference, as it also does for michael portillo, michael heseltine (both conservative british politicians) and paul keating (australian prime minister, 1991-1996). however, it fails with john smith: figure 11. wikipedia articles matching the entity john smith ranked by textual and semantic similarity. the english politician is not even in the top 10, which is perhaps not surprising given the dozens of john smiths on wikipedia. however, this is a good illustration of the difficulties in linking named entities, particularly of common names, to wikipedia entities. these attempts at resolving named entities are encouraging but further work is needed to refine matching using names, contextual dates, semantic similarity and common co-appearing entities. scaling experiments although performance on this small repository of 45,494 documents is very good, the complete trove newspaper repository contains 220 million articles. a notable characteristic of the hnsw algorithm used by many vector search engines (including lucene/solr) is that it is a hierarchical graph search in which most nodes (representing the embedding vectors of documents) are highly connected, typically to 16 – 128 other nodes, and the algorithm proceeds by following many promising parallel paths of increasing similarity through the graph. this causes effectively random access to the data in the graph, and as a consequence, very high io rates unless the graph is held in memory. even when in memory, the number of “probes” performed navigating the graph requires very high bandwidth between memory and the processor. for example, finding the “top 10” semantically-near results to a query with a hnsw graph storing about 200 million vectors, each linked with 60 nearest neighbours, requires about 45,000 “probes” distributed across the graph. hence, the size of the graph is of utmost importance. the maximum number of other nodes each node can be connected to has a moderate influence on the graph size, but the biggest factor is likely to be the size of the embedding vector. the ada-002 vector contains 1,536 floating point values, which equates to 6,144 bytes per vector. this uses less than 300mb for the small proof-of-concept repository, but for 220 million articles, requires over 1,300gb of memory. even on a machine with such a large memory, the need to randomly access this memory into the processor’s memory caches suggests undesirable processing bottlenecks which will reduce query performance. if the ada-002 vector contained vector positions (“dimensions”) which were highly correlated with each other or had near-constant values, then it would be possible to drop some dimensions whilst leaving search results unaffected. however, an analysis of the ada-002 vectors created for the newspaper and wikipedia articles showed there was no correlation between vector positions and no near-constant values: that is, all the dimensions seem to add discriminatory value to the embedding. however, this analysis did note that all but five of the 1,536 dimensions have almost all of their values in a narrow range between -0.1 to +0.1, whereas the five “outlier” dimensions had characteristic different narrow ranges (for example, dimension 194’s range is almost entirely -0.7 to -0.6). this suggested a low-error way to quantise the 4 byte float into a 1 byte value using six quantisations: one for each of the five outliers and one for the central “clump”, by choosing quantisation values to minimise total error. with this approach, the 1,536 float values are quantized and stored as 1,536 bytes, and at “query time”, when calculating the vector similarity, the quantisation tables are used to “expand” each byte to a float having a value close to the original float value. the proof-of-concept implements this approach as the “ada-002quant” embedding, and qualitatively, the results are equivalent to the unquantized values. this approach reduces storage/memory/memory-cache-shuffling by 75%, but still requires 340mb of memory for 220m records just to represent the embeddings. further compression can be achieved using product quantization (pq) coding [11]. with this approach, some number of vector positions, typically between 2 and 10, is represented by a single quantised value stored as typically between 4 and 16 bits. experimentation using the ada-002 embeddings gave good results using pq coding on 3 vector positions and representing them with 1 byte quantisation. this required 1,536 / 3 = 512 separate pq tables, each with 256 values, which represented 3 floats as 1 byte. each of the 512 pq tables were built using k-means clustering to minimise total error. pq coding hence reduces memory requirements by a further two-thirds, storing 512 bytes per embedding (down from the original 6,144 bytes), and allows the embeddings for 220m records to be stored in just 113gb. the proof-of-concept implements this approach as the “ada-002pqcode” embedding, and qualitatively, the results are similar to the unquantized values. chatting with a newspaper repository after the openai chat api was released in march 2023, a simple chat proof-of-concept was implemented as follows: an embedding is created for the initial user chat input (exactly as it would be for a search). a semantic search on the newspaper repository (canberra times news articles from 1994) is issued to find the 8 documents with embeddings closest to that of the user input. up to ~3,000 words are extracted in total from the 8 documents, in proportion to their relative similarity score and are used as context for the openai chat request. a system prompt and the user input are added to the article context and passed to openai’s chat api endpoint. the response is received, article references changed to hyperlinks (to the original article on trove) and shown to the user. any followup response from the user is appended to the previous inputs and sent again to openai’s chat api. figure 12. chatting data flow. as an example, a user input of “what problems did john major have in 1994?” generates the following request to openai’s chat api: "messages":[ {"role":"system","content":"you answer questions factually based on the context provided. the context consists of newspaper articles, each within their own article tag (<article>) which starts with their article number and date of issue formatted as yyyy-mm-dd. reference every article you used to construct your answer by starting every use of text derived from an article or articles by the source articles' sequence in the context, for example [from article 2] answer text derived from article 2... [from article 8, 3] answer text derived from articles 8 and 3"},; {"role":"system","name":"context","content": "<article> article 1. date of issue: 1994-05-07. content: international major humiliated in local elections london: in a humiliating setback for prime minister john major, britain's ruling party yesterday suffered its worst defeat in local elections as traditional conservative voters deserted his crisis plagued government. diehard conservatives throughout britain switched allegiance or stayed [content removed for brevity …] norman fowler</article> <article> article 2. date of issue: 1994-02-16. content: momentum against major the administration of britain's prime minister, john major, has been deeply hurt by the continuing sex scandals which have seen [content removed for brevity …] too long in the political game-to,fall into that trap</article> <article> article 3. date of issue: 1994-04-01. content: seeds sown for conservative uprising major facing mutiny after eu climb-down london: british prime minister john major faced the growing threat of a mutiny against his leadership yesterday.. and so on for 8 articles, concluding with: .. david ashby, had shared a hotel bedroom in france with a male friend</article> "}, {"role":"user","content":"what problems did john major have in 1994?"} ] chatgptv3 frequently failed to cite the source article as requested, but chatgptv4 routinely cites correctly, resulting in output such as: figure 13. response from openai’s chatgptv4 to the query what problems did john major have in 1994? given a context of 8 semantically-searched documents from the canberra times, 1994. followup questions are kept within the provided context: figure 14. response from a followup question, showing effective preservation of the original question’s context. chat provides an interesting way to summarise a large amount of content from different articles quickly and will be attractive for many people wanting to get an overview whilst still being able to “dig in” to the source material via the citation links to the original articles. further work the proof-of-concept has merely scratched the surface of possibilities and briefly explored just some of the approaches for improving discoverability of resources held in large digital repositories. some of the areas identified for further investigation are listed below. named entity identification and linking large language models have been successfully used to identify named entities and informal experiments with some newspaper articles and appropriate prompts using gpt3.5 and gpt4 are extremely encouraging, although the cost currently greatly exceeds running traditional named entity recognition libraries. it is possible that large language models, perhaps fine-tuned with content from wikipedia, will be able to accurately identify and disambiguate people, organisations and places. many of the people and organisations described in newspaper articles will not be prominent enough to appear in wikipedia. they will, however, be of great interest to some communities. how to identify and represent these, and for example, not link an unrelated “john major” to the wikipedia entry of the former british prime minister is unresolved. assuming reasonably accurate identification of named entities and linking to wikipedia/wikibase entities can be achieved, there exist great opportunities for representing content containing those entities in a broader context. for example, it would be possible to search for articles about british conservative leaders visiting ireland without needing to specify names of people or places in ireland, by using an equivalent of google’s knowledgebase (possibly based on wikidata [12]) to find the appropriate people and places, and then using named entity indices on articles to find the intersections and finally using a semantic filter to identify those describing an appropriate named entity (british conservative leader) visiting another appropriate named entity (place in ireland). appropriate embeddings this proof-of-concept compares the clip vit-l-14 and ada-002 embeddings, but there are many other options for embeddings. clip vit is no longer considered state of the art, but was used because it was very easy to try and because the ability to map text and images to a single high-dimensional space is very attractive, providing as it does a simple, unified way to retrieve images as well as text. although ada-002 qualitatively gave better results, it has several downsides. the cost is non-trivial. nla’s newspaper archive contains around 140 billion words, which equates to around 190 billion tokens. openai has reduced the cost of embeddings dramatically, and at the time of writing (june 2023), ada-002 embeddings cost $us0.0001 per 1k tokens, giving an embedding cost of $us19,000 for the nla newspaper repository. however, the nla’s web archive repository contains about two magnitudes more text. another concern with a proprietary embedding approach is that should the vendor ever deprecate then discontinue support, the existing embeddings are unusable: incoming queries need to be converted to the same embedding model as the documents they are going to search. similarly, a future decision to increase the price of obtaining an embedding could greatly increase search costs. the hugging face massive text embedding benchmark leaderboard [13] compares the attributes of embedding models, providing a starting point to investigate a balance between attributes such as cost, openness, speed, ability to characterise long runs of text, ability to characterise various languages, ability to characterise images, effectiveness, vector length and compressibility. vector search this proof-of-concept only examined the performance of lucene/solr’s hnsw approximate nearest neighbour search engine. this is a good “fit” for nla’s technology stack, and it performed well and was demonstrated to scale to the size of the entire current newspaper repository. however, there may be better options, and vector search is a rapidly developing field. the time taken and space used to construct the hnsw graph is affected by two tunable parameters: the number of “nearest” neighbours each node should be connected to (although the hnsw algorithm seeks to have somewhat diverse rather than strictly nearest but extremely similar neighbours). how exhaustively the algorithm should search for nearest neighbours. searching the hnsw graph is also affected by a parameter that specifies how exhaustively the algorithm should search for best matches. optimising these hnsw parameters is likely to be rewarding. compressing the embedding representation (to smaller storage values such as a byte rather than a float, and by reducing the number of values by techniques such as pq coding) are likely to be necessary for large repositories. for a repository the size of nla’s web archive (15 billion documents and counting), tradeoffs between embedding accuracy and performance seem inevitably required, at least with current hardware. user interface and search logic many searches are intended to be exact keyword searches. a searcher issuing paul keating bankstown is very likely only interested in results about the named entity paul keating (the former australian prime minister) and the named entity bankstown (the sydney suburb and his birthplace), and “blending in” semantic results may result in more annoyance than joy. preprocessing the query to identify those entities will be valuable (for example, paul keating may be referenced in a sought article as prime minister keating, pj keating, mr keating, or depending on the article’s context, just prime minister). however, a search for paul keating immigration policy will benefit from semantic searching, at least on the immigration policy part. perhaps a starting approach would be to attempt to identify named entities in the query and treat them as requiring a keyword match, or, if a knowledge graph is available, a known-entity match. a last resort (admitting failure) would be to rely on a mechanism such as “advanced search” in many library systems, which basically puts the onus on the searcher to devise and communicate the search strategy. but for those trained by google, such an approach is likely to be as mystifying as it is inadequate. where to terminate a semantic result set is another interesting problem. unlike keyword searches which always have a hard cutoff (e.g., for an “anded” search, only documents with all the keywords can be returned), similarity has no obvious hard boundary: all documents are similar to all others to some degree. previously, “documents like this” capabilities operated on a “bag of words” approach, typically favouring rarer words. semantic similarity offers the potential to make “documents like this” more useful. named entities combined with a knowledge graph will allow the search interface to suggest more “abstract” facets that are not directly contained in the text of the search result. for example, a search for train crashes may usefully present facets of country or state as higher-level groupings of place-names mentioned in the found articles. a search for paul keating immigration policy may usefully present facets such as political parties or factions as higher-level groupings of people mentioned in the found articles. ocr text correction semantic search is more robust than keyword search in the presence of ocr errors because it relies on more than specific occurrences of keywords to create an embedding representing the meaning of text. however, as described above, many searches will still rely on exact keywords (e.g., searching for a particular person and place), and removing as many ocr errors as possible prior to creating embeddings and keyword indices, whether by crowd-sourcing (as successfully implemented by trove [14]), by automated ocr correction (such as overproof [15], also used by trove) or using generative ai, may allow materially more accurate searching and summarisation. summarisation and “chat” interface moving away from the traditional search result of hyperlinks, each with a document title and brief context of found keywords, will be a big change for library systems, but following the design lead of large public search engines may minimise surprise. at the time of writing (june 2023), microsoft’s bing search results show a feature search result (with section headings and summaries taken from that feature web page) followed by traditional-looking links in a left hand column, and a ai generated summary response that starts filling the right hand column and “learn more” citations used by the summary, common explanatory follow-up questions, and an invitation to “let’s chat”: figure 15. bing’s ai-augmented search results as of june 2023. (note that the actual search results returned will be informed by many factors such as the searcher’s location, ip address, search history, cookies associated with them and their browser and its settings.) the citations can be followed to show the “source” web page. alternatively, “let’s chat” or one of the “pre-canned” follow-up questions can be clicked to continue the chat: figure 16. follow-up chat on bing’s ai augmented search portal. the summarisation/chat is very effective at presenting an overview derived from multiple source documents with mechanisms to dive deeper, and this hybrid of traditional search results shown alongside summarisation/chat allows the introduction of this new capability in a gentle way that’s unlikely to disorientate people seeing it for the first time. while the choice of which model to use for creating embeddings has long-term implications, the choice of which model to use for chat/summarisation is “tactical”, and can be relatively easily changed as technology develops. although at the time of writing (june 2023) openai’s gpt4 chat api has by far the best capabilities, it is expensive to scale (a single chat interaction with a provided context of 4,000 words generating a 500 word response costs about $us0.20) and some institutions may not be happy with resources and patron questions being supplied to a commercial service. highly capable open-source models trained for chat are appearing, such as the falcon-40b [16] and mpt-30b [17] models. mpt-30b appears particularly attractive because its size allows it to run comfortably on a single (very) high-end graphics processor and because it offers an 8k maximum token context, the same as the base gpt4 api. a larger context means more “source” text can be provided for the chat engine to summarise or use as a base for question-answering. the proof-of-concept always selects chat context from the 8 best articles, and always selects text from the start of the articles. it is unlikely this naive strategy is optimal. instead, it is probable that sometimes more, sometimes fewer articles should be selected, and that the most relevant context won’t always be confined to some fixed amount of text at the start of the article. for web pages (rather than newspaper articles) in particular, text at the start of a page is often boilerplate and hence normally irrelevant to the chat. the proof-of-concept persists with the same context as the chat progresses, simply adding questions and the chat-engine’s response to that context. this is unlikely to be optimal: followup questions in the chat probably change the original best-choice of articles selected to provide the context. finally, a production service may need to ensure that the chat remains within the context provided, regardless of follow-up questions which may, adversarially, attempt to elicit responses that the hosting institution would rather not appear under its banner. references [1] google hummingbird. wikipedia [accessed 2023 june 28] https://en.wikipedia.org/wiki/google_hummingbird [2] introducing the knowledge graph: things, not strings, amit singhal, google [accessed 2023 june 28] https://blog.google/products/search/introducing-knowledge-graph-things-not/ [3] pagerank. wikipedia [accessed 2023 june 28] https://en.wikipedia.org/wiki/pagerank [4] trove. national library of australia [accessed 2023 june 28] https://trove.nla.gov.au [5] #fullyfundtrove. change.org [accessed 2023 june 28] https://www.change.org/p/fully-fund-trove [6] stanford natural language processing core [accessed 2023 june 28] https://stanfordnlp.github.io/corenlp/ [7] clip vit-l-14. hugging face [accessed 2023 june 28] https://huggingface.co/sentence-transformers/clip-vit-l-14 [8] clip as a service. jina [accessed 2023 june 28] https://clip-as-service.jina.ai/ [9] hierarchical navigable small worlds (hnsw) search graph. pinecone [accessed 2023 june 28] https://www.pinecone.io/learn/hnsw/ [10] okapi bm25. wikipedia [accessed 2023 june 28] https://en.wikipedia.org/wiki/okapi_bm25 [11] product quantization. pinecone [accessed 2023 june 28] https://www.pinecone.io/learn/product-quantization/ [12] wikidata [accessed 2023 june 28] https://www.wikidata.org/wiki/wikidata:main_page [13] massive text embedding benchmark (mteb) leaderboard. hugging face [accessed 2023 june 28] https://huggingface.co/spaces/mteb/leaderboard [14] ‘singing for their supper’: trove, australian newspapers, and the crowd. marie-louise ayres [accessed 2023 june 28] https://library.ifla.org/id/eprint/245/1/153-ayres-en.pdf [15] report on comparative search results following overproof correction of 10 million nla newspaper articles. project computing [accessed 2023 june 28] http://nla-overproof.projectcomputing.com/ [16] falcon-40b. hugging face [accessed 2023 june 28] https://huggingface.co/tiiuae/falcon-40b [17] mpt-30b: raising the bar for open-source foundation models. mosaicml [accessed 2023 june 28] https://www.mosaicml.com/blog/mpt-30b about the author kent fitch has worked as a computer programmer since 1980 and been a partner in project computing pty ltd since 1982. he was the system architect and lead programmer of the national library of australia’s newspaper digitisation system, then of trove, and recently added pagerank to trove’s huge web archive full text index. he also develop systems at university of new south wales and for project computing. he does not speak for the nla and the work described in this article was performed independently of the nla. this work is licensed under a creative commons attribution 3.0 united states license. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – open access publishing with drupal mission editorial committee process and structure code4lib issue 15, 2011-10-31 open access publishing with drupal in january 2009, the colorado association of libraries (cal) suspended publication of its print quarterly journal, colorado libraries, as a cost-saving measure in a time of fiscal uncertainty. printing and mailing the journal to its 1300 members cost cal more than $26,000 per year. publication of the journal was placed on an indefinite hiatus until the editorial staff proposed an online, open access format a year later. the benefits to migrating to open access included: significantly lower costs; a green platform; instant availability of content; a greater level of access to users with disabilities; and a higher level of visibility of the journal and the association. the editorial staff chose drupal, including the e-journal module, and while drupal is notorious for its steep learning curve—which exacerbated delays to content that had been created before the publishing hiatus—the fourth electronic issue was published recently at coloradolibrariesjournal.org. this article will discuss both the benefits and challenges of transitioning to an open access model and the choice drupal as a platform over other more established journal software options. by nina mchale background since 1975, the colorado association of libraries (cal) has published a quarterly journal, colorado libraries. the print journal was a benefit for members of the association. in recent years, issues have consisted of six to eight articles focused on a theme of current interest to librarians and library workers in colorado and the rocky mountain region, as well as several regular columns. a content editor and a layout editor were in charge of the workflow process that culminated in the final physical product. the content editor worked with guest editors—each issue had a guest editor who developed the theme and recruited authors—and the layout editor prepared the copy for a commercial printer. at its january 2009 meeting, the cal executive board temporarily suspended publication of colorado libraries due to the high cost of printing and the fragile state of its budget. content was complete for issues 35.1 and 35.2, and content had been planned and work begun on issues 35.3 and 35.4. members of the publications committee, which oversaw publication of the journal, suggested at the time that an online, open access solution could allow continuation of the journal with much less stress on the budget. the board was initially concerned that transitioning the journal to an open access format would result in a decrease in cal membership, as it would be viewed as a loss of a membership benefit. a method for providing online access to members only was briefly considered, but the lack of a platform and the urgency of publishing the content finally made the case for open access successful. ultimately, however, the board requested that a final decision on the fate of the future of publishing colorado libraries be made after the association’s annual conference in october. at the january 2010 board meeting—the first following the annual conference—the publications committee presented a proposal to migrate colorado libraries into an online, open access format, supported by an open source software solution. the benefits that the committee presented to the board included: significantly reduced publication costs; a “greener,” more environmentally friendly format; and greater accessibility to users with disabilities. (additional benefits realized after the fact included the ability to include color photos and graphics, which had been prohibitively expensive in print, and the ability to make corrections post publication.) the annual budget for the print journal had been $6,600 per issue, for a total annual cost of $26,400. in contrast, the cost of publishing online was that of hosting—the association did not have a suitable server space for a drupal installation at the time—and registration of the domain coloradolibrariesjournal.org, which amounted to approximately $325, or slightly more than one percent of the annual print cost. the board requested that the committee return to the march meeting with a plan for migrating, and in march, the motion to migrate to an online, open access format passed unanimously. platform options: ojs, wordpress, and the e-journal drupal module even as recently as five years ago, migration of a print journal to an open source software solution would not have seemed as achievable as it is now, with the wealth of open source web software available today—not to mention the robust development communities behind them. the three platforms considered by the publications committee were open journal systems (ojs), wordpress, and drupal—more specifically, the drupal e-journal module. in the end, drupal plus the e-journal module was selected as the best fit for the journal’s needs. open journal systems, commonly known as ojs, was created by the public knowledge project (http://pkp.sfu.ca/?q=ojs) with an aim to provide a robust, open-source solution for peer-reviewed publishing. with over 7500 journals (as of december 2010) using ojs, it was a well-established and well-supported option. however, only the academic corner section of colorado libraries is peer reviewed. the rest of the publication consists of shorter, trade-related articles about 2,000 words in length, with the columns in the second half of the journal about 750 words each. also, the ojs platform contains ten roles to which editors, authors, managers, etc., may be added. these roles are not customizable, so combined with the rigorous workflow required by the peer review process, ojs ultimately seemed like overkill. (note: drupal’s e-journal module contains seven similar roles; however, these can be overridden to some extent with drupal’s flexible permissions system. more on this follows in the “challenges publishing with drupal” section.) finally, there was a desire to publish solely in html and not pdfs; ojs requires a non-web document submission process. after ojs was ruled out, wordpress was considered briefly due to its widespread use and support in online publishing, particularly blogs. as previously mentioned, a hosted solution was required as there was no web environment available to support a robust cms, and there were several “one click” installation and supported hosting options for wordpress. while drupal is generally agreed to be more difficult to learn from scratch, the content editor had been learning it for her regular job. ultimately, the existence of the e-journal module, with its journal-specific functionality, combined with the content editor’s prior experience with drupal, led to the final decision to use drupal. drupal itself is much more customizable than ojs thanks to its highly flexible and extensible theming system as well as its modularity. when the publications committee was making its software selection, drupal 6 was coming into its own, and there was a rapid expansion of the drupal community among libraries. support venues included the drupal4lib list (http://listserv.uic.edu/archives/drupal4lib.html) and the drupal interest group in ala’s library and information technology association (lita) (http://connect.ala.org/node/71787). the drupal e-journal module was developed by czech-born roman chyla, a librarian, developer, and cern research fellow. chyla released the contributed module to the drupal community in early 2006 for version 5. unlike ojs, which is a single-purpose software system developed specifically for online publishing, the e-journal module adds online publishing functionality to drupal’s multi-purpose web framework. however, chyla actually had ojs in mind when he created e-journal. in fact, he notes in the documentation for the module that he created it “to prove that it is possible to build a professional publishing system on the drupal platform;” chyla’s “proof of concept” “provides the basic functionality, organizes the published nodes around issues.”[1] the module can support multiple journals with a single installation, supports online workflow, and records an eissn, if desired. benefits to using drupal/e-journal drupal’s inherent flexibility and extensibility drupal has become such a popular cms due to its flexibility and extensibility, both in terms of its content development/management features and design, or “theming,” in drupal terms. combined with the e-journal module and the contributed content construction kit module, the core drupal taxonomy system makes creating and assembling journal issues a snap. a full list of the contributed modules and the functionality that they provide to the journal site is available in table 1. table 1. contributed modules used and functionality provided module functionality provided e-journal arranges drupal content (“nodes”) into issues administrative menu places most administrative tasks in a convenient dropdown menu available on all pages of the site, visible to site administrators only cck (content creation kit) allows creation of custom content types, including articles, columns, blog posts, etc., and the online forms for content managers to add content ck editor adds a wysiwyg toolbar for site content creators who do not know html contemplate allows for customization of templates for individual content types filefield, imagefield used together, allow for upload of images into content types (drupal 6) google analytics tracks use of the journal worldwide with web analytics mollom security and spam prevention/reporting, including captchas for forms and comments pathauto creates custom “blog-like” urls for more user-friendly and seo-friendly urls security review checks site setup for common security holes that could pose threats to the site upgrade status provides a report of which modules are ready for the next version of drupal views arranges content (“nodes”) into lists based on filters for custom displays after installing and enabling e-journal—which gives editorial staff the ability to create journals, volumes and issues (see figure 1)—volumes and issues may be created, and then a standard drupal taxonomy must be created to provide the structure of individual issues. e-journal administrative options are also added to the administrative menu, a contributed drupal module that provides site administrators easy access to the parts of the site that need administrative attention. in the taxonomy, editors create all of the journal’s parts, i.e., editor’s introduction, guest editor’s introduction, articles, and columns. this must be done prior to creating those individual content types because the taxonomy must be in place prior to the creation of individual pieces of content so that those pieces may be assigned to the correct part of the journal. figure 2 shows the full taxonomy of colorado libraries. figure 1: e-journal functionality for creating volumes and issues figure 2: taxonomy of journal parts next, content construction kit (cck) is used to create “content types” (drupal jargon) for each type of content in a journal. the 20 content types in colorado libraries are shown in figure 3. each content type developed has an associated web form that allows non-web experts to quickly create and add content to the site. the “create article” form is shown in figure 4. (incidentally, no changes were made to word counts for each type of content.) the ckeditor library and module add the wysiwyg toolbar shown in figure 4. also note the “available issues” and “journal structure” dropdowns; the former assigns the content being created by this form to the appropriate journal issue, and the latter assigns the article being created to the appropriate part of the journal as defined by the taxonomy—“articles,” in this case. the beauty of this structure is that guest editors and authors may be provided access to the site, with permissions to create specific content types, and upload content themselves, creating the journal issues almost automatically. colorado libraries’ workflow has not reached this point yet, but the editors hope to advance it to author submissions in the coming months. another benefit to allowing direct author submission in this way is that it allows authors to create and submit directly the type of publishing metadata that can slip through the cracks in the print workflow process, such as author biographical information, images and image captions. figure 3: all colorado libraries content types figure 4: article creation form once all of the content for an issue has been created, the journal editor(s) can manage the content for that issue at the “manage articles” screen, which is shown in figure 5. here, editors can change articles from unpublished to published, reassign journal content to different issues, if necessary, and also reassign the content to a different section of a journal—again, if necessary. most importantly, editors can use the dropdowns in the “weight” column to change the order of items in the issue. the end result from the reader’s point of view is shown in figure 6; this is the table of contents for issue 35.1, the first issue of colorado libraries to be published online. figure 5: manage article screen figure 6: reader’s view of the table of contents for a complete issue as shown above in table 1, eight additional contributed modules have improved the out-of-the-box drupal installation for the journal publication process; they are contemplate, filefield/imagefield, google analytics, mollom, pathauto, security review, and upgrade status. contemplate provides a means to customize individual content types if the default theme does not work as well as it could for those content types. combined, filefield and imagefield allow upload of photos and images with content types via a widget (not unlike attaching a file to an email). google analytics provides the editorial staff with rich data about site usage worldwide, and mollom provides captchas on the site feedback form as well as the individual comments on articles. pathauto automatically provides a blog-like url, which makes an article or column more easily identifiable than drupal’s standard node numbering system. for example, http://coloradolibrariesjournal.org/?q=node/83 is changed to http://coloradolibrariesjournal.org/?q=content/beyond-echo-chamber-library-messaging-social-networks. an added benefit to pathauto is that it improves search engine optimization (seo) for individual articles. security review conducts tests for loopholes that sometimes are left open in drupal installations, and upgrade status notifies site administrators of the status of all modules currently used by a site. this is particularly helpful when the time comes for a major drupal version upgrade. in terms of site design and theming, the quick turnaround necessary to publish the journal content that had been on hold for over a year due to the mandated hiatus dictated that a well-supported contributed theme be selected for the first issues, as there was no time to develop a custom theme. the site uses the giordani subtheme of the popular marinelli theme (http://drupal.org/project/marinelli). the only major design changes made to the subtheme were to replace the rotating stock banner photos of mountain ranges—which were not entirely inappropriate for the publication, save for the fact that they were mostly european ranges—with pictures of colorado library facilities and events taking place at them. an email to the statewide library email list yielded several stunning photos that were quickly and easily resized and added to the theme folder. all other design elements—colors, fonts, and layout—use the default giordani/marinelli settings. managing additional content in the cms another benefit to using drupal as a platform is that additional information about the journal workflow and processes can be included in the same “place” as the journal. for example, when the journal was published in print, there were a number of documents hosted on the cal website, including author guidelines, guest editor guidelines, editor job descriptions, a schedule of upcoming issue themes, and more. this documentation was posted in word and/or pdf documents on the cal website, was difficult to find, and quickly became outdated. the colorado libraries editorial staff did not have access to the association web site and could not easily maintain this meta-content, which exacerbated the problem of it becoming outdated. now, in drupal, all of this information—which has been expanded to include sample language for calls for contributors and sample letters to authors for guest editors, and an online style guide—is available in the online workspace shared by all contributors to the journal. the information that is presented to each user is dependent upon which of the five roles—author, column editor, editor, guest editor, and web admin—that the user plays in the journal’s management and production. for example, authors see links to “author guidelines,” whereas guest editors see both the author menu and the guest editor menu, which contains links to “guest editor guidelines” (see figure 7). the information that users are able to see and the actions that members of each role are allowed to perform are controlled in this way as well, thanks to drupal’s highly granular permissions, menu and block systems. since the information is written in html (in the basic drupal “page” content type), it is much easier to edit and maintain collaboratively among the regular editorial staff. figure 7: support documentation for guest editors. added social elements finally, the journal takes advantage of drupal’s social aspects in the core comments and blog features to encourage communication with the colorado libraries readership. the blog entries—which are displayed on the colorado libraries home page—contain news and announcements, such as the release of new issues and calls for themes, editors, authors, and articles. users can subscribe via an rss to keep up-to-date with the journal. enabling comments allows the readership to provide feedback on all parts of the journal website—articles, columns, and blog entries—and, as previously mentioned, is protected from comment spam by moderation as well as the mollom module. to date, only three total comments have been received, but one column author was able to add more information to what she had originally written, noting the success of the program that she was describing in her column, and another comment was a suggestion from a reader about the possibility of including short videos and webinars providing updates from around the state. while there are no immediate plans to post video on the site, the online platform brings suggestions like this into the realm of possibility. challenges to using drupal and e-journal in spite of these successes, the path from print to online was by no means strewn with rose petals. three significant challenges faced by the editorial team included delays to publication due to drupal’s learning curve; lack of community surrounding the e-journal module; and the way the module determines managerial and editorial roles. the most significant challenge faced during the transition from print to online was the publication delay due to drupal’s steep learning curve. the publishing hiatus requested by cal lasted fourteen months, and even after the editorial team got the go-ahead to publish online in march 2010, it was six months until content for 35.1 was published in september. in some cases, authors needed to revisit and update content that had been previously submitted and accepted. due to the technical editor’s inexperience with drupal at the time, certain aspects of the site were not constructed properly—or could have been built better—and have required or will require revisiting. one example of this is the display of author and editor headshots. shortly before the move online, the editorials staff had begun requesting and including author headshots on the title pages of articles. this practice continued in the online environment as well, improved, as noted above, by the ability to use color photos rather than black-and-white. rather than placing the photos into a template for the articles—which would have inserted the author photos into the same node with the text content—individual “blocks” (drupal term) were added for each author picture. blocks can be thought of as smaller building blocks for a site’s page layout. the drupal block system is flexible, allowing one to place a block only on one page or on as many pages as the site admin or content creator desires. one positive aspect of adding author photos in this way is that they are reusable, which is handy because there are often repeat authors in colorado libraries. however, it has quickly become clear after three issues that this is unsustainable because it generates too many blocks—20 blocks after three issues—to manage, and these are mixed in with other, structurally more important blocks. figure 8 shows the abuse of the block system. figure 8: misuse of drupal’s block system to display author photos. additionally, from a workflow perspective, anyone wishing to submit an author photo in this way would need access to drupal’s block system on the site, which is generally a level of permissions best kept to web administrators or at least more advanced content creators/developers. it is also much more labor and time-intensive to create the block, insert the content, position the block in the left sidebar, and configure the block to only show on that page, than to have it upload as part of an article, column, or editorial introduction. thankfully, e-journal comes with a better way to promote authors who contribute to the journal. a sub-module called “e-journal authors” allows journal administrators to create a taxonomy of author names in order to link content written by an author to their user profile—effectively, these would be individual “about the author” pages for all authors, that can then be maintained by the authors themselves. currently, this is not possible with colorado libraries, as authors are not granted user accounts at this time. the plan to expand access to authors to improve online workflow will make this possible in future issues. lack of “module community” another of drupal’s greatest strengths is its developer community. the most popular contributed modules have dedicated maintainers who review and release code regularly. the most popular contributed module, views, is in use on over 300,000 sites and is actively maintained. by comparison, the e-journal module not as actively maintained, and as of this writing, it is in use on only 44 sites. the last stable release of the code is dated november 2009, and the latest development release is dated february 2011. creator chyla, who notes that he developed e-journal as a proof of concept on his own time, is not sure if a drupal 7 release is in the cards for the module (email communication, 2011 aug 29). the documentation on the module is also, as chyla openly admits, meager.[2] experienced drupal users will be able to implement e-journal relatively quickly and easily, but without previous experience with and knowledge of cck and taxonomy, new users may find it difficult to implement, as this author did at the time. duplicate e-journal user roles one final issue that initially caused confusion in the colorado libraries project was an additional set of roles provided in the e-journal module. normally, drupal developers take advantage of the system’s robust user, roles, and permissions to allow users access to the parts of the site that they require to do their work. users can easily be assigned to roles, which determine a user’s permissions. in the case of e-journal, permissions are used primarily to determine who can view content at the different stages of review and publication. the roles created for colorado libraries, as previously mentioned, are web admin, editor, guest editor, and author. the preset e-journal roles, which are a slightly more flexible approximation of the prescribed and inflexible roles in ojs, are administrator, chief editor, editor, reviewer, proofreader, staff, and others. access rights for each of these roles is customizable on the “e-journal settings” page, as shown in figure 9. this secondary set of roles and permissions can be overridden by matching the predefined site roles to the roles created locally by the site administrator in drupal’s main permissions page, as shown in figure 10. however, it requires some extra work. further, if the site administrator is initially unaware of these roles, it can create headaches when content does not display as expected to users assuming locally created roles. figure 9: “e-journal settings” administrative page figure 10: overriding the default e-journal roles with locally created roles the cal communications committee—which has absorbed the publications committee—has big plans for future issues of colorado libraries, both in terms of content development and platform improvements. as for content, prior to 2006, book reviews had been published in the hard copy issues. in 2006, a decision to publish book reviews online on the cal website was made in order to get the content to the readership sooner, without the requisite delays caused by the layout and printing process. plans are underway to reintroduce book reviews in issue 36.1, now that physical layout and printing are no longer necessary. other new types of content will be introduced as the desire or demand arises, supported by the flexibility of the drupal/e-journal platform. as for the technology platform, one of the first improvements will be to provide readers with a means to download and/or print the articles as pdfs; this was a desire expressed via feedback about the online platform. unrelated to the pdf issue, while the giordani/marinelli theme has served the journal as an attractive and accessible (section 508/wcag) theme, as more content is added, it will become desirable and perhaps necessary to develop a theme from scratch that matches cal branding as well. as previously mentioned, a more complete online workflow that includes author submissions will streamline the production process even more. drupal’s mobile modules continue to improve, so a mobile site might eventually make a debut as well. finally, the editors wish to make more of the social aspects of drupal; one recent idea in this vein was to present a list on the journal home page of the most-read content with a link labeled “what’s popular?” options for discovery: repositories and databases as of the time of writing, the communications committee is appealing a rejected request to have colorado libraries indexed in the directory of open access journals (doaj). additionally, we are finishing requirements to be included in google scholar. the committee is also exploring other options for exporting data from drupal into other formats for ingest into appropriate repositories. colorado libraries has been indexed in wilsonweb’s library literature & information science full text back to its first issue, and full text is available from 1984-2008 and recently resumed with issue 35.3. the communications committee is working actively with ebsco during its acquisition of wilsonweb products to ensure inclusion in important scholarly resources for library literature. conclusion as of the time of writing, colorado libraries online has received 4,981 visits from 3,600 absolute unique visitors since issue 35.1 came online in september 2010. the visitors have come from 84 countries. readership has expanded not only in number—the 3,600 absolute unique visitors are nearly triple the cal membership—but in scope as well, from the state and the rocky mountain west to a global presence. some referring sites include course management systems from library schools. this has elevated the status of both the publication and the colorado association of libraries. despite the risk, using a drupal module that had low use and maintenance has proved successful and paid off in the short term, satisfying the immediate need to publish the final content developed for print, and migrating to online. however, if the module is no longer maintained and a version for drupal 7 is not forthcoming, an alternative will need to be found. when contacted about whether the e-journal would see a drupal 7 release creator chyla noted via email message, “if there is substantial interest and i see many people calling for it, i can do it” (email communication, 2011 aug 29). the drupal e-journal module, in spite of the shortcomings described above, has given the colorado association of libraries the ability to continue publication of colorado libraries in a much less expensive and more environmentally friendly format. the journal will continue to showcase the great work of librarians and library workers around the state and region and has helped the communications committee adjust to a modern mode of publication to better serve as a voice of the association. notes [1] e-journal : publishing system [internet]. roman chyla. [cited 2011 aug 29]. available from: http://drupal.org/node/187987. [2] e-journal : publishing system [internet]. roman chyla. [cited 2011 aug 29]. available from: http://drupal.org/node/187987 about the author nina mchale is assistant systems administrator at the arapahoe library district (colorado). her professional interests include the use of open source web content management systems in library environments, usability and user experience, and web accessibility. as a colorado resident, she is required by law to love the outdoors, where she spends much time hiking, biking, and skiing with her three kids, two dogs, and one husband. see what she’s up to professionally at milehighbrarian.net. subscribe to comments: for this article | for all articles 6 responses to "open access publishing with drupal" please leave a response below: open access publishing with drupal in january… « infodoc microveille, 2011-10-31 […] : http://journal.code4lib.org/articles/5913 […] cary gordon, 2011-11-01 you have pointed out a key issue in the adoption of a complex, modular free content management system. drupal 7, at its release, had 469 core contributors, a bit less than one tenth of one percent of its user base. obviously, that percentage of users for the e-jounal module — used on about 40 sites — would be more or less nobody. the e-jounal module was created and is occasionally maintained by a cern fellow who is also teaching and pursuing his phd. i would venture to say that this represented a huge effort on his part. he has no obligation to maintain it forever. as it has had no updates in almost a year, it is, for practical purposes, abandoned. this means that users of the module have three choices: * they can use the module in its current release as long as they like. * they can find a way of assembling the modules functionality using other modules — presumably more generic modules. * they can provide resources to keep the module up-to-date and to port it to drupal 7. if they chose the latter, and i hope that at least one of them does, it could be done several ways: – provide a coder to contribute and possibly take over the module. – offer resources to the original contributors to resume work on the module. – hire a drupal shop to update and/or maintain the module. it might be advantageous to start a drupal group (on http://groups.drupal.org/) so that you can collaborate with other module users. you can also get an idea of who might be using the module by noting who has posted issues in the issue queue. once you have a group, you can see if any of them are willing to commit resources, as well. one alternative might be to investigate the openpublish profile, which i believe contains most of the features of ejournal, with the possible exception of peer review. openpublish is a collection of core, contrib and custom modules that is sponsored by and supported (commercially) by a major drupal shop. openpublish is supported for drupal 7. nina, 2011-11-01 hey, cary! thanks for your comments. my co-editor tabby farney and i just discovered openpublish-we worked it into our lita forum presentation about this project :)-and it’s something that we will seriously consider. roman chyla was very kind when i emailed him out of the blue about the future of the e-journal module, and it was my hope that through writing this, we could either convince him to go to 7 or that we might possibly find a group of interested folks who might be willing to help maintain it, with his help and/or blessing. ilibrarian » open access publishing with drupal, 2011-11-06 […] by nina mchale, assistant systems administrator at the arapahoe library district, on the topic of open access publishing with drupal. if you’re considering moving to an open access model for your journal, you’ll want to […] thomas dodson, 2011-11-07 great article. i’ve found drupal to be a great platform for literary journals as well. i started my journal, printer’s devil review (http://pdrjournal.org) using ojs but found it’s workflow and submission process much too cumbersome for a non-peer-reviewed publication. building a submission form with drupal using the webform module, however, was a snap. i also have a lot more control over the look and feel of the journal through customizing the “journalcrunch” theme. readers also may be interested in yana, an open source project we’re working on at harvard designed to provide oa journals with a template for mobile apps using existing content: http://osc.hul.harvard.edu/yana. we’ve tested it with ojs and drupal and it’s very easy on both platforms to simply give yana the url of the rss feed for your current issue, for example, and instantly have it work as a web app. ilibrarian » 16 new library tech stories you may have missed, 2011-11-22 […] open access publishing with drupal – code4lib […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – automated collections workflows in gobi: using python to scrape for purchase options mission editorial committee process and structure code4lib issue 49, 2020-08-10 automated collections workflows in gobi: using python to scrape for purchase options the nc state university libraries has developed a tool for querying gobi, our print and ebook ordering vendor platform, to automate monthly collections reports. these reports detail purchase options for missing or long-overdue items, as well as popular items with multiple holds. gobi does not offer an api, forcing staff to conduct manual title-by-title searches that previously took up to 15 hours per month. to make this process more efficient, we wrote a python script that automates title searches and the extraction of key data (price, date of publication, binding type) from gobi. this tool can gather data for hundreds of titles in half an hour or less, freeing up time for other projects. this article will describe the process of creating this script, as well as how it finds and selects data in gobi. it will also discuss how these results are paired with nc state’s holdings data to create reports for collection managers. lastly, the article will examine obstacles that were experienced in the creation of the tool and offer recommendations for other organizations seeking to automate collections workflows. by katharine frazier introduction the ability to quickly identify purchase options for in-demand books can help libraries better meet and anticipate their users’ needs. at the nc state university libraries, the collections & research strategy department conducts a monthly audit of items with multiple hold requests. we also keep track of items that are lost, missing, or have been checked out for more than one calendar year. these reports are created from a combination of our local holdings data and data from our print and ebook vendor gobi (formerly yankee book peddler): each item in the report is paired with purchase information, such as price and format. collection managers use these reports to find instances where purchasing a new copy, whether electronic or print, can alleviate high demand and broaden access to our users. while reports are an important part of making sure users can access the resources they need, compiling them is a manual, time-consuming task. this article discusses a new approach to creating reports through web scraping with python, which expedites the collection of data from gobi and greatly reduces the hands-on time required to create each report. background staff in collections & research strategy are responsible for compiling two monthly reports known as the “lost, missing, checked out report” and the “multiple holds report.” these reports are created by downloading lists of lost, missing, checked out, and held items from our ils, sirsi. an average report contains up to several hundred items. to identify suitable print or ebook copies available to buy, each item is searched in gobi. the final report includes holdings data (publication date, number of circulations, home location, etc.) for each item and, where available, purchase data (format, publication date, price). whenever possible, this report includes purchase information for the newest available edition of an item: for example, if we have lost the 2014 edition of an item, the report will prefer data for a 2019 edition. unfortunately, gobi does not offer an api to make searching an easy, automatic process. despite expressing an interest in an api to our gobi representatives, they have not shared any plans to develop one. because of this, library staff have done all of this work by hand for years: searching, finding the best match, and manually inputting purchase data into a spreadsheet. this process often took anywhere from 10 to 15 hours a month per report. most of this time was spent searching for items in gobi, which can be a lengthy and imperfect process. a laundry list of obstacles gobi has many qualities that make it a great resource for collection managers, but the search experience can sometimes be difficult. one significant cause of this is the default gobi search bar, which appears at the top of each gobi webpage. this bar only allows users to search by any one of the following: keyword, title, isbn, author/editor, subject, or keyword and table of contents. for some items, searching by title or keyword is fine because the title is unique enough to retrieve only results for that particular item. for items with more general titles, this is a very inefficient way to search. regardless of a title’s uniqueness, searching by title or keyword often returns an array of results that vary in relevancy. while gobi allows users to sort (but not filter) these results by author name, this requires a user to spend time clicking through result pages until the desired author is found. to more quickly find accurate results, we set out to find a better way of searching. initial attempts to identify a better search experience in gobi included using an item’s isbn as a search term rather than title. while this did a better job of producing exact matches for a given item, we realized that gobi does not provide links between different editions of items. since we prefer the newest available edition for all items, searching by isbn did not work. we settled on a title and author search to access the most accurate results. to search by title and author, a user must navigate to the “standard” gobi search, which is located in the website’s search-related drop-down menu. while this produces higher-quality results, it causes unnecessary delays and frustration. when searching for hundreds of items, taking time to navigate to a separate search page can add up. figure 1. gobi’s “standard” search is only accessible through the search dropdown menu. even though the standard search improved result quality, it did not remove all issues with the accuracy of results. gobi orders results by “relevancy,” which means results are not sorted by date or edition number – the method that we would prefer to quickly gather information for our reports. a user must manually sort the results by publication date through another dropdown menu of options. just like the separate “standard” search page, this process eats up valuable seconds and greatly increases the amount of time it takes to collect data for our reports. with all of these issues creating delays in the search process, we decided to try a new approach. what if we didn’t have to do any of this searching by hand anymore – could this be automated? developing a scraper before beginning to design our scraper, we conducted research to see if any other libraries had run into and solved a similar problem. however, we found no scripts or tools to fit our needs. because the nc state university libraries is home to a community of python users, we decided to explore python as a potential avenue for automating these gobi searches. our process for developing the scraper consisted of identifying what the scraper would need to accomplish, bridging the gap between our ils (sirsidynix) and gobi, and troubleshooting the scraper’s access to gobi’s internal data. an important note is that using web automation to scrape gobi is not in violation of our agreement with ebsco, gobi’s parent company. another consideration is that only existing gobi customers will be able to use or adapt the following script. all information extracted from gobi using our script is accessible only through logging in with an authenticated username and password. bridging the gap between ils and gobi before beginning to write our script, we determined that data from our reports of in-demand items would have to be cleaned before it could be searched in gobi. reports are drawn directly from our ils, which outputs title information in a way not compatible with gobi’s search functions. without cleaning, a title from one of our reports might read: database systems : the complete book / hector garcia-molina, jeffrey d. ullman, jennifer widom. titles in sirsi are paired with author(s), separated by a forward slash. when an unformatted title is passed into a gobi search, gobi will return either zero results or only results with the author’s name in the title, such as studyguide for first course in database systems by ullman, jeffrey d. fortunately, cleaning data from our ils involved only a few steps. in excel, we replaced occurrences of “/*” in the title column with a blank space. the wildcard removes all author information following the forward slash. then, we used excel’s in-built =trim(cell) formula to remove blank spaces in the title string. while this quick process is currently done by hand, it is a candidate for future automation. writing the script the most important function our scraper needed to have was the ability to access and navigate web pages. we identified selenium, a python package that automates web actions, as the best module to use in developing our scraper thanks to its ease of use and robust documentation. selenium was chosen over other modules for its focus on automation capabilities; other modules, like beautifulsoup and scrapy, are more focused on data extraction, and we needed to quickly log into, navigate, and engage with the gobi website. selenium, which interacts with web browsers like a human would, relies on webpages’ internal html structure (css, xpath) to navigate pages and engage with interactive elements (links, search boxes, radio buttons). we also installed chromedriver, a tool that allowed us to automate chrome rather than selenium’s default browser, firefox. while this choice was made because we prefer to work with google chrome at nc state, the scraper would be equally operable in other browsers. the following code demonstrates how the script establishes a web browser window, navigates to gobi’s homepage, and locates the input boxes to which it needs to send username and password information. #establish webdriver (ex: chromedriver) browser = webdriver.chrome(r'path-to-webdriver') #tell browser to fetch website browser.get('http://www.gobi3.com') #find username field, send username userelem = browser.find_element_by_id('guser') userelem.send_keys('username') #find password field, send password passwordelem = browser.find_element_by_id('gpword') passwordelem.send_keys(‘password’) passwordelem.submit() after generating its own instance of a browser window, the scraper logs into gobi and begins searching for paired titles and authors. this information is supplied from our monthly reports of in-demand items. to pair titles and authors, we drew title and author information from the report into two lists and created a dictionary from these lists. #create lists for title, author, and year from original file titlelist = [] authorlist = [] datelist = [] #read in original file of report output wb = pd.read_excel(r'path-to-report.xlsx') #append column values from original file to lists titlelist = wb['title'].values authorlist = wb['author'].values datelist = wb['year'].values #create dictionary to match titles with authors zipobj = zip(titlelist, authorlist) titlesandauthors = dict(zipobj) to find the most accurate results, our scraper searches via gobi’s standard search page. the script works by looping through the dictionary of paired titles and authors and completing the following actions for each item in the dictionary. a try and except clause is used to account for items that did not include author information in the original report. #begin iterating through dictionary of title/author in gobi for k, v in titlesandauthors.items(): #find search dropdown, click it searchelem = browser.find_element_by_id('menu_li2') searchelem.click() #find "standard" option in search dropdown standardelem = browser.find_element_by_xpath('/html[1]/body[1]/div[1]/div[2]/div[1]/div[12] /a[1]') standardelem.click() #find title search bar on the standard search page: account for slow load time with wait try: titleelem = webdriverwait(browser, 10).until(ec.presence_of_element_located((by.xpath, '//*[@id="txttitle"]'))) except: continue #send title from dictionary titleelem.send_keys(k) #find author search bar on the standard search page authorelem = browser.find_element_by_xpath('//*[@id="author"]') #send author from dictionary; account for error if no author present try: authorelem.send_keys(v) except typeerror: pass we initially struggled with identifying the correct web element to use while scraping data from result pages. each result in gobi is housed in a separate html cell, making it difficult to scrape a page of results for all titles, all publication years, and all prices. after a few failed attempts to grab these individual data elements and merge them into a dictionary, we decided to instead scrape the entire page of results in one block. this action builds a web element containing a list of all results. the script then iterates through this list of items and turns each resulting web element into a text element. #find all results on results page itemelem = browser.find_elements_by_xpath('//div[@id="containeritems"]/div') #begin iterating through the results, and transform each web element to text element for item in itemelem: individualitem = item.text next, the script extracts title, publication date, format (print/ebook), and price strings from each text element and determines the “best” fit out of all captured results. to do this, we created a series of regular expressions that search for patterns matching how gobi formats each of these data points. data extracted from these regular expressions are assigned the variables of “title,” “binding,” “date,” and “price.” once clean data points are extracted, we narrow the list of results to only those matching the original item from our report. because our script has only gathered results with the correct author, we narrow by matching on title strings. for this, we decided to use a python module named fuzzywuzzy, which calculates the difference between two given strings with the levenshtein difference. because of slight formatting differences between titles from our ils and titles in gobi, the threshold for a match was set at 70%. we found that stricter thresholds caused legitimate matches to be excluded. when the script finds a title match, the title and its accompanying data (binding, date, and price) are sent to lists. lists are then combined into a dictionary called “choices.” #apply fuzzy string matching to find accurate matches (not 100 to account for punctuation) ratio = fuzz.partial_token_set_ratio(k,title) #send matches to lists; match set at 70% for best results if ratio > 70: namelist.append(title) bindinglist.append(binding) yearlist.append(date) pricelist.append(price) #add lists to results dictionary choices.update({'title':namelist}) choices.update({'binding':bindinglist}) choices.update({'date':yearlist}) choices.update({'price':pricelist}) after trying to proceed using this dictionary, we realized a need to have a more flexible format that would allow us to sort and split results. we identified pandas, a package that creates dataframes (flexible sets of data presented in rows and columns) as the ideal tool to help us with our next steps. from the “choices” dictionary, we create a dataframe representing all of the gobi results. then, we split this into two sub-dataframes representing print-only and electronic-only titles. #create initial dataframe gobidf = pd.dataframe(choices) #create sub-dataframe for ebook results ebookoption = gobidf[gobidf['binding'] == 'binding:ebook'] #create sub-dataframe for print results bestprint = gobidf[(gobidf['binding'] == 'binding:cloth') | (gobidf['binding'] == 'binding:paper')] because titles from gobi are often formatted differently than titles from our ils, the script runs another fuzzy string match to swap out the gobi titles with the properly-formatted titles from our ils. this process makes the eventual merging of the print and ebook dataframes with our original report easier by ensuring a merge on the title column will not result in mismatches. #create list to contain confirmed correct print titles correcttitles=[] #compare each paper title to original title, keep only best match for x in printtitles: correct = process.extractone(x,sirsititles) correcttitles.append(correct[0]) #swap out list of print titles for list of confirmed correct print titles bestprint['title'] = pd.series(correcttitles) in some cases, multiple matching results for an item are included in each dataframe. for any items with multiple results, we wanted to extract the “best fit” result. we defined “best fit” as the item with the most recent publication date. using pandas, this turned out to be a simple task. we applied a descending sort to the date column and removed all duplicate titles, leaving us with only the most recently published item. this step is repeated for both print and electronic dataframes. #sort dataframe by date of publication, with newest of each title at top bestprint.sort_values(by=['title','publication year in gobi (print)'], ascending=false, inplace=true) bestprint.drop_duplicates(subset='title', keep='first', inplace=true) finally, we merge our deduplicated dataframes with our report, which has been read into a dataframe named “original.” we use an outer sort, meaning that information from both dataframes will be present in the new, merged dataframe. by merging on title, we ensure that the information gathered in the two gobi dataframes will align with the correct rows in our report dataframe. once merging is complete, the dataframe is sent to excel using pandas’ to_excel function. dfmerge = original.merge(bestprint,on='title',sort=false,left_index=false,right_index=true,copy=false, how= 'outer') #merge ebook dataframe with other merged dataframe (print & original combined) newmerge = dfmerge.merge(ebookoption, on='title', sort=false, left_index=false, right_index=true, copy=false, how='outer') #send to excel file newmerge.to_excel(r'path-to-output.xlsx') this leaves us with a document that includes original holdings data from our ils, paired with information about new copies in gobi (price/publication date). figure 2. a sample of the script’s output displaying holdings information and purchase information. not pictured are several columns to the right including total # of circulations, # of holds, home location, and current status for each item. troubleshooting one significant issue we ran into while developing the script was the length of time certain searches in gobi took to complete. selenium is willing to wait for pages to load, but eventually it reaches a limit and times out, causing the script to return an error. to work around this, we included several time-out exceptions that, whenever a timeout exception occurs, override the exception and keep the script running. except selenium.common.exceptions.timeoutexception: pass including these exceptions allowed gobi enough time to fully load its results pages. the timeout exceptions most often occurred while searching for items with generic titles, such as microbiology or even michelle obama’s biography becoming. another issue we experienced was selenium working much more quickly than gobi could load. sometimes, gobi would load a page (such as the standard search page) but not display all page elements immediately. selenium, thinking the page had loaded, searched for a given element and produced an error when it couldn’t find it. we solved this by using an explicit wait, which told selenium to pause until an expected condition was met. titleelem = webdriverwait(browser, 10).until(ec.presence_of_element_located((by.xpath, '//*[@id="txttitle"]'))) in this example, the expected condition is the title element’s xpath being present on the page. once this element loads on the page, selenium can continue. until then, it is forced to wait. impact of the script since developing this scraper, the time spent per month compiling reports has been drastically reduced. a task that used to take 10 to 15 hours a month now takes only 30 minutes of manual time. this manual time includes data cleaning and writing up an email summary of the reports, which is shared out to the libraries’ collections interest group (cig). when the reports were compiled manually, they would be sent out to the libraries’ cig mid-month. now, the group receives access to them in the early days of the month. this gives collection managers more time each month to identify what needs to be purchased to fill user needs. because selenium operates in its own browser window and can run silently in the background, the implementation of this script has also greatly increased the capacity of staff to spend time on other projects. it has even allowed for the inclusion of “creative time” to imagine new projects and applications for python. conclusion since gobi does not offer an api, we used python’s powerful automation and data modules to design our own querying tool. while this scraping tool was created to address the unique needs of the nc state university libraries, this tool could be adapted to other libraries’ specifications. the scraper currently searches for both print and electronic copies; depending on a library’s collection development plan, this could be reduced to just print or just electronic copies. for libraries interested in providing maximum access to e-resources, the script’s method for choosing electronic copies could be altered. gobi displays these options in alphabetical order of provider: for instance, ebsco will always come before proquest. within each provider, choices are arranged in ascending order by the access model: 1 user copies are displayed before 3-user and unlimited-user copies. figure 3. example of electronic purchase options for an item in gobi. when considering electronic purchase options, our scraper defaults to the first choice available for each “best fit.” if a library was interested in only choosing unlimited user copies, one could refine the script to select those copies where available instead of the first option. areas for future exploration include automating the data cleaning process and adding a function to automatically select items and send them to a gobi folder. this step would expedite the ordering process for libraries not wanting to make title-by-title purchase decisions. our script produces a report that is still reviewed by a team of collection managers, but libraries with smaller collections departments or fewer staff could potentially skip this step. a complete script for automating searches in gobi and extracting purchase data can be found on the author’s github page. about the author katharine frazier is an incoming library fellow at the nc state university libraries, where she has previously served as a university library technician in the collections & research strategy department. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction – issue 3 mission editorial committee process and structure code4lib issue 3, 2008-06-23 editorial introduction – issue 3 code4lib journal seeks to share ideas for solving issues and, hopefully, inspire new ideas. by ron peterson the code4lib journal reaches out the code4lib journal is a unique avenue of communication within the code4lib community. unlike the other avenues which invite people in, the journal seeks to reach out to people in all areas of library work. most of the communication channels for code4lib are directed internally to the community. the mailing list provides a venue for people interested in libraries and technology to come together and talk about issues that interest them. the irc channel provides the same people a way to communicate with each other in real time. and the code4libcon gives them an opportunity to meet face to face to present and discuss ideas. but the code4lib journal attempts to disseminate these ideas and others, in order to engage people wherever they are, whether they are technologists or not. the code4lib journal aims to reach beyond the email list, the irc channel, and the conference to find people looking for technological solutions to the problems facing libraries or proposing solutions for those problems. the journal invites people to participate by contributing their successes (and failures), ideas, and thoughts to the journal by submitting articles. but it also invites people to participate as readers of articles. these readers can then contribute to the discussion by commenting on an article or implementing an idea in their library that they read about in the journal. or maybe they will take something they read about in the journal, build on it, and make it better. hopefully, issue 3 (as with issues 1 and 2) will provide that spark for someone and they will take one of the ideas presented and make it their own and take it to the next level. and once they have done that, they will come back and tell us how they accomplished their goals so that someone else can learn from their experiences. to that end, issue 3 of the code4lib journal continues the breadth and depth of articles that were established by the first two issues of the journal. what’s inside articles range from mason hall’s solution for collecting and analyzing statistics on virtual reference transactions using im to an article on using marc records to build an archival collections portal, a process that terry catapano, joanna dipasquale, and stuart marquis describe. there are articles that address solutions to everyday problems, such as how rebekah kilzer and beth black improved off-campus access to library resources at the ohio state university. and there are articles that take on broader issues like creating a controlled environment where the tools used for the preservation of digital objects can be tested and evaluated, as brian aitken, petra helwig, andrew jackson, andrew lindley, eleonora nicchiarelli, seamus ross, and jacqueline slats detail in “the planets testbed: science for digital preservation“. issue 3 also includes practical solutions for problems, like jeremy mcwilliamsâ€˜ use of the flickr api to create machine tags for a ceramics image collection. at the same time, this issue looks at the future of cataloging, with galen charlton discussing the possibility of distributed cataloging using git and bazaar to create, enhance, and exchange library metadata. if cataloging doesn’t interest you, then you can read how andrew bullen used ocr software to bring historic sheet music to life and tell the story of the iroquois hotel fire of 1903. for web developers, joshua dodson describes how he used wordpress to create subject guides. and finally, jason clark’s article, “making patron data work harder“, describes how to improve search results by tracking patron searches. the code4lib journal gets its license in order to facilitate the ability of our readers to build upon the ideas presented in the journal, beginning with issue 3 all articles are licensed under the creative commons attribution (cc-by) license. the cc-by license lets you reuse, share, and build upon the work presented in the article, as long as you credit the author for the original creation. this licensing is required for inclusion in the directory of open access journals (doaj) and to receive a sparc europe seal. code snippets included in the text are included under the cc-by license. for other code included with an article, we recommend, but don’t require, an open source license. we are contacting all authors with articles published in previous issues to request they license their previously published code4lib journal articles under the cc–by license. as mentioned above, code4lib is more than a journal. so you don’t need to wait for the next issue to learn more. even if, like myself, you weren’t able to make to the 2008 code4libcon, you can catch up with the presentations on google video or at internet archive. or you can join the discussion on the irc channel or the email list. you can learn more about code4lib at http://www.code4lib.org. code4lib is about building a community to address the problems facing libraries using technology. if you aren’t already a part of that community, i hope that the articles in this issue will reach out to you, inspire your own ideas, and draw you into our community. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – building a scalable and flexible library data dashboard mission editorial committee process and structure code4lib issue 35, 2017-01-30 building a scalable and flexible library data dashboard data dashboards provide libraries with the means to demonstrate their ongoing activities and usage in an engaging and communicative fashion. yet, due to the number of service platforms used by libraries, and the wide-ranging technical specifications they entail, bringing all of this content together in a sustainable way is a significant challenge. this article describes portland state university’s project to design and build a data dashboard based on a scalable and flexible infrastructure that would enable them to present data in a visually compelling and dynamic interface. introduction data dashboards are an excellent way for libraries to provide an at-a-glance view of their many ongoing activities. a well-organized, well-built data dashboard is easily understood, narrates a coherent story about the library, and provides readily identifiable metrics and data points that can inform evaluation and decision-making activities. in addition, because they can include a broad cross-spectrum of data sources, from circulation data, to collection counts, to server logs, dashboards make it possible to accomplish two things that are of particular value for libraries: illustrate the full extent of a library’s activities: visually representing these disparate data points alongside one another helps to thoroughly illustrate the wide-range of activities that libraries support. this can be very helpful in demonstrating the library’s impact and value to campus and community stakeholders. uncover interconnections between data: dashboards offer an opportunity to directly relate different data points to one another, and in doing so uncover previously unseen relationships between them. for example, by overlaying gate count with website visits, it’s possible to illustrate what relationship exists between the two different types of traffic. yet, taking advantage of these potential benefits is challenging. while the library generates a huge amount of data, it is far from consistent in its form or the systems that house it. as a result, there are three principal problems that libraries face in building a dashboard: disparate data sources: each library uses a number of different systems for providing services. examples include the integrated library system, interlibrary loan software, gate count trackers, web server logs, etc. each of these systems requires a unique approach for accessing and merging the data programmatically. heterogeneous data: the data structure is unique among for each of the source systems that house it. representing this data in a holistic fashion requires normalizing it across systems. accessing the data: each system provides different mechanisms for accessing the data. while some systems provide apis for retrieving data, or the ability to export reports, in some cases you may need to screen-scrape web pages to extract the data. in spring 2016, the portland state university library undertook a project to engage with these challenges and build a public-facing data dashboard and the technical infrastructure to support its ongoing maintenance. the data dashboard is part of the library website, and is aimed at illustrating the library’s broad range of activities and the relationships between them. most importantly, it is a first step towards enabling the library to more effectively analyze the vast range of data available to it. portland state university library’s dashboard project as a first step in tackling this project, the library included the following goal in its 2015-2016 annual plan: “develop the technical infrastructure and workflows for collecting and reporting on library metrics across systems and units.” the library technologies unit was tasked with leading work on this goal, and began work on it mid-way through the year. given the broad nature of the goal and the sheer size of the data challenge that the library faced, the scope for the project was quickly narrowed down to the creation of a data dashboard for the library website. the presentation of the library’s annual data on the website had always been rather simplistic, with tables and bulleted lists including a limited number of facts and figures from the previous fiscal year: figure 1.screenshot of previous library statistics page. therefore, focusing on this specific goal enabled the project to achieve two good outcomes in one step: 1) replacing the current content with a more compelling and engaging dashboard, and 2) developing the infrastructure for managing and reporting on library data in a more extensive way in the future. the project got underway with the formation of the project team, that included library technologies’ two developers (mike flakus and chris geib), our content strategist (sherry buchanan), and myself, the manager of library technologies. at the outset, several decisions were made to focus the project’s scope in order to meet our brief six month timeline. specific decisions included: the audience for the dashboard was defined as campus stakeholders (administrators, students, staff) and library administrators. the dashboard would be focused on including data that could be harvested on a monthly basis and that was of enduring value to the library. the process for harvesting and storing data had to be sustainable in terms of ongoing staff time to maintain the system, and so needed to be automated and reuse application modules wherever possible. the infrastructure for storing, making available, and presenting the data had to be as scalable as possible, and so it needed to be possible to add new volumes of data and new data sources with minimal effort. the interface needed to be visually compelling and as interactive as possible. data reflecting all of the library’s principal public-facing services needed to be included, such as circulation and electronic resource usage, reference and instruction, and both online and in-person traffic. the dashboard would be focused on ongoing, repeated reporting, as opposed to ad-hoc reporting. with the scope defined, the project team broke the overall project down into discrete phases, to best structure our work and map out our timeline: identifying data sources, evaluating available data, and selecting target data sources selecting the data store, schema, and harvesting data building the user interface extracting and caching the data for repeated use in production step 1: data sources our first step was to identify the data sources that we would harvest information from. as with most libraries, the portland state university library makes use of a number of different systems to support all of our services, each of which is managed differently. some are vendor-hosted, some locally-hosted, some have apis while others don’t, etc. the sources we chose fit each of these models, giving us an opportunity to develop tools for working with most any data source. the specific data sources were: alma integrated library system and primo discovery system: the alma/primo system is vendor-hosted, and includes an analytics engine that has a dedicated api for extracting reports google analytics: vendor-hosted, but offers an api for reports illiad interlibrary loan: this application is locally-hosted, and while it does not offer an api, we do have the ability to query the sql server database directly gate count: this is a locally-developed application for storing gate count statistics, and whose database we can query directly study rooms booking system: this is a locally-developed application for reserving and checking out study rooms, and whose database we can query directly libstats: this is a locally-hosted open source application that we use for tracking reference statistics, and whose database we can query directly instruction stats: this is a locally-developed application for tracking instruction session statistics, and whose database we can query directly pdxscholar: this is the library’s institutional repository, which is vendor-hosted, and doesn’t offer an api or extractable reports. instead, we needed to screen-scrape to get the data that we wanted. each of these systems houses a wide-range of data. but for the purposes of the dashboard, we focused on data that would clearly demonstrate high-level trends over time, and that would paint a broad picture of the library’s ongoing activities. so after discussion with a range of stakeholders, we settled on including monthly data for each of the following data sets: website visits for all of our sites (e.g. main website and all subsites, libguides, pdxscholar, primo discovery system, ) (google analytics) walk-in (gate count) traffic (gate count app) study room checkouts (study room app) overall item checkouts (alma) item checkouts by specific item types (alma) consortial borrowing counts (alma) interlibrary loan borrowing counts (illiad) reference question counts by question type (libstats) student counts for instruction sessions (instruction stats app) we also determined that the dashboard would include a table of annual usage numbers, which would be derived from the harvested data, and a chart illustrating the library’s collections. this last one would be hard-coded, since the data was collected annually. we also wanted to include information reflecting the usage of our electronic resources, but at the time of implementation that data was not available programmatically (though we were able to add it by hard-coding it into the system). step 2: data storage and harvesting with the data sources selected, we turned our attention to how best to store the data itself. our initial schema for the data, based on our vision for how we would use it, looked like this: name: the descriptive name of the data being collected, which would be used for display purposes on the dashboard counts: the harvested count(s) data month: the month the data was accrued year: the year that the data was accrued tags: descriptive term(s) for categorizing the data, which would be used for grouping related data sets together the counts and tags fields were especially challenging to handle when thinking about our storage mechanism. while some data points would include a single number for counts, such as “website visits”, others would include more granular data and multiple numbers for the counts, such as circulation data including “book checkouts, dvd checkouts, laptop checkouts, …” as a result, the storage mechanism needed to be flexible enough to accommodate a variable number of granular counts, with no upper limit. the same was true for the tags field, which would need to accommodate a variable number of values for any given data point. tags would be one of the critical aspects of the system, as we planned to use these to dynamically build groups of data sets and related visualizations. up until this point, we had principally used mysql for all of our data storage, while recently adding memcached and redis for specific applications. but, none of these three solutions were optimal for this use case due to the variable nature of the data described above. so, for maximum flexibility we opted to use mongodb (https://www.mongodb.com/), one of the many “nosql” databases that have risen to prominence in the last several years. mongodb allows for data to be added to the system without a predefined schema, and has a robust query interface for retrieving data. and it can scale to support as much data as we could possibly throw at it. with mongodb selected, we designed our data model for the initial implementation. mongodb stores data sets as “documents” that are part of a “collection” (akin to mysql’s rows and tables), and so the document for a set of harvested data would use the following schema: document name: (string) raw_data: (hash) -[data point name (string)]: [data point value: (integer)] -… month: (integer) year: (integer) tags: (array) -[tag name (string)] -… (this excludes the automatically included fields, all of which are denoted with an underscore in the name (e.g. _id) in the examples below) here are a couple examples of how this schema would play out for specific data points: *data collected with a single raw_data count { "_created": "fri, 02 dec 2016 18:08:52 gmt", "_etag": "ea2e8a6078cf32a4ebedaadf7b4a43908a5d40e5", "_id": "5841b8b43a3d5f05b79c71aa", "_links": { "self": { "href": "library_stats/5841b8b43a3d5f05b79c71aa", "title": "name" } }, "_updated": "fri, 02 dec 2016 18:08:52 gmt", "month": 7, "name": "website visits", "raw_data": [ { "website visits": "63817" } ], "tags": [ "website", "traffic" ], "year": 2014 } *data collected with a multiple raw_data counts { "_created": "fri, 02 dec 2016 18:09:20 gmt", "_etag": "fc55dca90348b7fe20751ec4be676b734da61d04", "_id": "5841b8d03a3d5f05b79c7257", "_links": { "self": { "href": "library_stats/5841b8d03a3d5f05b79c7257", "title": "name" } }, "_updated": "fri, 02 dec 2016 18:09:20 gmt", "month": 11, "name": "reference questions answered", "raw_data": [ { "email": 164 }, { "chat/text": 344 }, { "in person": 2066 }, { "phone": 139 } ], "tags": [ "reference" ], "year": 2016 }, while mongodb doesn’t enforce a schema on its own, we did need to ensure consistency in the centrally stored data, and so we implemented a python library named eve (http://python-eve.org/) that would function as middleware for enforcing consistency. eve enables you to define a schema for each collection in your mongodb database, and adds an api for the mongodb database that validates data as it’s saved, before pushing it to the database. using eve, data is saved to the database via the eve api, instead of being passed directly to the database, and data that does not meet the eve-defined schema will be rejected. eve enables us to be sure that data that is saved will have a consistent schema, while at the same time still benefiting from the flexibility in data structures that mongodb offers. in other words, while the schema sets a structure for the data, within that structure we can proliferate data as needed. for example, the raw_data section of each document stores a hash of key-value pairs describing the data from the source system. in practice, this can store any number of hashes, describing sets and subsets of data as needed. eve allows us to ensure that the data being saved to raw_data will be a hash and that it is present, but within that relatively broad definition, we have the flexibility to add any number of hashes as are appropriate for the data set. schema definition in eve: schema = { 'name': { 'maxlength': 128, 'minlength': 1, 'required': true, 'type': 'string', 'unique': false, }, 'tags': { 'maxlength': 64, 'minlength': 1, 'required': true, 'schema': { 'maxlength': 64, 'minlength': 1, 'type': 'string', }, 'type': 'list', 'unique': false, }, 'raw_data': { 'minlength': 1, 'required': true, 'schema': { 'type': 'dict', 'required': true, }, 'type': 'list', }, 'month': { 'max': 12, 'min': 1, 'required': false, 'type': 'integer', }, 'year': { 'min': 2000, 'required': false, 'type': 'integer', }, } with the database and data validator in place, we wrote mechanisms that would pull data from each of the systems listed above, and then pass it to the eve api to be stored in the database. as noted previously, each system had different mechanisms for accessing the data, most using apis or queries directly to the database. but the common element in each process was formatting the data for the api. here’s an example of what that looks like when using python to format the data from alma, with the key section highlighted in bold: def reformat_to_json(report_data, name, month_filter, key='key', tags=[]): soup = beautifulsoup(report_data, features='xml') this_year = date.today().year this_month = date.today().month 1 group_name = "circulation" type_name = "circulation_all" library_stats = [] for row in soup.find_all('row'): material_type = str(row.column1.string) if material_type != 'other': # we don't want type other month = str(row.column3.string) year = str(row.column4.string) key = str(row.column5.string) # 'key' argument is ignored value = str(row.column6.string) if month_filter: if month == str(this_month) and year == str(this_year): library_stats.append({ 'name': name, 'tags': tags, 'raw_data': [ {key: value} ], 'month': month, 'year': year, }) return library_stats in this example, the data stored in the library_stats object is passed through the eve api and stored in mongodb. we opted to write individual connectors for each data source, with each connector capable of pulling multiple data sets as needed. for example, if we queried the alma api for multiple data sets (e.g. checkouts by type, and resource sharing requests), the same connector would be used for collecting both and pushing the data to the database, and the above code block could be used for processing the data from each individual query to alma. step 3: building the dashboard principal goals for the dashboard were for it to be both visually compelling and able to present data in an analytically interesting way. we had used the google charts api in the past, which offers a nice set of tools for formatting and displaying data. but after doing some research into other options, we quickly settled on the c3 javascript library (http://c3js.org/). c3 is built on top of the d3 javascript library, which is great for creating rich, dynamic charts. yet, d3 can be challenging to leverage on its own, which is where c3 comes in. c3 makes it simpler to build charts with d3, by abstracting out the javascript code needed to build the chart itself, enabling you to simply pass your data and the chart type (you can see examples on the c3 site) to c3, which will then build the chart based on this. with the chart library selected, we set about embedding the dashboard in our website. to do this, we developed shortcodes for each chart. each shortcode includes all of the variables needed to construct and display the chart, and can be used to dynamically build and place it in the appropriate place on the page. for example, this is a shortcode for including the circulation count data: [chart name="materialscirculated" type="bar" title="materials circulated" caption="users borrow more than 28,000 psu-owned items in certain months."] in this example, the variables will be parsed as: chart name=“materialscirculated” is the name of the tag that will be used to wrap the chart type=“bar” indicates that the type of chart that will be used will be a bar chart title=”materials circulated” is the text that will be used as the label for the chart caption=”users borrow…” is the text that will be used as the caption for the chart in turn, a php script parses the variables and builds the block of html and javascript that will embed the chart on the page: function chart($attr){ $result = ""; if(isset($attr['title'])) $attr['title'] = str_replace("&","&",$attr['title']); if(isset($attr['name'])) { switch($attr['name']) { case 'walkin_v_website': $result = '<div id="'.urlencode($attr['name'])."_".urlencode($attr['type']).'"></div><div class="chart-caption">'.$attr['caption'].'</div><script src="//yourcontentserver.edu/static/data_dashboard/walkin_v_website_bar_and_spline.js" charset="utf-8"></script>'; break; ... default: $result = '<div id="'.urlencode($attr['name'])."_".urlencode($attr['type']).'"></div><div class="chart-caption">'.$attr['caption'].'</div><script src="//yourcontentserver.edu/static/data_dashboard/'.urlencode($attr['name'])."_".urlencode($attr['type']).'.js" charset="utf-8"></script>'; } } return($result); } in this code example, specific charts that are more complex (such as the website versus walkin chart that overlays a spline chart on top of a bar chart) are called out and handled specifically, whereas the majority of charts are handled by the default case. the resulting block of html/javascript that is added to the page looks like this: <div> <div id="materialscirculated_bar"></div> <div class="chart-caption">in one month, users may borrow more than 28,000 psu-owned items.</div> <script src="//yourcontentserver.edu/static/data_dashboard/materialscirculated_bar.js" charset="utf-8"></script> </div> the javascript file referenced is then called when the page is loaded, and in turn constructs the c3 chart: var full_column_data_materialscirculated_bar = new array(); full_column_data_materialscirculated_bar.push(['x','dec','jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov']); var mobile_column_data_materialscirculated_bar = new array(); mobile_column_data_materialscirculated_bar.push(['x','aug','sep','oct','nov']); var colors_materialscirculated_bar = new array(); colors_materialscirculated_bar['previous 12 months'] = '#00759a'; colors_materialscirculated_bar['previous 4 months'] = '#00759a'; colors_materialscirculated_bar['current 12 months'] = '#a8b400'; colors_materialscirculated_bar['current 4 months'] = '#a8b400'; full_column_data_materialscirculated_bar.push(['previous 12 months','15378','23576','25526','25222','28040','26295','18107','10178','8286','9681','27936','25797']); full_column_data_materialscirculated_bar.push(['current 12 months','15513','23394','23890','20997','23058','22155','14359','8017','7749','9410','24133','20151']); mobile_column_data_materialscirculated_bar.push(['previous 4 months','8286','9681','27936','25797']); mobile_column_data_materialscirculated_bar.push(['current 4 months','7749','9410','24133','20151']); var chart_materialscirculated_bar = { bindto: '#materialscirculated_bar', title: {text: 'materials circulated'}, data: { x : 'x', colors: colors_materialscirculated_bar, columns: {}, type: 'bar', groups: [ [''] ], order: 'asc', }, tooltip: { format: { value: function (value, ratio, id) { var format = d3.format(',') return format(value); } } }, color: { pattern: ['#00759a', '#a8b400', '#dc9b32', '#a33f1f', '#60351d', '#474334', '#e6dc8f', '#650360', '#6a7f10', '#dff2f5'] }, axis: { x: { type: 'category' }, y : { tick: { format: d3.format(",") } } } }; var materialscirculated_bar_chart = c3.generate(chart_materialscirculated_bar); by default, we opted to include twelve months of data for each chart, with most presented as year-to-year comparisons by month. for example: figure 2. example 12 month chart. our website is responsive, and we needed the dashboard to be responsive as well. so we configured the dashboard to present an appropriate amount of data for each chart depending on the screen size, in this case reducing the number of columns when the screen size became too small: figure 3. same chart as above image, dynamically modified based on screen size. step 4: putting the dashboard into production the last step in building and launching the dashboard involved moving away from the need to query the database each time the page was loaded. to do this, we built an automated process to extract the needed data from the database on a scheduled, monthly basis and store it in json format on a local web server. this would ultimately simplify the process (and the overhead associated with it) for retrieving the data, since the json formatted data could be easily parsed using javascript, and would result in much improved performance relative to having to query the database multiple times whenever the dashboard was loaded. for the initial launch of the dashboard, all of the data was extracted to support the included charts. the compiled json data was formatted similar to this: {"in person": {"201412": 1256, "201501": 2366, "201502": 2043}, "chat/text": {"201412": 217, "201501": 423, "201502": 319}, "email": {"201412": 249, "201501": 280, "201502": 296}, "phone": {"201412": 62, "201501": 99, "201502": 112}} in turn, when the dashboard is loaded and the charts are generated, they pull the data from these static files on the web server. next steps & future development as as a result of this project, the library has a dashboard that is dynamic and engaging for users, that requires a sustainable level of staff time to maintain, and that can scalably incorporate additional data sources and data points over time. but at the outset of the project we’d set a number of additional goals that we wanted to move forwards on, and shortly after having completed this initial project, we’ve already begun working on these. dynamic dashboards chief among these, is the ability to dynamically create additional dashboards, by leveraging the tag structure of the stored data. as i mentioned earlier, each stored document includes one or more tags, such as “circulation” or “study rooms”. the tags have a flat structure, meaning that there is no hierarchy among them. this enables us to make them as granular or broad as we’d like, and to easily add them to documents in the future. once the tags are associated with a document, querying the database by them is very simple. for example, to retrieve all documents in the database that include circulation data, you would just add the following elements to your base mongodb url: [base url]/v1.0/library_stats?where={“tags”:”circulation”} this would be the data returned for this query: { "_items": [ { "_created": "fri, 02 dec 2016 18:08:43 gmt", "_etag": "31e1693b93850f5b0149b0ca2f4d3ff900025a75", "_id": "5841b8ab3a3d5f05b79c71a8", "_links": { "self": { "href": "library_stats/5841b8ab3a3d5f05b79c71a8", "title": "name" } }, "_updated": "fri, 02 dec 2016 18:08:43 gmt", "month": 11, "name": "summit regional borrowing", "raw_data": [ { "books, journals, and audiovisual": "2801" } ], "tags": [ "circulation", "resource sharing" ], "year": 2016 { "_created": "fri, 02 dec 2016 18:08:51 gmt", "_etag": "592029b2113c4c5d7d1a842273ef96e79f6e9ee7", "_id": "5841b8b33a3d5f05b79c71a9", "_links": { "self": { "href": "library_stats/5841b8b33a3d5f05b79c71a9", "title": "name" } }, "_updated": "fri, 02 dec 2016 18:08:51 gmt", "month": 11, "name": "summit regional lending", "raw_data": [ { "books, journals, and audiovisual": "289" } ], "tags": [ "circulation", "resource sharing" ], "year": 2016 }, …] to limit this to a specific timeframe, such as a month, you add that element to the url query string: [base url]/v1.0/library_stats?where={“tags”:”circulation”},{“month”:9},{“year”:2016}]} this ability to dynamically retrieve all data for a given tag (or tags), is then coupled with standard definitions for how to present the data that is retrieved from the database. for this, we’ve set the following rules: retrieved data is presented in charts that are grouped by the documents’ “name” field, and document groups are sorted by “name”. data that is multifaceted (i.e. has multiple individual counts within the raw_data field) will be presented with three charts: bar chart with summary data by month, for 12 months, presented as month-to-month, year-to-year comparisons (see the study rooms usage chart above for an example) stacked bar chart presenting each granular count relative to one another, for 12 months. pie chart, presenting the data broken down by the granular counts, summarized for 12 months. in comparison, data that is not multifaceted will be presented with just a month-to-month, year-to-year bar chart, as in the study rooms example above. users will be given a choice of time periods to choose from, such as 6 months, 12 months, etc. figure 4. example of how the dynamic dashboard would render charts for multi-faceted data, while offering a drop-down for additional time-range options. setting out these rules enables us to programmatically build these tag-based dashboards dynamically. in turn, in our curated dashboard where we’ve laid out which charts appear, how they are grouped, and in which order, we can link out to the dynamic dashboards to offer users access to more granular data. for example, we can display specific charts in the circulation & resource sharing section of the dashboard, and then link out to a dynamic dashboard for the tags “circulation” and “resource sharing” that will include a range of charts that present this area of data in greater detail. ultimately, we will be able to take advantage of these tags and the programmatic logic behind the dynamic dashboards to scalably tap into the stored data to give users (whether internal or external) insight into the data in ways that are most useful and relevant to them. annual data our initial project was focused on collecting monthly data. but historically our library’s “data” page included annual data too, and we want to include this in the dashboard as well. data such as the library’s collection profile, and the annual data summary for total circulation and other data, are both key elements for the finished dashboard. during the initial project, these charts were simply hard-coded, but we want to develop a long-term solution for them. the chart for the annual data summary can reliably include information that is summarized from the data stored in the database. and so this chart will be relatively straightforward to include. the biggest challenge will be determining the optimal means of extracting and summarizing that data, and then storing it and presenting it in the dashboard. in contrast, the annual collection profile data is very different from the data collected thus far. that data reflects our collection counts by material type at a specific point in time, and not how the counts are accumulating or decreasing month by month, the way something such as monthly circulation data does. as a result, how we derive and store the data needs to be carefully considered, so that it can comfortably be stored in the database alongside the monthly data, without causing issues in data retrieval and presentation. at the moment, the collection profile chart only displays on the curated dashboard, and so we can easily control how the data is retrieved and presented. if we automate collecting the data on a monthly basis, it will need to be incorporated into our current structure for harvesting and storing data. long-term, by collecting this data on a monthly basis and tagging it appropriately, we will also be able to leverage the tag-based workflow for the dynamic dashboards to illustrate trends in our collection profile over time. additional data sources we have initially included the principal data sources that were already being used for reporting in the library. but this is a subset of all of the library’s relevant data sources, and there are additional ones that we would like to include in the long-run. for example, ezproxy data would be very useful to include for giving us a picture of patrons’ use of our electronic resources, particularly in terms of usage trends over time, and illustrating onvs. off-campus use. and ezproxy data is sufficiently well-structure that it would flow very well into the architecture that we have developed already. in addition, our university uses a program called labstats for tracking computer usage across campus labs. tapping into this data would enable us to illustrate the use of library labs and technology in other spaces in the building, and to juxtapose this with usage info from other labs on campus. these are just a couple of examples of other data sources that might prove valuable to incorporate as additional phases down the road. as i mentioned at the outset of this article, libraries have many data sources available to them, and mountains of data within them to creatively tap into. our overall goal is to continue to take advantage of the infrastructure that we’ve developed to make creative, effective use of this data. open source lastly, we plan on developing and sharing open source versions of the components of the system that will enable other libraries to take advantage of the work that we’ve done. particularly since so many libraries use similar software for systems such as their integrated library system, interlibrary loan services, and proxy server, it will be possible to share source code that would enable libraries to tap into these common systems to create dashboards. in addition, the structure for how the application stores and retrieves data is essentially source-system agnostic. the only part of the system that is thoroughly wedded to the source system for any given data set is the harvesting process. but even here, as long as data can be harvested from the source system and processed through the eve interface for the mongodb database, it would be possible for libraries to build additional system connectors with relative ease. the portland state university library’s github page (https://github.com/pdxlibrary) already includes a number of other projects that we’ve created open-source versions of. when we release the code for this project in spring 2017, this code will be included there as well. conclusion tapping into the library’s wealth of data to fuel ongoing evaluation and decision-making is a challenging project to undertake. it is particularly complicated by the broad array of systems that each library uses, and the need to arrive at a sustainable means of collecting and reporting on data in an ongoing fashion. but, arriving at a solution for this is a project that is ultimately of great benefit to libraries. such a solution represents an opportunity to overcome data silos that prevent us from effectively leveraging data to support our ongoing work and strategic planning. harvesting data centrally enables us to directly compare it with like data, offering insights that would have been difficult to obtain otherwise. what is more, because there are a number of commonly used library systems in place today, the development of such a solution would potentially benefit a many libraries. the portland state university library’s project has provided an initial glimpse of what is possible with such a solution. we have a much greater ability to take advantage of data today than we did a year ago, and as we learn more and add new functionalities, our capabilities in this area will continue to improve. our project also illustrates the type of steps that other libraries will need to consider if/when embarking on their own solutions, providing a roadmap of sorts for libraries who are also seeking to better leverage the data available to them. about the author nathan mealey (mealey@pdx.edu) is the manager of library technologies at portland state university library, where he oversees the team of technologists who develop and maintain the library’s web-based and desktop technologies. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – indexing linked bibliographic data with json-ld, bibjson and elasticsearch mission editorial committee process and structure code4lib issue 19, 2013-01-15 indexing linked bibliographic data with json-ld, bibjson and elasticsearch linked data is a powerful tool for sharing bibliographic metadata. by combining the decentralization of the web with the use of globally defined metadata vocabularies, data from many sources can be treated as a single, aggregated graph. supporting search across these distributed data sources within the same application, however, requires considerable work in vocabulary alignment and data transformation. aggregate systems must convert data into a unified model which must (almost inevitably) be generic at the expense of the structure and granularity of the original data. this paper presents a novel solution for representing and indexing bibliographic resources that retains the data integrity and extensibility of linked data while supporting fast, customizable indexes in an application-friendly data format. the methodology makes use of json-ld to represent rdf graphs in json suitable for indexing with elasticsearch. bibjson is used as a common index format capable of handling a wide range of library resources. since all three technologies (rdf/json-ld, bibjson and elasticsearch) share an emphasis on extensibility, it is possible to create an index of bibliographic data that is both generalized and flexible enough to handle linked data from multiple sources. by thomas johnson introduction linked data is a powerful tool for “sharable, extensible, and easily re-usable” bibliographic metadata (baker, 2012). by combining the decentralization of the web with the use of globally defined metadata vocabularies, data from many sources can be treated as a single, aggregated graph. working with this chaotic mass of data, however, can be daunting. each major dump of bibliographic data comes with its own quirks in terms of vocabulary choice, scope, and data model. combining multiple heterogeneous data sources for use in the same application typically requires considerable work on data transformation. even within the context of a single domain, there is little realistic possibility for a universal schema. aggregate systems must convert data into a unified model which must (almost inevitably) be generic at the expense of the structure and granularity of the original data. one place where this is a clear problem is in search. we want linked data search engines to expose data from across the web, but with a degree of integration that insulates users from the specifics of the models. how can we bring data from distributed sources together onto a single search platform? this paper presents a novel solution for bibliographic resources which retains the data integrity and extensibility of linked data while supporting fast, customizable indexes in an application-friendly data format. the methodology makes use of json-ld to represent rdf graphs in json suitable for indexing with elasticsearch. bibjson serves as a common index format capable of handling a wide range of library resources. since all three technologies (rdf/json-ld, bibjson and elasticsearch) share an emphasis on extensibility, it’s possible to create an index [1] of bibliographic data that is both generalized and flexible enough to handle linked data from multiple sources. the method demonstrated here was developed at oregon state university as a part of an ongoing project to build search services atop an rdf dataset for our theses and dissertations. additional information about this project can be found in johnson and boock (2012). json-ld json-ld aims to express linked data in otherwise normal looking json documents. by design, it is “as simple as possible, very terse, and human readable” (lanthaler and gütl, 2012). though there had been previous attempts to express linked data in json (see rdf json and jron), this specification emphasizes application and developer friendly json (lanthaler and gütl, 2012). it forgoes the rigid, unnatural structures seen in past serializations in favor of compact representations and supplementary documents which hold the details of the graph. while the specification is a work in progress, json-ld largely succeeds. it makes it possible to introduce linked data principles and vocabularies into existing json data without changing the data structure or application code. by the same stroke, the simple, readable structure of its json makes it easy to use existing rdf graphs with typical json programming practices and with systems like elasticsearch. contexts and framing working with json-ld requires some understanding of linked data—especially the practice of using internationalized resource identifiers (iris) to refer to terms and concepts—and several concepts used to apply its principles in json. chiefly, contexts and frames. a context is a mapping between json properties and equivalent iris. contexts are themselves expressed as json objects, using the “@context” keyword, which express equivalencies between iris and more readable json keys. given an applicable context, a json-ld document can be compacted into a simple, usable form or expanded from “normal” json to its full linked data representation. for example, a context might specify: "name": "http://xmlns.com/foaf/0.1/name" an expanded document would use the longer foaf:name iri in its key-value pairs, while a compacted one would simply use “name”. a context document consists of a number of statements of this form, designating the relationship between json keys and rdf nodes. the concept behind framing is somewhat more complex. frames complete the mapping from a particular rdf graph to a corresponding json tree. this is important, since most graphs can be represented as many distinct trees (e.g. by selecting a different node as the root). specifying a single structure allows a one-to-one relationship between the source rdf and the json-ld representation. the resulting json tree is predictable enough for applications to rely on. a frame can be helpfully thought of as a template for a generated json document. the simplest framing documents might contain a single line telling the algorithm to treat the data as a representation of a number of books (contrasted with a number of authors, containing lists of their books): "@type": "book" further key-value pairs can tell the framing algorithm to expect other data to appear within the root object and specify embedding behavior to ensure that child nodes are described fully each time they appear [2]. the most useful way to learn these concepts and their various quirks is to try them out. the json-ld playground is an easy way to do this. the examples there are instructive; more importantly, it allows you to interactively construct your own json-ld and view it in various stages of compression and normalization. bibjson though json-ld represents a flexible way to convert rdf to json without degradation, a general purpose bibliographic index will need to share a common json format. our target format needs to be simple and extensible; it must be able to accommodate the needs of various record types and models, and capable of representing commonly indexed fields in a predictable way. bibjson comes close to being ideal for our purposes. at its core, it is little more than a set of conventions for using bibtex fields as json keys, made extensible by supporting arbitrary additional keys as needed. those fields cover the most important metadata fields associated with bibliographic records; indeed, bibjson’s primary existing use case is the creation of a distributed and portable collection of bibliographic data (jones, 2011). though it won’t work as a universal format—it is distinctly record-centric and won’t fit complex entity-relationship models, as we’ll see later—it does surprisingly well. in particular, it offers clarity surrounding most important search fields and will extend to fit json-ld, so long as some bibligraphic entity is used as the root node. { "title": "indexing linked bibliographic data with json-ld, bibjson & elasticsearch", "author":[ {"name": "thomas johnson"} ], "type": "article", "year": "2013", "journal": {"name": "code4lib journal"}, "issue": "19", "link": [{"url":"http://example.org/tjohnson-2013"}], "identifier": [ {"type":"doi", "id":"10.1000/182", "url":"http://dx.doi.org/10.1000/182"} ] } figure 1. an example bibjson record. a basic bibjson record for this article is given in figure 1. as a rule, simple data points that can be represented with a string (e.g. title) are given as key-value pairs. for more complex fields, bibjson allows representation as an object or list of objects. this convention is explicitly invoked for several common fields [3] and can be used for others where needed. a simple example for a minimal demonstration of the entire process, consider the rdf in figure 2. this expresses, using common vocabulary, a graph similar in scope and structure to the bibjson above. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. @prefix dc: <http://purl.org/dc/elements/1.1/>. @prefix bibo: <http://purl.org/ontology/bibo/>. @prefix foaf: <http://xmlns.com/foaf/0.1/>. <http://id.example.org/tjohnson-2013> dc:title "indexing linked bibliographic data with json-ld, bibjson & elasticsearch" ; dc:creator <http://id.achelo.us/tjohnson> ; dc:date "2013" ; dc:link <http://example.org/tjohnson-2013> ; rdf:type bibo:article ; dc:ispartof <urn:issn:19405758> ; bibo:issue "19" ; bibo:doi "10.1000/182" . <http://id.achelo.us/tjohnson> foaf:person ; rdfs:label "thomas johnson" ; foaf:lastname "johnson" . <urn:issn:19405758> a bibo:journal ; rdfs:label "code4lib journal" ; rdfs:seealso <http://journal.code4lib.org/> . figure 2. example.ttl — an example bibliographic ‘record’ in rdf. the combined context and framing document in figure 3 will support the conversion of this data into a json document similar to the initial bibjson in figure 1. the constructions “‘type’: ‘@type'” and “‘id’: ‘@id'” create aliases for json-ld keywords which would otherwise appear in our final document, making them more human readable and closer to the default bibjson terms. where bibjson calls for a “list of objects”, the context specifies @set as the container to ensure compliance. the second part of the document (following the @context object) is the frame. it can be read as declaring that the “article” is the root node, and that the other objects should be embedded in its tree. figure 4 shows the bibjson-like results; the python code used to generate the json-ld, and an un-framed example of the same graph are included as an appendix. { "@context": { "type": "@type", "id": "@id", "article": "http://purl.org/ontology/bibo/article", "journal": "http://purl.org/ontology/bibo/journal", "journal": "http://purl.org/dc/elements/1.1/ispartof", "title": "http://purl.org/dc/elements/1.1/title", "issue": "http://purl.org/ontology/bibo/issue", "author": { "@id":"http://purl.org/dc/elements/1.1/creator", "@container": "@set" }, "person": "http://xmlns.com/foaf/0.1/person", "link": { "@id": "http://purl.org/dc/elements/1.1/link", "@container": "@set" }, "doi": { "@id":"http://purl.org/ontology/bibo/doi", "@container": "@set" }, "name": "http://www.w3.org/2000/01/rdf-schema#label", "year": "http://purl.org/dc/elements/1.1/date" }, "@type": "article", "author": { "@type": "person", "@embed": "true" } , "journal": { "@type":"journal", "@embed": "true" }, "link": { "@type":"id", "@embed": "true" } } figure 3. example_frame.jsonld — context and framing document. { "@context": { ... }, "@graph": [ { "title": "indexing linked bibliographic data with json-ld, bibjson & elasticsearch", "author": [ { "foaf:lastname": "johnson", "type": "person", "name": "thomas johnson", "id": "http://id.achelo.us/tjohnson" } ], "type": "article", "journal": { "http://www.w3.org/2000/01/rdf-schema#seealso": { "id": "http://journal.code4lib.org/" }, "type": "journal", "name": "code4lib journal", "id": "urn:issn:19405758" }, "link": [ {"id": "http://example.org/tjohnson-2013"} ], "year": "2013", "issue": "19", "id": "http://id.example.org/tjohnson-2013", "doi": "10.1000/182" } ] } figure 4. example.json — framed json-ld. comparing the output with the initial bibjson example reveals a few interesting differences. firstly, unexpected fields (foaf:lastname, rdfs:seealso, and various iris) are handled gracefully and without data loss. this will continue to work even if the unexpected nodes give rise to complex graph structures. unexpected data can simply be ignored by any software using the index, while remaining available on an ad-hoc basis and subject to elasticsearch’s default analysis. it can also, of course, be converted back into the original rdf along with the rest of the data. secondly, there are some minor differences in the tree structure. the doi in particular was represented in bibjson by an “identifier” object with a type, url, and id. in our generated json-ld, it is simplified to a key-value pair. this is mainly due to the nature of the source data. bibo’s doi property expects us to infer information that bibjson makes explicit. json-ld doesn’t offer a mechanism for forcing simple data types into more complex structures, so there’s not a natural way to resolve this issue. some attention will later be given to workarounds for eventualities like these. for now, it is enough to know that not all graphs can be represented with the same json tree structure; json-ld documents are, first and foremost, representations of rdf graphs. indexing the graph indexing items in elasticsearch is as easy as sending an http put request containing the appropriate json. the example data in figure 4 could be indexed with a request like curl -xput http://localhost:9200/bibjson/articles/1 -d '{...}' where ‘…’ is replaced by the record itself (here, the contents of @graph in figure 4). the url parts ‘bibjson’ and ‘articles’ are, in order, the index name and the type of the document. by default, elasticsearch creates new indexes automatically and creates a default mapping for new types. the final part of the url (‘1’) will be elasticsearch’s internal identifier; indexing with http post enables automatic id generation. the simplest method for indexing multiple records, therefore, is to iterate through the list in @graph, posting each record in turn. once indexed, documents can be retrieved via http get and indexes can be searched via elasticsearch’s query api. elasticsearch is intended to have “sensible defaults” for types with no explicitly defined search mappings (elasticsearch, n.d.). mappings configure the searchability, tokenization and faceting of fields, as well as data types, boosting, and inclusion of fields in ‘_all’ searches. the defaults are often sufficient and, when they aren’t, still manage to usefully handle unexpected data. custom mappings can be applied on a per-index and per-type basis. [4] creating aggregate indexes adding data from a second source can be accomplished by creating a new context and frame, then adding the resulting json-ld in elasticsearch. to the extent that the two datasets share bibjson as a common format, no new configuration is needed. however, when adding data generated from more than one context, experience at oregon state has suggested that using separate indexes is good practice. the context associated with a given index can be added alongside its data as type “context”, and searches configured to ignore these documents. keeping the context alongside the data gives applications a convention for extracting semantics back out of the indexed documents. organizing indexes this way also allows different search mappings to be applied on a by-index basis, helping to address discrepancies between data sources. queries can be easily run across both types and indexes and the segregation doesn’t affect the performance of searches. support for (unlinked) bibjson in addition to supporting a wide variety of linked data models, this index could also accommodate original bibjson data. to enhance its interoperability with json-ld indexes, a generic bibjson context document could be applied. frbr and other limitations data modeling perhaps the strongest limitation faced is the inconsistency between different data models. while our chosen common format prefers simple key-value pairs where possible, rdf data models often use more explicit and complex structure. a core example in the library domain is the entity-relationship model specified by frbr (ifla, 1998). expressing frbr’s multi-tiered approach to bibliographic data in json will lead to a fundamentally different data structure than the bibjson used by the rest of the index. any simplification for compatibility would be lossy, flattening the graph (and can’t be done using json-ld’s algorithms). it’s worth noting that these conceptual incompatibilities originate with the data models themselves and are not an artifact of the index process. datasets using differing models could still be expressed in json-ld and indexed in elasticsearch, but querying them would require a substantial amount of work on the application side to adjust for their differences. models with major incompatibilities in the index could instead be transformed into a compatible, bibjson-friendly, graph prior to generating json-ld. if needed, elasticsearch could still be used as a datastore for the original graph (in json-ld) in a separate index, not included in default searches. in this case, some internal convention would be needed to maintain an association between the indexed record and its original rdf. for smaller model incompatibilities like the doi issue encountered above, the best option may be to add duplicate data after the initial json-ld conversion. in the doi example, we would add the “identifier” object alongside the existing “doi” term. so long as the data produced by the json-ld algorithm is untouched the result will be a more consistent (and still linked) dataset. name collisions in contexts a smaller, but significant, limitation is the potential for name collision in context documents. since json-ld won’t allow multiple iri’s to be mapped to the same json key (this would prevent re-expansion), terms can’t always be represented using bibjson’s default key. the most common problem in our experience is due to the use of “name” as a key for both people (often expressed with foaf:name) and journals (dc:title). mapping both terms to “name” in json-ld would destroy the distinction between the two. the solution here is simple, though it does require an additional step: find a term (e.g. rdfs:label) that applies to both needs and add a triple to the incoming rdf. elasticsearch will index both the shared term (as “name”) and the original (with its full uri as the key). follow your nose one final weakness is that many rdf datasets will only contain a subset of the graph relevant to the creation of a full index. for example, a bibliographic dataset might hold comprehensive data about a book, but make reference to the author only by link to an external source—the ability to make use of related external data being a defining feature of linked data. json-ld generated from such a source would retain the author iri used in the original, but may be missing data points as crucial as the author’s name. the solution, here as elsewhere, is to follow your nose [5], dereferencing the iris to build a more complete graph. summary json-ld can serve as a useful tool for making linked data more accessible to applications. using it in conjunction with bibjson and elasticsearch provides a low barrier method for indexing a wide variety of bibliographic data. though the index has some limitations, it succeeds at joining heterogeneous linked data in a generic form without compromising the full graph structure and semantics of the original data source. notes [1] note that this doesn’t yield “semantic search” in the usual sense of inference, fuzzy logics and concept mapping, but rather a traditional analyzed index. because both semantics and graph structure are retained, clever use of the index might emulate semantic search features like concept-based facets. [2] the behavior of ‘@embed’ is currently in flux. the treatment we rely on here has clear support in current discussion: https://github.com/json-ld/json-ld.org/issues/119. [3] author, editor, license, identifier, link, and journal. [4] a good introduction to mapping is available in a pair of blog posts at http://euphonious-intuition.com/2012/07/an-introduction-to-mapping-in-elasticsearch and http://euphonious-intuition.com/2012/08/more-complicated-mapping-in-elasticsearch. [5] see http://patterns.dataincubator.org/book/follow-your-nose.html. references baker, t., bermès, e., coyle, k., dunsire, g., isaac, a., murray, p., …zeng, m. (2011). library linked data incubator group final report. available http://www.w3.org/2005/incubator/lld/xgr-lld-20111025/ ifla study group on the function requirements for bibliographic records. (1998). functional requirements for bibliographic records final report. available http://archive.ifla.org/vii/s13/frbr/frbr_current_toc.htm johnson, t. and boock, m. (2012). linked data services for theses and dissertations, proceedings of the 15th international symposium on electronic theses and dissertations, lima, peru. available http://hdl.handle.net/1957/32977 lanthaler, m., gütl, c. (2012). on using json-ld to create evolvable restful services, proceedings of the third international workshop on restful design. available from: http://dx.doi.org/10.1145/2307819.2307827. (coins) rdf json. [internet]. talis systems. available from: http://docs.api.talis.com/platform-api/output-types/rdf-json hawke, s. from json to rdf in six easy steps with jron. [internet]. [updated: june 4, 2010]. available from: http://decentralyze.com/2010/06/04/from-json-to-rdf-in-six-easy-steps-with-jron/ macgillivray, m. and pitman, j. how to do bibjson. [internet]. available from: http://www.bibjson.org/ jones, r., macgillivray, m., murray-rust, p., pitman, j., sefton, p., o’steen, b., and waites, w. (2011). open bibliography for science, technology, and medicine, journal of cheminformatics, 3:47. http://dx.doi.org/10.1186/1758-2946-3-47 elasticsearch guide: mapping. [internet]. available from: http://www.elasticsearch.org/guide/reference/mapping/ appendix #!/usr/bin/python import json, rdflib from pyld import jsonld g = rdflib.conjunctivegraph() g.parse(example.ttl', format='n3') # pyld likes nquads, by default expand = jsonld.from_rdf(g.serialize(format="nquads")) framed = jsonld.frame(j, json.load(open('example_frame.jsonld', 'r'))) print json.dumps(framed, indent=1) python framing code. { "@context": { ... }, "@graph": [ { "http://xmlns.com/foaf/0.1/lastname": "johnson", "@id": "http://id.achelo.us/tjohnson", "name": "thomas johnson", "@type": "person" }, { "title": "indexing linked bibliographic data with json-ld, bibjson & elasticsearch", "@id": "http://id.example.org/tjohnson-2013", "journal": { "@id": "urn:issn:19405758" }, "author": [ { "@id": "http://id.achelo.us/tjohnson" } ], "link": { "@id": "http://example.org/tjohnson-2013" }, "year": "2013", "doi": "10.1000/182", "issue": "19", "@type": "article" }, { "http://www.w3.org/2000/01/rdf-schema#seealso": { "@id": "http://journal.code4lib.org/" }, "@id": "urn:issn:19405758", "name": "code4lib journal", "@type": "journal" } ] } un-framed (compacted) json-ld document. about the author thomas johnson is digital applications librarian at oregon state university libraries, where he works on digital curation, scholarly publication, and related metadata and software issues. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – beyond open source: evaluating the community availability of software mission editorial committee process and structure code4lib issue 31, 2016-01-28 beyond open source: evaluating the community availability of software the code4lib community has produced an increasingly impressive collection of open source software over the last decade, but much of this creative work remains out of reach for large portions of the library community. do the relatively privileged institutions represented by a majority of code4lib participants have a professional responsibility to support the adoption of their innovations? drawing from old and new software packaging and distribution approaches (from freeware to docker), we propose extending the open source software values of collaboration and transparency to include the wide and affordable distribution of software. we believe this will not only simplify the process of sharing our applications within the library community, but also make it possible for less well-resourced institutions to actually use our software. we identify areas of need, present our experiences with the users of our own open source projects, discuss our attempts to go beyond open source, propose a preliminary set of technology availability performance indicators for evaluating software availability, and make an argument for the internal value of supporting and encouraging a vibrant library software ecosystem. by bret davidson and jason casden ”in the open source community there exists a tremendous need for exactly the skills librarians have always used in making information resources truly useful. in particular, systems testing, evaluation, and feedback to open source designers is welcome and even sought after; documentation for open source systems is always needing improvement; instructional materials for open source products are often lacking. these are all areas in which librarians excel.” from “open source library systems: getting started.” (chudnov, 1999) the code4lib community has produced an increasingly impressive collection of open source software over the last decade, but much of this creative work remains out of reach for large portions of the library community. to what extent has our community fulfilled the promise of free and open source software to improve service to our users, particularly as it concerns the vast swath of users served primarily by smaller institutions? do relatively privileged institutions have a professional responsibility to support the adoption of their innovations? “we hope the code4lib journal can manifest the values that have been successful for the code4lib community, while providing increased access to the collective knowledge and experience held throughout our diverse professional networks and local organizations, increasing cross-pollination and collaboration among library technology innovators–and helping more people and organizations become innovators.” from “editorial introduction — issue 1” in the code4lib journal. (rochkind, 2007). despite this stated goal, there are large segments of our broader community of cultural heritage institutions that are unable to implement even the most well documented open source software. these organizations, that represent unique and often marginalized user communities, often have extremely limited access to it resources. we’ve seen firsthand, through our own experiences in supporting open source communities, the impact this has on the ability of organizations to adopt open source library software. in this article, we will present an initial set of performance indicators and technical guidelines to support those developers interested in moving beyond open source to genuinely available software. background early free software “in the early days of computing, the need to generate massive adoption was strong and the compatibility threat was weak—open-source software reigned supreme.” (campbell-kelly & garcia-swartz, 2009). in the 1950s and 1960s, software was often free, source code was available and modifiable, and major development projects were undertaken as a collaboration between vendors, user groups, and academics. one of the earliest examples is often cited as one of the first open source systems and also the first source code compiler. the prevailing idea at the time was that software sold systems, and that the computing industry was primarily a hardware business. software was often collaboratively developed and bundled for free with modifiable source code. the distinction between source code access and technology availability, however, is evident in major software packages such as the ibm airline control program (acp), which was developed by ibm between 1965 and 1979: “what makes acp interesting to open source supporters is that, while technically i suppose ibm ‘owned’ it, the source code was completely available for any developer to change, fix, and enhance. you needed an ibm mainframe on which to run the code, which even the geekiest did not consider (those people were drooling over the dec vax back then), so availability is only a relative term. i don’t remember exactly how the code was contributed back to ibm, which served largely as a code repository (and also sold high-priced technical consultants to help these large enterprises install, support, and deploy the software on ibm hardware — some things don’t change). but i do know that it was done.” (schindler, 2009). it’s important to note that the computing landscape differed in profound ways from our current environment. systems were huge and extremely expensive, and only a small number of extremely wealthy organizations could participate in these communities. software generally was only compatible with one vendor’s platform, and huge technological challenges faced any independent vendor seeking to port an application to another vendor’s environment. this assumes, of course, that these vendors existed, although the independent software vendor (isv) industry would not take off until the 1970s. (philipson, 2004). this approach of bundling “free” source code with hardware was restrained by a long, but ultimately unsuccessful, federal antitrust lawsuit against ibm that was later echoed by the microsoft antitrust case in the 1990s. “such practices allegedly included anticompetitive price discrimination such as giving away software services for ‘the purpose or with the effect of… enabling ibm to maintain or increase its market share… ‘ (id. at 9.) the government also alleged that ibm’s bundling of software with ‘related computer hardware equipment’ for a single price was anticompetitive. (id. at 10.)” (brown, et al., 1995). the combination of more affordable computing hardware, legal challenges, and the emergence of independent software vendors led us into the emergence of a computing industry in which the notions of free and commercial software may be more recognizable to the reader. hacker culture and the opposition to free software the hobbyist computer culture of the 1970s developed alongside the much wider availability of academic computing and the development of minicomputers and diy kits. the increasing prominence of academic and hobbyist computing communities led to the integration of these communities’ ideals, such as intellectually, rather than commercially, driven code sharing, “hacking,” and open collaboration, into the computing field. at the same time, a nascent software industry was developing independently of the major hardware vendors. these vendors, heavily relying on software sales and licensing for revenue, began to resist the open and collaborative approaches to software development that academics and hobbyists value intellectually and hardware vendors and users value commercially. more closed development models allowed these vendors to compete not only based on advertised functionality, but also on the ability of these vendors to stitch together a wider base of platform support from a fragmented hardware industry. among these newer independent vendors to emerge from hobbyist and academic communities is microsoft. “one thing you do do is prevent good software from being written. who can afford to do professional work for nothing? what hobbyist can put 3-man years into programming, finding all bugs, documenting his product and distribute for free? the fact is, no one besides us has invested a lot of money in hobby software. we have written 6800 basic, and are writing 8080 apl and 6800 apl, but there is very little incentive to make this software available to hobbyists. most directly, the thing you do is theft.” (gates, 1976). this quote, taken from bill gates’ legendary “open letter to hobbyists,” aptly captures the moment in the history of software when independent software creators diverged into more formally managed free and commercial communities. emergence of free and open source software in this environment of increasingly powerful independent software vendors, partnerships between hardware and software (such as the critical pc partnership between ibm and microsoft), and complex inter-vendor software compatibility needs, ideological and academic free software movements and licenses began to formalize their organizations and policies. examples of these included the free software foundation and its gnu license, the “bsd” license, and the “mit” license. “by working on and using gnu rather than proprietary programs, we can be hospitable to everyone and obey the law. in addition, gnu serves as an example to inspire and a banner to rally others to join us in sharing. this can give us a feeling of harmony which is impossible if we use software that is not free. for about half the programmers i talk to, this is an important happiness that money cannot replace.” (stallman, 1985). these licenses helped to establish a software community that supported the revival and expansion of collaborative and distributed development for common systems handling functions such as email (sendmail), web service (apache), and server operating systems (linux). however, many of these licenses were designed either to foster scholarship or to pursue ideological ends. in 1998, partially in response to commercial experimentation with free distribution of application source code, the open source initiative, and the descriptor “open source,” were born. “the conferees believed the pragmatic, business-case grounds that had motivated netscape to release their code illustrated a valuable way to engage with potential software users and developers, and convince them to create and improve source code by participating in an engaged community. the conferees also believed that it would be useful to have a single label that identified this approach and distinguished it from the philosophicallyand politically-focused label ‘free software.’ brainstorming for this new label eventually converged on the term ‘open source,’ originally suggested by christine peterson.” (history of the osi…[updated 2015]). this newer free software movement, which has helped to drive the commercial success of free software that is widely apparent today, has not proceeded without criticism (stallman, [updated 2015]) for its business-friendly approach that explicitly distances itself from earlier “politically-focused” free software organizations. the “open source” moniker has since come to dominate discussions of pragmatic and even commercial free software efforts. the reemergence of the single deployment platform one thing that movements ranging from the free software foundation to the open source initiative seem to agree on, however, is the need to work collaboratively to develop a new common application infrastructure to recapture the benefits of the earlier period of vendor behemoths. whether this is seen to encourage more productive competition or higher availability to independent developers is beyond the scope of this paper, but suffice it to say that both commercial and ideological motives were often aligned in this work. “the more recent switch to hybrid strategies reveals (a) an attempt to recreate, via open-source software, the single-platform scenario that ibm achieved with system/360 (and that ibm failed to recreate, in a proprietary context, via saa), and (b) an attempt to boost revenues from middleware and services while allowing the old cash cows (e.g. proprietary operating-system software) to peacefully die an unavoidable death.” (campbell-kelly & garcia-swartz, 2009). ”unix is not my ideal system, but it is not too bad. the essential features of unix seem to be good ones, and i think i can fill in what unix lacks without spoiling them. and a system compatible with unix would be convenient for many other people to adopt.” (stallman, 1985). the most critical factor within the context of this article is that a large and influential group of academics, hobbyists, and commercial organizations have moved beyond the dichotomy of developer freedom and revenue towards seeking a common infrastructure to reach more users with less work. unix, and particularly linux, began to be seen as a strong candidate for providing a layer of abstraction over extremely fragmented hardware and systems layers. our own experience unfortunately, we’ve found that library open source software has tended to favor the support for narrow (well-resourced) institutional and developer communities over this goal of widespread support for diverse user communities. we’ve seen this most clearly with our own open source releases. while we’ve been reasonably successful when it comes to promoting projects at conferences and journals, as well as supporting software adoption at similar (research) libraries, we’ve failed at enabling the adoption of our approaches at other cultural heritage institutions. suma suma (suma … [updated 2015]) is an open-source mobile web-based assessment toolkit developed by ncsu libraries for collecting and analyzing observational data about the usage of physical spaces and services. this software suite, used by dozens of libraries, is designed to provide easy and rapid analytics capabilities for spaces and service similar to that which is available for web services. the following quote is from an email from a librarian at a research university seeking to install suma (used with permission). “who would be the best person on your team to talk with about the technical requirements and skills needed for us to install suma and get it up and running. for example, do we need a computer programmer with such and such skills. i’m sure you have good documentation available, however, we would like to talk with someone. we are also curious as to how much time it would take to get suma working for us. i am not sure we have the expertise in-house.” and from a librarian at a research university (also used with permission) on canceling a pilot suma evaluation: “ultimately, i didn’t want to get on our it support’s bad side and because i’m not allowed to play with our development servers i can’t exactly go at the installation process alone.” these librarians (and others who already have the fortune to work at relatively large research institutions) were struggling to find the it resources necessary to deploy what we humbly feel is a fairly well-tested and documented web-based application. of course, much of the community who could benefit from space and service usage data analytics has access to far fewer resources. no amount of testing and documentation will solve this problem. social media archives in order to develop accurate historical assessments, researchers must have access to primary materials that represent a large and diverse set of participants. social media platforms have become a venue where serious discourse and creation take place, but much of this critical and ephemeral content is lost to researchers as few institutions collect and preserve this content. in 2014, the ncsu libraries was awarded an ez innovation grant (2014-2015 lsta grant programs … [updated 2014]) from the state library of north carolina (state library of north carolina … [updated 2015]) to develop a freely available web-based social media archives toolkit (social media archives toolkit … [updated 2015]) that publicly documents our own effort to develop a sophisticated social media archival program in a way that may help guide cultural heritage organizations that are interested in collecting and curating social media content. as a part of this grant, ncsu libraries conducted a survey of north carolina cultural heritage organizations about the promise of social media data inclusion in archival collections. among the responses was this discussion of the tension between interest and ability: “along with email, social media will probably provide the main source of information for researchers studying our current time. however, our institution just does not have the resources right now to collect and store the social media of other people or organizations.” (social media archives toolkit … [updated 2015]) this confirms our own anecdotal evidence from presentation questions and conference presentations: most institutions simply don’t have access to the resources necessary to pilot new initiatives with technical prerequisites. this is especially true for those cultural heritage organizations who could contribute the most to conversations addressing archival practices for social media community documentation, such as public libraries, historical societies, and community colleges. the future of libraries “i believe that building and maintaining library software is vitally important work and it’s too big a job to leave to a small group of people. we are creating the future of libraries here.” from “creating a commons.” (sadler, 2013) library open source software has improved technical collaboration and sharing among large institutions, but it retains problems of obligatory collaboration by requiring a relatively costly common environment and skill set. the goals of library open source, if viewed through the lens of journal publications, conference presentations, project users, and project contributors, must be seen as more closely aligned to collaborative software development rather than representative project development. unfortunately, since most developers are employed by institutions with the resources to foster specialized human resources, when developers vote with their code many categories of users may not be reflected in the development of the project. community members who would help address many other types of users (serving many other types of institutional users) never have the opportunity to participate. library technology should be able to thrive in diverse organizational and technological environments. within the library community, technical settings vary wildly. for example, the 2011-2012 ala public library funding & technology access study showed that 88 percent of public libraries depend on “non-it specialists” for their it support services, with 59.5% of rural public libraries depending on their library director for it support (american library association, 2012). we will never reach the bulk of our colleagues and users without dramatically reducing the cost of supporting individual software projects. in order to build more representative software communities, we need to ask more of the software that we produce. we need software that is conscientious. “our software is like children. … we expect that after a time the child will mature, will grow up, will be able to take care of itself, to solve problems, to cope, and perhaps to contribute something new. initially selfish—for what other options are there?—the child becomes responsible. with luck or persistence or as the result of good upbringing, the child may become conscientious. shall we hope similarly for our software?” (gabriel and goldman, 2006) in the context of this challenging view of software as an entity that exists on some sort of maturation curve, open source software is not grown-up software. this software has left the house and begun to develop its own life outside of our influence, but it has not yet learned how to seek and maintain an environment in which it can thrive. it has a narrow, provincial view of the world which still must be expanded. this software is not yet fully realized in its role of providing utility to a large and diverse community. it is open but not available. software availability our goal now is to find methods for measuring the maturity, or availability, of our software as well as techniques for improving those measures. gabriel and goldman provide some suggestions, some of which fall into the realm of technical speculation, and some of which are difficult to evaluate. we need a way to quickly and consistently identify gaps in the availability of our software and to measure improvement after changes are made. most importantly, we need an evaluation tool that exclusively measures the ability to implement and use software. in other words, software does not need to be open source, easy to modify, or even free to be considered highly available to targeted user communities. these performance indicators must specifically target the goal of wide software adoption across all potential user groups. stopwatch technology availability performance indicators we have developed an initial set of task performance indicators to measure software availability for a given user group, derived from the common practice in user experience design of developing key performance indicators for tracking and evaluating progress toward strategic organizational goals (5 ux kpis [accessed 2016]). by focusing on task performance rather than specific techniques such as documentation, we are attempting to provide indicators that are easy to measure, compare, and communicate for a wide variety of projects. these indicators are: time to pilot on a laptop. defined as the time needed to install and configure, at minimum, a demonstration instance of the application, particularly for use in organizational evaluation of software. time to export data. defined as the time needed to export unique data captured in the application, e.g. social media metadata and space assessment data. time to update dependencies. defined as the time needed to update the dependencies on which an application is built, e.g. ruby on rails gems. time to upgrade application. defined as the time needed to update the application itself, e.g. version updates, bug fixes, and database migrations. time to migrate application. defined as the time needed to transfer application to a new server environment, i.e. application portability. time to new production deployment. defined as the time needed to create a robust production instance of the application. time to reasonable security. defined as the time needed to secure the application relative to the sensitivity of the data being stored. this should consider the security policies of the adopting institution. guidelines it’s important to follow a small set of guidelines when using these performance indicators in order to ensure that these measures are as consistent and reliable as possible. for each set of performance indicators, a targeted user group and rate of success should be defined before measurement. this will allow us to easily compare results across projects and to chart these results to measure the continual maturation of a particular project. use the same user group within each set of performance indicators use the same success rate within each set of performance indicators assess all performance indicators in the set examples of measure definitions: time to pilot for subject liaison librarians with 90% success, i.e. 9 out of 10 subject liaison librarians were able to pilot the application within the time listed. time to pilot for assessment librarians with 50% success time to pilot for systems administrators with 80% success we can now not only identify gaps in the availability of our software to specific user communities, but begin to search for tools and techniques that can be applied to mitigate these issues. applying stopwatch technology availability performance indicators several approaches exist for improving the ease with which software projects can be adopted by users including usable documentation, packaged installers like those offered by wordpress (installing wordpress … [accessed 2015]), vendor specific deployment scripts from companies like google (google cloud computing [accessed 2015]), microsoft (microsoft azure [accessed 2015]), amazon (aws [accessed 2015]), and heroku (creating … [accessed 2015]), and hosted and managed services like duracloud (duracloud [accessed 2015]) and dspace direct (dspace direct [accessed 2015]). another approach, and the one we have been exploring with our own software projects, is to take advantage of virtualization tools like vagrant and docker. vagrant (vagrant [accessed 2015]) is a tool for building development environments by provisioning systems inside a variety of virtualization platforms like virtualbox, vmware, and others, providing developers with isolated, disposable, and consistent environments. vagrant is easily installed on various operating systems including linux, windows, and os x using a packaged installer. applications are then provisioned using vagrantfiles provided by the application maintainers. as of version 1.8, vagrant will even install virtualbox automatically to create a simpler initial provisioning process. with vagrant, developers can be more confident that their code is running in the same environment as their team members, improving the reliability of code and easing the transition from development environments to staging and production environments. vagrant also simplifies operations by providing a disposable environment and consistent workflow for developing and testing infrastructure management scripts. docker (docker [accessed 2015]) is an open platform for building, shipping, and running distributed applications. it provides a common toolbox to take advantage of the distributed and networked nature of modern applications and enables the packaging of an application with all of its dependencies into a standardized unit for software development called a container. containers have similar resource isolation and allocation benefits as virtual machines but a different architecture. containers share the kernel with other containers, resulting in improved performance over virtual machines. unlike virtual machines, which include the application plus the necessary binaries, libraries and the entire guest operating system, containers run as an isolated process in user space on the host operating system. they’re also not tied to any specific infrastructure; docker containers can run on any computer, on any infrastructure, on any cloud. vagrant and docker may be used together and often are. vagrant supports docker both as a provisioner and as a provider, meaning that docker containers may be mounted to a guest virtual environment or directly to the host environment(lowe 2015.) what this means is that end users of applications don’t need to be familiar with docker at all in order to take advantage of it’s features since vagrant can be configured by the application maintainer to automatically manage the provisioning of docker containers. at the same time, these docker containers can also be used for production deployments to hosting providers and in-house servers. vagrant and docker have become popular tools for simplifying the process of creating and maintaining software environments. john fink wrote in 2014 that “for software development, when programmers check their code into git, a dockerfile could be included in the source code, allowing for quick testing of code on remote servers or as a demonstration tool to let other quickly bring up their own version of an application without having to worry about specific building instructions or dependency management” (fink 2014). furthermore, these tools can also help reduce the costs associated with software adoption and create mutual benefit for both project maintainers and adopters. as we have continued to develop and support open source software projects and communities, we have found that many cultural heritage organizations that might benefit from adopting our software were simply unable to do so based on budgetary constraints, staffing, or other institutional hurdles. we believe that integrating tools like vagrant and docker into our projects will not only improve internal development practices, but help lower the barrier to adopting our software and increase our score against our proposed stopwatch technology availability performance indicators. what follows are two case studies of our own open source software projects that demonstrate how these performance indicators might be used to evaluate the availability of library software projects. note:we are using estimates based on prior system support and user discussions and not on formal user testing. these performance indicators are most helpful as tools for identifying gaps in the software adoption process and are not meant as performance profiling tools. case study: suma-vagrant suma-vagrant (suma-vagrant [accessed 2015]) is an experimental demonstration and development environment that utilizes vagrant, virtualbox and ansible (ansible [accessed 2015]) to create a reproducible, isolated virtual environment for suma. developers working with suma-vagrant can be more certain that code they develop will be compatible with the large collection of dependencies and libraries used in the suma application. the suma-vagrant environment has allowed us to more easily test against consistent test data, explore integration with third party tools like tableau desktop, and on-board new developers to the project. suma-vagrant has also simplified the installation and configuration of suma for less-technical users. some of the installation issues (figure 1) users have encountered when trying to install suma at their home institutions include configuration file errors, environment issues, and database access. these issues are not only costly for us to support, but also impede the ability of potential users to evaluate suma and to argue for its adoption at their institutions. using suma-vagrant, users can quickly create a working, fully functional suma environment in only four steps (figure 2). evaluating suma (figure 3) and suma-vagrant (figure 4) against our stopwatch technology availability performance indicators shows that suma-vagrant greatly reduces the time and expertise required to install and configure a working demonstration and development environment for suma, making the benefits and use cases of suma easier to explore. mod_rewrite disabled curl missing configuration errors symlink problems server hardening software db access installation method confusion figure 1. selected list of suma install issues install vagrant clone or download the suma-vagrant repository on github. execute “vagrant up” from within the project directory using a terminal visit http://localhost:19679 in a web browser on the host machine. figure 2. suma-vagrant installation steps time to pilot on a laptop: days time to export data: 4 hours time to update dependencies: 4 hours time to upgrade application: 1 hour time to migrate application: unknown time to new production deployment: days, if at all time to reasonable security: unknown figure 3. suma evaluation: assessment librarians at 80% success time to pilot on a laptop: 40 minutes time to export data: 10 minutes time to update dependencies: 2 minutes time to upgrade application: 2 minutes time to migrate application: 20 minutes time to new production deployment: under development time to reasonable security: 10 minutes (pilot), under development (production) figure 4. suma-vagrant evaluation: assessment librarians at 80% success case study: social media combine as part of the ez innovation grant from the state library of north carolina, the ncsu libraries developed the social media combine (social media combine … [updated 2015]), a pre-configured collection of tools including social feed manager (social feed manager … [updated 2015]) and lentil (lentil .. [updated 2015]) for easily building twitter and instagram social media archives on your own computer. it pre-assembles lentil for instagram data harvesting and social feed manager for twitter data harvesting, along with the web servers and databases needed for their use into a single package that can be deployed to desktop and laptop computers using windows, os x, or linux. applying our “stopwatch technology performance indicators” to lentil before (figure 5) and after (figure 6) integration shows tasks that might take days for a standard installation take only minutes as part of the social media combine. potential users are not required to be familiar with application frameworks, server environments, or even the different programming languages used for the project (figure 7). once the social media combine environment is installed, users are presented with an intuitive web configuration tool for managing the configuration of the system (figure 8). the integration of lentil and social feed manager into a single application that can be run on everyday hardware with usable web-based configuration makes the adoption of these archival tools much more viable for under resourced institutions. time to pilot on a laptop: days time to export data: automatic time to update dependencies: 4 hours time to upgrade application: 1 hour time to migrate application: unknown time to new production deployment: days, if at all time to reasonable security: unknown figure 5. lentil evaluation: archivists at 80% success time to pilot on a laptop: 30 minutes time to export data: automatic (lentil), variable (sfm) time to update dependencies: 2 mins time to upgrade application: 5 mins time to migrate application: 20 minutes (under documented) time to new production deployment: under development time to reasonable security: 10 minutes (pilot), under development (production) figure 6. social media combine (lentil + social feed manager + configuration) evaluation: archivists at 80% success install git and vagrant clone or download the social media combine repository on github. execute “vagrant up” from within the project directory using a terminal enter configuration parameters in the web configuration form and click ok visit lentil: http://localhost:3001, social feed manager: http://localhost:8001, or the configuration tool: http://localhost:8081 figure 7. social media combine installation steps figure 8. screenshot of web configuration tool (attached) conclusion by applying our stopwatch technology availability performance indicators to our own software projects, we have shown how approaches that help local practices and deployments, like consistent, repeatable virtual environments, can also be used to improve the availability of library technology. holding ourselves accountable in this way is like testing; it’s good for you, it’s something you should be doing anyway, and will have a positive and significant impact on the total quality and usefulness of library software. these performance indicators might help us think more holistically about our software projects, focusing less on simply releasing code and more on what is required for users to actually adopt and use our software, especially when those users are under resourced. open source itself is not sufficient for distributing innovative library software. but, by taking the best parts of emerging devops tools and integrating them with the natural library mission of supporting the diffusion of innovation, perhaps we can ensure that institutions who might benefit from our software can install and apply it to their own needs. we believe privileged institutions have a professional responsibility to support the adoption of their innovations. our community has fulfilled the promise of free and open source software inasmuch as we have made our code freely available, created vibrant cross-institutional developer communities, and improved the level of service to our direct constituencies through truly innovative software. but, we have failed to fulfill the promise of promoting the diffusion of innovation in library technology to those smaller, less resourced institutions and their users. genuine software availability should not be defined only as open source releases of freely available software in a code repository, but as a fully considered commitment to ensuring that software can be adopted by those institutions that stand to benefit the most from our collaborative efforts and, in turn, that our projects benefit from the participation of more representative communities across the spectrum of cultural heritage institutions. references 5 ux kpis you need to track. [accessed 2016 jan 4]. http://designmodo.com/ux-kpi/ 2014-2015 lsta grant programs. [accessed 2015 dec 6]. http://statelibrary.ncdcr.gov/ld/grants/lsta/2014-2015grants.html ansible. [accessed 2015 dec 6]. http://www.ansible.com/ aws marketplace applications now available with 1-click deployment in sydney. [accessed 2015 dec 6]. //http://aws.amazon.com/about-aws/whats-new/2013/04/24/aws-marketplace-applications-now-available-with-1-click-deployment-in-sydney/ brown k, adler sm, irvine rl, resnikoff da, simmons i, tierney jj. united states’ memorandum on the 1969 case. washington, dc, usa; 1995. campbell-kelly m, garcia-swartz dd. pragmatism, not ideology: historical perspectives on ibm’s adoption of open-source software. 2009;21(3):229–244. chudnov d. open source library systems: getting started | oss4lib. 1999 feb 1 [accessed 2015 jan 20]. http://www.oss4lib.org/readings/oss4lib-getting-started.php creating a “deploy to heroku” button. [accessed 2015 dec 6]. https://devcenter.heroku.com/articles/heroku-button docker. [accessed 2015 dec 6]. http://www.docker.com/ dspacedirect. [accessed 2015 dec 6]. http://dspacedirect.org/ duracloud. [accessed 2015 dec 6]. http://www.duracloud.org/ fink j. docker: a software as a service, operating system-level virtualization framework. 2014 [accessed 2015 dec 6];(25). http://journal.code4lib.org/articles/9669 gabriel rp, goldman r. conscientious software. in: proceedings of the 21st annual acm sigplan conference on object-oriented programming systems, languages, and applications. new york, ny, usa: acm; 2006 [accessed 2015 jan 4]. p. 433–450. (oopsla ’06). gates b. an open letter to hobbyists. 1976 [accessed 2015 dec 6];2(1). https://commons.wikimedia.org/wiki/file:bill_gates_letter_to_hobbyists.jpg google cloud computing. [accessed 2015 dec 6]. https://cloud.google.com/ history of the osi | open source initiative. [accessed 2015 dec 6]. http://opensource.org/history hoffman j, bertot jc, davis dm. libraries connect communities: public library funding & technology access study 2011-2012. digital supplement of american libraries magazine. american library association; 2012 [accessed 2015 dec 6]. http://www.ala.org/research/plftas/2011_2012#final report installing wordpress. [accessed 2015 dec 6]. https://codex.wordpress.org/installing_wordpress#famous_5-minute_install lentil. [accessed 2015 dec 6]. https://github.com/ncsu-libraries/lentil lowe s. using docker with vagrant. [accessed 2015 dec 6]. http://blog.scottlowe.org/2015/02/10/using-docker-with-vagrant/ microsoft azure. [accessed 2015 dec 6]. https://azure.microsoft.com/en-us/ philipson g. a short history of software. in: barrett r, editor. management, labour process and software development: reality bites. 2004. p. 13. rochkind j, editor c, editorial introduction — issue 1. 2007 [accessed 2015 dec 6];(1). http://journal.code4lib.org/articles/39 sadler b. creating a commons. 2013 [accessed 2015 dec 6]. http://www.ibiblio.org/bess/?p=302 schindler e. an abbreviated history of acp, one of the oldest open source applications | itworld. 2009 aug 20 [accessed 2015 jan 12]. http://www.itworld.com/article/2767585/open-source-tools/an-abbreviated-history-of-acp–one-of-the-oldest-open-source-applications.html social feed manager. [accessed 2015 dec 6]. https://github.com/gwu-libraries/social-feed-manager social media archives toolkit. [accessed 2015 dec 6]. http://www.lib.ncsu.edu/social-media-archives-toolkit social media combine. [accessed 2015 dec 6]. https://github.com/ncsu-libraries/social-media-combine stallman r. the gnu manifesto. 1985 [accessed 2015 dec 6]. http://www.gnu.org/gnu/manifesto.en.html stallman r. why open source misses the point of free software. 2015 [accessed 2015 dec 6]. http://www.gnu.org/philosophy/open-source-misses-the-point.en.html state library of north carolina. [accessed 2015 dec 6]. http://statelibrary.ncdcr.gov/index.html suma. [accessed 2015 dec 6]. https://www.lib.ncsu.edu/reports/suma suma-vagrant. [accessed 2015 dec 6]. https://github.com/ncsu-libraries/suma-vagrant vagrant. [accessed 2015 dec 6]. https://www.vagrantup.com/about.html appendix a: social media combine sample code social media combine vagrant file: https://github.com/ncsu-libraries/social-media-combine/blob/master/vagrantfile social media combine docker file: https://github.com/ncsu-libraries/social-media-combine/blob/master/docker-compose.yml appendix b: suma-vagrant sample code suma-vagrant vagrantfile: https://github.com/ncsu-libraries/suma-vagrant/blob/master/vagrantfile suma-vagrant ansible:https://github.com/ncsu-libraries/suma-vagrant/blob/master/ansible_tasks/roles/demo/tasks/suma.yml about the authors bret davidson is a digital technologies development librarian at north carolina state university where he works to advance library services through applied research and application development. he provides technical leadership for the open-source space and service assessment toolkit “suma” and contributes to a broad portfolio of library applications. previously, he was an ncsu libraries fellow, a public school music educator, and a performing musician with the river city brass band in pittsburgh, pa. jason casden is the interim associate head for the digital library initiatives department at the north carolina state university libraries. jason has served as a project or technical lead for several projects designed to help the ncsu and wider library communities interact with library resources, services, and spaces in new ways. these include the “my #huntlibrary” community-driven photographic student documentation project and the supporting open source “lentil” instagram harvesting platform, the “suma” space and services assessment system, and the “social media combine” social media harvesting environment that pre-assembles social media data harvesting software into a single installable package. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – better together: improving the lives of metadata creators with natural language processing mission editorial committee process and structure code4lib issue 51, 2021-06-14 better together: improving the lives of metadata creators with natural language processing dc public library has long held digital copies of the full run of local alternative weekly, washington city paper, but had no official status as a rights grantor to enable use. that recently changed due to a full agreement being reached with the publisher. one condition of that agreement, however, was that issues become available with usable descriptive metadata and subject access in time to celebrate the upcoming 40th anniversary of the publication, which at that time was in six months. one of the most time intensive tasks our metadata specialists work on is assigning description to digital objects. this paper details how we applied python’s natural language toolkit and openrefine’s reconciliation functions to the collection’s ocr text to simplify subject selection for staff with no background in programming. introduction in 2001, yale university scanned the full run of washington, d.c. alternative weekly newspaper washington city paper (wcp) to microfilm, and provided digital surrogates to dc public library (dcpl) as a professional courtesy. at that time, dcpl had neither dedicated staff nor infrastructure to manage digital materials in a meaningful way, thus the collection remained largely dormant until 2014, when the special collections division (now named people’s archive) both hired its first digital curation librarian and launched its digital collections platform, dig dc. even then, aside from occasional reference requests, use of the digital wcp was limited due to there being no formal agreement between dcpl and the original publisher. two key shifts occurred in swift succession that led to movement on the wcp project. firstly, dcpl reached an agreement with publishers of the historic gay newspaper washington blade to digitize from microfilm, describe, and make available all issues of the paper from 1969 to the present day. the project is still ongoing, but is considered a success by both parties. thus, when the opportunity to have formal discussions with representatives of wcp presented itself, dcpl had in place not only a model for how the work might be completed, but also a legal framework to allow the publisher to retain copyright, and also confer generous rights grantor privileges to the library. dcpl reached a similar agreement with the publishers of wcp in 2020, and planned for the collection’s debut to coincide with the 40th anniversary of the paper in 2021. secondly, the covid-19 pandemic, while catastrophic, forced dcpl to reevaluate its telework policies and revise them for the better, at least temporarily. library staff required projects that could be completed remotely, and descriptive metadata work was deemed appropriate considering the situation. it should be noted that staff experienced significant anxiety about both their health and employment status at that time due to the impending recession and worsening coronavirus spread, and that this project, as conceived, was designed partly to remove one element of friction from the workdays of people who were under unusual and understandable levels of stress. that element of friction was, in short, the selection of subject headings and name authorities. dcpl’s previous work in the authoring of descriptive metadata for digital objects was not necessarily seen to be lacking, merely time-consuming. to illustrate, consider the workflow for the washington blade project: the project manager provides a batch of pdf issues and mods template to another staff member; many fields are pre-filled, but that staff member creates extensive entries for the item description field, and browses the library of congress (lc) linked data service to select a set of relevant subject headings for each issue in the batch. each issue must be read in detail, and a great deal of time must be spent exploring the lc site. with the 40th anniversary of wcp approaching quickly, the question arose: what could be done to speed up this process? could “subject” become one of those pre-populated fields? workflow dcpl’s people’s archive digital unit had long been interested in harnessing, in some way, the ocr text that is automatically generated when an appropriate item is ingested into dig dc (an islandora 7 repository). the period of confusion between the beginning of the pandemic and the now-familiar rhythm of telework created an opportunity to step back from normal responsibilities and, with distance, conceive an approach that could generate library of congress subject heading (lcsh) and library of congress name authority file (lcnaf) matches from the ocr text of a newspaper issue. the workflow that was developed follows. since dcpl does not usually perform ocr until an object, or set of objects, is ingested into islandora (typically the final stage of a digital collections project), and our goal was to assist with descriptive metadata creation (one of the initial stages), we opted to run tesseract on the collection in bulk before any other work was done. we decided to focus only on the cover pages of each issue for two reasons: one, it is much simpler and computationally less demanding to run tesseract on a single folder containing one tiff per issue than it is to do the same for a series of nested folders containing complete editions (which are around 80 pages each); and two, the covers are in general more succinct and focus largely on the unique content for that issue, which we hoped would result in more accurate subject headings. since the digital issues utilized a nested folders-within-a-folder structure, we wrote a python script to extract files with names that matched a pattern shared by all issue covers and place them into a single directory. this resulted in a folder containing 950 images, and consequently 950 files of ocr text. figure 1. the cover for the first issue of the publication, which was simply titled “1981” before adopting its current title of “washington city paper”. the next step was to extract commonly-used words in those files, and we identified python’s natural language toolkit (nltk) as the most appropriate tool for the job. we created another python script to iterate through the text files, produce a cleaned version of each, and apply nltk to the cleaned versions to output a list of ten most frequency encountered terms to an additional set of text files. the resulting top ten rankings were then combined into a single csv featuring two columns (“filename” and “content”) with a final python script. while the process technically worked from the beginning, results were not initially usable because nltk’s default list of 127 english stopwords used in the text cleaning process is simply not extensive enough. ultimately, we employed an expanded list of 1,160 stopwords compiled from multiple sources. lastly, the csv was brought into openrefine and cleaned up so that each term to be matched against lc occupied a single cell. we then used christina harlow’s lc openrefine reconciliation endpoint to reconcile the text in each column, and imported high percentage lcsh and lcnaf matches into the mods templates used by our metadata workers. results and analysis we were not able to fully pre-populate the subject field. that said, this process identified 1,972 lcsh and lcnaf matches for 950 issues, and 525 of those matches (roughly one quarter) were selected as relevant to the content of the issues being described. we found that terms with a certain level of specificity resulted in more useful matches (consider the following examples that were included in the final collection metadata: “vodka“, “nicaragua“, “rent“, “preservation“, “gambling“, “radio“, “squirrels“, “poetry“, “christmas“, “assassination“, “interferon“, and – the author’s personal favorite – “leprechauns“; versus “american” and “home“, which were not). one element of this workflow that was not particularly successful and that could be iterated upon in future implementations is the focus on identifying single words instead of terms for reconciliation against lc. to illustrate, this solution in no way accounts for terms like “american ambassador” (which in one instance was split into “american” and “ambassador,” both reconciled against lc separately), “infant formula” (which was treated as “infant” and “formula.”), or “enola gay” (which was treated as “enola” and “gay”). while terms were technically selected, the original meanings were altered due to this limited approach. unfortunately, we have little to no data to determine how this work impacted the quality of life of our metadata creators. what is obvious, however, is that their work cannot be easily replicated, merely supplemented; nltk, for all of its strengths, cannot match the creativity, cognition, and intuition of library workers and their abilities to read contextually and make judgement calls. the process of computationally identifying subjects and name authorities certainly adds value to the collection, and may save workers small amounts of time, but whether that time saved is statistically significant is debatable. the author believes, however, that there is further work to be done here and that there is great potential to improve upon these results in future projects. the washington city paper digital collection is a work in progress, and is available on dig dc. about the author paul kelly (paul.kelly2@dc.gov) is digital initiatives coordinator for the people’s archive at dc public library. he holds his m.s. in library and information science from the catholic university of america and m.a. in english literature/film and television studies from the university of glasgow. his professional interests include collections as data, web archiving, and digital preservation. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – enhancing serials holdings data: a pymarc-powered clean-up project mission editorial committee process and structure code4lib issue 58, 2023-12-04 enhancing serials holdings data: a pymarc-powered clean-up project following the recent transition from inmagic to ex libris alma, the technical services department at the university of southern california (usc) in los angeles undertook a post-migration cleanup initiative. this article introduces methodologies aimed at improving irregular summary holdings data within serials records using pymarc, regular expressions, and the alma api in marcedit. the challenge identified was the confinement of serials’ holdings information exclusively to the 866 marc tag for textual holdings. to address this challenge, pymarc and regular expressions were leveraged to parse and identify various patterns within the holdings data, offering a nuanced understanding of the intricacies embedded in the 866 field. subsequently, the script generated a new 853 field for captions and patterns, along with multiple instances of the 863 field for coded enumeration and chronology data, derived from the existing data in the 866 field. the final step involved utilizing the alma api via marcedit, streamlining the restructuring of holdings data and updating nearly 5,000 records for serials. this article illustrates the application of pymarc for both data analysis and creation, emphasizing its utility in generating data in the marc format. furthermore, it posits the potential application of pymarc to enhance data within library and archive contexts. by minyoung chung and phani chaitanya pendyala introduction in may 2023, one archives [1] successfully completed the data migration from inmagic to ex libris’ alma. it’s noteworthy that the metadata schema utilized in inmagic differed from marc 21, and the holding information was encoded as “checkin.” exemplified in figure 1 are instances of the checkin values, integral components of the migrated metadata from inmagic. figure 1. examples of checkin values throughout the data migration, the “checkin” data was systematically mapped into the 949 $c field within bibliographic records. additionally, this mapped information from the 949 $c field was further transposed into the 866 $a field within the holding records. this configuration not only allowed for the batch import of bibliographic records into the alma system but also facilitated the generation of holdings information in the alma import profile. context the usc libraries’ one archives stands as the preeminent repository of lesbian, gay, bisexual, transgender, queer (lgbtq) materials globally. boasting an extensive collection, one archives houses millions of archival items, encompassing periodicals, books, films, videos, audio recordings, photographs, artworks, organizational records, and personal papers. since 2010, one archives has been integrated into usc libraries. with usc libraries’ adoption of the exlibris alma integrated library system (ils) in 2017, the migration process also included the transfer of records from one archives. however, due to the complex nature of holdings information, particularly in specific serials and audio records, special attention was warranted. consequently, certain records remained pending ingestion into alma. the conclusive migration of data for these serials and audio materials was successfully executed between march and may 2023. this process necessitated extensive communication and collaboration among one archives, the technical services & collection development department, and the integrated library systems department within usc libraries. this paper discusses our follow-up cleanup initiative conducted between may and july 2023, right after the final data migration. our primary focus was directed towards a thorough examination of one archives serials holding records within alma. a noteworthy issue that surfaced pertained to the presentation of summary holdings in primo, prompting concerns about the clarity of the holding information. the challenge centered around the difficulty in understanding the holding details within primo, which proved to be time-consuming for myself and my fellow serials cataloger. from a technical perspective, the cleanup of holdings information to enhance clarity and comprehensibility was entirely feasible. consequently, embarking on this cleanup project was an evident and straightforward decision. this challenge is further compounded by the inherent non-browsability of these materials, given that the serials are securely housed in the restricted and closed stacks in one archives. therefore, ensuring the accurate display of holding information through the primo interface became imperative. the primary objective of the project was to enhance the user-friendliness of the primo display by presenting detailed information in a clear and understandable manner. usc libraries employ the ansi/niso z39.71 standard in the cataloging of serials and their holdings records. notably, primo does not utilize the 853 (captions and pattern) and multiple 863 (enumeration and chronology) fields. instead, primo employs the equivalent free-text fields, specifically the 866 field. consequently, ensuring the accuracy and well-organization of data in the 866 field is crucial for serials’ holdings records. as depicted in the figure 2 below, for example, the input of numbers for each month and season in the 863 field initiates the automatic generation of paired coded text data in the 866 field. figure 2. example of marc codes for months/seasons and the ansi/niso abbreviations in holdings record approach our methodology involved augmenting holding records with 853 (captions and pattern) and multiple 863 (enumeration and chronology) fields. in alma, the automatic creation of the summary field (866) was based on the information in the 853 and 863 fields. the addition of 853 and 863 fields was intended to update the 866 fields, ultimately reflecting the revised holdings information in primo. however, we encountered a challenge due to the variability in the provided holding information for each serial, along with distinct patterns. also, all holding information was consolidated in the 866 field (summary field) as presented in primo (figure 3). the incorporation of 853 and 863 fields across over 5,000 records proved to be a time-intensive process. consequently, we chose to explore the application of regular expressions to identify the issuing pattern for 853 and the enumeration/chronology for 863. figure 3. holdings information in primo before the update the data this study utilized a dataset comprising 5,254 serials holding records subjected to ingestion. to improve accessibility during the migration process, a free-text note labeled “oneinmagicserials” was appended to the local note 910. this deliberate step enhanced retrieval from alma, allowing for precise searches within both alma and primo ve. importantly, this approach facilitated the generation of a comprehensive excel-based list (see figure 4), encompassing key bibliographic information, including titles (245$a), mms id (001$a) [2], oclc number (035$a), and library of congress subject headings (650$a). figure 4. examples of exported bibliographic data data retrieval and update via z39.50 in marcedit although alma offers the capability to export bibliographic, inventory, and authority records, it is important to note that alma does not provide specific functionality for exporting holdings records in marc21 xml or binary format. consequently, we opted to utilize the marcedit-alma integration to obtain the holdings records. leveraging the z39.50/sru client-server, we conducted batch searches for mms ids and subsequently downloaded the corresponding holdings records in bulk. this task takes a considerable amount of time in marcedit, so we have divided the files into six separate parts to manage it more efficiently. figure 5. batch marc record retrieval using z39.50/sru in marcedit marcedit was utilized for the retrieval and updating of holding records in alma. upon extracting all holding records, the files were saved in both binary marc file (mrc) and mnemonic marc text file (mrk) formats [3]. the .mrc file is employed during program execution, while the .mrk file serves as a tool for reviewing current records and extracting insights for constructing regular expressions. moreover, the .mrk file serves as a backup to enable the restoration of records in case of errors or inadvertent mistakes. serials cataloging demands advanced skills and knowledge. while reviewing the holdings records, we collaborated with our senior serials cataloger to deliberate on the optimal approach for generating 853 and 863 fields. concurrently, a meticulously coded holdings record is conducive to machine readability, affording us the opportunity to explore a semi-automated enhancement process. =ldr 00366nx a22000851n 4500 =008 1011252u\\\\8\\\4001uueng0000000 =005 20230424100908.0 =852 8\$bonearchive$cone-mbl$xmagazine =866 \\$a2014: (31 [jun])2015: (1 [dec])2016: (56 [oct])2017: (66-67 [sep-oct])2018: (73-74 [apr-may],79-80 [oct-nov]) =999 \\$aalma_bib_number $b991043829289303731 $calma_holdings_number $d221271756360003731 =ldr 00291nx a22000851n 4500 =008 1011252u\\\\8\\\4001uueng0000000 =005 20230424100909.0 =852 \\$bonearchive$cone-nwp$xnewspaper =866 \\$a1991: 2 (9[may/jun], 10[jul/aug]) =999 \\$aalma_bib_number $b991043829287003731 $calma_holdings_number $d221271756320003731 =ldr 01274nx a22000851n 4500 =008 1011252u\\\\8\\\4001uueng0000000 =005 20230424100910.0 =852 \\$bonearchive$cone-nwp$xnewspaper =866 \\$a1979: 1 (6-8 [sep-dec])1980: 2 (1-3 [jan-mar], 5 [may], 10-11 [oct-nov], [dec])1981: 3 (1-11 [jan-nov], 1{lcub}sic{rcub} s/b 12[dec])1982: 4 (1-12 [jan-dec])1983: 5 (1-12 [jan-dec]) 1984: 6 (1-12 [jan-dec)]1985: 7 (1-2 [jan-feb]); v.8 ({lcub}sic s/b v.7{rcub} (3-12 [mar-dec])1986: 8 (1-12 [jan-dec]) 1987: 9 (1-12 [jan-dec])1988: 10 (1-5 [jan-may], 6 [jun 3], 8 [jul 1], 10-17 [aug 5-dec 2])1989: 11 (1 [jan6], v.10 {lcub}sic s/b v.11{rcub} (2 [feb 3], 4-7 [mar 3-apr 21], 9 [may 19], 11 [jun 23], 13-14 [jul 21-aug 4]); v.11 {lcub}sic s/b v.12{rcub} (1 [sep 1], 2 [sep 15], 4 [oct 20], 5 [nov 3], 6 [nov 17], 7 [dec 1], 8 [dec 15])1990: 11 {lcub}sic s/b v.12{rcub} (10 [feb 2], 11 [feb 16], 12 [mar 2], 13 [mar 16], 14 [apr 6], 15 [apr 20], 16 [may 4], 17 [may 18], 18 [jun 1], 19 [jun 15], 20 [jul 6], 22 [aug 3])1991: 12 [sic s/b v.13{rcub} (12 [mar 1], 23 [aug 16])1992: 14 (2 [oct 2])1993: 15 (1 [sep 3]) 1995: 16 (11 [feb 17], 12 [mar 3])1996: 17 (8 [jan 19], 23 [aug 16]); v.18 (2 [oct 4])1997: 18 (13-14 [mar 21-apr 4], 16 [may 2], 24 [sep 5])1998: 19 (23 [aug 7]) =999 \\$aalma_bib_number $b991043829282703731 $calma_holdings_number $d221271756190003731 figure 6. examples of retrieved holdings records in marcedit (in mrk format) implementation in this section, we will provide a detailed explanation of the implementation of the python-based code for automating marc record management. this code leverages the pymarc library and other python modules to efficiently process marc records and extract issue-specific information. importing necessary libraries the code begins by importing the required python libraries. these libraries provide essential functionalities for marc record processing, file selection, regular expressions, and data manipulation. # import necessary libraries from pymarc import marcreader, field, subfield from tkinter.filedialog import askopenfilename import re from copy import deepcopy ‘pymarc’: this library is used for working with marc records, providing functions to read, manipulate, and create marc records. ‘tkinter’: this library is employed to create a file dialog for the user to select the input marc file. ‘re’: the re library is used for regular expressions, which play a crucial role in extracting issue-related information from marc records. ‘deepcopy’: the deepcopy function is utilized to create a deep copy of marc records to prevent modifying the original records. loading marc records: the code proceeds by prompting the user to select a marc file using the ‘askopenfilename’ function from the ‘tkinter’ library. this selected file will serve as the input for further processing. # load marc records from a file filename = askopenfilename() records = readfile(filename) processing and modifying marc records this section focuses on the parsefile function, which iterates through the marc records, examines their content, and extracts issue-related details. let’s break down the steps in more detail: deep copying records before processing any marc records, the code creates a deep copy of each record. this is a crucial step to ensure that the original marc records remain unaltered. deep copying prevents any unintended modifications to the source data, preserving its integrity. new_record = deepcopy(record) parsing ‘866’ field the code checks if the current marc record contains an ‘866’ field. in marc records, the ‘866’ field is typically used to store information about issues and volumes of a publication. if '866' in new_record: regular expression matching the code uses regular expressions (regex) to match and extract relevant information from the ‘866’ field. in particular, it targets patterns that represent years, volumes, and issue information. here’s a breakdown of the regex pattern: year_pattern = r’(\d{4}):\s?(\d+)?\s*$(.*?)$’ (\d{4}): this group captures a four-digit year. the \d{4} pattern matches four consecutive digits, which represent the publication year of the issue. :\s?: this part matches a colon followed by an optional space. the colon is used to separate the year from the rest of the content. (\d+)?: this group captures an optional set of digits for volume information. the \d+ pattern matches one or more consecutive digits, representing the volume number. \s*: this part matches any additional spaces. it allows for flexibility in handling different formatting styles in the ’866’ field. $(.*?)$: this group captures text enclosed in parentheses, which typically represents issue information, such as issue numbers and issue months. the $(.*?)$ pattern matches any text enclosed in parentheses. by using these groups, the regular expression effectively extracts the year, volume (if present), and issue-related information from the ‘866’ field. formatting data once the regex matches are found, the code proceeds to format and organize the extracted data. it splits the issues into separate components, including year, volume, issue number, and issue month. issues = breakintoissues(content) issues_text = [] for issue in issues: ret = extractissuenumberandmonth(issue) if not ret: continue issue_number, issue_month = ret if issue_number != "": contain_issue_number = true if issue_month != "": contain_issue_month = true issues_text.append([year, volume, issue_number, issue_month]) splitting issues the ‘breakintoissues’ function splits the issues into a list of individual issues, typically separated by commas. extracting issue details within the section of code responsible for processing and formatting issue-related information, the extractissuenumberandmonth function is utilized to extract issue-specific details from each issue string. here’s an in-depth look at how this function works: the extractissuenumberandmonth function takes an issue string as input, which represents one of the issues extracted from the ‘866’ field of a marc record. it initializes issuenumber and issuemonth as empty strings to store the extracted issue number and issue month, respectively. the function examines the first character of the issue string: if it starts with a digit (0-9), it assumes that the issue string contains both an issue number and an issue month/year. it splits the string into two parts using the first space as a delimiter. if there’s no space, it assigns the entire string to issuenumber. if the first character is “[“, it assumes that the entire issue string is an issue month/year in square brackets and assigns it to issuemonth. if neither of these conditions is met, it means that the issue string doesn’t contain relevant information, and the function returns an empty list. if issuemonth has been identified, it removes the square brackets (if any) and performs additional cleaning by calling the cleanissuemonth function. this function standardizes issue months to a numerical format, such as converting “spring” to “21.” finally, the function returns a list containing [issuenumber, issuemonth] as extracted from the issue string. if either issuenumber or issuemonth is not present, the respective field in the list will be an empty string. the extracted issue number and issue month (if available) are then used to populate the issues_text list for further organization and are eventually added to the ‘863’ fields within the marc record. this function plays a pivotal role in accurately extracting and formatting issue-related details, making them suitable for integration into marc records. let’s consider an example issue string: 1983: 5 (7-30 [jan 28-dec 29])1984: 6 (1-4 [jan 5-jan 26], 6 [feb 9], 11-12 [mar 15-mar 22]) the extractissuenumberandmonth function processes each issue string individually. first issue (1983: 5): the function encounters the first issue in the example, which is “1983: 5.” in the example, this issue string starts with a digit (1983), indicating the year. it then looks for a space to split the issue string into two parts. in this case, it finds a space between “1983:” and “5.” the first part, “1983,” is interpreted as the year, and the second part, “5,” is the issue number. these details are extracted and stored. second issue (1984: 6): the function proceeds to the next issue, which is “1984: 6.” similar to the first issue, it recognizes the year (1984) and issue number (6). again, these details are extracted and stored. issue months and additional details: in the example, there are additional details enclosed in square brackets, such as “[jan 28-dec 29],” “[jan 5-jan 26],” “[feb 9],” and “[mar 15-mar 22].” the extracted month and date details enclosed in square brackets are now sent to the cleanissuemonth function for processing. the cleanissuemonth function removes the square brackets and further breaks down the month and date information. additionally, if the extracted issue month is a season, it is replaced by a number as explained in a subsequent section. resulting data structure the ‘issues_text’ list is structured to contain lists for each issue, each containing the year, volume, issue number, and issue month. this structured data is used to create ‘863’ fields in the marc record. creating new marc fields for each set of extracted issue details (year, volume, issue number, and issue month), the code creates new ‘863’ fields in the marc record. these fields are added to the record to store the extracted information. new_863_field = field ( tag = '863', indicators = [’40’ if '-' in issue_month or '-' in issue_number else '41'], subfields = [ subfield(code='8', value = f' 1.{i} '), subfield(code='a', value = f' {year} ') ] ) the ‘863’ field is created with appropriate indicators based on whether issue numbers and months contain hyphens, which is used to signify ranges. subfields are added to store year information. if available, volume, issue number, and issue month subfields are added. these ‘863’ fields effectively capture and organize the extracted issue-related data within the marc record. final organization after processing all the matches and creating ‘863’ fields, the code also adds a ‘853’ field to the marc record. this field is used to signify the organization of the record, including information about years, volumes, issue numbers, and issue months. the code also appends an ‘007’ field to represent the content type of the record. these additional fields enhance the cataloging and organization of marc records. new_record.add_ordered_field(new_853_field) new_record.add_ordered_field(new_007_field) by following these steps, the code efficiently processes and modifies marc records, ensuring that issue-specific information is accurately captured and organized within the records. this automation simplifies the cataloging process and enhances the consistency and accuracy of catalog records in libraries. results upon executing the .mrc file in the code program, it appended 853 and 863 fields in .txt format. subsequently, we accessed the .txt file using marc editor in marcedit. in the last phase of the cleanup, just before the update, it was imperative to ensure the removal of existing 866 fields. this step was necessary because new 866 fields were generated in alma, reflecting the updated information from the newly added 853 and 863 fields. alma facilitates the automatic generation of 866 fields for summary holding information based on the content of the linked 853 and 863 fields. =ldr 00336cy a22000851n 4500 =007 ta =008 1011252u\\\\8\\\4001uueng0000000 =005 20230424100048.0 =852 \\$bonearchive$cone-mag$xmagazine =853 20$8 $a (year) $b v. $c (season) =863 41$8 1.1 $a 1972 $b 1 $c 22 =863 41$8 1.2 $a 1973 $b 2 $c 24/21 =863 41$8 1.3 $a 1975 $b 3 $c 24 =863 41$8 1.4 $a 1976 $b 5 $c 22 =999 \\$aalma_bib_number $b991003920649703731 $calma_holdings_number $d221271644230003731 =ldr 00288cy a22000851n 4500 =007 ta =008 1011252u\\\\8\\\4001uueng0000000 =005 20230424095827.0 =852 \\$bonearchive$cone-mag$xmagazine =853 20$8 $a (year) $b (month) =863 41$8 1.1 $a 1975 $b jan =863 41$8 1.2 $a 1975 $b mar =863 41$8 1.3 $a 1975 $b apr/may =999 \\$aalma_bib_number $b991003916459703731 $calma_holdings_number $d221271636040003731 =ldr 00279cy a22000851n 4500 =007 ta =008 1011252u\\\\8\\\4001uueng0000000 =005 20230424095846.0 =852 \\$bonearchive$cone-mag$xmagazine =853 20$8 $a (year) $b v. $c no. =863 41$8 1.1 $a 1986 $b 1 $c 3 =863 41$8 1.2 $a 1988 $b 1 $c 8 =999 \\$aalma_bib_number $b991003915539703731 $calma_holdings_number $d221271633800003731 figure 7. examples of updated holding records in marcedit we identified instances where records were omitted due to incomplete information in the 866 field, specifically when the 863 fields were absent. approximately 600 records were found to lack the necessary 863 fields. consequently, we undertook the task of extracting these records for further attention. samples of these instances are provided below: =866 \\$a?: 0 (1 [?]) =866 \\$a?: (17 [?]) =866 \\$a?: 1 (6 [?], 7 [?], 12 [?]) =866 \\$a?: (3 [?]) after running the .mrc file through the code program, which added 853 and 863 fields in .txt format, the .txt file was manually inspected using marceditor in marcedit to check for any errors. in the conclusive cleanup phase, conducted just before the update, it was crucial to ensure the elimination of the existing 866 fields. this step was taken because new 866 fields were created in alma to align with the updated 853 and 863 fields. consequently, the omitted records did not generate 863 fields and lacked the corresponding information. this prompted the extraction of records lacking 863 fields, totaling approximately 600 entries. recognizing the presence of skipped holding records emphasized the necessity of collaboration with onearchives to validate actual holding information, acknowledging the limitations observed in the migrated serials holding data. in summary, we successfully updated 4,629 holding records out of a total of 5,254. throughout the updating process, alma mms id and holding id were utilized to search for and update records via the z39.50/sru client-server and marcedit. both identifiers are embedded in the 999 field, which is automatically created when downloading records through the marcedit-alma integration. figure 8. holdings information in primo after the update [4] discussion and next steps through this project, we have validated the capacity to identify patterns within semi-structured holding data using regular expressions, thereby elevating the quality of holding records. in our approach, we abstained from the conventional practice of cleaning up raw data, instead choosing to employ regular expressions to directly derive organized data from the unrefined dataset. this decision served a dual purpose, acting as both an experimental venture and, in hindsight, an unfortunate one. opting for the initial cleanup of raw data within the 866 field through openrefine, followed by the execution of the .mrk file in the code program, might have facilitated the formulation of simpler regular expressions and potentially yielded a more refined outcome. the paramount goal of this project was to enhance the presentation of serial holdings, aiming for increased user efficiency in verifying holding information. we anticipate that users engaging with information in the reopened archive will find it more user-friendly and informative. in the aftermath of this project, our future endeavors include collaborating with onearchives to address the records that were inadvertently skipped. references/notes our codebase: https://github.com/chaitupendyala/inmagic-project alma documentation: export and publishing. [accessed 2023 sep 21] https://knowledge.exlibrisgroup.com/alma/product_documentation/010alma_online_help_(english)/040resource_management/075publishing_profiles/030export_and_publishing#:~:text=alma%20supports%20the%20ability%20to,binary%20and%20dublin%20core%20xml. alma documentation: holding records. [accessed 2023 sep 21] https://knowledge.exlibrisgroup.com/alma/product_materials/050alma_faqs/print_resource_management/holding_records abrahamse, benjamin. (2010). batch marc record retrieval using z39.50. new england technical services librarians. (netsl) / 2010 massachusett. https://netsl.files.wordpress.com/2013/01/abrahamsegofishnetsl2010.pdf ansi/niso z39.71-2006 (r2011) holdings statements for bibliographic items. (2011). [accessed 2023 sept 22]. https://www.niso.org/publications/z3971-2006-r2011 marc 21 format for holdings data: table of contents (network development and marc standards office, library of congress). [accessed 2023 sep 26]. https://www.loc.gov/marc/holdings/ pymarc codebase: https://gitlab.com/pymarc/pymarc pymarc documentation: https://pymarc.readthedocs.io/en/latest/ pymarc pdf: https://pymarc.readthedocs.io/_/downloads/en/stable/pdf/ python regular expression: https://docs.python.org/3/library/re.html reese, terry. (2016). marcedit alma integration. https://blog.reeset.net/archives/1950 notes [1] one archives at the usc libraries is the world’s largest repository of lesbian, gay, bisexual, transgender, and queer (lgbtq) materials. since 2010, one archives has been an integral part of the university of southern california libraries. it is located at 909 west adams boulevard, los angeles, ca 90007. [accessed 2023 oct. 11]. https://one.usc.edu/about [2] mms id = metadata management system id. in alma, mms id can be 8 to 19 digits long (with the first two digits referring to the record type and the last four digits referring to a unique identifier for the institution). [accessed 2023 oct. 17] https://knowledge.exlibrisgroup.com/alma/product_documentation/010alma_online_help_(english)/010getting_started/085_alma_glossary [3] binary is a format consisting of a series of sequential bytes, each of which is eight bits in length. the mnemonic marc text file is a format used by marcedit that is a human-readable version of the binary file. [4] primo permalink for the example record: https://uosc.primo.exlibrisgroup.com/permalink/01usc_inst/qk93vi/alma991003915619703731 about the authors minyoung chung is a monographs and special projects cataloging librarian at the university of southern california. phani chaitanya pendyala is a computer science graduate student at the university of southern california. alongside his studies, he serves as a student assistant at the usc grand library. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – converting the bliss bibliographic classification to skos rdf using python rdflib mission editorial committee process and structure code4lib issue 59, 2024-10-07 converting the bliss bibliographic classification to skos rdf using python rdflib this article discusses the project undertaken by the library of queens’ college, cambridge, to migrate its classification system to rdf applying the skos data model using python. queens’ uses the bliss bibliographic classification alongside 18 other uk libraries, most of which are small libraries of the colleges at the universities of oxford and cambridge. though a flexible and universal faceted classification system, bliss faces challenges due to its unfinished state, leading to the evolution in many bliss libraries of divergent, in-house adaptations of the system to fill in its gaps. for most of the official, published parts of bliss, a uniquely formatted source code used to generate a typeset version is available online. this project focused on converting this source code into a skos rdf linked-data format using python: first by parsing the source code, then using rdflib to write the concepts, notation, relationships, and notes in rdf. this article suggests that the rdf version has the potential to prevent further divergence and unify the various bliss adaptations and reflects on the limitations of skos when applied to complex, faceted systems. by harry bartholomew background of the bliss bibliographic classification the second edition of the bliss bibliographic classification (bc2) is a universal, fully faceted, highly synthetic classification system used in 19 english libraries, the first schedules of which were published in 1977. though microscopic when viewed on the scale of dewey decimal (ddc) and library of congress (lcc) classification systems in terms of the number of libraries using the scheme, 11 of the bliss-classified libraries are found within the collegiate universities of oxford and cambridge, making bc2 a competitive alternative to ddc and lcc in these academic library communities. often chosen to replace simple enumerative in-house classification systems at oxbridge college libraries in the 1980s-90s, bliss’s synthesisability and granularity responded to the obstacles to browsing these imposed in the pre-computerised and early computerised periods of library catalogues, and it served a double purpose as a source for subject terminology in bibliographic records ( [1] sargent 1990) ( [2] watson 1997). bliss allows for the synthesis of classmarks from multiple through a process it terms retroactive notation, explained thus on the official website: the classifier must first analyse the subject of the work, then arrange the components, or facets, of the subject into the reverse order of the bc2 schedule, going from the most specific to the most general facet. the classmarks for each facet are then combined by dropping the repeated initial class letter from all but the first. this technique is called retroactive notation. as bc2’s citation order is inverted, the facet cited first actually comes last in the schedules—( [3] bliss classification association : bc2 : using the scheme) for example, in the philosophy & logic schedule, aco b is the class for subjectivism and ahk for ethics, so ethical subjectivism would have the synthesised classmark ahk cob. the idea here is that the schedule need not repeat itself by enumerating every possible classmark; instead, a more specific class can always be subdivided by the more general aspects preceding it, thus allowing an incomprehensibly vast number of potential notation and subject combinations from a reasonably sized list of concepts. 1977 saw the first volumes of bc2 published, however work on the system continues as many schedules are still only available as unfinished drafts. while efforts of the bliss classification association (bca)—the editorial body comprising bliss users and supporters—focus on finalising the unpublished schedules, those finished decades ago have a growing need for maintenance. its incompleteness and partial outdatedness have led to divergence in the application of the classification scheme between libraries, with significant in-house patches having been developed by individual libraries inconsistent with their fellow bliss-users. the first published bliss schedules were manually typed and photographed for publication, though a machine-readable encoded form was later developed alongside software to generate a formatted schedule and index from the encoded data ( [4] the way we were: development of the printed schedules in bc2 2007). figure 1 shows an extract from the final published version of the class a – al: philosophy & logic schedule and figure 2 shows the source code from which it was generated. the two digits preceding the node label indicate the indentation on the published page used to show the hierarchy of concepts. the indentation bullets in figure 1 do not equal the number in the source code; each column must have indentation relative within itself, and a header at the top of each column shows the current location in the concept hierarchy. figure 1. section from published schedule class a – al: philosophy & logic figure 2. extract of source file used to generate the formatted version shown in figure 1. the bc2 source code adheres to an idiosyncratic schema which allows for the generation of properly typeset and indexed schedule to be used by classifiers. the structure of the source code can be summarised in the following way: the file must be written in the order of the notation. each concept begins a new line with its notation as the first character, or with @ if there is no notation for the concept. the notation is followed by one or more spaces. the first 2 digits following the notation and spaces denote the length of the indentation on the page. the indentation numbers are followed by the labels used for the concept, each separated by a comma. the labels can be enclosed in a pair of matching brackets, indicating the node’s category: )inverted parentheses( show a “brought-down” class, meaning that the concept is first defined earlier in the schedule, but reappears at this point subsumed under a specific parent class. (normal parentheses) indicate that this node represents a facet ((double parentheses)) indicate a node introducing an array of classes the visibility of a concept in the schedule, thesaurus or index is controlled by a singular closing square bracket followed by the initial of the output in which the label is not to be visible. e.g. in the following snippet, the labels for the alphabetical ranges for classifying named but unlisted 17th-century british philosophers will not appear in the index or the thesaurus: ads 0917th century ]it adse 10)british philosophy( adsecy 11((schools & doctrines)) adsed 12cambridge platonists @ 11((individual philosophers)) adsef 12a bac ]it adseg 12bacon f adseh 12bac hob ]it adsej 12hobbes t adsek 12hob loc ]it adsel 12locke j adsem 12loc z ]it new lines beginning with spaces followed by an asterisk indicate a note; *sn indicates a scope note; ** precedes comments. labels and notes can continue multiple lines. one can easily draw parallels between this source code format for classification data and marc for cataloguing. roy tennant, in his influential and well-cited library journal denunciation, “marc must die,” explains that “marc and aacr2 are largely focused on capturing the paper catalogue card in computer form” and bases his critique on the format’s inconsistent granularity, limited applicability, and relative obsolescence in the face of xml ( [5] tennant 2002). in like fashion, the bc2 source code’s aim to digitally reproduce the printed schedule limits its extensibility, and its flat unitary structure impedes the systematic expression of relationships between concepts. the software developed and used by the bca to parse the source code is not openly available, and so the hitherto focus on the end product (the published schedule as a physical volume or an inflexible pdf) has withheld useful editorial capabilities from the classifiers at bliss-using libraries, who, anecdotally, often rely on handwritten annotations to the volumes to record their local applications of and updates to the system. aim and objectives the library of queens’ college, cambridge, needed a more robust system than handwritten annotations and printed addenda to manage its classification. queens’ converted to bliss in 1988, relying in large part on draft schedules to cover the unpublished subject areas ( [6] sargent 1990: 11). subsequent librarians have updated and revised the system and logged these changes with varying diligence. schedule annotations indicated, where bliss offers alternatives in notation and structure, which variant is locally used. annotations also revised offensive and prejudicial terminology and hierarchies ( [7] bartholomew 2023). where sections of the system were deemed too granular for our purposes, ticks and crosses showed which classes were to be used. for subjects still without an official bc2 schedule, printed in-house classification systems were kept in ring binders, and these too were more often revised by hand rather than by editing and reprinting the documents. schedule indexes were not consistently updated with revisions, leading classifiers looking up terms in the index to obsolete parts of the schedule. when changes were made to the system, reclassification projects were limited to the open-access collection only, meaning books in closed-access stores are arranged according to an older version of the system. as books are routinely relegated to storage, the closed-access shelves follow a disorganised shelving order according to multiple versions of the system. further to this, queens’ had no consistent system to record the components of its synthesised classes, having ceased to use bliss class labels as subject vocabulary in bibliographic records, and so a revision to a class’s notation was difficult to implement as there was no complete index to show which books’ classmarks used the old notation in its synthesised form. lastly, and importantly, queens’ classifiers cannot be sure whether the classmark assigned at a fellow bliss library for a particular book is valid for the same book in the local system, as system revisions in one or both libraries could have led to a divergence; therefore, copy-classifying is severely hindered. the objectives of an encoded classification system for queens’ college thus were: replicate the structure of the source code, avoiding the loss of any information encoded in the original file. allow for the generation of a formatted and indexed schedule, both in print and electronic form, from the encoded form. use persistent identifiers for concepts in the scheme, so that links can be expressed between: (1) different classification systems and variants; (2) bibliographic metadata and the classification system; (3) a synthesised classmark and its components. enable a form of version control logging changes to notations and labels so that the origin of a particular classmark can always be traced. skos-rdf structure the simple knowledge organisation system (skos) is a model for use in rdf that standardly encodes a traditional thesaurus, taxonomy, or classification hierarchical structure. a skos concept can have a preferred and alternative label, a notation, various types of notes, and expressions of broader, narrower and related concepts. development of skos began in 2006, and soon its potential use for bliss was discussed in the bliss classification bulletin. alistair miles ( [8] 2006), whilst making the case for skos-encoding, pointed out that “currently there is no built-in support for the synthesis of conceptual units to represent compound meanings” (p.11). leonard will ( [9] 2008) also observes the limitation of skos for encoding bliss, whose schedules “show examples of pre-coordination” but “users are expected to create others as required. skos can not yet encode this type of structure”. this remains true of skos in 2024. despite this limitation, skos was still chosen as the data model for this project. an owl ontology categorising concepts more granularly and defining more particular potential relationships between concepts could be developed for bliss, which could then define the node categories indicated by the enclosing parentheses as well as encode visibility of labels in the index and thesauri. however, as the published version of bliss still adheres to a traditional hierarchical classification structure, it is entirely possible to replicate the hierarchy in skos; further to this, polyhierarchical relationships are expressible in skos, and so a compound class could be a narrower concept of multiple concepts from different branches of the schedule. skos, owing its eponymous simple nature, has a small vocabulary, and it was applied to queens’ classification as follows: skos: conceptscheme each bliss schedule is considered a separate skos:conceptscheme, owing to the fact that the principle of retroactive notation applies only within a schedule, and compounding between schedules is only sometimes permitted. also, queens’ in-house schedules adhere to separate compounding rules. skos:concept every node in a schedule is a skos:concept, regardless of whether it has a notation; notationless nodes are necessary to describe the structure of the hierarchy and separate arrays for different facets. skos:inscheme every concept is linked to its conceptscheme. skos:preflabel the first label in a list of comma-separated values for each node is treated as the preferred label. skos:altlabel all subsequent labels are included as alternative labels for the concept. skos:broader every child concept has its parent concept as a broader term. skos:topconceptof concepts without a parent concept are linked to the concept scheme as a top concept. skos:note basic notes following a label. skos:editorialnote used for source code comments. skos:scopenote used for notes defined as scope notes in the source file. these do not represent the full range of the skos vocabulary; some classes and properties were deemed superfluous to requirements; others could not be encoded automatically based on the source file data. implicit inversion pairs of skos uris are inversions of one another, e.g. if a is broader than b, then it follows that b is narrower than a; if x is the top concept of the y concept scheme, then y has x as a top concept. to avoid redundancy and for consistency, the hierarchy is expressed in only one direction throughout, narrowest to broadest. transitivity the skos broader/narrower relationships are neither inherently transitive nor intransitive; they merely express immediate hierarchical relationships. while transitivity can be explicitly expressed with skos:broader/narrowertransitive properties, the potential benefits of this are unclear and it could be that an explicit declaration of transitivity may turn out to be problematic in places. collections skos concepts can be collected into arrays using the skos:collection and skos:orderedcollection classes, in which the concepts would be members. in some ways, this would usefully reflect the nature of the array nodes enclosed in double parentheses in the source file. however, array nodes are treated similarly to ordinary classes in the source file structure, where they are the narrower term of the parent concept and the broader term of the array of child concepts. as skos:collection is disjoint with skos:concept, the children would need to be narrower terms of the concept above the array node, thus disrupting the established hierarchical structure indicated in the source file. some array and facet nodes also have a notation and could therefore be used as a concept, or in a synthesised class. therefore, collections were not used in this automatic conversion, but could be useful in future revisions if it is decided that the hierarchical structure ought to be altered. notes the source file distinguishes only three types of note which can be systematically converted, though other subproperties of skos:note would be useful. a skos:historynote would serve when revising the system, and notes acting as definitions for the class ideally would be separated from notes advising on its usage within the system. parsing the source file with python and rdflib imports and constants from rdflib import graph, rdf, literal, namespace from rdflib.namespace import skos ex = namespace("http://example.org") g = graph() the only imported library is rdflib, from which five objects used: rdflib.graph is the web of nodes and relationships which is eventually serialised into turtle format in this project rdflib.rdf is the rdf vocabulary rdflib.literal is the rdfs:literal property (i.e. strings and integers) for use when the datatype must be specified rdflib.namespace allows for generation of uris with a common prefix rdflib.namespace.skos is a readily available skos namespace the queens’ classification is not yet available on the web and so a placeholder namespace is used in this project temporarily. this still allows the resulting turtle file to be visualised and aid local classification. creating the concept scheme def parse_source_file(file_path): with open(file_path, ’r’) as file: g.add((ex[""], rdf.type, skos.conceptscheme)) the first class added is the concept scheme, which currently creates a bare node to which top-level concepts can be linked: <http://example.org> a skos:conceptscheme . this will need to be named and described before the classification is hosted online. parsing the file the source file is read line-by-line, however the challenge lay in the fact that node labels and notes can, and often do, span multiple lines. def identify_line(line): if line[0] != " ": return "new node" if line.strip()[0] == "*": return "new note" else: return "continued" if a new line begins with a non-space character (either the notation or an “@” to indicate no notation), it is a new node, and each note always begins with an “*”. if neither of these conditions are true, it is assumed that the line continues its predecessor. a new node is assigned a uri suffix consisting of a 4 digit suffix to the namespace indicating its order in the source file. unlike the integrated levels classification (ilc) which was also migrated to skos ( [10] binding et al. 2021), bliss notations are not unique identifiers: the same notation, or no notation at all, can apply to multiple classes, hence the need for arbitrary generation of identifiers. structuring the hierarchy another way in which bc2 differs from ilc is bliss’ dissonance between its hierarchy and its notation. where in ilc, “crucially, the hierarchical structure of the classification was inferred from the letter sequence of the notational codes” ( [11] binding et al. 2021: 933), a bliss child concept could have a briefer notation than its parent class, e.g. bgy is the notation for electricity & magnetism, yet bh is the notation for its child concept electromagnetism. the hierarchy of concepts must therefore be inferred from the numerical indentation indicator from the source file, and is tracked using a python dictionary, starting as: hierarchy = {0: "root"} then, narrower concepts are added to this dictionary as the parsing progresses, with their indentation number as the key and their uri as the term. def create_hierarchy(hierarchy, uri, line): def extract_indentation(line): return int(line.split()[1][:2])</p> indentation = extract_indentation(line) hierarchy[indentation] = uri if indentation == 1: g.add((uri, skos.topconceptof, ex[""])) else: broader_uri = hierarchy[indentation 1] g.add((uri, skos.broader, broader_uri)) return hierarchy anything with an indentation of 1 is a skos:topconceptof the concept scheme, and those further down the hierarchy have their parent concept’s uri added with a skos:broader relationship. adding rdfs:literal properties to nodes 3 types of literal properties can be added to each node: labels, notation, and notes. the notation is easily extracted, with “@” interpreted as an absence of notation. the string of characters between the indentation indicator and the subsequent note or node is the label string, consisting of a list of comma-separated labels that can have enclosing brackets, denoting its “category”, or a final closing square bracket, providing instructions for the visualisation of the classification schedule and index or inform on the origin or type of the concept. the information these brackets convey are not encodable in skos alone, but the bracket type is kept in a skos:editorialnote so that any future interpreter can make use of it. the labels, extracted from any brackets are added to each node, with the first in the list being the preferred label, and following labels as alternative labels. notes are then added with their preceding asterisks removed. output the output of this conversion process is a turtle file. the output of the sections shown in figures 1-2 is: @prefix skos: <http://www.w3.org/2004/02/skos/core#> . <http://example.org/0045> a skos:concept ; skos:broader <http://example.org/0022> ; skos:editorialnote "("@en, "facet"@en ; skos:inscheme <http://example.org> ; skos:note "these are defined by period & place and are therefore only found within a particular broad tradition. see ad western philosophy & ai oriental philosophy."@en ; skos:preflabel "historical schools of philosophy"@en . <http://example.org/0047> a skos:concept ; skos:broader <http://example.org/0046> ; skos:inscheme <http://example.org> ; skos:notation "aafb" ; skos:note "see explanatory notes & examples at ag."@en, "very little, if anything, will locate here."@en ; skos:preflabel "comprehensive works on branches of philosophy"@en . <http://example.org/0048> a skos:concept ; skos:broader <http://example.org/0046> ; skos:inscheme <http://example.org> ; skos:notation "aafd" ; skos:note "general studies only; see notes at aaf."@en ; skos:preflabel "metaphilosophy"@en . <http://example.org/0049> a skos:concept ; skos:broader <http://example.org/0046> ; skos:inscheme <http://example.org> ; skos:notation "aagg" ; skos:preflabel "metaphysics"@en . <http://example.org/0050> a skos:concept ; skos:broader <http://example.org/0049> ; skos:inscheme <http://example.org> ; skos:notation "aagh" ; skos:preflabel "ontology"@en . <http://example.org/0051> a skos:concept ; skos:broader <http://example.org/0049> ; skos:inscheme <http://example.org> ; skos:notation "aagm" ; skos:preflabel "cosmology"@en . <http://example.org/0052> a skos:concept ; skos:broader <http://example.org/0049> ; skos:inscheme <http://example.org> ; skos:notation "aagq" ; skos:preflabel "special topics in metaphysics"@en . <http://example.org/0053> a skos:concept ; skos:broader <http://example.org/0046> ; skos:inscheme <http://example.org> ; skos:notation "aagr" ; skos:preflabel "epistemology"@en . <http://example.org/0054> a skos:concept ; skos:broader <http://example.org/0046> ; skos:inscheme <http://example.org> ; skos:notation "aagw" ; skos:preflabel "philosophy of language & logic"@en . <http://example.org/0055> a skos:concept ; skos:broader <http://example.org/0046> ; skos:inscheme <http://example.org> ; skos:notation "aahd" ; skos:preflabel "philosophy of mind & action"@en . <http://example.org/0056> a skos:concept ; skos:broader <http://example.org/0046> ; skos:inscheme <http://example.org> ; skos:notation "aahj" ; skos:preflabel "axiology"@en . <http://example.org/0057> a skos:concept ; skos:broader <http://example.org/0056> ; skos:inscheme <http://example.org> ; skos:notation "aahk" ; skos:preflabel "ethics"@en . <http://example.org/0058> a skos:concept ; skos:broader <http://example.org/0056> ; skos:inscheme <http://example.org> ; skos:notation "aahp" ; skos:preflabel "aesthetics"@en . <http://example.org/0059> a skos:concept ; skos:broader <http://example.org/0046> ; skos:inscheme <http://example.org> ; skos:notation "aahr" ; skos:note "alternative (not recommended) to locating with subject."@en, "not restricted to one broad tradition. even if this option is used most of the literature will go at ahr (where instructions for use are given)."@en . skos:preflabel "philosophy of particular subjects (general)"@en ; achievements, limitations, and future directions this project provides a method for converting bliss source files into rdf using the skos data model. this has so far only been used internally at queens’ college library, and while other bliss-using libraries are welcome to download the rdf version or even convert the files themselves using this code (or an adapted version of it), the full benefits of an rdf-encoded linked classification system can only be realised when the uris become urls, i.e. accessible via the internet, and ideally with an api endpoint. skos alone is arguably too simple for bliss, and a more complex ontology is needed to extend the basic hierarchical classification structure. many properties encoded as rdfs:literal in a skos:note inform the classifier of how the class can be used to build compound classmarks; defining bliss-specific properties and relationships between concepts could make this standardised and machine-readable. a further limitation is that the documentation for the bliss source file format provides more flexibility in formatting than is outlined here. the python code used for conversion assumes that a particular standard has been followed, though does not account for valid variants that may be used in other schedules’ source files. the code will need to be edited to accommodate conversions of such files. consistent with the source files, but a hindrance for linked data, is the lack of relationship between the nodes of “brought down” classes and their origin. given the polyhierarchy permitted in skos, an altered structure could see the children of brought down classes as narrower concepts of both their parent class, but from earlier in the schedule, and their grandparent class. for example, in the source file snippet featuring 17th-century british philosophers above, adsed: cambridge platonists could be restructured as a narrower term of both ads: 17th century western philosophy and adq e: british philosophy, though this might entail label changes (figure 3). figure 3. current simple hierarchical structure compared with a proposed polyhierarchical structure it is hoped that queens’ rdf classification will eventually incorporate links to an official bliss rdf version. where there has been a schism from bliss orthodoxy, the skos mapping properties can be used to show the matches and variations between the standard and local versions. bibliography [7] bartholomew h. 2023. obsolete orders: the need to reclassify queens’ war memorial library. queens’ college old library blog. https://queenslib.wordpress.com/2023/05/15/obsolete-orders-the-need-to-reclassify-queens-war-memorial-library/. [10][11] binding c, gnoli c, tudhope d. 2021. migrating a complex classification scheme to the semantic web: expressing the integrative levels classification using skos rdf. jd. 77(4):926–945. doi:10.1108/jd-10-2020-0167. [3] bliss classification association : bc2 : using the scheme. [accessed 2024 may 17]. https://www.blissclassification.org.uk/bcclass.shtml. [8] miles a. 2006. simple knowledge organisation and the semantic web. the bliss classification bulletin.(48):10–13. https://www.blissclassification.org.uk/b48.pdf#page=10 [1][6] sargent c. 1990. classifying the undergraduate collection at queens’ college, cambridge. the bliss classification bulletin. (32):10–12. https://blissclassification.org.uk/b32-1990.pdf#page=10 [5] tennant r. 2002. marc must die. library journal. 127(17):26–27. https://www.libraryjournal.com/story/marc-must-die [4] the way we were: development of the printed schedules in bc2. the bliss classification bulletin. 2007. (49):17–22. https://www.blissclassification.org.uk/b49.pdf#page=18 [2] watson r. 1997. quincentenary library, jesus college, cambridge. the bliss classification bulletin.(39):14–16. https://www.blissclassification.org.uk/b39.pdf#page=14 [9] will l. 2008. bliss on the web. [accessed 2024 feb 25]. https://www.blissclassification.org.uk/blissontheweb.pdf. about the author harry bartholomew (harrybartholomew.github.io) is a librarian with systems responsibilities at the royal institute of international affairs, london, united kingdom. until june 2024 he was the reader services librarian at queens’ college, university of cambridge. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – bc digitized collections: towards a microservices-based solution to an intractable repository problem mission editorial committee process and structure code4lib issue 44, 2019-05-06 bc digitized collections: towards a microservices-based solution to an intractable repository problem our digital repository services department faced a crisis point in late 2017. our vendor discontinued support for our digital repository software, and an intensive, multi-department, six-month field survey had not turned up any potential replacements that fully met our needs. we began to experiment with a model that, rather than migrating to a new monolithic system, would more closely integrate multiple systems that we had already implemented—archivesspace, alma, primo, and metaarchive—and introduce only one new component, namely mirador. we determined that this was the quickest way to meet our needs, and began a full migration in spring of 2018. the primary benefit of a microservices-based solution for our collections was the potential for customization; we therefore present our experiences in building and migrating to this system not as a blueprint but as a case study with lessons learned. our hope is that in sharing our experience, we can help institutions in similar situations determine 1) whether a microservices-based solution is a feasible approach to their problem, 2) which services could and should be integrated and how, and 3) whether the trade-offs inherent in this architectural approach are worth the flexibility it offers. by chris mayo, adam jazairi, paige walker, luke gaudreau introduction in the summer of 2017, the digital repository services department at boston college libraries initiated a project to replace our digital special collections repository platform. we first implemented digitool, an ex libris product, in 2003. while it served our needs for several years, the product eventually reached end of life, and ex libris discontinued active development. we had configured an interim discovery platform based on blacklight in 2015, but this interface relied on the digitool backend and image viewer, and we knew all of these components would eventually need to be replaced. one strength of digitool was its ability to interpret and render sophisticated mets structmaps. our implementation made extensive use of the structmap data structure to order and show the relationship among multiple constituent files in a single digital object. maintaining this functionality was a requirement; in order to mirror the reading room experience for archival materials, we have always matched our digitization workflows and the presentation of digitized materials to the levels of description and the physical arrangement of the materials. we used mets to represent order within items and files, and in some cases the differences between the physical and intellectual order of materials. we were committed to maintaining this robust level of structural description, but we also understood that this functionality was uncommon in repository software. with our custom mets implementations in mind, we initiated an environmental scan to determine if there were any viable repository candidates that could fit our needs. after identifying our functional requirements, we set off on a six-month odyssey to research and test eight options in consultation with a diverse group of stakeholders from archives, systems, and digital scholarship. although some of these candidates seemed promising at the outset, they each presented challenges. some were new open-source solutions that did not yet include the functionality we required, while others were proprietary systems that lacked sufficient transparency and customizability. others were more mature open-source solutions that raised concerns about long-term support. our assessment process ultimately failed to identify a monolithic solution. conception by this point, it had become clear to us that we needed to think creatively. in a final attempt to meet our needs with existing software, we sent an email to the samvera community listserv, requesting responses from institutions using mets in a samvera context. we received a reply from a developer who informed us that mets was part of his institution’s workflow, but that it was used primarily to generate iiif manifests, and that these manifests, not the mets itself, were used to render structured digital objects. around this time, we received some alarming news from our university’s information technology services department. they had been scanning local servers for struts, a development framework with critical vulnerabilities that were exploited during the equifax hack. as it happens, digitool was built with struts, and because the application was end-of-life, it was using a particularly outdated version. because digitool is closed-source software, we had little agency to address the root cause of this problem. instead, we worked with its to isolate the server from our campus network to minimize the impact of any potential attacks, then rapidly accelerated our migration timeline. we had previously reacted to the samvera community response by surmising whether we could make do by spinning up an iiif image server to host our digital collections. now with the report from its, it started to seem like a more viable solution. we began discussing what services we would actually need to support search, display, and data backup and integrity in a more distributed ecosystem, and began seriously considering a solution based on the microservices architectural model, where multiple single-purpose systems are integrated to provide the functionality of a larger piece of multi-purpose software. in brainstorming the services provided by repository software that we’d need to replicate, we discovered that systems we already had in place could cover a surprising amount of ground. our archivists create and maintain canonical descriptive metadata for archival materials in a locally hosted instance of archivesspace. we already had workflows in place for automatically generating description of digital archival objects (daos) in batches, and for exporting mods and mets to ingest into our repository systems along with digitized files. in addition to our custom blacklight discovery interface, we also harvested our digitized collections into primo, our library discovery platform. while this allowed for serendipitous discovery alongside other library resources, it did not provide a facility for browsing collections. conveniently, as we were researching solutions, ex libris announced a new collection discovery feature. with this feature, we could build collections in alma and publish them to primo alongside other records. primo then presents these as browsable collection galleries. alma also includes an oai-pmh provider, which we could leverage to serve content to external entities like digital commonwealth. from a digital preservation standpoint, digitool had conducted automated fixity checking of collections and stored the checksums in its database. however, we also stored a second set of our collections outside the digitool environment. these copies lived on an in-house preservation server, which was configured to run automated checksum audits using audit control environment. additionally, copies of these collections were transferred from this server to our lockss-based distributed digital preservation solution, metaarchive. even if our new solution lacked preservation functionality, we could make up for that with already extant systems. we also discovered, as anticipated, two major areas where we would lack coverage after moving out of a traditional repository system. the first was an image viewer. we saw the migration from digitool as an opportunity to adopt the international image interoperability framework (iiif), an exciting initiative that we were eager to apply to our own digital library work. while alma offers partial iiif support with its implementation of universal viewer, the viewer services module was not customizable enough for our needs. digitool had also provided access to technical metadata through digital entity files stored in a database. although this option was clunky and often riddled with imperfections, it had fulfilled our need to search for and store premis within our repository. we recognized that we would also need to devise a way to store technical metadata in an archival management system. feasibility as we explored the feasibility of our proposed digital repository infrastructure, one of our most pressing concerns was how to render structural metadata effectively. we used osullivan, a ruby implementation of the iiif presentation api, to write a ruby gem that would convert our mets files into iiif manifests. fortunately, our digital scholarship group had already developed something similar for a separate iiif project, which we used as a starting point. because that code was written for a specific context, we had to refactor it significantly to make the gem compatible with all of our mets. this proved time-consuming but well worth the effort, as it provided the foundation for another gem, aspace-to-iiif, that generates a iiif manifest from an archivesspace digital object record. selecting the right image viewer also took some trial and error. we demoed a few iiif viewers to our stakeholders, but the one that stood out was mirador. mirador had been on our radar as a potential viewer for digital scholarship projects, but we had not yet implemented it at boston college. for this use case, it proved to be a perfect match. our team appreciated the clean, modern interface, as well as the ability to display structural information in the side panel. for our iiif image server, we wanted something that would be relatively easy to install and maintain. we decided on loris, which was already in use for a different project at the libraries. while we were pleased with the user experience presented in primo, we found ourselves with several new challenges. first, we needed to devise a workflow for importing our dao records into alma. since archivesspace serves as our source of canonical metadata, we only needed the alma records for arranging into collections and publishing to primo. with some trial and error we were able to generate exports of object records in marc format that captured critical descriptive metadata and distinguished the material from other alma records. we worked to arrange the alma records into collections by building sets of records matching by collection name. we also needed to address the need for persistent links to dao records in primo. while primo does include a permalink mechanism out of the box, it is not customizable and the links cannot be predictably controlled. we worked around this by creating a primo customization that replaced the default permalink with a new one based on the legacy digitool pid and our internal handle server. we also added this pid as a searchable pnx field and added it to our existing primo search middleware, primo-services. our handle server was then configured to point to primo-services to perform a search for the given pid. finally, we needed to determine whether it was possible for us to generate and store technical metadata in archivesspace for our files, to have this data available outside of our preservation environment. as we already had a process in place to automatically build digital archival objects with bitstream-level data, this was primarily a question of finding an appropriate file characterization tool and mapping data to appropriate fields in the database. we began experimenting with exif, but eventually settled on fits for its ability to characterize a wider range of filetypes. some, but not all of the data we were interested in storing already has controlled fields in the archivesspace data model, and so in many cases we have been forced to store data in labeled general notes fields. managing these fields will be an ongoing challenge. when we had outlined all of our needs and the ways we expected to fill them, we created the system model depicted in figure 1. figure 1. diagram of system model. outcomes as noted, the purpose of this article is to outline our experiences with implementation of a microservices-based repository solution. given that any institution implementing a similar solution will ideally be selecting individual systems that best meet their own needs or are already implemented locally, we have chosen not to describe the technical specifics of our migration and integration workflows. we also omit our efforts to implement an agile-style project management mode for the process of building and migrating to this system. we intend instead to present our outcomes as a series of challenges and positive outcomes, with an analysis of how each outcome relates to the microservices model: are there problems we could have avoided, or desired outcomes we would have lost, if we had not pursued a microservices-based solution? one aspect of our new architecture that seemed beneficial at first was use of the description in archivesspace as the canonical source of descriptive metadata. it had to be verified against our legacy data, and we would need to devise a way to publish it in primo, but with the description already present in archivesspace, no migration was required. however, this assumption was refuted by two of our major digitized collections. one, an archival photographic collection consisting of tens of thousands of images across hundreds of files, was not described in archivesspace because we had not yet implemented an archival management system when digitization on this collection began over a decade ago. a similarly large collection of early new england financial and legal documents had not been described in archivesspace because it is owned by our law library, not our archives. in both cases, we used an export of legacy metadata to make these collections available in bc digitized collections. still, they would require extensive backend work to be maintainable. in the six months since the system launched, we have partially imported the data for the photographic collection into archivesspace, but much work remains to be done, and work on the legal documents collection is only just beginning. if we had been migrating to a new repository, we would have these collections under better intellectual control in the digital repository, but we would not have had the excuse we currently have to put the effort into migrating them into archivesspace. the microservices model has unarguably given us more work, but it is work that needed to be done. using alma and primo to make our digitized collections discoverable presented its own challenges. significant resources were dedicated to trial and error configuration of primo normalization rules for display and search. in many cases the rules are duplicates of existing rules targeted specifically to digital archival objects. the large number of rules and their reliance on precise combinations of metadata values has translated to significant maintenance costs. we also learned that primo collection discovery relies on additional api calls to build the browseable collection hierarchy. that dependency has proven to be problematic, with performance, caching, and security problems abounding. over time many of these problems have been fixed, but in the meantime it has required a large number of manual workarounds, communication with public services staff, and persistent troubleshooting with ex libris. our integration of alma in the bc digitized collections architecture is a weak point, particularly with regard to archivesspace. we currently export archivesspace digital object records as mets with embedded mods, then transform them to marc before ingesting to alma. this workflow is effective, but it requires a fair amount of manual work that could be automated. we plan to explore the archivesspace oai provider as a potential solution, and have also discussed the possibility of writing a plugin to export digital objects as marc. there is interest within archivesspace community to pursue alma integration (kevin clair at the university of denver has already written a plugin to that effect), so this could be an opportunity to collaborate with other institutions that have similar needs. many of our preservation-related workflows remain manual rather than automated. after its initial ingest into archivesspace, technical metadata is updated infrequently. our fixity checking program, audit control environment, does not integrate with archivesspace, so related premis events are not documented alongside other metadata. we also lack a solution for batch normalization of select file formats. to account for these needs, our team is investigating archivematica as a potential addition to our suite of microservices. in spite of these challenges, we’ve also experienced an equal or greater number of positive outcomes. similarly to the challenges, many of these beneficial outcomes are related to the choice of a microservices model. of the eight potential repository solutions that we investigated before undertaking this project, five were open-source and three were vended solutions. we have the available technical resources that if we had chosen an open-source solution, we would have been hosting it locally, and our experiences with digitool had strongly predisposed us toward open-source software. due to the proprietary formats used in the database, although we were able to get our data out of digitool, we did not reuse it extensively in the migration process other than as an audit checker. some of the components of our solution are vended—primo and alma—and others are locally controlled and installed instances of open source projects—archivesspace, mirador, loris, and audit control environment. critically for us, our canonical metadata lives in an open-source management system, which helps us feel more comfortable about our ability to migrate forward in the future. although there were additional considerations, one of the reasons an open-source repository was infeasible for us to implement was the amount of time our current staff of systems and support librarians had available. as noted, we already host several open-source systems locally, but we haven’t customized any of those systems to the extent that the available open-source repositories would have required. we felt that we couldn’t do justice to our needs without committing 1 fte of developer time, which we simply didn’t have. had we chosen a vended solution, upfront and ongoing monetary costs would have been significant. we were able to implement bc digitized collections without purchasing additional software or hiring new staff, clearly making it our most financially responsible choice. while we relied on dedicated project time from a systems librarian during the implementation and migration, ongoing maintenance and upgrades have been folded into existing primo support mechanisms. we also gained a seat on our library’s primo working group, where we can advocate for our needs alongside issues from other stakeholders. it’s difficult to say whether we’d have been able to implement an existing repository solution, open-source or vended, in such a budget-neutral way. however, this budget-neutrality was achieved partly through a combination of extant technical skills and our own interest and ability to gain new skills. in particular, our digital repository applications developer gained more expertise with the iiif presentation and technical apis in his support of mirador and loris and manifest generation. our digital collections and preservation librarian explored and implemented fits in her file characterization workflows to enable us to generate and store technical metadata. our digital production librarian expanded their work with the archivesspace api to support import and linking of new metadata fields, both technical and descriptive. the microservices model allowed us to select and implement services that matched with and enhanced our staff’s current technical capabilities, benefiting the project and the library. we committed a significant amount of staff time to develop migration tooling for this project. ordinarily, such scripts and transforms would hold little value once the migration ends, but we were thrilled to find the opposite with bc digitized collections. some of these tools were brand new, such as our aspace-to-iiif gem. others were heavily revised and refactored, such as the script that we use to batch generate digital objects. regardless, many of them still find regular use in our daily workflows. the microservices approach to our repository architecture has allowed us to develop greater flexibility and transparency in our workflows. because these services are interoperable, we can apply them beyond our repository to improve our digital library program as a whole. for instance, we are now exploring integrations between archivesspace, bitcurator, and archivematica to create a more robust digital preservation campaign. due to the open-source nature of many of these programs, we can also analyze their code to better understand their technical foundations. this allows us to assess how programs may impact the integrity of our collections and the privacy of our patrons. we believe that these flexible integrations will lead to fewer silos, more efficient workflows, and a stronger digital library overall. finally, one of our enduring goals as a department is to connect with and contribute to the larger community of practice. by taking an active role in multiple open-source communities, this project has positioned us to do that. our interactions with the individuals and institutions involved in archivesspace and iiif have been deeply rewarding, and we look forward to working more with these communities. next steps as bc digitized collections was a significant departure from our previous repository platform, we were eager to see how our end users responded to it. we conducted usability testing in fall 2018, which predictably revealed some design flaws. for example, we currently link to an external instance of the mirador viewer from catalog records, which caused confusion for some test participants. we made this design choice because it was the fastest way to ship the product, but we would like to explore embedding mirador in primo to reduce friction. we plan to test this with the upcoming mirador 3. similarly, we would like to expand our iiif implementation, which at this point is fairly lightweight. we plan to add support for annotations, and are beginning to consider iiif as a delivery mechanism for audiovisual content, as will be supported by version 3 of the presentation api. we have also discussed extending our iiif image server to other uses, such as a web app that allows staff to resize and crop images. usability testing also revealed biases in our design process that had gone unnoticed. a particularly salient example of this is the name ‘bc digitized collections’. as archivists and librarians, we took for granted that our users would be able to decipher this and understand what they would be able to find within the system. in fact, our test results suggested the opposite. a reevaluation of how we are naming and describing the system is forthcoming, with a greater sensitivity to our end users’ level of familiarity with library jargon and information architecture conventions. we must also work to streamline how some of our systems are integrated. this includes development of additional archivesspace export functionality, and perhaps use of the archivesspace oai provider to pipe records to alma in a more automated manner. this is relatively low priority technical debt, but it is a notable downside of microservices: loose coupling can mean additional overhead to make cross-component workflows run efficiently. we are also investigating integration of archivematica with archivesspace, with the hope that technical metadata generated through archivematica’s fits may be ingested into archivesspace and associated with the appropriate descriptive components. although we are unsure of archivematica’s capacity to accomplish this goal, we will continue to investigate integrations for our manual and siloed technical metadata workflows. conclusion we decided to share our experiences with implementing a microservices-based solution because it’s an architectural approach that more libraries are becoming interested in exploring. overall, our experiences have been highly positive: we gained significant technical expertise and broadened our horizons for giving back to the open-source community. at the same time, we implemented a new system that has garnered positive feedback from internal stakeholders in a largely budget-neutral fashion. on the other hand, this was possible because a large number of the services we intended to combine were already implemented in our library, and our staff had the interest, ability and bandwidth to take on new technical challenges. any library considering implementing a microservices architecture should make sure to include stock-taking and feasibility steps in their project planning to minimize nasty surprises during the implementation phase. will we continue to use a microservices architecture for our digital repository in the future? we aren’t certain. given that lack of maturity was a major roadblock in some of the repository solutions we considered, we may eventually find them maturing to suit our needs without requiring extensive local customization. when initially implementing bc digitized collections, we viewed it as a stopgap measure. we conceived of it to be as flexible and migration-friendly as possible, expecting to migrate one or more components forward eventually. it will be interesting to look back in five years and see whether we are still employing the same architecture, still using microservices with a different suite of interlinked services, or have migrated to a more monolithic repository system. we hope that we have provided some idea of the benefits and tradeoffs inherent in this microservices-based model. we strongly believe that this approach can be beneficial to libraries that already implement a wide variety of systems, provided that they have the resources and technical skills to support tighter integration between those systems. we hope to see more mid-sized institutions like our own implementing similar solutions and contributing to the open-source ecosystem in the near future. resources and github repositories primo/alma general documentation boston college implementation primo services: an application for building persistent links to primo records and searches primo explore search collections: a primo customization that adds a search button to each collection/sub-collection page primo explore galley item: a primo customization that adds an item count to multi-file digital objects alma integrations, a university of denver archivesspace plugin allowing communication between archivesspace and alma international image interoperability framework (iiif) iiif consortium website o’sullivan, a ruby api for working with iiif presentation manifests burns antiphoner, a boston college digital scholarship project utilizing iiif aspace-to-iiif: a boston college ruby gem converting archivesspace digital objects to iiif manifests mirador: a iiif-based image viewing platform loris: a iiif image server implementation other resources batch dao workflow, a boston college workflow for batch generation of digital archival objects in archivesspace audit control environment about the authors chris mayo is the digital production librarian at boston college. they are a project manager, metadata wrangler, and general-purpose script writer and cat herder. adam jazairi is the digital repository applications developer at boston college. he manages bc’s digital library infrastructure. paige walker is the digital collections & preservation librarian at boston college. she works with the preservation, privacy, and security of the libraries’ born-digital and digitized content. luke gaudreau is the discovery systems librarian at boston college. he focuses on library discovery, user experience, project management, and design thinking. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – the brooklyn health map: reflections on a health data dashboard for brooklyn, ny mission editorial committee process and structure code4lib issue 56, 2023-04-21 the brooklyn health map: reflections on a health data dashboard for brooklyn, ny recent years have put a spotlight on the importance of searchers of all kinds being able to quickly and easily find relevant, timely, and useful health information. this article provides a general overview of the process used when creating the brooklyn health map, an interactive brooklyn-based health data dashboard that visualizes community health information at the census tract, zip code, and neighborhood levels. built using html, css, bootstrap, and javascript, the brooklyn health map presents information in the form of interactive web maps, customizable graphs, and local level data summaries. this article also highlights the tools used to simplify the creation of various dynamic features of the dashboard. by sheena philogene introduction as a librarian at brooklyn college, i have found that patrons often want to do local research focused on brooklyn. however, it can be difficult for both students and faculty to find the right sources for the local level information they need quickly or easily. with that user need in mind, the idea of the brooklyn health map (bhm) was born. this article provides an overview of the tools used when creating the brooklyn health map, an interactive brooklyn-based health data dashboard that visualizes community health information at the census tract, zip code, and neighborhood level, in the form of maps, graphs, and data summaries. background recent years have put a spotlight on the importance of searchers of all kinds to quickly and easily find relevant, timely, and useful health information. although there are already many sources of high quality health data, reporting on the city, state, and national level, many of these sources can be hard to find and navigate. because health information is often scattered across various specialized organizations, users may be required to visit multiple sites to collect all the information they want to find. in many cases, data will be initially reported on the federal level, and if users are interested in finding local data, they will be required to narrow down to their area of interest using filters or other built in tools. this can often be a difficult or confusing process for people with less experience or facility navigating these web applications, because certain features will only be available after using a filter or other tool. recognizing these limitations, the brooklyn health map (bhm) compiles data from several sources of health and population information (e.g., nys cancer registry, us census, and centers for disease control and prevention [cdc]), limits the coverage area to one relatively small geographic area, and provides simple navigation for users to quickly and easily collect the information they need. the bhm focuses on presenting data about the health characteristics of adults living in brooklyn, new york. the dashboard is intended to simplify access to relevant and comprehensive health and demographic information about brooklyn. this article seeks to provide an overview of the technical aspects involved in the development of the bhm. for information about the goals, development, and pedagogical applications, see philogene (2022). as the bhm is a web based dashboard, the entirety of the page is built using html, css, bootstrap, and javascript. the webpage is divided into three areas (figure 1), which correspond to (1) an interactive web map, (2) zip code selector summary panel, and (3) interactive neighborhood level charts. figure 1. the brooklyn health map. the necessary data the bhm includes health and population information from three sources: the places: local data for better health project (2020) new york state cancer registry (2020) u.s. census bureau (2015; 2019). in addition to the population information, the dashboard also uses shapefiles of the census tracts, zip codes, and other areas of brooklyn collected from the new york city open data portal. the dashboard incorporates 20 types of health and population metrics, including languages spoken in the area and health characteristics such as high cholesterol and cancer screening rate on the census tract, zip code, and neighborhood level. the data were initially compiled into .csv files (one for each geographic level) and converted to geojson files using arcgis. geojson is a geographic file type based on the javascript object notation (json) file type, which stores both geographic feature information (i.e., coordinates) and non-geographic data. using geojson files makes web mapping easier, as this file type helps to consolidate several types of data into one file, making file management and coding easier. the dashboard panel 1. web mapping with leaflet the first section of the dashboard is an interactive web map, built using the leaflet. leaflet.js is an open source javascript library used to create interactive, mobile friendly web maps. leaflet provides many short cut methods for adding dynamic functionality to web maps. the leaflet website provides full tutorials to get started with using the library. to prepare the page for a web map, a leaflet css and javascript file needs to be added to the html <head> element, and a <div> element needs to be added in the html where the map is intended to be placed. the css file added to the <head>: <link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css" integrity="sha512-xodzbntc5n17xt2attpue1hxjvmsvlvw9ocquklscc5cxdbqcmblashomas6/ keqq/smzmz19scr4pszchsr7a=="crossorigin=""/> the leaflet javascript file (must be after the leaflet css file): <script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js" integrity="sha512xqoymqmtk8lvdxxyg3nz448hoeqiglfqkjs1noqv44cwnurbc8pkaocxy20w0vlaxavueariobhixz5v3y nxwa=="crossorigin=""></script> the <div> element added to the <body>: <div id="map" class="row" style="margin-right: 0px;margin-left: 0px;"> ... <div id="mapid" style="width:100%; height:700px"></div> <script src="script.js"></script> </div> the javascript for the dashboard was written as a separate file (script.js), and also added into the same <div> element. jquery was used to read the geojson files that store the coordinate data and to populate the maps. as mentioned above, the dashboard includes 20 different types of health and population information that fall into four categories: brooklyn demographics, unhealthy behaviors, health outcomes, and screening rates. separate geojson files were created based on each data source. the majority of the data included on the map is at the census tract level, however, there are also outlines of zip codes, neighborhood tabulation areas (ntas) and public use microdata areas (pumas) which users can overlay on any map layer. here is a snippet using jquery to read the geojson file storing both language and population information: $.getjson("brooklyn_language.geojson", function(data) { languagelayer.adddata(data); }); languagegroup.addto(mymap) the jquery .getjson() function is used to read the geojson file. from there, the data is added to the languagelayer object using .adddata(data). the data file stored information for the most commonly spoken language, the percentage of people who speak a language other than english, and the distribution of speakers of over 15 languages per census tract. this data was then used to produce a collection of 20 distinct map layers: the initial “brooklyn languages” map shown when landing on the page and an additional 19 maps showing the distribution of specific languages spoken across brooklyn. this added detail is accessed using a dropdown visible only on this map layer (figure 2). the collection of maps is added to a featuregroup called languagegroup. by adding all the layers to a single featuregroup, all added functionality (e.g., dynamic pop up boxes) are added to all the layers included in the group. finally, the .addto(mymap) method is used to populate the map with the stored data and geometry. figure 2. the brooklyn languages layer showing default view (right) and layer selected using drop down menu (left). map category navigation as the map included 20 different types of health and population information, it was necessary to divide the content into thematic categories to help with navigation. so similar types of information can be easily identified and explored, the data were placed into four groupings (i.e., “brooklyn demographics,” “unhealthy behaviors,” “health outcomes,” and “screening rates”). in order to create navigable categories, bootstrap buttons were added above the script.js file included in the <div> above: <div class="col-sm-8"> <div style="width: 100%; height:70%" > <div class="btn-group"> <button type="button" class="btn w3-white"><b>census tract level</b></button> <button type="button" id="demographics" class="btn w3-cyan">brooklyn demographics</button> <button type="button" id="unhealthy" class="btn w3-blue-grey">unhealthy behaviors</button> <button type="button" id="outcomes" class="btn w3-pale-red">health outcomes</ button> <button type="button" id="screening" class="btn w3-purple">screening rates</ button> </div> <div id="mapid" style="width:100%; height:700px"></div> <script src="script.js"></script> </div> the resulting button controls: figure 3. category buttons linked to map layers. the button ‘id’ for each category corresponds to a jquery .click() method, which simultaneously removes the previous layer, legend, and any other available controls, then replaces them with a new group of layers that correspond to the button selected. this script shows what the “unhealthy behaviors” button does: $("#unhealthy").click(function() { mymap.eachlayer(function (layer) { mymap.removelayer(layer); // clear map on click }); baselayer.addto(mymap); // re-add basemap currentlayer = uninsuredlayer; uninsuredlayer.addto(mymap); mymap.removecontrol(currentlayercontrol); // remove old layer controls mymap.removecontrol(languagedrill); // remove language layer dropdown mymap.removecontrol(cancerdrill); // remove cancer layer dropdown currentlayercontrol = behaviors; // add new layer controls behaviors.addto(mymap); // add layer group to match category mymap.removecontrol(currentlegend); // remove old legend currentlegend = uninsuredlegend; // update current legend uninsuredlegend.addto(mymap); }); panel 2. exploring zip codes the second section of the dashboard, located on the right-hand panel, allows users to investigate data at the zip code level (figure 4). this zip code level panel actually has two purposes: (1) to introduce users to the dashboard and provide information about navigating the page and (2) to summarize health and language information available across the 20 map layers. figure 4. zip code panel, showing welcome message (right) and zip code summary (left). the zip code information is stored on a separate geojson file, primarily because it uses a different geography than the mapped data. the same .getjson() method is used to access the data, and a similar <div> element is used to control what information populates the panel. from the zip code selector panel, there are three possibilities of messages that can be returned: introductory welcome text (only visible when the page loads or refreshes) summary for selected zip code (changes to reflect selected zip code) text to suggest other options for exploring jquery is used to handle the call, when a zip code is selected from the drop down. the selected zip code is converted to a string, so it can be used as a selector in the function used in the .on(‘change’) method. the string is then used to check for the matching zip code ‘id’ in the geojson file and return the summary data for the correct zip code. if a user selects the text “select a new zip code” instead of an actual zip code, they will be given the third text option for other ways to explore. handling changes in the zip code selector panel: // on selecting zip code on drop down $(document).ready(function(){ $('#zipcodes').on('change',function(){ zipselected = $("#zipcodes option:selected").text(); // match zip code from drop down and to data variable; return appropriate text for( var i = 0; i < 99; i++ ) { if( zipdata[i][0] == zipselected ) { document.getelementbyid("zipcontainer1").innerhtml = zipdata[i][1]; // get zip code highlight(zipselected) // highlight selected zip code break; } else { document.getelementbyid("zipcontainer1").innerhtml = '<br><br><h3 style="text-align:center">select a new zip code</h3><br><h3 style= "text align:center"><i>or</i></h3><br><h5 style="text-align:center">check out the charts below to see an overview of the health statistics on a neighborhood level.</h5><br><h4 style= "text-align:center"><i class="fa fa-arrow-down" aria-hidden="true"></i><h4><br><h5 style= "text-align:center"><i class="fa fa arrow-down" aria-hidden="true"></i></h5><br><br><h6 style= "text align:center"><i class="fa fa-arrow-down" aria-hidden="true"></i></h6>'; } } }); }); in addition to updating the panel to display a summary of zip code level data corresponding to the selected zip code, there is also the highlight(zipselected) function, which results in a yellow highlighted boundary of the selected zip code, which appears on the currently visible map layer (figure 5). function controlling zip code boundaries: function highlight(zipselected) { mymap.removelayer(zipoutlines); zipoutlines = l.geojson(null, { pane: "pane660", style: stylezip, interactive: false }).addto(mymap); full view of the result of selecting a zip code: figure 5. view of dashboard with zip code boundary highlighted on map and summary of data for corresponding zip code. panel 3: looking at neighborhoods figure 6 shows the final section of the dashboard, which provides a view of the health specific data, summarized at the neighborhood level. these summaries exclude the first category seen on the map (i.e., language and population statistics), as these could not be consolidated onto the neighborhood level, while continuing to be meaningful. figure 6. neighborhood summary charts. the layout of the neighborhood summary section follows a similar bootstrap button selection method, as described above, when selecting each information category. the bar charts are created using the highcharts.js library. in addition to the javascript file, three additional modules were added to expand the end user options: the exporting.js and export-data.js modules, which provide options to export a snapshot of a chart or a table of the raw data as a .csv or .xls file; and the accessibility.js module, which makes the charts compatible with assistive technologies. the highcharts javascript file must be added to the html <head> element: <script src="http://code.highcharts.com/highcharts.js"></script> <script src="https://code.highcharts.com/modules/exporting.js"></script> <script src="https://code.highcharts.com/modules/export-data.js"></script> <script src="https://code.highcharts.com/modules/accessibility.js"></script> in addition to simplifying the creation of interactive mobile friendly web charts, the other main benefit to using the highcharts library is that it allows you to create fully dynamic charts that can be customized both when writing the code and when being used by the end user. figure 7 shows an example of a chart with three health metrics excluded, and the export menu options displayed. this means each individual user can customize the data included in any of the tables to reflect their specific research interest, and export the relevant data as either a figure or raw data to use as they need. figure 7. example of customized map, with export options visible. looking ahead: next steps the brooklyn health map has been a tool in development since it was created. in the time that it has been available to the public, it has been very well received and has been shared as both a research and educational tool. since sharing this tool with faculty, i have become a regular guest lecturer in a community health program planning course, where i have introduced students to the bhm as an example of a community needs-driven project. considering that the course is targeted towards students studying health and nutrition sciences, many have expressed excitement and interest in the possibility of developing similar tools that focus on their own communities using data they generate in their own practice-based research (e.g., community surveys). currently, the immediate next steps for the dashboard involve updating it to reflect the new releases of data from the sources mentioned above. for the most part, new data is released yearly, so i am investigating how best to streamline updates. with this in mind, one consideration is the use of jquery in the code. while javascript coding conventions are moving away from the use of jquery, my main purpose for using the library was to simplify the coding process and develop the brooklyn health map more quickly, since jquery is just a library of pre-built javascript code. additionally, jquery helps to ensure compatibility across different (primarily older) web browsers. however, as browsers become generally more cross-compatible and jquery less popular, i plan to restructure parts of the code without the use of libraries. in the longer term, i plan to expand the bhm into a new york city (nyc) health map, which includes separate dashboards for each of the five boroughs and a citywide summary. since much of the groundwork has already been done, with this work, the hope is that most of the labor involved will be in collecting and compiling the data for the rest of nyc. from there, the resulting geojson files should make the rest of the process simple for the county-level additions. however, since the citywide summary will be a completely new addition, it will likely present new design opportunities and potential obstacles in the next steps. about the author sheena philogene is the science librarian at brooklyn college (cuny) and the librarian at the brooklyn college cancer center (bccc-cure). bibliography philogene s. 2022. the brooklyn health map: reflections on a health dashboard visualizing connections between social factors and health outcomes in brooklyn, ny. journal of map & geography libraries [internet]. [cited 2023 mar 6];18(1-2):22-40. available from: https://doi.org/10.1080/15420353.2022.2155752 subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – user experience is a social justice issue mission editorial committee process and structure code4lib issue 28, 2015-04-15 user experience is a social justice issue when we’re building services for people, we often have a lot more practice seeing from the computer’s point of view than seeing from another person’s point of view. the author asks the library technology community to consider several case studies in this problem, including their root causes, and the negative impact of this problem on achieving our mission as library technologists. the author then recommends specific actions that we, as individual contributors and organizations, can take to increase our empathy and improve the user experience we provide to patrons. by sumana harihareswara editors’ note: the content of this article was originally presented as a keynote address at the code4lib 2014 annual conference under the title, “ux is a social justice issue.” the editors have adapted the speech for print, and the author has provided a new introduction and conclusion reflecting on the state of user experience design one year later. introduction the code4lib community asked me to give one of the keynote speeches at code4lib 2014 while i was the community manager for the wikimedia foundation’s open source projects and an advisor to the ada initiative. my wikimedia expertise intersects with the interests of the code4lib community, as wikimedia — like many libraries and other cultural institutions — aims to empower patrons with access to information. before i worked in open source, i worked in customer service. i saw first-hand how design flaws (in architecture, signage, and websites) could frustrate and drive away customers and make more work for me. every time i participated in an open source project — altlaw, gnome, mediawiki, and more — i’ve brought that experience with me. i found it particularly striking that small changes on wikipedia could cause large changes in user behavior, as i discuss in this essay, which is adapted from my keynote speech. this issue goes beyond software, as i explain with the healthcare and banking examples. the spark that caused me to write the speech was reading professor lisa j. servon’s piece in the atlantic about the usability of storefront check cashing services; i saw a pattern where poor user experience repels people from crucial and empowering services, and decided, in a flash of anger and inspiration, to write “user experience is a human rights issue.” i am following in the steps of jeremy prevost and bess sadler and mark matienzo, among others, in their previous code4lib presentations and writings about design and emotion [1]. in adapting my speech for this article, i toned down the title to “user experience is a social justice issue,” but the narrative is largely the same. i hope you find it thought-provoking. the last mile problem the largest hurdles we as technologists face are choosing to make the right things in the first place and choosing to make them usable. in the 1990’s, telecommunications companies laid down a lot of fiber to connect big hubs to one another, but often it took years to connect those hubs to the actual houses and schools and shops and offices, because it was expensive, or because companies were not creative enough to do it well. this is called the “last mile problem,” and i think usability has a similar problem. we have to be creative and disciplined enough to actually provide services in a way that people can use them. when we’re building services for people, we often have a lot more practice seeing things from the computer’s point of view or from the data’s point of view than from another person’s point of view. in tech, we understand how to build arteries better than we understand how to build capillaries. personally, i think capillaries are more interesting than arteries. maybe it’s just personal temperament, but i like all the little surprising details of how people end up experiencing the ripple effects of big new systems, and how users actually interact with the user interface of a service, especially ones that we don’t really think of as having a user interface. like taxes, or healthcare, or hotels. all these big systems end in little capillaries, where people exchange information or get healed or get whatever they need. and when those capillaries aren’t working correctly, then those people just don’t get what they need. the hubs are connected to each other, but people aren’t connected to the hubs. over and over, in lots of different fields, we see that bad usability makes a huge difference. when choosing between two services, people will make very different choices, depending on which service actually seems designed around the user’s needs. in march 2014, librarian jessamyn west was in a conversation on metafilter about coffee machines, in particular, keurig pod machines. these machines will only make coffee from pods manufactured by the company that makes the machines. one person said, “this convenience thing is a bit overstated.” jessamyn’s response was: not that i’m not more in your camp taste-wise than the keurig camp but i see this as a great exercise for people generally in the “other people have different priorities in life and aren’t just bad versions of you” direction. not you [personally], just the general you. i know a lot of people for whom keurigs solve a problem and they don’t mind the downsides. i respect that. i also know a lot of people who wouldn’t be caught dead with one and i respect that as well. i know it’s tough to get your head around but the goal state for people with keurigs is generally not to just have the best cup of coffee. it’s to have a coffee solution that is easy to clean up after, or that turns itself off, or that has pre-measured sizes, or that has all the brands that people like, or that makes cocoa, or that offers holiday flavors, or that you can buy at the department store, or that is easy to clean, or that can be modified to accept change, or that you can put in a place with no running water, or that descales itself, or can be put in a place without a kitchen, or that has funky modern lines, or that can make ten cups of passable coffee in five flavors (caf and decaf) in ten minutes. these things do not solve problems for me, but they solve a problem for an awful lot of people which is why these things are so popular (west 2014). so, yes, usability makes a difference in what coffee people drink. and if you care about health, and education, and the working class, then this is actually scary, because differences in user experiences are driving people to make bad choices. i want to give you a few examples. banking: payday lending and check-cashing services have sprouted up in many working-class communities. these services are often predatory or incredibly expensive, compared to traditional banking services. they do not provide a way to save and earn interest instead of paying interest. but lisa j. servon, an urban policy professor at the new school, worked in a ritecheck for four months and found that one reason people chose ritecheck over a traditional bank was the user experience. it was the hospitality people were getting. she wrote in the atlantic: the glue at ritecheck is the customer/teller relationship. i interviewed 50 ritecheck customers after my stint as a teller and, when i asked them why they brought their business to ritecheck instead of the major well-known bank three blocks away, they often told me stories about the things the ritecheck tellers did for them… at ritecheck, the tellers treated the customers as individuals and went the extra mile to assist them, perhaps in the same way that a neighborhood grocer might allow a trusted customer to run a monthly tab. on busy days, tellers regularly skipped lunch and coffee breaks in order to keep the wait times down. ana paula, our manager, often joined us at the window. the customer always came first and knew it (servon 2013). servon believes that to attract these depositors, banks would need a better product — fee and service structures that work for them and are designed around user experience. ebook lending: the new york public library (nypl) has been lending out e-books for years. many people in new york city have smartphones or dedicated e-book reading devices. i think we could expect that people who are already buying e-books would be interested in borrowing them for free. but e-book borrowing rates at nypl are abysmal. for context, across the united states, for every one hundred print books sold, somewhere from 30 to 80 e-books are sold. this calculation is from some rough numbers that some friends and i put together. it might be off, but it’s not much beyond, let’s say, 3-to-1. given that new york public library lent out about 250,000 print books every week last year, you might hope that nypl would be lending out, let’s say, maybe 150,000 e-books a week, or lower than that, maybe 100,000. let’s say it’s a 5-to-1 ratio – 50,000 e-books a week. but instead, it was actually much lower than that. in 2013, nypl lent out 19,000 e-books a week. so instead of a 3-to-2, 3-to-1 ratio, it’s more like a 13-to-1 ratio. new york public library has this department, nypl labs, which is researching this problem. nypl labs has broken down the e-book borrowing experience and found that it currently takes 18 separate steps for an nypl user to borrow an e-book. i saw this diagram that was called “19 steps of hell.” they have been working on a project to take that down to three steps. and i would predict that this would raise e-book borrowing rates beyond this 13-to-1 print-to-e-book ratio. wikipedia: i have first-hand experience of this from my work at the wikimedia foundation, the organization behind wikipedia. wikimedia’s number one concern is gaining and retaining wikipedia editors, as well as contributors to wikidata, wikivoyage, and so on. if we want a world in which every single human being can freely share in the sum of all human knowledge, then we need diverse perspectives editing and providing that knowledge. we’ve found that sometimes little things in usability make a really big difference. for a subsection of a wikipedia article, the “edit” link used to be all the way at the right side of the screen. it wasn’t obvious what that link was for, and some users thought it was the link to edit the previous section. so we moved that link to be right next to the subsection’s header. the click-through rate more than doubled. it went up by 117 percent, and that higher volume of clicks led to 8.6 percent more edits as well (wikimedia contributors 2012). we see the bounce-off effect where people hit the edit button, see a form full of wikitext, the markup language that evolved with mediawiki, and just hit the back button. generally when newcomers see that, they worry that something’s broken, because when else would you see that kind of gibberish on a website? to improve this user experience, we have invested in making a visual editor appropriately called the visualeditor. you can try it right now on english wikipedia, if you log in and turn on beta features in your preferences. our hypothesis, which we’re working to prove, is that if we make the user experience better, more people, and more different kinds of people, will edit. one last note about wikipedia and usability: you don’t have to register to read wikipedia. it’s just there, no paywall, no registration wall. and there are a lot of really great other open educational resources out there, oers, that have content that readers would really find useful, maybe just as useful as wikipedia – except they’re behind a registration wall. that just massively reduces how many people are going to see it. healthcare: i think healthcare in the united states is kind of fractally broken: it’s broken on the macro and the micro, you just zoom in and you keep finding new kinds of breakage. if you look at the user experience for normal use cases: general practitioner acquisition, appointment booking and service delivery, usability is pretty poor. a lot of offices aren’t open at night or on weekends, you have to call them during business hours to make an appointment, you have to book a whole appointment just to get a flu shot, and there’s a long wait once you get there. so in recent years we’ve seen some new startups that are usability hacks. zocdoc, a website that lets you see doctors in your area who have open appointment slots this week, lets you filter out the ones who don’t take your insurance and book the appointment online. minuteclinics and other quick clinics at drugstores and in big box stores are also walk-in, low-cost ways to get quick diagnoses or shots. this convenience, however, does come at a cost to the user. patients on zocdoc are often just going to book with the next available doctor. people who use minuteclinics are skipping a longer doctor visit, so these patients might not get long-term preventive care or form long-term relationships with their doctors (harihareswara 2007). in the library world, researchgate and academia.edu act similarly to zocdoc. they’re hacks that patrons use to self-serve, which is great except that they’re routing around the kind of deep and long-term research help your institutions could give them. encryption: let’s talk about privacy tools. pretty good privacy (pgp) is basically the standard, the common standard for email encryption. but it’s incredibly hard to wrap your head around. encryption tech in general has terrible usability. it has terrible usability, terrible customer support, terrible localisation to different languages and locales, and thus, abysmal adoption rates. as a result, journalists and dissidents and activists generally don’t use it, even when the stakes are really high. they’re instead using twitter and facebook and email – unencrypted – because they are much easier to use. the open internet tools project’s james vasile said at the keynote at open source bridge 2013 that the privacy tech community would be better off spending the next year of our time, spending all of that time, on user experience design, localisation, and end user customer support, rather than writing any new tools or features (vasile 2013). why the bottleneck? these examples – banking, lending, wikipedia, healthcare, encryption – show you how bad usability can really change what choices people make. we in the open source community, including code4lib, put a lot of energy into lowering visible barriers to entry, like licensing and cost, that stop people from getting information. but bad usability is a barrier to entry, too–less visible but just as real. free access to education, free expression of political opinions, and privacy from unwarranted surveillance are human rights in the un declaration of human rights [8]. even if you don’t want to go that far, they are social justice goals that i think we can all agree on here. our community shares a vision of less waste and more empowerment and getting knowledge into people’s hands. but transferring these benefits out of our libraries and into the hands of our users, the new last mile problem, requires that the users actually be able to use our software. in getting rid of user experience obstacles, we are working to achieve social justice goals and implement human rights. it’s so clear why we ought to make our products usable , but why do we have this bottleneck? why do we have this barrier? it’s not just that the problem is hard. we’ve had other hard problems in the tech world – from going from mainframe to web, distributed computing, localisation — that we have done better on than usability: so why? this isn’t just about coffee. let’s look at what it takes to do user experience work. you have to look at your service from the point of view of someone who knows a lot less than you and see where they’re coming from. you have to imagine the reasons why they want what they want. seeing that causation, seeing the connection between what someone’s doing now and all the causation that went before it, that’s empathy. it’s a little like reverse engineering. you’re trying to break the drm (digital rights management) that’s stopping them from getting what they need. we need to exercise a disciplined empathy. it’s an empathy that includes qualitative thinking, like interviews and watching people use stuff to see where the snags are, and quantitative thinking, like a/b testing and heat maps. the tech industry is not very good at empathy. i’m speaking from my own experience here–i know library tech is its own field–but in my experience, we just drop the ball on empathy and hospitality a lot. one reason is that our industry systematically undervalues the jobs and roles that require empathy and has deeply gendered associations with hospitality and empathy is that they’re not seen as masculine traits. this isn’t simply a women-versus-men issue; it affects everyone. the tech industry values masculinity over femininity, meaning traits like hospitality are devalued. those who perform or reinforce masculinity — whether they’re male or female, are privileged. everyone, women and men, become trapped in this cycle of usually unknowingly reinforcing hospitality, empathy, and so on as stuff that can lead to demotion on the respect ladder and the pay scale ladder. naturally, all this stuff is then smushed out of our software, because it’s just not incentivized, it’s actually penalized, and when the group making the software isn’t very diverse, the cycle just repeats, and becomes even worse. this is one reason that diversity in a group, especially a group making software, is useful. it includes people with different perspectives and is more likely to include more people with the ability to see issues from multiple perspectives–including the users’ perspectives. in general, marginalized people develop more empathy than the dominant group because we have to. we have to be able to see from other people’s points of view, including the dominant point of view, as a matter of survival, and we need to be able to see from many different users’ points of view, even when it’s uncomfortable or shows us that we have failed [2]. what you can do a disciplined empathy means that we need to treat customer support, those front-line desk and phone tasks, as a first-class responsibility and a source of important data. these are your bug reports. this is how you know if you aren’t being as hospitable as you want to be. i’ve heard lore about a library where they log, in a shared document, every time they have to tell a patron “no,” and then they try to fix that. disciplined empathy means coders listening to designers, designers and coders learning to speak each other’s languages, and designers learning how to integrate their work into the development workflow. and this is happening. in her blog post, “code talks and designers don’t speak the language”, crystal beasley says that in general, coders right now in the open-source world don’t know what to expect from user experience designers, and developers don’t have good judgment procedures to know whether a proposed design is correct and that “representing the work of a designer requires a shift in culture,” with which i agree. beasley also says, “the solutions to all these problems lie in communication and building a trusted relationship. it’s a higher barrier for designers that takes time to overcome. i’ve found all of my team to be receptive when i’ve taken the time to explain the principles that guide my decisions [10].” this is a good point. i think in any negotiation, it’s good to help people see the principles you’re reasoning from, and the experiences you’ve had, that led you to where you are now. if you don’t think empathy is one of your strong suits, you can change that. we all came into this community with some skills and without others. you can learn and exercise your empathy skills as well. if you have a sequential learning style where you prefer a structured approach, you could take a course in conflict resolution. if you prefer self-study, you could read novels and blogs written by people with lives very much unlike yours. at your own institution, observe real users using your library and your library’s digital services, including patrons and colleagues. let’s go as bess sadler once said to me, “at this intersection of technology and librarianship, we need to not only bring technology skills and values to library services; we need to bring library skills and values to technology services.” and i’d say that includes hospitality and access. here are some of the resources that you can access right now. university of michigan’s library information technology group has a user experience department [3], and suzanne chapman, interface and user testing specialist and head of the university of michigan library’s user experience department, also has a really interesting blog [4]. university of illinois’s library has a usability testing lab with two workstations and a room set up to conduct usability studies [5]. matthew reidsma at grand valley state university libraries has this awesome “work notes” blog so you can see, for instance, how the gvsu web team responded to data and anecdotes to make incremental improvements to their site. i also suggest you read his pieces “the library with a thousand databases” [6] and “how we do usability testing” [7]. usertesting.com is a quick and easy-to-use site that lets you commission a test. testers perform tasks and record and narrate on video what they did. it’s cheap; someone i know commissioned a test for $89 and got results in less than an hour. if you just want to start with usability testing and maybe just find some glitches in what you just rolled out, this is an option. influx, a consultancy started by aaron schmidt, amanda etches, and nate hill, focuses on library user experience [8]. within lita, some folks have proposed a user experience interest group [9], that’s also something to look into. libraries are doing this work. sometimes we call it ui, sometimes we call it human-computer interaction, sometimes we call it user-centered design or interaction design, and it intersects with product management. several people on the code4lib email list in october talked about what a huge difference a dedicated ux team or person makes. for example, tom cramer said, “we have been lucky to have a full time interaction designer within our library it group for about 6 years. it makes a world of difference in the quality of our products (cramer 2013).” better user experience is the best force multiplier we have at our command, so it is vital that we make it a first-class priority, throughout the development process. and with disciplined empathy we can do that. here at the intersection of libraries and tech, we can figure out how to scale hospitality, fix the new last mile problem, and actually achieve the social justice goals that so many of us got into this for. conclusion: post-speech reflections after my keynote speech, attendees responded with approbation. one attendee tweeted, “the last time i heard the word “hospitality” used this often was at the catholic worker. #c4l14 #justsaying” [10]. another said, “i want to take yesterday’s keynote, bess’ last 2 years of talks, matienzo’s lightning talk from 2013: burn to a dvd, send as a mass mail” [11]. an archivist used the talk as a jumping-off point to ask, “what kind of user experience are we providing to our researchers from start to finish?” [12]. outside the immediate code4lib community, a specialist in infotech for international development also found my talk applicable to her work: “…if we want to use technology–any kind of technology, from radio to broadband–to give people more options and choices in their lives, we have to get imagining. we don’t really have a choice” [13]. as i look back a year later, i see related discussions (including andromeda yelton’s keynote from code4lib 2015) continuing in our communities, and i am particularly happy to see research and development continuing along the lines i described. for instance, library simplified [14], from nypl labs, is going to launch an e-book lending product for new yorkers in the middle of 2015, and will be working to help other libraries use the software. work on usable security by adrienne porter felt [15] and others has improved the ux of privacy for millions of web users, and points to opinionated design principles that libraries can use to empower patrons while keeping them safe. perhaps the best response i received was during the code4lib conference last year, in raleigh. i was looking for an acquaintance in the hotel lobby, and poked my head into the bar. i saw a stranger wearing the conference badge, so i greeted him, and he said that he’d liked my speech. i asked whether there was anything in particular he’d liked. he quietly told me that he’d been working in library technology a long time, and had just been through a long stretch of feeling ground down. but my speech had reminded him of why he’d gotten into this particular business in the first place – why he was here. that touched me. and i hope it continues to help him, and anyone else who needs it. acknowledgements i want to thank all the people who helped me work out the ideas in my original speech: coral sheldon-hess, mel chua, andromeda yelton, bess sadler, emma molls, leonard richardson, jared zimmerman, and sky croeser. and i want to thank all of you for the work you’re going to do, hacking yourselves and your institutions to serve users better. notes [1] see bess sadler’s “creating a commons” (http://www.ibiblio.org/bess/?p=302) and mark matienzo’s “emotion, archives, interactive fiction, and linked data” (http://matienzo.org/blog/2013/emotion-archives-interactive-fiction-linked-data/). [2] for further reading see mel chua’s notes (http://blog.melchua.com/2013/08/20/engineering-education-discourses-on-representation-why-problematization-matters-beddoes-2011/)on kacey beddoes’ “engineering education discourses on representation: why problematization matters.” [3] http://www.lib.umich.edu/library-information-technology/user-experience-department [4] http://userslib.com [5] http://www.library.illinois.edu/sc/services/usability_testing/usability_testing.html [6] http://matthew.reidsrow.com/articles/58 [7] http://matthew.reidsrow.com/articles/13 [8] http://weareinflux.com/ [9] http://www.ala.org/lita/about/igs/user/lit-igue [10] https://twitter.com/helrond/status/448455510971252736 [11] https://twitter.com/chrpr/status/448879826620145664 [12] http://rockarch.org/programs/digital/bitsandbytes/?p=1036 [13] http://thehillarylp.com/blog/2014/4/14/me-and-you-and-everyone-we-know [14] http://www.librarysimplified.org/ [15] for further reading, see http://adrienneporterfelt.com/ references beasley, c. 2012, october 12. code talks and designers don’t speak the language. retrieved from http://skinnywhitegirl.com/blog/code-talks-and-designers-dont-speak-the-language/930/ cramer, t. 2013, october 30. usability person? [online forum comment]. retrieved from https://listserv.nd.edu/cgi-bin/wa?a2=ind1310&l=code4lib&f=&s=&p=243880 harihareswara, s. 2007, january 16. bad relationship retrieved from http://www.harihareswara.net/sumana/2007/01/16/0 servon, l. 2013, september 11. the real reason the poor go without bank accounts. citylab. retrieved from http://www.theatlanticcities.com/jobs-and-economy/2013/09/why-poor-choose-go-without-bank-accounts/6783/ united nations. the universal declaration of human rights. retrived from http://www.un.org/en/documents/udhr/ vasile, j. 2013, june 18. keynote address presented at the 2013 open source bridge conference, portland, or. retrived from https://www.youtube.com/watch?v=xihsrpd_u-0 west, j. 2014, march 10. your new coffee overlord [online forum comment]. retrieved from http://www.metafilter.com/137379/your-new-coffee-overlord#5456577 wikimedia contributors. 2012, may 16. research:section edit modification. retrieved from https://meta.wikimedia.org/wiki/research:section_edit_modification#overview_analysis_from_previous_testhttps://meta.wikimedia.org/wiki/change_to_section_edit_links about the author sumana harihareswara is a programmer and open source community leader living in new york city. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – challenges in sustainable open source: a case study mission editorial committee process and structure code4lib issue 9, 2010-03-22 challenges in sustainable open source: a case study the archivists’ toolkit is a successful open source software package for archivists, originally developed with grant funding. the author, who formerly worked on the project at a participating institution, examines some of the challenges in making an open source project self-sustaining past grant funding. a consulting group hired by the project recommended that — like many successful open source projects — they rely on a collaborative volunteer community of users and developers. however, the project has had limited success fostering such a community. the author offers specific recommendations for the project going forward to gain market share and develop a collaborative user and development community, with more open governance. by sibyl schaefer introduction from july 2007 to september 2009 i served as an archives analyst for the archivists’ toolkit (at).[1] the project was co-sponsored by the university of california, san diego, and new york university, where i was based. the archivists’ toolkit has been an extremely successful and groundbreaking open source software program for archives. the program has revolutionized archival description and data management by enabling easy authority control processes, linking accessions with descriptions, and allowing multiple standard data outputs from the same data set. the at project was funded by the andrew w. mellon foundation but was not self-sustaining at the end of the grant phase (september 2009), six years after the inception of the project. instead of continuing to fund the product, the mellon foundation has suggested it merge with archon, another open source archival data management program. while this move will unite two user communities in support of one product, it remains unclear whether the product is tenable outside of grant funding. during my experience with the at project it became evident that: 1) the product needed more users to become sustainable and 2) governance of the project needed to be more open, delegating tasks to users whenever possible in order to minimize overhead costs and essentially becoming a true collaborative and community-based open source venture. the merger of the at with archon provides an excellent opportunity for the new product team to increase market share and establish a system to harness user contributions. by doing so, the end product will be in a much stronger and more sustainable position. background on the archivists’ toolkit the archivists’ toolkit claims to be the “first open source archival data management system”[2] although archon was also released around the same time.[3] both projects originated out of the need for a tool to support the management and automation of archival data in a manner reflecting archival practices and to output this data in professional standards, such as encoded archival description (ead). this need was realized first in 2002, when the digital library federation and california digital library co-sponsored a couple of meetings to discuss what was then called the archivists’ workbench. two years later, in june 2004, new york university and the university of california at san diego, working in conjunction with the five colleges, were awarded a grant from the andrew w. mellon foundation to develop this tool. the project commenced and by december 2006 the first version of the software was released with the “archivists’ toolkit” moniker. the project team at that point was mainly divided by task: project management was handled at ucsd and development at nyu. analysts wrote specifications from ucsd, nyu, and the five colleges area. the archivists’ toolkit was groundbreaking for its successful integration of previously separated data stores. in the pre-at world, an archival repository may have kept accessions information in an access database, an excel spreadsheet or maybe even in a hand-written log. finding aids, the predominant access tool for archival materials were kept in paper inventories, word documents, html files, or discrete ead files (if the archives had been able to adopt the technology). authority control may have been accomplished through the creation of collection-level marc records uploaded into the ils of the repository’s parent organization or may not have existed at all. the at changed these idiosyncratic practices. accessions information, descriptive data, donor information, location information, and authority records are all kept within one searchable space. reports and standardized outputs can also be easily generated. the product was similar to an integrated library system (ils) but specifically designed for archival materials. although ilss have existed since the rise of computing, the at was the first software system to tackle the unique needs of archival management and description. the first release of at version 1.0 prompted a lot of investigatory testing, but few institutions adopted it on a full-scale production basis. this was to be expected. the program still had some kinks, and most repositories remained wary of a tool that was not fully tested. in 2007, at version 1.1, a much more mature and stable release, was offered. it allowed for more flexibility in importing legacy data: multiple ead files could be imported at once, an xml import schema was developed for accessions data, and user-defined fields were allowed within the accessions record. this release prompted many institutions that had previously been working on development instances to move fully into production, although full-scale adoption of the program was still low. after the release of 1.1, the at project team hired ithaka, a non-profit consulting group, to develop a sustainable business model. ithaka scoped out the archival landscape, divided it into market segments, and conducted interviews with at users within those segments. from early on the outlook was glum. first, the at needed to increase its user base. the small size of the archival market meant that the at needed to become a dominating force within that market. second, the at needed to either find a parent organization that would be willing to incubate the project, sharing resources until the project became financially stable– or the project would have to develop its own means of generating income. in the case of the latter, serious budget cuts would need to be made. in order to make up for these cuts in operating expenses, ithaka suggested the project do what most open source projects do: rely on their users. working towards sustainability: increasing the user base the at offered archivists new ways to capture, control, and manipulate archival data, and thus can be considered discontinuous innovation; users are required to modify their behavior in order to take advantage of its benefits.[4] discontinuous innovations are in danger of falling into the “chasm,” or the space that lies between the early adopters and early majority in george moore’s revised technology adoption life cycle model.[5] moore warns that early adopters are distinctly different from the early majority. unlike early adopters, who are eager to implement the latest technology, the early majority tend to be pragmatists when making their decisions. they won’t accept a buggy, unpredictable product, and they want the benefits from the product to be demonstrable, especially in increasing productivity. as pragmatists, the early majority are also looking for the whole product — not just the software, but support, training, and other services and technologies that surround the program to integrate it fully into their existing framework.[6] to cross the chasm effectively, the at product needed to start gaining support from the early majority, particularly by appealing to its pragmatist qualities. taking heed of ithaka’s advice to focus on increasing the user base, at senior administrators changed my analyst position into a user services support position. i led organized regional user group meetings, held mainly at professional conferences. the meetings generally turned out to be a mix of formal communication from the team and informal communication between users. these meetings were key in informing potential users about the software and its capabilities. the presence of the software at conferences showed we were a viable player and provided a venue for current users to share their successful implementations with potential users. in addition, the at team focused on providing the whole product. extensive documentation was developed and maintained. user support, always a strong point for the team, became a priority. we also began working in conjunction with the society of american archivists (saa) to provide training on how to use the program in adherence to national and international standards for archival description. this last move was especially ingenious — working with saa gave the program an informal endorsement from the nation’s leading professional archival organization.[7] combining the training with a focus on standards was also key; it sent the message that an additional benefit of adopting the program was making standards compliance easier to implement. the project was fairly successful in using marketing methods that appealed to the pragmatist nature of the early majority, but the team neglected to review aspects of the application that could be modified in order to broaden its appeal. one major oversight was a lack of focus on increasing productivity in every applicable area in the application. although the at saves a considerable amount of time by providing automatic ead markup, the main data entry mechanism in the program is unwieldy, especially when it comes to creating the hierarchy that is inherent in most archival arrangement and description. archivists accustomed to entering information in excel or access are often frustrated by the additional keystrokes, mouse clicks, and navigation required by the at. other editing functions typically found in desktop applications are missing in the at, notably a find and replace feature and an undo mechanism. because the at is database-driven, these features are more difficult to implement than in other types of programs, such as word processing software. yet this difference is not obvious to the typical at user who generally doesn’t distinguish between one desktop application and another.[8] since the early majority seeks gains in productivity, the lack of these basic and very noticeable features make it easy for them to overlook the other time-saving mechanisms provided by the at. the at also missed another essential part of the whole product — ead search and delivery mechanisms. the inability to serve up ead from the at is probably the predominant reason some institutions chose archon, which does offer immediate web-publishing functionality. the at does provide basic html and pdf transforms, but repositories usually find these outputs insufficient because they don’t allow for complex searching, and because customization requires additional technical skills. most repositories don’t have the access to technical resources that the at’s parent universities do; creating a homegrown ead publishing system is out of their financial and technical reach.[9] the at did not provide the whole product and missed out on a considerable amount of possible market share.[10] i believe the lack of web delivery, as well as the web access archon offers, figured predominantly in the decision to merge the two applications. lastly, the at also missed the opportunity to develop relationships with third-party partners. such relationships can help a software project deliver aspects of the whole product that they can’t deliver themselves. they also serve to make the product appear stable and well supported to potential adopters. these qualities are especially important for an open source product looking to cross the chasm and which has to face questions not only about its sustainability, but also about the effectiveness of open source software. these partnerships do not just magically appear; they need to be actively sought out.[11] the at, for instance, could have partnered with a company willing to host the at database backend and/or provide an ead publishing system. such partnerships could have provided more options for users who otherwise were not able to adopt the at. reducing overhead: providing a framework for user contributions in the spring of 2008, ithaka informed the at project team that it was highly unlikely the at software could expect volunteer development contributions. the archival community is small, and the number of archivists with enough programming prowess to contribute to the at code base is even smaller. since we couldn’t expect code contributions from our users, the project was always going have to cover development costs. ithaka advised the team to find a parent organization that could take the project under its wing. ideally this would be an organization whose ideology mirrored that of the at project and who would be willing to incubate it by sharing resources until the at became financially stable. should the project not find a suitable parent, ithaka suggested other business models which could generate income to cover some development costs. they strongly recommended that costs not associated with actual programming – testing, documentation, specification etc. – could all be achieved in the traditional open source manner: using volunteer contributions from the user community. relying on contributions for these tasks would decrease overhead enough to make the project’s sustainability viable. by the spring of 2009, talks with potential parent organizations had failed to reach fruition. around that same time a few prominent archives decided they wanted to extend and improve on the at base by contracting a programmer (a former lead developer on the at project) to do so. during this time, six months before the end of the grant period and without any enacted sustainability plan, the at should have done as much as possible to open up governance of the project to users by delegating tasks and responsibilities whenever feasible. yet this did not happen, even with the promise of outside development contributions. by all appearances, the project had started off with intentions of encouraging a “bazaar” atmosphere, one in which development is marked by contributions and involvement of numerous people outside the original development team. in order to ensure bugs and other problems were detected, the program was beta tested by 20 different institutions prior to the first release. the original lead developer also embraced the mantra “release early, release often.” although there were some initial attempts to open up development of the program, it was largely done in a more closed, “cathedral” fashion.[12] for example, although jar files are made available on the at website for each release, a way to access the latest code, or guidelines on how to be an official “committer” have never been officially provided. testing eventually became a task completed solely by the project team, due to time considerations and the difficulty in setting up appropriate test conditions.[13] in his work of managing open source software projects, karl fogel provides guidelines on how to incorporate and encourage volunteers to contribute back to an oss project. one way of doing so, he states, is to “treat every user as a potential volunteer.”[14] the at project team was congenial and helpful to its users, but it did not recruit users to the extent that it could have. instead of operating from a level on par with the users, the adopted tone was more one of an expert. this tone was mirrored by a lack of delegation of tasks to the user community. part of this had to do with the fact that team members were all in salaried positions in which it was their responsibility to work with the software. when it is your business to be an expert on something, it is hard not to assume that stance. it is tempting take on tasks yourself rather than teaching volunteers how to do them, and it can be difficult to enforce tight deadlines when managing volunteers who have full-time positions outside of their volunteer contributions. however, as fogel states, “the goal is to make every user realize that there is no innate difference between herself and the people who work on the project.”[15] this realization, he believes, is what prompts users to take the initiative to do such tasks in the first place; everyone can become an expert if they are willing to put in the time. in the case of the at project, we were experts on the program, and could be relied on to provide documentation, run beta testing, answer support questions, etc. there was little incentive to contribute when everything seemed already under control.[16] because we were experts the team could complete tasks in less time than it would take to train volunteer collaborators . the first sketches of a framework in which user contributions could be accepted gradually did emerge. the proliferation of new development initiatives forced the at project to finally start coming to terms with the open source nature of the software. the lead programmer created a plugin framework to allow for new functionality to be added without changing the core code. this isolated code reduced testing time and provided basic means for code contribution without forking the code. in a parallel development, active leaders in the at user group formed the archivists’ toolkit roundtable, an official saa interest group, which first met and elected new officials in august of 2009. however, by this time the merger with archon had been announced, and the role of the roundtable remains undetermined. likewise, it is uncertain whether the plugins currently in development will be compatible with the merged at/archon system. conclusion the archivists’ toolkit/archon merger team has an important task in front of it. both products were revolutionary in changing the way archivists manage their collections data and greatly eased the production of access instruments. but even though the two products will no longer be competing, the merged product will still need to gain market share to be sustainable. in order to do so, the merger team should: enable increased productivity by making data entry tasks as easy as possible, provide more visualization for users as they work in the hierarchical description section, include a find and replace function, and add shortcut key functionality. prioritize usability from the start of application development. review plugins that are in development as a means of assessing what features are truly important to users. provide the whole product package. the addition of web services (publishing and editing) will help, but there are other areas that could use revision, notably the reports. work with other people and companies to provide whole product services the core at/archon team can’t provide, such as training, set-up, data migration, plugins, etc. the merger team should also address sustainability issues early in the project. delegating tasks to dedicated users who are willing to volunteer their time can help cut costs substantially. in order to create a more open and participatory environment, the merger team should: provide access to the most updated code. provide guidelines on how code contributions are handled, and who gets “committer” status. set up an infrastructure that eases the burden of testing on volunteers. delegate everything that can be possibly delegated to users outside of the development team. create experts in documentation, testing, specification, etc., in the user community, rather than on the project team. this structure would ideally be organized before the product goes live, so potential new users would not be deterred by the lack of the whole product. use the roundtable as a form for governance/decision making and task delegation, and keep the number of salaried staff that are not programmers to the absolute minimum it’s easy to reflect back and discuss what should have been done, but it’s much more difficult to implement systemic changes in the middle of a project with finite resources and tight deadlines. the end product of the at/archon merge will likely confront the same problem the at did: improbable sustainability without a greater number of users, and a coordinated volunteer support system. the new at/archon merger team has the unique opportunity to focus on making users part of the infrastructure and fabric of the new program, paving the way for a truly sustainable open source archival data management system. notes [1] during the two years i spent with the project i worked on various different aspects of it: namely marketing, user support, documentation, specification, testing and training. grant funding for my position expired in july of 2009. nyu generously offered to keep me on for at least one additional year, but several months into that year i left for another position. [2] archivists’ toolkit, “introduction to the archivists’ toolkit,” http://archiviststoolkit.org. [3] university library, “library archivists receive award,” university of illinois at urbanachampaign, http://www.library.illinois.edu/news/archon_award.html. [4] geoffrey a. moore, crossing the chasm: marketing and selling disruptive products to mainstream customers (new york: harpercollins, 2006), 10. (coins) [5] moore, crossing the chasm, 16-20. [6] moore, crossing the chasm, 112-113. [7] saa never officially endorsed the at. archon also later began offering training sessions through saa. [8] the at can be run in different configurations — entirely local with the client and database run on the same machine, or with a networked database. the program is written in java and supports mysql, ms sql, or oracle backends. hibernate is used as a communication layer between java and the database. [9] ucsd contributes to the online archives of california (oac), which serves as an ead repository for institutions across the state of california. nyu has had its own ead publishing system in place for several years. [10] this is based on numbers provided in an interim report to the at project team from ithaka. roughly 6,000 archivists work in repositories that lack technical support for all but the most basic needs. in contrast, approximately 850-1200 archival institutions do have the technical resources to create and support their own ead publishing system. [11]moore, crossing the chasm, 117. [12] eric s. raymond, “the cathedral and the bazaar,” http://catb.org/~esr/writings/homesteading/cathedral-bazaar/ [13] many archivists who expressed interest in testing were unable to replicate their data and upgrade a different database to the test version, either because they personally did not have the technological know-how to complete such tasks, or because they had technical support that were not amenable to performing them. [14] karl fogel, producing open source software: how to run a successful free software project (2005/2009), 145. available on the web at http://producingoss.com/ (coins) [15] fogel, producing open source software, 149. [16] ironically, many of these tasks, especially documentation and support, are appealing to potential users and offer more of the whole product. i think the project was right in taking on those tasks but should have incorporated users more in their production. about the author sibyl schaefer is the metadata librarian for the university of vermont’s center for digital initiatives, where she provides metadata expertise for digital resources and manages the center’s digitization and metadata projects. she currently is a member of the society of american archivists’ standards committee and a certified archivist. sibyl previously served as the user services liaison on the archivists’ toolkit project, a role that included providing customer support and training on the application, conducting usability testing, developing the user group, and specifying new or improved features. subscribe to comments: for this article | for all articles 2 responses to "challenges in sustainable open source: a case study" please leave a response below: oss eval methods-lit review « practical e-records, 2011-06-02 […] as i noted a few weeks ago, emily brock and i are reviewing formal evaluation methods for open source software (oss). we’re doing this because i would like to get a handle on what worked or didn’t work with the archon project. having an objective understanding of that project’s strengths and weaknesses will be critical as the archivesspace project moves forward. the article that emily and i hope to write will complement sybil shaefer’s excellent code4lib piece. […] a case study of supply-driven product development hanging together, 2020-09-03 […] reflection on the development of the archivist’s toolkit in the most recent code4lib journal. challenges in sustainable open source: a case study was written by sibyl schaefer who worked on the project. she does the kind of brave, objective […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using imagemagick to automatically increase legibility of scanned text documents mission editorial committee process and structure code4lib issue 14, 2011-07-25 using imagemagick to automatically increase legibility of scanned text documents the law library digitization project of the rutgers university school of law in camden, new jersey, developed a perl script to use the open-source module perlmagick to automatically adjust the brightness levels of digitized images from scanned microfiche. this script can be adapted by novice perl programmers to manipulate large numbers of text and image files using commands available in perlmagick and imagemagick. by doreva belfiore project background since 1997, the law library of the rutgers university camden campus has engaged in a project to provide open access to digital legal materials from the state of new jersey, other u.s. states, and the united states federal government. some of the documents are harvested from publically-accessible open data websites, and some of the documents come from the digitization of paper and microfiche volumes. problem for the digitization of state session laws and state constitutional documents, the library uses a mekel m565-200 microfiche scanner to scan individual images from microfiche frames and generate jpeg image files of each document page. the mekel filmscan scanning software allows for the manipulation of image density and brightness on the fiche level, but not on the frame level for individual page images. this creates a problem of potentially wide variations in brightness among images scanned from the same fiche. the library holds a collection of state constitutional law documents on microfiche which were produced by the greenwood press. greenwood press selectively inserted watermarks on random images, presumably by super-imposing a plastic or glass film sheet over the printed page prior to filming. these large watermarks (see figure 1) often render the resulting microfiche image dark and, frequently, illegible. in digitizing the microfiche images, there is no way to control for the brightness of the darkened watermarked files while maintaining the brightness and legibility of the non-watermarked ones. due to the scale of the state constitutions digitization project (estimated 350,000 image files), there would be no cost-effective way to individually brighten each watermarked file by hand using photo manipulation software such as photoshop or gimp. instead, we needed to find a way to automatically brighten and correct image density for large numbers of scanned text images. figure 1 – example of a greenwood press watermark. solution imagemagick is a free image manipulation program that can be run on various platforms (windows, macintosh, unix) and includes a perl module, perlmagick, for use in scripting. these features make it an excellent choice for high volume image editing. imagemagick, from the command line, can generate a large amount of image metadata, from exif values to size, filetype, brightness levels, histograms and more. it can also be scripted to perform a wide variety of image manipulations automatically. testing imagemagick alone on the command line, we found that the modulate command could increase the brightness of files. once we determined that imagemagick could manipulate individual files successfully to increase legibility, we created a script that would work efficiently on a large number of files at the same time. after the microfiche are scanned, the resulting jpg files are stored in one master folder and retained in their original format until all processing and quality control is complete and the images are uploaded to the rutgers camden law library website for public access. copies of each fiche set are made specifically for manipulation via perl scripts that assign pagination, embed descriptive dublin core and exif metadata into each image, and take preservation checksums of each image file. before pagination and metadata assignment, this copy set is used for any manual or automatic image correction procedures. for this image brightening project, we started with a clean reference file that we chose on the basis of legibility, and used the brightness level of that file as the example from which to evaluate other files in the directory. because greenwood press microfiche do not have a standardized image grid, the person scanning the microfiche has to use his or her best judgment in centering the fiche for the camera. as the camera scans each frame of the fiche sequentially, occasionally black edges occur around the document pages when the frames shift during the scan process. in order to prevent the black edges from skewing the brightness values of our reference image or the other images, we decided to select a sample block for testing from the center of the image. each original image (frame) on the microfiche is scanned at 400 dpi and is approximately 3804 pixels wide by 3193 pixels long, although the specific dimensions may vary among fiche sets. through trial and error, we found that the imagemagick shave command could crop a specific number of pixels from all edges of the image (500 pixels from each side formed an ideal ‘slice’), leaving us with a rectangular block in the center from which to evaluate each file. the “slice” or “snapshot” is loaded temporarily into memory for evaluation, leaving the original file unchanged. in addition to testing the center portion of each image, we also set the image to monochrome in order to increase text contrast and limit the mathematical comparisons to a smaller number of values (0 to 256). this was a useful choice for our state constitution text files, but might not be appropriate for photographic or other image files. using perl and perlmagick, we gather and count the number of files in a directory specified by the user, to stdin. we then open the selected reference file clean.jpg and take the value of the mean of image levels (see shave script below). as this file is monochrome, color values run from 0 to 256. for the purposes of our project, a precision level of 2 decimal points was sufficient to demonstrate the brightening effects that we sought. the script can be adjusted to use a different reference file for each fiche set, allowing the user to customize the baseline standard to the needs of each particular set of images. im_autocompare.pl script – shave function #instantiates an image object in imagemagick. $clean = new image::magick(); $rg = $clean->read('clean.jpg'); die "$rg" if $rg; die "$rg" if $rg; $rg=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #changes grayscale to monochrome to limit to shades of black $rg = $clean->set(monochrome=>'true'); #this command shaves 500 pixels off the top and sides so that it evaluates the middle #of the image and not the edges to get an accurate brightness level. $rg = $clean->shave(geometry=>'500x500'); $rg = $clean->set(page=>'0x0+0+0'); die "$rg" if $rg; $rg=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #gathers statistics about the clean file @rstats = $clean->statistics(); $rmean = $rstats[3]; print "the mean of the good file is: $rmean \n"; print logfile "mean of clean file is : $rmean \n"; we then open the images to be processed using perlmagick and read each into memory. the shave command crops out a “snapshot” of each image. the statistics command then reads the snapshot and calculates its mean color levels, which are stored in a variable. finally, the script compares the mean value of the image file against the mean value of the “clean.jpg” reference file. using a graduated percentage scale, if the current file is brighter than the reference file or within 25 points of the reference file value (i.e., a difference in means of less than 25), the script reports that the file is acceptable and takes no action. if the current file is darker than the reference file by a difference of 25 to 49 points, the script issues a modulate command of 250, or approximately 25% brightening (see modulate command below). if the file is darker by a difference of 50 to 74 points, the script issues a stronger modulate command of 500, or approximately 50% brightening. if the difference between the reference file and the clean file is 75 to 90 points, the subsequent modulate command is issued for 750 or approximately 75% brightening. for any difference greater than 91 points, the maximum modulate level of 999 is issued. this split scale of brightness can be adjusted for the needs of the specific files to be corrected. in all cases the script reports to the user what action is being taken and this report can be logged for troubleshooting and documentation. im_autocompare.pl script – modulate command #5 adjust files for brightness if needed # performing level adjustments based on scale of difference # between the mean level of the current file and the mean level of the # difference file. # modulate command adjusts brightness from 1 999 (highest). if (($diff > 25) && ($diff < 49)) { print "adjusting brightness by 25% of file $document now ....\n"; print logfile "adjusting brightness by 25% of file $document now ....\n"; #instantiates new imagemagick object. $fixer = new image::magick(); $ff = $fixer->read("$input/$document"); die "$ff" if $ff; $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #adjusts brightness by 25% and writes over the file. $ff = $fixer->modulate(brightness=>'250'); $ff = $fixer->write("$input/$document"); } as all manipulations take place on the fly in system memory, the brightening process is efficient and takes approximately 1-2 seconds per brightened file. at the end of the brightening step, the original scanned file is overwritten with the corrected image. we maintain our original scans on a separate server as a backup in case of problems or errors. this overwrite step is completely optional, and can be changed by adjusting the perlmagick write command to write to a temporary file that can be reviewed, edited or replaced at a later time, as in: $x = $y->write(“$tempfile”); in addition, for error handling, the program is set to quit, print, and log any error messages generated by perlmagick. below are before and after examples to demonstrate the results of this script: figure 2 – “clean” reference file the “clean” reference value or mean brightness of the center of this “clean.jpg” is 222.43. example #1: figure 3a – original example file the mean of the brightness value of the center of this image = 147.017 (difference = 75.413) figure 3b – example file brightened by 75 percent example #2: figure 4a – original example file the mean of the brightness value of the center of this image = 170.566 (difference = 51.864) figure 4b – example file brightened by 50 percent examples of documents that have been improved by this script can be seen in the rutgers law library’s nebraska constitutional documents online collection: http://lawlibrary.rutgers.edu/stateconst/neconst/index.shtml known issues and limitations at the present time, this script has problems evaluating an image that is composed of only one side of a double-page spread. taking a center “snapshot” by using the shave command will necessarily include a partially black side and will skew the mean value of the image. for this reason, we did not choose to loop this script to automatically repeat the correction process. looping would brighten the printed page side to the point of washed-out illegibility. instead, we run the script sequentially, printing results and errors to a logfile and evaluating the output file for legibility after each pass. as a future enhancement to this script, we plan to edit the main script to recognize a highly skewed value and fire off a subroutine that will test the image a second time. by taking two vertical slices as “snapshots”, one on each side of the horizontal page, the script will be able to detect an empty page, which shows as a pure black frame. once detected, a flag can be set for the file to be marked as a single page and exempt it from further brightness evaluation. another limitation of the script as currently written is that the gradients for image brightening are deliberately broad. we sought to simplify the brightening process into 4 “strengths” at 25%, 50%, 75%, and 99%, respectively. the script could be improved to utilize better mathematical algorithms to measure the image brightness more tightly, and run multiple passes and checks on the images as they reach a desired brightness level. in a related project using imagemagick, we are testing the combination of perl scripts running perlmagick with cgi scripts presenting images to a non-expert user for evaluation. a negative user response to a prompted question sends the presented image file into a subroutine for further testing and checking, while a positive user response proceeds to the next image. in the future, combining these types of testing methods with an enhanced version of this image brightening script from perlmagick will allow for faster turnaround time in quality control for the state constitutions digitization project. future projects and other ideas as the primary goal of the rutgers camden law library digitization project is to digitize and make available as many public domain legal documents as possible, our overriding interest is the provision of public access. in order to achieve the scale of production needed to meet our goals, an automated processing method was absolutely necessary. for our purposes, the percentage scale of image brightening using our method was found to be sufficient to produce legible text from scanned greenwood press microfiche when viewed via a standard web browser. this script can be enhanced to increase brightness, sharpness, contrast, or other image levels on a more fine-grained scale, depending upon the needs and goals of the user and programmer. this script could also be extended to take advantage of other file manipulation commands offered by perlmagick. this use case is an example of how free and openly available tools can be used to enable the creation of large scale digital library collections at an affordable cost per image that is accessible to libraries and archives with very modest digitization budgets. at the same time, the code involved in this script can be understood and maintained by someone with a fairly modest level of technical programming knowledge, which also makes it an accessible tool for institutions that do not have dedicated programming staff. we have found it to be highly useful, and look forward to improving it over time and applying it to more digitization projects. references imagemagick studio llc. [internet]. [updated 2011]. imagemagick: convert, edit and compose images; [cited 2011 april 20]. available from: http://www.imagemagick.org/script/index.php. imagemagick studio llc. [internet]. [updated 2011]. perlmagick api; [cited 2011 april 20]. available from: http://www.imagemagick.org/script/perl-magick.php. joergensen, j.p. 2002. the new jersey courts publishing project of the rutgers–camden law library. law library journal 94(4):673-689. thyssen, a. [internet]. [updated 2011 march 15]. examples of imagemagick usage; [cited 2011 april 20]. available from: http://www.imagemagick.org/usage/. acknowledgements the author wishes to thank john joergensen of the rutgers university camden school of law for his mentoring and guidance. about the author doreva belfiore (dorevabelfiore@gmail.com) received her masters of library and information science in 2011 from the drexel university ischool in philadelphia, pa. she works as a digital library and circulation intern at the law library of the rutgers university school of law in camden, new jersey. appendix script: im_autocompare.pl #!/usr/bin/perl # # # imautocompare.pl # # # comparing brightness of a clean reference file # to then adjust levels of all files in a given directory # using perlmagick. # # doreva belfiore # rutgers university school of law camden, nj # law library #1 use imagemagick perl module use image::magick; #2 set variables here my($document, $clean); my($rg, $average); my($rf, $fixer, $ff); open (logfile, ">>im_autocompare.log"); #3 load the information about what a "clean", "good" or at least #"repaired" file looks like. #here we are using clean.jpg , a very clean document file as a reference image. $gdir = ("./reference"); chdir ("$gdir"); print "loading reference information now. please wait... \n"; #instantiates an image object in imagemagick. $clean = new image::magick(); $rg = $clean->read('clean.jpg'); die "$rg" if $rg; die "$rg" if $rg; $rg=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #changes grayscale to monochrome to limit to shades of black $rg = $clean->set(monochrome=>'true'); #this command shaves 500 pixels off the top and sides so that it evaluates the #middle of the image and not the edges to get an accurate brightness level. $rg = $clean->shave(geometry=>'500x500'); $rg = $clean->set(page=>'0x0+0+0'); die "$rg" if $rg; $rg=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #gathers statistics about the clean file @rstats = $clean->statistics(); $rmean = $rstats[3]; print "the mean of the good file is: $rmean \n"; print logfile "mean of clean file is : $rmean \n"; #4 open the user-specified folder and get information about the files $input = $argv[0]; chomp ($input); #$rdir = ("../scratch/alconst"); $rdir = ("../test"); chdir ("$rdir"); print "processing folder $rdir \n"; opendir(dir, "$input"); @files = grep /\.jpg/i, readdir dir; closedir dir; @files = sort @files; $number = @files; #checks files found in the directory print "found $number files in the directory $rdir \n"; foreach $document (@files) { #instantiates an image object with imagemagick. print "reading $document...\n"; $average = new image::magick(); $rf = $average->read("$input/$document"); die "$rf" if $rf; $rf=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number print "transforming image...\n"; #this command shaves 500 pixels off the top and sides so that it evaluates #the middle of the image and not the edges to get an accurate brightness level. $rf = $average->shave(geometry=>'500x500'); $rf = $average->set(page=>'0x0+0+0'); die "$rf" if $rf; $rf=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #changes greyscale to monochrome to limit to shades of black $rf = $average->set(monochrome=>'true'); #gathers statistics about the current file @avstats = $average->statistics(); $amean = $avstats[3]; $diff = ($rmean $amean); print "the difference between $document and the reference file is $diff \n"; print logfile "the difference between $document and the reference file is $diff \n"; #5 adjust files for brightness if needed # performing level adjustments based on scale of difference # between the mean level of the current file and the mean level of the # difference file. # modulate command adjusts brightness from 1 999 (highest). if (($diff > 25) && ($diff < 49)) { print "adjusting brightness by 25% of file $document now ....\n"; print logfile "adjusting brightness by 25% of file $document now ....\n"; #instantiates new imagemagick object. $fixer = new image::magick(); $ff = $fixer->read("$input/$document"); die "$ff" if $ff; $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #adjusts brightness by 25% and writes over the file. $ff = $fixer->modulate(brightness=>'250'); $ff = $fixer->write("$input/$document"); } die "$ff" if $ff; $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number elsif (($diff <= 74 ) && ($diff >= 49)) { print "adjusting brightness by 50% of file $document now .... \n"; print logfile "adjusting brightness by 50% of file $document now ....\n"; $fixer = new image::magick(); $ff = $fixer->read("$input/$document"); die "$ff" if $ff; $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #adjusts brightness by 50% and writes over the file. $ff = $fixer->modulate(brightness=>'500'); $ff = $fixer->write("$input/$document"); } $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number elsif (($diff <= 90) && ($diff >= 74)) { print "adjusting brightness by 75% of file $document now .... \n"; print logfile "adjusting brightness by 75% of file $document now ....\n"; $fixer = new image::magick(); $ff = $fixer->read("$input/$document"); die "$ff" if $ff; $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #adjusts brightness by 75% and writes over the file. $ff = $fixer->modulate(brightness=>'750'); $ff = $fixer->write("$input/$document"); } $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number elsif ($diff > 91) { print "adjusting brightness by 99% of file $document now .... \n"; print logfile "adjusting brightness by 99% of file $document now ....\n"; $fixer = new image::magick(); $ff = $fixer->read("$input/$document"); die "$ff" if $ff; $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; # log the error number #adjusts brightness by 99% and writes over the file. $ff = $fixer->modulate(brightness=>'999'); $ff = $fixer->write("$input/$document"); } $ff=~ /(\d+)/; print $1; # print the error number print logfile $1; #logs the error number else { print "image fine. skipping to next file... \n"; print logfile "image fine. skipping to next file... \n"; } #6 undefine variables undef $average; undef $rf; undef $rg; undef $fixer; undef $ff; undef $document; undef $clean; } #end foreach loop close logfile; subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – an introduction to optical media preservation mission editorial committee process and structure code4lib issue 24, 2014-04-16 an introduction to optical media preservation as the archival horizon moves forward, optical media will become increasingly significant and prevalent in collections. this paper sets out to provide a broad overview of optical media in the context of archival migration. we begin by introducing the logical structure of compact discs, providing the context and language necessary to discuss the medium. the article then explores the most common data formats for optical media: compact disc digital audio, iso 9660, the joliet and hfs extensions, and the universal data format (with an eye towards dvd-video). each format is viewed in the context of preservation needs and what archivists need to be aware of when handling said formats. following this, we discuss preservation workflows and concerns for successfully migrating data away from optical media, as well as directions for future research. by alexander duryee introduction the role of optical media in archives has shifted in the past decade from preservation medium to at-risk format. while longevity models estimated a lifespan of 25-200 years for recordable media[1], recent testing has found this range to be optimistic by orders of magnitude. one collection of data cd-rs from the 1990s yielded a 92% failure rate after approximately twenty years of storage.[2] given how recently archives have approached optical media as an object of preservation, there is little literature and research regarding the format. while a comprehensive account of all migration and preservation issues of optically stored data is impossible, there is a need for a broad overview of optical media, which this document hopes to provide. this document will cover five of the most common data storage standards for optical media. in doing so, it hopes to inform institutions working with such media as to the nature and challenge of optical media with regards to preservation. the author also hopes that, by providing a general overview of optical media preservation, more advanced conversations and explorations of the medium can take place. note that this article deliberately limits its scope to the most common formats and uses of optical media. the total forms and functions of optical media are many and varied – from analog video to hidden data to graphics stored in control codes – and are beyond the scope of this document. archivists dealing with such discs are recommended to explore more specialized resources, such as technical communities and standards documentation. this document, while offering recommendations, is also not a prescriptive cookbook of workflows and tools for dealing with optical media, as the necessary research for such recommendations has yet to be performed. logical structure it is crucial to understand the logical layout of optical media before attempting any preservation activities. as the first optical media standard was iec 60908 (1982) for audio storage and playback, future standards reflected a media-centric approach in how they structure the disc. thus, despite standards such as iso/iec 10149 (available as ecma-130[3]) providing for filesystem-based storage, the language and structure of red book persists[4]. hence, while it seems strange to discuss file-based data in terms of logical tracks, this is precisely how data is stored on cd-rom. a simplified model of a compact disc’s logical structure is as such: a series of sessions, each containing a series of tracks. a track is, as its name implies, designed to be one discrete track of audio. for non-audio data (e.g. cd-rom), any number of filesystems may be contained within a single track. these tracks are arranged in a linear series and bounded by a lead-in (which contains the table of contents – locational and descriptive metadata – for the following tracks) and a lead-out. this collection comprises one session. early cds were designed as if only one session would be on a disc; this was expanded in 1990 to provide for multiple sessions. figure 1. a diagram of the logical structure of a multi-session cd. due to the flexibility of how data can be arranged on sessions and tracks, the following example configurations are possible: -one session, many audio tracks (typical audio cd) -one session, one track, many filesystems (typical data cd) -two sessions, one with many audio tracks, one with one track and many filesystems (enhanced cd) -one session, one data track, many audio tracks (video game) it merits noting that, on cd-r and cd-rw, a session can be open or closed. a session can have new tracks written to it so long as it is marked as open; by closing it, it prevents further data from being added and creates the final table of contents. methods of handling discs with open sessions are documented by the digital forensic community, due to their importance in evidence collection.[5] for purposes of this article, the structure of dvd is sufficiently similar to cd-rom that it does not merit in-depth analysis. iso9660 iso9660 is the most common format for carrying data on cd-roms, often appearing as the baseline filesystem on cross-platform discs. it is also common for it to serve as the foundational filesystem for more sophisticated systems, such as joliet and hfs – these often (but not always) serve as a layer on top of iso9660 data to provide additional functionality. its limitations (such as 8.3-style non-unicode filenames) make it compatible with all common operating systems, thus allowing for a fallback in case more advanced filesystems are not supported. as such, it is extremely common to find compact discs with one iso9660 filesystem alongside a mix of hfs and joliet – such a disc would provide advanced features for windows and macos, but the data would be accessible on virtually any operating system. from the perspective of preservation, iso9660 can be treated similarly to filesystems on magnetic media. as each sector must return data consistently on each read (consider the consequences of a dropped sector when executing compiled code!), and each sector must be easily found on the disc, these discs use the mode 1 sector structure. in contrast to cd-da’s use of every byte in a sector for data, mode 1 dedicates 2048 bytes to data, 16 bytes to sector sync and identification, and 288 to error detection and correction. these non-data bytes reduce the amount of user data on the disc but crucially allow for it to behave as a filesystem by providing for rapid accurate sector seeking and consistent data reading. as such, for purposes of migration, long-term preservation, and access, an iso9660-based disc can be treated similarly to a magnetic disk. while there is no need for the specialized hardware that magnetic media requires (such as write blockers), iso9660 tracks can be imaged at the byte level[6] via the standard variety of data transfer tools (dd, guymager, ftk, isobuster, etc.)[7]. the subsequent image can then be mounted as any other filesystem for archival analysis and user access. various access models and post-migration workflows, such as remote image mounting and automated filesystem analysis, have previously been explored similarly and parallel to research in magnetic media.[8] joliet/hfs while iso9660 is the most common data format for cd-rom, it is rarely seen by itself. due to the restrictions on file structures, there was a demand for some mechanism to expand the capabilities for cd-rom filesystems. to address this, microsoft established the joliet specification as an extension to iso9660 in 1995. joliet’s primary improvements over iso9660 are long filenames, unicode filenames, and deeper directory trees. as joliet does not manage the data on the disc – it only provides for enhanced filesystem metadata – it is typically packaged as a metadata layer next to an iso9660 filesystem. an exploration of a disc’s logical structure will reveal that an iso9660 file and its joliet counterpart both point to the same sector of the disc – while the filenames differ, one copy of the data is used. the hierarchical file system (hfs) was employed to allow for macintosh-specific file behaviors on cd-roms and to work around the limitations of iso9660. hfs, while specific to apple machines, is a more powerful filesystem than iso9660 – notably, it allows for longer filenames and metadata necessary to integrate more naturally with the operating system[9]. as such, it was heavily used to provide compatibility with macos; for example, a document without the necessary hfs-specific metadata will lack the format/software signifiers to open correctly. thus, even in environments where the hfs filesystem is no more than metadata pointing at underlying iso9660 data, the hfs filesystem is critical in retaining the compatibility of the disc. typically, hfs filesystems will contain data unique to the disc on macos, such as compiled bytecode and documentation specific to the operating system, which should be migrated as part of any archival workflow. figure 2. a disc containing an hfs filesystem, viewed in an emulator. note the floating text in the shell – this was accomplished by creating empty files with a blank icon and placing them within the finder window. [17] udf since udf was designed as a universal standard for data storage, it has few surprises for preservation. as opposed to a compact disc, a dvd will almost certainly (barring unusual circumstances) behave in a standard fashion across any number of devices. despite the apparent variations between dvd-video, dvd-audio, and data dvds, the underlying disc format is identical. while there are a variety of quirks between versions and implementations of udf and highly specific cases where its structural features are critical to a successful migration, these are beyond the scope of day-to-day preservation. note that, due to varying hardware support for udf versions, it is often paired with an iso9660 filesystem. due to iso9660’s near-universal support, it can be used as a “bridge” to allow older hardware to read the disc. as such, it is up to the individual archive to decide if one or both of the filesystems should be preserved. one important subclass of dvds is dvd-video, which is the predominant format for consumer video. using a screen essentialist approach, dvd-video is seemingly equivalent to audiovisual tape formats: the user inserts the disc into a player, which provides an interactive menu and a collection of linear audiovisual streams. however, this masks the underlying structure of the disc. in actuality, the dvd contains a udf filesystem with a standardized directory and file structure (which a dvd player recognizes and parses in a seemingly single stream). put generally, the filesystem contains a video_ts directory, which contains the mpeg streams (vob) and playback metadata (ifo/bup).[10] dvd-video can therefore be treated similarly to any data dvd for purposes of migration. the major issues that archives will face with dvd-video are overcoming the content scramble system (a copy protection scheme that prevents disc migration and hence preservation) and providing access to the raw audiovisual files post-migration. cd-da the data structure of a cd-da is more akin to tape than a traditional data disk. instead of dividing the storage area into discrete files, cd-da data is written as a linear pulse-code modulation (pcm) stream, divided into separate tracks. as the data is read by the disc drive (at 44,100 16-bit samples per second, chosen specifically to be above the necessary rate for perfect human reconstruction), the playback device interprets the pcm stream and generates the corresponding waveform. in this regard, cd-da is more closely allied to tape-based formats than it is to traditional magnetic disk, as it represents not a structured filesystem but a linear stream of media. figure 3. the pulse-code modulation stream to digital signal to analog signal chain as cd-da is fundamentally different from data formats like iso9660, there are a number of factors to consider. since there are no ‘files’ on a cd-da session, but streams of raw pcm data, the file listing provided by an operating system’s shell does not reflect the contents of the disc. windows, for example, will display .cda files that will play via media software. these files, however, contain no data – they are merely pointers to locations on the cd where their corresponding tracks begin. the data itself is something that cannot be parsed by the shell and thus requires special software to migrate. cd-da was also designed to maximize space at the cost of accuracy; thus, it sacrifices a third level of error correction in exchange for more data per sector.[11] the advantage of this approach was a marked increase in capacity and a level of error tolerance, as misread sectors will not interrupt smooth playback in all but the most egregious cases. thus, errors introduced during the read process can go undetected by the drive. typically, read errors will manifest as traditional sonic problems, such as clicks and pops. as a result, reading an audio track in a single pass will provide unreliable results, with consumer hardware being roughly 95% accurate at the track level.[12] this is unacceptable for preservation – consider the consequences of introducing errors to five to ten percent of tapes and records during transfer! various techniques have been developed to account for cd-da’s inherent lossiness. software designed for reliable cd-da extraction combine a variety of methods to ensure consistent reads across a given disc. for example, an extraction tool may read each sector of the disc multiple times in order to find the correct value, or it may compare the data against online databases containing other bitstreams of the same disc. while these methods are powerful in ensuring consistently good reads of cd-da, they are specific to this data structure, and thus they are not portable to iso9660 or other cd formats. this presents issues during the migration of so-called enhanced cds (ecd, also known as cd extra) and mixed mode cds (common with video games), which contain both audio and filesystem data. as the specific workflow for such discs is reliant on the structure of a particular disc, it is beyond the scope of this document. as it is unlikely that these will be found outside of specific subject collections, archives handling such media would do well to conduct research and contact specialists with regard to migration. applications in preservation workflows the most important analysis an archivist can perform on a disc occurs before it ever enters a workstation. recordable optical discs are frequently used by individuals as simple portable file carriers – for example, a donor may burn a dvd with material to transport as part of their collection, or a disc may be created to share documents around an office. in these cases, the disc serves as a simple carrier, analogous to an envelope or package – in other words, something that never merits preservation. as such, the archivist may decide to copy the files over via their shell (finder, windows explorer, et cetera) instead of imaging a disc, as there is no value in preserving the low-level structure. dependent on the nature of the discs in a given collection, this may be the preferred method of migration. the physical aspects of optical media, while typically not important to the data on the disc, do merit discussion with regard to migration processes. the read speed of a disc should be set as low as possible (via tools such as cd-rom tool spti or hdparm), as lower speeds can provide more accurate results. the quality of drive can also make a dramatic difference – empirical data has found high-quality consumer drives to be approximately five percent more accurate at the track level than poor ones[13] – and thus should be taken into account when designing an archival workstation. traditionally, plextor drives have served various preservation-minded communities well and therefore often recommended for optical migration.[14] given that an optical disc can have any number of filesystem and sector structures – some of which are invisible via an operating system’s shell – it is necessary to use dedicated tools for analysis. by reading a disc’s table of contents directly (instead of relying on what the operating system recognizes), it becomes possible to view a complete list of tracks on a disc (and their corresponding contents in case of data tracks). a handful of tools exist for exploring the structure and contents of a disc, such as cd/dvd inspector and isobuster (both windows-only). these will allow for an archivist to understand the broader structure of a disc (sessions/tracks, filesystems, files, etc.), as well as catching any discs containing cd-da before migration. figure 4. isobuster with a disc containing an iso9660 filesystem, extended by joliet and hfs layers due to cd-da’s lack of error correction in comparison to other cd/dvd formats, it requires a special workflow to migrate properly from cd. a linear read of a cd-da track will not give an accurate transfer of data, due to reasons outlined above; additionally, audio-specific issues (such as placement of silence between tracks) are issues not addressed in more general software. to solve these problems, specialized software was developed to address problems in cd-da extraction. the author recommends tools such as exact audio copy (windows), dbpoweramp (windows) and cdparanoia (linux/osx/windows), which are specifically designed to overcome migration barriers via methods to account for silent read errors, pregap detection, et al. the combination of empirical data and technical analysis of these tools make them the gold standard for audio cd migration. the target formats for optical migration are dependent on a variety of factors. a complete image containing every byte on a disc will not necessarily be the optimal format for long-term preservation, as it will include error correction and sync data. this will increase the size of an image by 305 bytes per sector (96mb across the entire disc), provide minimal utility, and create potential access issues. however, by using a binary image (typically a .bin file) and a cue sheet (.cue file, which provides the metadata necessary to divide the binary stream into sessions/tracks and interpret it), a disc can be replicated perfectly. in certain cases, particularly that of complex discs, this may be necessary for creation preservation masters. an alternative when handling single-track discs is a single track image, which will capture the user data within the track and provide a mountable and usable image. general workflows for the creation of preservation masters can be found in the appendix. for cd-da, the target migration format is typically 16-bit/44.1khz wave. as the raw pcm data is equivalent to a headerless wave file, the header can be appended with no change to the audio data. this file can then be manipulated to an archive’s specifications for transcoding, broadcast wave metadata, et cetera. if the structure of the disc is important, the bin/cue format is an alternative to discrete wave files that retains the timing and layout of the disc[15]. a suggested workflow using exact audio copy can be found in the appendix. as dvd-video is structured no differently than any other data dvd, a binary image of the disc is sufficient for preservation. given that there is no real need for transforming the mpeg streams within the vob files for preservation, the image can be stored as-is and mounted upon request for access. however, due to the variety of needs and methods for preserving, describing, and accessing digital video, it is impossible to suggest a single target format – for example, an asset management system may require a specific codec for playback. while the vob files are stable and generally playable, there may be a need to transform the files for broader/easier access (e.g. concatenating split files into one object). an archive will need to set policy for handling copy protection, transcoding, storing, and providing access to the media. future research the current state of preservation research with regard to optical media is rather lacking, as it limits its scope to the physical artifact. while relevant in the short term, the inevitability of disc rot, coupled with the decrease in the manufacture of quality drives, places an urgent imperative on migration research and practice. as there has been little research performed in allied fields such as law enforcement and computer science, there is a greater need for the establishment of practices and knowledge relating to optical media by archivists. areas of future research may include: documentation of best practices with regard to known and unknown types of optical carriers establishment of techniques and workflows for damaged media scaling optical media migration and balancing best practices with efficiency outlining of significant properties of classes of optical media metadata standards for describing optical media and data stored as such exploration of methods for preserving more exotic optical formats, such as cd+g and laserdisc-based formats the author anticipates that, as the archival scope begins to encompass the era of optical storage, a greater need for established workflows and advanced research will arise. the use of optical media as objects of archival preservation has been limited to specific projects (namely government documents and digital art) and external communities (particularly music and video game collectors). by applying and adapting existing knowledge to generalizing optical preservation, archivists can prepare for the next generation of digital preservation challenges. appendix cd-rom suggested workflow analyze disc with isobuster and determine workflow (with a particular focus on cd-da, number of tracks/sessions, and initial errors) if the disc is cd-rom only, analyze for structure (iso-based, hfs, hybrid, etc) and describe as per metadata standard extract cd <image> -> user data, or if non-data bytes are necessary (for byte-level alignment), raw this will capture all cd-rom sessions/tracks on a disc in a binary image. if user data was selected, this can be mounted as a disc within one’s operating system; if raw was selected, one may need to extract the user data before mounting the image. raw images can be translated into mountable user data via the bchunk tool on linux/osx. cd-da suggested workflow due to the complexity – and necessity – of properly configuring exact audio copy (eac), it is beyond the scope of a general document. a guide to the varying software and drive options can be found at the hydrogen audio knowledge base.[16] note that, while eac’s software options can be generalized for preservation, every drive is unique and thus requires individual configuration. it is also prudent to align one’s drive with the accuraterip database in order to compare individual drive accuracy against others of the same model. analyze disc with isobuster and determine workflow if the disc is entirely cd-da tracks, open eac and read the disc into it detect pre-gaps and check for unusual deviations (gaps are typically approximately 2 seconds each) test & copy selected tracks -> uncompressed this will produce wave files as per specifications set in eac’s preferences. due to eac’s emphasis on reliably reading cd-da (via methods such as comparing multiple reads against each other), this will be a very accurate migration of the data on the disc. if there is a need to maintain the structure of the disc via a cue sheet, this can be created within eac using the current gap settings. glossary cd: compact disc. a format designed to hold audio data (and later expanded to general data) on a 12cm plastic disc, using a laser to read a series of pits and lands as binary data. cd-da: compact disc – digital audio. the standard used to define the logical, physical, and data structures of audio discs. cd-r: compact disc – recordable. a worm (write once – read many) format allowing for compact discs to be created using consumer hardware. cd-rom: compact disc – read only memory. the standard used to define the logical, physical, and data structures of data discs. cd-rw: compact disc – rewritable. a format, similar to cd-r, that allows for data on the disc to be written over. dvd: digital versatile disc. a format designed to store cinematic-length motion pictures on a 12cm disc. this led to the creation of a disc that offered 6-10 times the capacity of cd-rom in the same form factor. dvd-video: the standard filesystem and file formats for distributing motion pictures on dvd. enhanced cd: a format for storing cd-da and cd-rom data on a single disc. this was originally impossible, but was permitted with the blue book standard in 1995. hfs: hierarchical file system. the standard filesystem used on macos, which was used on cd-rom to provide compatability and enhanced features. iso9660 (“iso”): the standard used to define the filesystem and file structure in place on compact discs. despite being highly limited, it is near-ubiquitous on cd-rom. joliet: an extension to iso9660, designed by microsoft, that provides a metadata layer to overcome the limitations of iso9660. mode 1: a substandard of the cd-rom data structure. mode 1 defines 2048 bytes of user data per sector, with the rest of the sector used for error correction and sync bytes. mode 1 is much more commonly used than mode 2, which was designed for use in more exotic formats such as cd-i and video cd. pcm: pulse-code modulation. a method for storing analog audio data as a series of binary values. udf: universal disk format. a standard designed to supersede iso9660 by lifting some of its restrictions. udf set out with the goal of replacing the myriad cd data formats with a single one. correction when this article was first published, a citation to figure 2 was missing and has since been added. we apologize for this oversight. references [1] range commanders council, optical systems group. multimedia archiving: videotape, compact disc (cd), digital versatile disc (dvd), and blu-ray disc (bd) media [internet]. white sands missile range, new mexico: range commanders council [updated 2010 february; cited 2014 february 14]. available from http://www.wsmr.army.mil/rccsite/documents/462-10_multimedia%20archiving%20-%20cd,%20dvd,%20and%20blu-ray/462-10_multimedia%20archiving%20-%20cd,%20dvd,%20and%20blu-ray.pdf [2] wilsey l, skirvin r, chan p, edwards g. capturing and processing born-digital files in the stop aids project records: a case study. journal of western archives [internet]. [cited 2014 february 14]; 4:1. available from http://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1026&context=westernarchives [3] ecma international. data interchange on read-only 120 mm optical data disks (cd-rom) [internet]. second edition. geneva, switzerland: ecma international. [updated 1996 june; cited 2014 february 14]. available from https://web.archive.org/web/20130116084229/http://ecma-international.org/publications/files/ecma-st/ecma-130.pdf [4] the red book is the colloquial name for iec 60908, due to the color of its cover. the various standards that define optical media are collectively known as the rainbow books, as each has a differently colored cover (e.g. iso/iec10149 being the yellow book). [5] more information on the theory and method of retrieving data from discs with open sessions can be found in crowley and kleiman’s cd and dvd forensics [6] note that the term “bit level” is meaningless with regard to optical media. instead of using eight physical bits per byte, all compact discs use eight-to-fourteen modulation (efm): the 256 combinations of fourteen bits most favorable to accurate reading are mapped to corresponding eight-bit bytes, and the fourteen bit words are written to disc. as such, a true “bit level” rendition of a compact disc would require an analysis of the physical artifact. for more information, see the ecma-130 standard linked above. [7] note that the colloquialism “iso image” is frequently misleading. despite implying that a disc image file ending in .iso captures only iso9660 data, it is instead used to describe a general disc image. as such, unless the provenance of a .iso file is documented, it cannot be used when determining the scope and content of migration. [8] woods, kam a. (2010). preserving long-term access to united states government documents in legacy digital formats [dissertation]. bloomington (in): indiana university. available from https://web.archive.org/web/20140217161345/http://www.digpres.com/publications/kam-woods-dissertation-pre.pdf [9] for example, the creator code and type code allow for macos to open a file automatically using the intended software. [10] the dvd video specification also defines an audio_ts directory. this is only used on the exotic dvd-audio format and will likely not be encountered in most archival scenarios. for the purposes of dvd-video, it will most always be empty. [11] the specifics of error correction of cd-da data is complex enough to fall beyond the scope of this document. more information can be found at: https://web.archive.org/web/19970616191108/http://www.ee.washington.edu/conselec/ce/kuhn/cdmulti/95×7/iec908.htm [12] cd/dvd drive accuracy list [internet]. [updated 2013 may 14]. [cited 2014 february 14]. available from https://web.archive.org/web/20131203200732/http://forum.dbpoweramp.com/showthread.php?30430-cd-dvd-drive-accuracy-list-2013 [13] ibid. [14] note that plextor ceased production of its own hardware in the mid-2000s; any drive with the plextor name made after that point was manufactured by a different company. [15] the canonical example of bin/cue being necessary for cd-da is music designed for dance, which often uses no gaps between tracks. as bin/cue allows for direct manipulation of gaps, this allows for the session to be played back as intended. [16] eac options [internet]. [updated 2013 august 3]. [cited 2014 february 14]. available from http://wiki.hydrogenaudio.org/index.php?title=eac_options [17] norie neumark, shock in the ear, 1998. rose goldsen archive of new media art, #8200. division of rare and manuscript collections, cornell university library. screenshot by dianne dietrich, mac os 8 in basilisk ii. about the author alexander duryee is a digital preservationist with avpreserve experienced in web archiving, optical and magnetic storage migration, and data preservation. holding an mlis from rutgers university, his background includes archival tool development, data access, preservation infrastructure analysis, and metadata management. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – git and gitlab in library website change management workflows mission editorial committee process and structure code4lib issue 48, 2020-05-11 git and gitlab in library website change management workflows library websites can benefit from a separate development environment and a robust change management workflow, especially when there are multiple authors. this article details how the oakland university william beaumont school of medicine library use git and gitlab in a change management workflow with a serverless development environment for their website development team. git tracks changes to the code, allowing changes to be made and tested in a separate branch before being merged back into the website. gitlab adds features such as issue tracking and discussion threads to git to facilitate communication and planning. adoption of these tools and this workflow have dramatically improved the organization and efficiency of the ouwb medical library web development team, and it is the hope of the authors that by sharing our experience with them others may benefit as well. by keith engwall and mitchell roe a library website is a complex entity that consists not only of web pages containing static content, but a variety of dynamic content, integrated access to other information systems, and other features that rely on customized code. as websites become more reliant on code, changes to the site run greater risk of errors that may lead to downtime. therefore, it is good practice to design, implement, and test changes within a development environment prior to releasing them to the production site. setting up a development environment may, however, be a challenge to some libraries, especially if there are multiple personnel working on the site. coordinating changes, backing up files, and ensuring that everyone is working on the most recent version adds a layer of complexity that needs to be addressed. this article details how a small medical library designed and implemented a development environment and workflow for a multi-developer team without the need of a development server, and discusses the challenges and benefits of setting up and using this environment for the management of a production library website. overview of library’s web environment in 2011, the oakland university william beaumont (ouwb) school of medicine opened its doors to its first class of students, and the medical library began development of its library website. initially, the medical library piggy-backed off the university library’s website as a subfolder within that site. necessitated by specific needs of the various authors of the university library’s site, the library website was managed through dreamweaver, a web integrated development environment (ide), and consisted of html, css and javascript. the site was configured to use dreamweaver’s template system to cordon off and synchronize sections of html designated to common elements (e.g. header, footer, navigation, etc.). eventually, a separate virtual host was set up for the medical library on the university library’s web server. the medical library’s web librarian and technology specialist started building a new site from scratch, using editors of their choosing and replacing dreamweaver’s template system with php modularization (shared elements would be placed within their own files and pulled into pages through include statements). php also allowed for the development of dynamically generated content from data files. for example, the medical school’s required textbook spreadsheet was converted into a set of dropdowns, providing a list of required textbooks for each course with links to electronic versions. the medical library’s team was uncomfortable with the idea of performing development directly on the production site, but no development site was immediately available. their request for one was delayed by review and approval processes of the university’s central it service. to move forward, the medical library designed a development infrastructure and a workflow that would effectively accommodate their collaboration on site development. they decided to use git, an open-source distributed version control system, to track changes during the development process. git builds a self-contained repository around a set of folders and files, storing metadata about the files and changes made to them in a hidden folder within the root folder. each of the web authors would have their own copy of the website repository from which to make changes. to coordinate their changes, the official repository would be stored on the university’s implementation of gitlab, an open-source web-based devops platform built on top of git. each team member would be able to pull the most recent copy of the repository from the gitlab server, make changes, and push them back to the server. git would track these changes and maintain a history of them on both the local computer and the gitlab server, interleaving the changes made by each person and detecting conflicts. in addition to providing centralized storage for the repository, gitlab had project management features that allowed them to better track and plan their work. they developed a six-step workflow around these tools, and found that they had everything they needed for a development environment without the need of a centralized development server. the following section will provide more detail about the development infrastructure and the features that informed the design of their workflow. the role of git and gitlab in website development workflow the core of this development environment is git [https://git-scm.com/], an open-source, distributed version control system. git is very flexible and can be set up in a variety of configurations to fit several different development workflows. git commands can be run from the command line, from a dedicated graphical user interface (gui) client, or as a set of functions built into an integrated development environment (ide) or development tool. git tracks changes made to files and stores them along with metadata as “commits” within a repository. git only stores changes made within files, and not full copies of the files themselves (except for binary files, such as images), making git’s use of disk space extremely efficient. a “branch” is like a “version” of the repository. a branch is actually stored as a set of commits tracked by git until the branch is “checked out”, at which point git applies the commits to the files, bringing them up to date with the most recent commit in the branch. in most configurations, each repository has a “master” branch representing the core version of the files. development branches may be created from this “master” branch to serve as an independent copy of the master branch, providing a safe place to make changes to files without risk to those within the master branch (figure 1). figure 1. development branches in git. once the changes are complete and have been tested, a development branch can be merged into the master branch to apply the changes to the files within. git can track multiple branches, and will gracefully interleave changes so long as they don’t conflict within an individual file. in such situations, git will identify the conflict, present the alternative edits, and provide the opportunity to resolve the conflict prior to merging. after the file has been edited to retain the intended changes, the merge may be attempted again and should be successful. git thus isolates the master branch from these risks, and provides a means to roll back changes from a commit or an entire branch. this allows for relatively quick recovery as well as the information necessary to identify and analyze the cause of the problem. git is distributed by design, which means that repositories can be copied (a.k.a. “cloned”) between computers, including the metadata for changes made thus far in the master branch, as well as additional branch metadata, as needed. git is kept up to date by “pulling” changes from a remote repository. the remote repository is updated by “pushing” changes in a branch, along with the corresponding metadata, to the remote repository. this may be done from development branches, allowing a gatekeeper to oversee the merging of development branches into the master branch. this allows developers to work independently without cluttering remote repositories with branches containing trivial changes, depending on the situation. git’s feature set makes it extremely suitable for website change management, especially in situations where multiple individuals are collaborating on a site. although it is fairly technical and may be difficult to master, this may be attenuated by the use of an ide such as jetbrains phpstorm [https://www.jetbrains.com/phpstorm/] or adobe dreamweaver [https://www.adobe.com/products/dreamweaver.html], which incorporate git functions within the interface in a way that may be more user-friendly. as a front-end to git, gitlab provides a robust interface for git repositories stored on its server. users create accounts on gitlab and may create or be invited to participate in git repositories that are stored on the gitlab server. access to the gitlab server directly from git can be facilitated by setting up an ssh authentication key in the user’s profile. beyond providing functions to work with the repository itself, gitlab extends this functionality through additional integrated tools and features. one key feature that gitlab provides is the issue tracker. this acts as a ticketing layer that sits above the repository, facilitating higher-level project management through the creation and organization of task items (“issues”). issues contain several features often found in task management systems: they can be assigned or claimed, given due dates, and/or attached to project milestones. they contain a title, description, and a comment thread for high-level discussions about the issue (the nature of the problem or desired change, expectations, time frames, etc.). stakeholders outside of the development team can be given gitlab accounts with limited access to the repository, which will allow them to review, submit, and participate in issues. this ticketing layer integrates with the repository through the creation of branches. branches created from the issue are linked to the issue. this integration reduces the overhead of project management, resulting in a more efficient work process. gitlab also provides a set of tools and an additional comment thread from within its merge request interface. once a branch is ready to be merged, a merge request can be created from the issue. at this point, members of the web team may open the merge request to review the changes made in the branch and discuss the changes if necessary. the merge request interface contains a list of commits made to the branch and a comparison view, which shows the differences in each file between the development branch and the master branch (by default). the comparison view may also be configured to show changes between any two commits from within the branch. once satisfied, a member of the dev team can merge the branch into the master branch from the merge request interface. gitlab will use git to handle the merge, and any conflicts detected by git will be noted in the merge request, where they can be resolved prior to merging. a six-step workflow for web development after much discussion and some trial and error, the web team settled on a six-step workflow that would be applied for each change to be made to the website (figure 2): create an issue for a proposed change create a branch to isolate the change until it has been tested make, test, and commit edits to the code until the change is complete review the change merge the branch and the change within back into master push the change to production figure 2. workflow for implementing changes to the medical library website. this workflow operates across three layers: the ticketing layer outlines the use of gitlab’s issue tracker feature as a ticketing system, allowing for a proposed change to be tracked as an issue with documented discussion available at every stage in the workflow, until the change is pushed to production and the issue is closed. the change management layer uses git and gitlab to create a branch for the overall change from the corresponding issue, track the edits that are made to the code by storing them within commits, and then, once the change is complete and tested, merge the branch back into master and push the change to production. the development layer is where the change is made via edits to the code and testing of those edits. this workflow is flexible and allows developers to use the tools of their choice in conjunction with gitlab, whether that be git, a text editor, and an amp stack, or an ide that pulls everything into a single interface. each of the steps in the workflow are detailed below. step 1: create issue when a change needs to be made to the website, the first step is to create an issue. gitlab’s issue tracking feature serves as a ticket layer in the workflow, allowing any stakeholder, whether it be a librarian, a staff member, or one of the web developers; to document an observed need for the website (e.g. problem, desired feature, or content change) by submitting an issue. as with a ticket in a helpdesk system, the person submitting the issue is automatically added to it as a participant, allowing them to receive emails when the issue is updated. each issue has a comments section that can be used as a discussion thread for the proposed change. this can be useful both in planning and implementing changes. in figure 2, this is represented by extending the discussion action across the entire workflow diagram. developers may request more information from stakeholders, discuss potential solutions with other members of the development team, or link to related issues, whether open or closed. posts to the thread are emailed to participants, and individuals may be invited to participate from within the text of a post, using social media conventions, by referencing their gitlab username prefaced by ‘@’ (e.g. @engwall). the issue will appear in the issue list for the website repository in gitlab. members of the development team can monitor the issue list, coordinating and prioritizing issues as necessary. gitlab will display the user who last acted on an issue along with the date of the update, and so working on an issue is typically a sufficient way to automatically “claim” it. however, if desired, issues may be formally assigned (or claimed) by selecting someone in the optional “assignee” field. step 2: create branch in order to make changes to the website code for a proposed change, a development branch is first created off the master branch in the repository from within the corresponding issue in gitlab. gitlab uses git to create the branch and then automatically links it to the issue within gitlab. the branch title becomes the id number of the issue followed by the issue title. for example, a branch created from the 23rd issue created for the repository in gitlab, that has the title “add chat feature”, would be given the title “23-add-chat-feature”. step 3: make changes and test thus far, the branch exists only in the repository on the gitlab server. all of the actual work on a change to the website is done on the web developer’s local repository on their computer. this step has several substeps: pull an update from the repository on gitlab to the local repository. this will update the master branch as well as git metadata about the issue branch and any other branches that may have been created or updated since the last time they pulled from gitlab. check out the development branch to be worked on. as changes are committed to the checked-out branch, git will store and track them within the branch. this isolates changes and protects the master branch from them until they have been tested and the development branch is ready to be merged back into master. edit files and test changes. the developer has complete freedom over how they choose to edit files and test the changes they make. the medical library team uses text editors (e.g. vim [https://www.vim.org] or emacs [http://www.gnu.org/software/emacs/]) to edit files and mamp to test changes. an ide (e.g. phpstorm, dreamweaver) may not only provide a means to edit files and test changes, but also provide an interface for and/or automate other git actions in these substeps. commit changes made to files. changes to files are not stored by git until they have been committed. technically, each commit is preceded by an “add” action in git, which stages the added files for the commit and allows for more precise control over which edits are contained within each commit. however, typically, these two actions are done in immediate sequence, and are thus presented as a single substep. as part of the commit, git expects a brief text summary that summarizes the content of the commit. this can prove to be very useful during review and troubleshooting. push branch back up to the gitlab repository. although, technically, this can be done at any time (which can be useful if you want someone else to look at the code in their own repository part-way through working on an issue), typically the branch is pushed to gitlab after all of the edits have been made and all testing is complete. at this point, the code is ready for a final review before being merged and pushed to production. step 4: review changes this is an optional but recommended step in the change workflow. a second member of the development team reviews the change before merging and pushing it to production. this may either be done by pulling and checking out the branch on their own computer for testing or, for simple changes, by reviewing the changes to the code directly from gitlab. gitlab has several tools within its merge request feature (like issues, merge requests are another gitlab extension to git functionality) that facilitate code review, including a list of commits with the summary provided for each commit and a compare feature that will display all edits made, either to all files in the branch over all, to all files between two commits, or to an individual file. each merge request also has its own comments section, where technical code-level discussions can take place without cluttering the higher-level discussions in the issue. a merge request may be created at the time the branch is created or at any time afterwards. some prefer to wait until the branch is ready to be merged to create a merge request, as this is fairly intuitive. others prefer to create a merge request up front in order to have access to the additional functionality. gitlab links merge requests to the issue and its corresponding branch, using a similar naming convention (since a single issue’s branch may have multiple merge requests, the id used in the merge request title is the merge request number instead of the issue number). gitlab provides a way to tell whether a merge request is ready to be merged by automatically applying a “work in progress” (wip) status to the merge request when it is created (this is also reflected in the title of the merge request, which is prefixed by “wip”). the wip status must be removed before the merge action becomes available. step 5: merge branch once the change has been reviewed, and the wip status removed, the branch can be merged from the merge request interface in gitlab. this migrates the commits from the development branch to the latest version of the master branch, which may contain other changes from other branches that were merged after this branch was created. prior to the merge, git will automatically check for conflicts (changes made to the same lines within the same file), display any that it finds, and prepare them for resolution. each file with a conflict will have a delimited section for each conflicting set of edits. once all conflicts have been resolved by removing the unwanted edits and the surrounding delimiters, another merge may be attempted. there are options to automatically delete the branch and/or close the issue. typically, these are selected, since this and the next step are usually performed together. due to the distributed nature of git, each developer will need to remove any branches from their own repository. all participants will automatically be notified when an issue is closed. an issue may always be re-opened if follow-up discussion is needed. if further changes are necessary, it is best to create a new issue and corresponding branch, and reference the previous issue in the discussion thread. step 6: push to production while this step is almost always performed immediately after the merge, it is presented as a separate step because it is distinct from the merge, and the methods used to carry out this step depend on access to the production web server, and configurations of the production environment may vary widely from one institution to another. at ou, the gitlab server and production web server are separated by a firewall, and only specific individuals have access to the web server from their computers. thus, in order to push a change to production, a member of the team with such access must pull the latest version of the repository to the local repository on their computer and push the master branch from there to the production server. one of the best features of this workflow is that it will accommodate multiple simultaneous changes in active development by one or more individuals. this allows each issue to be worked on without worrying about what others are working on, which provides a great deal of flexibility in prioritizing and planning and improves efficiency. limitations & challenges this workflow does have one major limitation. it is designed for an environment where the production web server is not directly accessible from the gitlab server. the developers’ local computers serve as an intermediary between the gitlab server and the production web server. failure to perform the final step and push a change to the production server means that the issue may be fixed in the code base and closed in the tracking system, but not be represented on the production web site. therefore it is recommended that the master branch be pushed to the web server immediately after a merge. there are other challenges that may arise. like most software, flexibility often comes at the cost of complexity, and git is no exception. it is elegant, but can be a challenge for new users to get their head wrapped around. this article does not provide the specific git commands to perform the tasks described therein. rather, it attempts to describe the tasks performed with git such that the specific commands can be easily referenced, and provides links to resources that describe the execution of these tasks. the command line version of git may be suitable for web developers, but may be difficult for those with limited experience with this kind of interface. content creators may benefit from the use of an ide that interfaces with git (many do). alternatively, they may submit content changes through the gitlab interface or, if the website uses a content management system (cms), through the website’s interface. this raises another challenge in that git is designed to track changes to plain text files. although binary files (e.g. database files, images, pdfs, etc.) can technically be tracked, even minor changes to a database (e.g. oracle, mysql, mariadb) or to an image file will cause a new version to be stored in its entirety, which can cause a repository to grow prohibitively in size. this also means that individual changes made within a database cannot be tracked. thus, git may not be an effective tool for websites using a database-backed cms, unless the cms can be configured to use a flat-file database, which is stored as a plain text file. some cms solutions (e.g. grav [http://getgrav.org], kirby [https://getkirby.com/], etc.) are specifically designed to use flat-file databases. other binary files, such as images and pdfs, require separate consideration. git may be configured, through use of a .gitignore file, to ignore these types of files. they would not be tracked by git and would need to be uploaded to the webserver separately. if space is not a concern or if these files will not be updated frequently, it is ok to allow them to be tracked by git. if so, it is recommended to perform image work within its own commit, and only perform the commit once the final version of the image has been selected. this was not a significant issue for the medical library, since most of the images used on the medical library website do not change often. it is, however, something to keep in mind. gitlab itself also provides its own set of challenges. gitlab is a shared web-based tool and thus must be hosted on a site that can be accessed by everyone who would use it. if your institution does not have its own instance of gitlab, your choices are to install it yourself (which may be non-trivial, depending on your comfort level and access to resources) or to set up accounts on gitlab.com. there are free options for both gitlab software to be installed locally and for hosted accounts on gitlab.com that would provide all of the features described in this article, but there may be factors at your institution that could preclude storing and/or accessing your website files on an external third-party server. in addition, the merge request feature of gitlab can be somewhat counterintuitive. the tools it provides are absolutely beneficial, particularly if an issue is complex or difficult. however, the notion of creating a merge request before a single line of code is edited does not sit well with some. it’s an oddity that can be safely ignored from a practical sense, since the merge request can be marked as a “work in progress”, which serves as both an indicator of whether or not the branch is ready to be merged, as well as a mechanism to block merging the branch before it is ready. it can be enough of a conceptual disconnect that some may be reluctant to use it at first. on the plus side, git can absolutely be used without gitlab. a centralized repository can be set up on any networked computer on which git is installed. if all else fails, if git can be used on the production web server, it can provide a measure of protection for the production version of the site on its own, as changes can be made independently of the production web server and pushed there when ready. the changes can also be rolled back via git if needed, providing a mechanism for elimination of regressions or mistakes. an alternative solution would be required for issue management, but there are project management tools available, such as trello, that could fill this role reasonably well, though without the benefit of git integration. also, there are several git clients [https://git-scm.com/downloads/guis] that have robust features both for interfacing with git as well as code review. considerations for testing this workflow is flexible enough to allow developers to choose how they make and test changes. the medical library made use of two solutions for testing changes made to the website code. initially, the technology specialist set up a virtual machine (vm) using virtualbox [https://www.virtualbox.org/], a free, open-source virtualization application owned by oracle. this vm was installed and run on each team member’s computer. a virtualization app (also known as a “hypervisor”) allows for the installation of an operating system which runs on top of the computer’s main operating system. this “guest” operating system is stored in a file on the computer, large enough to serve as a hard drive for the virtual “machine.” the app acts as a bridge between the virtual computer and the actual computer’s hardware, and provides a virtual network interface between the vm and the computer. the virtual machine can be booted up and run in the background as though it were an independent system connected via network. virtualbox sets up a virtual network adapter on the vm and assigns it a localized ip address. by pointing the browser to the ip address of the vm, the developer can view the site just as it would appear on the production server. the local repository can either be stored within the vm and edited by logging in to the vmor on the local computer and mounted in the vm (e.g. via ssh, smb, nfs, etc.). the test vm was set up using a freely-available version of the production system’s red hat linux operating system (i.e. centos), replicating the production environment as closely as possible. a second, more lightweight option the medical library tried was to install and run mamp [https://www.mamp.info/], an application for macos. an amp stack is a bundle of software containing apache, mysql or mariadb, and php, perl, and/or python. mamp (the “m” in which refers to “macos”) bundles these tools together and provides a management interface for starting, stopping, and changing the services as needed. other, similar tools include xampp [https://www.apachefriends.org/] and ampps [http://ampps.com/], both of which can be run on windows, mac, or linux. depending on the tool used, server applications may run directly on the computer itself or on a built-in vm running a lightweight os strictly for the server apps. the amp manager application can be configured to select the folder to be used as document root and to specify software versions to be used. mamp was configured to use the repository directory as its document root and to use the same version of php as the production server. using a custom url specified in the mamp documentation, the web developer was able to view the local copy of the website in a browser. this alternative traded the fidelity of the cloned environment for efficiency and ease of use, under the assumption that so long as the php version was consistent, the results would be sufficiently accurate for testing purposes. with their local repositories and development environments set up, the web team collaborated to design a workflow that would meet their needs. regardless of what method of testing you choose, it is important to know what versions of software are running on your production server, such as apache and, especially, the code interpreter (such as php, perl and/or python, and to configure your test server environment to use the same versions. while this is not straightforward in many amp stack solutions, including xampp and mamp, it is nonetheless easier than setting up a vm, and although setting up a vm for a test environment may provide the opportunity to set up the most accurate replica of your production environment, the overhead involved in both setup and use make this difficult to recommend. looking to the future in addition to the protection that git provides, the ability to retrieve previous versions of the site has promising implications for archival purposes. in most cases, aside from restoring an old backup, web content tends to be ephemeral and requires proactive steps to be taken in order to preserve the style and design of the website at various points in time. however, in many cases, changes to the website can be gradual enough that it is difficult to determine in advance when to preserve the site. however, by default, git keeps all changes made, indefinitely. git also can be used to “tag” specific commits as significant. this can be a version number tagged to the final commit after a significant set of changes, or it can be any text that may prove useful in identifying commits that you want to return to later, if only temporarily, by checking out that commit by its tag name. some potential next steps for the medical library include retroactive use of tags at strategic points in the history of the website, expanded use of gitlab as a ticketing system for other types of library issues within a separate “helpdesk” repository, and investigation into additional features of git, gitlab, and/or other git-related applications that may further enhance the website development process. like any production resource, library websites should have a development environment in which to make changes. although this may seem out of reach, especially in multi-developer environments, it can be achieved using freely-available software and tools. there are free open-source tools such as git, gitlab, and amp stacks, that can be used to create isolated development environments without the need of a development server, as well as collaborative environments for sharing information and resources. the workflow is as important to an effective development process as the tools used. changes should be tracked, coordinated, and documented, and should follow a process that is efficient, open to communication, and flexible enough for the needs of various stakeholders. through use of these tools and this workflow, the productivity of the web development team increased greatly. capturing ideas for the website as well as problems became much easier. whenever the issue list became too long, priorities could be shifted away from other tasks to concentrate on the website. since every change was documented, reporting on website updates became much easier, and going back through the closed issue list provided an easy way to figure out what changes were made and when. the medical library website has undergone several large scale improvement projects quickly and with great success. one of the best things about using these tools is that they are flexible enough to adjust to a variety of workflows, so even if the workflow described in this article would not fit your library, git and gitlab should be able to work with whatever workflow meets your library’s needs. about the authors keith engwall is the web and technologies librarian for the ouwb medical library and is the webmaster for the library website. mitchell roe served as a technology specialist for the ouwb medical library and was instrumental in the development of the environment and workflow detailed in this article. he currently serves as a systems administrator for oakland university technology services. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – oprm: challenges to including open peer review in open access repositories mission editorial committee process and structure code4lib issue 35, 2017-01-30 oprm: challenges to including open peer review in open access repositories the peer review system is the norm for many publications. it involves an editor and several experts in the field providing comments for a submitted article. the reviewer remains anonymous to the author, with only the editor knowing the reviewer´s identity. this model is now being challenged and open peer review (opr) models are viewed as the new frontier of the review process. opr is a term that encompasses diverse variations in the traditional review process. examples of this are modifications in the way in which authors and reviewers are aware of each other’s identity (open identities), the visibility of the reviews carried out (open reviews) or the opening up of the review to the academic community (open participation). we present the project for the implementation of an open peer review module in two major spanish repositories, digital.csic and e-ieo, together with some promising initial results and challenges in the take-up process. the opr module, designed for integration with dspace repositories, enables any scholar to provide a qualitative and quantitative evaluation of any research object hosted in these repositories. by pandelis perakakis, agnes ponsati, isabel bernal, carles sierra, nardine osman, concha mosquera-de-arancibia, and emilio lorenzo background the peer review system is the norm for many publications as it is an essential element in the avoidance of errors in scientific literature and therefore in the improvement in its quality. it involves an editor and several experts in the field providing comments for a submitted article. one of its characteristics is that the reviewer remains anonymous to the author, with only the editor knowing the reviewer´s identity. this model is now being challenged and open peer review (opr) models are viewed as necessary in improving some of the problems in the existing system. in the opr model, openness and transparency are two aspects that are considered necessary to address the issue of biased or non-expert opinions, which is inherent in the anonymous peer review model, characterized by the anonymity of reviews and the unaccountability of reviewers. while there are differences in the definition of the opr concept, “open” here means that the public has free online access to the full text of the reviews, reviews can be submitted by an unlimited number of peers over the lifetime of the content, the identity of authors and peers is disclosed during the entire peer review process. the project to build an opr module for open repositories is based on a series of premises: the need to incorporate quantitative assessment of the hosted research items that will facilitate the process of selecting the most relevant and distinguished content of a repository. common currently available metrics, such as number of visits and downloads, do not necessarily reflect the quality of a research work the need to assess and review content (grey literature, research results in earlier stages, etc.) related to research but removed from the customary flows of scientific publication, as well as enabling post-publication peer-review the opportunity to connect the review model (linked to content ratings) to an author reputation model (sabater-mir and sierra, 2005) with the aim of offering an incentive to researchers to undertake high-quality assessments. importantly, our open peer review module includes an authors and reviewers reputation system based on the assessment of reviews themselves by other peer reviewers. this allows a sophisticated scaling of the importance of each review on the overall assessment of a research work, based on the reputation of the reviewer to address these issues, we developed an open peer review module (oprm) to be installed on existing open access repositories, digital.csic and e-ieo [1], and offered as an overlay service. any digital research work hosted in a compliant repository can then be evaluated by an unlimited number of peers who offer not only a qualitative assessment in the form of text, but also quantitative measures that are used to build the reputation of the research work and its authors. like many other innovations or emerging processes, and opr should be described as such, the development of the module faced many challenges with regard to its inception, development, implementation, and adoption. these included complex decisions in the aspects relating to the opening of the model to researchers from other institutions and the consequent complexity in their identification, the transparency of the ratings issued, the configuration of the invitations model, attempting to strike a balance between the opening of the model and the restriction of the activity of trolls, the role of administrators of the repositories, etc. always important, the technical challenges involved introducing new workflows into functionally complex repositories designed to be minimally intrusive and to take maximum advantage of the already existing code of the dspace software. the code developed [2] is based on standard functionalities of dspace, aligning it therefore with the software’s constant evolution. equally challenging was the translation of the reputation model to the calculation algorithms and corresponding data structures. the convergence of the algorithms (reputations are calculated cyclically and reiteratively) and its impact on the calculation times and, consequently, on the performance of repositories, required constant adjustments and measurement for its control. finally, of no less importance, the adoption of the new opr paradigm by users is complex. we believe that the future frontiers of opr will be centred on how to attract authors and reviewers while confidence is being built in the new reputation system, as well as the search for incentives that encourage researchers to undertake public and transparent reviews. characteristics of the reputation model the reputation assessment model is based on peers evaluating (quantitatively, in addition to qualitatively) each other’s research works as well as each other’s reviews. the latter allows for a sophisticated scaling of the importance of each review on the overall assessment of a research work, based on the reputation of the reviewer. we note that our model assumes that evaluations may be done on a number of dimensions (e.g. originality, technical soundness, predicted impact, etc.), however, an ‘overall quality’ dimension is used for computing the general reputation of the research work. in brief, the model quantifies a reputation (osman, provetti, riggi, and sierra, 2014) for works (which can be any research object hosted by the repository), authors, reviewers, and reviews. the reputation of an article is the weighted aggregation of the reviews it receives, where the weight depends on the reputation of the reviewer (discussed below). a single metric is provided for each evaluation dimension: overall quality, expected impact in the field, expected impact for society, etc. a scholar’s reputation as an author is an aggregation of the reputation of their papers. again, this reputation is computed for each dimension separately. note that the impact of the reputation of a particular work in the general reputation of an author is inversely proportional to the number of authors of the work. the reputation of a reviewer is essentially a weighted aggregation of the comments about his or her reviews by other reviewers who evaluated the same research works. the weight in this case is the reputation of reviewers who offer an opinion. finally, the reputation of a review is similar to the one for articles. it is a weighted aggregation of the ratings received in comments, where the weight is the reputation as reviewer of the researcher who wrote the comment. the module allows an unlimited number of expert reviewers to provide an evaluation for any research work, either preprint or already published. reviewers can either be invited through the system (for example following a request by an author or editor) or can volunteer to review any object of the repository. in both cases, reviewers receive the review request details by email and are asked to offer their review reports within a specified deadline. the review and reviewer credentials are submitted to the system administrator for inspection and verification. after this process is completed, the review is linked to the original research object and becomes openly accessible. implementation details compared to other design options, we should note that the development was selected on the basis of the existence of advanced author models in both repositories. these models, for the two repositories under consideration, are based on the authority-control functionality [3] of dspace. the e-ieo repository, with dspace version 5 and xmlui interface, incorporates an authors-extension [4] based on code from atmire. the repository digital.cisc, with dspace version 4 and jspui interface, has dspacecris [5] module installed. in our view, basing the developments performed on one of the existing models in dspace for extending author-related functionality, it is necessary to: disambiguate and identify the authors of articles and reviews unequivocally so that it is possible to calculate the reputations of articles by taking into account the reputations of reviewers and authors, show the reputations of authors and reviewers in their personal pages. this is necessary for credit and recognition. the open peer review module is built around a number of components described below: the invitations component has been developed as an extension of the workflow and submission capabilities of dspace. it allows the author to send review requests to selected peers. alternatively, any user can request a token to make a review. the submission-item-interface has been extended to specify the email addresses of the reviewer. the system sends a customized email including a token that grants the reviewer access to the research object and to the reviews’ subsystem, bypassing the login and authorization checking of dspace the reviews component. this is also an extension of the workflow and submission capabilities of dspace. the reviewer accesses the reviews’ subsystem acting with sufficient privileges granted by the token, and the evaluation forms are then presented to him/her, together with relevant terms and conditions regarding the whole review process. the proposed forms can be configured using standard data types when applicable, although an additional schema has been added to accommodate a specific model’s metadata. when the review form is completed, a new object is created in the submission workflow. the submission workflow can assign the review object to the repository administrators, with a single accept/reject/edit metadata step or just deposit the review in a specific collection. the administrator can complete the submission process with any necessary metadata enrichment. following this step, specific background tasks are attached to the process, via consumer events, to perform automatic validation of the metadata, linking reviews and reviewed objects, and calling the reputation submodules to calculate new numeric values (for authors and research objects) and automatically incorporating them into the reviewed object and into the review. the comments component complements the reviews component, allowing comments (also called annotations) to be attached to the reviews. reviewers use it to comment or annotate other reviews of the same research object. comments/annotations are automatically deposited into a specific collection, which contains the submit group anonymous group. this allows annotations by those who do not have a user account (the same principle followed with reviews). annotations are restricted in that they only can be made by those who have created a review on the item, or by any of the authors. the component could be limited, eventually, like the reviews component, to one or more email subdomains, in order to avoid uncontrolled use of the system. the dspace data model has been extended to incorporate relevant metrics as well as the back-and-forth relationships between research objects and their reviews and judgments. specific metadata schemas have been incorporated into both repositories, although an extension of the qualified dublin core metadata scheme could be used, possibly leading to simpler implementations. in order to process information about the reputation of the authors, and make this information persistent, the system uses extensions to the author´s metadata. it is important to note that the module can be used without these extensions, although in this case the consolidation and visualization of the reputation of authors and reviewers is not available. the reputation engine was developed as a separate plugin, allowing easy adaptation to other reputation algorithms and making feasible its implementation and functional adaptation to other repositories. the reputation engine meets the following principles: the reputation of a research object is calculated by aggregating the weighted ratings by reviewers the reputation of research objects will then be used to calculate the reputation of individual authors the reputation of reviews will be calculated by aggregating the evaluations (called judgments) of other reviewers. they are subsequently used to calculate the reputation of individual reviewers. if a person plays more than one role (author, reviewer and/or single user) in the system, the module will estimate a global reputation for this person, combining her/his reputation as an author and reviewer together with the above components, a group of modifications to the repositories were made. these include item view and author’s pages customization with the aim of presenting the different reputation values of objects and agents involved; adjustments in the searching, indexing and filtering subsystems to enable the search for the new types of objects, reviews and comments; and adjustments in the filter system of oai_pmh to avoid exposing the reviews and comments in normal harvesting interfaces. figure 1. item´s view showing reputation values project data the project was funded by the european union, grant id 643410, in the context of the openaire horizon 2020 initiatives [6] that aim at promoting opr and studying its effects in the context of digital infrastructures for open scholarship. its management, coordination and implementation took place between june, 2015, and march, 2016, through the formation of a consortium of six partners, including: the open scholar cic organisation, which acted as the project promotor and coordinator the institutional repository of the spanish national research council (digital.csic) the repository of the spanish oceanographic institute (e-ieo) the artificial intelligence research institute of the spanish national research council (iiia-csic) in catalonia the multidisciplinary laboratory of library and computer sciences (secaba) in granada a dspace registered service provider (arvo consultores) conclusions and future directions open access repositories can play a far more significant role in scholarly communication by integrating an open and transparent evaluation system. this additional functionality can help to address many of the issues related to the peer review system, the current journal-based research reputation system. the reputation model produces novel metrics that directly reflect the perceived quality of a research work by expert peers, as opposed to current available metrics in repositories that only indirectly account for quality through usage statistics. likewise, the reputation model of the agents participating in the process (authors and reviewers) is an element that is different to other review models as it lays the foundations of new author assessment metrics, which also promote, as opposed to the non-disclosed peer model, transparent reviews. despite the short length of time in which this module has been operational, it may be seen that the integration of peer review into repositories promotes open scholarship by enabling a direct, transparent collaboration between authors and reviewers. a simple survey among the participating authors in both pilot projects has produced responses that are consistent with the general view of researchers about open peer-review (callaway, 2016), with support and misgivings expressed with regard to the new processes. open peer review is regarded as a long-awaited repository service, whose primary use is for preprints and other unpublished works. however, it is viewed as limited in terms of its applicability for works that have been already evaluated and published. we would indicate that generally even those researchers who support this new repository service reported issues that can slow down the uptake of this feature. these include finding the time to select works to be reviewed, inviting peers, and commenting on the reviews received. furthermore, it appears that inviting peers to an open evaluation may place authors and reviewers in an uncomfortable situation. we surmise that this aspect may be associated with the still small circle of experts available for invitation and the inevitable issues with regard to co-authorship and competition. finally, researchers indicate that current systems of merit and professional evaluation do not take into consideration (and it is not anticipated that they will do so in the short term) authors’ performance (either as a reputation index or any other process indicator), which is regarded as a major obstacle to the paradigm deployment of open peer review. reviewers feel that the system implemented is a great idea that merits success as currently peer review is not credited in the cvs of researchers at all due to its anonymity. but, in keeping with the above paragraph, researchers will not have time to review and comment on the work of other peers as long as this activity remains beyond the scope of cvs and lacks strong support from the research institutions. naturally, authors and reviewers would like the model, which in its current implementation is limited in those who can review and those who can comment, to be endowed with a greater openness, saying that the service should promote spontaneous discussion between those who wish to send comments. they say that it is demanding, delicate and difficult to make public reviews of sufficient quality and to continue with the dialogue that the review and comments system may require. this leads reviewers to request that the works associated with the review process be introduced into the system that assesses research activity. the two participating institutions, csic and ieo, have encountered difficulties in providing authors and reviewers with explanations (we acknowledge that it is hard to operate) about the invitations process together with the concepts that underpin the reputations and scoring model. similarly, improvements have been suggested. these include the need to connect or link the objects to be reviewed to potential reviewers on the basis of similarities between the subject matter of the pre-print or article and the reviewers’ expertise. a large part of the system’s viability appears to involve making this connection between objects and reviewers in an automatic manner, striving not to exclusively base it on the peer-review invitation model. in terms of the functionality provided by the module, it should be noted that the improvement of the user interface is seen as a priority in order to make it more comprehensible and easy to handle and even to automate the invitation-review-comment process. the two participating repositories already had advanced author functionalities incorporated (authorities models, researchers’ personal pages, etc.) and consequently they viewed the addition of the oprm module as an extra step in the offer of advanced services to researchers and general users. questions may be raised about the model’s viability or usefulness for repositories lacking this functional maturity. both the csic and the ieo feel that it is necessary to design an effective and attractive campaign that reaches out to the wider institutional community in order to consolidate the service as an active one for the repositories in the coming months. without such a campaign, misgivings about the lack of linkage with institutional assessment exercises and a rewards system, limitations associated with an invitation-based module and misunderstandings about the oprm reputation sub-module and what type of open peer review it supports are expected to be potential stumbling blocks. in light of the above, the immediate directions for development of the open peer review module in the two participating repositories are oriented towards facilitating the creation of reviews by authors from connected scientific disciplines, connecting the invitations’ system with the repository subscription service, opening up the invitations system to enable reviews to be made by a wider audience, improving the user interface, explaining and refining the algorithms that calculate reputations, and improving the understanding of reputation ratings. notes [1] csic repository, found at digital.csic.es and spanish oceanographic institute repository, found at http://www.repositorio.ieo.es/e-ieo/ [2] the code is currently publicly available. to download it or see the detailed module specification, visit the project wiki at https://github.com/arvoconsultores/open-peer-review-module/wiki [3] the dspace authority control system is documented in https://wiki.duraspace.org/display/dspace/authority+control+of+metadata+values [4] documented in https://wiki.duraspace.org/display/dspace/author+profiles+xmlui and https://github.com/dspace/dspace/pull/668 [5] originally developed by cineca (http://www.caspur.it/) is an open source extension of dspace that includes cris functionality. documented and available for download at https://wiki.duraspace.org/display/dspacecris/dspace-cris+home [6] https://www.openaire.eu/h2020openaccess/ references sabater-mir, j. and sierra, c. (2005) review on computational trust and reputation models. artificial intelligence review, 24:33–60. osman, n., provetti, a., riggi, v., and sierra, c. (2014) more: merged opinions reputation model. in proceedings of the 12th european workshop on multi-agent systems (eumas 2014). springer. callaway, e. (2016) open peer review finds more takers. nature 539, 343. doi:10.1038/nature.2016.20969 about the authors pandelis perakakis is director of open scholar cic, uk, and researcher at mind, brain and behaviour research centre (cimcyc), universidad de granada, spain agnes ponsati works at the spanish national research council as director of the unit of information resources for research, and manages the library and scientific information resources as a supportive infrastructure for research, dealing with the open access policy and the institutional repository digital.csic. isabel bernal manages digital.csic, institutional repository of the spanish national research council. carles sierra is chancellor of the artificial intelligence research institute, iiia-csic, campus uab, bellaterra, catalonia, spain. nardine osman, is a reseacher at artificial intelligence research institute, iiia-csic, campus uab, bellaterra, catalonia, spain. concha mosquera-de-arancibia, is researcher and editor of the spanish oceanography institute and coordinator of e-ieo repository. emilio lorenzo, is director at arvo consultores, a spanish company of digital repositories solutions. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – islandora for archival access and discovery mission editorial committee process and structure code4lib issue 58, 2023-12-04 islandora for archival access and discovery this article is a case study describing the implementation of islandora 2 to create a public online portal for the discovery, access, and use of archives and special collections materials at the university of nevada, las vegas. the authors will explain how the goal of providing users with a unified point of access across diverse data (including finding aids, digital objects, and agents) led to the selection of islandora 2 and they will discuss the benefits and challenges of using this open source software. they will describe the various steps of implementation, including custom development, migration from contentdm, integration with archivesspace, and developing new skills and workflows to use islandora most effectively. as hindsight always provides additional perspective, the case study will also offer reflection on lessons learned since the launch, insights on open-source repository sustainability, and priorities for future development. by sarah jones, cory lampert, emily lapworth, and seth shaw introduction this article is a case study describing the implementation of islandora 2 [1] to create the university of nevada, las vegas special collections and archives portal, a digital asset management system (dams) that provides a unified public interface for the discovery, access, and use of archives and special collections materials. founded in 1957, the university of las vegas, nevada (unlv) is a public land-grant university, and unlv special collections and archives (sca) is a division within the unlv university libraries. sca documents university history, as well as the history, culture, and environment of las vegas, the southern nevada region, and the global gaming industry. as of 2023, sca is composed of four units: digital collections, public services, technical services, and the oral history research center. prior to the public launch of the sca portal in 2021, users accessed sca material through three separate access points: the unlv libraries catalog (monographs, maps, periodicals), an online database of archival collections (manuscripts, photographs, oral histories, and university archives), and digital collections websites (digitized collection material and digital exhibits). technical services (ts) processes and manages all of sca’s archival holdings, physical and digital. archivesspace is the primary collection management tool, and is used to create finding aids for oral history interviews, and manuscript, photograph, and university archives collections. these finding aids are available for browsing and downloading via the sca portal. digital collections (dc) digitizes and provides online access to select materials via the sca portal, including archival photographs, manuscript materials, newspapers, oral history interview transcripts, and born-digital records. these two units have developed a deep collaboration that came about through selecting, implementing, and using this islandora-based joint digital asset management system that brings several ts and dc workflows together in a centralized repository. these two sca departments work closely with two departments in the library technologies division in order to maintain our local installations of islandora and archivesspace. the library technologies division of unlv libraries includes the web and application development services (wads) department and the library systems department. wads is responsible for application development, front-end web development/design, and user experience. this department includes the lead developer responsible for islandora and archivesspace. the systems department is responsible for back-end server maintenance and administration as well as backup and storage workflows and amazon web services (aws) administration. choosing islandora unlv libraries began digitizing materials on a small scale in the web department, with a larger and more coordinated effort beginning in 2006. over the years, the digitization program grew through grant funding, outsourcing of collection digitization with vendor partners, and with the creation of a department (dc) dedicated to the reformatting of materials in a digitization lab, creation of digital object metadata, and management of a digital asset management system. the first system in use was the oclc-created contentdm system which provided for ingest of images and associated metadata, public display of the objects in templates, and tools for metadata creation and controlled vocabulary management. in 2014, unlv began to focus on large-scale digitization and linked data and began to face issues with workflows, system scalability, and the absence of integrated digital preservation solutions. unlv libraries leadership team charged a digital asset management system (dams) task force in 2014 to “provide recommendations on the best technology solutions to support the acquisition, preservation, and management of born digital archival content and the continued development and delivery of robust, unique library digital collections.” at the time, unlv libraries was using a locally hosted implementation of contentdm to provide access to digitized items from unlv special collections and archives (sca) – mainly images, but also pdf documents, audio clips, and video files. in 2014, sca implemented archivesspace for archival collection management and finding aid creation. at this time, researchers were faced with navigating three different websites and four different search boxes in order to find and access unlv sca materials, so a more unified user interface was one of sca’s goals. figure 1. digital collections and archival description system diagram pre-islandora. the three sca websites users encountered were: contentdm (d.library.unlv.edu), the digital collections drupal-based exhibits site (digital.library.unlv.edu), and the main unlv libraries website (library.unlv.edu). the four different search boxes that could be used to find sca materials queried the unlv libraries catalog, a database of collection-level descriptions, the full text of collection finding aids, and contentdm. digital collections used local networked server locations (workspace and read-only vault) to store master copies of files, which were also backed up using amazon web services. sca was also focused on improving its stewardship of born-digital archives, therefore improving unlv libraries’ digital preservation practices was another important goal. the dams task force priorities led to adjacent work that resulted in the creation of a formal digital preservation policy in 2017. in summary, the task force was looking for a dams that would meet all three of its main goals: replace contentdm’s flat file data structure with a linked data model that could provide public access to digital objects and other structured data provide a unified search interface for digital objects, archival description (finding aids), and agent records (people and organizations connected to our collections) provide functionality for born-digital archives and digital preservation in 2015, the dams task force tested two hosted solutions, preservica and archives direct (a service combining duracloud and archivematica). while these options offered robust digital preservation and born-digital processing capabilities, they fell far short of matching the user interface and metadata management capabilities that contentdm provided, which sca did not want to lose. at the time of testing these options also did not seem to scale well in handling the large amount of digital material that sca needed to ingest and manage. the vendor costs of hosting a large (and growing) amount of data was another concern. in 2017, a dedicated application developer was hired to focus on sca technology, which allowed the task force to more seriously explore open source systems. four different potential directions for development were identified: a samvera repository next to an archivesspace public user interface unified by common theming, linking, and a common search interface an islandora repository similarly integrated with archivesspace as described above a full drupal integration using islandora and pulling in archivesspace data via the api an in-house built interface harvesting from both fedora and archivesspace apis. the decision was based on the dams task force’s answers to three questions: should the interfaces be tightly or loosely coupled? should we focus development on an in-house solution or contributing to and benefiting from a community-based solution? how important was leveraging existing technical expertise (php+drupal) or were we willing to adopt new ones (ruby)? the task force members choose a tightly integrated interface (instead of linking between distinct systems), a community-based project, and leveraging the developers’ existing php and drupal experiences. islandora is an open-source eco-system for the display and management of digital cultural heritage materials, with an active community of supporters and users, with drupal as its primary interface. figure 2. planned architecture where drupal now serves as the unified interface for special collections web content. dashed lines indicate planned future projects. source: seth shaw, “islandora @ the university of nevada, las vegas“, islandoracon, prince edward island, can, 2022. according to the product website, “islandora is an extensible, modular, open source digital repository ecosystem focused on collaborative authorship, management, display, and preservation of digital content at scale. islandora adheres to widely adopted best practices and open standards and frameworks used in information practice.” the product is augmented by community-developed modules that add functionality and are shared back into the community. islandora users are supported by the islandora foundation, which acts as a governance structure setting priorities, and building consensus on issues of architecture and feature development. when unlv made this decision in 2017, the islandora community was working on a major change from islandora 7 to the next version (confusingly) islandora 2.0, which was then code-named islandora claw. [1] islandora 7 was based on drupal 7 and fedora 3, both of which were rapidly approaching the end of their official support windows, and which used drastically different internal structures than their new versions requiring an updated conceptualization of how islandora works and a complete rewrite of the software. in light of these circumstances, unlv libraries decided to help the islandora claw project complete the new version rather than initially invest in islandora 7, only to migrate away shortly thereafter. the drastic change in architecture is one of the reasons the new version was named islandora 2 or modern islandora. unlv’s contract with oclc to host contentdm had an annual renewal date, which worked as a convenient deadline for development. based on the answers to the three guiding questions above, and with insight from the development community’s timelines on product releases, unlv made the decision to select islandora and aim for a launch timed with the end of the oclc contract. this signaled a new phase of work implementing islandora, followed by migration of all unlv special collections data into the new system. implementing islandora implementing this new architecture consisted of five major tasks: getting islandora claw to an initial release, integrating the new islandora into our existing drupal-based online exhibits site, the archivesspace integration, designing the user interface, and migrating existing and new collections. much of this work was done in parallel with changes in one aspect impacting the others. getting an initial islandora claw release unlv began working with claw in december of 2017. initial work focused on the community’s deployment tooling so we would have a stable development platform for both contributing code back to the community and testing our local code. as commonly happens with pre-release software, there were multiple large design shifts that occurred over the course of 2018 which changed how derivatives and indexing actions are triggered (january) [2], inverting the relational direction of metadata records and their associated files (june) [3], and how binary files flow between drupal and fedora (august) [4]. the community continued active development of new features and remediating bugs as they were discovered. then, in early 2019 the islandora community performed a series of targeted two-week efforts, called sprints, to document and test the new system before making its initial public release in june 2019 [5]. while this was a significant milestone, developers continued to add new features and fix bugs that were discovered as we continued building our new repository site and the number of other institutions using it increased. integrating islandora with the existing exhibits site unlv digital collections had an existing drupal-based exhibits site where content specialists could describe the digitized content and give historical context. this site included a locally developed integration with a self-hosted contentdm instance, which is now unavailable due to contentdm’s discontinued option to locally host instances. these interfaces were popular with researchers and users beginning their search process and the team asked the developer to investigate methods to preserve the user experience from the former repository using the new drupal structures leveraged by the islandora community. integrating digital objects directly into the existing drupal exhibits site allowed us to bring these components together again. unfortunately, automated islandora deployments provided by the community assume a brand new drupal instance which prevents us from using them as part of our workflows [6]. these automated deployments were useful to the community because islandora adopted a microservices approach to indexing and derivatives which meant there were several distinct applications that needed to be installed and properly configured to work together [7]. eventually the developer resorted to manually installing and updating the necessary components. theming proved to be another challenge for integrating with the existing site as several structure and styling decisions had been made with the existing content in mind which did not display well with the new data structures and content. the design and styling of the new digital object, agent, subject, and search pages went through several iterations with the stakeholders to find an acceptable solution (see the “designing the user interface” section below). archivesspace integration when we began looking at integrating archivesspace with our new islandora repository, we discovered there was an existing drupal 7 integration in use by the american academy in rome [8]]. we reached out to the developer and began an early collaboration to re-implement an integration for drupal 8+ [9]. the new integration would need to first define the content models in drupal to receive the archivesspace data, establish the synchronization mechanism, and establish basic data views [10]. while most archivesspace fields are relatively simple text fields, some were more complex than the field types provided by drupal: dates, extents, and physical instances [11]. for example, the drupal provided date fields presumed iso dates, however archivesspace dates included properties for certainty, types (e.g. bulk or inclusive), calendar, and a string-based expression. archivesspace also included a concept of relationship types, where an agent can be linked to a record using a list of possible relationships. replicating this in drupal natively would require creating a separate field for every relationship type, which seemed untenable. we decided to create a new field type to support this type of linked agent field. it was added to a separate drupal module, controlled access terms, as it seemed like it might be useful to the broader islandora community outside of the archivesspace module [12]. the controlled access terms module also included a new drupal field type and associated plugins for representing dates using the library of congress proposed extension to iso 8601, extended date time format (edtf) [13] which we used for digital object dates instead of the archivesspace-specific date field type. controlled access terms was later transferred to the islandora foundation’s official github organization for community management. early on we decided to leverage drupal’s new migrate api which was intended for ingesting and updating content and provided a plugin framework for extension. we expected this would limit technical debt by leveraging existing apis rather than coding the migration code wholly from scratch. two critical pieces we needed were an authentication mechanism to login and maintain a session with archivesspace and a migrate api “source” plugin which could use the session to iterate over the archivesspace api endpoints and expose the api fields drupal would pull from. further, because of the more complex nature of the archivesspace data model, we needed additional plugins to assist in transforming the source data into structures drupal could use to populate its own records. the final step of this process was to create some default migration configurations which would map the archivesspace data to the target models. these consist of several configuration files which can be adapted by local site implementers. finally, after modeling the data and using the drupal migrate api, we also set up some default data views. drupal has a powerful query and display builder called views. these allow us to select which data related to the current item in question should be displayed and how. this allows the site builders a great deal of flexibility in deciding how to display their data. for example, when a user views a drupal page corresponding to an archivesspace resource record a view can display all the child archival objects with whichever fields are desired and provide links to archival objects’ pages. figure 3. screenshot of an archivesspace resource record after being loaded into drupal figure 4. a view of a collection’s series powered by a drupal view. once we were able to load our archivesspace data into a drupal site we then considered how best to link archival objects with their digital representations. archivesspace can create digital object records, but does not have tooling around loading and transferring the files themselves. we decided that the simplest workflow would be to load new digital objects directly into the new islandora and maintain the links to archival objects in islandora instead of creating archivesspace digital object records. all we needed was to include the archival object reference id in our loading spreadsheets and have drupal perform a lookup on that field to create the link. once again, the drupal views functionality allowed us to control how we wanted related items to appear on each item’s page with a great deal of flexibility and control. figure 5. screenshot of an archival object record with a thumbnail of the digital object record in a sidebar. content migration once our local instance of islandora was stable, the digital collections department worked with the application developer to migrate the digital objects in contentdm. the metadata was exported from contentdm and remediated by the metadata librarian to conform to a new, consistent application profile. since only access copies of digital images and other files were stored in contentdm, the digital collections librarian created spreadsheets to match the metadata with the master files that were saved on local file servers. the metadata and file spreadsheets were sent to the developer, who used the migrate_islandora_csv module combined with custom code to ingest the content onto the development server for testing [14]. after any issues were resolved, the content was ingested onto the production servers. figure 6. migration from contentdm to islandora. source: seth shaw, “islandora @ the university of nevada, las vegas“, islandoracon, prince edward island, ca, 2022. each digital collection in contentdm had its own set of metadata fields; for example, there was a field for “cuisine” in the menus digital collection. in islandora, all digital objects would have the same metadata fields, so the metadata librarian had to review all of the existing metadata in contentdm and create a new unified metadata application profile, then remediate the metadata from each contentdm collection to conform to the new profile. he also converted geospatial, genre, subject, and agent terms to uris. figure 7. digital object metadata remediation process. source: seth shaw, “islandora @ the university of nevada, las vegas“, islandoracon, prince edward island, ca, 2022. with the intention that born-digital items would eventually also be added to the portal, and that digital preservation functionality would be built out, sca added several new metadata fields for digital objects, and also at the file/media level. the “digital provenance” field uses a controlled set of statement to distinguish between different types of digital objects, including: “original archival records created digitally” (born-digital archives) “digitized materials: physical originals can be viewed in special collections and archives reading room” “digitized materials: physical originals are not available for viewing because of fragility or obsolescence” (such as digital video files created from vhs tapes) “digitized materials: physical originals are not held by unlv special collections and archives” (“scan-and-return” items, i.e. digital surrogates are part of unlv’s collections but physical originals were retained by the donor) a “digital processing note” field uses another controlled set of statements to document if the digital surrogate was edited or redacted, and if it was transcribed manually or using ocr. at the file level, portland common data model use extension uris are used to identify the different types/versions of files, such as original, service, thumbnail, etc. sca created a matrix that cross-references the use extension with the information in the digital provenance and digital processing note fields to assign different preservation levels at the file level, for example: original born-digital files are “high priority” “priority” is assigned to the original files of digital surrogates where the original physical item is the ultimate “preservation copy” “low priority” is assigned to automatically generated derivatives like thumbnails sca also decided to assign archival resource keys (arks) to all digital objects to be used as persistent identifiers and urls. at first, arks were not assigned to the children of complex/compound digital objects, they were only assigned to the parent object. however, child digital objects are their own drupal nodes and show up independently in the search results, so sca decided to also assign arks to these child objects for consistency. unlv uses california digital library’s ezid service to create and manage arks. the developer wrote a script to assign arks to digital objects and update information in ezid in bulk. later on the developer further automated ark creation by using drupal insert “hooks” to trigger ark minting on item ingest. designing the user interface one of the key successes of this project was the centralization of search across a wide variety of content types with one search box. with so many different types of content in unlv’s islandora site (digital exhibit pages, archival description, agent records, and digital objects), the team spent over a year discussing the best way to display everything to users. stakeholders from sca, including public services and curatorial staff, worked with wads to create personas, inventory all content, do card sorting exercises, create mockups, conduct internal testing, and gather feedback from internal stakeholders. the personas sca developed for users included an amateur historian, phd candidate, undergraduate student, filmmaker, media relations consultant, art historian, and video game designer. two sca librarians also conducted in depth interviews with nine different types of users of unlv’s digital collections, which also helped to inform the ui design. the team aimed to make the user interface (ui) and labels intuitive enough for first time users, but also wanted to include features and information that would be useful to expert researchers. there is one main search bar for the site, and a set of filters to refine search results. the filter at the top of the sidebar allows users to refine by content type: digital objects: digital files and their metadata. archival collections: collection-level descriptions archival components: descriptions of series, sub-series, files, and items within archival collections people: descriptions of people who are creators, contributors, or subjects of archival collections or digital objects organizations: descriptions of organizations that are creators, contributors, or subjects of archival collections or digital objects families: descriptions of families that are creators, contributors, or subjects of archival collections or digital objects subjects: subject terms related to archival collections or digital objects geographic locations: location terms related to archival collections or digital objects webpages: webpages of digital exhibits blog entries: blog entries related to digital exhibits each individual instance of each content type is its own webpage; with over 740,000 individual pages indexed, the team created a total of eleven different filters to help users refine their search results. the team also deliberated carefully over which metadata fields and thumbnails to display in the search results for each different content type. figure 8. search results page. the team designed a webpage template for each content type. archival collection pages show collection-level description, a search box to search all the digital objects and archival components within that specific collection, and an inventory of the archival components that are in the topmost level of the archival hierarchy. figure 9. archival collection page. archival component pages show component-level description, as well as the basic information needed to request the materials in the sca reading room. the component’s hierarchical organization within the collection is shown in the breadcrumbs at the top of the page. the search box to search within the collection is also present. if the component is digitized, a link to the digital object is displayed as a thumbnail. if the component has child components or multiple child digital objects, those are displayed in an inventory under the “search within this collection” box. figure 10. archival component page. unlv’s digital objects are mainly images and text documents, but also some audio and video files. there are many complex digital objects in the sca portal, meaning multiple digital items grouped together and described in the aggregate. a complex object could be two images of the front and back of a photograph or letter, or it could be an entire folder of photos or documents that has been digitized. there are links between parent and child records in breadcrumbs at the top of the page, in the metadata, and also below images. digital objects are also linked to archival components when possible, and this hierarchy is part of the breadcrumbs. the most important metadata fields appear next to the digital file, with other fields placed below into different tabs. figure 11. user interface for the parent record of a complex digital object consisting of images. adjusting to islandora unlv’s islandora dams was named the unlv special collections and archives portal and publicly launched in august 2021. sca had achieved its goals of replacing contentdm and creating a unified search interface to improve the online user experience. motivated by the end of unlv’s contentdm contract with oclc, and a development timeline that was stretching longer than anticipated, the portal was launched while it was still missing some of sca’s desired features, including those related to digital preservation. below, the authors discuss the reactions of users and staff to the new system, as well as challenges faced, adjustments made, and lessons learned. labor disruptions less than a year after the public launch of the portal, unlv experienced a common challenge for libraries working with in-house technical development projects. they lost their lead developer to a better employment offer. with the covid pandemic greatly disrupting labor across the country and the advent of more fully remote positions for developers, this was a challenging time for many organizations. unlv strove to fill the gaps by asking existing staff to learn new skills, relying more on the community for troubleshooting, and through the significant work of a contract developer hired to partially fill the gap (with 20 hours/week devoted to helping maintain the system). it is hoped that with the hire of a permanent developer, additional progress can be made on portal priorities, but for the most part the portal has been in a holding pattern/maintenance-only mode for over a year. documentation and extensive cross-training were a lower priority than launching a functional product, resulting in only minimal documentation being created. additionally, the system itself changed frequently during its development (as detailed above) making it impractical to create extensive documentation and invest in cross-training before the portal was stable. while the original developer created basic documentation for the most important aspects of the portal, a top priority for the new developer will be to update and enhance local documentation alongside conducting an audit of the portal to identify and prioritize upgrades, fixes, and improvements necessary for the security, stability, and sustainability of the system. since the original developer left, sca and technologies staff have learned more technical details by necessity during troubleshooting. a new lead developer will fill in the rest of the picture where the gaps in knowledge currently exist. unlv’s approach to adopting islandora 2 was driven by local necessity in terms of timing and also by ambitious goals for features. reflecting on this undertaking, the benefit it offered the developer was the chance to contribute to the community codebase and participate in the decision-making involved with the redesign of islandora. now, institutions that are interested in adopting islandora 2 have the benefit of a stable version to build from, or working with a vendor experienced in islandora 2, which reduces the amount of in-house development and support needed, compared to the scope of the project that unlv set for itself. user feedback the web content and user experience specialist led virtual usability testing of the new portal with nine participants and wrote a report of the findings in spring 2022. the team also set up an online form for user feedback that is linked to from the top menu of the sca portal. the most often cited dislikes of the portal focused on difficulty understanding labels (definitions of what the filters would return) and difficulty understanding oral history content (how to access items by a narrator, access a transcript, or find out if an oral history had restrictions). there was feedback that one search box was preferable to the previous system and that the hierarchies were particularly useful as context. overall, the user feedback and usability test results confirmed that unlv had successfully launched a repository that not only met our initial launch goal of a minimally functional project, but exceeded users expectations in ease of use, speed, and the availability of content filters to find items. while the internal team was very much aware of a long list of features that were still pending development, they were encouraged by these strong results and decided to go forward with dismantling the previous repository and making a full switch to the new portal. metadata management tools digital collections (dc) metadata workflows were altered significantly by the migration from contentdm to islandora. dc initially planned to develop an application that replicated the functionality of the contentdm project client for the sca portal. the project client software had a user-friendly gui that made it possible for student workers to create metadata in a spreadsheet or item-level view, choose terms from locally managed lists of controlled vocabulary terms, and ingest digital objects in bulk. it also allowed staff to review metadata quality and make edits in bulk before the digital objects were published publicly online. however, due to lack of resources, this project kept getting pushed back and was eventually put on hold indefinitely. in the meantime, the team experimented with drupal views for bulk metadata editing. mostly this has also been put on hold as the department has adjusted workflows to take advantage of the community supported tool islandora workbench. islandora workbench is a “command-line tool that allows creation, updating, and deletion of islandora content from csv data” [15]. digital collections staff learned to use this tool to ingest new digital objects, edit metadata, and delete and replace files (media) in bulk. there was a slight learning curve setting it up and learning how to edit the configuration files and format the csv files, but within a couple months of using it regularly, dc librarians felt comfortable with this tool. however, dc decided that workbench is a bit too complicated to train student workers to use. instead, the department adjusted its workflows so that librarians are creating most of the metadata in bulk in csv files and ingesting via workbench, and students enhance metadata at the item level in the web interface (for example, adding subject terms). this process is rather slow and has made metadata enhancement project timelines longer than in the previous system as students cannot work on a batch of items at one time. some of the strengths of workbench are that it is widely used, actively maintained, and updated often; however, that can turn into a challenge when a new update causes a new error. dc librarians rely on the islandora community and wads staff to help troubleshoot and solve these errors, which occur every few months. archivesspace integration (part two) users and staff are generally happy with the unified interface that includes archivesspace data in the sca portal, but several adjustments needed to be made after launch to both the user experience and staff workflows. once a week, an automated script updates the sca portal with any additions, revisions, and deletions made in archivesspace. updates can also be refreshed “on demand” if there is an urgent need. initially, the sync was only adding new archival resources and components, but any edits to existing components or removal of components were not reflected. this led to multiple versions of the same components displaying at once and changes not being reflected to the general public until the problem was resolved by library technologies. each archival component is its own webpage in the portal, which can require many clicks to navigate the finding aid for a single collection. sca decided to also provide access to finding aids in pdf format, which is how they were previously available to users. the pdfs are generated in archivesspace and available for users to download in the portal on the archival collection page. however, technical services (ts) must manually create or refresh the pdfs using the portal staff interface. dc and ts also created workflows to link digital objects and the archival components they are the same as or part of. dc uses a view to export a csv file of the archival components of a collection from the portal. this spreadsheet includes description fields such as title, date, drupal node id, and archivesspace reference id. dc reuses the archival description as the basis for the metadata for digitized materials, and adds the archival component node id to the metadata to create links between the two in the portal. ts must separately add digital object arks into archivesspace so that digital object links are also available in the pdf version of the finding aid. after digital objects are ingested into the sca portal, the digital collections librarian sends a spreadsheet to technical services of the digital objects ids, their corresponding arks, and the archivesspace reference ids of the archival components that the digital objects are the same as or part of. technical services uses the “digital object bulk import” spreadsheet feature in archivesspace to simultaneously create and link digital objects to existing archival objects. once this task is completed, the pdf is regenerated and links appear for users to immediately go from pdf archival inventory to the corresponding digital object. subject headings and agent records are also synced from archivesspace to the portal. merging the controlled vocabularies from archivesspace and contentdm has been a challenge for sca. different controlled vocabularies, subject application styles, and different teams creating subject terms led to significant duplication. while some work to merge duplicate terms has been completed, there are over 20,000 terms in sca’s local controlled vocabularies, so further work has been postponed until a new full-time lead developer is hired and can help automate and complete these edits in bulk. in the meantime, sca has adjusted workflows to avoid duplication so that all agent records are created only in archivesspace and synced to the portal every hour. missing functionality there are some development priorities that were on the initial functional requirements list for the system that were developed prior to launch. while the metadata functionality evolved with the community and these tasks have been partially addressed with workbench, some needed modules are not yet prioritized in the community or unlv has lacked the development resources to implement community work. some of the main areas that require significant local development work include: advanced search functionality automating text extraction from pdfs to an indexed metadata field to improve full text search for researchers for textual digital objects and newspapers [16] a method to restrict content (this would enable the portal to be used for born-digital content and embargoed or sensitive collections) [17] better quality control functionality using drupal views (to aid in metadata enhancement) integrated digital preservation workflows (such as fixity checking) ultimately, the dams replacement project was a success. we have a new linked data repository with a unified interface that simplifies the access to multiple types of special collections content. most users find it easy and satisfying to use and approve of the design philosophy. at this point, a new phase of refining and enhancing the portal would provide an opportunity to address some of the missing functionality and to meet the third goal of addressing born-digital access and preservation. reflection at various points in this repository development work, there have been points where the team has expressed uncertainty about the stability and security of selecting an open-source, community-led software product. following the resignation of the lead developer, there was fear about support needs and conversations about rolling back the project, looking for hosted support, or returning to a vendor product. but, with time and more experience, many of these anxieties have transformed into actionable projects and manageable strategic priorities. the advantage of encountering challenges is that there is a rich environment of learning and growth. some of the key lessons learned in the project include how expectations have been modified, and how to manage them going forward as the team looks towards further development. most technical projects are managed with a goal, allocated resources, and a deadline. but working with open-source software, especially in a solo developer environment, meant that project managers, supervisors, and colleagues had to adapt to the realities of working in sometimes ambiguous or uncertain circumstances. for instance, a module may exist in the community, but when explored further it may be found that documentation is minimal, that the module requires customization to work in the local context, or that it simply is not mature enough to work reliably. in other cases, a simple upgrade or security issue that seemed straightforward ended up involving more staff members and time than initially planned. a key lesson was to understand that timeline forecasting is difficult and that using techniques like agile project management are often more suitable than strict deliverables and hard deadlines. the team also had to learn new ways of working and adapt expectations of what “technical support” looks like. there is a significant adjustment for staff when moving from a vendor supported product to an open-source community. no longer can staff expect that if they uncover an error that a service call will be made and the issue resolved in an expected time-frame. rather, the team had to build a new workflow that included how to report a portal issue, who the issues should go to, what roles each member of the team played in troubleshooting, how to document errors and troubleshooting work, and how to engage in the community if outside support was required. after several months of refinement, this system is now in place and has become more comfortable, but there can still be discomfort or a feeling of insecurity for staff wondering, “what if the portal goes down” when they are working at a public service point or in critical moments with patrons or donors. expectations have also changed in areas of decision-making. philosophically, unlv supports the open-source model; but there are times when it can be too slow or the community priorities are out of alignment with the organizational needs. we have learned that we need to find a middle road between completely vendor-provided products and services, and completely custom, built-to-order open source software development. we take the approach to try and reuse community work when possible. we see the value in working in uniform ways with the islandora core code and modules. but some workflows may need the boost of dedicated development resources to complete development of a new function or service and the team is looking for ways to supplement our internal resources with project work and vendor solutions (such as a site audit to improve our documentation, or a vendor contract to help implement restricted materials functionality). in the current environment, the team also noticed a shift in the power dynamic from an administrator or director setting the work plan and then the project manager setting milestones and the functional experts reporting the progress. instead, we now work in a complex environment with shifting power dynamics encompassing the open source community, workflow owners who may or may not be able to optimally complete their work in the portal, our users/stakeholders, staff who prioritize technical development work (but cannot do it themselves), the technology experts who have the technical expertise to solve problems, the managers responsible for evaluating job performance, and administrators – who may or may not understand the true measures of the system’s success or value. this all means that more communication is required and every person on the team needs to advocate for the system. future goals the unlv team has had a chance over the last twelve months to look back and reflect on the selection, migration, and use of the sca portal and despite the aforementioned challenges, the future looks bright. currently our future tasks include: on-boarding a new lead developer and delving more deeply into how islandora skills are acquired and sustained designating a product owner to liaison between the stakeholders and the developer and who will manage communications and priorities building relationships in the community and possibly hosting an event to bring potential collaborators to us locally leveraging vendor relationships to fill gaps in large development projects continuing to nurture collaborative relationships in the organization making our digital library more inclusive of content and merging the provision of agent information and archival information into one unified interface for our users has been a challenging but rewarding project. it is certain that new technical issues and challenges will arise as the team enters the second phase of repository development, but with the experience gained since launch, the support of the community, and the dedication to the principles that drove the initial project, the team can continue to build an innovation solution to special collections discovery and ensure a sustainable path forward. references and notes [1] islandora 2 has also been known as islandora 2.x, islandora 8, and modern islandora. it was also known by its pre-release codename, islandora claw. for more information, see islandora’s versioning policy at https://islandora.github.io/documentation/technical-documentation/versioning/“ [2] islandora commit: using context instead of rules https://github.com/islandora/islandora/commit/c807bab123421b70c84d4d6127e923b8018820e9 [3] islandora commit: content modeling overhaul https://github.com/islandora/islandora/commit/a1987aecb747b0b297176fc4d34afdc22cbddb4d [4] islandora commit: flysystem https://github.com/islandora/islandora/commit/cb8bb07238d1822370aca73e7011fdea9235c501 [5] islandora release 1.0.0 https://github.com/islandora/islandora/releases/tag/1.0.0 [6] at the time it was limited to an ansible playbook although the community has since added docker-based deployments. [7] see “architecture diagram” in the islandora documentation: https://islandora.github.io/documentation/technical-documentation/diagram/ [8] lavinia ciuffa, “from processing to public service: the digital humanities center at the american academy in rome”, visual resources association bulletin, vol 44, no. 1 (2017). https://online.vraweb.org/index.php/vrab/article/view/40 [9] unfortunately, the collaboration only lasted a few months before our partner turned their attention to other things. [10] some components of this integration are discussed in more detail in an archivesspace hosted webinar available online: seth shaw, “integrating archivesspace with drupal and islandora at university of nevada, las vegas libraries”, archivesspace webinar series, july 15, 2020. https://youtu.be/qic3svcbauc [11] archivesspace complex fields https://git.drupalcode.org/project/archivesspace/-/tree/8.x-1.x/src/plugin/field/fieldtype [12] islandora controlled access terms module https://github.com/islandora/controlled_access_terms [13] the library of congress proposed the extended date/time format (edtf) specification in 2012 on which the unlv module was based. edtf received a new version on february 4th, 2019 and the unlv module, now transferred to the islandora organization, was updated accordingly. see https://www.loc.gov/standards/datetime/. [14] islandora migration documentation https://islandora.github.io/documentation/technical-documentation/migrate-csv/ [15] islandora workbench repository https://github.com/mjordan/islandora_workbench [16] newspapers are currently not indexed and only available as a list of titles and issues: https://special.library.unlv.edu/collections/newspapers [17] the early version of the repository did have support for restricted content based on a module the islandora community presumed would work. however, as the size of the repository grew we discovered performance degraded significantly, so the stakeholders decided to table that requirement until a more performant solution for restricted content could be found. see seth shaw, “islandora: performance testing & content access control”, islandora open meeting, 2021-02-23. recording available at https://youtu.be/tkqidyjsvdo with slides available at https://seth-shaw-unlv.github.io/files/2021-02-23_islandora_open_meeting_performance_testing_and_content_access_control.pdf about the authors sarah jones is head of special collections & archives technical services at the university of nevada, las vegas where she leads a team in the preservation, description and access, and overall management of archival holdings in both physical and digital formats. cory lampert is professor and head of digital collections at the university of nevada, las vegas where she is responsible for leading a team responsible for digitization, metadata and linked data creation, and digital asset management. emily lapworth is associate professor and digital collections librarian at the university of nevada, las vegas. her work focuses on digitization, digital asset management and preservation, and increasing online access to archives and special collections materials. seth shaw was the software engineer for special collections and archives at the university of nevada, las vegas where he was responsible for the maintenance and development of software applications used in the special collections and archives; primarily islandora and archivesspace. he is now a digital library software engineer at arizona state university libraries. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – fair principles for library, archive and museum collections: a proposal for standards for reusable collections mission editorial committee process and structure code4lib issue 40, 2018-05-04 fair principles for library, archive and museum collections: a proposal for standards for reusable collections many heritage institutions would like their collections to be open and reusable but fail to achieve that situation because of organizational, legal and technological barriers. a set of guidelines and best practices is proposed to facilitate the process of making heritage collections reusable. these guidelines are based on the fair principles for scholarly output (fair data principles [2014]), taking into account a number of other recent initiatives for making data findable, accessible, interoperable and reusable. the resulting fair principles for heritage library, archive and museum collections focus on three levels: objects, metadata and metadata records. clarifications and examples of these proposed principles are presented, as well as recommendations for the assessment of current situations and implementations of the principles. by lukas koster, saskia woutersen-windhouwer introduction most libraries, archives, museums (lam) and other heritage institutions manage a variety of collections. these collections can be physical and digital book and journal collections, repositories of publications, physical artifacts collections and digital representations of these, archives of all kinds of physical and digital material, etc. the collections and objects are described in a large variety of information systems and databases, using all kinds of digital object and metadata formats, access protocols and storage facilities. more and more people and organizations want to be able to access and reuse both digital objects and the (meta)data describing the physical and digital objects, not only individually but also in bulk. some lam institutions already comply with these wishes. examples are: the british library free data services (british library free data services [2018]), the ghent university library open data services (universiteitsbibliotheek gent open data [updated 2018]) and the rijksmuseum api (rijksmuseum api [2018]). in other cases, there may be technical or legal barriers or data services lack clear documentation making discovery and access ambiguous. not all collections apply or provide the universal metadata standards and protocols that are necessary for exchanging or reusing information in a simple way. furthermore, some collections lack essential metadata (like global persistent identifiers), making implementation of universal standards and protocols impossible. and reuse can be complicated, as data or data services may lack information related to data reuse or license restrictions. not only are these circumstances an obstacle for reuse of collection objects and metadata by external parties, but also for internal and external interoperability between various information systems used within lam institutions. to summarize: it is not enough to make collections available through web based end user interfaces and provide download options for individual objects and metadata records. on the contrary: collection data and objects must be findable, accessible, interoperable and reusable for people and software in their entirety and in specific parts.[1] in this article[2] we propose a body of requirements for making lam collections findable, accessible, interoperable, and reusable (fair), as well as a set of recommendations for assessing existing institutional situations and for implementation of the requirements. these requirements and recommendations are based on the fair principles for scholarly output (fair data principles [2014]) and adapted for cultural heritage collections. fair collections are not necessarily open collections. the open knowledge foundation applies a definition of “openness” that adds modification permission and universal participation to the accessibility and reusability criteria, “at no more than a reasonable reproduction cost” (what is open? [2018]), referring to the open definition. the short form of the open definition is “open data and content can be freely used, modified, and shared by anyone for any purpose“. the full definition stresses openness in license (public domain or open license), in access (on the web and free of charge) and formats (without restrictions on its use and processable with at least one free/libre/open-source software tool) (open definition 2.1 2018). accessibility and reusability in the fair sense do not require collections and objects to be freely available, modifiable and shareable with free tools as such. that is why we have chosen to use the term ‘fair collections’ instead of ‘open collections’. some metadata or objects will be copyright protected, have privacy issues or local law issues. local law issues are for instance database rights in the european union[3] or portrait right as a personality right in the netherlands[4]. it can be discussed if open can be used for the metadata and fair is only necessary for the objects. unfortunately, there are still vendors that protect data that is (or should be) in the public domain by a contract. for that reason, metadata cannot always be open access or be freely reused. the same applies to objects that should be, according to the copyright law, in the public domain. even in those cases an owner can, as the owner of the object or metadata, lay down the conditions and determine what a user may or may not do with the data or objects. current initiatives in recent years there have been various local and global collaborative initiatives to develop sets of guidelines and best practices for making objects and associated metadata open and reusable. many of these initiatives focus on open and reusable research datasets. but there are also initiatives that relate to cultural heritage objects and collections. a number of these initiatives are described below. first, we will look at the fair data principles, then focus on a number of other initiatives in comparison to fair and finally we will discuss the respective relevance of all initiatives for reusability of lam collections. fair in 2014 the fair data principles were proposed as “a concise and measureable set of principles” to support “the reuse of scholarly data” (wilkinson et al. 2016): to be findable: f1. (meta)data are assigned a globally unique and eternally persistent identifier. f2. data are described with rich metadata. f3. (meta)data are registered or indexed in a searchable resource. f4. metadata specify the data identifier. to be accessible: a1 (meta)data are retrievable by their identifier using a standardized communications protocol. a1.1 the protocol is open, free, and universally implementable. a1.2 the protocol allows for an authentication and authorization procedure, where necessary. a2 metadata are accessible, even when the data are no longer available. to be interoperable: i1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. i2. (meta)data use vocabularies that follow fair principles. i3. (meta)data include qualified references to other (meta)data. to be re-usable: r1. meta(data) have a plurality of accurate and relevant attributes. r1.1. (meta)data are released with a clear and accessible data usage license. r1.2. (meta)data are associated with their provenance. r1.3. (meta)data meet domain-relevant community standards. these principles describe a set of guidelines to make data (meaning ‘scholarly output’, more specifically ‘research datasets’) and metadata about these datasets findable, accessible, interoperable and reusable (fair). the authors stated that “… the fair principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals” (wilkinson et al. 2016). the fair principles do not require long term sustainable storage and access of datasets, but only long term sustainable access of metadata: “a2 metadata are accessible, even when the data are no longer available.“. the phrasing of the fair principles is somewhat confusing, very probably because of the wish to be concise. some principles (not all) refer to both “data” and “metadata”, which is formulated as “(meta)data”, for example in “f1. (meta)data are assigned a globally unique and persistent identifier”. one principle is self-referencing (“i2. (meta)data use vocabularies that follow fair principles“). also the enumeration of the four sections (f, a, i, r), each containing 3 or 4 principles, is rather uncommon, for example “(f)air” has four individual principles f1, f2, f3, f4, but “(r)eusable” has one main principle r1 and three sub-principles r1.1, r1.2, r1.3. the logic of this subdivision is unclear, because all four reusable principles are guidelines on the same level in their own right. in fact, the authors say so themselves: “the elements of the fair principles are related, but independent and separable.” (wilkinson et al. 2016, p.4). fair annotations the fair principles can be applied to other object types. collocated with the force2017 conference in berlin, a meeting was organized by the annotating all knowledge coalition dedicated to fair annotations (martone 2017). it is an initiative to make scholarly annotations on the web findable, accessible, interoperable and reusable. a list of issues (under construction) is available (at the time of writing) in the form of discussions of all fair principles applied to annotations (annotations and fair [2018]). the list raises several questions, for instance how to link annotations to specific fragments of a text, what a retrieval protocol for annotations should look like, etc. parthenos parthenos is a consortium of national and international research and research infrastructure organizations, such as inria, knaw, clarin, dariah. “parthenos aims at strengthening the cohesion of research in the broad sector of linguistic studies, humanities, cultural heritage, history, archaeology and related fields through a thematic cluster of european research infrastructures, integrating initiatives, e-infrastructures and other world-class infrastructures, and building bridges between different, although tightly, interrelated fields.” (parthenos/about the project/the goal [2018]). parthenos aims to “provide common solutions to the definition and implementation of joint policies and solutions for the humanities and linguistic data lifecycle“. for this end parthenos “will deliver guidelines, standards, methods, services and tools to be used by its partners and by all the research community”. in their “report on guidelines for common policies implementation (1)” (parthenos 2017), parthenos presented a set of high-level principles as common guidelines to their community. these guidelines are based on the fair principles, but also extend these. for instance, fair does not directly address long term preservation of scholarly output as such, but only long-term preservation of the metadata of those objects. parthenos explicitly formulates policies, strategies and guidelines for long term digital preservation of objects (parthenos 2017, par. 3.3.2). it refers to the much used oais (open archival information system) standard (iso [updated 2012]). metadata2020 metadata2020 (metadata2020.org) is a collaboration of stakeholders who create or use scholarly metadata. these stakeholders are service providers of platforms and tools, publishers, librarians, researchers, data publishers, data repositories and funders. with metadata2020 they advocate richer, connected, and reusable, open metadata for research output with the aim to improve the quality of metadata for research. they make users and creators aware that metadata should be richer to fuel discoverability and innovation, to bridge the gaps between systems and communities, and reusable to be able to eliminate duplication of effort. richer metadata of the metadata2020 initiative fits all aspects of the fair principles ‘findable’ and ‘reusable’ (especially r1). improving the connection between systems and communities is related to fair principle ‘interoperable’ (i1, i2 and i3) and ‘reusable’ (r1.3). the fair principle ‘reusability’ is also one of the three pillars of metadata2020 (most relevant are the fair principles r1.1. and r1.2). some special challenges between researchers and libraries are mentioned by metadata2020, such as that libraries have invested in siloed metadata standards, seem hesitant to develop new systems that support new models and prefer to fit new models into legacy systems (metadata2020 librarians [2018]). this has led to interoperability problems, and duplication of data entry for the use with different systems. another challenge is that librarians use metadata in a different way compared to researchers, and that even different research fields use different metadata within their own systems. libraries can play an important role if they work with the other stakeholders to build a common metadata vocabulary, including the nuances per research field. libraries can also help with the adoption of persistent identifiers to prevent disambiguation of names and institutions. to conclude, libraries, together with the other stakeholders, should work towards open, de-siloed metadata that promotes cross-platform interoperability and sharing for all stakeholders. cultural heritage reuse charter parthenos, together with a number of other european institutions (clarin and dariah, both partners in parthenos; apef, europeana, e-rihs) participates in the initiative to set up a ‘cultural heritage reuse charter’. a draft mission statement was published in 2017 (apef et al. [2018]) and a survey was held to collect feedback on the mission statement (clariah [updated 2017]). the mission statement formulates six generic principles that both heritage institutions and researchers should subscribe to: reciprocity, interoperability, citability, openness, stewardship, trustworthiness. the subject of these principles is a bit unclear. throughout the statement the phrase “cultural heritage data” is used without explaining the meaning of “data“. sometimes “content and knowledge” is used, “digital and non-digital assets“, “research data“, “output“, “data, digital collections or data descriptors“. of these six principles reciprocity seems somewhat out of scope. it states that “both cultural heritage institutions and researchers agree to share content and knowledge equally with each other, making use of data centres and research infrastructures” and as such it merely serves as a recommendation of signing a mutual contract. in fact, all principles are two-sided and describe obligations for institutions as providers and researchers as consumers of heritage data. interoperability is a combination of part of the fair principles ‘interoperability’ and ‘accessibility’: “international standards, frameworks and interoperability protocols“. the citation principle states that heritage institutions should apply human and machine readable “relevant data citation standards“. it can be questioned that this really something institutions should provide. it seems that citation standards can be derived from provided standard metadata formats as facilitated by tools like refworks, mendeley and zotero. the openness principle only addresses the use of licenses, which is represented by the fair reusable 1.1 principle. openness usually also comprises findability and accessibility principles, such as the use of open standards and protocols, persistent identifiers and online availability, for instance as formulated in tim berners-lee five stars of open data (berners-lee [updated 2015]). the stewardship principle focuses on long term preservation and accessibility. trustworthiness refers to provenance (fair principle reusable 1.1), including descriptions of “any relevant materials, equipment, techniques, procedures and protocols used“. the final trustworthiness requirement “alert to any defamatory use of cultural heritage data” seems rather out of scope and hardly feasible. it does not describe guidelines for presentation of content and data, but for monitoring usage of these. den – de basis den, the dutch knowledge centre for digital heritage and culture, has published a set of minimal requirements for digitization of cultural heritage, in dutch “de basis” (den [updated 2016]).[5] it consists of six areas of attention: rights management, findability, creation, presentation, digital sustainability/preservation, description. de basis is a mixture of guidelines, principles, policies, references and roadmaps. as such it serves in fact as a set of principles combined with detailed roadmaps how to implement these principles. rights management extensively covers fair 1. (meta)data are released with a clear and accessible data usage license. findability actually focuses on four issues: identification (data are unique and persistently identified: fair f1, f4, a1), accessible (data are accessible via the internet: fair a1), search machine searchable (more or less fair f, a, plus html search machine recommendations), reuse (fair a1, a1.1, i1, i3, explicitly not the same as fair reusability). presentation covers the separation of content and presentation, look and feel, multilinguality, web accessibility digital sustainability/preservation focuses on the long term preservation and accessibility of digital collections (fair does not really cover this as stated before) description covers all types of metadata about digital objects (fair f1, f2, f4, i1, i3, r1, r1.3). discussion in their brief existence, the fair data principles have experienced widespread publicity, adoption and adaptation in the scholarly information world and beyond (mons et al. 2017). some of the initiatives discussed here explicitly mention the fair principles and their adoption and adaptation of them. in other, older initiatives fair principles can be partly identified “after the fact”. whereas fair and most of its emulators are targeting the scholarly output environment, some initiatives relate to heritage collections, which is the focus of this article. can the fair principles be transposed and applied to heritage collection objects? not as such, but they are a good starting point. the two main limitations are the lack of explicit attention for long term preservation of digital objects, besides their metadata, and the excessive interwovenness of “data” (or objects) and “metadata”. the fair-like initiatives focusing on heritage collections differ substantially, which is caused by differences in target audience and objectives. the cultural heritage reuse charter is in fact a mutual contract between institutions and researchers, and for that reason cannot be implemented unilaterally by institutions. the den de basis set of best practices, although it is very broad and complete, is not compact enough for quick adoption. we deem it necessary for collections to be fair. when collections do not meet the fair criteria, these collections will be harder to use by teachers, researchers and enterprises. collections that will not be fair will run the risk of being out of the picture, as will in the long term their institution. the objective of this article is to provide a compact, and practical list of guidelines for achieving reusability of lam collections that can be fairly easily applied by lam institutions according to their own roadmaps and resources. minimum requirements for objects and metadata the fair principles for scholarly output can be applied to lam collection objects with a small number of adjustments and extensions, including some principles from the other initiatives mentioned here. the fair principles for lam collections are applied to three levels instead of two: objects (such as books, journals, artifacts, videos, datasets, etc.) metadata about the objects on elementary level (such as title, creator, identifier, date, etc.) metadata records (body of metadata elements about an object in a specific database) fair principles for objects findable objects (physical and digital) have a globally unique persistent identifier objects are described with metadata accessible digital objects are permanently accessible by: sustainable storage (hardware, storage medium) open universal access protocols version management backups interoperable digital objects are stored in preferred or acceptable formats reusable digital objects have a date-timestamp objects have a license for reuse, which is also available in a machine readable form fair principles for metadata findable metadata specify the global persistent identifier (pid) of the object. this pid is used in all systems/databases that contain a description of the object in question metadata about specific objects are available via one or more searchable online repositories, catalogs, online databases, etcetera accessible metadata specify information about an object’s availability, obtainability and/or access options interoperable metadata are available at least in one metadata schema appropriate for the specific type of object metadata are available in various additional generic standard data formats for other contexts metadata contain links/references to other objects/authority files, by using other global persistent identifiers reusable metadata specify the object’s rights holder metadata contain license information referring to the object metadata specify the object’s provenance fair principles for metadata records findable metadata records have their own global persistent identifier accessible metadata records are machine readable metadata records are accessible using open universal protocols interoperable metadata records must be of sufficient quality reusable metadata specify the metadata record’s provenance metadata specify the entity responsible for the metadata record metadata records have their own licence for reuse, which is also available in a machine-readable form clarification of the fair principles for lam collections persistent identifiers by “persistent identifier” (pid) we mean a “globally unique persistent identifier”. an “identifier” is a code which can be used to refer to an object. “unique” implies that the identifier is used exclusively for one particular object. “global” means that the identifier is valid throughout the entire world wide web. “persistent” entails the permanent availability of the identifier, independently of any individual organizations, systems or systems implementations. several systems are available for assigning global persistent identifiers to objects, such as handle (handle.net), doi (doi.org), urn-nbn (hakala and network working group [updated 2001]). for references to people (as author, subject etc.) a number of identifier systems can be applied, such as orcid (orcid.org), isni (isni.org), viaf (viaf.org). for indexing terms (subject headings, thesaurus terms, classification codes etc.) many identifier systems are available too, such as library of congress subject headings (library of congress subject headings [2018]), aat (getty research institute [updated 2017]), mesh (national center for biotechnology information, [2018]), etc. besides persistent identifiers for objects it is also recommended to assign identifiers to metadata records describing the objects. the metadata records in question must contain the identifier of the described objects as one of its metadata elements. both types of persistent identifier should be findable and usable via the online user interfaces of all systems and repositories that the metadata records are published in. it is still widespread common practice to use internal dedicated system identifiers which refer to both the metadata record and the object described. this type of internal identifier can be published on the web using a url based on the particular system or domain, but it is not a global and persistent identifier, because it depends on the specific system or domain. sustainable storage it is vital that digital objects are stored in stable server environments in order to assure their long-term availability. servers have to be managed, monitored and secured in a professional manner. whenever necessary the objects have to be migrated to new environments. a professional backup system must be in place to avoid the loss of objects. a versioning system must be applied in order to maintain previous versions of newer replacements. these versions must be assigned date-timestamps in order to distinguish between different versions. objects must be stored in preferred or acceptable formats (recommended formats statement [2016]), such as pdf (text), jpeg, png (image), wave, flac (audio). access protocols for access to both digital objects and digital metadata, universal access protocols must be made available on several levels. the first level comprises server level access protocols for copying files between servers and workstations (ftp/sftp), direct server command line access (ssh) and downloading files and data through web browsers (http). the second level consists of application level access protocols for accessing metadata, such as application programming interfaces (apis), harvesting (oai-pmh) and retrieving metadata (z39.50, sru) and linked data (sparql). finally, there are the general formats and frameworks, other than the domain specific data formats (see standard metadata formats below), such as xml, rdf, json, json-ld, which are used to represent the domain specific formats. for instance, marc can be represented as marc-xml, dublin core in xml, rdf, etc. in order to support reuse of metadata and digital objects as much as possible, options for selecting and downloading in bulk must be presented as well as options for retrieving and downloading records and objects individually. standard metadata formats it must be possible to provide metadata records in various standard formats, other than the format the metadata is stored in. examples are marc, dublin core, mods, mets. the stored metadata must be sufficient to describe the specific object type and support the intended reuse. in the linked data context vocabularies or ontologies in rdf are used. for this various standards are available, such as edm (europeana data model), cidoc-crm (for cultural heritage), bibframe (the prospective linked data successor to marc), as well as well-known formats like dublin core and schema.org. copyrights and licenses in order to avoid the recurring necessity of requesting permission for reuse in every individual case, it should be instantly clear what actions are permitted with the objects and metadata. this is possible by including a license provided by the rights holder. the most used open content license is the creative commons (cc) license.[6] the cc license is a worldwide, irrevocable, non-exclusive license for the duration of copyright and similar rights, and sui generis database rights. the most used version is the ‘cc-by’ (creative commons. attribution 4.0 international [2018]). with a cc-by, the rights holder gives permission that metadata or an object may be distributed, copied and adapted free of royalties under the condition of attribution. in addition there are 3 other elements that can be added: nd (the metadata or object may not be adapted); sa (must be distributed under the same license); and nc (may not be used for commercial purposes). if copyright no longer applies, or if the copyright is waived by the rights holder, a cc-0 statement can be used to explicitly state that the object or metadata is a part of the public domain or that all copyright and related rights to the fullest extent allowed by law are waived and can be used by anyone without restrictions (cc0 [2018]). if a license cannot be granted for any reason, consideration can also be given to providing a standardized statement, such as ‘unknown rightsholder’. these standards can be found on the site rightsstatements.org that has been set up by europeana and the digital public library of america (dpla). organizations will have to determine which license for reuse will be granted to which objects and metadata. in addition, organizations must take into account other rights such as contract, portrait, personality and privacy rights in relation to objects and metadata. external links for the description of entities related to the objects (people/creators, subjects, organizations, places etc.) the accepted commonly used persistent identifiers, available in accepted external authority files, must be recorded in the metadata whenever possible. for people for instance orcid, isni, viaf etc. for subjects many sources are available, such as library of congress subject headings, getty art & architecture thesaurus aat, medical subject headings mesh, etc. provenance provenance originally stands for “the chronology of the ownership, custody or location of a historical object … the primary purpose of tracing the provenance of an object or entity is normally to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody and places of storage. the practice has a particular value in helping authenticate objects.” (provenance [updated 2018]) in digital contexts it can also be applied to digital objects, digital representations of objects, data and metadata. in this case provenance can be recorded for physical objects, digital objects and data in the context of all kinds of collections and archives. quality assurance in order to promote reuse of metadata and objects the metadata must be correct, complete and up-to-date. to this end a practice of quality assurance for metadata must be adopted. this implies the establishment of requirements the metadata must comply to, monitoring these requirements, verification of the actual metadata, assigning sufficiently qualified staff. the quality requirements at least include the minimum requirements proposed here, with the addition of criteria for the specific domain. roadmap to fair collections in order to make collections fair, a lam institution should draw up a roadmap, consisting of a number of steps. some of these steps can be taken simultaneously, some are preconditions for other steps, some are temporary, others will be permanent. we recommend the following steps: set up a working group the first step should be setting up a working group, taskforce, or similar, to coordinate and carry out the activities needed for making the institutions collections fair. the working group should consist of representatives from all internal stakeholders and specialists, such as metadata experts, developers, systems coordinators, legal copyright experts, etc. the intention is to cover all perspectives of fairness: description, hardware, software and legal issues. the working group could be established as a temporary task force until all collections are fair. but it is recommended to have some kind of permanent body to monitor and adjust the situation as needed. compile an overview of collections one of the first tasks of the working group is to compile an overview of all of the institution’s collections, and all the databases, repositories and systems these collections are described and presented in. compile an overview of existing agreements an overview of agreements and collaborations of the institution with national and international standards organizations will be helpful in assessing what standards have already been adopted, implemented or agreed upon. compile an overview of existing fair principles per collection the overview of collections and the overview of agreements can serve as input for compiling an overview of already implemented fair principles per collection. besides these overviews of course also thorough examination of the collection infrastructure is indispensable. a useful instrument would be a kind of scorecard containing all fair principles on which one can mark the score for each principle per collection, including comments. obviously, the scores can only be established by communicating with all stakeholders. a completed scorecard for each collection will automatically reveal the issues that have to be addressed. involve stakeholders in order to identify the fair issues that your users experience, it is recommended to involve all stakeholder groups: end users, researchers, teachers, etc. an advisory group can be established with representatives of all stakeholder groups that are relevant to the institution. communication between working group and advisory group can produce an overview of usage requirements, use cases, etc. talk to others if possible, talk to other institutions who already managed to make collections fair and learn from their experiences. there is no need to invent the wheel again. execute a pilot if appropriate, carrying out a pilot project with a sample collection or sub-collection can be useful to identify the main issues. the pilot can for instance focus on providing bulk downloads of collection data, connecting the collection to other platforms using linked open data, or publishing digitized objects. the pilot will reveal all obstacles that are preventing findability, accessibility, interoperability and reusability of the specific collection’s data and objects. establish a fair policy an institutional policy should be established for achieving fair collections, preferably based on a pilot, feedback from stakeholders and the compiled overviews. write an implementation plan once an institutional fair policy is in place and solid overviews of collections, agreements and issues are available, it is time for writing an implementation plan. in this plan it is essential to set priorities and establish implementation decisions for those principles that are fundamental across all collections. for instance, in order to make anything fair, you need globally unique persistent identifiers. so before you do anything else, it is necessary to decide on the specific pid-schemes available (doi, handle, uri, etc.) that are suitable, identify the practical implementation issues involved and configure a pid infrastructure. other fundamental principles are sustainable storage, license management, links/reference management, provenance information and metadata quality. the information system/database/software infrastructure is also important. an institution has to be able to implement fair principles in the available technical infrastructure. if the institution is dependent on external system and database providers, there have to be plans for negotiating with these providers if necessary. also, when procuring new database/system environments, it is advisable to incorporate the fair principles into the requirements. when these organizational and technical foundations are adequate, it is time to focus on the systems infrastructure. this depends highly on the degree of distribution and fragmentation of the systems infrastructure. a museum with just one information system for both cataloguing and discovery can focus on that one system only for implementing most of the technical requirements. a university library with physical text material, online digital resources, scholarly output and special collections will have a number of systems and databases for cataloguing, handling and presenting their collections. in this case focus on source systems/databases first, so all improvements can be propagated to dependent target systems. for instance, objects can be described in a backoffice library management system, the metadata can be copied to an end user discovery tool and a repository, the digital objects can be made available in an institutional repository and copied to an online end user tool. finally, it is essential that existing workflows and responsibilities are adapted to the new fair principles, so that new objects and collections are fair from the start. conclusion making lam collections fair (findable, accessible, interoperable and reusable) is a complex endeavor. it is important that all involved organizational units of an institution work closely together in order to break down the organizational, legal and technological barriers that prevent it. besides that, external stakeholders, users of the collections and the collection data, have to be involved. as noted, this process requires addressing a wide range of facets when adopting fair principles. therefore, the strategy must be methodical and systematic. guidelines must be determined, and a policy must be established for the three collection levels “objects”, “metadata” and “metadata records” based on a pilot. it is important to observe, evaluate and, if necessary, adapt the guidelines with every new implementation. this approach should lead step by step to make an institution’s collections fair. references annotations and fair [internet]. [2018]; [cited 2018 feb 24]. available from:https://docs.google.com/document/d/1uobmtncl_dw5_tqljkwgfkbsm_4ypcxxiu3zj2fsuv8/edit apef, clarin, dariah, europeana, e-rihs, iperion-ch, parthenos [internet]. [2018] cultural heritage data reuse charter: the mission statement; [cited 2018 feb 24]. available from: https://sondages.inria.fr/index.php/593568/lang-en berners-lee, t. [internet]. [updated 2015 aug 31]. 5-star open data; [cited 2018 feb 24]. available from: http://5stardata.info british library free data services [internet]. [2018]; [cited 2018 feb 24]. available from: http://www.bl.uk/bibliographic/datafree.html cc0 [internet]. [2018]; [cited 2018 may 1]. available from: https://wiki.creativecommons.org/wiki/cc0 clariah [internet]. [updated 2017 aug 9]. cultural heritage data reuse charter : mission statement. amsterdam: clariah; [cited 2018 feb 24]. available from: https://www.clariah.nl/en/new/news/cultural-heritage-data-reuse-charter-mission-statement creative commons. attribution 4.0 international [internet]. [2018]; [cited 2018 feb 24]. available from: https://creativecommons.org/licenses/by/4.0/ den [internet]. [updated 2016 nov 23]. de basis. den; [cited 2018 feb 24]. available from: http://www.den.nl/debasis the fair data principles [internet]. [2014] la jolla, ca: force11; [cited 2018 feb 24]. available from: https://www.force11.org/group/fairgroup/fairprinciples getty research institute [internet]. [updated 2017 mar 7]. art & architecture thesaurus® online. getty research institute; [cited 2018 feb 24]. available from: http://www.getty.edu/research/tools/vocabularies/aat/ hakala j, network working group [internet]. [updated 2001 oct]. using national bibliography numbers as uniform resource names. request for comments. the internet society; [cited 2018 feb 24]. available from: http://www.ietf.org/rfc/rfc3188.txt iso [internet]. [updated 2012 sep] iso 14721:2012 (ccsdss 650.0-p-1.1) space data and information transfer systems — open archival information system (oais) — reference model. geneva: iso; [cited 2018 feb 24]. available from: https://www.iso.org/standard/57284.html library of congress. subject headings [internet]. [2018]. library of congress; [cited 2018 feb 24]. available from: http://id.loc.gov/authorities/subjects.html metadata2020 librarians [internet]. [2018]; [cited 2018 feb 24]. available from: http://www.metadata2020.org/communities/librarians/ mons b, neylon b, velterop j, dumontier m, da silva santos, lob, wilkinson, md. 2017. cloudy, increasingly fair; revisiting the fair data guiding principles for the european open science cloud. information services & use [internet]. [cited 2018 feb 24]; 37(1):49-56. available from: http://doi.org/10.3233/isu-170824 martone m. 2017. annotating all knowledge, fairly. hypothes.is [internet]; [cited 2018 feb 24]. available from: https://web.hypothes.is/blog/annotating-all-knowledge-fairly/ national center for biotechnology information. mesh [internet]. [2018]. bethesda md: national center for biotechnology information; [cited 2018 feb 24]. available from: https://www.ncbi.nlm.nih.gov/mesh open definition 2.1 [internet]. [2018]. open knowledge international;[cited 2018 feb 24]. available from: http://opendefinition.org/od/2.1/en/ parthenos/about the project/the goal [internet]. [2018]; [cited 2018 feb 24]. available from: http://www.parthenos-project.eu/about-the-project-2/ parthenos. 2017. report on guidelines for common policies implementation (1). d3.1. [internet]; [cited 2018 feb 24]. available from: http://www.parthenos-project.eu/download/deliverables/d3.1_guidelines_for_common_policies_implementation.pdf provenance [internet]. [updated 2018 jan 19]; [cited 2018 feb 24]. available from: https://en.wikipedia.org/wiki/provenance recommended formats statement [internet]. [2016] library of congress; [cited 2018 feb 24]. available from: https://www.loc.gov/preservation/resources/rfs/ rijksmuseum api [internet]. [2018]; [cited 2018 feb 24]. available from: https://www.rijksmuseum.nl/en/api universiteitsbibliotheek gent open data [internet]. [updated 2018]; [cited 2018 feb 24]. available from: https://lib.ugent.be/en/info/open what is open? [internet]. [2018] open knowledge international; [cited 2018 feb 24]. available from: https://okfn.org/opendata/ wilkinson md, dumontier m, aalbersberg ij, appleton g, axton m, baak a, blomberg n, boiten j-w, da silva santos lb, bourne pe, bouwman j, brookes aj, clark t, crosas m, dillo i, dumon o, edmunds s, evelo ct, finkers r, gonzalez-beltran a, gray ajg, groth p, goble c, grethe js, heringa j, ‘t hoen pac, hooft r, kuhn t, kok r, kok j, lusher sj, martone me, mons a, packer al, persson b, rocca-serra p, roos m, van schaik r, sansone s-a, schultes e, sengstag t, slater t, strawn g, swertz ma, thompson m, van der lei j, van mulligen e, velterop j, waagmeester a, wittenburg p, wolstencrof k, zhao j, mons, b. 2016. the fair guiding principles for scientific data management and stewardship. scientific data [internet]. [cited 2018 feb 24]; 3:160018 available from: http://doi.org/10.1038/sdata.2016.18 notes [1] see also “the library as an open global platform”, https://future-of-libraries.mit.edu/ [2] the article is based on an internal advice for the library of the university of amsterdam by the authors (2018) [3] the directive 96/9/ec of the european parliament and of the council of 11 march 1996 on the legal protection of databases [4] articles 19-21 dutch copyright act [5] in english “digital heritage – building a successful ict strategy.” [6] patent and trademark rights are not licensed under a creative commons license. about the authors lukas koster msc (l.koster@uva.nl) is library systems coordinator at the library of the university of amsterdam and the amsterdam university of applied sciences (hva), focusing on data infrastructure and discovery. his current activities are project manager improvement of the functionality and user experience of the primo discovery tool, co-lead of the working group fair collections and project team member of the amsterdam cultural heritage linked open data network project. orcid: 0000-0003-0214-4721. saskia woutersen-windhouwer llm (s.windhouwer@uva.nl) is specialist electronic publishing/repository manager and open access service manager at the library of the university of amsterdam (uva) and the amsterdam university of applied sciences (hva). she is also involved in open science at the uva and in the horizon2020 project openup, and co-lead of the working group fair collections at the uva. orcid: 0000-0003-0120-266x. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – designing digital discovery and access systems for archival description mission editorial committee process and structure code4lib issue 55, 2023-1-20 designing digital discovery and access systems for archival description archival description is often misunderstood by librarians, administrators, and technologists in ways that have seriously hindered the development of access and discovery systems. it is not widely understood that there is currently no off-the-shelf system that provides discovery and access to digital materials using archival methods. this article is an overview of the core differences between archival and bibliographic description, and discusses how to design access systems for born-digital and digitized materials using the affordances of archival metadata. it offers a custom indexer as a working example that adds the full text of digital content to an arclight instance and argues that the extensibility of archival description makes it a perfect match for automated description. finally, it argues that building archives-first discovery systems allows us to use our descriptive labor more thoughtfully, better enable digitization on demand, and overall make a larger volume of cultural heritage materials available online. by gregory wiedeman introduction archives are weird. or at least that seems to be the perception of many library technologists. while archives are often part of larger research libraries, archival methodologies are often misunderstood by our administrator, technologist, and librarian peers. this confusion has become more problematic as archives continue to need and develop more complex access systems to make description, digitized materials, and born-digital objects available over the web. implementing these systems requires cross-domain partnerships, and the misunderstandings and miscommunications around archival description in particular have severely hindered the development of discovery and access systems for archives. archives access systems do not work like library catalogs or really anything else on the web and currently have major usability barriers. to those who work mostly with the bibliographic description used by libraries and most of the web, it can be unclear why archives cannot just use the same systems, or why archives systems and practices just seem so limiting for users. archival methodology and its reasoning can be easily obscured among the more esoteric traditions of archives, like the celebration of famous men to demonstrate value to donors, hollinger boxes, or finding aids. it is often hard to differentiate between the value and the dogma. archivists themselves often find it hard to articulate why their needs are just different than their librarian peers. it can be challenging even for many archival practitioners to acquire strong expertise in archival description. in the united states, archival training is a concentration within a library credential, which can mean merely one or two archives-specific courses. you might only get one single class that discusses archival description, and even that is often taught by a faculty member with a research focus rather than extensive practitioner experience. archival description skills often need to be learned on-the-job and seem to be mostly effectively passed on through peer groups, mentorship, or other types of informal professional development that not everyone has access to. even archivists that do have strong knowledge of archival description may not have a detailed understanding of how web applications or other technologies are designed or work in practice. while many archivists see firsthand the constant friction in current access systems, they often struggle to articulate how they can be designed better as web applications. the divide in domain knowledge between discovery systems and archival description is a challenging one to bridge. i hope to clarify the core differences between archival and bibliographic description and outline a path towards more effective discovery systems. while bibliographic description is much more intuitive and commonplace in our web applications, archival methods free us to apply the valuable descriptive labor that is the main bottleneck in our digitization and born-digital acquisitions programs more thoughtfully and appropriately.[1] if used properly, archival description could enable us to better provide digitization services on user request at scale and make these materials available online for future users. the extensibility of archival metadata also makes it a perfect fit for using automated description, such as optical character recognition (ocr), entity extraction, or automated transcription to enhance discovery, as it combines imprecise output with human-created records. i try to make it much clearer why archival metadata makes discovery so peculiar, highlight the cases where it can be advantageous, outline a path forward to increase the usability of archives access systems, and make the case for privileging archival description when planning and designing discovery systems. the misunderstandings around archival description have hidden an enormous problem: there are no available off-the-shelf systems that provide access to digital materials using archival description. every digital repository, digital asset management system (dams) or institutional repository (ir) uses bibliographic description as an unrecognized design assumption. to illustrate this, i provide a case study of ualbany’s existing hyrax and arclight implementations which use archival description for discovery by linking data from these systems over apis. this approach works functionally but has substantial usability and maintenance issues. in working to combine these systems into a single archives discovery system, i wrote a custom indexer that adds digital materials, full-text ocr and extracted text content to arclight as a proof-of-concept example that i hope can illustrate a path forward towards designing access systems that work directly with archival methods. finally, i will point to some ways we can experiment with how archival inheritance is indexed to potentially mimic bibliographic usability. archival vs. bibliographic description by bibliographic description, i mean the creation of individual metadata records for each object with a set of descriptive fields. this has been the intuitive method of managing information going back beyond our relevant professional history. i’m sure you could go back thousands of years and find library workers creating some kind of discrete bibliographic record describing an individual item. library catalog cards and online public access catalogs (opacs) are canonical examples of bibliographic description. each record has a set of descriptive fields and is self-contained – all of the available information is contained within the record. dublin core states this explicitly in its “one-to-one principle,” where it declares that each discrete entity “should be described by conceptually distinct descriptions.”[2] while linked data adds some complexity by potentially breaking up records into statements, data structures and descriptive practices usually remain the same. most of the information on the web is displayed to users in a way that looks like bibliographic description. a search engine, a major e-commerce site, or wikipedia will display records of objects to users that contain all the available information. these records often link to other records, but each record still describes an isolated object and is fully comprehensible by itself. the ubiquity of this format proves its intuitiveness and usability. i am sure that this is to some degree an oversimplified caricature of bibliographic practices, but it is a useful contrast to help us to better understand the impact of archival description. while archives may appear to be just a specialized type of library, they have a fundamentally different methodology for managing and providing access to materials. why did early archivists reinvent the wheel and develop incompatible practices that are less intuitive for both professionals and users? the answer is very practical: they simply had too much stuff. the early development of archival description in the united states illustrates how usability was a conscious and necessary tradeoff to be able to adequately manage the scale of records that were working with. the american national archives was first created in the 1930s and, since the government had been functioning and creating records for over 150 years, records had been previously managed by individual departments and offices, often with a variety of different methods and techniques. by 1941 archivists had accessioned 302,114 cubic feet of records from seventy-two different agencies.[3] these early american archivists actually wanted to use bibliographic methods to make all these records easily accessible in familiar ways. they made multiple attempts to use various forms of card catalogs to describe materials and established a classification division devoted to somehow providing subject-based discovery. however, “…given the diverse mass of materials in the national archives, classification demanded vastly more time and expense than the agency could afford,” and the division was disbanded in 1941.[4] with truck after truck moving more and more records to the archives, all archivists were able to feasibly do was document the source of records and their existing arrangement. the provenance of each set of records was important because each source had a different arrangement system and discovery process. a user would have to use the “preliminary inventories” created by the archivists to find what office created the records they were seeking, and how that office arranged or maintained them to navigate that file series or records component.[5] these “preliminary inventories” evolved into paper and online finding aids over time.[6] of course, it would be simpler for users if all the records had a single discovery process, but to early american archivists, that was obviously (if regretfully) infeasible. usability was a conscious trade off to make the enormous volume of materials even somewhat accessible. as a rule of thumb, the approaches used by archivists are useful primarily because of the scale of the materials they manage. got a large but manageable amount of stuff? use bibliographic description. got a seemingly never-ending vast mountain of materials? use archival description. this is an oversimplification, as archival methods are also very good at retaining context of materials, but scale alone is a useful distinction to show how archival systems are meaningfully different.[7] the reality is that in our current unlimited information environment, archives and libraries have larger collecting scopes and volumes of materials than they have descriptive resources, much like the early national archives. even with the additional catalogers and archivists that should be hired to address this, archival methods should be reassessed in order to make the line between the available and the inaccessible more gradual, and to make a larger body of materials open for use. archival description in practice most of our librarian and technologist peers understand that archival data is structured differently, as archival data is hierarchical, with a tree structure of “components,” such as collections, file series, folders, and perhaps items. however, the way archival description inherits is not widely understood and has really important implications for system design. even archivists do not often articulate how the relationships between components of description work. for example, a repository might hold a folder called “meeting minutes, 1989 july 26.” this component only has a title and a date, which alone are not very helpful to users. who was meeting? what were they discussing? unlike a bibliographic record, not all of the available information is contained within the record and the relationships to other records are very meaningful. in this example, the file is part of a series titled “new jersey proportionality review project records,” which is part of a collection titled the “leigh b. bienen papers.” both higher order components have fields where a user might learn the purpose of the meeting and its potential participants and outcomes. image 1. the file is part of a series titled “new jersey proportionality review project records,” which is part of a collection titled the “leigh b. bienen papers.” https://archives.albany.edu/description/catalog/apap312aspace_c264f5e1f93f9d58e5b60483c32d76e9 here is where we have to get into the weeds a bit. at all levels, components may use twenty-five elements that are described in the archives content standard, describing archives: a content standard (dacs). eleven of these elements are required fields. the standard also outlines a set of requirements for multilevel descriptions that articulates rules for the relationships between multilevel archival components like the above example. this section of dacs is particularly impactful, but it is challenging for non-experts to fully appreciate its meaning. what is not often understood here is that, while most of the eleven required elements are often only used at the collection level in practice, each component is required to be described by every one of these fields. even the above “meeting minutes, 1989 july 26” example needs to have a name of creator(s), a scope and content note, an access conditions note, etc. this example is actually described by those elements, they are just stored outside of the record in higher order components. lower-level components only use dacs elements if they supersede or are more granular than the higher order component. if this is not the case, the element from the higher-level component applies. thus, the scope and content element from the new jersey proportionality review project records series component and several elements from the leigh b. bienen papers collection component also describe the “meeting minutes” file component. when archival repositories used paper finding aids, inherited elements were implicitly displayed using front matter, indents, and other design features that conveyed this relationship, but our current discovery systems do not account for this. archival description also provides us with a tremendous amount of flexibility, allowing for the discovery of full text, bibliographic records, and description automatically derived from digital materials within a single descriptive schema and discovery system. dacs allows archivists to use bibliographic metadata, such as dublin core fields, to further describe materials when there is a user-driven reason to do so. it just requires a clear and explicit relationship between these records and the archival component that describes them. this allows archivists to create high-quality descriptive records when appropriate. an archival collection can easily contain one series of lower-value or rarely used materials that are only generally described by the series description, and another series of high valued items containing high quality detailed metadata for each item that you would expect in a library catalog. instead of allocating a similar amount of descriptive labor to all materials as bibliographic description often does, archival description empowers archivists to use their appraisal skills and spend their valuable time in proportion to the value of materials they are working with. for rarely used items with less value, materials can still be accessible, just with a higher usability cost.[8] because archival methodology accommodates lower quality descriptive records, this also makes it a perfect fit for automated approaches that derive description from digital materials. this includes full text, technical metadata, or the output of computational techniques such as entities extracted using natural language processing (nlp). archival description is also a perfect fit for using emerging machine learning techniques for extracting meaningful information from digital images and documents for discovery if these tools can be used without causing harm. there have been some experiments that have used automated approaches to describe special collections materials. however, no matter how sophisticated, automated methods alone produce lower-quality records that limit discoverability and usability in bibliographic systems.[9] in archival description, these records would always be linked with higher-level metadata created by a human professional. the flexibility of archival description also makes it easy to manually enhance automated description when needed. for lower value materials that would not receive detailed description, automated description can also be better than nothing. archivists are also welcome to use automated description at first while assessing its use and potentially enhancing the description later as appropriate. yet, as i’ll discuss later, while archival description encourages these practices, the current systems available for managing digital materials are designed only to work with bibliographic description, thus they are blocking the use of automated approaches in practice. archival methods do have significant drawbacks. this is an idealized vision of archival description. systems that support the creation of quality archival description are a relatively new phenomenon and a lack of training and support can mean that archival methods are sometimes inconsistently mixed with bibliographic approaches, or just poorly applied. additionally, even if we design discovery systems that make use of archival description, there is a usability cost that may be unavoidable when we compare it to the simplicity of bibliographic description. when you compare catalog cards to finding aids or opacs to archivesspace, bibliographic description is often more familiar and comfortable for most users. the usability problems of online finding aids and archives access systems are very well-documented.[10] the more complex relationships in archival data are simply just more challenging to navigate and display intuitively. yet, there are paths forward if we design digital repositories to match the affordances of archival description, then we may be able to improve the usability of discovery systems to where the advantages are well worth the costs to users. the current landscape of digital repositories when we apply a strong understanding of archival description to the current landscape of digital repositories, we see that there are several digital repositories available, but no system allows for the discovery of digital material using archival description. this is true across both open source and proprietary systems. using archival methods for discovery is simply not currently possible without substantial customization. most repositories are designed as digital asset management systems (dams) like contentdm for the upload and discovery of digital objects or designed as institutional repositories (irs) like islandora, samvera-based applications, or bepress digital commons that have built-in multi-user submission workflows. every single one of these systems is designed with bibliographic description in mind. each assumes that librarians or archivists will enter a set of descriptive metadata fields when uploading digital objects. each tool also envisions itself as a self-contained system for this description. no complete off-the-shelf system expects description to be managed and made discoverable outside of its interface. remember that if an item is described by an archival component, dacs requires a clear and explicit relationship between that item and its higher-level components so that users can use those inherited descriptive fields, and it is reasonable for a user to expect at least a navigable link here. a common workflow is for archivists to digitize an item that is already described by an archival component, but since all dams and irs assume they are self-contained, the archivist then has to spend additional time and labor to create a separate set of dublin core or other bibliographic elements for a digital repository. this both duplicates effort and creates an obvious usability barrier. users often must navigate both a system for archival description and a separate system for digital content. this problem is particularly acute for small repositories, as to make digital content available, they are incentivised to change their local descriptive practices to match the system used by whatever consortial repository is available to them. it is probably correct to say that none of the current tools, including contentdm, islandora, dspace, bepress digital commons, or samvera-based systems like hyrax or hyku, are compliant with dacs. archivists have no options. this is a major use case that is simply not being met with available tools, likely because of the divide in domain knowledge between archivists and administrators, librarians, and technologists. there is no off-the-shelf product that provides access and discovery for digital materials using archival repositories’ existing description methods and systems. over the last decade or so, there has been a lot of progress in designing and developing systems to manage archival description, with the development of archivesspace being a major success. however, archivesspace, access to memory (atom), and arclight all only manage and/or provide access to description, not digital content. while these tools all help provide us with an important piece to the access puzzle, users want to access materials, not just descriptive records. in-person research will always be a key part of archival repositories, but more and more archival research is being done primarily or solely online, with the covid-19 pandemic possibly being a major turning point. the closure of reading rooms finally forced many archives to regularly accommodate digitization requests on-demand. this is a major advancement in user services, yet many of these materials are often sent directly to users and not uploaded into digital repositories for future use. this is because these systems are not able to accommodate items without additional descriptive labor, despite them already having archival description and the fact that they were already discovered by a user.[11] archival repositories need systems that manage digital content to do less – focus on asset management, file serving, and interoperability. archivists are already able to create and manage complex archival description in tools like archivesspace or atom. archives need digital repositories to manage digital content but be interoperable with and rely on their existing description systems. the international image interoperability framework (iiif) is a great way to make these connections. there are some important roles that repositories should take on, such as processing or ingest workflows and technical metadata, but digital repositories as currently constituted cannot serve as the primary end-user discovery system for archival materials. it also could be advantageous to designate digital repositories and discovery systems as separate concerns, as repositories can better serve as a “back-end” systems that may better provide or are more interoperable with preservation functions. in the future this may help us avoid design problems like the samvera architecture, which too tightly coupled preservation and access functionality though activefedora.[12] this separation may also make it easier for systems to manage access restrictions, as archivists need to manage and preserve digital materials that cannot currently be made publicly available, as “virtual reading rooms” or limited or controlled access systems are another important piece of the access puzzle.[13] but most importantly, separating discovery from asset management may also provide us with the space and flexibility to design access systems that allow end-users to discover and navigate that content using archival description. ualbany case study a case study of the espy papers from ualbany illustrates both the potential for using archival description to manage digital objects, particularly by enabling digitization on demand, as well as the practical challenges that arose attempting this with current systems. m. watt espy spent most of his life documenting capital punishment in the united states. he dug up information for every death row inmate he could find from corrections records, county histories, court proceedings, and popular publications, and summarized each case on index cards – colorfully documenting victims, alleged perpetrators, and circumstances. at his height he had a large network of collaborators that sent him documentation sourced from all over the country. this collection represented the most complete documentation of executions in what is now the united states dating back to european colonization. in 1984 the national science foundation (nsf) awarded a grant to the university of alabama to create a computational dataset based on the materials, which was first released as executions in the united states, 1608-1987: the espy file. on espy’s death, the original source materials along with other papers were donated to ualbany’s national death penalty archive and in 2010 it received detailed folder-level processing with funding from the national historical publications and records commission (nhprc).[15] while the espy file dataset became a canonical source for criminal justice researchers, abstracting the stories of these thousands of individuals onto a spreadsheet took away a lot of meaning and serviced only certain types of research. some researchers had found issues with the dataset and reference staff had heard a number of anecdotes from users about discrepancies they found between the index cards and the espy file data. seeing so many users willing to travel to see the index cards, along with the potential of leveraging the existing metadata from the dataset made it a strong candidate for digitization and in 2016, ualbany was awarded a council on library and information resources (clir) hidden collections grant to digitize two file series and make them openly available online.[16] since the collection had previously received detailed folder-level processing and the materials were the source for an existing dataset, it seemed wasteful and duplicative to create additional item-level records with bibliographic metadata for what would be about 125,000 digital objects. the espy file dataset was not created as descriptive metadata to our current standards and did not map to the paper materials in a machine-actionable way, so it was not useful as a drop-in replacement for bibliographic metadata in a dams. thus, the collection seemed like an excellent candidate for using existing archival description to provide access to the digital scans, as it could make practical use of the problematic espy file data. our existing systems provided no way to use the existing description to provide discovery and access to digital scans. we had recently completed migrating our archival description to archivesspace and were using extensible text framework (xtf) and the luna dams for access, but neither xtf or luna were interoperable or sustainably customizable and no digital repository was available that used archival description for discovery out-of-the-box. the archivesspace rest api provided the potential to use archival description in new ways, and we were eager to fully leverage our descriptive labor already dedicated to the collection to benefit users and make our work more impactful. we decided to implement an open-source digital repository that would be more customizable to use folder-level description from archivesspace along with the espy file dataset. for much of the source materials series, we thought that the quality folder-level description already existed should be sufficient to provide access. also, if we could implement a successful process for using existing archival description for digitization, we hoped that we could do the same for other collections, and potentially even provide digitization services on request for single folders without having to create detailed bibliographic metadata. we decided to implement a lightly customized hyrax repository which uses the samvera framework. hyrax is not a “turnkey” system, but a fully featured set of open components that can be implemented into a digital repository. we hoped the openness of hyrax would make it easily adaptable to our existing archival description. over the course of the project, the arclight mvp project made arclight into a viable option for providing access to archival description. because it uses a similar ruby on rails stack as hyrax, it became easier to implement arclight and integrate it with hyrax than doing a similar level of customization with the archivesspace public user interface (pui). we needed data to be passed both ways, from hyrax to arclight and from arclight to hyrax, and both systems exposed json metadata with rest apis, which was an invaluable feature we could not have done without. since both systems used the same technology, much of what we learned customizing one system could also be applied towards the other. we did not quite know what we were getting into. the project was significantly under-resourced in both outside funding and internal expertise. however, despite some delays, data problems, and the challenges of learning new technologies, the systems we implemented were a major success. the espy project execution records website provides open access and discovery to the espy papers. our university libraries also gained a lot of skills and capacity to implement and host open-source applications that would be applicable to other projects, we developed a more productive relationship with the university-level information technology services division, and we are better able to utilize our on-campus virtualized data center. the need to support these systems was successfully used in 2019 to justify filling a vacant technologist position that otherwise was not likely to have received university-level approval. the project enabled us to use existing archival description for digitized and born-digital items and allowed us to provide online access to a much greater volume of materials. on the hyrax side, we had to develop multiple custom data models to handle both legacy materials from our existing dams as well as objects that would rely only on a link to a component of archival description. it was relatively straightforward to create image and av models to handle the schemas used in our existing dams, but hyrax’s use of linked data uris was a barrier to creating a sensible digital archival object (dao) model for archival description.[17] to make connections between digital objects and components of archival description, we used the 32-character ref_id generated by archivesspace and indexed into arclight. each folder-level component would have a ref_id for itself, could have multiple ref_ids for higher level series and subseries components, and always had a collection identifier for the top-level collection. we thus needed three identifier fields, one containing multiple ids where the order mattered, and each having a separate meaning. it also made sense to store the name of each component in the model as a string. this was challenging to model using linked data uris since hyrax requires a unique uri for each field. once we got a set of uris that hyrax accepted, we essentially ignored the uris downstream and relied on local meanings for the fields. i am skeptical that even a perfectly designed or customized ontology would have provided any value to this project, and trying to use any form of the records in context (ric-o) ontology currently being designed by the international council on archives experts group on archival description (ica egad) would have been a nightmare.[18] once the dao model was complete, we customized the workflow page where an archivist would upload and describe a digital object. this worker would enter the ref_id for the component of archival description, the collection identifier, and click a “load record” button. this button would make a javascript ajax call to the arclight json api and automatically fill most of the descriptive fields. the worker would then only be required to add a resource type and a license or rights statement before uploading the object. image 2. dao model. we also customized the display page for each object to pull relevant archival description from arclight also using client-side javascript calls. when an object page loads, it uses the ref_id and collection identifier to query the archival description component and all of its parent components. the page then displays the names and links for all higher-level components as any scope and content notes. the use of client-side ajax calls is imperfect but allowed us to integrate the two systems without much more complex customization within the rails applications. if a worker was digitizing an item, they would just have to find the ref_id and collection number for the folder in archivesspace or arclight, and enter those fields in hyrax with a resource type and rights statement. for descriptive metadata hyrax would then only contain a title (example: skandalon, vol. 3, no. 9) and date (example: 1965 march 10), which by itself would not be very helpful to users. when a user accesses the item, hyrax will query and display scope and content notes for the skandalon and the university publications collection. a user could then read that this is a single issue from a bi-weekly journal of news and opinion published by campus christian council, which was part of an artificial collection of student publications. this minimal descriptive workflow, along with rapid lower-quality scanning allowed for digitization on user request. we later implemented a new digital reproduction fee schedule that charged by the time required for digitization rather than page counts.[19] since we were using existing archival description for metadata and avoiding page count estimates with back-and-forth emails, in many cases we were able to digitize an item in about the same time as a traditional reference request and make requests that take under 30 minutes free to users. this practice improved user experience, allowed us to digitize a much larger volume of materials and make them accessible online, and has the added benefit of making our digitization labor more transparent to users. in this example, i received a request for one issue and digitized the whole run of 42 issues in an afternoon merely because i had some extra time and thought the materials were interesting and worth digitizing. in addition digitizing individual items on request, we also developed a batch upload workflow for large sets of items sent to an outside vendor for digitization. the process relied mostly on spreadsheets. here we also used existing archival description so the materials did not require item-level bibliographic metadata. this proved to be really useful for university publications, for example, where we had existing volume and issue lists. we had an existing tool for exporting this metadata from the archivesspace api, so we added on a process where an archivist could paste in the corresponding access file for each issue and a script would generate another spreadsheet that could be uploaded into hyrax using a rake task. this workflow enabled us to rapidly digitize large collections or file series that were really valuable for reference use, such as student newspapers, university publications, commencement programs, university organizational charts, press releases, and university senate legislation. while additional descriptive care would have improved discoverability as always, making these materials discoverable using existing archival description plus full text ocr and extracted text was a major advancement. while our arclight and hyrax implementations were very successful in providing access to digital materials using archival description, they also have a number of practical limitations. the most obvious problem is that users still must navigate two separate systems, one for archival description and another for digital materials. we implemented a “bento” style discovery layer based on quicksearch to make search results from both arclight and hyrax available from a single search box but found that users still had trouble navigating back and forth between the two systems.[20] a redesign in early 2022 based on the duke university arclight implementation addressed some minor issues with this integration, but the core problem remains.[21] additionally, getting data from hyrax back into arclight is challenging. it was easy to modify arclight templates to point to hyrax for digital materials, but once an archivist uploaded a new object into hyrax, that uri had to be added to a new archivesspace digital object record. we were also storing separate preservation copies for each object outside of hyrax so we needed to download the object, store it as a local archival information package (aip), and add an identifier that references the aip into hyrax. since hyrax does not provide an api for this, we were only able to automate this using a very wonky script that queries the hyrax solr index, adds a new digital object in archivesspace, schedules it to be indexed to arclight, downloads and stores the object as an aip and adds the identifier to hyrax by literally scraping the hyrax login and edit pages and posting data to the edit form using the python requests module. it worked, but it was a hack. this process along with overall support for hyrax creates major sustainability risks. our library systems department has struggled to maintain hyrax without anyone with a strong ruby or rails background on staff. major cuts to library staff in 2020-2022 only minimally impacted applications support, but with overall library staff reduced by about 30% due to unfilled retirements, our long-term support for customized applications should be questioned, particularly when we are adapting systems like hyrax and not using them quite as they are intended. overall, there is a need for this setup to be simplified. a discovery system designed for archival description archivists need a discovery system for digital materials that uses archival description. a true archival discovery system would query archival description along with item-level bibliographic metadata and automated description derived from digital materials, such as extracted text, ocr text, and a/v transcripts in a single search interface. arclight has the potential to be this system. currently arclight is an access system for archival description based on blacklight. it does not manage digital assets but returns individual components of archival description and lets users navigate through connected records. since arclight merely displays data indexed in solr just like blacklight, it also has the potential to display and return search results for digital objects, including full text. description_indexer is an experimental tool that overrides the default arclight indexing pipeline. out-of-the-box, arclight uses traject to index archival description from ead-xml files, often exported from archivesspace. while traject is set up to be easily configurable to select which xpath to use for each solr field, it is not easily customizable to add the significant logic needed to index archival description or data from other sources. instead, description_indexer is a python library that uses archivessnake and pysolr to index archival description directly from the archivesspace api. this approach is potentially very useful for individual repository instances but may be less so for consortial aggregators because of the high permissions levels currently needed to access the archivesspace api. description_indexer contains two very basic json data models, one for archival description and another for the arclight solr index. this extra layer of abstraction is useful, as any data source that can map to the archival description model would then be automatically indexable into arclight. the archival description model is very much a draft and is likely too simple to be comprehensive, but community consensus around a model like this is key to consistently representing digital materials in the arclight index. the description_indexer main branch is set up to be a “drop-in” replacement for the current traject indexer. the dao-indexing branch is designed to be a more experimental branch that flexibly indexes from digital repositories or other systems that manage digital assets. it is designed to be extensible, since individual implementations will likely need to index asset data from a number of different sources, you can write your own plugin-in to index digital assets from your local system. once description_indexer is installed, you can add a custom class in a .py file in your home directory or using an environment variable that will allow for local logic to override how digital objects are indexed. the ualbany example that is included queries json from our hyrax instance to index links to content and other item-level data not managed in archivesspace. description_indexer also contains multiple “extractors” for pulling content from digital files using apache tika and/or tesseract, however running these during indexing is a challenge and a better design would be to extract and store this data while processing digital files and make it available to the indexer via a file system or a rest service. here is also where there is the potential to experiment with new tools for extracting useful information from documents for discovery using nlp or models generated with machine learning. the data pipeline to the indexer needs further consensus and standardization. in writing description_indexer, i discovered that digital objects, files, and file versions are under-theorized in archival description and archivists need to better define these objects and their relationships. the portland common data model (pcdm) provides helpful definitions of objects and files, and should be incorporated as much as possible, but the relationships between objects and archival components in lieu of pcdm collections is ill-defined and current practice is inconsistent.[22] archivesspace attaches digital objects to archival components, but allows component attributes such as subjects and note fields to be attached to digital objects as well. digital objects also do not have href or url attributes but contain file versions which have file uri attributes. both digital objects and file versions also have is_representative boolean attributes that are likely useful for digital objects. overall, it should be clearer that digital objects are an abstraction that do not necessarily correspond to a file, and digital objects should probably have a field for an international image interoperability framework (iiif) manifest, as that also can be an abstraction and should be the preferred method of linking archival description to digital materials. attributes for how files and versions are displayed in the absence of a iiif manifests are also likely necessary, and overall, it was challenging to model this and broader and more complex community use cases are needed.[23] the biggest barriers to enabling the discovery of digital materials in arclight are establishing consensus data models and data pipelines. once content from digital materials is indexed into an arclight solr index, we can display those objects in arclight with only some minor customizations and a iiif-compliant image server. i implemented a simple demonstration application that illustrates what this could look like in practice. this system returns results based both on archival description and full-text content extracted from digital objects. this implementation has data and design limitations, but i hope that this can be a useful model that shows the potential for what arclight can be going forward. privileging archival description in discovery systems academic libraries and other cultural heritage institutions also manage digital objects using bibliographic description. to avoid implementing and maintaining multiple discovery systems, archival materials are often forced into off-the-shelf irs and dams designed for bibliographic description. a better understanding of archival description shows that it is actually more appropriate to do the reverse, and index bibliographic records into systems designed for archival materials. here, it might be helpful to see archival description as an organizational schema for managing materials which have many different organizational schemas. in the same way that the early national archives used archival description to manage different descriptive methods used by different government agencies, archival systems can also accommodate bibliographic metadata that provides more usable and familiar access. this provides the best of both worlds. we can have one discovery system that provides both a strong user experience for higher value materials while still providing some level of access for materials that do not receive wide interest and otherwise would not receive detailed descriptive care. this also works from a purely technical perspective. while it is possible to model archival description in digital repositories like hyrax, the more complex structure of archival data makes this very challenging. it is comparably much simpler to model bibliographic metadata into archival systems than the reverse. with well-defined data models, we can easily add bibliographic metadata to an arclight index, just like with a blacklight instance. these records could stand alone or also be linked to archival description. this provides arclight with the potential to unify bibliographic and archival metadata in a single user environment, offering the usability of detailed records with the extensibility of archival hierarchy. this would provide us with the full potential of archival description to flexibly allocate our descriptive labor based on the value of materials and user needs. navigating complex archival data structures for items with lesser value may still be challenging for users. if we can make decisions based on the value of materials, rather than systems limitations, this should actually be an effective allocation of our limited descriptive resources. there are also additional opportunities to improve the usability of archival description. since arclight is just an extension of blacklight, it presents description to users in search results as discrete units much like bibliographic metadata. what we can do is experiment with how archival tree structures are indexed to better match how dacs envisions inheritance. since dacs expects notes that are usually only applied at collection or series levels—like scope and content or historical notes—to apply to lower-level components as well, we can experiment with indexing these notes as part of lower-level components too and just return them with lower relevancy scores. arclight currently indexes parent access and use notes like this but does not use them to return search results. this has the potential to return better results for minimally-described materials, but would need to be part of an iterative usability testing process so that results are weighted appropriately. these are exciting possibilities, but we cannot do usability testing on archival discovery systems until they exist. conclusion archival description takes a very different approach to description than what is commonly used elsewhere – whether that be in library catalogs, digital repositories, or on the web. archival methodology has key strengths that make it very useful for managing the vast quantity of digital materials held by libraries and avoiding a digital divide in an era where pandemics and the emissions costs of travel may limit in-person research. our descriptive labor, no matter how extensive it is or should be, has limits. if academic libraries continue to prioritize bibliographic approaches to metadata and apply the same level of descriptive care to objects one by one regardless of value, there will always be a hard line between what is accessible and what is not. archival description provides flexibility that empowers us to apply that valuable descriptive care based on the needs of users and prepares us to experiment with automated metadata approaches and iterative workflows. archival methods simply more accurately and appropriately model our descriptive resources to our materials. unfortunately, it is currently very challenging to use archival description to manage and provide access to digital materials, as current digital repositories are not designed to work with archival description. archivists manage description for materials in systems like archivesspace that are designed for archival description, but dams and similar digital repository systems expect them to create additional bibliographic metadata for any digital material they manage, whether that is an appropriate use of resources or not. there is usually no easy way to link that metadata from two different systems together. in practice, this means that lower-valued items or the increasing number of items digitized by archives on user request are not made available or discoverable for future users because they do not have the value needed to receive detailed bibliographic description. this is silly considering archival description already exists for them. since archives data structures can accommodate bibliographic metadata, but the reverse is very challenging, discovery systems design must privilege archival description. currently, there is no easy way to integrate archival description from systems like archivesspace with digital materials managed in digital repositories into a single discovery point for users. ualbany’s approach of using a “bento” style discovery layer on top of these two systems works functionally but has substantial usability limitations and sustainability concerns. the misunderstandings around archival description have marginalized archival systems in academic libraries. because our digital access systems never have worked for archival methods, libraries long took shortcuts by establishing whole separate programs to manage unique digital materials and limiting archives and special collections to a very traditional understanding of their collecting scopes. instead of working with archives, libraries often worked around them – often causing needless duplication in metadata work, digitization, asset management, and digital preservation across different reporting structures. arclight has the potential to unify discovery of archival and bibliographic description and provide a single discovery point for physical and digital materials that allows archivists to fully leverage the affordances of archival description. we need further community consensus on a data model for archival description – most notably for digital objects, files, and file versions. i hope description_indexer can be a helpful example that can be iterated upon, that further work can be done to index digital materials in the arclight index, and that we can experiment more with indexing archival description in general. while not really discussed here, archival description’s focus on agents and functions behind the creation of records has the potential for opening new patterns for discovery.[24] overall, we need examples of digital materials in arclight alongside archival and bibliographic description for iterative usability testing. about the author gregory wiedeman is the university archivist at the university at albany, suny where he helps ensure long-term access to the school’s public records. he manages the university archives and supports born-digital collecting, web archives, and systems implementation for the department’s outside collecting areas. he currently serves as co-chair of the technical subcommittee for describing archives: a content standard (ts-dacs). endnotes [1] joyce chapman, kinza masood, chrissy rissmeyer, dan zelner, “digitization cost calculator raw data,” digital library federation (dlf) assessment interest group (2015). https://dashboard.diglib.org/data/. amanda j. wilson, “toward releasing the metadata bottleneck: a baseline evaluation of contributor-supplied metadata,” library resources & technical services vol. 51, no. 1 (2007). https://journals.ala.org/index.php/lrts/article/view/5384/6604. [2] “dcmi: one-to-one principle,” dublin core metadata innovation. https://web.archive.org/web/20220627093857/https://www.dublincore.org/resources/glossary/one-to-one_principle/ [3] mccoy states that “…the national archives had to deal with the greatest volume of records in the world; the unparalleled diversity of their origins, arrangement, and types; and their widely scattered locations in 1935.” donald r. mccoy, the national archives: america’s ministry of documents 1934-1968 (chapel hill, nc: the university of north carolina press, 1978), 45, 69. [4] mccoy, 78-80. philip m. hamer, “finding mediums in the national archives: an appraisal of six years’ experience,” the american archivist, vol. 5, no. 2 (1942): 86-87. [5] the national archives, guide to the material in the national archives (washington, dc: united states government printing office, 1940), ix. [6] this process is discussed in more depth in gregory wiedeman, “the historical hazards of finding aids,” the american archivist, vol. 82, no. 2 (2019): 381-420. https://doi.org/10.17723/aarc-82-02-20. [7] in addition to working well at scale, archival description is also more effective at maintaining contextual relationships between records, their creators, and the activities that created them. this is further discussed in jodi allison-bunnell, maureen cresci callahan, gretchen gueguen, john kunze, krystyna k. matusiak, and gregory wiedeman, “lost without context: representing relationships between archival materials in the digital environment,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). https://doi.org/10.25740/gg453cv6438. [8] this practice is best described in daniel a. santamaria, extensible processing for archives and special collections: reducing processing backlogs (chicago: neal-schuman, 2015). shan c. sutton also discusses the further extension of this to digitization in shan c. sutton, “balancing boutique-level quality and large-scale production: the impact of “more product, less process” on digitization in archives and special collections,” rbm: a journal of rare books, manuscripts, and cultural heritage vol. 13, no. 1 (2012). https://doi.org/10.5860/rbm.13.1.369. [9] paul kelly, “better together: improving the lives of metadata creators with natural language processing,” in code4lib journal issue 51 (june 14, 2021), https://journal.code4lib.org/articles/15946. kaldeli, eirini, orfeas menis-mastromichalakis, spyros bekiaris, maria ralli, vassilis tzouvaras, giorgos stamou, and evaggelos spyrou, “crowdheritage: crowdsourcing for improving the quality of cultural heritage metadata,” information vol. 12, no. 2 (february 2021). [10] christopher j. prom, “user interactions with electronic finding aids in a controlled setting,” american archivist 67, no. 2 (2004): 234–68, https://doi.org/10.17723/aarc.67.2.7317671548328620. anne j. gilliland-swetland, “popularizing the finding aid: exploiting ead to enhance online discovery and retrieval in archival information systems by diverse user groups,” journal of internet cataloging 4, nos. 3–4 (2001): 199–225, https://doi.org/10.1300/j141v04n03_12. luanne freund and elaine g. toms, “interacting with archival finding aids,” journal of the association for information science and technology 67, no. 4 (2016): 1007, https://doi.org/10.1002/asi.23436. wendy scheir, “first entry: report on a qualitative exploratory study of notice user experience with online finding aids,” journal of archival organization 3, no. 4 (2005): 49–85, https://doi.org/10.1300/j201v03n04_04. joyce celeste chapman, “observing users: an empirical analysis of user interaction with online finding aids,” journal of archival organization 8 (2010): 4–30, https://doi.org/10.1080/15332748.2010.484361. [11] james e. murphy, carla j. lewis, christena a. mckillop, and marc stoeckle, “expanding digital academic library and archive services at the university of calgary in response to the covid-19 pandemic,” ifla journal vol. 48, no. 1 (2021). https://doi.org/10.1177/03400352211023067. florence sloan, “special collections practice in response to the challenges of covid-19: problems, opportunities, and future implications for digital collections at the louis round wilson library at the university of north carolina at chapel hill,” masters thesis, university of north carolina at chapel hill school of information and library science (april 30, 2021). https://cdr.lib.unc.edu/concern/masters_papers/1z40m3313. the infeasibility of creating item level records is also discussed in stephanie becker, anne kumer, and naomi langer, “access is people: how investing in digital collections labor improves archival discovery & delivery,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021), 33. https://doi.org/10.25740/gg453cv6438. [12] esmé cowles, “valkyrie, reimagining the samvera community,” https://library.princeton.edu/news/digital-collections/2018-06-05/valkyrie-reimagining-samvera-community. [13] elvia arroyo-ramírez, annalise berdini, shelly black, greg cram, kathryn gronsbell, nick krabbenhoeft, kate lynch, genevieve preston, and heather smedberg, “speeding towards remote access: developing shared recommendations for virtual reading rooms,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). https://doi.org/10.25740/gg453cv6438. [14] m. watt espy, john ortiz smykla, executions in the united states, 1608-2002: the espy file (icpsr 8451), (ann arbor, mi: inter-university consortium for political and social research (distributor), 2016-07-20). https://doi.org/10.3886/icpsr08451.v5. [15] m. watt espy papers, 1730-2008. m.e. grenander department of special collections and archives, university libraries, university at albany, state university of new york. https://archives.albany.edu/description/catalog/apap301. “commission recommends $7 million in grants,” the u.s. national archives and records administration, 2010 june 1. https://web.archive.org/web/20220307211150/https://www.archives.gov/press/press-releases/2010/nr10-107.html. [16] blackman and mclaughlin summarize the widespread praise for espy’s work, while also highlighting some of the espy file’s limitations and criticizing its use for quantitative analysis. blackman and mclaughlin, “the espy file on american executions: user beware,” homicide studies vol. 15, no. 3 (2011): 209-227. [17] models for the ualbany hyrax instance. https://github.com/ualbanyarchives/hyrax-ualbany/tree/main/app/models. [18] egad – expert group on archival description, “records in contexts – ontology,” july 22, 2021. https://www.ica.org/en/records-in-contexts-ontology. [19] “request items for digitization,” m.e. grenander department of special collections & archives, university at albany, suny. https://archives.albany.edu/web/reproductions/. [20] “quicksearch,” north carolina state university libraries. https://www.lib.ncsu.edu/projects/quicksearch. [21] sean aery, “arclight at the end of the tunnel,” november 15th, 2019. https://blogs.library.duke.edu/bitstreams/2019/11/15/arclight-at-the-end-of-the-tunnel/. [22] portland common data model (april 18, 2016), https://web.archive.org/web/20220912065008/https://pcdm.org/2016/04/18/models. [23] description_indexer experimental archival description model, https://github.com/ualbanyarchives/description_indexer/blob/dao-indexing/description_indexer/models/description.py. [24] the rockefeller archive center’s dimes access system is a really interesting step in this direction by emphasizing agent records and requiring users to click through archival components to convey description inheritance. renee pappous, hannah sistrunk, and darren young, “connecting on principles: building and uncovering relationships through a new archival discovery system,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). the records in contexts – conceptual model (ric-cm) also has a very intriguing focus on agents and functions for discovery that deserves further practical exploration. “records in contexts – conceptual model.” expert group on archival description (egad), https://web.archive.org/web/20221007020234/https://www.ica.org/en/records-in-contexts-conceptual-model. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction: it’s all about data, except when it’s not. mission editorial committee process and structure code4lib issue 30, 2015-10-15 editorial introduction: it’s all about data, except when it’s not. data capture and use is not new to libraries. we know data isn’t everything, but it is ubiquitous in our work, enabling myriads of new ideas and projects. articles in this issue reflect the expansion of data creation, capture, use, and analysis in library systems and services. it’s all about data. as we all know, libraries have been generating, capturing, using, analyzing, and manipulating data for a long time. “data” is a popular buzzword now, especially when preceded by the word “big” or followed by the word “management”. the code4lib journal has been publishing articles about data since its first issue, with andrew bullen’s piece about wrangling historical demographic data,[1] among others. this issue’s lineup has an unusually heavy emphasis on data, perhaps a reflection of the urgency among government and publishing stakeholders regarding data generation and preservation, perhaps because for libraries it is all about data. except when it’s not. we also know, because libraries have a long relationship with data, it’s not all about data. it’s also about conventions that become standards in working with data. it’s about things like data security, tools to manipulate data, and careful planning for curation because of factors like hardware and software obsolescence. and it’s about what data can do for us, like improve discovery, also a major topic in this issue. while there are several data-centric articles here, there are also cool tools created by library coders sharing their work with the community, such as, in this issue: integration of library services with internet of things technologies, where kyriakos stefanidis and giannis tsakonas present selida, a module developed for koha. topicspace: rapid prototyping a mobile augmented reality recommendation app, in which jim hahn, ben ryckman, and maria lux describe their work using ux methods to drive development. streamlining book requests with chrome, which describes the development and use of a handy chrome extension developed by dr. rachel schulkins and joseph schulkins sierradna – demonstrating the usefulness of direct ils database access, by james padgett and jonathan hooper, which describes three use cases with innovative interface’s database navigator application. in the data-related cool tools department, we have: manifold: a custom analytics platform to visualize research impact, in which steven braun describes his work developing a research assessment tool for the university of minnesota school of medicine. data munging tools in preparation for rdf: catmandu and lodrefine, where christina harlow explains the handy tools she set up for catalogers at the university of tennessee knoxville. generating standardized audio technical metadata: aes57, by jody ridder, describing the aes57 standard and the fits2aes tool developed at the university of alabama. collecting and describing university-generated patents in an institutional repository: a case study from rice university, in which scott carlson and linda spiro describe the fondren library’s project to populate a local repository with university owned patents open journal systems and dataverse integration– helping journals to upgrade data publication for reusable research is about an open source tool for submitting data for publication, described here by a group of authors representing the institutions responsible for its development (micah altman, eleni castro, mercè crosas, philip durbin, alex garnett, and jen whitney) and saving the best for last, collected work clustering in worldcat, where janifer gatenby, gail thornburg and jay weitz describe recent progress implementing clustering in a frbr world. we hope you enjoy this data-centric issue, remembering that it’s all about libraries and data. notes [1] connecting the real to the representational: historical demographic data in the town of pullman, 1880-1940 subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – spruce mashup london mission editorial committee process and structure code4lib issue 19, 2013-01-15 spruce mashup london spruce digital preservation mashups are a series of unique events that are being organized in the united kingdom to bring together digital preservation practitioners and developers to work on real-world digital preservation challenges. during the 3-day event the digital preservation developers work to create practical solutions to real-world challenges the practitioners are having related to digital preservation. meanwhile, the practitioners work to create compelling business cases for digital preservation at their institution. this article describes the spruce mashup london event held in september 2012. by edward m. corrado introduction the spruce project is a jisc funded partnership led by leeds university library. other partners include the british library, digital preservation coalition, london school of economics, and open planets foundation. the spruce project is a partnership designed to “collaborate to develop a strategy for engaging the academic community around digital preservation and will also contribute technical expertise and solutions.” [1] besides organizing digital preservation mashups and other events, spruce has a number of other initiatives. two other major initiatives are providing funding awards of up to £5,000 for developing “practical digital preservation outcomes and/or development of digital preservation business cases” [2] (available to institutions in the uk-only) and a community project know as crowd sourced representation information for supporting preservation (crisp) that is aiming to combat the challenges of digital preservation representative information (ri) using crowd-sourcing techniques and the collective wisdom and knowledge of digital preservation experts. [3] spruce digital preservation mashups are a series of events that are being organized in the united kingdom that are intended to bring digital preservation practitioners and developers “together to discuss, test, code […], plan, and share challenges related to the new types of content entrusted to libraries, archives, and museums to preserve and manage.” [4] practitioners bring real world digital preservation challenges and are then paired with a developer, who works in a sprint-like software development fashion to come up with a working solution before the end of the three-day event. it is hoped that the solution will not only help the practitioner with their specific problem, but also be reusable by other practitioners who experience similar issues in the future. the solutions and other outcomes of the event are documented on the spruce project wiki. while the developers work to create practical solutions based on real-world problems, the practitioners work to create compelling business cases for their digital preservation work with the goal of helping them request and receive funding for their work and digital collection management duties. the structure of the spruce mashups are rather unique in how they are structured and how they bring both developers and practitioners together at the event. therefore, the structure of spruce mashup london will be described in some detail below. spruce mashup london the spruce mashup london was held at the hubworking centre in london from september 18-20, 2012. the event was free to attend and attendees were provided food and lodging throughout the event free of charge. the event started with an introduction to the spruce digital preservation mashup and why the organizers felt they are necessary. they said they felt that, in some cases, there was a mismatch of solutions to digital preservation problems with software solutions and that better coordination between practitioners and developers was needed. they believed that there were possibly some open source solutions available to practitioners but that they might not be aware of them or they might not have enough software knowledge to implement them. the organizers also identified a need for more education about digital preservation sustainability and how to create a business case that could lead to additional institutional support and funding for digital preservation projects. there were two main goals for spruce mashup london to help address these problems. the first goal was to solve some concrete digital preservation challenges. this was to be done by “capturing” some real-world preservation challenges and then solving the problems using an agile software development process. the second goal was to prepare a generic business case for digital preservation. digital preservation is not often seen as “exciting” to upper-level administrators. therefore, practitioners need to build a business case to present to administration that outlines the reasons why they should support digital preservation. the purpose of the generic business cases that the practitioners created were to help identify the benefits and stakeholders of digital preservation services at the practitioner’s institution. during the mashup practitioners would also work on an organizational digital preservation skill gaps analysis and on a digital preservation elevator speech that they could present to administrators at their organization. after the organizers described the purpose, goals and agenda of the mashup, the participants introduced themselves and discussed their experiences with digital preservation. the practitioners also described the particular problem that they brought with them to the event in hopes of having a developer provide a working, practical solution. the issues presented ranged from organizational and where to start to highly technical issues. for example, rachel macgregor from the birmingham archives and heritage service stood up and said “help – i’ve got digital content and i don’t know how to manage it” [5] while maurice de rooj of the national archives of the netherlands (naneth) had an issue where tiff files would not render correctly (or at all) even though jhove marked them as valid and well-formed. [6] other issues presented at the london mashup can be viewed on the open planets knowledge base wiki. indeed, the wiki [7] was used throughout the event to document what was being discussed and created during the mashup. after the practitioners gave their introductions, the developers followed with their own introductions and responded to some of the practitioner’s issues. in a few cases attendees were both developers and practitioners, but for the purposes of the mashup they choose one hat or the other (normally developer, but not always). in total there were approximately twenty attendees, not counting the organizers. most of the participants were from the united kingdom although some came from other european countries and one (the author of this article) came from the united states. once introductions were complete the organizers broke the participants into small groups that each contained a couple of practitioners and developers. in these small groups the practitioners and developers discussed the issues presented earlier so that the developers could get an understanding of what type of solution was needed. a summary was then presented to everyone and after a short break, the developers went off to begin working on the solutions while the practitioners got together and described their issues on the mashup wiki, based on their conversations with the developers. while the developers were developing solutions to the practitioners’ problems, the practitioners began to build their business cases for digital preservation. the first step of building the business case was to identify the benefits of digital preservation. some of the benefits identified included 1) increased access to digital objects (allow for multiple people accessing documents, fulfilling legal and other institutional obligations, and making objects more discoverable through enhanced metadata); 2) developing expertise and knowledge that increases the status and creditability of the organization and enables the organization to provide digital preservation services and accept digital collections; 3) financial benefits that could be gained from providing digital preservation services to other organizations and the possibility of receiving grants; 4) automation of preservation tasks (thus saving valuable staff time); 5) fulfilling the institutional mission which in many libraries and archives can not be done without being able to handle digital material; and 6) preserving institutional memory and local/regional history created utilizing digital media. the second day began with a brief meeting between the developers and the practitioners that they were creating tools for. this was followed by a brief update from the developers on where the stood with the solutions they were working on to all of the participants. after this, once again the developers broke off and started developing while the practitioners went back to work independently on their business plans – this time focusing on stakeholder analysis. the day continued with developers working on projects while practitioners worked on their business cases with time provided in between the sessions for the developers to consult with the practitioners. during the day, practitioners also worked on a digital preservation skills gap analysis for their organization. during the third and final day of the mashup, the main agenda item was to finish what everyone had started, commit code, and finish the descriptions on the wiki of the developers’ solutions as well as the business cases that were created by the practitioners. the practitioners also created, and presented to everyone in attendance, short elevator pitches that they could give to administrators at their organizations. some of the solutions developed include: peter may wrote a java program that makes “use of a custom apache tika wrapper to extract file format identification and metadata from a directory of files and present aggregated data for identifying which files have full descriptive metadata and which don’t.” [8] it was agreed that this program worked well at summarizing information about various files within a directory. this can be used for identifying possible duplicate files and could also be used to determine the amount of descriptive metadata that is embedded in a set of files. rob talbot and peter cliff developed a solution for archivist practitioners thom carter and rebecca webster that built on previous work from peter may and carl wilson which extracted metadata from files utilizing python and apache tika to produce output in json and a simple xml format. talbot and cliff felt that “tika was also a good choice as the archivists indicated that these collections were predominantly text-based documents in relatively recent formats – mostly ms office and pdf – both formats that tika handles very well for both metadata and text extraction.” [9] cliff also participated in this solution by creating three further scripts that used this output to create n-gram word clouds. a related project at the mashup utilized perl scripts “that used the metadata […] extracted using apache tika to help locate duplicates and different versions of the same document” [10] (see correction) dominic ivaldi worked on a solution that used ffmpeg as a video transcoder. the practitioner had numerous extremely large video files that need to be stored and preserved. for example, he has a 28gb file that contained a 18 minute long promotional movie created for ford motor comapny in the 1950’s. this appears to be excessive “considering entire movies are sold commercially on 9.4gb dvds.” [11] maurice de rooj worked on a php script that reads exif data from a image file file using exiftool and then normalizing the output into a dublin core compatible xml file. it also adds specific metadata to the dublin core xml which is contained in an .ini file. [12] maurice de rooj also took the lead on solving an issue about how to move records from microsoft sharepoint to eprints for digital preservation. it was not feasible to come up with a direct solution during the mashup but they were able to create a new sharepoint view that contained all of the fields and could be exported as a microsoft excel file. another issue identified was that future content should be properly formatted which means “users need to be educated and made aware” [13] that if future documents are not properly formatted thet may not be able to be preserved. bram lohman, dirk von suchodoletz, and tom woolley investigated how to preserve video games and how to provide public access to the preserved games. instead of developing code, they created a list of issues involved and available emulators.[14] maurice de rooj added nexus file recognition to fido. [15] fido (format identification for digital objects) is a simple, command line tool written in python that can be used “to identify the file formats of digital objects. it is designed for simple integration into automated work-flows.” [16] a team of developers worked on a problem the national archives of the netherlands had with corrupt tiff images. the tiff images were unusable although tools such as jhove marked them as well-formed and valid. the team discovered that the problem was the images claimed to be 16-bit greyscale files however they were really 8-bit greyscale files. the team used exiftool to detect and correct the problem files. this situation leads to important questions and takeaways included what does it truly mean for a file to be valid and well-formed. [17] mashup participants voted on awards for the best developer and the best practitioner during the event. the winner of the best developer award was maurice de rooij from the national archives of the netherlands while rachel macgregor won the award for best practitioner. all of the materials from the event, including the business plans, the digital preservation issues the practitioners had, and the solutions the developers came up with are available on the event wiki. impressions although i occasionally write shell scripts and other small computer programs, i participated in this event as a practitioner. overall i found this to be an excellent event and i would encourage people that have real-world problems to solve relating to digital preservation or are developers that can solve some of these types of problems to participate in future spruce digital preservation mashups or similar events. as rachel macgregor reported in a blog post, “i suspect that any of the developers could have offered solutions for my datasets” and at the end of the event “i already had an answer to my question!” [18] another practitioner mentioned how useful the mashup was to them and i imagine most of the practitioners would agree. there were multiple benefits to attending this spruce mashup london as a practitioner. the first, and probably the most obvious, benefit was that i came away with working code created by an expert developer that solved a real problem i was having with creating and maintaining a list of metadata mappings for a large photograph collection. i also was able to work on a digital preservation business plan with the help and feedback of other people involved in digital preservation. lastly, the networking opportunities were tremendous. i met a number of digital presentation developers and practitioners that i can consult with and ask questions of. while i wasn’t participating as a developer i imagine that they equally benefited. not only did they get to solve real-world problems, they got to collaborate with other, highly skilled developers. in some cases, they also brought digital preservation issues with them and were able to get other developers to help them figure out solutions to difficulties they were having. in many cases developers working on digital preservation projects do not have other developers at their workplace that they can go to when they are trying to discover the best way to solve a problem. by getting together at events like this mashup they can get other sets of eyes to look at their problems and also will know who they can consult when they have additional questions in the future. i believe this cooperation and the networking that occurred during the event will help build a strong uk community of developers and practitioners interested in digital preservation. the funding for this event was covered by jisc. i believe this was a key factor to the success as the only expenses to participants and their employers were related to traveling to the mashup. in many organizations, especially for developers, it may have been difficult to get funding to attend the event if there were food and lodging expenses associated with the event that the organization had to cover. in order for this type of event to be replicated elsewhere, i think receiving some level of outside funding for the event will be crucial to having a critical mass at the event. in the united states, because of the geographic size, the travel costs may be a hindrance to attendance although that could be mitigated somewhat by having regional mashups and choosing locations that are easy and inexpensive for enough people to attend. event calendar september 2012: spruce mashup (london, uk) april 2012: spruce mashup (glasgow, scotland) april 30-may 2, 2013: spruce mashup (leeds, uk) july 2013: spruce mashup (london, uk) endnotes [1] http://www.dpconline.org/advocacy/spruce/786-spruce [2] http://wiki.opf-labs.org/display/spr/spruce+awards+-+funding+opportunity+for+digital+preservation [3] http://wiki.opf-labs.org/display/spr/crowd+sourced+representation+information+for+supporting+preservation+%28crisp%29 [4] http://www.dpconline.org/advocacy/spruce/878-2nd-spruce-mashup-london-tuesday-18-20-september-2012-?format=pdf [5] http://openplanetsfoundation.org/blogs/2012-10-01-spruce-mashup-london-2012-practitioners-tale [6] http://wiki.opf-labs.org/display/spr/valid+and+well-formed+tiff%27s+with+scanline+corruption [7] the main page for the spruce mashup london event is available at http://wiki.opf-labs.org/display/spr/spruce+mashup+london. readers of this article are encouraged to review the week for more details about the event. [8] http://wiki.opf-labs.org/display/spr/distinguishing+files+with+descriptive+metadata [9] http://wiki.opf-labs.org/display/spr/extracting+and+aggregating+metadata+with+apache+tika [10] http://wiki.opf-labs.org/display/spr/using+perl+to+write+scripts+for+reporting+on+the+content+of+the+collection [11] http://wiki.opf-labs.org/display/spr/ffmpeg+as+video+transcoder [12] http://wiki.opf-labs.org/display/spr/maintain+a+list+of+metadata+mappings+outside+of+the+script [13] http://wiki.opf-labs.org/display/spr/moving+records+from+sharepoint+to+eprints+for+preservation+solution [14] http://wiki.opf-labs.org/pages/viewpage.action?pageid=16713667 [15] http://wiki.opf-labs.org/display/spr/nexus+data+collection+isis+-+stfc+-+solution [16] http://www.openplanetsfoundation.org/software/fido [17] http://wiki.opf-labs.org/display/spr/solving+tiff+malformation+using+exiftool [18] http://openplanetsfoundation.org/blogs/2012-10-01-spruce-mashup-london-2012-practitioners-tale correction 16 january 2013: the original version of this article misidentified some of the people and/or their roles on solution #2. the author apologizes for this error. about the author edward m. corrado is director of library technology at binghamton university located in binghamton, ny (usa). at binghamton, he provides leadership for information technologies and digital initiatives and overall direction, administration and management of computer resources, systems, and networking in the libraries. corrado also supervises the systems department; oversee the libraries’ technology infrastructure, web services, and other information access and production technologies; responsible for the libraries’ ils (ex libris aleph); work with the library faculty and staff to research and develop new and innovative technologies and services; recommend policies; plan upgrades; maintain current awareness of digital library technologies; work with university information technology services; and represent the libraries’ information technology interests within the university and in suny-wide initiatives. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – automated processing of massive audio/video content using ffmpeg mission editorial committee process and structure code4lib issue 23, 2014-01-17 automated processing of massive audio/video content using ffmpeg audio and video content forms an integral, important and expanding part of the digital collections in libraries and archives world-wide. while these memory institutions are familiar and well-versed in the management of more conventional materials such as books, periodicals, ephemera and images, the handling of audio (e.g., oral history recordings) and video content (e.g., audio-visual recordings, broadcast content) requires additional toolkits. in particular, a robust and comprehensive tool that provides a programmable interface is indispensable when dealing with tens of thousands of hours of audio and video content. ffmpeg is comprehensive and well-established open source software that is capable of the full-range of audio/video processing tasks (such as encode, decode, transcode, mux, demux, stream and filter). it is also capable of handling a wide-range of audio and video formats, a unique challenge in memory institutions. it comes with a command line interface, as well as a set of developer libraries that can be incorporated into applications. by kia siang hock, li lingxia introduction audio and video content forms an integral, important and expanding part of the digital collections in libraries and archives world-wide. while these memory institutions are familiar and well versed in the management of more conventional materials such as books, periodicals, ephemera and images, the handling of audio (e.g., oral history recordings) and video content (e.g., audio-visual recordings, broadcast content) requires additional toolkits. in particular, a robust and comprehensive tool that provides a programmable interface is indispensable when dealing with tens of thousands of hours of audio and video content. ffmpeg is comprehensive and well-established open source software that is capable of the full range of audio/video processing tasks (such as encode, decode, transcode, mux, demux, stream and filter). it is also capable of handling a wide range of audio and video formats, a unique challenge in memory institutions. it comes with a command line interface, as well as a set of developer libraries that can be incorporated into applications. uniquely audio/video singapore content excellence in singapore content has been one of the core pillars of the national library board (nlb) of singapore. when the national archives of singapore (nas) joined the nlb family in november 2012, it brought with it a huge and highly valuable collection of audio/video resources, including oral history recordings, sounds, audio-visual recordings and broadcast content. nlb has been progressively digitising the audio/video content for preservation and access purposes and managing the digitised content with its robust and scalable content management service. the nlb content management service is based upon the alfresco community edition (kia & wang, 2013). the challenges of audio-visual content complexity of audio-visual content compared with most other formats, audio-visual content are relatively more complex. figure 1 shows the steps typically required to process such content [1]. figure 1: processing of audio-visual content the content is kept in a ‘container’, more commonly known as a format. the container is used to identify and package the various data components that make up the audio-visual content. some of the most popular containers are avi (audio video interleaved), flv (flash video), mpeg (moving picture experts group), mp4 (mpeg-4) and mov (quicktime). a video container will typically contain one or more data streams for video and audio. each of these data streams will need to be managed by a codec (coder-decoder) program that is capable of encoding and decoding the data stream based on a specific codec standard. generally, the codec encodes an input video data into storage, and decodes video content for playback. there are literally hundreds of codecs; the popular ones are h.264 and mpeg4. digital preservation requires high resolution as a memory institution, it is a key role of nlb to preserve singapore content that are of historical, cultural and heritage value for future generations to come. for this purpose, the audio-visual content will need to be digitised at the highest possible resolution. the following are some examples of the standards adopted by nlb for the preservation, working and access copies of the content (not exhaustive): -standard definition & high definition videos o mp4 – h.264 o wmv standard definition broadcast quality videos o apple pro res hq o imx 30 high definition broadcast quality videos o apple pro res hq o xdcam hd 2k film resolution broadcast quality o apple pro res hq o hdcam sr multi-screen access to audio-visual content with the proliferation of mobile devices in singapore (and in many other countries world-wide), it is critical for nlb to be able to deliver its content on all devices (including phones, tablets, notebook and desktop computers). these devices provide varying support for the types of container and codec standards. for optimal delivery, it is necessary to transcode the audio-visual content to cater to the devices to be supported. a media player that is able to detect the device capabilities so as to select html5 or flash to stream the appropriate audio/video format will also be required. nlb uses jw player for this purpose [2]. about ffmpeg ffmpeg (fast forward moving picture experts group) is a comprehensive, cross-platform solution to record, convert and stream audio and video [3]. it is free software licensed under the lgpl or gpl, and supported on linux/unix, windows and mac os x operating systems. ffmpeg is made up of 5 key components, namely: 1. ffmpeg: the command line interface 2. ffserver: the media streaming server 3. ffplay: the media player 4. ffprobe: the media data analyzer 5. developer libraries while the project started in 2000, active development appears to happen from 2010. more recently, a new major release takes place quarterly (version 2.1 was released on 28 october 2013, while version 2.0 was released on 10 july 2013). who is using ffmpeg? the projects page on the ffmpeg official website lists some of the users of ffmpeg [4]. for example, the google chrome browser uses the ffmpeg libraries to support html5 audio and video playback. the popular vlc media player is another project that leverages the cross-platform ffmpeg software [5]. nlb has recently implemented an audio/video streaming infrastructure based on the wowza solution [6]. ffmpeg has been integrated with that solution. beyond these, there have been speculations on the internet about youtube and facebook using ffmpeg in their video upload mechanism. however, these have not been confirmed by google and facebook. basic audio/video content processing as mentioned earlier, ffmpeg comes with the streaming server and media player components. we will, however, not cover them in this article since the focus is on programmatic processing of audio-visual content. let’s start our tour of the ffmpeg functionalities with some basic tasks. for ease of illustration, we will perform the tasks using the command line interface. metadata we will start by extracting some basic metadata about the audio-visual content. this is done simply by issuing the following command: ffmpeg –i audio.mp3 the ffmpeg command reads from one or more ‘input’ sources, as specified by the -i option. the input source can be a regular file, a pipe, a network stream, or a grabbing device. the output on the console is shown in figure 2. figure 2: metadata of mp3 audio file nlb was interested specifically in the duration information of the audio or video content. in fact, we have successfully integrated the ffmpeg software with our content management service to extract the duration of ingested audio-visual content automatically. resize another common task with regards to video content is the changing of the width and height of the content, which is needed to create versions of the original video to cater to the different form factors of the playback device. the following command takes in a video (video.wmv), and creates a new video (video_o.wmv) with width of 320 pixels and height of 240 pixels as denoted by the -s flag: ffmpeg –i video.wmv –s 320x240 video_o.wmv note that the format is normally detected automatically. there are a number of convenient ways to specify the width and height of the output file required. for example, the parameter of -vf scale=iw/2:ih/2 will result in halving the input width (iw) and input height (ih). iw and ih are predefined variables. the -vf flag applies the video filter. ffmpeg –i video.wmv –vf scale=iw/2:ih/2 video_o.wmv another useful filter is -vf super2xsai. it enlarges the source frame size by 2 times using a scaling algorithm that minimises the reduction of sharpness. crop cropping is done through the crop filter. the following commands crop the left, middle and right segments of the input video respectively: 1. ffmpeg –i video.wmv –vf crop=iw/3:ih:0:0 video_left.wmv 2. ffmpeg –i video.wmv –vf crop=iw/3:ih:iw/3:0 video_middle.wmv 3. ffmpeg –i video.wmv –vf crop=iw/3:ih:iw/3*2:0 video_right.wmv the syntax for the crop filter parameter is as follows: crop=ow[:oh[:x[:y]]] where ow – output width oh – output height x,y – the co-ordinates of the top left corner of the cropped area convert before we move on to the more advanced features of ffmpeg, we will cover the important topic of conversion. the conversion can take place from one container format to another (e.g., from wmv to mp4): ffmpeg –i video.wmv video_o.mp4 while nlb was exploring the use of voice-to-text automatic transcription technologies, we needed to extract the audio component of our video content. this was done simply with the following command: ffmpeg –i video.wmv audio_o.mp3 advanced audio/video content processing the ffmpeg software is extensive and comprehensive. a thorough discussion of the major features would be beyond the scope of this article. for this rest of the article, we will cover three more features that we find useful in nlb: overlay, extract and join. overlay many a times, it is necessary to ‘watermark’ the video content, just like how we would watermark an image. we will show the commands to overlay a company logo and a copyright statement onto a video. to overlay the nlb logo, we will first need the image file (nlb.png in the illustration below). we can then use the following command to achieve the result: ffmpeg –i video.wmv –i nlb.png –filter_complex overlay video_o.wmv as shown in figure 3, the nlb logo is placed by default at the top-left corner. figure 3: logo on top left corner you can determine the exact positioning of the logo by providing an x:y position that represents the top left corner of the logo within the video. ffmpeg defines 4 variables that can be used in the expression for x and y: w – width of main video h – height of main video w – width of overlay h – height of overlay with these predefined variables, you can easily place the nlb logo at the bottom right corner using the following command (see figure 4): ffmpeg –i video.wmv –i nlb.png –filter_complex overlay=w-w:h-h video_o.wmv figure 4: logo at bottom right corner it is common to add a copyright statement on our digital content. to do just that, we can use the following command: ffmpeg –i video.wmv –vf drawtext=”fontfile=/windows/font/arialbd.ttf:\ text=’all rights reserved.national library board singapore 2013’\ :fontsize=14:fontcolor=white:x=220:y=450” video_o.wmv by now, the above ffmpeg command should be self-explanatory. figure 5 shows the result of the above command, with the logo placed on the bottom left. figure 5: watermarks extract most libraries and archives provide the service for organisations and individuals to purchase content from their digital collections (where copyright permits), and this service may include audio-visual content. in some cases, the user is able to purchase specific segments of the content, and the libraries/archives will deliver only the segments requested. many times, such requests will need to be manually fulfilled. the next 2 commands from ffmpeg can be used to automate the fulfillment process, thereby enabling a fully automated end-to-end, straight-through order processing system. assuming that the user-facing ordering system gathers the information about the segment(s) the user would like to purchase, we can then program the system to extract each segment from the video using the following command: ffmpeg –ss 00:00:10 –t 00:00:20 –i video.wmv video_1.wmv the segment starting from the 10th second for the duration of 20 seconds will be extracted into video_1.wmv. this can be repeated if the user has selected multiple segments. join if the user selected multiple segments, these segments will need to be joined together. the command to join multiple audio-visual content is as follows: ffmpeg –i video_1.wmv –i video_2.wmv –filter_complex concat video_o.wmv the –filter_complex option is typically used when there are more than one input and/or output, such as the joining of the 2 video input streams in this example and the overlay examples described earlier. the output file can then be electronically delivered to the user. conclusion we just went on a whirlwind tour of ffmpeg, an excellent open source software for audio-visual content. it will do most of the tasks that you will ever need to perform. while we used audio-visual content to illustrate the features covered in this article, you will be happy to note that quite a number of the ffmpeg options work with images too. the online official documentation is adequate. however, given the many formats, codecs and numerous parameters and options that ffmpeg supports, it is not easy software to master. the book ffmpeg basics written by frantisek korbel is a good way to start, and the authors of this article have hugely benefited from it. moreover, while the illustrations are based on the command line interface, you can easily wrap the command line interface into web services, or embed the developer libraries into your applications. in fact, there are already many ffmpeg wrappers available for popular programming languages, such as the media handler pro by mediasoft [7]. we now have a powerful tool that we can leverage to handle the complex nature of the tens of thousands of hours of valuable audio-visual resources, and deliver an experience that is engaging anytime, anywhere, on any device. notes [1] taken from ffmpeg basics by frantisek korbel [2] http://www.jwplayer.com/ [3] http://ffmpeg.org/ [4] http://ffmpeg.org/projects.html [5] http://www.videolan.org/vlc/ [6] http://www.wowza.com/ [7] http://www.mediasoftpro.com/media-handler-pro.html references kia, siang hock and wang, zhi liang (2013) content-as-a-service platform with the alfresco open-source enterprise content management system. paper presented at: ifla world library and information congress, 17 – 23 august 2013, singapore. about the authors kia siang hock (siang_hock_kia@nlb.gov.sg) is currently the deputy director overseeing the it architecture and innovation at the national library board, singapore (nlb). in this role, he and his teams are heavily involved in the conceptualisation, proof of concepts (pocs), design and development of various innovative services at nlb. li lingxia (lingxia_li@nlb.gov.sg) is a senior systems analyst at the national library board, singapore. she provides it solutions and technical supports for internal and public-facing services at nlb. lingxia is responsible for the developments involving the ffmpeg software. subscribe to comments: for this article | for all articles one response to "automated processing of massive audio/video content using ffmpeg" please leave a response below: nilesh, 2014-03-03 hi, how join work with multiple audio files ? leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – mapfast: a fast geographic authorities mashup with google maps mission editorial committee process and structure code4lib issue 14, 2011-07-25 mapfast: a fast geographic authorities mashup with google maps when looking for information about a particular place, it is often useful to check surrounding locations as well. fast geographic subjects provide clean access points to this material, and a google maps mashup allows users to see surrounding locations that are also fast subjects. moreover, the web service to the underlying data is also open and available for use. the map interface allows for simple selection of a location, with links to enter it directly as a search into either worldcat.org or google books. by rick bennett, edward t. o’neill, kerre kammerer, and jd shipengrover background most search engines are keyword driven. if you want to know something about a place, using the name of the place as a search term is usually sufficient. however, not all place names appear in a controlled vocabulary such as the library of congress (lc) subject headings and name authorities [1]. even if the location you are interested in has a controlled subject heading, some material of interest may be cataloged using a nearby town or lake name. the mapfast prototype interface was developed to address this problem [2]. mapfast is a mashup that uses google maps to present fast (faceted application of subject terminology) [3] geographic authority records. the prototype presents a different way to look at subject access to bibliographic records. the map interface allows for simple discovery of a geographic location, selection of a location via a map pin or a search result list, and display of location information including links that initiate either a worldcat.org or google books search for the selected entry. mapfast uses records from the fast database. the fast schema reworks library of congress subject headings (lcsh) rules so that they are easier to use, understand, and apply. fast subject headings are facetted into their component parts such as topics, geographics, and events. the result is a more machine-friendly schema designed to handle a large volume of materials with less effort and cost. fast geographic data currently the fast authority file contains over 1.6 million authority records. the authority file is extensively indexed to support a variety of search options. documentation on searching the fast authority file is available on the fast web site [4]. fast is also available through the oclc terminologies service prototype [5]. the full fast authority file can be licensed for non-commercial use. there are about 160,000 geographic headings in fast. the geographic coordinates for these headings were extracted from the corresponding lc records. the lc coordinate is stored in a free-text field, usually in one of a few fairly consistent patterns, so the coordinate could be read and put into a machine-consistent format in the fast version. when a coordinate was not available in the original lc record, the geonames [6] database was used as an additional source of coordinates. geonames is a geographical database combining several geographic databases. a matching algorithm was used to match fast headings to the geoname headings, and the coordinates and place types were added from these records. in all, over 62.5% of fast geographic headings have coordinates. in addition to geographic headings, events often occur at specific places. when a battle or other event could be placed in a particular location, that coordinate was also added to the fast record. this was done automatically by using the information in the records. this is often approximate. for example, the battle of gettysburg was assigned the coordinate for the nearby town of gettysburg rather than the exact place of the battle. it is hoped that manual review or the availability of other databases for matching will provide more accurate coordinate information in the future. figure 1mapfast display of the battle of gettysburg corporate bodies are another facet of fast headings that may have coordinates assigned. these are often buildings, parks, or other facilities that have a particular location. as of this writing, coordinates have not been assigned for any corporate headings but it is being considered by the fast team. mapfast interface in late 2009, google opened its map and data apis to include attribute searching [7]. the example in the google geo developers blog post was a college-finder application where the result could be refined by type of college, size, and distance from a location. with these new features, we used this example as a basis to develop an interface where the fast geographic headings could be displayed, and allow them to be refined by the geographic type of the heading. in a google maps mashup, the google maps api [8] provides basic map functionality though javascript classes, and the developer implements a map interface using that functionality in combination with local data or some specialized functionality. for mapfast, we modified the layout, provided fast data to be included as markers on the map, and customized some of the user-map interactions. mapfast has a simplified single-page user interface (figure 2) designed with the end-user in mind. the interface consists of five sections: search entry, search results display, map display, subject heading detailed display, and permalink display. search entry is the first section and its main component is a search entry box. this section also has two search limiters presented to users as standard drop-down select boxes. the limiters are: radius (5, 10, 20, 50, 100 km) and type (all; populated places; regions or government districts; lakes, rivers, streams; other; events; or undefined). limiters can be placed on a search before or after the search has been executed. the second section is the search results display. this section is found below the search entry area. it is a simple list of fast subject headings displayed in order of distance from the center of the map. the result list displays up to 20 subject headings. each subject heading’s type is indicated via a simple icon to the left of the entry and each entry on the list is also a link that can be used to display the subject heading detailed display. . figure 2 – example of mapfast user interface the entries on the search result list are displayed on the map in the third section, the map display area. this area is to the right of the search and results sections. it is the display area for the google map and each entry on the search result list is displayed on the map as a google maps marker or pin. the google map can be controlled using standard google maps controls. either the search results list entries or the map display pins can be selected to bring up the fourth user interface section, the subject heading detailed display. this section is a google maps overlay bubble that displays further information about the selected subject heading. additional information includes: the full heading, the heading type, alternative subject headings (if available), and search links into worldcat.org and google books. the final user interface section is the permalink display. this is simply a unique link for a given location. it allows anyone wanting to share or save a location to do so. the permalink is presented in the “share this location” text box provided below the search results and map displays areas. mapfast interface mechanics search and display in the mapfast interface is a multiple step process. this process is outlined in figure 3. first, the search terms are submitted to the google map engine. the first result from this search is examined, and a latitude/longitude coordinate pair is extracted. for example, a search for amsterdam, netherlands returns the results in an array, and the first element contains: result[0]. formatted_address = amsterdam, the netherlands result[0].geometry.location.lat() = 52.3730556 , and result[0].geometry.location.lng() = 4.892222199999992 the geographic coordinates are then submitted along with the desired radius to the mapfast web service described below. a jquery.getjson request is used to make this request. this returns a list of fast geographic headings and associated data. this first-cut result is accurate, but it may not be pleasing. the results may be too concentrated within the map display, or skewed to one side. therefore, the results are examined, and a new longitude/latitude box is defined that contains the initial set of results, but which is likely to change the center of the result. this adjusted box is now submitted to the mapfast web service to obtain the final result set. the google maps display is also adjusted based on this box. the fast result list is processed to calculate markers to display on the map. map movements are handled in a similar fashion.   the google maps engine takes care of moving the map. the new bounds of the google map are used for a new longitude/latitude box search to the mapfast web service, and the results are displayed. figure 3 – mapfast interface flow as is standard with the google maps display, map markers may be expanded to provide information about specific map locations. in the marker expansion bubble, url links are provided to search worldcat and google books. fast is not implemented in worldcat, but the keywords are typically found in worldcat subjects anyway, so using the fast heading as a keyword search to the subject index works well. a normalizedname parameter is provided by the mapfast web service to avoid doing the complicated normalization in javascript. an example worldcat search for netherlands— amsterdam is: http://www.worldcat.org/search?q=su: netherlands%20amsterdam&qt=advanced google books collects bibliographic data for many of its entries too, so a subject search also works well there: http://books.google.com/books?q=+subject:netherlands%20amsterdam&lr=&as_drrb_is=q&as_minm_is=0&as_miny_is=&as_maxm_is=0&as_maxy_is=&as_brr=0&source=gbs_metadata_r&c . mapfast web service restful query the web service is a simple restful [9] service, returning json objects. this means that a simple url can access the information, and the lightweight json object returns all the data requested. a sample url is: http://experimental.worldcat.org/mapfast/services?geo=-33.863,151.208;crs=wgs84&mq=&sortby=distance&radius=30000&max-results=3&callback=? the key parameters here are: geo = the latitude & longitude radius = search radius in km max-results = maximum number of records requested this query is processed by a perl script and mysql database as described below. building the sql query library search engines tend to be keyword oriented. finding records with coordinates “near” an input coordinate is not a typical feature. therefore, the more general mysql database was chosen for its ability to handle a complex sql statement that could calculate the distance from a point and rank the result by that distance. the critical database tables for this query are longitude and latitude. in sql, it is quite easy to select all the records from a box of a latitude range and a longitude range, and a perl script is used to build and process the query. however, on a globe, that box is not square. as the latitude increases, the width (longitude) of the box narrows, to zero at the poles. to adjust for that, the longitude is increased by the dividing it by the cosine of the latitude. at the equator, the divisor is one, so there is no adjustment. at 60 degrees latitude, the longitude range is doubled. of course the box is not truly square at this point, but it is close enough for retrieval. google maps doesn’t support locations above 85 degrees latitude, so dividing by zero is not an issue. now that a result set within a longitude/latitude box can be obtained, it is only necessary to rank the result by the distance from the center. the center coordinate is, of course, the midpoint of the query range. the quadratic equation is used to calculate the distance from the center coordinate for ranking. the sql query is then: $query="select [table list] from fast where latitude >='$latlow' and latitude <='$lathi' and longitude >'$lonlow' and longitude <'$lonhi' order by sqrt( pow((latitude-'$latcen'),2)+pow((longitude-'$loncen')/$longcorr,2)) limit $maxresults";[/sourcecode] where <ul> <li>latitude and longitude are tables in the mysqldatabase</li> <li>$lonlow, $lonhi, $latlow, $lathi are the longitude and latitude ranges. the longitude has already been widened corrected for latitude to approximate a square box.</li> <li>$latcen= ($lathi+$latlow)/2;</li> <li>$loncen= ($lonhi+$lonlow)/2;</li> </ul> $longcorr = 1/( cos 2*3.14*$latcen/360 ); $maxresults = maximum desired results to be returned. when the url request has only the center point and radius, the script constructs an approximate box around the requested circle, and treats it as a box type request. <h3>json return</h3> the following is an example search return in json format. only a single record is shown. [sourcecode language='js'] { "name": "fast authority records", "status": { "code": 200,      "request": "geocode"   }, "placemark": [ { "id": "fst01320412", "name": "new south wales -sydney -australia square", "description": "", "extendeddata": [ { "name": "normalizedname", "value": "new south wales sydney australia square" },                            { "name": "feature",          "value": "unknown"        },                            { "name": "fcode",          "value": "u"        }                                              ],</li>                "point": { "coordinates": "-33.8650,151.2070" }                   },               ] …placemark entries repeat } each placemark parameter contains information from a single fast record, and the parameters are described in the web service documentation [4]. since the return is in json, the javascript can use it directly without further processing. when the url parameter “callback=[functionname]?” is added, this makes the request jsonp, where a javascript function call is returned around the json data. the “?” is required, and a random function name is returned when no parameter is given to process anonymous callback function. otherwise the functionname parameter is used as the callback function name. summary by combining the fast geographic and event subject headings, the geographic coordinates associated with the subject headings, and google maps, a map mashup was developed. this map interface is a simple, easy to use way to access materials about an area that were difficult to get at by other means. users can search for a place, then use features of the mashup to identify, locate, and access resources related to that place, or nearby places or geographic features. restricting the result to a feature type, such as events or lakes, rivers, and streams, allow the user to focus on the information they are most interested in. from a library developer aspect, mapfast shows how to develop a google maps mashup to make map based interfaces using local data. the restful web service that underlies this interface provides a json list of fast geographic headings that are near an input coordinate. the web service results are compatible with both fast and lc subject headings, and is available so that others may develop additional applications. references [1] library of congress authorities. available from: http://authorities.loc.gov/ [2] the mapfast interface is at http://experimental.worldcat.org/mapfast/ . the project is described at http://www.oclc.org/research/activities/mapfast/default.htm. the web service is documented at http://www.oclc.org/developer/services/mapfast. [3] fast: faceted application of subject terminology, by lois mai chan and edward t. o’neill, 2010. coins oclc research activities associated with fast are summarized at: http://www.oclc.org/research/activities/fast/ [4] the fast database is available at: http://fast.oclc.org/ [5] oclc research terminology service pilot for fas:t http://tspilot.oclc.org/fast/?operation=explain&version=1.1 [6] the geonames geographical database: http://www.geonames.org/. [7] the maps data api: bringing geospatial search to google’s cloud (holden, 2009). available from: http://googlegeodevelopers.blogspot.com/2009/12/maps-data-api-bringing-geospatial.html [8] google map api: http://code.google.com/apis/maps/documentation/javascript/ [9] restful, representational state transfer. available from: http://en.wikipedia.org/wiki/representational_state_transfer about the authors rick bennett (rick_bennett@oclc.org) is a consulting software engineer in oclc research, where he works on processing and manipulating bibliographic and authority data. currently he has been focusing on developing, maintaining, and displaying authority data for the fast project. rick was an undergraduate at penn state and graduate at georgia tech, where he completed a program in computer engineering. edward t. o’neill (oneill@oclc.org) is a senior research scientist at oclc research and project manager for the fast project. ed did his undergraduate work at albion college and his doctorial work at purdue university in industrial engineering. his research interests include authority control, subject analysis, database quality, collection management, and bibliographic relationships. he is active in ifla (international federation of library associations and institutions) and is a member of the ifla standing committee of the classification and indexing section. kerre kammerer (kammerer@oclc.org) is a consulting software engineer in oclc research. kerre’s research interests include database quality control and authority data. her current research activities involve the creation and maintenance of fast authority records and the conversion of lc subject headings to fast headings in bibliographic records. kerre holds a ba in economics from depauw university. jd shipengrover(shipengj@oclc.org) is a senior web and user interface designer for oclc research. she works to bring user-centered design principles to the web applications created by oclc research. her research interests include web usability, web standards and information visualization. she holds bas in journalism and the history of art from the ohio state university and has a mlis from kent state university. subscribe to comments: for this article | for all articles 2 responses to "mapfast: a fast geographic authorities mashup with google maps" please leave a response below: mashups bibliotecarios: mapfast | blog eco escuela 2.0, 2011-11-21 […] qué es mapfast, cómo ha sido construido y cómo funciona: code4lib journal {lang: 'es'} etiquetas: biblioteca, google maps, mashups en herramientas web 2.0, recursos, […] worldcat's mapfast mobile service | reading, writing, research, 2013-07-31 […] geographic place names can offer special difficulties in searching. a system faceted application of subject terminology (fast) takes library of congress subject headings and presents them in a more machine-friendly way to search for them. a mashup of fast with google maps makes mapfast […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – dimensions & vosviewer bibliometrics in the reference interview mission editorial committee process and structure code4lib issue 47, 2020-02-17 dimensions & vosviewer bibliometrics in the reference interview the vosviewer software provides easy access to bibliometric mapping using data from dimensions, scopus and web of science. the properly formatted and structured citation data, and the ease in which it can be exported open up new avenues for use during citation searches and reference interviews. this paper details specific techniques for using advanced searches in dimensions, exporting the citation data, and drawing insights from the maps produced in vos viewer. these search techniques and data export practices are fast and accurate enough to build into reference interviews for graduate students, faculty, and post-phd researchers. the search results derived from them are accurate and allow a more comprehensive view of citation networks embedded in ordinary complex boolean searches. by brett williams introduction in an early library position, i was working with second-language learners aiming for higher education in the united states, canada and the uk. in an attempt to explain the difference between what they needed to think through and what a computer search could accomplish, i coined this aphorism “computers are dumb, people are smart. but computers are good at doing the same thing over and over again.” this aphorism is a constant in my information literacy instruction and it informs how i teach search strategies and research. using vosviewer to map search results throws the power of a computer doing the ‘same dumb thing over and over again’ at a series of common search techniques of keyword, author and citation searching. the resulting maps are impressive and the interactive mapping allows a careful researcher to gain a comprehensive view of an author, a keyword search or a set of articles. literature review vosviewer has been in active development since 2009 [1]. vosviewer uses van eck and waltman’s vos (visualization of similarities) algorithm to display the relationship between entities in a way in which both direct and indirect connections between entities result in placing those entities closer together on a map [2]. in the library science literature, bibliometrics is presented in the context of research data management, often as a service provided by research data management departments in large research universities [3]. bibliometrics is also used as a tool for after-the-fact analysis of data gathered through other means, such as the content management system used for virtual reference transactions [4]. zahedi et al. applied the text mining tools inside vosviewer to identify common topics of interest to users of mendely [5]. however, up to this point there does not appear to be any application of bibliometrics or citation mapping to searching during the reference interview. vosviewer supports mapping citation data extracted from web of science, scopus, dimensions and pubmed. the demonstrations and live links in this paper will use data solely from dimensions, but the techniques and tools rely on generally applicable boolean search strategies. dimensions is a relatively new discovery layer developed by digital science, the developers of the readcube citation management software, integrating scholarly articles, book chapters, grant and funding material and allowing the export of bibleometric data. dimensions can be found at app.dimensions.ai. dimensions coverage and citation accuracy compares favorably to google scholar, scopus and web of science [6][7]. search techniques the 5w’s mnemonic is a helpful structure to use during the reference interview. it is used extensively in the reference interview techniques in this paper to organize and identify search strategies suitable for bibliometric mapping. table 1. who identify and record key researchers, including links to their profile in dimensions if available when narrow the date range what identify and record ‘fields of research’ inside dimensions that correspond with the research request where identify and record key institutions and research groups with the patron why identify the broader purpose of the research request and the intended outcome for the paper export figure 1. dimensions supports exporting the first 2500 results. for early stages of exploring a topic, simply using the relevance sort and mapping the first 2500 results can be extremely effective for identifying major topics, refining a search strategy and identifying major authors. a basic export is initiated by clicking “save/export” and then clicking “export for bibliometric mapping.” an export of 2500 records will take between 30 seconds and 3 minutes to export depending on the load on the dimensions servers.the export is a zipped csv file with each publication listed along with its citations. figure 2. while waiting for this export, continue the reference interview by identifying key authors under the “researcher” limit, narrowing the date range under the ‘publication year’ limit and identifying the categories used by dimensions under the “fields of research” limit. when the export is complete, you will get an email notification and the export will appear in the dimensions export center. the downloaded file is a zipped csv file. mapping unzip the file and open vosviewer you can launch vosviewer from the website at http://www.vosviewer.com/ under file > create > “create a map based on bibliographic data” figure 3. then “read data from bibliographic database files” and choose the dimensions tab, navigating to the location of the unzipped csv file figure 4. the type of analysis is ‘citation’ and the unit of analysis is ‘documents’ figure 5. depending on the search, modify the following: minimum number of citations figure 6. number of documents figure 7. you can verify and exclude any of the documents figure 8. if there are disconnected items, vosviewer will alert you before mapping and give you the option of only displaying the largest cluster 1. advanced boolean search – sorted by relevance tools dimensions vosviewer example search string (burnout and (suicide or “suicidal ideation” or (suicid* or parasuicid* or self-harm*))) and (surgeon* or doctor* or physician* or clinician* or anesthesiologists or anesthetist* or psychiatrist* or oncologist* or internist* or geriatrician* or cardiologist* or dermatologist* or rheumatologist* or nephrologist* or neurologist* or radiologist* or respirologist* or rheumatologist* or paediatrician* or endocrinologist* or pathologist* or neurosurgeon* or gynecologist* or obstetrician* or ophthalmologist* or urologist* or neuropathologist*) map figure 9. access to the original csv file: https://drive.google.com/drive/folders/1pphjiqj4htaekgnz0e6ecvga6hl4pqw6?usp=sharing techniques identify citation clusters vosviewer colors clusters of citations using a set of colors selected using the menu on the right hand side. the largest clusters are colored with a dark red for the largest, green for the second largest, a dark blue for the third largest and yellow for the fourth largest cluster. scan for these four largest clusters immediately and pass your mouse over the largest bubbles to identify the key topics. in the example map, the green cluster is largely about training therapists. the large red cluster is largely about medical students. figure 10. figure 11. you can click on any bubble to go directly to the article referenced. identifying these key clusters, and the most highly cited articles within them will allow you to fill out your 5w’s chart with the patron in a more comprehensive way. pay close attention to finding highly cited articles, systematic reviews, and authors. you may also wish to adjust the search based on the search strategy and parameters you set earlier. identify highly cited articles highly cited articles are indicated by the larger size of the bubble. in the green cluster at the bottom, the beidas 2010 article in figure 10 has 488 citations that match the criteria. the dyrbye 2006b article in figure 11 has 777 citations that match the criteria. look at the citations for these two articles, when looking at the search indicated in the example: beidas, rinad s., and philip c. kendall. “training therapists in evidence-based practice: a critical review of studies from a systems-contextual perspective.” clinical psychology: science and practice, vol. 17, no. 1, 2010, pp. 1–30. wiley online library, doi:10.1111/j.1468-2850.2009.01187.x. dyrbye, liselotte, et al. “systematic review of depression, anxiety, and other indicators of psychological distress among u.s. and canadian medical students.” academic medicine, vol. 81, no. 4, apr. 2006, pp. 354–73. both are relevant, but clearly the focus on medical students is a strength of this search and the cluster around the dyrbye article is something to look into. identifying connectors the tyssen 2001 article is closely connected to both the red (medical students), orange (medical students) and yellow clusters (first responders). it is also highly cited with 169 citations. tyssen, reidar, et al. “suicidal ideation among medical students and young physicians: a nationwide and prospective study of prevalence and predictors.” journal of affective disorders, vol. 64, no. 1, apr. 2001, pp. 69–79. sciencedirect, doi:10.1016/s0165-0327(00)00205-6. figure 12. 2. key papers & citations tools dimensions vosviewer dimension vosviewer tools “citation link” for creating larger data sets: https://docs.google.com/spreadsheets/d/1f73emj-fhrnrdthncpl04tvbtlh_i5cgqp70qkqwm_w/edit?usp=sharing example search exported and mapped bibliographies and citations for: donald, i., cartwright, s., taylor, p., millet, c., cooper, c., & johnson, s. (2005). the experience of work?related stress across occupations. journal of managerial psychology, 20(2), 178–187. https://doi.org/10.1108/02683940510579803. dyrbye, l. n., massie, f. s., eacker, a., harper, w., power, d., durning, s. j., … shanafelt, t. d. (2010). relationship between burnout and professional conduct and attitudes among us medical students. jama, 304(11), 1173. https://doi.org/10.1001/jama.2010.1318. dyrbye, l. n., west, c. p., satele, d., boone, s., tan, l., sloan, j., & shanafelt, t. d. (2014). burnout among u.s. medical students, residents, and early career physicians relative to the general u.s. population: academic medicine, 89(3), 443–451. https://doi.org/10.1097/acm.0000000000000134. dyrbye, l., thomas, m., & shanafelt, t. (2006). systematic review of depression, anxiety, and other indicators of psychological distress among u.s. and canadian medical students. academic medicine, 81(4), 354–373. retrieved from insights.ovid.com. map – including key articles – keeping citations – keeping cited by 1 citation, top 1000 documents, resolution 1: figure 13. map – excluding key articles – keeping citations – keeping cited by 1 citation, top 1000 documents, resolution 1: figure 14. techniques one of the most effective search techniques when talking to graduate students and faculty has been building off of articles that they already know about. inside the five w’s model, this helps fill out many aspects including when (publication dates), where (institutions where the authors have published, where studies were done), who (author and co-author searches), what (specific terminology) and why (schools of thought, methodological tools, validated measurement surveys). using dimensions you can export both the citations from a paper, and all of the papers that cite the paper. this technique works best with 3-5 ‘key articles.’ search for a key article by doi, using the doi search in dimension. with a single entry in the search results, “export for bibliometric mapping.” then open the full records and scroll down to ‘publication citations.’ click ‘see all,’ then export this view. repeat this for each key article. this creates a set of files that can be easily identified by size. figure 15. what exporting both the article with its references and the publication citations creates is a 360 degree view of the networks associated with each individual article, allowing you to map connections between them. as the process of extracting the individual articles can get a little confusing, i’ve developed a tool to help keep track of larger groups of articles. to use, find the individual article pages and paste them under “article url.” individual article url’s look like this, but may contain a longer search string: https://app.dimensions.ai/details/publication/pub.1083855537 the page to export the citations will appear on the right and will have a url like this: https://app.dimensions.ai/discover/publication?subset_publication_citations=pub.1083855537 https://app.dimensions.ai/discover/publication?subset_publication_citations=pub.1083855537 figure 16. adjusting resolution one of the quick adjustments you can easily make to simplify a map is to adjust the resolution. under the analysis tab in the clustering section you can adjust how sensitive the clustering algorithm is. lower numbers (0.1, 0.25, etc.) are less sensitive and will result in fewer, larger clusters. higher numbers (1, 2, etc.) are more sensitive and will result in many, smaller clusters. click ‘update clustering’ to apply the new resolution. adjusting resolution is particularly effective when you’ve already identified the importance of individual articles. mapping excluding only the ‘key articles’ you can also exclude the key articles you’ve identified already in the last selection before launching the map. this can be particularly effective if you know already that most of your articles will cite your key article. this emphasises the citations between the publication citations and publication references without the extra, known noise of the citation of the known key article. figure 17. 3. key authors tools dimensions vosviewer example search burnout and physician against all publications by liselotte n dyrbye burnout and physician against all publications by tait d shanafeldt map figure 18. techniques reviewing clusters under the ‘items’ tab in the sidebar you can review each cluster with an alphabetical view of each paper included in the cluster. cluster 1 is the red cluster. you can right click to jump directly to each individual bubble in the cluster. figure 19. as you have already identified these scholar’s works as important and you have already filtered by keywords for the specific type of journal article you are looking for, it is often effective to review each item in the cluster for relevance. filtering by first author built into the items tab is an option to filter. searching for a first author will show in which clusters that author’s work shows up. figure 20. 4. list of papers by doi (google sheets tool) tools zotero zotero doi manager plugin dimensionvosviewer tools spreadsheet under “doi” https://docs.google.com/spreadsheets/d/1f73emj-fhrnrdthncpl04tvbtlh_i5cgqp70qkqwm_w/edit?usp=sharing dimensions vosviewer example search 10.1001/archinternmed.2010.350 or 10.1001/jama.2012.6183 or 10.1001/jama.2019.4168 or 10.1001/jamainternmed.2013.8519 or 10.1001/jamainternmed.2016.3284 or 10.1001/jamanetworkopen.2019.0932 or 10.1001/jamaophthalmol.2016.5392 or 10.1001/jamaophthalmol.2016.5399 or 10.1002/hec.1344 or 10.1002/hec.1378 or 10.1002/hec.1663 or 10.1002/hec.3572 or 10.1002/jhm.2400 or 10.1002/lary.27311 or 10.1007/s10198-014-0582-8 or 10.1007/s10198-016-0861-7 or 10.1007/s11606-018-4462-2 or 10.1007/s11606-018-4579-3 or 10.1007/s11606-018-4632-2 or 10.1007/s11606-019-04940-9 or 10.1007/s12630-019-01321-y or 10.1007/s13524-014-0320-y or 10.1016/j.acra.2006.06.010 or 10.1016/j.adaj.2017.01.005 or 10.1016/j.ajo.2018.09.003 or 10.1016/j.amjms.2017.12.001 or 10.1016/j.fertnstert.2019.02.004 or 10.1016/j.genm.2010.07.003 or 10.1016/j.healthpol.2010.06.019 or 10.1016/j.healthpol.2015.01.005 or 10.1016/j.healthpol.2019.05.002 or 10.1016/j.ijrobp.2018.09.042 or 10.1016/j.jacc.2015.10.038 or 10.1016/j.jamcollsurg.2017.03.018 or 10.1016/j.jcjo.2014.02.007 or 10.1016/j.jcjo.2014.03.005 or 10.1016/j.jcjo.2016.04.013 or 10.1016/j.jebo.2005.06.007 or 10.1016/j.jhealeco.2011.05.005 or 10.1016/j.joms.2018.07.010 10.1016/j.joms.2018.07.010 or 10.1016/j.juro.2017.11.039 or 10.1016/j.sapharm.2011.06.003 or 10.1016/j.sapharm.2013.01.004 or 10.1016/j.socscimed.2011.03.047 or 10.1016/j.socscimed.2012.07.028 or 10.1016/j.socscimed.2013.06.018 or 10.1016/j.socscimed.2015.08.001 or 10.1080/03630242.2012.674092 or 10.1080/03630242.2012.674092 or 10.1097/01.anes.0000264766.41297.ce or 10.1097/01.aog.0000231720.64403.6f or 10.1097/acm.0000000000001250 or 10.1097/acm.0000000000001283 or 10.1097/acm.0b013e3182a71519 or 10.1097/aln.0000000000000834 or 10.1097/aog.0000000000002420 or 10.1097/hmr.0000000000000069 or 10.1097/mlr.0000000000000310 or 10.1097/mlr.0b013e318268ac0c or 10.1097/mpg.0000000000000637 or 10.1097/sla.0000000000002920 or 10.1097/sla.0000000000002928 or 10.1111/1475-6773.13120 or 10.1111/acem.13694 or 10.1111/caje.12334 or 10.1111/ceo.13523 or 10.1111/j.1442-9071.2007.01480.x or 10.1111/j.1572-0241.2008.01976.x or 10.1111/j.1572-0241.2008.01988.x or 10.1136/bmj.i2923 or 10.1136/bmj.l1510 or 10.1136/bmj.l2089 or 10.1136/bmjopen-2018-023811 or 10.1136/postgradmedj-2016-134094 or 10.1177/0046958017709688 or 10.1177/0950017015590760 or 10.1177/1077558715589705 or 10.1177/1077558718754573 or 10.1177/1203475418762719 10.1186/1471-2296-13-94 or 10.1186/1478-4491-12-32 or 10.1186/s12913-017-2343-8 or 10.1186/s12960-016-0162-3 or 10.1186/s12960-017-0258-4 or 10.1186/s13054-018-2139-1 or 10.1186/s40697-016-0117-6 or 10.1245/s10434-018-6450-5 or 10.1371/journal.pone.0129197 or 10.1377/hlthaff.2010.0597 or 10.1377/hlthaff.2017.0014 or 10.1513/annalsats.201804-252ar or 10.2106/jbjs.17.00532 or 10.2106/jbjs.17.00532 or 10.2106/jbjs.17.01501 or 10.2214/ajr.17.18256 map figure 21. techniques collecting dois and mapping while collecting articles to meet a research request, zotero offers several powerful tools for capturing accurate metadata. however, the most effective tool for managing dois is the zotero doi manager. after the plugin is installed, make sure to change the setting for dois under tools > get dois for new items > long dois. to easily extract dois ingested into zotero, select all of the items you wish to export, and choose the csv format. this will create a file you can open in excel or other spreadsheet software with a column of all of the dois from your selection. figure 22. dimensions allows you to search for articles by doi and supports boolean operators. however, the search starts to return inconsistent results after a search for about 45 dois. the search is also extremely sensitive to extra spaces or any small errors in forming the boolean search string of doi’s. to ease the creation of these long doi search strings, i created a tool to create 40 doi long perfect boolean search strings for dimensions. you can create a copy of this tool here: https://docs.google.com/spreadsheets/d/1f73emj-fhrnrdthncpl04tvbtlh_i5cgqp70qkqwm_w/edit?usp=sharing. a map based on doi’s is a pre-selected group of ‘key papers” exported along with their citations. if you have both the time and the justification, you can also apply the technique found in ‘key papers & citations.” for a collection of 90 papers as in the example map, this is rarely justified. jump to clusters as the majority of the articles have already been collected as relevant articles, the most effective use of these maps is to very carefully look at newer and less cited articles in the clusters. in the ‘items’ tab, right click on the cluster and choose “show cluster in visualization.” this will jump you directly to the cluster. figure 23. using the mouse tooltip to highlight the jagsi(2012) article results in five articles not included in the initial list of doi’s already identified as key articles, including an update of the jagsi(2012) article in 2013 and a specialized focus article on cardiologists, jagsi(2016). figure 26. 5. list of authors (google sheets tool) tools zotero dimensionvosviewer tools spreadsheet under “authorlist” <https://docs.google.com/spreadsheets/d/1f73emj-fhrnrdthncpl04tvbtlh_i5cgqp70qkqwm_w/edit?usp=sharing dimensions vosviewer example search https://app.dimensions.ai/discover/publication?or_facet_researcher=ur.01136211600.84&or_facet_researcher=ur.01233634501.14&or_facet_researcher=ur.01211404006.02&or_facet_researcher=ur.010153355772.07 map figure 24. techniques clusters & networks of citations in the general set of topics that have been used as examples in this paper, most of them have come from authors that publish in multiple areas. sometimes it is useful to get a comprehensive view of what an author or cluster of authors have published in a selected area. the example map above is a raw export of four authors who have published in the general area of burnout among physicians. however, most of these authors are also specialist physicians. the crossover between the clinical work of a specialist and their work on physician issues is not always obvious, and mapping their work provides an easy way to get a general overview of both aspects of their work. the largest section of the above map that mostly refers to physician burnout is the large right hand cluster. figure 25. the tait shanafelt (stanford) and lisolette dyrbye (mayo clinic) papers in the (roughly) green/light blue and dark blue clusters above are all generally about burnout among physicians and are all generally based on samples in the united states. search against selected authors https://app.dimensions.ai/discover/publication?or_facet_researcher=ur.01136211600.84&or_facet_researcher=ur.01233634501.14&or_facet_researcher=ur.01211404006.02&or_facet_researcher=ur.010153355772.07&search_text=physician%20burnout&search_type=kws&search_field=full_search the search link above is a search created by taking the author search created with the authorlist tool and then doing a keyword search against it. this can be a particularly powerful tool when a reference interview identifies a large number of important authors. this technique is similar to the key authors search detailed above, but allows a much larger number of authors to be searched. the pre-filtering involved in identifying important authors creates dense, highly connected maps and works best in getting a comprehensive view of an author’s complete publication history on a specific topic. figure 26. practical integration in the reference interview when working directly with patrons during a reference interview i have found that the most effective maps are the advanced boolean search and key papers and citations maps. graduate students and phd’s seem to immediately grasp the idea of mapping a search (advanced boolean search) and mapping references and citations. the diagram of the two point-to-point triangles (see figure 16) is particularly effective in illustrating how the vosviewer map works. the time necessary to create the key papers and citations map makes it less effective than the speed in which a properly set up advanced boolean search map can be spun up. list of papers by doi, list of authors and key authors searches work best on followup. bibliography [1] van eck n, waltman l. 2009. software survey: vosviewer, a computer program for bibliometric mapping. scientometrics. 84(2):523–538. doi:10/cx2w6z. [accessed 2019 may 13]. https://akademiai.com/doi/abs/10.1007/s11192-009-0146-3. [2] van eck nj, waltman l. 2007. vos: a new method for visualizing similarities between objects. in: decker r, lenz h-j, editors. advances in data analysis. berlin, heidelberg: springer berlin heidelberg. p. 299–306. [accessed 2019 may 13]. http://link.springer.com/10.1007/978-3-540-70981-7_34. [3] corrall s, kennan ma, afzal w. 2013. bibliometrics and research data management services: emerging trends in library support for research. libr trends. 61(3):636–674. doi:10/f4zzh4. [accessed 2019 may 13]. https://muse.jhu.edu/article/508619. [4] shachaf p, shaw d. 2008. bibliometric analysis to identify core reference sources of virtual reference transactions. libr inf sci res. 30(4):291–297. doi:10/frdmmf. [accessed 2019 may 13]. https://linkinghub.elsevier.com/retrieve/pii/s0740818808000741. [5] zahedi z, van enjp, cwts, cwts. 2015. identifying topics of interest of mendeley users using the text mining and overlay visualization functionality of vos viewer. 20th international conference in science & technology indicators, 2-4, september, 2015, lugano, switzerland. httpwwwsti2015usichposter-sess. [accessed 2019 may 13]. https://openaccess.leidenuniv.nl/handle/1887/48264. [6] thelwall m. 2018. dimensions: a competitor to scopus and the web of science? j informetr. 12(2):430–435. doi:10/gdqwvj. [accessed 2019 may 13]. http://www.sciencedirect.com/science/article/pii/s175115771830066x. [7] harzing a-w. 2019 may 8. two new kids on the block: how do crossref and dimensions compare with google scholar, microsoft academic, scopus and the web of science? scientometrics. doi:10/gf24xf. [accessed 2019 may 13]. http://link.springer.com/10.1007/s11192-019-03114-y. about the author brett williams (@brettlwilliams, brett.l.williams+c4l@gmail.com) is a senior research analyst at the ontario medical association in toronto, ontario where he acts as the systems librarian and a reference and instruction librarian. he has been been munging library metadata since he first learned how to do an =importxml() in google sheets shortly after graduating from library school at mcgill university in 2008. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – the fachref-assistant: personalised, subject specific, and transparent stock management mission editorial committee process and structure code4lib issue 37, 2017-07-18 the fachref-assistant: personalised, subject specific, and transparent stock management we present in this paper a personalized web application for the weeding of printed resources: the fachref-assistant. it offers an extensive range of tools for evidence based stock management, based on the thorough analysis of usage statistics. special attention is paid to the criteria individualization, transparency of the parameters used, and generic functions. currently, it is designed to work with the aleph-system from exlibris, but efforts were spent to keep the application as generic as possible. for example, all procedures specific to the local library system have been collected in one java package. the inclusion of library specific properties such as collections and systematics has been designed to be highly generic as well by mapping the individual entries onto an in-memory database. hence simple adaption of the package and the mappings would render the fachref-assistant compatible to other library systems. the personalization of the application allows for the inclusion of subject specific usage properties as well as of variations between different collections within one subject area. the parameter sets used to analyse the stock and to prepare weeding and purchase proposal lists are included in the output xml-files to facilitate a high degree of transparency, objectivity and reproducibility. by eike t. spielberg, frank lützenkirchen introduction place gets rare. more and more libraries are confronted with this problem. in their desire to offer more and better equipped working space to students and scientists, they need to evaluate their stock and make room for high level desk space. weeding is a magic key to achieve this goal (martin et al. 2013). due to their size, textbook collections are highly promising candidates for gaining space. at a large university as essen-duisburg with more than 40,000 students, some of the titles are 100 or even more than 200 items strong. but how to proceed and how to quickly identify those items which give rise to the most extensive gain in room while resulting in the smallest disturbance and problems possible for the users? only by a thorough evaluation of the usage in combination with the subject-specific judgement of the subject librarian on which titles are useful (even if 15 years old) and which are not, can this question be answered (cottrell 2013; raphael 2013). while the second part of the question is difficult to replace by automated routines, the first one can be. the main concern is basic data of sufficient quality. these data can be extracted from the library system (aleph (exlibris [accessed 2017]) from exlibris (exlibris [accessed 2017]) in our case). but even if the data are of sufficient quality and are evaluated by automated routines, an interpretation of the results is often difficult for table based information. a clear and understandable visualization allows the user to grab the necessary information quickly (finch and flenner 2016). hence the representation of loans, requests, and stock plays a key role. for several years, a tool called “ausleihprotokoll” (loan protocol) has been used at the university duisburg-essen, written by one of the authors. this web application gives a general overview of the editions and the items present at the library and visualises the temporal evolution of loans and requests in a graphical chart. however, the temporal evolution of the stock was not included. some time had to be spent to acquire this part of the information from tables and to compare it to the loans depicted in the chart. two types of information representation (graphical and tabular) had to be combined, resulting in additional time for information collection and making the decision, if and how many items can be withdrawn. even if it takes only a few seconds for scrolling and gathering the stock information, in the run of analysing hundreds of titles this can add up to a considerable amount of time. in addition, to apply automated analysis these data have to be harmonized, so that the routines can relate the loans to the actual stock at a given time. a project at the university of applied sciences cologne extracted the stock data from the item data and included them in the graphical representation. in addition, a first routine to analyse a single title had been developed (eike spielberg [accessed 2017]). in this paper, we show how this procedure can be extended to whole stock ranges and how personalized parameters are used to calculate subject-specific weeding proposals [1]. our approach usage statistics for one title before expanding to larger stock areas, we explain how the usage statistics for a single title are calculated. given the shelfmark of a title, the corresponding editions are collected from the library system. for each edition, the available items are collected as well as the events (loans and request) for each item. as stated above, these have been displayed in the older version of our tool. a subsequent step was added, calculating similar events from the item information: the date of inventory is used to define an accession event, the date of removing the item as a withdrawal event. we obtain three series of events, in which each series comprises of a start and an end event: loan/return, request/hold and accession/withdrawal. if we look at the temporal evolution of one of these series (for example the loans), a starting event (loan) increases the corresponding counter (number of items simultaneously loaned) while an end event (return) decreases the counter. the three series provide all data necessary for an in-depth analysis. screenshot 1. the upper part of the loan protocol (protokoll) for an item from a textbook collection. the query-form (“abfrage”) is followed by the chart depicting the loans (red), the requests (green) and the stock (blue) for a single shelfmark (= edition, “auflage”). the inclusion of the stock data into the chart of loan/request data already provides a significantly easier and accurate overview over the principal usage characteristics (see screenshot 1)[2]. from this diagram alone, the trends in usage, the relation of loan and stock items and the adequacy of the current stock is easily obtained. as the estimation of items for weeding is usually governed by rather clear and obvious guidelines – how many books have been maximally loaned at the same time, how many should we keep as reserve? – it can be calculated automatically. but how exactly can this be done? relating usage and stock – the weeding potential first, we need to define a time range to be looked at, for example, the last five years. in this time range the absolute maximum number of simultaneous loans is extracted and compared to the stock present today. the difference between these two numbers represents something, which might be called a “weeding potential”. it represents the number of unused items, resting on the shelf when the maximum number of items has been loaned. in a rather aggressive approach, all these items could be selected for withdrawal. however, one should take precautions for the case that items might get damaged or lost. to account for this, every librarian would keep some items as a kind of security reserve. in the fachref-assistant a two-step approach is applied. firstly, the weeding potential is corrected for a static reserve (for example three items or 20% of the stock). secondly, a variable reserve is applied to account for the usage of items (items that are used more regularly have a higher probability to get damaged or lost). for this reason, the ratio between the mean and maximum relative loan is calculated in the given time range. the mean relative loan is calculated by summing up all days the items were loaned and dividing this value by the sum of days the items were in stock. the maximum relative loan is calculated as the maximum fraction of loaned items over items in stock within the time range. a high ratio of these two values indicates a high usage of the individual items, thus making a future replacement more likely. screenshot 2. analysis page showing statistical metrics (“analyse”) and a chart depicting the loans of a single edition with respect to different user groups (“ausleihe nach gruppen”). the table depicts the mean relative loan (“mittlere ausleihe”), the maximal relative loan (“maximale ausleihe”), the maximal number of items simultaneously loaned (“maximale ausleihe absolut”), the actual stock (“aktueller bestand”), the calculated weeding proposal (“vorschlag zur aussonderung”) and the shelfmarks of the items found (“signaturen”). the time range for the analysis can be adjusted via the input field below (“zeitraum ändern”). the chart depicts the distribution of the relative loan with respect to four library user groups: university staff (“interne ausleihen”), students (“studentische ausleihen”), people who are not member of the university (“externe ausleihen”) and other (“andere ausleihen”). the size of the variable reserve is calculated by multiplying a user-defined maximum reserve (e.g. 40 % or 10 items) with the ratio of the mean and maximum relative loan. therefore, a higher ratio results in a larger reserve of items kept to avoid unnecessary costs. using only three parameters – time range, static reserve and variable reserve – weeding proposals can be calculated, which account for many effects in a way very similar to the decisions a librarian makes subconsciously. the results of such an analysis is displayed together with further information about the composition (user groups) in a separate analysis page, accessible from the loan protocol (see screenshot 2). however, the actual values of these parameters may vary strongly from subject to subject. for titles with high, but short-lived relevance (for example genetics or computer science), short time periods might be sufficient to estimate the use in the future. for other subjects such as humanities, the time range to be looked at needs to be much longer. in the case of natural sciences with a very rigid course of lectures, the general textbooks are often required every year at the same time. for lecture courses in german literature, the subject of a seminar might vary from time to time, giving rise to high loans in a series of titles during one semester and with longer periods of neglectable loans. to account for these subject specific problems, the fachref-assistant allows the definition of profiles, a set of parameters used for individual stock ranges. following this approach, the stock in humanities can be analysed with a much longer time range than the stock in chemistry or computer science. for the sake of clarity and to not overload the list of profiles, a personalization option is included, thus allowing each subject librarian to define their own set of parameters for their subjects. from one to many – the weeding and purchase recommendation lists the analysis presented up to now can be performed only for a single title (but with several editions). for several years, lists were prepared at our library by querying the aleph system for some criteria and then thoroughly testing each title. so we query the system with one program, prepare a list and manually test each title by querying the system again with another application. but wouldn’t it be much easier to avoid the intermediate list? to query the system, perform the described analysis right away, and prepare a list only of those titles which have a significant weeding potential? to this end, we implemented a section-based analysis within the personalized part of our application. after login, users with the role subject librarian can switch to the profiles module. here, an individual set of parameters can be defined for each subject area (see screenshot 3). the definition page consists of three major parts: range parameters, loan analysis parameters, and request analysis parameters. screenshot 3. definition of profile for a certain subject. in the first block (“angaben zum bereich”) the subject, collections and media are defined. the second block (“parameter zur ausleihanalyse”) allows for the input of the parameters used to calculate the weeding potential. the last block (“parameter zur vormerk-analyse”) offers parameters used to calculate purchase recommendations based on the requests put on the individual items. range parameters: at the university duisburg-essen, we use a three letter code in the shelfmark as subject categorization. each subject has a certain range (e.g. biology: vna-vtz). in a dropdown menu, the main subject categories can be selected, setting the ranges automatically. in addition, one can define certain ranges or lists of shelfmarks or combinations of both to restrict the analysis to a certain range. afterwards, two fields allow for a restriction to certain collections or media. loan analysis parameters: in the next part, the time range and the two reserves mentioned above are defined as well as a minimum time range; media have to be in stock before they might be selected for weeding. for the display of lists, the minimum value of the weeding proposal can be defined as well. in addition, an optional email address can be entered, to which a bcc of a weeding decision is sent for additional archiving purposes (a copy of the final list is always stored as xml file on the server). request analysis parameters: the last part defines two parameters used to analyse the requests and to make recommendations for extending the stock. again, a time range can be defined and a minimum number of days that people had to wait for their desired item. this choice of the number of days corresponds to the idea of ascertaining a certain service level (“our user does not have to wait longer than 10 days to get his book”). screenshot 4. list of profiles (“profile“) with parameter sets to analyse specific parts of the stock. the two green buttons in each row lead to the weeding (“aussonderung”) and purchase proposal list (“erwerbung”), the blue button with the star starts the analysis, the yellow allows changing the parameters of a profile and the red one deletes the corresponding profile. after a profile has been defined, the analysis can be started from the blue button in the profile overview page (see screenshot 4). after completion, two lists can be accessed: a list of weeding proposals and one of purchase recommendation. an example of such a list is depicted in screenshot 5. using javascript, these lists can quickly be filtered, searched and ordered. the list includes the shelfmark (showing bibliographic details upon hover-over), the collection information, a link to the loan protocol (see above) the trend of the maximum loan over the last ten years, the actual stock, the absolute maximum loan and the weeding recommendation. it also contains a column for comments, which can be used later to give additional information to the give commentaries for the colleagues who perform the actual withdrawal. screenshot 5. list of weeding proposals (“aussonderungsvorschläge”). the columns depict the shelfmark (“signatur”), the collection information (“standort”), a link to the loan protocol (“ausleihprotokoll”), the trend of the maximum loan over the last ten years (“trend/jahr”), the actual stock (“bestand”), the absolute maximum loan (“maximale ausleihe”), a comment (“kommentar”) and the weeding proposal (“vorschlag”). the last column contains buttons to put these titles on a so-called blacklist. titles that shall be protected from weeding (even if hardly used) can be put on this list in order to prevent any further analysis. this might be the case for special collections, for books written by university staff, media used in the library only and so on. the two buttons offer two ways to protect a title: the green one uses a standard entry with a time range set in the personal settings and the blue one allows the staff to set the time period manually and include comments. after working through the list of proposals, saving preservable items to the blacklist and adjusting the numbers of items to be withdrawn, a click on the “aussondern” (withdraw) button at the very bottom of the page generates an email in the user’s local mail program, which can be sent to the members of staff responsible for performing the withdrawal process itself. in this email the shelfmarks, the number of items to be withdrawn and the comments added are listed in plain text. <analysis key="123456789" trend="-0,350" shelfmark="abc123(2)"> <mab> weeding in libraries ... </mab> <meanrelativeloan> 0.349613 </meanrelativeloan> <maxrelativeloan> 0.966667 </maxrelativeloan> <maxloansabs> 31 </maxloansabs> <laststock> 43 </laststock> <proposeddeletion> 12</proposeddeletion> <finaldeletion>12</finaldeletion> <comment /> <totaldaysrequest>0</totaldaysrequest> <numberrequests>0</numberrequests> <maxnumberrequest>0</maxnumberrequest> <proposedpurchase>0</proposedpurchase> </analysis> ... <stockcontrolproperties> <stockcontrol>43_1484895218825</stockcontrol> <collections>e33 e43</collections> <materials>book</materials> <subjectid>43</subjectid> <systemcode /> <minimumyears>5</minimumyears> <yearstoaverage>10</yearstoaverage> <groupedanalysis>false</groupedanalysis> <staticbuffer>0.0</staticbuffer> <variablebuffer>0.0</variablebuffer> <threshold>0</threshold> <yearsofrequests>2</yearsofrequests> <minimumdaysofrequest>5</minimumdaysofrequest> </stockcontrolproperties> listing 1. example – part of the resulting xml-file depicting an imaginary title (analysis-node) and the parameters used to analyse this book (stockcontrolproperties-node). transparency and persistence – using xml-files. with the generation of this email, the list is archived on the server as well. to be as transparent as possible and to allow for a future check-up of performed weeding, the lists are stored as blank xml-files in a separate archive-folder on the server used to run the application disk. not only the items to be withdrawn (finaldeletion-node) are included, but also the proposed number of items (proposeddeletion-node) and the parameters set in the profile (stockcontrolproperties-node, see xml 1). this way, the calculated proposals stay linked to the parameters used to determine the proposed actions. comparing the differences between the proposed number and the final number of items might be used to automatically optimize the chosen parameters. this way a kind of learning weeding agent can be thought of, using machine learning techniques to be as close as possible to the user generated weeding lists. adoption by others – the use of generic mechanisms currently, the application works only with the exlibris (exlibris [accessed 2017]) integrated library system aleph (exlibris [accessed 2017]). in addition, for the use of the profiles a lot of library specific data have to be entered, such as collections, the classification used and so on. for these reasons, a transfer of the application to other libraries might look kind of difficult. therefore, from the very beginning, modules and methods were designed to be as generic as possible. for example, the list of shelfmarks is generated from an xml-file of the used classification and is stored within an in-memory database. in principle, this could easily be adopted to allow for the use of other systematics or to include other types of sources (for example comma-separated values). only the import of the systematics needs to be changed; the internal use of these criteria depends only on the entries stored in the in-memory database and would therefore work also with other data. similar arguments hold true for the use of collections. to manage these a web-frontend is included, allowing the user to define certain ranges. at this point, some options are still hard-coded to the xsl-files which render the web-frontend page, but this could be easily replaced by different mechanisms or just by editing the web-page. the connection to the local database is done via a set of java classes (so-called getters), all located within one package. by replacing the access information, the expressions and mechanisms within these very few classes, an adjustment to other systems could be easily achieved. we therefore believe, that adaptation of this application to other systems need several things to be done, but compared to the overall size of the project, this should be possible and relatively easy. we also started to restructure the application in order to build an independent microservice to access the library system. this approach shall result in a kind of abstract data provider layer. this layer can be adapted for each system without changing the rest of the application. outlook and future work the fachref-assistant has proven its value in our daily life at the university library duisburg-essen. the “ausleihprotokoll” is a common tool to evaluate the usage of single titles and the preparation of lists with the help of the fachref-assistant has been done in many cases. however, we still see a lot of work to be done. one of the main tasks for the immediate future is the adaptation to other library systems, especially to next-generation systems like alma (exlibris [accessed 2017]). three major challenges exist: adopt the getter-classes to other databases / apis include other classifications and collections adopt the mechanisms defining the local settings to allow for an easier setup and transfer. in addition, we would like to separate the application into a set of smaller modules in order to allow for a more agile development. this mainly includes the separation into restful backend services and a web-frontend. this would also make the localization much easier. conclusions we present in this paper the fachref-assistant, a personalized web application for the weeding of printed (and online) resources. it offers an extensive range of tools for evidence based stock management, based on the analysis of usage statistics. the personalization allows for the inclusion of subject specific usage properties as well as the variations between different collections within one subject area. the parameter sets used to analyse the stock and to prepare weeding and purchase proposal lists are included in the output xml-files to facilitate a high degree of transparency, objectivity and reproducibility. the application has been designed to work with the aleph-system from exlibris, but efforts were spent to keep the application as generic as possible. thus, the fachref-assistant supports the library staff by aggregation of objective criteria to achieve higher work efficiency, setting free additional financial and time resources that may be used for other tasks and challenges. acknowledgement the authors thank the university library duisburg-essen for the support for the project. special thanks goes to jutta kleinfeld, nils verheyen and paul rochowski for helpful discussions about technical details. for testing and extensive feedback, we would like to thank christina kläre, rosemarie kosche and felix schmidt. technical details the web application is java/xml based. the mycore (mycore-community [accessed 2017]) framework was used to design the main servlet handling the requests. connection to the library system is established using oracle java database connectivity (jdbc) (oracle [accessed 2017]). upon calling java web servlets with specific url-parameters the data are collected and evaluated according to the user-specified parameters. the results are collected in an xml-document, which is rendered into an html-page via xsl-transformation (w3c [accessed 2017]). to allow user role specific views and option, authentication and authorization was performed using the apache shiro framework (the apache software foundation [accessed 2017]). the general design of the web pages is bootstrap based (@mdo and @fat [accessed 2017]), the charts are generated by the highcharts-package (highcharts [accessed 2017]) and d3.js (bostock [accessed 2017]). the source code is available via github (https://github.com/etspielberg/ub-statistics), documentation (javadoc) under https://etspielberg.github.io/ub-statistics/. about the authors eike t. spielberg (orcid: 0000-0002-3333-5814) studied chemistry at the friedrich-schiller-unviersity in jena and now works as subject librarian and scientific co-worker at the university library duisburg-essen in germany since 2015. he develops tools for automation of library workflows and for new services within the context of bibliometrics. frank lützenkirchen (orcid: 0000-0001-5065-6970) studied business informatics at the university of essen. he is active member and co-founder of the mycore-community and head of the department “digital library” at the university library duisburg-essen. notes [1] an extensive description (in german) of this application has also been submitted as master thesis at the university of applied sciences cologne and will soon be published on the local document repository (https://epb.bibl.th-koeln.de/home). ets acknowledges the support by his supervisors dr. peter kostädt and prof. dr. achim oßwald. [2] as the application was written at a german university the web pages depicted in the screenshots are all in german. a localization has not been performed yet. in captions and references throughout the text, the german expressions found in the screenshots are highlighted in quotation marks. references @mdo, @fat. [accessed 2017 jan 28]. bootstrap: the world’s most popular mobile-first and responsive front-end framework. http://getbootstrap.com/. bostock m. [accessed 2017 jan 28]. d3.js: data driven documents. https://d3js.org/. cottrell tl. 2013. weeding worries, part 1: books. bottom line. 26(3):98–102. doi:10.1108/bl-06-2013-0015. eike spielberg. [accessed 2017 may 17]. aussonderungsassistent für fachreferenten an wissenschaftlichen bibliotheken. malis projekte-blog. http://malisprojekte.web.th-koeln.de/wordpress/aussonderungsassistent-fuer-fachreferenten-an-wissenschaftlichen-bibliotheken/. exlibris. [accessed 2017 jan 28]. aleph integrated library system. http://www.exlibrisgroup.com/de/category/aleph. exlibris. [accessed 2017 jan 28]. exlibris alma. http://www.exlibrispublications.com/alma/. exlibris. [accessed 2017 jan 28]. the bridge to knowledge. http://www.exlibrisgroup.com/de. finch jl, flenner ar. 2016. using data visualization to examine an academic library collection. college & research libraries. 77(6):765–778. en. doi:10.5860/crl.77.6.765. highcharts. [accessed 2017 jan 28]. interactive javascript charts for your webpage. [place unknown]: [publisher unknown]. http://www.highcharts.com/. martin j, kamada h, feeney m. 2013. a systematic plan for managing physical collections at the university of arizona libraries. collection management. 38(3):226–242. doi:10.1080/01462679.2013.797376. mycore-community. [accessed 2017 jan 19]. mycore: das framework zur präsentation und verwaltung digitaler inhalte. de. http://mycore.de/. oracle. [accessed 2017 jan 19]. database jdbc developer’s guide. https://docs.oracle.com/cd/e11882_01/java.112/e16548/toc.htm. raphael l. 2013. killing sir walter scott: a philosophical exploration of weeding. [accessed 2017 jan 23]. in the library with the lead pipe. http://www.inthelibrarywiththeleadpipe.org/2013/killing-sir-walter-scott-a-philosophical-exploration-of-weeding/. the apache software foundation. [accessed 2017 jan 19]. apache shiro: simple. java. security. https://shiro.apache.org/. w3c. [accessed 2017 may 22]. xsl transformation (xslt). https://www.w3.org/tr/xslt. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction: join us at the table mission editorial committee process and structure code4lib issue 22, 2013-10-14 editorial introduction: join us at the table introducing issue 22 by sara amato this issue, born during the long days of summer in the northern hemisphere, focuses on what libraries can bring to the digital table both in terms of special collections and metadata expertise. articles range from an analysis of a large cross institutional collection of ead finding aids, to mixing it up with wikipedia and authority records, to using apache hadoop, apache mahout and html5 to further institutional collections storage and discovery. many thanks to the authors for sharing their work and expertise. if you enjoy reading this journal and would like to contribute to the the code4lib community, please considering applying for a seat on the code{4}lib journal editorial board. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – digital forensics on a shoestring: a case study from the university of victoria mission editorial committee process and structure code4lib issue 27, 2015-01-21 digital forensics on a shoestring: a case study from the university of victoria while much has been written on the increasing importance of digital forensics in archival workflows, most of the literature focuses on theoretical issues or establishing best practices in the abstract. where case studies exist, most have been written from the perspective of larger organizations with well-resourced digital forensics facilities. however organizations of any size are increasingly likely to receive donations of born-digital material on outdated media, and a need exists for more modest solutions to the problem of acquiring and preserving their contents. this case study outlines the development of a small-scale digital forensics program at the university of victoria using inexpensive components and open source software, funded by a $2000 research grant from the canadian association of research libraries (carl). by john durno and jerry trofimchuk introduction back in 2012, marshall breeding wrote: it’s [..]interesting to consider the ways that technology has changed the challenges that libraries face as they accept the collections from retiring scholars or literary figures. in times past, such a collection sold or donated to a library or archive would consist of books and boxes of notes, manuscripts, correspondence, and other physical artifacts. today, retiring scholars might also include their word processing files, research data sets, or even the computers themselves. the files might reside on media long obsolete. how many libraries, for example, still have equipment that can read 5.25″ floppy diskettes that were standard in the early days of computers and word processors? or magnetic tapes from a 1970s vintage mainframe? processing such collections may involve increasing measures of digital archaeology. breeding’s observation had particular resonance for us because shortly prior to its publication our university archives had received a donation of physical materials with inventory lists on 5.25″ floppies. this was not the first time donations had contained such materials, but so far all we had been able to do with them was to store them along with the other physical materials in the collection. of course, this strategy was not optimal, both for the reason mentioned by breeding and also because such media degrade over time. the longer we waited to attempt to extract their contents the less likely we would be able to do so. we decided the time had come to try. as an obvious prerequisite, we would need to obtain equipment that could do the job. at one time, of course, our library had equipment capable of reading old media. but that was back when the media were not yet obsolete. storage space being at a premium, over the years we conscientiously discarded older computers to make room for their replacements, and the last 5.25″ floppy drive had long since left the building. one possibility was of course to buy a complete older system from a vendor specializing in such things, but we were reluctant to go that route for three reasons: vintage equipment tends to be expensive, particularly if it has been refurbished. even with refurbished equipment there would be no guarantee how long it would continue to function. mechanical parts wear out, capacitors burst, and rust never sleeps. while there are off-the-shelf digital forensics solutions, such as the forensic recovery of evidence device (fred)[1] from digital intelligence, these solutions appeared to be overbuilt for our requirements, and with a correspondingly high price tag. our hardware budget for this project was $2000, and we did not exceed it. as a corollary to #2, it did occur to us that we will probably not require this kind of equipment to last forever. eventually there will be no more donations of 5.25″ or even 3.5″ floppies, as the members of the generations that used such things join eliot’s dancers under the hill. and of course it is questionable whether such media will still be readable 20 years further down the line, even if we still have the equipment to do it. bearing that in mind, we decided to aim for a 20 to 25 year operational window. what could we do now, if anything, to ensure we maintained the capacity to read old floppy disks for the next quarter century? it was at this point we decided to look into how possible it was to build an older computer out of new parts. we did some research, concluded it was feasible, and obtained a grant from the canadian association of research libraries (carl) to underwrite the cost of hardware purchase. the rest of this paper describes how we approached the problem, the design considerations as applied to hardware, software and the places where they intersect, and some of the preliminary outcomes. hardware of course, anyone at all conversant with the field of digital forensics could legitimately ask why we did not simply do what everyone else does: use a modern computer equipped with a usb floppy controller (for 5.25″ floppies, the “go-to” solution appears to be the fc5025 floppy controller from device side data, or at the higher end the kryoflux). (reside, 2012) it’s a good question, and before enumerating our reasons i will note that in fact we did build a second workstation with the fc5025, so we have a benchmark for comparison. for the rest of this paper, we will refer to these two computers as the “new/old” computer (old computer built with new parts) and the “usb floppy” computer (workstation with usb floppy controller). reasons why one might want to consider replicating an older hardware platform include the following: even if the fc5025 controller solves the problem of reading 5.25” disks very well (and early indications are that it does), it does not address the problem of reading 3.5” disks. external usb-attached 3.5” floppy drives are still commercially available, but there are reports these have more problems reading older media than do direct attached devices. many can only read high density floppies, not the earlier double density variety (reside, 2012). creating disk images is of course only the first step along the way toward preserving content, which is of course what we care about. you need to be able to mount the disk image or otherwise access the files it contains, and run a variety of older software and operating systems in order to be able to view documents or run older executables. while emulation and format conversion are both possible and necessary (as it would be unrealistic to attempt to duplicate every possible hardware configuration you might encounter), that approach does have its limits. having the ability to hardware-boot older operating systems provides additional flexibility for extracting content from old media. that, at least, was the rationale we started with. we should stress at the outset that we are relative neophytes when it comes to recovering data from old computer media, and were even more so when this project got underway in 2012. jerry is a library systems technician and hardware hacker with many years of experience; john is a librarian and self-taught jack of all trades who manages the library systems department. our goal here is to describe how we approached the problem in the hopes that it will have some value to others involved in the field, or anyone wanting to replicate the functionality of older hardware using new components, possibly for reasons other than simply reading data off of older media. figure 1. ibase mb820 motherboard, cpu and fan. the core component of our hardware strategy was an ibase mb820 industrial motherboard, readily available from multiple suppliers. industrial motherboards differ from their consumer counterparts in that they are designed to support operational continuity for specialized software frequently found in industrial applications. typically, upgrading industrial software simply to keep it compatible with newer computer hardware is not worth the expense. it is more economical to keep the hardware environment constant for as long as possible. thus the mb820 has some interesting and valuable characteristics from our perspective. [2] the mb820 motherboard supports an intel pentium 4 processor, giving it the equivalent processing capacity of a computer from the early 2000s. in terms of supported components, it straddles the space that lies between older legacy devices such as 5.25″ floppy drives, and ps/2 keyboard and mouse, as well as more recent devices such as sata hard drives, an important consideration as ide hard drives are becoming difficult to purchase. [3] with the mb820 as our foundation, we were able to construct a system with the following components: intel pentium 4, 2.8 ghz processor with 512k l2 cache 4gb ddr memory 500gb sata seagate hard drive ati radeon 9200 agp graphics card creative labs soundblaster awe 64 sound card (useful for audio in dos) 5.25″ floppy drive 3.5″ floppy drive cdr/dvd ide optical drive logitech ps/2 keyboard hewlett packard ps/2 mouse these components together allow us to run a rich cross-section of legacy software and operating systems, including ms-dos and windows 3.11, as well as more modern operating systems such as windows xp pro and centos linux 5.10, to take advantage of the different capabilities of each. for example, the motherboard supports both usb and ps/2 input devices, which is useful given that dos does not readily support usb. and of course, on board support for floppy drives obviates the need to utilize usb controllers, making it feasible to directly access the disk drives from within dos, should we ever want to. as mentioned above, the goal of the project was to build an older system out of new parts. we were mostly but not completely successful. the floppy drives themselves (both 3.5” and 5.25”) were purchased secondhand, as were the sound and graphics cards. we were not successful in locating suppliers for new iterations of these older components, and secondhand components were readily and cheaply available. buying spares of key components was already central to achieving our goal of a 25 year lifespan. given that the used parts are less likely to survive for an extended period than were the newer components, we bought additional spares for these in case it proved difficult or expensive to procure replacements in the future. figure 2. bios settings showing floppy drive capacities. software environment as noted above, our “new/old” computer can boot multiple operating systems. however, for day to day use we find the centos 5.x (currently 5.11) partition to be the most flexible. we chose centos 5.11 because it supports both our 32-bit hardware and the cert forensics tools package, which contains most of the utilities we need to acquire and examine disk images.[4] centos 6.x also comes in a 32-bit flavor but the cert package does not appear to be available in a 32-bit version past centos 6.4, which meant we would not have been able to run updates going forward had we opted for the newer version of centos. of course, centos 5.x updates will cease in 2017, at which point we will have to decide whether to simply freeze the machine (and block external network access for security reasons) or upgrade the os. it is worth noting that centos 5.x automatically saw and configured support for both floppy drives during the installation process. this was both unexpected and highly gratifying, as we had anticipated a certain amount of effort would be required to compile the appropriate drivers into the kernel. our “usb floppy” computer runs windows 7 enterprise, with bitcurator running in virtual box.[5] however, as noted below, bitcurator is not currently operational in our environment. disk imaging imaging floppies is an important first step toward their preservation. floppy disks were never the most durable of media, and older floppies in particular need to be handled extremely carefully. at the same time, it is not enough simply to extract their contents, since the goal is to preserve not only the individual files but also how they were organized, timestamps, and any additional metadata that might have been part of the original file system. ideally as exact a copy as possible should be created with the least amount of handling of the original disk. disk imaging addresses both those requirements: first a bit-level copy of the disk is created (hopefully in a single read) and any subsequent content extraction is done from a copy of the image, not the original disk (woods & lee, 2012) however, it turns out acquiring disk images is not altogether straightforward, even after you have the necessary hardware. a variety of imaging software is available, and images can be saved in multiple formats. what’s the best approach? when we consulted the literature it appeared that the best practices for archival disk imaging had been derived from the field of digital forensics, the reason being that both archives and law enforcement have to credibly establish that their images are accurate and complete copies of the original (duranti, 2009, and fox, 2011). however, it is still possible to question how far best practices developed in one field should extend to the other, particularly with regard to file formats for long term preservation. the simplest image format is a raw, bit-level copy of the original source media. this format is output by utilities like the venerable dd command, which has existed in one form or another since the 1970s and is included on every posix-compliant unix system. however there seems to be some consensus in the literature that the advanced forensics format (aff) is preferred, largely due to its ability to store user-created metadata in the same file as the disk image, along with other features like digital signatures and more efficient storage capabilities. (garfinkel et. al., 2006) but in the context of long-term preservation, and particularly in the context of the minimal storage requirements presented by floppy disk images, the use of a more complex and evolving format like aff seems to us to present at least as many drawbacks as advantages. our confidence was not increased when we determined that earlier versions of aff have since been deprecated for future use (forensics wiki, n. d.). we believe raw images are more likely to be readable over the long term than are more specialized formats. in addition to dd, we are exploring the use of other tools for imaging disks. one of these is a proprietary program that ships with device side data’s fc50525 usb floppy controller. for imaging 5.25” floppy disks this is turning out to be our best option, both for its overall ease of use and for its ability to read disks that dd cannot. its biggest weakness is that it does not log damaged sectors, even though it briefly displays this information on screen while the reads are in progress. it outputs raw disk images. figure 3. disk type options for the fc5025 usb floppy controller. bitcurator is another, more complex, suite of tools that is gaining significant traction in the digital preservation space, and we are only just beginning to explore its capabilities. a custom distribution of ubuntu, bitcurator features the open source guymager as its primary imaging tool. guymager is flexible and relatively user-friendly imaging software that can output raw, aff and ewf images along with checksums and metadata [6]. however, as of this writing we have had no success using bitcurator for imaging floppies of either the 3.5” or 5.25” variety. it has a 64-bit requirement and so does not run on our “new/old” 32-bit machine. and while it will run under virtual box on our “usb floppy” workstation and does see the usb floppy drive, it freezes when attempting to read from it. we have not yet determined what is causing this problem. however, it should be noted that while bitcurator is a very convenient way to obtain and use a wide range of forensics tools, most of its software components are available in other ways (such as via the cert forensics tools package mentioned above) and they can compile and run quite readily in our 32-bit centos 5.x environment. finally, one of the bigger imaging challenges we have encountered to date has been simply determining what combination of technologies wrote data to the disks in the first place. even taking a bit-level copy of a disk, it can be critical to know, for example, whether the data it contains was saved by an apple ii or a dec rainbow 100, and which version of operating system it was running at the time. it is perhaps surprising how infrequently this kind of metadata accompanies the older floppy disks in our archives. content retrieval to illustrate the kinds of challenges posed by older media we will describe one of our earliest projects, imaging and extracting content from a box of twelve 5.25” disks in the fonds of a local (vancouver island) poet, rona murray. at the outset we knew that the disks were approximately 30 years old, from a variety of manufacturers, all single-sided double density. a label pasted on the side of the box indicated they contained files created in wordstar 2.2 and in addition, each disk was labelled with a list of the files it contained. however, we did not know what kind of computer the author had used. our first attempts to image the disks failed. we used ‘dd’ with some standard configuration options but it was unable to pull an image from any of the disks we tried. thinking the disks may have been corrupted we then employed a program called ‘ddrescue’ to see if we could recover data that way, as ddrescue reads at a lower level than dd. using ddrescue we did manage to extract a considerable amount of data but the logs indicated significant error rates. we were not able to mount the ddrescue images so we were not able to view their contents using standard file system tools. grasping at straws, we then dumped the binary contents of the images to screen using the unix ‘more’ command. and this is what displayed: figure 4. binary data dump to terminal. it is interesting to note that, even if we had not been able to further refine our process much of the content would still have been readable, thanks to the fact that wordstar saved files primarily in plain text (albeit 7-bit ascii). but even better, wordstar also indicated which version of itself created the file, in this case the “kaypro ii 64k cp/m” version. armed with this information, we were able to go back and pull some very good images off ten of the twelve disks using the usb floppy drive, this time using device side data’s proprietary imaging tool (which has a built-in setting for kaypro ii cp/m disks). unfortunately two of the twelve disks were visibly corroded and many sectors were unreadable, but overall this was a much higher success rate than we anticipated. in this case, of course, our ability to run and emulate a range of historic computing environments was not helpful for accessing the contents of the disk images, given that cp/m is not yet one of the environments we can emulate. fortunately we were able to use michael haardt’s cpmtools package [7] on a modern ubuntu linux workstation to list the contents and extract individual files. although we currently lack a copy of wordstar, we were able to open the files easily in a variety of backwards-compatible applications, such as wordperfect 12. our experience so far has been software like dd and guymager have a much higher success rate for imaging floppies dating from the late 1980s through the 1990s, when disk geometries had become more standardized and the biggest compatibility issues were between microsoft and apple systems. imaging and retrieving content from disks dating from the early 1980s requires higher levels of effort and expertise. not all of our recovery projects have been successful. our attempt to recover files from a set of 5.25” floppies that accompanied a 1984 masters’ thesis did not result in anything useful. in this case, the disks had been stored in the back of a hardbound thesis on the public shelves in one of our branch libraries, not in our climate-controlled archives. they contained executable programs written on an apple ii in an early educational software called pilot, and data had been written on both sides, (so-called ‘flippy disks,’ a fairly common practice back in the day, but not encouraged by the disk manufacturers, who controlled for quality only on one side). given our minimal success in reading any data at all from these disks, we suspect they may have passed through a library desensitizer at least once in the previous 30 years. an object lesson, if one is needed, that floppy disks are fragile media and need to be treated as such. we are pleased to report that the hardbound thesis is still quite legible, however. long term storage of course, simply having the technological capability to image old media is only a first step [8]. we are only beginning to work out how we want to operationalize the storage of images going forward. we will need to work through a number of issues, such as: metadata: what metadata do we want to capture for each image? in addition to the usual subject classification, it would be very useful to note any additional information we can gather about the original software environment, as well as what tools and settings we used to capture the images. it would also be useful to capture any sector read errors reported during the disk creation process. then there is the question of metadata formats. guymager provides some useful functionality in this area, but may not be the best choice for imaging all of the disks in our collection, particularly the early ones. format: what image formats do we want to preserve? as noted above, the raw file format seems to us like a reasonable choice. others may disagree. given the small size of these images it would be possible to store them in multiple formats if necessary. content: how much effort do we want to put into extracting and forward-migrating the files contained on the disk images? creating images is fairly trivial if you know what was used to write to the disk in the first place. extracting content from the images is a variable process: it can be relatively time consuming and technical, as illustrated in our example above; or it can be easy if the disk images can be mounted and explored using standard file system utilities.given that there is a pressing need to image many of these disks before they become unreadable, it may be better to focus our efforts on imaging and only do minimal content sampling to ensure the imaging process was successful. organizing files: what is the best way to organize the files so that their conceptual and physical relationships are preserved in a digital storage system? storage system: our library recently acquired an instance of archivematica, an open source, oais compliant digital preservation system from artefactual systems. this would appear to be our best long-term storage solution, particularly given that it recently (version 1.2, september 2014) acquired the ability to ingest forensic disk images [9]. we have not yet tested this capability, however. conclusions although we are in the early stages, we can provisionally conclude that having a variety of hardware and software options is a good thing when it comes to working with early media and their contents. in general we have so far found that our “new/old” workstation provides more flexibility when it comes to working with older media than does our “usb floppy” workstation, and consequently it has been the one we use first when confronting a set of archival floppies for the first time. yet as noted in the example above, the “usb floppy” workstation is better for certain specialized applications, notably imaging 5.25” disks from the early 1980s. so far the ability to boot into multiple operating systems has not been all that useful for retrieving older content. most of the early documents we have so far encountered have been in some version of ascii with minimal proprietary formatting, and conversion tools for many early document types are still readily available. this is not to say that we won’t ever encounter cases where having these capabilities will be of help to us, only that we haven’t yet. however, we are beginning to give some thought to contexts in which the multiple boot and emulation options could be very useful indeed. it seems probable to us that systems constructed around industrial motherboards could have a wider application than disk imaging. as legacy systems become increasingly difficult to maintain and expensive to purchase, systems like our “new/old” machine could bridge the gap for researchers wishing to study old software running in its natural habitat. douglas reside (2010) has observed that “… some of the most useful tools for migrating data from an obsolete to a modern (or at least slightly less obsolete) format are those computers that were manufactured at a moment when a popular new media format or transfer protocol had just emerged. such computers often have ports or drives, along with associated drivers, capable of using older, and in their time more common, technologies as well as new ones.” he calls these kinds of computers “rosetta machines”. in reside’s view the rosetta machine “par excellence” is the macintosh wallstreet powerbook, because having been manufactured in mid-1998 it came with swappable cd, dvd, floppy drives capable of reading 400k and 800k disks, and could accommodate a swappable zip drive. it had an ethernet port and usb capability (via onboard pcmcia slots), and it is capable of running older versions of linux. its’ one main drawback is the lack of a 5.25” floppy drive. although we did not set out to build a rosetta machine according to reside’s specifications, we were pleased to note that our “new/old” workstation meets or exceeds every one of them. this, coupled with the fact that the supply of wallstreet powerbooks is finite (and ever more so as time takes its toll), suggests to us that rather than focusing on keeping old computers alive, building them out of a combination of new and secondhand parts may represent a more viable strategy for reading older media over the medium to long term. acknowledgments the authors would like to gratefully acknowledge the canadian association of research libraries (carl) for providing a research grant to support this project. notes [1] see http://www.digitalintelligence.com/products/fred/ [2] for complete specs, see http://www.ibase-i.com.tw/mb820.htm [3] whereas in the fall of 2013 it was still possible to purchase ide drives through ncix, as of a year later ide drives were not available from the same supplier. [4] for more information on the cert linux forensics tools repository, see https://forensics.cert.org/ [5] bitcurator is available at http://www.bitcurator.net/ [6] see http://guymager.sourceforge.net/ [7] michael haardt’s cpmtools package: http://www.moria.de/~michael/cpmtools/ [8] in fact, developing the technology before developing a preservation plan is probably not the optimal order of precedence. see ricky erway (2012), you’ve got to walk before you can run: first steps for managing born-digital content received on physical media. dublin, ohio: oclc research. retrieved from http://www.oclc.org/research/publications/library/2012/2012-06.pdf [9] see https://www.archivematica.org/wiki/archivematica_release_notes references breeding, m. (2012). from disaster recovery to digital preservation. computers in libraries, 32(4). retrieved from http://librarytechnology.org/ltg-displaytext.pl?rc=16821 duranti, l. (2009). from digital diplomatics to digital records forensics. archivaria 68. retrieved from http://journals.sfu.ca/archivar/index.php/archivaria/article/view/13229/14548 forensics wiki.(n. d.). aff. retrieved from http://www.forensicswiki.org/wiki/aff fox, r. (2011). forensics of digital librarianship. oclc systems & services: international digital library perspectives, 27(4). retrieved from http://dx.doi.org/10.1108/10650751111182560 garfinkel, s. l., malan, d. j., dubec, k., stevens, & c., pham, c. (2006). advanced forensic format: an open, extensible format for disk imaging. retrieved from http://nrs.harvard.edu/urn-3:hul.instrepos:2829932 kirshenbaum, m., ovenden, r., & redwine, g. (2010) digital forensics and born-digital content in cultural heritage collections. washington, d.c.: council on library and information resources. retrieved from http://www.clir.org/pubs/reports/pub149 reside, d. (2010). rosetta computers. in m. kirshenbaum, r. ovenden, & g. redwine, digital forensics and born-digital content in cultural heritage collections. washington, d.c.: council on library and information resources. retrieved from http://www.clir.org/pubs/reports/pub149 reside, d. (2012). digital archaeology: recovering your digital history. retrieved from http://www.nypl.org/blog/2012/07/23/digital-archaeology-recovering-your-digital-history woods, k. & lee, c. (2012). acquisition and processing of disk images to further archival goals. archiving 2012, copenhagen, denmark, june 12-15, 2012.retrieved from http://ils.unc.edu/callee/archiving-2012-woods-lee.pdf about the authors john durno (jdurno@uvic.ca) has been head of the library systems unit at the university of victoria since 2006. jerry trofimchuk (jtrofimc@uvic.ca) has been a technician in the library systems unit at the university of victoria since 1998. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – enhancing print journal analysis for shared print collections mission editorial committee process and structure code4lib issue 51, 2021-06-14 enhancing print journal analysis for shared print collections the western regional storage trust (west), is a distributed shared print journal repository program serving research libraries, college and university libraries, and library consortia in the western region of the united states. west solicits serial bibliographic records and related holdings biennially, which are evaluated and identified as candidates for shared print archiving using a complex collection analysis process. california digital library’s discovery & delivery west operations team (west-ops) supports the functionality behind this collection analysis process used by west program staff (west-staff) and members. for west, proposals for shared print archiving have been historically predicated on what is known as an ulrich’s journal family, which pulls together related serial titles, for example, succeeding and preceding serial titles, their supplements, and foreign language parallel titles. ulrich’s, while it has been invaluable, proves problematic in several ways, resulting in the approximate omission of half of the journal titles submitted for collection analysis. part of west’s effectiveness in archiving hinges upon its ability to analyze local serials data across its membership as holistically as possible. the process that enables this analysis, and subsequent archiving proposals, is dependent on ulrich’s journal family, for which issn has been traditionally used to match and cluster all related titles within a particular family. as such, the process is limited in that many journals have never been assigned issns, especially older publications, or member bibliographic records may lack an issn(s), though the issn may exist in an oclc primary record. building a mechanism for matching on issns that goes beyond the base set of primary, former, and succeeding titles, expands the number of eligible issns that facilitate ulrich’s journal family matching. furthermore, when no matches in ulrich’s can be made based on issn, other types of control numbers within a bibliographic record may be used to match with records that have been previously matched with an ulrich’s journal family via issn, resulting in a significant increase in the number of titles eligible for collection analysis. this paper will discuss problems in ulrich’s journal family matching, improved functional methodologies developed to address those problems, and potential strategies to improve in serial title clustering in the future. by dana jemison, lucy liu, anna striker, alison wohlers, jing jiang, and judy dobry background west, western regional storage trust, founded in 2010, is a distributed shared print journal repository program comprised of 68 members, plus five past members who continue to retain journals on behalf of west [1]. member institutions participate in a number of ways depending on local resources and capacity, for example by: committing to retain titles on behalf of west on site (i.e., in campus libraries) or in dedicated storage facilities; physically validating retained titles for completeness and/or condition; actively seeking holdings from other members to fill gaps (i.e. missing volumes) in retained titles; offering volumes to fill gaps in retained titles; and supporting west’s objectives by taking part in the program’s governance and strategic decision-making. west members, on agreeing to commit to archiving proposals made by the west operations and collections council (west-occ), record information about the proposals in their library and consortial catalogs in the form of marc holdings records following program guidelines. these holdings are submitted to west-ops once a year for inclusion in a local database managed by the california digital library (cdl), which supports ongoing collection analysis and local comparison functionality. west archival commitment files in the form of marc holdings records are supplied by west-ops once a year in the form of marc holdings records for inclusion in the print archives and preservation registry (papr) of the center for research libraries (crl). the papr system supports the archiving and management of serials collections by providing comprehensive information about titles, holdings, and the terms and conditions of archiving for major print archiving and shared print programs. in addition to submitting to papr, member institutions also supply these same archival commitments to oclc as marc holdings records in order to also surface these holdings in worldcat. shared print programs strive to ensure ongoing, long-term access to the scholarly print record, while also allowing member institutions to optimize campus library space. it is of the utmost importance that the functionality of the process includes as many member held titles for consideration as possible, while still retaining accuracy in the process of pulling together related titles to build journal families. collection analysis west collection analysis is predicated on titles being organized into clusters, based upon the ulrich’s journal family id. ulrich’s assigns a unique journal family id to related titles such as preceding, foreign language parallel, supplemental, and other interconnected titles. titles that are successfully matched with journal families are candidates for collection analysis at both the family and the individual title level. titles submitted by member institutions are matched in ulrich’s using issn, and assigned a journal family id. titles which do not match are considered to be “orphan” records, and are removed from the process stream at that point. after the matching process, titles having matched with a journal family go through a number of exclusionary filters to identify only those titles which have an optimal value-add to shared print archiving. records that do not successfully pass through the exclusion filters are dropped from the process stream. exclusion filters include: west previously archived titles non-print formats government documents lc classification k (law) title keywords “online” and “monograph” location exclusions as directed by the owning institution. for example, exclude any titles with location “reference desk, current periodicals, rare materials.” the remaining titles are organized and clustered into their relevant journal families and are output with their summary holdings and other various data points as reports. west staff use the reports to further refine for title review and proposal by west-occ to west members for shared print archiving. internal processing, pre-enhancements before internal functionality enhancements were made, titles were matched on limited, selected data fields found in marc bibliographic serial records furnished by west members. west-ops loaded the bibliographic records and local holdings to a database and the following input fields were isolated as match candidates with ulrich’s journal family issns: primary issn (022$a) linking issn (022$l) former title issn (780$x) succeeding title issn (785$x) west-ops further enhanced the data by using the oclc search api to harvest 022$a, 022$l, 780$x, 785$x fields from primary oclc records (master records) and loading them to a dedicated table (table 1). this data, in most cases, was a supplement to the supplied data in west member records and, in some cases, was a correction to one or more submitted values. it should be noted that data harvested from the oclc primary record was always assumed to be most current and most correct, and so took precedence over input values in ulrich’s match search order. table 1. oclc harvested supplementary issn values in the issn table. issn values are harvested using the oclc search api. the oclc_primary number connects data from this table to matching records in the source data table (by way of the oclc_xref table) which stores full bibliographic details. oclc_primary primary issn linking issn former issn succeeding issn 1102013 1001-3456 1001-3456 3344-9930 9131027 3498-1209 3399-2020 1443-9008 6775-6678 in addition to issns, the oclc primary number and any cross reference oclc numbers [2] harvested were stored locally as an oclc number “set” to prevent the searching of a title more than once (table 2). during processing, the primary oclc number from each contributor record was first searched in a dedicated table containing harvested primary and cross-reference numbers (table 2.) if a match was found, the oclc number set would then be associated with that contributor record in the source data table. if no match was found, then a new search would be done using the oclc api. any issns and oclc numbers gleaned from the matching oclc record would then be harvested to further populate and enhance the issn and oclc_xref tables for further searching. this strategy specifically addresses the ubiquitous issue of multiple oclc numbers (primary and xrefs) associated with the same oclc record, but which may show up in local catalogs as primary values. multiple oclc numbers associated with a single oclc record occur when oclc detects duplicate records in their union catalog. duplicate records are weighed to determine which is of better quality. holdings are then moved over to the better record, the “winner”, while the “loser’s” oclc number is inserted into 019$a [3] as an oclc cross-reference number (xref.) since a bibliographic record in a local catalog is, in essence, a snapshot of the primary record in oclc, its primary oclc number may deprecate to a cross-reference in future. it is therefore important to have a complete set of primary + cross-references when searching and matching on oclc number. table 2. oclc harvested supplementary oclc values in the oclc_xref table. oclc number sets are stored so that any unique title isn’t searched more than once by the oclc search api. the oclc primary or x-ref number connects this table’s applicable oclc number “set” to the source data table which stores full bibliographic details. oclc primary number oclc x-refs 1102013 23456, 89076, 10003322 9131027 334467 following the oclc api harvesting process, the system then attempted to match records to ulrich’s journal families, using issn values. for each source table title, issns from both the source data table and the issn table were searched in ulrich’s, in the order listed below, until a match was found [4]: oclc record harvested primary 022$a oclc record harvested primary 022$l oclc record harvested preceding title 780$x oclc record harvested succeeding title 785$x source table record primary 022$a source table record primary 022$l source table record proceeding title 780$x source table record succeeding title 785$x following ulrich’s journal family matching, records went through the title exclusionary filters (as described in the previous section), eliminating government documents, non-print formats, etc. slightly less than half of member-contributed records (48%) in the pre-enhancement process matched with an ulrich’s journal family, leaving approximately half (52%) that did not (orphan records). there were several reasons why records did not match: no issn, primary or otherwise, was assigned to that record. this is especially common in older titles, which unfortunately, are often those most at risk. low-level records [5] sometimes lacked an oclc number, nullifying the oclc search api harvesting process, consequently omitting potential, additional issn match candidates for ulrich’s. titles were not found in ulrich’s despite oclc number and issns in source and harvested data. based on ulrich’s internal criteria, not every title is eligible for inclusion, or it may be that while the title is marked for inclusion, ulrich’s hasn’t yet captured that title as part of their latest database update. figure 1. overview of internal process, pre-enhancements. internal processing post-enhancements to address the approximately 900,000 orphan records comprising a little more than half of member contributed records (52%), the west-ops team devised a more complex match strategy using additional control number values, resulting in a significant improvement in ulrich’s journal family matching. the processing steps are identical, with the addition of several enhancements (figure 2). additional issn values from both the source table data and the oclc harvested values, 770$x, 772-777$x, 786-787$x, are searched in ulrich’s as part of the match journal family step. this captures additional matches on the following ulrich’s title categories: abridged edition abridged edition of alternative frequency edition alternative media edition cumulative edition cumulative edition of issued with title parallel language edition partial translation edition partial translation edition of regional edition seasonal edition seasonal edition of special edition special edition of supplement supplement to translation translation of additional control numbers (lccn, coden, oclc) are harvested from both source data and oclc to form an expanded set of match values. these values are used for a more robust issn search in ulrich’s, as well as an additional matching strategy which matches any remaining orphan records with records that have already been assigned a journal family id; orphan records that match are assigned that same journal family id. this expanded match set includes: oclc numbers [6],[7]: primary, 019$a, 035$a$z, 079$a, 770$w(ocolc), 772-777$w(ocolc), 780$w(ocolc), 785-787$w(ocolc) lccn numbers: 010$a, 770$w(dlc), 772-777$w(dlc), 780$w(dlc), 785-787$w(dlc) issn numbers [8]: 022$a$l, 770$x, 772-777$x, 780$x, 785-787$x coden numbers: 030$a, 770$y, 772-777$y, 780$y, 785-787$y as orphan records are searched and matched, their control numbers are pooled and deduplicated with the target record’s control numbers. this provides an expanding pool of control number values from which subsequent orphan records can match. if an orphan record does not match, it is shuttled to the “end of the line,” in hopes that as additional orphans match and expand the pool of match targets, these remaining orphans will eventually find matches with target records. when all orphan records in line have failed to match with a target record pool, this match processing step ends. the unmatched orphan “line” may cycle through numerous times until matches are no longer found. added control number values, both primary (019$a, 035$a$z, 079$a, 010$a, 022$a$l, 030$a), and from relevant linking data fields (770$wxy, 772-777$wxy, 780$wxy, 785-787$wxy), allow for multiple, highly reliable matching options, barring member submitted record content errors [9]. matching on records that previously successfully matched with ulrich’s provides a back-door method to garner new matches for records which otherwise would have been dropped from collection analysis. the oclc search api is employed to make a final pass over orphan records to harvest any additional oclc numbers and issns that may have otherwise been missed. a last attempt is then made to match any of these remaining orphan records with records previously matched in ulrich’s. orphan records that match with a record with a journal family id are assigned that same journal family id. in this final step, oclc-harvested oclc numbers and issns from remaining orphans are searched in the following order: oclc numbers in the following order: primary, 019$a. if no match, then issn numbers in the following order: 022$a, 022$l, 780$x, 785$x. it should be noted here that the oclc search api was set up in previous development cycles to harvest a limited set of additional, or corrected, issn values 022$a$l, 780$x, 785$x. ideally, this would have been refactored to include the full, expanded control number set of 010$a, 019$a, 022$a$l, 030$a, 035$a(ocolc)$z(ocolc), 079$a, 77[02-7]$w(dlc)$w(ocolc)$x$y, 78[05-7]$w(dlc)$w(ocolc)$x$y. however, because of the significant development time this expansion would require, we were unable to fold this particular enhancement into this project. we hope to add this expanded set into post-enhancement step 4 (see following section) during a future development cycle, anticipating that it will further improve our match rates. figure 2. overview of internal process, with enhancements.. results post-enhancements table 3. processing post-enhancements results. steps 3 and 4 were done as part of the processing enhancement project only, illustrating the improvement in journal family matching using the expanded set of control numbers for ulrich’s matching, and matching orphans to records with journal families as an added strategy using an expanded control number set. baseline(a) step 1(b) step 2(c) step 3(d) step 4(e) total records 1,721,416 1,721,416 1,721,416 1,721,416 1,721,416 matched journal family 0 783,983 825,419 920,619 921,127 did not match journal family 1,721,416 937,433 895,997 800,797 800,289 matched journal family … and issn(s) in source data 0 783,983 789,214 815,273 815,409 … and no issn(s) in source data 0 0 36,205 105,346 105,718 matched journal family … and source data lccn(s) only 0 0 0 681 683 … and source data coden(s) only 0 0 0 3 3 … and source data oclc#(s) only 0 0 0 39,599 39,833 baseline preprocessing source data table captures counts of primary control number values 022$a$l, 780$x, 785$x in input marc record. match ulrich’s journal family using source data issns. marc fields 022$a$l, 780$x, 785$x. match ulrich’s journal family using additional oclc api harvested issns. marc fields 022$a$l, 780$x, 785$x. match ulrich’s journal family using expanded control numbers from source data, then match remaining orphan records to records with journal families using expanded control numbers. marc fields 010$a, 019$a, 022$a$l, 030$a, 035$a(ocolc)$z(ocolc), 079$a, 77[02-7]$w(dlc)$w(ocolc)$x$y, 78[05-7]$w(dlc)$w(ocolc)$x$y. harvest any remaining control numbers from oclc to match remaining orphans to records with a journal family. marc fields: 019$a, 022$a$l, 035$a(ocolc), 780$x, 785$x. a set of 1,721,416 contributed bibliographic serial records was used as a baseline (table 3) pre-enhancement processing steps 1 and 2 were run, resulting in an ulrich’s match rate of 48% (825,419). 895,997 records (52%) failed to match with ulrich’s data. pre-enhancement processing step 1 used source data issns from 022$a$l, 780$x, 785$x, while step 2 used additional, or corrected, issns where available, from oclc (022$a$l, 780$x, 785$x) were used to match with ulrich’s issns. following the enhancement of the process, and using the same baseline set of bibliographic source data, steps 3 and 4 were run following steps 1 and 2. fifty-four percent (54%) of records (921,127) matched with a journal family, a positive increase of +6%, leaving 46% of records (800,289) still orphaned. if we review improvements based on a percentage of increased records ((enhancement total – pre-enhancement total)/pre-enhancement total), we gained +11.6 percent (95,708 records.) step 3, matching records to ulrich’s journal family using an expanded set of control numbers from the source data and, following this, attempting to match any remaining orphan records using an expanded control number set to records with an ulrich’s journal family, garnered the most significant gain in journal families, +95,200 (~10%.) step 4, a final search of oclc for a limited set of issns and oclc numbers (019$a, 022$a$l, 035$a(ocolc), 780$x, 785$x), gained very few additional titles, +508 (~.06%), but in light of the total number of journal titles proposed, (table 4), this may have had some significance.[10] in examining matching and prevalence of controls numbers at each step in the updated processing routines, we found that expanding our control number match set from the source data had a significant effect where the records had no, or limited, issn values in the source data: 105,718 records lacked base set issns (022$a$l, 780$x, 785$x) in the source data. 36,205 of these found matches following the first oclc api harvest of base set issns. an additional 69,513 records were added when their additional linking data field issns (77[02-7]$x, 78[67]$x) were used to match with ulrich’s journal families or match with those records previously assigned a journal family id.. 40,519 titles were gained where there was no available issn for matching in ulrich’s. these records had one or more lccn or, one or more codens or, one or more oclc numbers only. these control numbers were used to match with a record having already been assigned a journal family id. conclusion following enhancements, a total of 921,127 records (54%) were assigned a journal family id. this was an improvement of 6% over the previous collection analysis cycle, pre-enhancement (table 4) we theorize that this gain is primarily comprised of older titles lacking issns and/or more obscure titles which have a tendency to be represented by brief cataloging records in local catalogs,[11] titles that are, furthermore, more likely at risk due to age and rarity. marc records for older and more obscure titles can be lacking the expansive metadata of more recently published titles, as they may lack access points that were not implemented and used until decades following the development of marc standards; older title metadata, for example, may be limited to data transcribed from catalog cards to marc when a library first moved to an automated library catalog. where a retrospective catalog conversion [12] would otherwise upgrade these types of records, they still lurk in many library catalogs. performing oclc api searches on these records can be thought of as an abbreviated retrospective conversion. this search fleshes out control numbers from primary and linking data fields garnered from the oclc primary record, the “record of record”, which generally speaking, is the fullest and most accurate version. these additional control numbers then allow for enhanced matching capabilities with ulrich’s, and on records having already matched with ulrich’s, using the full array of, and most correct values known to an individual title as expressed in the oclc “record of record.” table 4. record total overview comparison, preand post-enhancement, journal families. pre-enhancement number of records percent of total post-enhancement number of records percent of total total contributor records loaded 1,922,108 100% 1,721,416 100% records loaded with issn 1,068,059 56% 977,937 57% records matching ulrich’s journal family 917,032 48% 921,127 54% the collection analysis report, seeded by journal titles which successfully matched with an ulrich’s journal family, and subsequently passed exclusion criteria, was likely improved. after correcting for duplicate titles shared across contributors, 191,750 unique titles were included in the 2020 collection analysis report. of these, 8,051 titles were proposed for archiving in 2020 (table 5) table 5. record overview, post-enhancement, collection analysis report and titles proposed. pre-enhancement number of records, 2018 % of total post-enhancement number of records, 2020 %of total total contributor records loaded 1,922,108 100% 1,721,416 100% number of titles in collection analysis report [13] 197,956 10% 191,750 11% number of individual titles proposed 8,074 .42% 8,051 .47% of significance is the relatively small number of records which are ultimately included in the collection analysis report, and of these, the even smaller number of titles that are proposed for archiving. this would indicate that while overall improvement was only a positive gain of 11.6%, these 95,708 records comprise a significant pool from which titles may be gleaned for a relatively smaller set of archiving proposals. the number of individual titles included in the collection analysis report, and proposed after enhancement, improved proportionally between 2018 and 2020 by a small margin (table 5.) however, due to changes in exclusion criteria and other considerations between collection analysis cycles, it’s impossible to draw any sort of concrete conclusions at such a granular level when comparing across cycles. unfortunately, due to the complexity of programming efforts it would take to track individual field and process step matching, we were unable to determine how many additional titles were captured and added to the final set of archive proposals. so, while the overall number of records with matches to journal families may have increased with enhanced matching routines by 11.6%, we do not know how many of these specific additions may have dropped out during the exclusion process, or, having passed, subsequently failed to be proposed for archiving. ideally, we’d be able to identify which titles were added post-enhancement, if any, to determine how successful this project was. to improve tracking and matching, further enhancements might include: expanding on control numbers harvested from oclc in step 4 for inter-record matching, including lccn and coden numbers, both base and linking data values. using tightly controlled and normalized author/title string matching between records using primary values from 1xx/2xx fields with appropriate linking data field contents 77x-78x. adding tracking fields to database records to trace marc field and subfield match values in steps 3 and 4 for further analysis as to which fields produce the highest match rate. adding tracking fields to database records see which records having gone through step 3 and step 4 matching processes were included in the collection analysis report, and were subsequently nominated for archiving. move from the ulrich’s model to an entirely in-house model, where control number and author/title string matching are used outside of ulrich’s to cluster titles together for “west journal families.” bibliography california digital library. 2019. discovery & delivery. california digital library, services and projects. [online] may 2019. https://cdlib.org/services/d2d/ california digital library. 2020. west membership. [online] june 2020. https://cdlib.org/west/about-west/west-membership/ california digital library. 2020. west: western regional storage trust. [online] may 2020. https://cdlib.org/west/ center for research libraries. 2020. center for research libraries. global resources network. center for research libraries. global resources network. [online] 2020. https://crl.edu center for research libraries. 2020. papr – print archive and preservation registry. [online] 2020. http://papr.crl.edu/ library of congress. network development and marc standards office. 2020. marc code list for organizations. marc standards. [online] may 2020. https://www.loc.gov/marc/organizations/ library of congress. network development and marc standards office. 2020. marc standards. [online] march 2020. https://www.loc.gov/marc/ library of congress. network development and marc standards office. 2003. the lccn namespace. marc standards. [online] november 2003. https://www.loc.gov/marc/lccn-namespace.html oclc. 2007. technical bulletin 254, oclc-marc format update 2007 and institution records to accommodate the rlg union catalog. s.l. : oclc, 2007. 1097-9654. proquest. 2020. ulrichsweb.com(tm) — frequently asked questions. ulrichsweb global serials directory. [online] 2020. https://www.ulrichsweb.com/ulrichsweb/faqs.asp notes [1] as of 08/2020. https://cdlib.org/west/. [2] found in 019$a and 035$z. [3] 019$a contents in oclc are duplicated in 035$z in exported records. therefore, these are found in either, or both 019$a and 035$z in records evaluated during these processes. [4] oclc primary record issns were searched first, as these were deemed to be most current and therefore most correct. [5] a low-level record is considered to be one that is brief in nature, often input “on-the-fly” in the local catalog, and will often lack an oclc number and other metadata descriptive elements. [6] note that for matching purposes, the organizational code as supplied in parentheses, is used in tandem with the control number, to identify control number type. https://www.loc.gov/marc/ [7] 079$a is the location of the oclc primary number in oclc institution records. institution records (irs) were created to accommodate rlg clustering practices when rlg merged with oclc in 2006. irs are no longer in use, but 079$a’s containing an oclc master number may still exist in legacy records residing in local catalogs. (oclc, 2007) [8] lccns were normalized for optimal matching using the lccn namespace documentation. https://www.loc.gov/marc/lccn-namespace.html [9] garbage in/garbage out rule. [10] unfortunately, due to the complexity and programming efforts it would take to track field and process step matching, we are unable to determine how many additional titles captured with updated processing were added to the final set of archive proposals. [11] because, at this time, we are unable to track titles matching under the new processing routines, we have no way of knowing the composition of the gained set of title matches. [12] a process used to upgrade a library’s catalog of marc records, generally done with oclc by matching and supplying up-to-date cataloging records from oclc to replace low-level records in the local catalog. [13] note that the number of titles in the collection analysis report will include duplicate entries across owning institutions as these are submitted as individual marc records. in other words, if institution a and institution b both own that title, that title will show up twice in the report. about the authors dana jemison (dana.jemison@ucop.edu) is the principal metadata analyst for the discovery and delivery group at california digital library. she works primarily with the western regional storage trust (west), the cdl, crl, and the hathitrust shared print collaboration (cch collaboration.) she holds her m.l.i.s. from university of texas, austin and a b.s. in botany from california state university, long beach. lucy liu (lucy.liu@ucop.edu) is the application programmer for the discovery and delivery group at california digital library. she works primarily on the agua and papr project to support the collection analysis for western regional storage trust (west), the comparison tool for cdl, crl, the hathitrust shared print collaboration (cch collaboration). jing jiang (jing.jiang@ucop.edu) is a programmer analyst for the discovery and delivery group at california digital library. she works primarily on library metadata management projects including the hathitrust metadata management system (zephir) and the western regional storage trust (west) shared print programs. anna striker (anna.striker@ucop.edu) is the shared print operations and collections analyst at the california digital library. she works primarily with the western regional storage trust (west) and ucl shared print programs. she also works with the rosemont shared print alliance and contributes to the cdl, crl, and hathitrust shared print collaboration (cch collaboration). she holds her m.l.i.s. from san josé state university and a b.a. in english from the university of virginia. judy dobry (judy.dobry@ucop.edu) is technical team manager for the discovery and delivery group at california digital library. she works on a variety of shared print projects including the western regional storage trust (west), the cdl, crl, and hathitrust shared print collaboration (cch collaboration), and the partnership for shared book collections. she holds her b.a. from university of north carolina at chapel hill. alison wohlers (alison.wohlers@ucop.edu) is the shared print program manager at california digital library. as such, she supports and coordinates collaborative print management efforts in the uc libraries system, the western regional storage trust, and connects that work to national and north american collaborations. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – creating an institutional repository for state government digital publications mission editorial committee process and structure code4lib issue 9, 2010-03-22 creating an institutional repository for state government digital publications in 2008, the library of virginia (lva) selected the digital asset management system digitool to host a centralized collection of digital state government publications. the virginia state digital repository targets three primary user groups: state agencies, depository libraries and the general public. digitool’s ability to create depositor profiles for individual agencies to submit their publications, its integration with the aleph ils, and product support by exlibris were primary factors in its selection. as a smaller institution, however, lva lacked the internal resources to take full advantage of digitool’s full set of features. the process of cataloging a heterogenous collection of state documents also proved to be a challenge within digitool. this article takes a retrospective look at what worked, what did not, and what could have been done to improve the experience. by meikiu lo and leah m. thomas introduction the library of virginia (lva) is a state library and is responsible for collecting and distributing state publications to designated virginia depository libraries.[1] due to the increasing number of born digital publications, lva began developing the digital virginia state publications collection in 2006 to accommodate the digital publications of over 140 state agencies, including lva. two full-time professional library staff are devoted to the development and maintenance of this collection. one staff member trains agency contributors, or endusers,[2] while the other is responsible for the metadata, cataloging, and management of the digital publications. digitool, an exlibris product, was selected to manage the collection. the library of virginia’s depository program although lva was established as virginia’s state library in 1823 and has been collecting government materials since that time, it was not designated as a depository library until 1981. the state publications depository program’s purpose is to ensure that the public has access to government information as mandated in §42.1-92 of the code of virginia.[3] because state publications were increasingly being born digital, lva needed a way to provide the public with access to these publications and to maintain the collection in conformance with its responsibility as a government publications depository. lva began working with selected agencies in december, 2007, to test the digital program. the first general request for volunteer agencies was february-march 2008. in july, 2008, the program was switched to the production server. the second and third requests for volunteer agencies occurred in august and october, 2008. the general training sessions were done largely in-house due to the proximity of many state agencies to lva and the availability of an onsite computer lab with staff to assist with training. additionally, due to budget constraints that restricted travel during the latter half of 2008, lva implemented alternate methods of offering training for agencies, colleges, and universities not located in the local area, which included offering one-on-one telephone training sessions. these sessions usually took about 10-20 minutes and have been as successful as the in-house training sessions, as determined by the feedback and depository submissions from these agencies. a step-by-step workflow with screenshots was sent to all trainees. approximately ninety agencies received training. why digitool? prior to 2006, when lva was considering a digital asset management system, two primary needs were taken into consideration: 1) a system’s flexibility to be used across collections and 2) system support, maintenance, and development. the foremost reason that digitool was selected was because of the reputed administrative and technical support that exlibris was providing to the user community[4] and the company’s efforts toward digital preservation. exlibris assured lva that it would provide lva staff assistance with any maintenance, development, and support issues for digitool’s implementation, operation and functionality, which it has done. lva also considered contentdm; however, at the time it was determined that this management system could not handle the governor’s records according to the needs of lva. lva wanted one system that would provide the functionality for maintaining every type of digital content. lva is a small institution without the staff infrastructure, including the assistance of graduate students, for onsite development. consequently, open source products that lva considered were not an option because they required onsite development. open source systems did not have the same level of technical and development support that proprietary products had. [5] open source products today, however, may no longer require the same level of onsite support. digitool was selected to enable the virginia state publications depository, as well as collection management, access, and preservation of varied digital collections. for the purposes of the depository, digitool allowed lva to create depositor profiles with email notification for off-site users, and agency login accounts so that endusers at the agencies could deposit agency publications. an enduser repository system was preferred because agency employees would know better than lva would, when their electronic publications were available. the agencies would be able to deposit their digital publications themselves without lva having to search for them, with the possibility of some publications being missed. other reasons for selecting digitool were customization, functionality, and synchronization with aleph, lva’s ils, which is also an exlibris product. many of the digital publications managed by digitool are also cataloged in both oclc connexion and aleph. instead of cataloging these publications twice, the metadata could be transferred to aleph through the synchronization of digitool and aleph. automated synchronization could be done selectively by transferring specific data from aleph to digitool and vice versa, which was preferable to not having control over what data gets transferred to and from either program. selective synchronization would be beneficial when modifications were made to records in either system. both aleph and digitool use z39.50 protocol,[6] enabling the retrieval of records from different databases, such as the library of congress, or reciprocation between aleph and digitool. the z39.50 feature in digitool is essential for locating and attaching pre-existing machine-readable cataloging (marc) records from aleph. although lva chose digitool for the aforementioned benefits, some drawbacks have been discovered. while digitool can locate an aleph record, once it is located, the record cannot be opened. it can only be linked, and is identifiable by displayed information, such as the title, call number, and year. when creating “objects,” as they are called in digitool, and linking to metadata and other objects, it has to be done by the persistent digitool internal id (pid) numbers, rather than by title, author, or another access point. in order to locate a record in meditor, the metadata editor utility, the pid has to be known in order to retrieve a digital object or its metadata that have been imported or created. meditor is not a user-friendly interface for catalogers. it is not intuitive and changes to records are not made easily. an entire record may have to be deleted in order to make changes. fields have to be deleted rather than changed. only limited reports can be generated. for example, instead of being able to generate a report to indicate the number of electronic publications ingested every month, a monthly collection had to be established in digitool in order to record the number of these publications, as elaborated below. another drawback is only one mapping schema can exist for all collections in digitool, and once a schema is set, it is not easily changed, so there is little flexibility, if any, for switching dublin core (dc) elements. also, no authority control exists in digitool’s infrastructure. standards for metadata and access points have to be determined locally and documented for purposes of consistency and having an institutional record of what was done and the rationale for it. metadata issues and best practices lva houses both library and archival collections and is required to archive and manage the governor’s communications as well, from the more traditionally scanned objects, such as photographs, maps, and manuscripts, to digitally born publications, including email and video. both acquisitions and access management (aam)[7] and archives are further broken down into collections: archives includes private papers and local and state records; aam includes special collections, the map collection, and federal and state documents. initially, three catalogers worked on three projects simultaneously: state publications, special collections, and the map collection. each was described differently, using inconsistent dc elements and labels. state publications that were born digital were described in dc and were referred to as “native” digitool objects. however, digital resources that had already been cataloged in oclc and lva’s ils were for the original copy, e.g. paper, not the digital reproduction. this did make it easier for catalogers because the marc record could be imported into digitool and linked to the image, but the distinction between description of the original and description of the digital object needed to be made. also, because of workload and time constraints, differences in description and punctuation in the marc records were not modified, so the catalogers had to allow for some flexibility in standards, and thus in presentation, of metadata. as a result of these metadata issues, especially the inconsistency in standards, an ad hoc metadata standards working group was formed from those who would be creating the metadata. because there were different collections being entered without any centralization, they were being cataloged using different practices for each collection along with different mappings. the goal of the ad hoc metadata standards working group was to develop a mapping schema and best practices for the collections which were already in digitool and for future collections. the group met to determine the mapping for the marc fields and labeling for the individual collections and for the generic overall digitool display, based upon the collections that had already been entered. because it was not known what to anticipate for each collection, it was determined the best way to proceed was to be as specific as possible in the mapping and the application of dc elements, while allowing for flexibility. once the application of dc elements and mapping was accomplished, each cataloger devised a data dictionary for each collection. additionally, the group had to resolve discrepancies in controlled vocabulary and abbreviations as they arose. it helped to review the best practices of other institutions. these, including crosswalks and data dictionaries,[8] were developed based upon those that already existed within the metadata community, such as those of colorado state university,[9] boston college,[10] cornell university,[11] and virginia commonwealth university,[12] in addition to the standards established by the open archives initiative (oai),[13] and following the dublin core metadata initiative (dcmi).[14] digitool’s feature for electronic submission, which was developed primarily for theses and dissertations, was utilized for the digital state documents depository. this feature allowed us to create a form which requires the submitting agency to complete the six required dc elements, including language, on the deposit form interface. the optional dc elements, in addition to the required ones, are title, creator(s), summary, keyword(s), and subject(s) as shown in figure 1. figure 1. lva deposit form click here for larger image two sets of labels were necessary: one for digitool overall that applied globally to all collections and another, collection-specific set. a data dictionary was developed to define the usage of the labels so that they would be used consistently throughout all of the collections. an unanticipated mapping issue that arose was duplicate fields, a result of mapping with marc where these mapped fields were also applied to native digitool records originally described in dc only. (see figure 2 for an example). because digitool can handle only one mapping schema across collections, it imposes this mapping onto all collections, as shown in figure 2a, below. this issue of duplicate fields has not yet been resolved. figure 2. issues in mappings: duplicate fields click here for larger image figure 2a. issues in mappings: duplicate fields click here for larger image as mentioned previously, digitool required one mapping schema for all lva projects. but having only one schema caused inconsistencies in the presentation of metadata, (see figure 3), due to the marc mapping overriding the dc description. for sorting purposes, the agency name is used instead of division and unit as publisher. figure 3. issues in mappings: one mapping schema for all collections click here for larger image records have to be reviewed for various unexpected problems, and documentation has to be updated as fields become obsolete. figure 4 shows mappings, including both the dc elements and labels, for the state documents collection. the example identified in figure 4 is the 440 field, which is now obsolete. the 490 field will need to be added, as well as a footnote documenting the reason for the change so that a historical record of the changes is maintained. in addition to having changes in bibliographic standards, there are issues related to the system itself over which the cataloger has no control, as noted above in figures 2 and 3. figure 4. mappings click here for larger image like the university of minnesota, lva discovered that well-prepared mapping and a well-established taxonomy in the beginning may have decreased the amount of work that had to be done later. in this case, where only one mapping schema could be used for all collections, and the same practices across collections, necessary for access purposes, needed to be followed for each collection as well. developing standards would have gone more smoothly had it been centralized with someone designated in charge. semantic differences arose on how to use various dc elements. it was even suggested that we did not need best practices for description of resources across collections because dc allows for more flexibility than marc with aacr2. the aforementioned documentation from the digital library community helped resolve many of these issues. yet, there is still a tendency among different departments to apply dc elements and labels however they want due to the lack of official centralization of metadata standards. we had thought developing the taxonomy retrospectively might actually save work because the necessary categories would not be determined until lva started receiving the digital publications. but because the taxonomy for the state documents collection was not developed until later, thousands of records had to be re-linked. although these records did not require re-cataloging or additional fixes, the re-linking was time-consuming. re-linking such a considerable number of records was not anticipated, and it was not realized that digitool would be as difficult to manipulate as it was. although no authority control exists in digitool, the name authority used in the ils could not be used because it would have caused problems with sorting within the state documents collection. if the name authority that is used in the ils were used for the state documents collection, for example, every one of them would begin with “virginia.” an agency directory, as a drop-down list, was used on the depository form instead. the public is generally more familiar with the agency name and can use it for searching. however, the established authority is used in other collections because it does not cause the same types of sorting problems. collection management issues when endusers submit digital publications, the objects are ingested into digitool, after the approval of the state publications librarian. digitool extracts the administrative metadata, and the state publications cataloger modifies metadata provided by the depositor, adds descriptive metadata, and organizes the digital content. the sample below shows administrative metadata from the ingest process: the pid (35387) for the object; the issue (no. 51, november 2009); the deposit number (441453); the name of the ingested item (“virginia water central, v. 51”); the partition, or collection, (“statedoc”); as well as the date, time, and name of the person who ingested the deposited publication. figure 5 shows the meditor view of this metadata. <xb:digital_entity xmlns:xb="http://com/exlibris/digitool/repository/api/xmlbeans"> <pid>35387</pid> <control> <label>no. 51, november 2009</label> <note xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <ingest_id>dep441453_ing3537</ingest_id> <ingest_name>virginia water central, v. 51</ingest_name> <entity_type xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <entity_group xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <usage_type>view</usage_type> <preservation_level>any</preservation_level> <partition_a>statedoc</partition_a> <partition_b xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <partition_c xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <status xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <creation_date>2010-01-06 10:51:06</creation_date> <creator>creator:nverilla</creator> figure 5. meditor view: administrative metadata click here for larger image the cataloger reviews the data and makes modifications, such as formalizing the title to make it consistent with the one in aleph and adding the issue date and subjects (descriptive metadata) to the ingested object. additionally, because the metadata submitted by the enduser is not always consistent or correct, the cataloger has to review and modify that metadata as well. the sample data below indicates changes applied (that the metadata is descriptive and in dc and when and by whom it was modified). “water-supply – virginia – periodicals,” for example, is a library of congress subject headings (lcsh). the meditor view of the added descriptive metadata is shown in figure 5a. <modification_date>2010-01-06 14:29:09</modification_date> <modified_by>super:mlo</modified_by> <admin_unit>lva01</admin_unit> </control> <mds> <md shared="false"> <mid>58476</mid> <description /> <name>descriptive</name> <type>dc</type> <value> <?xml version="1.0" encoding="utf-8"?> <record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"> <dc:title>virginia water central (no. 51, november 2009)</dc:title> <dc:stateemployee>nathan verilla</dc:stateemployee> <dc:publisher>virginia water resources research center, virginia</dc:publisher> <dc:date>2009</dc:date> <dc:format>pdf</dc:format> <dc:language>english</dc:language> <dc:subject>water-supply -virginia -periodicals</dc:subject> <dc:subject>water resources development -virginia -periodicals</dc:subject> <dc:subject>water -pollution -law and legislation -virginia -periodicals</dc:subject></record> ]]> </value> initially, the dc element “contributor” was used to identify the state employee who deposited the digital content. however, because every digital collection in digitool had to have the same mapping, the cataloger would have to delete that field for each record, since the depositor was not to be displayed to the public. the “contributor” element, however, is publicly displayed. therefore, the field “state employee,” which was unique to the state documents collection, was created, as shown in figure 5a. this field does not display to the public and, thus, does not have to be deleted. figure 5a. meditor view: descriptive metadata click here for larger image the parent-child relationship, or complex object, is used to manage continuing assets, such as periodicals. figures 6 and 6a illustrate an example of a complex object in resource discovery (rd), the user interface that displays searchable digital content and metadata, while figure 6b shows the meditor view. figure 6. parent-child relationship, or complex object click here for larger image figure 6a. parent-child relationship, or complex object click here for larger image figure 6b. meditor view: parent-child relationship, or complex object click here for larger image the administrative metadata for a complex object shows the object is “part of” another object and provides the pid for the child. the child is given a separate label, “virginia water central,” and shows both the creation and modification dates and times. the entity type is identified as “complex,” meaning “complex object.” lva uses either the complex object feature in digitool or a staff member converts a multi-file pdf into a single file using acrobat adobe professional. <relation> <type>part_of</type> <pid>8703</pid> <label>virginia water central</label> <note xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <usage_type>view</usage_type> <creation_date>2008-08-27 13:55:59</creation_date> <modification_date>2010-01-06 14:31:38</modification_date> <file_extension xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <mime_type xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <external_type>0</external_type> <directory_path xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> <entity_type>complex</entity_type> <file_id xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" /> </relation> to set up a periodical, a parent object is created and four or five dc elements, e.g. title, publisher, date, language and/or subject(s), are added. the child objects are then linked to the parent object, and the parent object is linked to the periodicals collection. finally, the url is added to the marc records in both oclc and aleph, which could be done using a separate program. currently it is being added manually. after the object has been modified with additional descriptive metadata, it is incorporated into the appropriate subcollection or collections. the main collection of depository documents, state government publications, is subdivided into three categories: 1) browse by topics, 2) depository submissions by date, and 3) periodicals, as shown below in figures 7 and 7a. the topics were devised taxonomically, by the state and federal publications cataloger, because this was the only collection requiring this level of sub-categorization at the time. figure 8 displays the general collections as seen in rd. figure 7. collection in resource discovery (rd) click here for larger image there are two collection levels with ten main categories and subcategories. the two collection types are nodes and itemized. each object is assigned one subcategory. the subcategories can be seen in the state government publications view in figure 7a. figure 7a. subdivisions of main collection click here for larger image the statistics that are kept track the publications and the number received. the total number of publications displayed in the rd of digitool is misleading, however, because it combines all subcollections within the state government publications collection. these subcollections contain items that overlap other subcollections in the state government publications collection in digitool, thus inadvertently inflating the number of records in the collection. in order to retrieve the accurate number of digital publications that have been deposited in digitool, a collection of depository submissions, entitled “depository submissions by date,” was created (see figure 8) showing the number of items deposited by year on the first level then further subdivided by month so that the number of items can be identified and calculated monthly. to retrieve the number of deposited digital state documents in aleph, the local note marc field 590 is used to indicate the submission date of the digital object, which can be used to create a report in aleph to retrieve the number of items deposited by date. figure 8. depository submissions by date click here for larger image another aspect of collection management outside of digitool, yet based on the collection of depository submissions by date in digitool, is the online shipping lists, created by the state documents cataloger. the lists are derived from oclc connexion export reports that are entered into microsoft word, which is then translated to a pdf. the shipping lists identify items cataloged monthly. they are listed numerically with different lists for monographs and serials as shown in figure 9. in this way lva can easily notify depository libraries of newly available titles. lva creates monthly online shipping lists to display on its website so that other state agencies can get access this way as well. figure 9. electronic shipping lists click here for larger image conclusion digitool has provided the functions for a successful digital repository program for the virginia state publications depository through its facility of digital publication deposit, email notification, and agency login accounts. yet, the success of the program has been limited due to digitool’s requiring management by a systems person or department and inability to be manipulated by catalogers. fields cannot be easily changed across the system, much less for individual collections. this lack of support infrastructure has prevented lva from taking advantage of digitool’s full functionality. while support from exlibris has been excellent, response time to questions from exlibris can take anywhere from thirty-six and forty-eight hours or longer. communicating in real-time has not been possible due to their being headquartered in israel. there also has not been the level of synchronization between aleph and digitool as we had hoped and anticipated. we have also realized that digitool cannot handle newspapers very well and offers a limited text searching capability. for this reason, lva will have to use another system for its extensive digital virginia newspaper collection. while digitool was a good choice at the time for what was needed, unless it can be manipulated and better utilized by catalogers, the system will continue to be a laborious and time-consuming product with much potential but limited functionality for smaller institutions. the most challenging aspects of developing the program were the mapping, and metadata standards, both developing these locally as well as enabling them in digitool. the mapping, metadata standards, and taxonomy would have been more successful had they been developed at the beginning. metadata in libraries needs to be understood as cataloging requiring standards, and digital asset management systems like digitool need to be intuitive and practical for purposes of functionality, display, and most important, efficiency. notes [1] see library of virginia state publications depository program, accessed 1/7/10, http://www.lva.virginia.gov/agencies/statedocs/. [2] for the purposes of this article, enduser will refer to the agency contributor, the person who manages the agency’s publications and has the responsibility to submit electronic publications to lva. [3] “faq about the state publications depository program,” library of virginia website, accessed 1/12/ 2010, http://www.lva.virginia.gov/agencies/statedocs/faqstate.asp. code of virginia, legislative information system website, accessed 2/22/10, http://leg1.state.va.us/000/src.htm. [4] see valerie stevenson and sue hodges’s article (2008) “setting up a university digital repository: experience with digitool” in oclc systems & services: international digital library perspectives, 24:1, p. 48-50 for liverpool john moores university (ljmu) case study on digitool. ljmu also developed a “process for self-deposit” (p. 49) and determined that the “main advantages in choosing digitool . . . have been the guaranteed level and availability of support available from exlibris and the purchase of a single product . . . to create and manage the full range of digital collections at ljmu” (p. 50). see also “digitool implementation and documentation” on boston college’s website accessed via internet archive, accessed 2/22/2010, http://web.archive.org/web/20070809040644/http://www.bc.edu/bc_org/avp/ulib/staff/digilib/digitool/digitool.html and este paskausky’s (2004) “digitool at boston college” in north american aleph user group fifth annual meeting, june 13-16, 2004, massachusetts institute of technology, cambridge, ma, accessed 1/7/10, http://documents.el-una.org/77/1/naaug04-present_digitool-bc.pdf. [5] see valerie stevenson and sue hodges’s article (2008) “setting up a university digital repository: experience with digitool” in oclc systems & services: international digital library perspectives, 24:1, p. 48-50. [6] “’z39.50’ refers to the international standard, iso 23950: ‘information retrieval (z39.50): application service definition and protocol specification,’ and to ansi/niso z39.50. the library of congress is the maintenance agency and registration authority for both standards, which are technically identical (though with minor editorial differences),” z39.50 international standard maintenance agency, the library of congress network development & marc standards office website, accessed 1/31/2010, http://www.loc.gov/z3950/agency/. [7] acquisitions and access management (aam) is a department at lva that resulted from the merging of technical services, which included acquisitions, and government documents. [8] network development and marc standards office, library of congress, “dublin core to marc crosswalk,” library of congress website, accessed 1/31/2010; see carol jean godby, jeffrey a. young, and eric childress’s article (2004) “a repository of metadata crosswalks” d-lib magazine, 10:12, accessed 1/7/10, http://www.dlib.org/dlib/december04/godby/12godby.html. [9] hunter, nancy chaffin, et al., “phase one report: core metadata elements for csu electronic theses and dissertations, faculty papers, and university historic photographs collection (glass plate negatives),” accessed via csu’s digital repository, 1/31/2010, http://hdl.handle.net/10217/3146; cdp metadata working group, “dublin core best practices, version 2.1.1, september 2006, accessed 1/31/2010, http://www.bcr.org/dps/cdp/best/dublin-core-bp.pdf; rettig, patricia j., et al., “csu core data dictionary: version 1.1 (july 2008),” accessed via csu’s digital repository 1/31/2010, http://hdl.handle.net/10217/3147; rettig, patricia j., et al., “csu core data dictionary: version 1.0 (october 2007),” accessed via csu’s digital repository 1/31/2010, http://hdl.handle.net/10217/3150; rettig, patricia j., et al., “developing a metadata best practices model: the experience of the colorado state university libraries,” journal of library metadata, vol. 8, no. 4 (2008): 315-339, accessed via csu’s digital repository 1/31/2010, http://hdl.handle.net/10217/9025. [10] paskausky, este, “digitool at boston college,” 2004, accessed 2/1/2010, http://documents.el-una.org/77/1/naaug04-present_digitool-bc.pdf. [11] cornell university library metadata standards & tools, accessed 2/1/2010, http://lts.library.cornell.edu/lts/dm/ms/mst; see also metadata services : resources from cornell university, accessed 2/1/2010, http://metadata.library.cornell.edu/resources.html. [12] mary anne dyer, metadata librarian at vcu, provided authors with examples of vcu’s crosswalk and data dictionary for one of vcu’s online repository collections via email, 3/9/2009. [13] “implementation guidelines for the open archives initiative protocol for metadata harvesting,” open archives initiative website, accessed 1/31/10, http://www.openarchives.org/oai/2.0/guidelines.htm. [14] “library application profile,” dublin core metadata initiative website, accessed 1/31/10, http://dublincore.org/documents/library-application-profile/index.shtml. about the authors mei kiu lo (meikiu.lo@lva.virginia.gov) is state and federal publications cataloger and leah m. thomas (leah.thomas@lva.virginia.gov) is cataloging coordinator at the library of virginia. mei kiu developed the taxonomy for the state publications collection. in tandem, they started the ad hoc metadata standards committee and devised the crosswalk for digitool. subscribe to comments: for this article | for all articles 3 responses to "creating an institutional repository for state government digital publications" please leave a response below: beanworks » comparing digital library systems, 2011-04-11 […] expertise in varying levels in order to realize full functionality of their features (see, e.g., creating an institutional repository for state government digital publications). for another, as the number of libraries implementing digital libraries with resource discovery […] manoj, 2012-05-16 i wish to undertake project of digital library of state governemnt publications for government of maharashtra. how to go abut it manoj, 2012-05-16 it is aggod article leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – recognizing cultural diversity in library interface development mission editorial committee process and structure code4lib issue 28, 2015-04-15 recognizing cultural diversity in library interface development the rapid increase in complex library digital infrastructures has enabled a more full-featured set of resources to become accessible by autonomous users, whether onsite or remote. however, this trend also necessitates careful consideration of the usability of new interfaces for populations with increasing cultural, geographic, and socioeconomic diversity. researcher aron marcus has become an authority on how cultural principles affect interface perceptions and inform their development. this article will explore marcus’ work to contextualize diversity issues within usability before exploring the redevelopment strategy for the new york university libraries’ web presence, which serves a broad and global set of users. by nik dragovic with the increasing predominance of remote and digital access to academic library resources, the design of user interfaces and online access is a correspondingly more visible concern. highly networked online environments draw a more varied audience to these web resources. the division of libraries at new york university (nyu) is responsible for serving a diverse group of students, faculty, staff, and other stakeholders at multiple global sites. accordingly, the usability of online resources for the wide range of users and uses in this complex organization has become a point in focus. this article will discuss the ways in which current initiatives seek to enhance wayfinding for information seekers across academic discipline, university role, and cultural background. conceptual frameworks in universal access will frame the discussion of user needs in the context of the user community and institutional strategies and methodologies. a growing corpus of research exploring the influence of cultural background on user experience (ux) has emerged in the past two decades, much of which can inform the efforts currently underway at nyu. one of the major frameworks applied to this topic is that of aron marcus. marcus’ thinking on cross-cultural ux design is predicated on five dimensions of culture proposed by cultural anthropologist geert hofstede. power-distance – the extent to which less powerful members expect and accept unequal power distribution within a culture collectivism vs. individualism – strong group cohesion and mutual dependence versus loose social ties femininity vs. masculinity – traditional gender roles linked to domesticity versus competition and toughness uncertainty avoidance – the degree to which a culture tolerates ambiguity and resulting anxiety, expressed in varying rituals and customs longvs. short-term orientation – confucian-inspired dedication to the search for virtuous behavior versus the western belief in, and search for, truth (marcus and gould, 2012) marcus has examined the ways in which these considerations affect digital interface expectations, preferences, and interactions for users in varied cultural contexts, where culture is defined as “the set of shared attitudes, values, goals, and practices…” (marcus and baumgartner, 2004). his framework has in part been derived from contrasting the interfaces of multinational company websites and university websites. resulting implications tied to the five dimensions extend to content, visual design, navigation, and all other elements of digital interfaces, and manifest in ways that range from strikingly clear to incredibly subtle (marcus and gould, 2012). working under marcus, valentina-johanna baumgartner derived 29 detailed dimensions of culture from the work of nine theorists (including hofstede) and surveyed 57 multinational user interface designers and analysts to rank the importance of the various dimensions within the professional community. many respondents found the context of the user interface to be a major influence on which elements were most essential (marcus and baumgartner, 2004). respondents also asserted that the degree of compliance with the dimensions of culture is a matter of resources, since the cost of development and testing of localization features for culturally responsive interfaces is high even when only a few elements are undertaken (marcus and baumgartner, 2004). in the academic library environment, the question raised is how the relatively new implementation of dedicated user experience staff can accommodate these challenges and provide equitable access to resources for each of its users. for digital interfaces in particular, user diversity can be difficult to identify or even perceive. nyu is approaching this challenge with a variety of user research methods enacted through project work by dedicated ux staff as well as newly-established interdepartmental committees. contextualizing the information needs of the global network university accommodating the diverse backgrounds and information tasks of users is a complex endeavor which the division of libraries at new york university (nyu) is currently addressing. as one of the largest private universities in the united states, the school has a remarkably diverse population of students, faculty, and staff, in addition to a growing international presence. a 2014 report by the institute of international education found that nyu enrolled the highest number of international students of any postsecondary institution in the nation (iie, 2014). the school claims 90 countries of origin for its new york student population, and 77 for its group of international faculty and research scholars. in addition to 11 global academic centers that are analogous to traditional study abroad sites, the university has also established two global campuses that exist as autonomous degree-granting sites and recruit students worldwide for a us education. the abu dhabi campus, established in 2008, currently hosts students from over 100 distinct countries and claims 98 languages are spoken onsite. nyu shanghai, established in 2012, hosts students from 49 countries. furthermore, the range of services extends to the general public as well. considering the robust set of archives and special collections in new york and abroad, as well as mandates to provide access to documents deposited from the us federal government as well as the united nations, the school must also consider independent researchers in its efforts to improve interfaces for access and discovery. with one website currently in place to provide online access to the catalog and electronic resources of the shared library system, there exists a prominent need to provide an experience sensitive to this culturally and geographically diverse user base. index scores for hofstede’s five dimensions of culture at nyu’s global campuses (marcus and gould, 2012) power distance individualism/collectivism masculinity/ femininity uncertainty avoidance long/short term time orientation china 80 20 66 40 118 united arab emirates 80 38 52 68 – united states 40 91 62 46 29 the chart above lists the index scores of hofstede’s five dimensions of culture for the countries in which each of nyu’s degree-granting campuses are situated. though these figures are not fully representative of respective student populations, since each campus draws a widely international audience, the dramatic range in these measurements illustrates a point of contention for interface design. focusing on power distance, in which a higher index suggests rigid hierarchical relationships in a society and a lower number indicates more equality and status, has a corollary to library services at nyu. students coming to the university from high power distance countries are often challenged to acclimate to an american educational environment that is more democratic and participatory than their places of origin. to support this transition, programs are offered in library services to help explain the differences in both information retrieval and academic culture germane to an institution that operates on an american model despite a global presence. this highlights a conflict for interface design as well: should the web presence defer to the understandings of the user, or seek to support them in a transition to an environment with a different value system? having identified user-centered practices as a foundational design principle, nyu’s objective will be to achieve the former. strategizing the development of our online research hub nyu libraries’ last major website redesign occurred in 2008. while the staff has continued to incorporate new back-end technologies to improve access to online reference, electronic resources, and surrogate records, the need to revise the user interface, which brings together a wide variety of discoverability engines and content platforms, is acknowledged. in accordance with the current strategic plan, which names user experience as a focus area, the libraries have originated an eponymous department for that purpose. the confluence of these factors, in addition to the growing global audience for the web platform, has presented an interesting set of challenges to strategize around. with hundreds of full-time library staff members as stakeholders in addition to the broader university community, a web presence group, consisting of members from multiple library departments, was convened. this committee is led by the head of user experience and ensures that representatives from these separate divisions share their perspectives and advocate for their staff and users. while directly addressing diversity as defined by professional roles, this structure also includes permanent representation for staff of global sites and a global services librarian. these geographically defined roles, by extension, implicate culture. this team-based approach to the development and maintenance of the library site is one of the first implicit strategies for addressing cultural diversity. with more equitable representation of different stakeholders by committee members, those more intimately familiar with the needs of those from different backgrounds have a direct voice in the strategy and implementation of new web elements. the approach is enhanced with a dedication to regular communication within the group and to outside stakeholders. the group often meets face-to-face, with synchronous virtual participation from members overseas. communication is supplemented with institutional google apps including gmail and drive, as well as the libraries’ shared wiki and workspace. the group itself is divided into subsections with specific responsibilities such as user testing, which helps direct the workload and leverage the expedience of smaller teams. this more egalitarian form of web development is augmented by liaison work by the committee in the form of periodic updates and outreach to explain progress and solicit feedback. monthly emails charting group progress are distributed to the entire library staff, and the group also visits library departments to share goals and obtain needs and feedback from constituents. some of the essential service developments currently planned or underway include updates to research guide platforms and the testing and selection of a new site-wide content management system, both subordinate to the overall reimagining of the libraries’ web presence. harnessing the power of agile scrum the first strategic goal emerging from this group was the re-envisioning of the library web presence, to be interpreted as a re-engineered information architecture, user interface, and digital strategy. the last of these elements represents a particular departure from prior conceptualizations of how to manage the website. the recognition of complex and multifaceted user needs and the advent of agile development methodologies meant that web initiatives need no longer be confined by monolithic, complete redesigns. instead, iterative processes mean that specific elements of the platform can be taken under consideration as the need arises, then swiftly developed and tested. this has enabled the popular scrum methodology to be newly applied to this academic environment, functioning as a way to expedite work within the user experience department and enact the findings of the web presence group. several new functions have already been deployed under this workflow, and assessment has shown that even minor tweaks have paid dividends in the usability of the site. the broader implications of harnessing this responsive framework and resulting ability to reformulate elements of the user interface with ease draw back to practical concerns in marcus’ work. even resource-heavy corporations struggle with the cost and complexity of directly addressing the cultural dimensions and framing of their user interface. agile development enables incremental changes to be tested and adopted quickly, meeting these needs with an indirect and less resource-intensive approach. the deployment of global and local experiences the other main thrust of the new digital strategy is that of localization. a new conceptual model is in the pipeline to implementation, in which users will be able to customize the library site to their given location for easier access to relevant information. the prior web redesign centered on the bobst library, while current initiatives must strive to incorporate the rapidly complexifying library ecosystem growing in tandem with global sites and initiatives. the libraries strategic plan 2013-2017 makes specific reference to “a user experience that is high quality, consistent, and robust regardless of the user’s location, access method, or objective,” and the first initiative subsumed under this goal targets “sensitivity to its global context.” marcus’ work goes into depth about the consequences of localization of interface for users, namely that such efforts can help accommodate a more geographically and culturally diverse audience by customizing the experience with respect to relevant interface elements. the practical adoption at nyu will involve the user’s selection of their local or global site in order to streamline their interface. providing information customized to the user and their intended use of physical sites will reduce the complexity of the current site, which must dictate policy and procedure for many user types and institutional sites within one space. this has been a challenge for users, as demonstrated by user testing. the aforementioned agile development plays a significant role in enabling these customized experiences tailored to users with different locations, motivations, and cognitive needs. the fragmenting of different user flows into global and local contexts and the iterative design process set the foundation for a more holistic development process that accommodates cultural diversity in addition to a host of other priorities in one workflow. instead of attempting to meet the needs and contexts of all users with a single interface, variable user approaches can now be divided into more cohesive and workable sections. given this structure, user testing can be tied to a more specific site function, and findings can be more seamlessly implemented under an agile methodology. with the conceptual framework in place for this initiative, the choice of infrastructure is now being made. testing is now underway for a content management system that most effectively meets the needs identified to make global and local site experiences function as envisioned. the value of a user-centered design process in turn, this reinvigorates the commitment to a user-centered process. with a user experience department now available to perform the labor-intensive research this entails, the library now employs several methods to glean insight into the development of improved interfaces and workflows for users. one of the earliest initiatives involved personas, which are commonly defined as generalized personifications of user data that emerge from user research. these are typically assigned fictitious names and images in order to provide a concrete framework to contextualize and speculate on user behavior. in the case of nyu, personas originally emerged from careful study and coding of virtual user reference transactions occurring over a chat service. (tempelman-kluit & pearce, 2014) after a qualitative interpretation of the data, investigators were able to identify information needs and motivations, which acted as a springboard for the development of four evidence-based personas, as suggested by the research outputs. with the increasing need to incorporate global workflows into these resources, two global personas were also developed, which focus on the abu dhabi and shanghai sites. the user experience department developed these with attention to the differing library policies and services that apply to users at these sites, since resource sharing among the libraries has resulted in complex request and delivery mechanisms that are generally managed through user interaction with the web interface. all six personas are now commonly employed in interface development projects. user stories provide another tool for development with diverse users in mind. based on personas, the stories address a particular information need and compel the researcher to remain cognizant of the user perspective throughout the complex process of walking through an information seeking task through a third party perspective. a number of these guiding narratives have been created to help add a practical perspective on how best to address user needs, from basic access to resources and services, to more complex engagements requiring work across platforms. direct work with real users is generally the most resource-intensive but rewarding form of user research, and this is currently underway. focus groups are being conducted with students and staff on a voluntary basis, and the scope includes diverse users and perspectives from international centers. user testing, in which users are observed during the course of performing interface interactions, has also become a staple in the user-centered environment. by embracing fast and flexible agile development principles at the core of the website, usability testing can be more effectively deployed for a broader swath of users, and its results can be more easily implemented for incremental but telling improvements in user experience. guerrilla testing, which in this context means open recruitment of study participants from central library locations, has been deployed to useful effect. by meeting the user in physical library space as well as collecting online data to drive ux research, the libraries are making a concerted effort to gain the perspectives from across the community to represent modes of access to information and the approaches taken by those of different academic disciplines and cognitive frameworks. conclusion overall, new approaches to website design and feature implementation at nyu libraries put the focus more squarely on the needs of the user. though dedicated user experience initiatives are still in their infancy at academic libraries, progress so far has already paid dividends by improving the experience for a diverse pool of constituents, and more sophisticated means of assessment will help demonstrate the value of such efforts. though development of a renewed web presence is still in the early stages at the institution, the charted path shows promise for the kind of all-embracing development advocated by leading researchers in the cross-cultural user experience realm. as a distinctly american university operating in many different cultural contexts, the ultimate iterations of the nyu libraries web presence are as yet to be determined. the members of the web presence group at nyu are katherine boss, zachary coble, michael haag, laura henze, gerald heverly, sarah jones, jessica mcgivney, dan perkins, raymond pun, charlotte priddle, andrew rarig, beth russell, nadaleen tempelman-kluit, and kara whatley. references institute of international education (iie). 2014. open doors report on international educational exchange. available from http://www.iie.org/research-and-publications/open-doors new york university division of libraries. strategic plan 2013-2017: mapping the library for the global network university. available from https://library.nyu.edu/about/strategic_plan.pdf marcus a, baumgartner vj. 2004. a practical set of cultural dimensions for global user interface development. in: masoodian m, jones s, rogers b, editors. proceedings of the computer human interaction: 6th asia pacific conference, apchi 2004; 2004 june 29-july 2; rotorua. berlin (de): springer-verlag berlin heidelberg. p. 252-261. marcus a, gould ew. 2012. globalization, localization, and cross-cultural user-interface design. in: jacko ja, editor. the human-computer interaction handbook. 3rd ed. boca raton (fl): crc press. p. 341-366. tempelman-kluit, n, pearce a. 2014. invoking the user from data to design. college & research libraries 75(3):616-640. about the author nik dragovic is a collections assistant at new york university and reference librarian at st. francis college. he was recently granted a master of library and information science degree from pratt institute, and his continuing research interests include usability, digital projects, and diversity in the library profession. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – creating a low-cost, diy multimedia studio in the library mission editorial committee process and structure code4lib issue 44, 2019-05-06 creating a low-cost, diy multimedia studio in the library this case study will explain steps in creating a multimedia studio inside a health sciences library with existing software and a minimal budget. from ideation to creation to assessment, the process will be outlined in development phases and include examples of documentation, user feedback, lessons learned, and future considerations. we’ll explore multimedia software like one button studio, gamecapture, kaltura, adobe creative cloud, garage band, and others and compare their effectiveness when working on audio and visual projects in the library. by bryan elias hull and brandon patterson introduction as audio and video projects increase in demand to meet growing multimedia-centered teaching and learning, libraries adapt to create spaces where these projects can be facilitated. the spencer s. eccles health sciences library (eccles library) at the university of utah has seen an increase in demand for multimedia projects because of a push for faculty to create flipped classroom videos, staff to create videos for their workplace, and students making introductory videos for medical school applications (patterson, et al, 2018) [1]. a technology team at the eccles library was tasked with creating a low-cost, diy multimedia studio space to meet the demand of our patrons. the following article describes the four phases for developing a multimedia space from scratch and include a narrative that may be helpful when considering developing a similar multimedia studio space in your library. phase 1 in late 2016, the library began investigating multimedia options for a growing number of faculty requesting audio and video equipment to record lectures for use in their classrooms. the initial scope of the project was to create a mobile audio/visual station that could be checked out as a set for students, faculty, and staff to use at their own workstations or other remote locations. the idea was to allow patrons to record higher quality video for lectures and other video projects with different microphones, headphones, an audio interface, ipad/apple pencil, and an imac computer. originally, the project was called the multimedia design & innovation lab or “mdil” for short (figure 1). due to space limitations within the eccles library, the initial request was to make as mobile friendly of an equipment setup as possible. as mdil started forming and feedback from patrons signified a preference toward a dedicated studio, it became evident that a space would need to be acquired that could serve as the central location to house the equipment. to obtain additional equipment and software for a dedicated studio space, the technology team sought funding from the campus teaching and learning portfolio (tlp); a university information technology committee that distributes funds collected through a student computing fee (table 1; patterson, et al, 2018). table 1.items, quantity and prices of original equipment and software purchase. items quantity price imac 27 inch retina 5k display fusion 8gb 1 $1,899.00 blue nessie – adaptive usb microphone 1 $80.00 sony mdrv6 studio monitor headphones with ccaw voice coil 1 $110.00 focusrite scarlett 2i2 audio interface 2 $149.00 ea. sennheiser ew 112-p g3 camera-mount wireless microphone system with me2 lavalier mic – a (516-558 mhz) 1 $630.00 dracast led500 pro bi-color led light 1 $500.00 apple ipad pro with pencil 1 $880.00 cables & adapters multiple $200.00 total $4,597.00 the library’s technology team also provided equipment already in the library including: a sennheiser shotgun microphone, gopro cameras, and adobe creative cloud software that exists through software agreements with the university. figure 1. initial equipment for the multimedia design and innovation lab items for the mdil were chosen for various reasons based on price, quality, and how user friendly they were (table 2). the selection of equipment and software was specifically chosen with a few different scenarios in mind: higher quality lecture captures podcast recording mobile video recording student video projects educational support staff projects table 2. mdil items and why they were chosen. items rationale imac 27 inch retina 5k display fusion 8gb the imac was chosen for a few specific reasons: having the computer and monitor all in one sleek package. no need for a separate tower and monitor but still having enough computing power for video, and graphic projects. the free software options bundled with osx including imovie, garageband, keynote, and other productivity software. the large, 5k display for better viewing & editing of video & graphics. blue nessie – adaptive usb microphone the blue nessie connects directly to any computer via usb providing a simple option for recording audio, allowing better sounding recordings than typical webcam or internal computer microphones. sony mdrv6 studio monitor headphones with ccaw voice coil having a good pair of headphones is important for editing, previewing audio while recording, and being able to critically listen to recordings. the sony mdrv6 are highly rated on most web shops and reasonably priced. focusrite scarlett 2i2 audio interface the focusrite scarlett 2i2 allows for professional microphones to be used with a computer seamlessly. the device acts as an audio interface where microphones can be plugged in with either xlr or ¼” inch cables, phantom power provided to microphones that require 48v of power, and gain adjustment on the microphones. there is also a port for headphones that allows you to preview the audio that is coming through the microphones and the ability to connect studio monitors (speakers) to the device. there are several focusrite scarlett options with more inputs and other features available. sennheiser ew 112-p g3 camera-mount wireless microphone system with me2 lavalier mic – a (516-558 mhz) having a lapel microphone is helpful for mobile recordings, lectures where the presenter is the primary focus, and other creative video projects. this sennheiser lapel microphone pack includes the microphone & transmitter as well as a smaller, more streamlined receiver unit that is easier for mobile applications than a traditional receiver unit. the downside is that all of the equipment utilizes aa batteries that are an additional cost and concern when checking out equipment. dracast led500 pro bi-color led light good lighting is sometimes the difference between a poor quality video and a more professional product. having a flexible, adjustable, and mobile led studio light allows for patrons to add an extra level of quality to their recordings. this panel includes brightness and color temperature controls to allow for dialing in the warmth or coolness of the light and the brightness. apple ipad pro with pencil the ipad contains a decent camera for video recording and can be paired with video editing apps. the device can also be used as a teleprompter through apps. the ipad and the apple pencil further expands the capabilities for demonstration, display, and markup depending on the project. cables & adapters the request included a general figure for cables & adapters for the inevitable forgotten cable, adapter, or small item that would be necessary for everything to function properly. instead of itemizing based on every cable, the team was able to request a general figure for flexibility and the ability to purchase throughout the fiscal year. after equipment was purchased in summer 2017, the library created a workstation within an existing macintosh computer lab that would accommodate the mdil. equipment was checked out and used in the space through the imac computer purchased for the mdil. many of the projects for the initial lab setup revolved around audio recording and podcasting. several student video projects used the space for curriculum and non-curriculum applications. phase two the second phase of the mdil saw two primary developments. first, the macintosh lab was moved to the public computing area of the library in order to dedicate the space entirely to the mdil. second, additional equipment was repurposed by the library in order to facilitate more elaborate audio/visual projects (table 3). computer lab to multimedia studio a 16 x 12 foot room that was originally a dedicated macintosh computer lab was repurposed to be a dedicated studio space for the mdil equipment. this allowed for a better controlled working environment for those requesting audio and video support. after approval was obtained by library leadership, the macintosh computer lab was moved to a public computing space in a different area of the library. after the migration, the team began setting up the space to be a dedicated, self-service multimedia studio (figure 2). round two of equipment acquisitions before making a request for more student computing fees, the library’s technology team examined existing library equipment and items that could be purchased through university surplus. this was done to bolster the multimedia lab without going through a lengthy budget request process and to avoid relying on external funding entirely to obtain the dedicated studio. table 3. repurposed or existing library items and why they were chosen. canon vixia hrf200 hd video camera the library had a pool of camcorders available for checkout. one camera was taken out of circulation to act as the dedicated camcorder in the new studio space. magnus vt-4000 tripod similarly, tripods were also available for checkout. one was taken out of circulation to be dedicated to the new studio. elgato hdmi capture device in order to enable recording directly to the desktop, the library repurposed an hdmi capture device originally used for lecture captures to instead act as the recorder for the mdil. this peripheral device came with its own software “game capture hd” that allowed for recording of video and live streaming directly from the camera. rode shotgun microphone another shotgun microphone from the library’s equipment pool was included in order to have two shotgun microphones available for simultaneous recording. especially useful for interviews and podcasts. sennheiser ew 100 g3 wireless lapel microphones (x2) to bring the total number of available microphones to four, two wireless lapel microphones, transmitters, and receivers were added to the space in order to allow greater flexibility for recording audio. lapel microphones are wireless and easy to hide, making them a better choice for recording video than the shotgun microphones. various cables & adapters as necessary in order to make sure all the equipment worked, the following cables were also repurposed to the space: mini hdmi to hdmi cable for the camera to hdmi capture device connection 4 xlr cables in order to connect the microphones to the focusrite scarlett 2i2 power cables and a power strip halogen studio lights the library acquired two halogen studio lights from the university of utah surplus store that were left by another department to bolster the studio space at minimal cost. initial room setup all but one desk and two chairs were cleared from the old macintosh lab to create the new space. a cabinet with locking doors houses equipment that was not in use. the door to the room was kept locked providing additional security to the camera, computer, and lights within the space. figure 2. initial room setup of the mdil. the room was set up with a large recording area with the intention of making it flexible for different scenarios (note the light blue area in figure 2). the “recording space” gave patrons the option to bring in various pieces of furniture from the library depending on the project. this proved to be especially useful as users brought in more comfortable seating for longer on-microphone discussions. the space did not include a backdrop, initially, and used the wall in the space instead which sufficed for most projects. initial equipment setup to continue providing maximum flexibility for patrons, the equipment was configured to facilitate audio, video, and other tasks to be completed simultaneously or individually. the two peripheral devices—the focusrite scarlett 2i2 and the elgato hdmi capture—served as the inputs for the audio and video feeds respectively. the camera’s hdmi output would be connected to the elgato device via an hdmi mini to hdmi cable. the shotgun (see figure 3) or the wireless lapel microphones (see figure 4) could then be connected to the focusrite scarlett 2i2 via xlr cables. this configuration became a limiting factor because the focusrite scarlett 2i2 has two inputs allowing for only two microphones to be used at any given time. as we’ll discuss later, a peripheral audio device such as a focusrite scarlett 2i4, that has at minimum 4 inputs, is recommended to accommodate multiple microphone configurations. the blue nessie usb microphone and the sony mdrv6 headphones were available in the equipment cabinet. we discovered that the option of letting users configure and plug in microphones themselves was more confusing than it was helpful. this resulted in a less hands-on audio configuration in a future phase saving library staff time spent with patrons troubleshooting audio. figure 3. equipment setup with camera and shotgun microphones. figure 4. equipment setup with camera and wireless lapel microphones. initial software setup the imac was updated to the latest version of mac osx and the library’s it department set up the computer to be network-login enabled. this allowed users with a valid university id and password to login to the computer with their credentials, instead of using a single user account on the computer. network login also gave patrons greater privacy for their projects, a necessity in the academic health sciences setting because of laws protecting patient and student information (hipaa, ferpa, etc.). however, the network login option presented a challenge because user accounts are wiped from the computer after several days. to address this, additional documentation and warnings were placed throughout the space to inform users that they must save their work to a cloud storage service, usb thumb drive, or an external hard drive. the following software was then installed on the imac (descriptions provided under “software used” heading below): adobe creative cloud suite including: premiere pro after effects photoshop illustrator audition adobe captivate (separate from creative cloud) elgato game capture hd kaltura capture space garageband imovie quicktime to ensure that every user had properly installed equipment drivers, the library’s it group configured network login accounts to have the following basics prepared to minimize any errors or troubleshooting: making the focusrite scarlett 2i2 the default audio input device on the imac. in the system preferences in the mac os, “sound” is selected with the input device switched to “scarlett 2i2” as the default. the output remained as the internal imac speakers, in anticipation that most users would want to hear audio through the computer speakers and would naturally gravitate towards the imac’s volume controls on the keyboard instead of the focusrite scarlett. shortcuts to the libguide for the mdil, the university’s cloud storage system “ubox,” and the university’s video hosting platform “mediaspace” were added to the desktop. program defaults were made to always utilize the focusrite scarlett 2i2 as the audio driver, and the elgato game capture device as the video source. the software elgato game capture hd was used for video recording because it was the proprietary software of the elgato hdmi capture device. the primary function of the software is to record any video source that had an hdmi output such as a camcorder, video game console, or other source. additionally, the software had a “live stream” option allowing users to livestream their video on various social platforms, live stream platforms, or anything that accepted the real-time messaging protocol (rtmp). figure 5. screenshot of the elgato game capture software interface. one advantage of the elgato software was that it separated the audio from the focusrite scarlett 2i2 (shown as “live commentary” in figure 5) and the audio coming from the hdmi capture device (shown as “game audio” in figure 5.) this meant that microphones plugged into the focusrite scarlett could be recorded instead of the audio coming from the internal camera microphone that is fed into the hdmi capture device. this is achieved by muting the “game audio” and selecting the “commentary” button (shown in blue in figure 5), allowing for better sound quality than the internal camera microphone. audio recording was achieved primarily through garageband or adobe audition depending on the user, their skill, and experience levels. with the audio inputs already set up in the system preferences, the studio was largely “plug and play” where users select which microphones they wished to use and plugged the xlr cables into the focusrite scarlett 2i2 and started recording. instructional how-to’s a series of instructional how-to documents and supplemental videos were developed to help users utilize the space without having a dedicated staff person present at all times. the documents & videos outlined the step-by-step procedure for the most popular uses of the space including: video recording with game capture hd software, audio & podcast recording with adobe audition or garageband, lecture & screen capture using kaltura capture space, and how to livestream with game capture hd. each document was printed and placed next to the imac computer for easy access to users. the documents and videos were also available in a dedicated libguide that was linked to on the desktop. finally, a poster with the general guidelines and rules of using the space was displayed prominently behind the imac and desk. policy & procedure the space was made available for reservation by either emailing the library’s reference email list or speaking to the library desk staff in-person. if the space was not in use or reserved, patrons had the option to “drop-in” and use the space by speaking to the library desk staff. once a reservation or request was made, patrons would be let into the space by the desk staff. the patron would be required to sign an agreement prior to using the space stipulating that they held responsibility for not stealing, losing, breaking, or otherwise misusing the space. once a patron was finished with their allotted time in the space, they would be required to “check-out” with the library desk staff to ensure that all equipment was present and undamaged. the amount of equipment in the space required that the door to the studio space remained locked at all times. the equipment cabinet in the space had the ability to lock, but the decision was made to leave it unlocked with the assumption that the door into the space was locked at all times. posters were put up in the room with the guidelines and rules, including a large poster with images and labels listing the equipment for easy reference for both staff checking the space for accuracy and library patrons not familiar with the devices or cables. figure 6. the media design & innovation lab pictured in december 2017. phase three the latest development phase focused on converting the space to a “one button studio” (obs) and further expanding the capabilities of the equipment while trying to simplify the recording process for all users. to kick off this transition, the library’s technology team rebranded the space from the “media design & innovation lab” to the “tree of hippocrates education studio,” or “the studio” for short (see figure 7). the name was inspired by an offspring of the original “tree of hippocrates” that is planted outside of the library and the memorable “the” acronym. figure 7. logo for the tree of hippocrates education studio. equipment requests similarly to the first phase of development, the library’s technology team sought equipment funding for multimedia equipment and software from the campus teaching and learning portfolio (table 4). the primary focus of this equipment request was to: obtain equipment necessary to utilize penn state’s one button studio software allow for the use of a teleprompter slim down the lighting set-up with compact led light panels provide some sound proofing of the space with sound treatment foam bolster the audio recording capabilities with a mix board add a backdrop to make video recordings more professional and visually appealing table 4. equipment and cost requested for the studio. items quantity cost auralex studiofoam wedgies (charcoal gray, 24-pack) 1 $103.00 rode wsvm shotgun windscreen 1 $9.00 rode dead cat wind muff 1 $30.00 sony fdr-ax100 4k ultra hd camcorder 1 $1,400.00 magicue mobile teleprompter kit with hard case 1 $360.00 ikan elite-remote bluetooth ipad teleprompter remote 1 $73.00 behringer xenyx 802 8-channel compact audio mixer 1 $60.00 sd cards 4 $160.00 impact crushed muslin background gray 1 $53.00 impact background system kit with 10’x12′ black, white, chroma green muslins 1 $219.00 blackmagic design h.264 pro recorder 1 $428.00 dracast led500 s-series bi-color led 3-light kit with v-mount battery plates 1 $479.00 button – griffin technology powermate assignable usb multimedia controller 1 $45.00 anker 4-port usb 3.0 hub 1 $20.00 cables & adapters several $250.00 total $3,689.00 once the approval was granted to the library’s technology team, the equipment was purchased and delivered to the library. in anticipation of the transition to the new “one button studio,” the space was closed for a two-week period during the summer semester. initial room setup the desk, equipment cabinet, and the camera remained in the same location as the previous development phase. the older halogen studio lights were removed and replaced with newer led lights. a backdrop was also added to the space as shown in figure 8. figure 8. initial room setup of the studio. to better soundproof the room, auralex studio foam wedges were purchased and installed in the room by a library colleague with an audio production background. the box of 24 wedges was enough to create two 12-wedge blocks located in line with the camera as pictured in figure 9. while a good start for dampening some of the sound reflections in the room, at least 4 times as many wedges are needed to create soundproofing blocks on all four of the walls. the purchase of bass traps that fit in the corner between walls and the ceiling or floor are also desired for additional treatment of the room. figure 9. teleprompter, camera, led light panels, and backdrop are pictured in the studio, december 2018. initial equipment setup to transform the room into a one-button studio, notable adjustments were made. first, all four of the microphones in the studio were routed through a mix board so that they could be recorded simultaneously through the focusrite scarlett 2i2 (see figure 10). the original idea was to run the microphone audio from the mix board into the camera for the one button studio. we soon abandoned this idea because this approach resulted in the multiple microphones being sent to a single stereo output, making multitrack recording in adobe audition or garageband impossible. figure 10. initial equipment setup of the tree of hippocrates education studio. another adjustment was investing in a new camcorder, the sony fdr-ax100 4k ultra, although there was one major hiccup. the sony camcorder featured only a powered 1/8” microphone input jack. this was incompatible with the “unbalanced” audio output from the mix board and caused significant audio distortion in video recordings. we realized that an ideal camcorder would be one that has a “line in” input in addition to a microphone input jack so that the unbalanced audio from our mix board could be used with the camera (see starter pack for studio equipment below). to gain greater flexibility in the space, we would recommend purchasing one of the focusrite scarlett models that have more than two xlr inputs. this allows for multitrack recording with more than two microphones with no need for a mix board to be purchased. additionally, audio from the focusrite scarlett could be run directly into the camera from the headphone jack or the stereo output on the device utilizing either a ¼” to ?” phone cable or a 3.5 mm trs to dual 1/4″ ts stereo breakout cable. equipment set-up for the one button studio software included swapping out the elgato game capture hd device with the blackmagic h.264 pro recorder and the older canon camcorder with the sony fdr-ax100 camcorder. the anker 4-port usb hub and the griffin technology powermate assignable usb multimedia controller a.k.a. “the one button,” were also installed and plugged in. finally, the magicue mobile teleprompter kit was installed on the tripod holding the camera. the decision was made to leave the teleprompter equipment up at all times to make it easier to use the teleprompter and remind users of its availability. the magicue mobile teleprompter has a glass reflector like most teleprompter systems and a platform that holds an ipad or other mobile tablet. in order to best utilize the ipad as a teleprompter, an app described in the next section, allows for mirrored text to be read on the screen. initial software setup despite the extensive equipment changes most of the software setup that we accomplished in the second development phase transferred over with only a few minor tweaks and additions. first, a new user account was created that would be explicitly used for the penn state one button studio software with the recommended system and software settings. this includes opening the obs and button controller software upon login, suppressing on screen prompts and messages, and allowing administrative permissions for disk writing, among other settings. this meant that users that wanted to use the obs software would be able to do so by default, without logging into the computer with their university id or password. we chose not to purchase the equipment necessary for allowing the studio lights to be turned on automatically by the obs software. this was decided because the lights were easily accessible by hand and assumed users would not find too much trouble switching on the lights by themselves. users would have the ability to change the brightness and light temperature or move the lights around in the space freely. finally, a teleprompter app called “promptster” was downloaded for the apple ipad pro. the app was one of the only free app options that provided a text editor, scroll speed setting, text size, and mirrored text options. to make text readable on a teleprompter screen, mirrored text is required as the teleprompter screen reflection reverses what it is being presented to it. instructional how-tos with the changes to the equipment and the addition of the one button studio software, all of the instructional handouts and videos had to be edited or redone. while not a difficult task, it proved to be time consuming. editing the documentation should be taken into consideration when making videos and written handouts. if the interest is flexibility for updates, it is easier to edit textual content over video. policies and procedures with the obs addition, we saw an increase in patrons using the space. they continue to reserve the space whenever possible as with the previous phase. in order to receive feedback from patrons, we’ve relied on a short survey that is sent after they’ve reserved and used the space. the survey consists of questions related to overall satisfaction, productivity, knowledge of how to use equipment, where to go for assistance, and their likelihood to use the space again. we also collect user demographic information and provide space for added comments or concerns. while response is relatively low (31% response rate of total users), information collected has been valuable in constituting the need for the space to library administration, as well as instigating improvements for the space. users have been satisfied with all aspects of the space. on average, users have consistently marked as being satisfied with the space and feeling productive. the indicate understanding how to use the space, where to get assistance and are likely to use the space again. we have been able to make improvements based on comments received like improved lighting, sound proofing, software updates, and guidance via written instructions in the space (survey and results in appendix 1). software used the following list breaks down the pros and cons of the software used in the studio: one button studio by penn state (with updates) pros: the biggest pro for this software is that it is one of the only free software options available for simple video recording at the press of a single button. this is a great option for those seeking to make a diy recording studio that is simple and easy to use for patrons and “non-technical” users. the version 1.2.5 update released in november 2018 greatly improved performance of the video recordings and made the file saving process clearer at the end of video recording. ample documentation is provided for setting up the obs software and equipment with several recommendations. cons: updates are seemingly rare based on the version history in the app store. the software is only compatible on macintosh os x 10.10 or later with a 64-bit processor and will not work on a pc. you are required to use the equipment specified in the setup documentation provided by penn state. the audio options are limited to the sound feed from the camera and hdmi output. thus we recommend selecting a camera that has a line-in input instead of just a microphone input. there are no editing options so other software must will be required for users which will likely require consultation from experienced staff members. elgato game capture pros: has a plethora of features including: video recording, separate audio source recording, live streaming, basic editing, and more. the program is included with the purchase of an elgato hdmi capture device and works with any camera with an hdmi output and most streaming platforms. cons: the software only works with a compatible elgato hdmi capture device which must be purchased. the software is mildly complicated, especially for a user that does not have technical experience or video know-how and will likely require instruction to be utilized by library users. in our experience, the live stream function worked well so long as you had an rtmp address and key. the embedded login functions for live streaming didn’t appear to work as well. kaltura capture space (personal capture) pros: the software easily records audio, screen capture, and webcam in any combination of the three. for the university of utah, kaltura is used as the sole institutional video hosting and display solution which offers integration across many platforms including the university’s learning management system (instructure canvas) and the university’s video hosting platform (mediaspace.utah.edu) cons: the update from “capture space” to “personal capture” removed the functionality to record a powerpoint presentation on its own, without the use of the “screen capture” which is more complicated for presentation recording. the software only uploads to the kaltura video hosting platform, meaning finding the original recording or downloading a copy is difficult and requires downloads to be turned on in settings. adobe creative cloud pros: the standard for which most creative software is measured against. the suite includes a plethora of tools such as photoshop, premiere pro, illustrator, after effects, and many more. if included in an institution’s licensing agreement with adobe, it may be available at little or no cost at the departmental or individual level. the individual software are powerful and generally as capable as the user using them with a plethora of instructional information on the web. cons: if the institution does not have a licensing agreement with adobe, it can be very expensive to pay for. most library users are going to be intimidated by the idea of the software and will require either extensive help documentation or staff/faculty help to get going on their projects. similarly, the interfaces are dense and complex, and require a certain level of familiarity with adobe products to truly get a hang of outside of some basic functions. garageband pros: garageband has a much simpler interface compared to adobe audition or other similar “digital audio workstations.” it also comes free with apple computers. the ability to record multitrack, especially when paired with an audio interface such as the focusrite scarlett series makes for easy, professional sounding audio recording without much experience. many audio plugins such as equalizer, compression, reverb, etc. are included in the software at no additional cost. cons: several functions that are available in apple logic pro, garageband’s sister software, such as razor tool, waveform editing, and simple track controls are not available or as prominent in garageband. compared to adobe audition, garageband is less robust and doesn’t offer as many features for higher production value such as track automation. imovie pros: many users that have a macintosh may already have some experience using imovie. the interface and editing functions are fairly rudimentary and straightforward but are lacking in comparison to something more robust like premiere pro. cons: many simple functions that are obvious on premiere pro are not always so obvious on imovie. in many cases the project requires more robust editing than what imovie provides. libguides pros: as a solution for hosting library information, springshare’s libguides is a helpful tool for creating, editing, and storing information on a subject or topic basis. part of the libapps suite, libguides offers a “wysiwyg” editor that makes it simple to make decent looking web pages for your projects. it also allows tie-in with other libapps products such as libcal which can handle scheduling for your multimedia studio. cons: in some cases the libguides can be cumbersome to use, especially when a particular libguide has a lot of information to display. the handling of widgets and boxes in libguides is not very intuitive and adds a level of complication when laying out the look and feel of your page. unless otherwise dressed up through web design and images, the default look and feel of a libguide can be lacking. libwizard pros: part of springshare’s libapps suite, libwizard is a tool that helps make forms, surveys, quizzes, tutorials and assessments easily. data is easy to collect and analyze with libwizard’s built in data, field, and cross tab analysis tools. it is helpful for creating a “one-stop shop” solution instead of utilizing a hodgepodge of free services such as survey monkey or google forms. cons: similarly to libguides, an institution must have a license to the libapps product through springshare which comes at a cost. most of the functions are the same as free services like survey monkey or google forms. phase four to support the tree of hippocrates education studio, a student intern with prior multimedia experience from a public library makerspace was hired in january 2018. they helped create closed captions for the instructional videos, worked closely to consult on specific student and faculty projects, and created guides for podcasting. the additional student support was invaluable and avenues for internships are being sought to help provide additional dedicated support to the studio. having interns with video and audio production knowledge will provide one-on-one guidance on projects while also providing the interns an opportunity to expand their skills in multimedia design. after the first year of operation, more and more clinicians from the university of utah hospital were using the studio. patient education videos are in high demand and the hospital lacks the equipment, expertise, and coordination to help with such projects. the eccles library has partnered with imagine perfect care, a hospital grant-giving entity, so the library can provide further assistance with projects specific to patient education. while only in the pilot stage of this partnership, clinicians have found many added benefits of having a nearby multimedia space to create videos for their patients. in reflection, we found specific equipment and software as essential to the success of the studio. while not an exhaustive nor necessarily an “ideal” list of equipment, the following pieces can help kick start a successful multimedia studio that works with penn state’s one button studio software and allows for the creation of other multimedia projects with different software. starter pack for studio equipment: computer imac 27 inch retina 5k display fusion 8gb audio/video devices: canon vixia hf g21 full hd camcorder focusrite scarlett 18i8 or similar audio interface (the more xlr inputs the better) blackmagic design h.264 pro recorder (required for use with obs software) elgato hdmi capture device (for use with elgato game capture software; not compatible with obs) microphones: sennheiser ew 112-p g3 camera-mount wireless microphone system with me2 lavalier mic – a (516-558 mhz) røde ntg2 shotgun microphones w/ windscreens and desk stands headphones: sony mdrv6 studio monitor headphones with ccaw voice coil teleprompter: magicue mobile teleprompter kit with hard case lighting: dracast led500 pro bi-color led light dracast led500 s-series bi-color led 3-light kit with v-mount battery plates backdrop: impact background system kit with 10×12′ with a variety of backdrops (white, gray, black, green, etc) usb devices: button – griffin technology powermate assignable usb multimedia controller (for obs) anker 4-port usb 3.0 hub (for obs) cables & adapters: a generous slush fund to purchase those pesky forgotten cables and adapters. approximate total price: $5,850 in looking to the future, new equipment and software will be added to complement those already existing in the space. an upgraded camera with a line-in audio port, more robust teleprompter applications, and additional sound treatment foam are set to be purchased. camtasia, a screen capture and video editing software package is frequently requested and is slated to be purchased. another future development will be a user-centered reservation system where users can book the studio space based on their availability without emailing or visiting the library staff. this will ideally be done through a calendar on the web and perhaps on a tablet or kiosk outside of the studio space. we look forward to continually evolving and building on the success of the space and continually inquire feedback to support future developments from our students, staff, and faculty. references [1] brandon patterson, tallie casucci, bryan elias hull & nancy t. lombardo (2018) library as the technology hub for the health sciences, medical reference services quarterly, 37:4, 341-356, doi:10.1080/02763869.2018.1514899 appendix 1 follow-up survey with responses overall how satisfied were you with the studio satisfied somewhat satisfied neutral somewhat dissatisfied dissatisfied 13 1 – – – overall how productive did you feel you were with the tree of hippocrates education studio? productive somewhat productive neutral somewhat productive unproductive 12 2 – – – i was able to understand how to use the space or equipment. agree somewhat agree neutral somewhat disagree disagree 9 3 2 1 – i knew where to go to get assistance with the space or equipment. agree somewhat agree neutral somewhat disagree disagree 13 2 – – – how likely are you to use the space or equipment again. likely somewhat likely neutral somewhat unlikely unlikely 13 2 – – – what is your role at the university? student staff faculty other 3 5 4 – what is your university affiliation? school of medicine college of nursing college of pharmacy school of dentistry college of health hospitals/clinics other 6 1 – – – 4 1 did you experience any problems with the space or technology? the lighting was hard to adjust without getting a huge glare on the tv screen. it would be great if there was more guidance or information on the website for tips with this. it was also not set up for the one-touch. we had to plug some things in and restart the program to get the recording to start. it was a bit tough to figure out which cables i was supposed to plug in where. no the button didn’t work to start the recording. but the space bar did. the first time i used it, i had to take the flash drive out and re-start the program each time i wanted to record (new recording). this did not occur though the second time i used the studio. microphone wasn’t connected. one button record didn’t work. yes. the equipment worked just fine but we heard a lot of noise/conversations from the office next door. we had to pause our recording quite a bit because the recording are going to go to quite a large audience. i think the room is great and i am very happy to have access to it. however, it needs soundproofing. when i was recording, the helicopter landed or took off and it made it on the tape. i also had to pause a few times because of conversations in the office next door. no problems whatsoever. great experience do you have any recommendations to better the space or technology for yourself or future users? more information and troubleshooting information in the room, or at least on the website. the podcasting platform i use works only on the phone and not on the computer. thus, it would be nice if we had a microphone i could plug into my phone. have a table in the taping area have a projection ability for a script. i use the chair from the desk to sit in front of the camera. i am reading off of a script, which i have to hold because there is no where to put. i think this makes it more obvious that i’m reading. ideally, a teleprompter would be fantastic. but having something that allows you to see and read your script that is situated behind the camera would be really very helpful. this is such a great space to assist with education not only academic but professional education as well!! i had a great experience using the studio twice–once with a student and once by myself. i hope to use it again in the future. it’s a great resource to have! provide better soundproofing to the room and if possible move the office that is next to it. however, i loved the how to and instruction documents, i can see them being really helpful for someone that doesn’t have a audio engineering background. it was a little tricky recording good sound because the room isn’t sound proof. none, except spread the word i do not. i’ve already recommended the studio to several colleagues. wonderful resource! about the authors bryan elias hull is the digital publishing program manager at the eccles health sciences library at the university of utah. bryan’s work focuses on enhancing scholarly communication through the library’s digital publishing and educational technology platforms and programs. email: bryan.hull@utah.edu brandon patterson is the technology engagement librarian at the eccles health sciences library at the university of utah. he connects students, staff, and faculty to digital tools and emerging technologies and creates meaningful experiences using prototyping tools, virtual reality, and online learning platforms. he is a health sciences education liaison and coordinates with faculty to incorporate information literacy instruction and technology into their classrooms. email: b.patterson@utah.edu subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – egosystem: where are our alumni? mission editorial committee process and structure code4lib issue 24, 2014-04-16 egosystem: where are our alumni? comprehensive social search on the internet remains an unsolved problem. social networking sites tend to be isolated from each other, and the information they contain is often not fully searchable outside the confines of the site. egosystem, developed at los alamos national laboratories (lanl), explores the problems associated with automated discovery of public online identities for people, and the aggregation of the social, institution, conceptual, and artifact data connected to these identities. egosystem starts with basic demographic information about former employees and uses that information to locate person identities in various popular online systems. once identified, their respective social networks, institutional affiliations, artifacts, and associated concepts are retrieved and linked into a graph containing other found identities. this graph is stored in a titan graph database and can be explored using the gremlin graph query/traversal language and with the egosystem web interface. by james powell, harihar shankar, marko rodriguez, and herbert van de sompel introduction egosystem was developed to support outreach to former los alamos national laboratory (lanl) employees. as a publicly funded research organization, there are many reasons why lanl would want to maintain contact with its “alumni.” scientific research is often collaborative. former employees know the lab and its work, and often have colleagues who remain employed at lanl. these relationships fuel intraand interdisciplinary projects. government research agencies are also encouraged to partner with the private sector. productizing lanl output becomes an opportunity for a public-private or commercial entity via a technology transfer process. some small businesses (and jobs) owe their existence to ideas that were first developed at lanl. public support for the ongoing research at lanl plays a role in ensuring support for adequate funding levels for that work. outreach to alumni can encourage them to serve as ambassadors and advocates for lanl and its mission. cross searching of online social networks remains an unsolved problem. these networks are, to some extent, a reflection of the ties between people in the real world. people use online social networks to establish connections with one another and to share information. they may also explore the social networks of their connections to make new connections of their own to friends, family members, co-workers, experts in a particular field, or popular celebrities such as writers, film or tv stars, politicians, etc. users are generally limited to exploring each online social networks to which they belong within the confines of that site. for example, the social network within twitter consists of people who follow or are followed by other users. twitter users can explore these relationships when they are logged into twitter to find new connections. facebook users can use facebook graph search (https://www.facebook.com/about/graphsearch) to locate other users. but aggregate comprehensive search is generally not possible. online social networking sites remain walled gardens [1]. egosystem is intended to address social searching across identities for a defined population (postdoctoral students). postdocs spend a fixed amount of time at lanl and many then move on to private sector employment. exit interviews ensure that lanl retains some basic demographic information about the students. egosystem uses this demographic information to discover postdoc identities on the web. a discovery process uses some simple, iterative heuristics to find public online network identities and the explicit networks surrounding them. it also retrieves public online information about the individual that might lead to additional connections. person identities can be connected through co-authorship relationships, co-employment relationships, or information that reveals that two individuals work, study or perform research in similar fields. in egosystem’s graph, discovered identities are mapped to primitive vertex types which we refer to as platonic vertices. egosystem provides a web interface with social search exploration and analysis capabilities for this graph. platonic vertices that correspond to a person, concept, artifact, or institution each have their own profile page. select end users can use the web interface to search or traverse the connections among these vertices. figure 1: egosystem displaying profile page for marko rodriguez discovery the discovery process for egosystem is designed to find online identities of lanl’s postdoc alumni community. lanl conducts a basic exit interview of postdocs before leaving the lab, and the discovery process uses this as seed data. it constitutes minimal data necessary to locate and validate an online identity for a person. an example seed entry for a postdoc is shown below: name: rodriguez, marko phd university: university of california, santa cruz field of study: computer science, computer networks joining company: at&t title: scientist field of work: graph theory, graph databases the discovery process searches for identities in a predefined set of social and academic web portals. it approximates the steps a human would use to find online identities. discovery starts with a web search, which is based on some basic demographic information about a person. when a person identity is found, identities are also often found for related institutions and artifacts. this information is used to augment the initial seed data, and the process is repeated. during subsequent iterations of the discovery process, other person identities, affiliations, artifacts and concepts may be discovered. concepts are discovered using the augmented data since they are based on keywords associated with affiliations and artifacts. the same is true for connections, which are discovered once person identities, institutions, and artifacts have been found. egosystem starts to create connections between person vertices as more concepts, artifacts, identities and institution affiliations are found. figure 2: discovering identities at this point, egosystem’s discovery process targets identities within microsoft academic, linkedin, twitter, and slideshare. it also searches for public homepages and wikipedia pages. institutional identities are retrieved from wikipedia, linkedin, or the institutional home pages (like http://lanl.gov/). geocoordinates for institutional locations are discovered using the google maps api. concepts are retrieved from linkedin, microsoft academic and slideshare. artifact metadata is retrieved from microsoft academic, slideshare and twitter. some identities are discovered through yahoo search using the yahoo boss api, which is its programmatic interface. the yahoo search strategy employs the same techniques a person might employ, using various combinations of seed data in conjunction with the name of a site of interest. for example, here are some variations for a query on marko: marko+rodriguez+linkedin marko+rodriguez+los+alamos+national+labs+linkedin marko+rodriguez+graph+database+slideshare the results of these various search strategies are analyzed, duplicates are removed, and the html page for each resulting url is retrieved. this html page is parsed and identity information is passed to the validator for verification. this works well for locating content-rich pages such as those within wikipedia, but less so for social networking sites that lack an api, such as linkedin. screen-scraping is brittle, as even minimal changes to the underlying html can cause identity discovery to fail. therefore, whenever possible, site specific programmatic apis are used. microsoft academic, twitter and google maps provide apis that support searching and retrieval of data in machine readable formats such as xml and json. native, service-specific apis like microsoft academic provide information about the person directly, and this data can be passed on to the validation module. web homepages also pose a challenge. hence, other techniques as discussed in [4] and [5] are used to retrieve information and validate this content. to determine if a webpage is indeed a homepage of a person, egosystem checks if the domain name in the url is the domain name of an institution the person is associated with. then, it checks to see if the title of the page contains the full name of the person. it also checks for the person’s name in the page’s url (like http://public.lanl.gov/mrodriguez). seed data first discovery iteration adds second discovery iteration adds person: name: rodriguez, marko affiliation (phd university): university of california, santa cruzfield of study: computer science, computer networks institution (left to work at): at&t title: scientist concepts (field of work): graph theory, graph databases new identity: www.linkedin.com/in/markorodriguez affiliation: worked at graph systems architect institution: aurelius, llc affiliation: visiting researcher at institution: vrije universiteit brussel concepts: machine learning, distributed systems, hadoop, graph theory, algorithms, semantic web, natural language, rdf, etc. new identity: twitter.com/twarko artifact: (recent tweets) connections: identities of people that follow, are followers of this identity new identity: http://academic.research.microsoft.com/author/3534935 artifacts: a path algebra for multi-relational graphs the rdf virtual machine constructions from dots and lines the graph traversal pattern concepts: artificial intelligent, computers and society, data model, data structure, decision making, digital library, etc. connections (co-authors): johan bollen, herbert van de sompel, jennifer h. wakins, alberto pepe, etc. new identity: http://www.slideshare.net/slidarko artifact: faunus graph analytics engine solving problems with graphs titan: the rise of big graph data the pathology of graph databases connections: identities of people following this identity the validator component is responsible for verifying that the profile obtained by the discovery process is a correct identity for that person. it compares the seed information to the information retrieved from searches, and awards points for every correct match. if the profile scores more points than a set threshold, then it is considered a correct identity for the person. an exact match for the last name of a person is necessary to validate a profile, otherwise the validation fails at this step. for first names, if an exact match is not found, fuzzy pattern matching is used to determine if the first names are close. for the name “james powell”, “jim powell” would be accepted by the validator while “james powel” will not be. after the name is validated, at least one of the retrieved institution names must match the institution listed for that person in the seed data. again fuzzy pattern matching is used to match the name of each institution. if all of the institutions in the seed information are determined to be valid, then the profile is accepted as a valid identity. if only one institution is deemed valid, then at least one area of expertise or the job title must match in order for the profile to be validated. the scoring system to validate an online profile was introduced by northern and nelson [2]. twitter profiles generally do not contain professional information, so egossytem uses a validation technique comparing institutional twitter identities to their immediate social neighborhood. if a candidate twitter identity is following an institution listed in that person’s seed data, and the name for this identity also matches, then the twitter identity is accepted as valid. pilot project seed information for 3005 postdocs were uploaded to egosystem and the discovery algorithms were executed for these individuals. the discovery algorithms for linkedin, twitter, and microsoft academic only require the original seed data and can run independently of one another. these algorithms supply additional information required by the other discovery algorithms which locate homepages, wikipedia pages, mendeley, slideshare, and other identities. linkedin and microsoft academic were the sites that provided the most information about the lab’s postdoc community. among the 3005 postdocs, the discovery modules found 1963 microsoft academic identities, 833 linkedin identities and 176 twitter profiles. at the end of the pilot project, we conducted a cursory review of the discovery algorithm performance. this consisted of a manual search of a subset of postdocs which was then compared with the results of the discovery algorithms. among the 100 postdocs that were manually searched, we found that the discovery algorithm results were correct for 86 identity searches in microsoft academic. more specifically the algorithm found 71 correct identities, and did not find identities for 15 other people who in fact did not have profiles. among the 14 incorrect identities, 2 were wrongly assigned and 12 identities that existed were not found. for linkedin, the discovery algorithm results were correct for 88 identity searches in linkedin. in this case, it found 34 identities which were correct, and did not find identities for 54 other individuals who did not have linkedin profiles. the remaining 12 identity search results were incorrect, of which one profile was incorrectly assigned to a user, and 11 existing profiles were not discovered even though they existed. for twitter identities, 4 of the identities found by the algorithms were incorrect and 2 existing profiles were not found. the remaining 94 people did not have twitter identities, and the discovery algorithm correctly determined this to be the case. in this limited sample, it appears that the discovery algorithms achieve both high precision and high recall for this set of online identities in the targeted web sites. the graph the data accumulated via the discovery process is mapped into a property graph model [4] defined for egosystem (see figure 4). a property graph contains vertices and edges which can have arbitrary key/value pairs associated with them. it is the specific key/value pairs chosen for a given property graph that refine the vertex and edge types for a given application. in the case of egosystem, vertices fall into two categories, platonic vertices and affiliation vertices. a platonic vertex is one of the four primitive types: person, institution, artifact, and concept. these four types share many of the same properties, as illustrated in figure 3. affiliation vertices bind platonic nodes together in flexible ways. for example, a postdoctoral student might have both worked for and studied at a given institution for different time periods, and two affiliation vertices are used to express these distinct relationships. edges represent the direction of a relationship and their properties are concerned with the type of relationship, including workedfor, studiedat, authored, hasconcept, and hasaffiliation. these vertices and edges allow all the various relationships and objects discovered to be represented within egosystem’s graph. figure 3: example of a property graph the property graph is stored in aurelius’ titan, a distributed graph database capable of representing and processing graphs on the order of hundreds of billion edges and sustaining tens of thousands of transactions a second over a multi-machine compute cluster. this technology enables organizations to build massive domain models that include not only the direct information of their objects of inquiry (for example, particular people), but also the periphery of information that can be leveraged to understand how these objects are embedded within a larger context (for example, other objects of interest such as other people, artifacts, organizations, and concepts). titan enables egosystem to encompass not only the data in professional graphs (for example, lanl employee database, linkedin), but also knowledge graphs (for example, wikipedia), artifact graphs (for example, microsoft academic, arxiv, github), and personal graphs (for example, facebook, twitter). adding machine nodes to the cluster enables more storage and compute resources to be allocated to storing and processing the graph. problem-solving is accomplished via the gremlin graph traversal language. gremlin allows graph traversals to be represented as path expressions. for example, a request to list the names of people who marko doesn’t know, but who are known by marko’s friends: g.v('name','marko').out('knows').name to expand upon that query, a request to list the names of the people that marko’s friends know that marko doesn’t already know is expressed as: g.v('name','marko').out('knows').aggregate(x).out('knows').except(x).name gremlin supports memory and branch structures which allow it to recognize or traverse any arbitrary path through a graph. when gremlin is used with titan, sub-second local neighborhood traversals are enacted (similar in form to the traversals above). in order to do long-running, global analytics of the graph, aurelius provides faunus. faunus executes parallel sequential scans of titan graphs in order to support bulk loading, global graph analysis, bulk mutations, etc. as is the case with titan, gremlin is also the graph traversal language leveraged by faunus. for example, to determine the distribution of age within the graph, the following global scan can be enacted: g.v('type','person').age.groupcount we chose titan over other currently available property graph databases (such as orientdb, neo4j, dex, and infinitegraph) because of its apache2 license, horizontal-scalability, global analytics package (via faunus), and direct support of the standard blueprints graph api (analogous to jdbc but for the graph database community). a property graph is known in academic circles as a directed, attributed, multi-relational, binary graph: directed: an edge has a tail and a head vertex. attributed: vertices and edges can have an arbitrary number of key/values pairs. multi-relational: edges are types to support multiple types of relationships binary: an edge connects only two vertices another popular graph representation is the rdf graph data model of the semantic web effort. there are two primary distinctions between these two models: in rdf, vertices and edge labels are identified by uris; and property graphs support properties on edges. a major modeling benefit of the property graph model is that edges can support properties and thus, edges can have, for example, timestamps, weights, provenance information, etc. in rdf such edge reification is solved using “blank nodes” and typically leads to a graph structure that is difficult to query. another major distinction between rdf and property graphs is the means by which they are queried. sparql is a prolog-like language for rdf where patterns are specified and those resources that bind to the pattern are returned. conceptually, sparql queries a graph using pattern matching and gremlin queries using traversal. for example, “who are the friends of marko that have also been employed by the los alamos national laboratory?” in sparql: select ?y where { lanl:marko foaf:knows ?x ?x foaf:workedfor lanl:lanl ?x foaf:name ?y } in gremlin, the above is written as: g.v('name','marko').out('knows').as('x') .out('workedfor').has('name','lanl').back('x').name gremlin supports complex path traversals as well as property value matching and filtering on both vertex and edge properties. free-form text searches against property values use elasticsearch, which is bundled with titan. elasticsearch uses lucene to build a full text index for the name property value for each platonic vertex. this enables users to search for person and institution names, artifact titles, and keywords. elasticsearch is also used to build a geospatial index of points representing locations for institutions. geospatial searches against this index return items that fall within a circle that represents the user query. elasticsearch makes it possible to search for person vertices that have institutional affiliations within a certain distance to a given location. gremlin works transparently with elasticsearch. here is a traversal which expands on a previous example to determine the distribution of the ages of people in the santa fe area: g.v('type','person').has('location',geo.within,geoshape.circle(35.6843,-105.961,20)).age.groupcount egosystem was initially populated with data about 3005 postdoctoral students, which were represented as platonic person vertices. the discovery process found 116,466 non-lanl people which were connected to these postdocs in some way, of which 38,111 represented twitter identities and 78,355 were author identities from microsoft academic. it also found 1,395 institutions through linkedin. the total number of vertices in the resulting property graph for egosystem was 9,015,844, and the total number of edges was 19,399,683. figure 4: egosystem property graph model the user interface the property graph can be explored using egosystem’s web interface. the default starting point is a simple, google-like search page with a single input field and a search button. this search executes four searches against properties for all four platonic vertex types. results are organized by platonic vertex type: people, institutions, artifacts and concepts. egosystem’s web interface is implemented in javascript and runs in the user’s browser. it makes extensive use of ajax technologies to communicate with the server hosting the property graph, and to dynamically generate web content. the interface makes calls to a server side rest api, implemented in groovy, which uses the blueprint api [5]. the backend accepts json requests and returns vertex and edge information as json objects. when a user searches for a person by name, the following steps occur: their query is converted to a json object the json object is submitted to a rest request called getthingbyname the service responds with a json object that contains all of the information for matching platonic vertices here is an example of a json client request, and a json server response: request: {"name": "marko"} response: { "properties": { "platonic": "person", "name": "rodriguez, marko", "starttime": null, "endtime": null, "frame": "gov.lanl.egosystem.frames.platonic.person", "locked": 1374698679437, "uri": "urn:uuid:e92925c5-91d1-4b26-a67b-48c561643edf", "alias": null }, "methods": [ { "method": "discovertwitter", "parametertypes": "[egosystemgraph]", "returntype": "boolean", "canexecute": true, "description": "execute the twitter discovery algorithm for this platonic" } ... ], "identities": [ { "properties": { "location": null, "handle": "twarko", "service": "twitter", "logo": "http://a0.twimg.com/profile_images/395313322/exterminator_normal.png", "name": "marko a. rodriguez", "starttime": null, "endtime": null, "frame": "gov.lanl.egosystem.frames.identities.twitter.twitterperson", "locked": null, "uri": "http://twitter.com/twarko" }, "methods": [ { "method": "getfollowers", "parametertypes": "[]", "returntype": "iterable", "canexecute": true, "description": "no description is available" }, ... the javascript interface is implemented as a set of objects representing platonic types, the results of a query, or the results of a path traversal. these objects are implemented using prototype functions [7]. for example, instances of the egoplatonic object encapsulate platonic vertex properties, regardless of the platonic type, as well as set and get methods. when an instance of an egoplatonic object is created, values from properties of a vertex returned by a getthingbyuri request are mapped to fields in the object. the object can then be inspected via its get methods. function egoplatonic(jsonplatonic) { this.uuid = jsonplatonic.properties.uri; this.name = jsonplatonic.properties.name; this.hash = jsonplatonic.properties.hash; this.kind = jsonplatonic.properties.platonic; ... egoplatonic.prototype.getname = function() { return this.name; } egoplatonic.prototype.getkind = function() { return this.kind; } egoplatonic.prototype.getservices = function() { return this.services; } the web interface dynamically constructs a profile page using vertex data contained in an instance of an egoplatonic object, in combination with a set of simple rules represented as a json object. a platonic person profile is a dynamically generated web page that incorporates all of the affiliations and identities in the property graph for that individual. a profile page is permanently addressable via a persistent and shareable url. much of the profile page information depends on the identities associated with the particular platonic vertex. each identity vertex includes properties which define the kinds of services it supports. a rule set matches service types with graphical components that can render them. { "antecedent": { "methodname": "getplatonic", "returntype": "platonic", "parametertypes": "[]", "canexecute": true, "uri": "http://academic.research", "service": [ "msacademic" ], "renderas": "*", }, "consequent": { "component": "author.html" } }, all platonic types have their own distinct representation in the web client. an institution page includes institutional identities such as the organization’s homepage and wikipedia page, if found, as well as clickable lists of concepts and people associated with that organization. concept pages (shown in figure 5) and artifact pages are built in a similar fashion. egosystem constructs a concept or profile page by first retrieving information about the concept or artifact’s platonic vertex, and then traversing outgoing edges to gather information about identity vertices to which the platonic vertex is connected. when a user clicks on concepts, people or artifacts, they are actually navigating the underlying graph via identity and platonic vertices. if a user clicks on a link for an identity, egosystem will attempt to resolve that identity to a platonic vertex and display a profile page. if no platonic vertex exists (for example, a non-lanl person identity), egosystem will redirect the user to an external web page for that identity. figure 5: egosystem concept profile page the user interface provides two other types of search. a user can search for individuals who are or were located within a certain distance of a particular geographic location. this interface uses an ajax-based location lookup service to suggest and resolve place names to latitude and longitude coordinates, which are then searched against location information associated with identities in the graph. results are overlaid onto google maps. users may also search for individuals who were associated with an institution during a certain time range. the results are presented as a list of person names which point to the profile page for their platonic vertex. graph visualizations and textual list-based representations of first degree social networks are available from a person’s profile page. the graph visualization is generated using the javascript d3 libraries [8]. users may interact with the graph, rearrange vertices, or click through to profile pages just as they can from the text view of the social network. both lanl and non-lanl identities are included in these social networks. path discovery is also supported through the interface. a user may “save” a person as they search or browse egosystem. from another person profile, the user can locate identity-specific paths that might exist between these two people, including institution, twitter, co-authorship, etc. if a path exists between these two people, the traversal can be viewed as a graph visualization or a list of links for the vertices that connect them. the “save” function also supports a rules-based comparison tool. a user may compare the person they are currently viewing with a saved person. during the compare process, egosystem’s interface iterates through the identities for each person. when both individuals possess a given identity, rules determine which services associated with those identities should be called. if both people have a given identity, say a twitter identity, then service requests (such as getfollowers) are made for each twitter identity, and the results are merged and displayed. the comparison view shows shared social connections, concepts linking two people, and a combined timeline of employment history. figure 6 shows the results of an egosystem compare request for two lanl alumni. starting at the top left, egosystem presented their combined publication metrics as a textual summary and as a bar chart. since both people have a twitter account, egosystem called the getfollows and getfollowers methods to construct a pair of graph visualizations to show their combined follows and followers networks respectively. next, egosystem displayed a combined employment and education timeline. since both people have linkedin identities, egosystem called the getconcepts method for each to construct a combined concept graph. both people also have microsoft academic identities, so egosystem called the gettagcloud method for each to generate a combined tag graph from that data. finally, egosystem called two other methods associated with microsoft academic identities, getcoauthorpaths and getcoauthors, to generate a shared co-authorship path visualization and a combined co-authorship graph. figure 6: egosystem comparison interface the following example shows the json request and response for a getfollowers service request: request: {"uri": "http://twitter.com/twarko" "method": "getfollowers"} response: { "result": [ { "properties": { "outdegree": null, "indegree": null, "location": null, "handle": "drmichaelham", "service": "twitter", "logo": "http://a0.twimg.com/profile_images/2927745307/5e05b6cc43e0e10bb1dee2f95c5af5e7_normal.jpeg", "name": "drmichaelham", "starttime": null, "endtime": null, "frame": "gov.lanl.egosystem.frames.identities.twitter.twitterperson", "locked": null, "uri": "http://twitter.com/drmichaelham", "alias": null } }, ... egosystem’s web interface also allows adding new people to the graph via the “man on the street” form. a user can supply some basic demographic metadata about a person to which they seek a connection through a lanl alumni. this information, like the seed data used to initially populate the system, includes name, a place they worked, job title, a university they attended, and some keywords related to their area(s) of expertise. as with the initial bootstrap process, a platonic vertex is added for this person, and then each discovery method is executed. any identities are added to the graph. the user can then immediately view the new person’s profile, look at their identities, concepts, institutions, and artifacts, find paths to them through lanl people, or compare them to other people in egosystem. profiles provide access to a person’s immediate neighborhood. the profile comparison tool and the path queries show a portion of the graph that links two people. there are also lower level tools for cross-graph analysis. the system can generate reports on graph-wide characteristics such as total vertex and edge counts, or counts by vertex or for all vertices which are of a particular type, or which have a property that has a particular value. demographic reports such as how many postdocs came from or went to a given institution are possible because the underlying graph records the direction of an affiliation and the temporal data associated with that relationship. there are also some more advanced, focused reports for analysis such as by-institution flows. data for these reports is output in standard graph markup languages because the resulting subgraph can be quite large. this enables users to download the graph representation and then load, visualize and manipulate the data using more robust desktop graph visualization tools such as gephi. conclusion egosystem’s discovery module, property graph, and web interface work together to provide what we believe is a novel and useful aggregated social search capability for a defined community. it automates the otherwise time-consuming and laborious task of finding public online identities with a reasonably high degree of accuracy. it aggregates and stores this information, many aspects of which are inherently network oriented, as a richly descriptive property graph in a graph database. the results describe a community of former lab-affiliates augmented with social network information, geographic data, institutional affiliations, temporal data, and data about their intellectual pursuits. as originally envisioned, select lanl staff can use this system to locate, re-establish contact, and establish closer ties between lanl and its alumni. perhaps just as importantly, the connections that accumulate in this graph can reveal additional information that is not readily apparent. this could prove beneficial for activities such as recruiting and identifying partners for collaborations in the future. endnotes [1] hinchcliffe, dion, 2014. where is interoperability for social media? zd net, enterprise web 2.0, february 28, 2014. http://www.zdnet.com/where-is-interoperability-for-social-media-7000026894/ [2] northern, carlton t., and nelson, michael l. 2011. unsupervised approach to discovering and disambiguating social media profiles, proceedings of mining data semantics (mds 2011). [3] yi fang, luo si and mathur, aditya p. 2010. discriminative graphical models for faculty homepage discovery. inf. retr. 13, 6 (december 2010), 618-635. http://dx.doi.org/10.1007/s10791-010-9127-7 [4] qi, x. and davison, b. d. 2009. web page classification: features and algorithms. acm comput. surv. 41, 2, article 12 (february 2009), 31 pages http://doi.acm.org/10.1145/1459352.1459357 [5] rodriguez, marko a. and neubauer, peter. 2010. constructions from dots and lines. bulletin of the american society for information science and technology, volume 36, issue 6, pages 35–41, august/september 2010. http://dx.dio.org/10.1002/bult.2010.1720360610 [6] blueprints. https://github.com/tinkerpop/blueprints/wiki [7] port, sebastian. 2013. a plain english guide to javascript prototypes. http://sporto.github.io/blog/2013/02/22/a-plain-english-guide-to-javascript-prototypes/ [8] d3 data-driven documents. http://d3js.org about the authors james powell is a research technologist in the research library at los alamos national laboratory, where he is currently a member of the digital library research & prototyping team. since joining lanl, he’s been involved in a number of information retrieval projects that incorporate semantic web and graph analysis technologies. his current interests include graph theory and complex systems. harihar shankar is a research and development engineer in the research library at los alamos national laboratory. he holds a master’s degree in computer engineering from the university of new mexico and his interests include digital preservation and retrieval, information infrastructure, data mining and scientific communication. dr. marko a. rodriguez is the founder and ceo of the graph computing company aurelius, where the team focuses on the development of the open source graph computing technology. marko is the lead developer of the gremlin graph traversal language and the faunus graph analytics engine. previous to aurelius, marko was a graph architect at at&t and a postdoc at the center for non-linear studies at the los alamos national laboratory, where he focused on the development of a multi-relational graph algebra. marko received his masters and ph.d. from the university of california at santa cruz in computer science and his bachelors in cognitive science from the university of california at san diego. herbert van de sompel graduated in mathematics and computer science at ghent university (belgium), and in 2000 obtained a ph.d. in communication science there. for many years, he headed library automation at ghent university. after leaving ghent in 2000, he was visiting professor in computer science at cornell university, and director of e-strategy and programmes at the british library. currently, he is the team leader of the prototyping team at the research library of the los alamos national laboratory. the team does research regarding various aspects of scholarly communication in the digital age, including information infrastructure, interoperability, digital preservation and indicators for the assessment of the quality of units of scholarly communication. herbert has played a major role in creating the open archives initiative protocol for metadata harvesting (oai-pmh), the open archives initiative object reuse & exchange specifications (oai-ore), the openurl framework for context-sensitive services, the sfx linking server, the bx scholarly recommender service, and info uri. currently, he works with his team on the open annotation, memento (time travel for the web), resourcesync, and hiberlink projects. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – building a library app portfolio with redis and django mission editorial committee process and structure code4lib issue 19, 2013-01-15 building a library app portfolio with redis and django the tutt library at colorado college is developing a portfolio of library applications for use by patrons and library staff. developed under an iterative and incremental agile model, these single-use html5 applications target multiple devices while using bootstrap and django to deliver fast and responsive interfaces to underlying frbr datastores running on redis, an advanced nosql database server. two types are delineated: applications for access and discovery, which are available to everyone; and productivity applications, which are primarily for library staff to administer and manage the frbr-rda records. the access portfolio includes book search, article search, call number, and library hours applications. the productivity side includes an orders app and a marc batch application for ingesting marc records as frbr entities using rda core attributes. when a critical threshold is reached, the tutt library intends to replace its legacy ils with this library application portfolio. by jeremy nelson background colorado college’s tutt library is a small, urban, academic library serving the needs of over 2,000 students along with faculty and staff in colorado springs, colorado. as a member of the colorado alliance of research libraries, a consortium of academic and public libraries in colorado and wyoming, we participate in a union catalog comprising the collections of our member institutions. our islandora institutional repository is also hosted through the colorado alliance’s repository service. we operate our own instance of iii’s millennium ils, however. like most academic libraries, our material budgets have shifted from physical to electronic resources and, as a consequence, we are doing more batch loading of marc records for these electronic resources and less original or copy-cataloging of print material. with the quality of vendor-supplied marc records varying considerably, the old workflows to manipulate and load records into our legacy ils were often long and laborious processes. by scripting the manipulation of these marc records with python and the pymarc python module [1], and with a web front-end built in django, we brought considerable time savings to our cataloging staff. this led to a second django project that enabled senior students to self-submit their thesis or final essay, along with any supporting datasets, to our digital repository. a parallel effort was started as we looked at various commercial and open source options for a new discovery layer. given monetary and resource constraints, along with the worry of maintaining multiple codebases in different programming languages and environments, the library decided to fork kochief, a django-based discovery project. this decision led to the release of aristotle, the tutt library discovery-layer project available on github under the apache 2 open source license [2]. in 2011, we started to explore using redis, a popular nosql technology, to represent bibliographic and operational information. this research led to the frbr-redis datastore project that was the topic of a 2012 code4lib presentation [3]. using the library of congress bibliographic framework initiative’s “a bibliographic framework for a digital age” [4] as starting, high-level requirements for the frbr-redis datastore, we tested the suitability of using redis with over 850 unit tests representing marc21, mods, and vra metadata structured as frbr work-expression-manifestation-item entities with a variety of redis data primitives. one of the trends highlighted at the 2012 code4lib conference was the growing importance of mobile devices, both in usage by library patrons, and in the limitations and opportunities presented by developing applications. this caused us to rethink how we offered access and management to bibliographic information from a desktop/rich web “discover” model to a simplified mobile app user interface used in mobile and tablet native applications such as apple’s ios and google’s android operating system. we started experimenting with twitter’s bootstrap support for responsive web design to build simplified html5 apps that were much closer to the design aesthetic of a mobile and tablet application. the first app we released was a call number app that provides an embedded widget for our discovery layer as well as a standalone app targeted for mobile and tablet devices but fully usable by modern desktop web browsers. this app was the beginning of the aristotle library apps project [5] and can viewed online at http://discovery.coloradocollege.edu/apps/call_number/. since 2011, we continued to monitor the modeling work done by the bibliographic framework transition initiative. as we became more experienced with redis, the initial implementation of a redis frbr datastore, with a single redis instance using multiple databases, wasn’t as flexible as having each each of the first group of frbr entities (work, expression, manifestation, and items) run as separate redis instances on different ports on the same server. in september 2012, sally mccallum, in an igelu presentation [6], offered the first glimpse of a new bibliographic model she referred to as marcr, for marc resource. this new model supports rda and frbr but the core entities have been whittled down to four: work, instance, annotation, and authority. because of redis’s flexibility, we were quickly able to modify our existing frbr redis datastore key structure to follow this new model using rda. a primer on this new model was published in november of 2012 [7], with the name changed to bibframe and minor changes made to mccallum’s introduction from september. namely, the work entity became creative work and the namespace for the model was formally set as bibframe. the bibframe redis datastore is now a separate open source project licensed under the gnu general public license 2 and includes documentation, redis server configuration files for each of the bibframe entities, and lua server-sides scripts [8]. more about redis redis, a key-value datastore, is one of the many new nosql data technologies that offer alternative models for data representation and use. redis supports data persistence in two ways: an rdb mode that saves the dataset at periodic intervals, and an aof mode that saves the dataset with every write operation. while there are advantages and disadvantages to each approach [9], most libraries could employ a combination of rdb mode for bibliographic records and aof mode for library transactional data like circulation statistics. redis is fundamentally different from the flat-file structure of a marc record and the relational databases of more traditional library systems. the flexibility of redis allows for the rapid development of multiple apps by supporting different information schemes and structures within a single datastore. the manner in which a redis key is constructed allows for the embedding of semantic or heuristic information about the data structure in its naming structure. redis assumes that related data use a key naming pattern, and even provides a global function to increment the key id. another important design consideration when using redis is the type of data primitive to associate with a key. the simplest value is an atomic string. the redis list is a collection primitive which stores unordered and duplicate string values. the redis set stores unique string values, while the sorted set assigns a sort weight to each value. if a weight of 0 is used in a sorted set, redis does a lexical sort based on the string values in the set. the last redis data primitive is a hash. a redis hash associates multiple sub-keys with a single redis key and with the hget command returns the value associated with that sub-key. redis is not a relational database and it would be suboptimal to attempt to replicate an rdbms. in redis, the key is the fundamental structure, not a table-row as in an rdbms. redis keys can also serve as a string value for other keys in the datastore, providing a sort of crude sql join, but offering more flexibility in representing relationships between keys in a manner that would be difficult or impossible to replicate in an rdbms. the downside is that referential integrity between different tables is not built into redis. eventual consistency can be achieved either through application logic or through strategies involving a combination of redis server commands. the redis string, set, sorted set, list and hash data primitives all offer different ways to represent library information in the redis server. redis also provides a number of server and primitive-specific commands that ease application development, including exist and type. for the exist command, a string is passed in as a parameter and a boolean is returned confirming whether the string is a key in the datastore. the type command, when passed in a key string, returns the type of redis data structure that is represented by the key or a null value if it doesn’t exist. for large datasets that may not fit into ram, the lead developer on the redis project, salvatore sanfilippo, recommends presharding [10] to break up the datastore among different redis instances. the naming schema used by the aristotle library apps and bibframe datastore project can easily support presharding for use in large collections, though for the time being, a single virtual machine with 4g of ram is more than sufficient to run all of colorado college’s bibframe datastore redis instances. the following graphic shows our colorado college bibframe datastore set-up. figure 1. colorado college’s bibframe datastore setup the evolution of a native bibliographic redis datastore the first iteration of a redis bibliographic datastore used a naming schema based on frbr and rda, using a notation similar to xml namespaces to create a collection of keys to represent values and relationships with other entities in the datastore. for example, in the aristotle library apps project, rda core frbr entities and attributes are extracted from marc records following the marc-to-rda mappings provided in the ala’s rda toolkit [11] that are then organized into redis key collections following a pattern that maps to bibframe linkable information resources like creative work, instance, authority, and annotation. the figure below shows one such bibframe redis key collection for a creativework’s rda:title with supporting redis keys for that entity. the phonetic, rda:varianttitleforthework:sort subkeys and subvalues in the bibframe:creativework’s rda:title support title searches by various apps and colorado college’s discovery layer. figure 2. example of bibliographic redis keys and data primitive values for a bibframe creative work’s title the bibframe datastore can support other metadata formats and values as well. continuing the previous example, if we wanted to use mods or dublin core instead of rda, the bibframe datastore key could look like bibframe:creativework:10089:mods:titleinfo with each of the mods titleinfo sub-elements as hash key-value pairs for an individual creativework. likewise, a redis key like bibframe:creativework:10089:dc:title could be either a simple static string or a hash like the current rda:title key used in the current iteration of the bibframe datastore project. we could even have all three (or as many different keys as desired) exist in the same redis datastore instance. the design of the redis key’s structure is through convention and convenience with an effort to be explicitly descriptive of the data that is returned by redis without being too verbose. meanwhile, financial information, such as material orders and invoice information stored in our ils, is extracted and added to the redis datastore for reporting and budget forecasts. we even store the library hours as a redis string for use in a standalone app and as a json data feed for our discovery layer. redis datastore interactions are abstracted via python classes, which are built with the redis-py module [6]. if the app developer needs custom data storage or extended redis functionality, python custom classes can extend existing classes through direct manipulation of the datastore. html5, responsive web design, and twitter bootstrap while native apps generally run faster and more closely follow the recommended user interface guidelines for their respective platforms, the tutt library does not have the resources to maintain multiple app development environments. thankfully, with evolving web standards and protocols, coupled with the development of css and javascript libraries, fast and easy-to-use html5 apps can be created that are more universal and can run on multiple mobile and tablet platforms as well as on personal computers running more modern web browsers that support html5. this is the goal of responsive web design, as expressed in the original article on a list apart: rather than tailoring disconnected designs to each of an ever-increasing number of web devices, we can treat them as facets of the same experience. we can design for an optimal viewing experience, but embed standards-based technologies into our designs to make them not only more flexible, but more adaptive to the media that renders them. in short, we need to practice responsive web design. [7] the aristotle library app project uses the popular web framework bootstrap as the basis for the user interfaces that respond and adjust for different client devices and displays. it is prohibitively expensive and impossible for a small library with limited staff and resources to test out apps on all of the different platforms, web browsers, and devices used by our users. by focusing on the most popular and available devices in the library (windows 7, macintosh, ios, and some android phones and tablets), the tutt library targets specific functionality needed by its patrons and staff. the design intention of this html5-based app development environment is that creating a new app should be roughly equivalent in difficulty to building a simple website leveraging librarian and staff’s pre-existing competencies with such tools as dreamweaver and content management systems. while there is training involved in educating staff about bootstrap and html5, the training burden and requirements for app development is considerably less than if the library tried to develop native apps for the ios and android environments. access and discovery apps the majority of apps in the initial aristotle library app project are categorized as access and discovery apps, which allow users to find and access the library resources and more general information about the library. access and discovery apps broadly address the generic tasks by users to find, identify, select, and obtain resources as expressed in the frbr specification [14]. also included in this category, the tutt library’s hours app informs a patron if the library is open or closed. while not bibliographic in nature, the hours app addresses one of the top questions the tutt library receives from patrons. the call number app figure 3. screenshot of the call number app the call number app was the first app released, and served as the catalyst for the entire aristotle project. as we worked on the discovery layer, another librarian was inspired by a feature in stanford university’s searchworks (built with blacklight and hydra) that allowed a patron to see which call numbers were near each other in the library’s stacks. while investigating stanford’s implementation based on solr, we realized a simplified data model could be used with redis. to create the type of sorted indexes needed for this app, normalized library of congress, sudoc, and local call numbers were added as weights to redis sorted sets. once we had embedded the call number app into the discovery layer, we explored the further development of dedicated, simplified apps for common searches. these independent apps could be used in larger systems, like the discovery layer or the library’s website, through the use of json apis and raw html. library hours app figure 4. screenshot of library hours app when the college adopted a cms incapable of building a dynamic feed of the library’s hours of operations for the library’s homepage, we felt that a dedicated app with a json feed and hours data stored in redis would work instead. the library hours app stores dates and hours in a string using redis bit operations commands. each day has a unique key with the following pattern: “library-hours:yyyy-mm-dd”, with the value being a 96-bit string with each bit representing a quarter hour with bit offset 1 representing 00:00 to 00:14, bit offset 2, 00:15-00:29, etc. for each quarter hour of a twenty-four hour day. bits set to zero (the default) means the library is closed, the bit set to 1 means the library is open for that quarter hour. to see if the library is open, the hours app queries the library’s operational redis datastore by using a redis getbit command as demonstrated in this code snippet from the hours app’s redis_helpers.py [15]: def is_library_open(question_date=datetime.datetime.today(), redis_ds=redis_ds): """ function checks datastore for open and closing times, returns true if library is open. <br/> :param question_date: datetime object to check datastore, default is the current datetime stamp. :param redis_ds: redis datastore, defaults to module redis_ds :rtype: boolean true or false """ offset = calculate_offset(question_date) status_key = question_date.strftime(library_key_format) return bool(int(redis_ds.getbit(status_key,offset))) the patron user interface for the hours app displays a simple message with the library’s current hours. if the library is closed, the app displays the next available date and time when the library is open. with 365 keys per year, one for each day, this redis structure easily supports the requirements of the hours app administrative user interface, allowing authenticated library staff to add or modify the posted hours. productivity apps the second category of apps is for productivity, developed to either manage or report on resources in the collections. these apps require the user to first authenticate, then, depending on the app and the user’s authorizations, allow for the manipulation or reporting of library information, which includes the native bibframe entities in the datastore. in the orders app, order records were imported from tutt library’s legacy ils into the frbr redis datastore. by doing so, we freed this information from the proprietary ils vendor that tightly binds order information to the marc bibliographic record (even going so far as to create custom 9xx fields for order information). by separating the order information into redis sets with each invoice and order as distinct redis keys, visualizations and budget reporting became much simpler. before we had this tool we would have to export this data from the ils and extract the data from the marc21 record, then clean it up before importing it into microsoft excel for analysis of this critical aspect of library operational information. we have also developed a productivity app for a collection of fedora commons utilities used by library staff to move objects around in the digital repository, ingest batches of objects, and apply a metadata batch update to one or more fedora objects. while this app does not directly use the bibframe entities, it has streamlined the workflow of staff in the library who work with the digital repository. the aristotle library apps project is flexible enough to accommodate workflows with other library systems, like our fedora commons digital repository. roadmap for the aristotle library app project the tutt library uses its call number, hours, and discovery apps to augment the library’s website and discovery layer. these are publicly available, along with the article and book search apps, at http://discovery.coloradocollege.edu/apps/. the same json interface that the call number app uses to populate a shelf-browser is also used in the record view in the discovery layer. the hours app provides an embedded html snippet for inclusion in various locations in the library’s website. the next wave of app development will focus both on improving local workflows for such library services as material check-out, course reserves, and collection inventories, as well as improved discoverability of resources from other institutions through shared bibframe redis datastores. in keeping with an agile software development philosophy, each app should be simple enough to design, implement, and start testing within a three-to-four week sprint, which nicely coincides with the current academic block calendar at colorado college. peer-to-peer and consortium union catalogs concern over interoperability was brought up as a challenge to any radical technology change as the tutt library moved to an app model. the regional colorado alliance of research library’s prospector union catalog, in which colorado college is both an active lender and borrower, supports the college’s block plan, where students take intense, 3 1/2 week courses for college credit. with students needing research material promptly, tutt library strives to deliver to students, faculty, and staff as promptly as possible, preferably under 72 hours. the prospector-based ill service is critical to meet this tight deadline for materials. any replacement or legacy ils cannot diminish that service. our plan to tackle these challenges involves an incremental approach, starting off with bibframe datastores shared by just two institutions before increasing the scale to a consortium or regional scale. figure 5. multiple institutions peer-to-peer bibframe datastores we are currently in the early stages of prototyping a peer-to-peer bibframe datastore with the university of denver’s penrose library. as the above diagram illustrates, in this early exploratory work colorado college and the university of denver each have their own redis master running in their respective, local virtual machine environments. each institution also has local redis slaves of the other’s creative work, authority, and annotations that sync up with their corresponding master. our intention is to test the usability of a shared bibframe redis-based union catalog with two institutions to prepare for scaling up to a consortium-level service. figure 6. design of a consortium bibframe datastores maintaining record-level interoperability should be relatively easy in the library app portfolio as long as the marc utilities and productivity apps are in active development. the challenge is in integrating material requests and real-time circulation status into the proprietary system the alliance currently uses for the prospector. a strength of redis is its ability to store and serve large volumes of data, as it has done for sites such as github, engine yard, craigslist, disqus, and stack overflow [16]. the library is in early discussions with the alliance about expanding the aristotle library apps project to scale for millions of records. some interesting network topologies for bibliographic information may be possible when using redis as the underlying datastore and frbr/rda as the organizing principle. for example, the alliance may host the shared work and expression datastores, then subscribe to manifestation and item datastores managed and hosted locally at each institution, at the alliance, or at a commercial cloud provider. each institution could also use the alliance’s hosted work and expression datastores in their own access and discovery apps. to support the increased scale of a consortium or regional bibframe redis ecosystem, we are closely following redis cluster, a sub-project of the redis open source project. redis cluster offers a distributed and fault tolerant [17] environment for very large and distributed data nodes that matches well with our peer-to-peer and consortium-level bibframe datastores. redis custer is under active development, with the developers planning a stable beta release in 2013. the library and the alliance are also exploring grant opportunities to fund the development and support for a new bibliographic datastore that will scale to hundreds of millions of frbr entities. notes [1] pymarc. available from: https://github.com/edsu/pymarc [2] aristotle discovery layer project. available from: https://github.com/jermnelson/discover-aristotle [3] nelson j. nosql bibliographic records: implementing a native frbr datastore with redis. code4lib 2012, seattle, washington. available from: http://discovery.coloradocollege.edu/code4lib [4] a bibliographic framework for the digital age. october 31, 2011. available from: http://www.loc.gov/marc/transition/news/framework-103111.html [5] aristotle library apps project. available from: https://github.com/jermnelson/aristotle-library-apps [6] mccallum, s. bibliographic framework initiative approach for marc data as linked data, september 13, 2012. igelu conference presentation. powerpoint available at: http://igelu.org/wp-content/uploads/2012/09/igelu-sally-mccallum.pptx [7] miller, e., ogbuji, u. and k. macdougall bibliographic framework as a web of data: linked data model and supporting services. november 11, 2012. available from: http://www.loc.gov/marc/transition/pdf/marcld-report-11-21-2012.pdf [8] bibframe-datastore project. available at: https://github.com/jermnelson/bibframe-datastore [9] redis persistence. available from: http://redis.io/topics/persistence [10] redis presharding. available from http://antirez.com/post/redis-presharding.html [11] ala’s rda toolkit mappings. available with subscription at: http://access.rdatoolkit.org/document.php?id=jscmap1] [12] redis-py. available from: https://github.com/andymccurdy/redis-py/ [13] marcotte e. responsive web design. a list apart. may 25, 2010. available from: http://www.alistapart.com/articles/responsive-web-design/ [14] functional requirements for bibliographic records. international federation of library associations and institutions. december 26, 2007. available from: http://archive.ifla.org/vii/s13/frbr/frbr_current2.htm [15] from the aristotle library apps’s project https://github.com/jermnelson/aristotle-library-apps/blob/master/hours/redis_helpers.py [16] who’s using redis? available from: http://redis.io/topics/whos-using-redis [17] redis cluster specification (work in progress). available from http://redis.io/topics/clster-spec about the author jeremy nelson (jeremy.nelson@coloradocollege.edu) is the metadata/systems librarian at colorado college. he is responsible for ensuring that the tutt library technology that students, staff, and faculty at colorado college depend on is available when they need it both on and off campus. he is also responsible for the cataloging department, ensuring that electronic and physical material acquired by the library is cataloged correctly and is positioned for future further use. subscribe to comments: for this article | for all articles 2 responses to "building a library app portfolio with redis and django" please leave a response below: new issue alert: the code4lib journal (issue 19) is now online | lj infodocket, 2013-01-15 […] building a library app portfolio with redis and django […] jurnal media syariah – building a library app portfolio with redis and django, 2013-03-09 […] by scripting the manipulation of these marc records with python and the pymarc python module [1], and with a web front-end built in django, we brought considerable time savings to our cataloging […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – extracting, augmenting, and updating metadata in fedora 3 and 4 using a local openrefine reconciliation service mission editorial committee process and structure code4lib issue 31, 2016-01-28 extracting, augmenting, and updating metadata in fedora 3 and 4 using a local openrefine reconciliation service when developing local collections, librarians and archivists often create detailed metadata which then gets stored in collection-specific silos. at times, the metadata could be used to augment other collections but the software does not provide native support for object relationship update and augmentation. this article describes a project updating author metadata in one collection using a local reconciliation service generated from another collection’s authority records. because the goddard library is on the cusp of a migration from fedora 3 to fedora 4, this article addresses the challenges in updating fedora 3 and ways fedora 4’s architecture will allow for easier updates. by ruth kitchin tillman introduction this article documents a project undertaken to augment metadata within the nasa goddard library repository.[1] the repository runs on a custom fedora 3.3/drupal 7 setup. it contains five collections spotlighting aspects of work produced by the goddard space flight center—authors & publications, case studies, colloquia, balloon technology documents, and the goddard news. the fedora objects for each collection consist of metadata recorded in an appropriate language indexed in fedora’s xml-based gsearch; a simple dublin core file indexed by both gsearch and fedora’s risearch [2]; a rels-ext file (fedora’s rdf/xml language, used to create links between fedora objects) indexed in risearch; and pdfs or externally-hosted video files, depending on the collection. fedora versions 1 to 3 use permanent identifiers (pids) to identify objects. pids consist of a colon-separated prefix and suffix. the prefix may be used across a repository or vary based on collection within the repository. when combined with the suffix, it forms a unique object identifier. this article focuses on two of the collections, authors & publications and colloquia. the first documents the publication output of authors affiliated with nasa goddard. it consists of two types of fedora object, author objects and publication objects. author objects include local mads [3] authority records for each goddard author. publication objects use niso jats [4] to record article, book chapter, technical report, or book metadata. the second collection, colloquia, documents the colloquia presented at nasa goddard by scientists from around the world as well as local goddard speakers. its metadata is recorded in the goddard encoding metadata schema (gems), an extended version of dublin core. the goddard repository uses different pid prefixes for each type of object, which will be referenced in code samples as authorpid, publicationpid, and colloquiapid. historically, the two collections have remained separate. because authors & publications collects publications by authors who have an affiliation at the time of publication, each publication must connect to at least one author object. this relationship is a core component of the collection. the collection is in the process of being back-dated to 2000, but will not extend earlier because of the challenges in gathering reliable author metadata from over 15 years ago. in the colloquia collection, on the other hand, only 710 of the over-5000 records (around 14%) have a goddard presenter. moreover, the collection includes metadata records for library-held colloquia recordings as far back as the 1970s, although digitized copies of older recordings are not yet available on the website. because of the timespan of the collection and the high percentage of non-goddard presenters, connecting presenters in the colloquia collection to the author records from the authors & publications collection was not considered suitable at the time. instead, the colloquia gems records include a speaker’s name and institutional affiliation. if the author is from goddard, their organizational code is recorded as well. these fields may be used in searching the colloquia or in cross-site search, but there is no formal relation to an author’s fedora object record or publications. author names are taken from colloquia materials and not transformed into controlled forms (i.e. “dave” vs. “david” or “chip” vs. “charles”). when considering augmentations to the repository in preparation for its eventual upgrade to fedora 4, i began exploring the possibility of reconciling these two collections for a more unified repository experience. this article outlines my step-by-step process, with scripts and screenshots, for extracting metadata from the authors & publications collection, transforming it into an rdf reconciliation service for openrefine, extracting appropriate colloquia metadata from fedora, using that openrefine service to augment the colloquia metadata, and re-ingesting the augmented data back into fedora. because of the project’s timing, i explored methods for re-ingesting the augmented metadata in both fedora 3 and fedora 4 for comparative purposes. project goal the author and publication records are connected using fedora 3’s built-in rdf/xml language, rels-ext. a publication would include the statement <rel:ispartof rdf:resource="authorpid"/> for each goddard author. in the fedora 4 model for this collection, the relationship will be documented in the publication object itself (vs. an attached file) with the rdf property dc:creator linking to the author’s fedora object. at the end of this reconciliation project, i needed to have related colloquiapid and authorpids in a format which could be used to update colloquia records with either an appropriate rels-ext statement or a dc:creator statement. i also needed to solve any complications in re-ingesting the augmented metadata. phase 1. creating a local reconciliation service for the metadata reconciliation process, i chose to work in openrefine[5][6], a popular data manipulation and transformation tool. i added the rdf refine plugin from deri[7] (these services have also been combined to form lodrefine, but as i worked from openrefine with the plugin installation, that’s how i’ll discuss it), which allows one to both export openrefine projects as linked data and to use linked data services alongside other reconciliation services. unlike cases where one might be drawing from established name authority services, i had to create my own service based on the goddard authors and their uris (pids). one can do this with the rdf refine plugin, adding a reconciliation service from a file, whether locally-generated or downloaded from a major authorities site. i just needed to generate my own rdf file with author information. while many rdf vocabularies could have been used to describe authors, i chose the friend of a friend[8], or “foaf” schema because it’s one of the defaults in rdf refine and all i needed was something to handle names. my choice of vocabulary would affect the process but not the end result. first, i needed the names and object pids for all goddard authors. i exported these through the risearch interface by running a query against the brief dublin core files for author records. the resulting data set was a csv. a sample excerpt: "atlas, david",authorpid:823462024 "tilton, james c.",authorpid:411735411 "esaias, wayne e.",authorpid:868841903 "lyon, richard g.",authorpid:982601626 "schnase, john l.",authorpid:580647414 "esper, jaime",authorpid:564921526 i considered importing the csv into openrefine and using the rdf refine plugin’s option to export data as an rdf skeleton[9], but soon realized i could create an rdf file more quickly using regular expressions. additionally, i wanted to offer alternative names depending on whether or not the author used a middle initial, since colloquia metadata was never standardized to use the official form of the author’s name. manipulating names could be done in open refine, but could also easily be done with regular expressions as a part of creating the rdf file. in sublime text 2[10], the program i used for all non-openrefine data manipulation during this process, i searched for the following regular expression: ("(.+?),(\sdr.\s|\s{2}|\s))(.+?)(\s(\w\.)"|"|"),authorpid:(\d{9}) and replaced it with: <http://gsfcir.gsfc.nasa.gov/authors/id/\7> a foaf:person;\n foaf:name "\2, \4", "\2, \4 \6".\n\n the initial search captures a surname, matches a comma, captures any initial “dr.,” captures the first name that follows and any middle initial if it exists, and captures the unique suffix of the pid. the second query returns a url to the author’s page (urls derive from pid suffixes[11]), an rdf type statement that the entity belongs to the class foaf:person, a line break, two forms of the author’s name, and two more line breaks to keep objects neatly separated. i then ran a second find for foaf:name "(.+?)", "\1 ". and replacement foaf:name "\1". to account for cases in which the author’s middle initial wasn’t recorded, leading to duplicate names. a snippet from the result: <http://gsfcir.gsfc.nasa.gov/authors/id/823462024> a foaf:person; foaf:name "atlas, david". <http://gsfcir.gsfc.nasa.gov/authors/id/411735411> a foaf:person; foaf:name "tilton, james", "tilton, james c.". <http://gsfcir.gsfc.nasa.gov/authors/id/868841903> a foaf:person; foaf:name "esaias, wayne", "esaias, wayne e.". i then added the appropriate prefix declarations: @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . to the top of the file and saved it as an rdf/turtle .ttl file. the author urls were used instead of pids because, when reconciling, one may click on the link for a suggested match and preview the page. this functionality proved useful in cases where author names closely matched speaker names but there was some uncertainty. one could infer, for example, that a speaker primarily published on flight dynamics and trajectories would likely be giving a presentation about the interplanetary transport network. to add the file-based reconciliation service, i selected rdf in the upper right hand corner of openrefine, chose to add a reconciliation service, and selected “based on an rdf file.” figure 1. adding a file-based reconciliation service the process i used to upload the file and add the service is quite simple, as shown below. note that rdf refine includes built-in support for rdfs:label, skos:preflabel, dcterms:title, dc:title, and foaf:name. choosing the “other” box allows one to add other vocabularies if desired. figure 2. defining a file-based reconciliation service (enlarge) the service is now installed and ready to reconcile against any data which includes a column of goddard author names. phase 2. extracting metadata to reconcile after the reconciliation service had been created and installed, the next step in the process was to extract appropriate metadata from the colloquia collection. first, i had to determine what metadata would be needed to complete the reconciliation process. i settled on the following fields: colloquiapid, speaker name, date, title the only two fields actually needed to update records would be the colloquiapid and the speaker’s authorpid, which i would be extracting based on the speaker’s name during the reconciliation process. however, the date and title seemed useful for settling edge cases without having to look up the individual colloquia. a colloquia’s pid gives no indication of the actual topic and no date context. as mentioned above, the author reconciliation service would link to a page with the author’s primary topical field. on a low-confidence match that seemed entirely outside the author’s field, it would be safer to discard the data. similarly, if the colloquia were delivered in the 1980s and had a low-confidence match to an author who had only joined nasa in the 2000s, that possible match should be discarded. to get this data, i would need to process information based on each speaker in a colloquia, not each colloquia record. i began by extracting colloquia records from fedora using its gsearch output. since the gems field is indexed in gsearch, i ran a search to return raw xml output for each object in which affiliation contained the words “goddard” or “gsfc.” because i was searching contents and not exact matches, this result included affiliations described as “goddard space flight center,” “nasa goddard,” etc. i set the search to return 100 items per page, manually offsetting the page start until i reached the end of the search. i saved each page individually as an xml file then copied the object-level results into a single xml file with gsearch’s headers and footers. a single-object example with two authors can be seen below. <?xml version="1.0" encoding="utf-8"?> <resultpage xmlns:zs="http://www.loc.gov/zing/srw/" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <gfindobjects hitpagelast="6" hitpageend="200" hitpagestart="101" hitpagesize="100" hittotal="591"> <objects> <object score="73.88713" no="11"> <field name="pid">colloquiapid:3827</field> <field name="abstract">the solar dynamics observatory (sdo) sdo launched on february 11, 2010, 10:23 am est on an atlas v from slc 41 at cape canaveral. sdo's main goal is to understand the solar variations that influence life on earth and humanity's technological systems. these variations are caused by the solar magnetic field. the sdo science investigations will determine how the sun's magnetic field is generated and structured, how this stored magnetic energy is released into the heliosphere as the solar wind, ... </field> <field name="accessrights">public</field> <field name="affiliation">nasa/|!gsfc!|</field> <field name="affiliation">nasa/|!gsfc!|</field> <field name="browse">s</field> <field name="callnumber">2011-0404 (eng)</field> <field name="code">671.0</field> <field name="code">671.0</field> <field name="date">2011</field> <field name="dc.creator">pesnell, w. dean</field> <field name="dc.creator">chamberlin, phillip</field> <field name="dc.date">2011-04-04</field> <field name="dc.format">dvd</field> <field name="dc.format">webcast</field> <field name="dc.identifier">2011-0404 (eng)</field> <field name="dc.identifier">colloquiapid:3827</field> <field name="dc.title">the solar dynamics observatory: your eye on the sun</field> <field name="dc.type">video</field> <field name="fgs.createddate">2011-05-13t12:14:09.102z</field> <field name="fgs.label">colloquiapid:3827</field> <field name="fgs.lastmodifieddate">2012-11-01t14:47:05.745z</field> <field name="fgs.ownerid">fedoraadmin</field> <field name="fgs.state">active</field> <field name="ismemberofcollection">|!!|collection:3</field> <field name="mediachk">true</field> <field name="presentationdate">2011-04-04</field> <field name="resourcetype">video</field> <field name="series">engineering colloquium</field> </object> </objects> </gfindobjects> </resultpage> using the following xslt, i cross-walked the object results into a csv in which each row contained the fields listed above. the xslt matches an object’s dc:creator field, steps back one level in the xml tree to the object, and extracts the pid, dc:title, and dc:date to go along with that dc:creator. it applies to each dc:creator within an object, creating multiple rows as needed so each speaker will be in an individual row. <?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/xsl/transform" xmlns="http://www.openarchives.org/oai/2.0/" xmlns:xs="http://www.w3.org/2001/xmlschema" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xmlns:zs="http://www.loc.gov/zing/srw/" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:dc="http://purl.org/dc/elements/1.1/" exclude-result-prefixes="xs foxml zs xsi" version="2.0"> <xsl:output method="text" indent="no" media-type="string" /> <xsl:strip-space elements="*"/> <xsl:template match="/resultpage/gfindobjects/objects"> <xsl:apply-templates select="object"/> </xsl:template> <xsl:template match="object"> <xsl:apply-templates select="field[@name='dc.creator']"/> </xsl:template> <xsl:template match="field[@name='dc.creator']"> <xsl:text>"</xsl:text><xsl:value-of select="../field[@name='pid']"/><xsl:text>","</xsl:text><xsl:value-of select="."/><xsl:text>","</xsl:text><xsl:value-of select="../field[@name='dc.date']"/><xsl:text>","</xsl:text><xsl:value-of select="../field[@name='dc.title']"/><xsl:text>"</xsl:text><xsl:call-template name="newline"/> </xsl:template> <xsl:template match="field"> <xsl:apply-templates/> </xsl:template> <xsl:template match="text()" /> <xsl:template name="newline"> <xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet> a line from the resulting csv: "colloquiapid:5088","mcelwain, michael","10/8/2014","see the ball, be the ball: how to find and characterize planets beyond our solar system" i then ingested the resulting csv into openrefine. phase 3. reconciling metadata in openrefine once i’d created and installed the local reconciliation service, derived the csv of pertinent data, and ingested it into openrefine, i could begin the real work of reconciliation. starting a reconciliation process in openrefine is fairly straightforward. from the column i wished to reconcile, author, i selected the drop-down “reconcile” and “start reconciling.” this allowed me to choose from any services i’d already installed, but did not give me a chance to install a new service. figure 3. selecting reconciliation from the column’s drop-down menu once i selected the service, openrefine spent several minutes parsing the service for the different types of data it contains so that it can offer options for reconciliation. from doing this several times, i discovered that, even if a file-based service has been installed and used already, it will need to re-parse each time. figure 4. openrefine must re-parse the reconciliation service once openrefine has figured out what options are available, it will offer to match against the different rdf classes found in the service. if i had made a file with both foaf:person and foaf:organization, for example, i could have chosen to reconcile only against foaf:person for this project. or i could have opted to “reconcile against no particular type” if i wanted to match against both (this is especially useful when a reconciliation service combines data from multiple sources which use different vocabularies to describe the same type of thing, e.g. foaf:person and schema:person). at this point one also chooses whether or not to automatically match when a high-confidence match exists. i experimented both ways and found that automatic matches required such a high degree of confidence that i felt i could trust them for this project. it should be noted that the right-hand column with header “also use relevant details from other columns” is not supported by the rdf refine plugin and only works for other types of reconciliation services. my first version of the reconciliation service had included the organization code where a person worked in an attempt to improve matches, but without this working for rdf reconciliation, i was unable to add that level of confidence. figure 5. select class (or no particular type) against which to reconcile and whether to auto match (enlarge) the following is an example of matches after i ran basic reconciliation. the green line at the top shows percentage of matches found. each match immediately links the author’s name to their uri (in this case, their author page), see figure 7. clicking this link opened the author page as a pop-under within the service where i could confirm a match or evaluate several possible matches. if i wanted to see other possibilities for an automatic match or create a new topic, i could “choose a new match.” in cases where the service did not find an automatic match, it lists several possible options with the match confidence, the closest form of the name, and the same link to the uri. figure 6. an example of initial match results figure 7. clicking on an author link allows one to view the uri’s webpage and then opt whether to match this cell, all identical cells, or simply close the view (enlarge) initial reconciliation with auto-matching found 175 matches on the 710 names extracted from the collection. because many authors were not caught by auto-match (see figure 7 above for multiple authors who didn’t match as well as several for whom there were no matches), i performed individual review of the author column. in most cases where a match existed, it was easy for me to confirm through personal experience with the author’s work or by seeing it was simply a different form of the name, e.g. gail skofronick-jackson vs. gail skofronick jackson. for around a dozen names, i had to open the author’s page and sometimes even perform brief local research on their publications. in total, i found another 110 matches for a total of 285 author matches. as for the rest, some authors had simply left goddard, some had never published and were thus never in the authors & publications collection, some were mis-identified as affiliated with goddard at all (the oldest metadata came from transcribed textual records which were occasionally too brief to ensure accuracy and i recognized certain errors from my institutional knowledge alone), and some were non-persons such as the “office of human capital management” visible in figure 7. once the automatic and manual reconciliation was complete, i extracted the data from the names column into a new column. the reconciliation data is only available in openrefine as a part of the service and exporting the project into excel or another other format would not include the data from the reconciliation process. first, i chose to “edit column” and “add column based on this column…” figure 8. to extract data, start by adding a column based on the reconciled column i then used the google/general refine expression language (grel) expression cell.recon.match.id to select the uri/id of the rdf subject to which the person was matched. the preview pane gives examples of what would be extracted for each row. i chose to leave the cell in the new column blank “on error” (if no reconciliation data existed). figure 9. the object uri may be extracted with grel expression cell.recon.match.id (enlarge) figure 10. an example of extracted data (enlarge) after extracting the data into a new column, i exported the project into excel for simple manipulation and extraction. since the authorpid (vs. author page url) is necessary to create relationships in fedora 3, i replaced the first part of the author url with the prefix authorpid:. i then removed all columns but those containing colloquiapid and authorpid and sorted the sheet based on values for authorpid so rows without a pid could be grouped and deleted. this file then became my master copy of the relationship data from which i would create update solutions for fedora 3 and fedora 4. phase 4. updating fedora 3 and 4 at this point in the process, i had a simple two-column sheet outlining relationships between colloquia objects and author objects. what remained was to transform these into a format which could be used to update fedora. if a colloquia had multiple presenters who also had author objects, i would have to take that into account when performing updates. i also had to handle major structural differences between fedora 3 and fedora 4. the processes turned out to be significantly different, primarily due to fedora 3’s requirements for updates. updating fedora 3 fedora 3 supports an xml language, fedora batch modify,[12] which can be used to create or purge fedora objects or to modify datastreams (xml files, images, pdfs, etc.) attached to a fedora object. the difficulty? one must completely replace the datastream one is modifying. in this case, to add an author relationship to the rels-ext rdf/xml datastream for the colloquia object, one must recreate the entire datastream for that object. at first, this challenge seemed insurmountable. recreating the entire rels-ext added a degree of risk to the project that simply inserting an author relationship wouldn’t. but i decided to see just what recreating the information would entail. i began addressing the complication by asking three questions about the colloquia collection’s rels-ext files: what is the same across all rels-ext files? what is different? what does the new information look like? in the case of colloquia files, the answers were: same: each has the relationship <collection:3> indicating that they’re colloquia collections and <fedora-model:hasmodel> <cmodel:2>, <cmodel:4>., which, respectively, designate the set of expected fedora datastreams and indicate the object may have children. different: each colloquia is related to a particular subcollection using <rel:ismemberof> <subcollection:n>. the subcollection indicates which division (science, engineering, etc.) sponsored the colloquia, therefore a colloquia may belong to multiple subcollections if it had multiple sponsors. new: each colloquia being updated will have one or more relationships added to author records, the point of the project. to fit the fedora 3 models currently in use for publications in authors & publications, this relationship will be defined as <rel:ispartof> <authorpid:nnnnnnnnn>. based on this assessment, creating a replacement rels-ext would involve some standard relationships for every colloquia, the author reconciliation data, and the subcollection information for each colloquia that was being updated. it was at this point that openrefine became very useful again. first, i queried fedora’s resource index (again via risearch) to extract subcollection data for all colloquia. i considered attempting only to extract data for the colloquia i was going to be updating, but soon realized that my process would allow me to easily sort out colloquia which did not need updating. i combined the subcollection export with the author reconciliation data in the same sheet, ending with columns: colloquiapid subcollectionpid authorpid and reordered based on the colloquiapid column. the file included one line per relationship. a colloquia with one subcollection and three authors, for example, would have 4 separate lines. i imported this spreadsheet into openrefine where i began work on it. openrefine offers options for working on data on different lines that belongs to the same overall record. after import, an example colloquia looked like this: figure 11. an example of colloquia information before blanking down to have openrefine see all of these as one record, i had to remove the duplicate entries for the colloquiapid. one does this by selecting the “blank down” option under “edit cells” in the column’s drop-down menu. functionally, blank down removes all duplicate data in the column that is both subsequent and adjacent. imagined vertically, a a a will become a – – but a b a will remain the same. blank down works on any row, but for it to make openrefine view the objects as one record, the modified column must be the farthest left and it must be ordered so that all rows for a record are adjacent. figure 12. the blank down function this is the result, once blank down is applied: figure 13. the same information, after openrefine’s blank down has been applied note the numbering change on the left-most column when compared to figure 11 (openrefine’s row/record number). openrefine sees the group of rows as a single record and applies a single number. now i could use openrefine’s functions for processing multi-valued cells. in this case, i need to join multi-valued cells. (see figure 12 for where this can be found in the menu.) i selected a comma separator for the data which had previously been in separate cells. the operation applies to one column at a time, so for this project i needed to apply it to both subcollections and authors. figure 14. results from joining multi-valued cells. (enlarge) at this point in the process, the data consisted of a single row for each colloquia with its pid in the first column, the pids of all subcollections in the second column, and the authorpids of all speakers in the third column. the data was sufficient to recreate a rels-ext but now had to be transformed into fedora batch modify statements. while i initially considered parsing it with python or even xslt, considering phase 1 of the process reminded me that it could be done much faster with a simple regular expression find and replace. first, i had to isolate the data that actually needed to be replaced. as mentioned earlier in this section, when extracting subcollection relationships, i had done a collection-wide extract. however, i would not need to update the thousands of records which would be unchanged. i exported the openrefine project as a tab-separated value (tsv) sheet, opened the sheet in excel, and used row sorting to group rows which had nothing in the authorpid column. i then deleted rows without authorpids and resaved the project before opening it in sublime text 2 for actual manipulation. in sublime text 2, i used a series of regex searches and substitutions to search for a variety of possibilities, taking into account that each pid followed a known pattern and that a brief test indicated that there were no more than 3 subcollections and 4 authors for any one record. what follows is a sample line from the sheet, the regular expression used to match it, and the entire datastream as a replacement value for that line with backreferences to pids captured in the regular expression search. colloquiapid:4873 subcollection:3,subcollection:5 authorpid:837483920,authorpid:892493054,authorpid:294850493 (colloquiapid:\d{1,4})\t(subcollection:\d{1,2}),(subcollection:\d{1,2})\t(authorpid:\d{9}),(authorpid:\d{9}),(authorpid:\d{9}) <fbm:modifydatastream pid="\1" dsid="rels-ext" dscontrolgrouptype="x" dslabel="colloquia_rels-ext_ds" logmessage="batchmodify modifydatastream"> <fbm:xmldata> <rdf:rdf xmlns:rel="info:fedora/fedora-system:def/relations-external#" xmlns:fedora-model="info:fedora/fedora-system:def/model#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:description rdf:about="\1"> <rel:ismemberofcollection xmlns:rel="info:fedora/fedora-system:def/relations-external#" xmlns:rel="info:fedora/fedora-system:def/relations-external#" rdf:resource="info:fedora/collection:3"/> <rel:ismemberof xmlns:rel="info:fedora/fedora-system:def/relations-external#" rdf:resource="\2"/> <rel:ismemberof xmlns:rel="info:fedora/fedora-system:def/relations-external#" rdf:resource="\3"/> <fedora-model:hasmodel xmlns:fedora-model="info:fedora/fedora-system:def/model#" rdf:resource="info:fedora/cmodel:2"/> <fedora-model:hasmodel xmlns:fedora-model="info:fedora/fedora-system:def/model#" rdf:resource="info:fedora/cmodel:4"/> <rel:ispartof xmlns:rel="info:fedora/fedora-system:def/relations-external#" rdf:resource="\4"/> <rel:ispartof xmlns:rel="info:fedora/fedora-system:def/relations-external#" rdf:resource="\5"/> <rel:ispartof xmlns:rel="info:fedora/fedora-system:def/relations-external#" rdf:resource="\6"/> </rdf:description> </rdf:rdf> </fbm:xmldata> </fbm:modifydatastream> this may sound time-consuming, but it was simply a matter of creating a regular expression search and a template for a single colloquia, subcollection, and author, and adding/removing possibilities. once the template had been created, the find and replace process took no more than 5 minutes. to ensure none of the lines had been missed and the file created was properly valid, i added the header: <?xml version="1.0" encoding="utf-8"?> <fbm:batchmodify xmlns:fbm="http://www.fedora.info/definitions/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.fedora.info/definitions/ http://www.fedora.info/definitions/1/0/api/batchmodify.xsd"> and closing </fbm:batchmodify> tag and validated the xml file. i then uploaded the modification file to one the of repository’s testing sites and used sparql and risearch queries to see if calls for all items which considered themselves ispartof authors whom i’d identified in the process would return colloquia as well as publications. the raw results were a complete success, so it was time to move on to experiments with fedora 4. updating fedora 4 after the challenge of updating fedora 3, fedora 4 turned out to be a much simpler situation. all rdf data in fedora 4 is handled directly on the object rather than in an attached bitstream (fedora 4’s term for what fedora 3 called a “datastream”). these can be sent as a curl patch call to the fedora server, using sparql-update to insert a single rdf relationship to the author’s object. not only did this negate the need to extract and group subcollection information with the author information, patch calls can be made over and over to the same object, so there was no need to group multiple colloquia speakers into a single patch. i started with a copy of the master file i’d created at the end of phase 3 of the process, which i saved as a tsv sheet and opened in sublime text 2. i decided to use a similar regular expression method to create the patch statements. the following are sample data, query, replacement, and result. data: colloquiapid:1375 authorpid:160862354 colloquiapid:4524 authorpid:160862325 query: colloquiapid:(\d{1,4})\tauthorpid:(\d{9}) replacement: curl -x patch -h "content-type: application/sparql-update" -h "cache-control: no-cache" -d 'prefix dc: <http://purl.org/dc/elements/1.1/> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> delete { } insert { <> dc:creator <http://example.com:8080/fedora/rest/authors/\2>. } where { }' 'http://example.com:8080/fedora/rest/colloquia/\1'\n result: curl -x patch -h "content-type: application/sparql-update" -h "cache-control: no-cache" -d 'prefix dc: <http://purl.org/dc/elements/1.1/> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> insert { <> dc:creator <http://example.com:8080/fedora/rest/author/160862354>. } where { }' 'http://example.com:8080/fedora/rest/colloquia/1375' curl -x patch -h "content-type: application/sparql-update" -h "cache-control: no-cache" -d 'prefix dc: <http://purl.org/dc/elements/1.1/> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> delete { } insert { <> dc:creator <http://example.com:8080/fedora/rest/author/160862325>. } where { }' 'http://example.com:8080/fedora/rest/colloquia/4524' when saved as a curl file and run from the command line, this immediately inserted the correct relationships into our fedora 4 test instance (note, no passwords have been included in the patch statement, but if your fedora 4 repository requires passwords for curl calls, you would include them as well). similar tests against the triplestore revealed that dc:creator searches would now retrieve colloquia uris as well as publication uris. while this method seems simple, some idiosyncrasies recognized during the test process require a brief explanation of why it might require a few more steps to do to a live fedora 4 repository. the testing version of the repository i updated uses pid prefixes and suffixes to create object uuids, splitting authorpid: 948347594 into /author/948347594/, etc., rather than using fedora 4 default uuids, which are generated using a non-semantic pairtree process[13]. however, because of some of the advantages of complex, pairtree-based uuids and the need to create new objects in these collections, the final fedora migration process used on the goddard repository may abandon pids for fedora 4-generated uuids. in that case, a second step of reconciliation would need to be done before these curl patches could be generated. pids will migrate into fedora 4 as triples using the predicate fedora3model:pid. as i envision it, one would use a triplestore to extract uris with full uuids and pids for each set of pids. one could then store these in separate sheets in the same excel workbook and use excel’s match and index functions to recreate the pairings of colloquiapid and authorpid in a third and fourth column of colloquiauuid and authoruuid. a similar find/replace search based on matching and capturing each uuid would allow one to generate curls in the same fashion as above. alternatively, one might simply wait until the migration was complete and recreate the project from phase 1 onward, using slightly different methods to extract data in phase 2 since gsearch will be deprecated, so one is starting with the new uuids and not relying on pids. pids were used in this case because no permanent fedora 4 version of this collection exists. comparing methods for updating fedora 3 and fedora 4 while this paper is meant to be a case study of the overall process of using one’s own data to create a reconciliation service and enhancing one’s own collections, the process of updating fedora and the major differences between fedora 3 and 4 prompted comparative reflection. when updating relationships between objects, fedora 3’s methods are clearly more onerous and heighten the risk of data being lost through negligence or accident. if this project had also required updating the author objects from authors & publications, it would have required extraction of every single relationship the repository records between two authors (hascoauthor) and every relationship an author has to their publications (haspart). for prolific authors, this would have involved hundreds of relationships and while one could have followed similar methods as for subcollection, the risk of data loss would have caused greater anxiety. in future, when working with any data that gets handled through rels-ext or could be handled as rdf in fedora 4 (especially dublincore), i would err on the side of waiting until the data was in fedora 4 to perform the updates. conclusion the process of reconciling and updating relationships between objects in a repository doesn’t end with performing the updates. while both the fedora batch modify file and the curl file stand ready to be deployed to the repository’s staging server, neither has been implemented at this point. the project was primarily an exploration of “could we do this?” and “how?” but more questions need to be answered before it can be deployed on the repository website. first, how should the display be changed to reflect an author’s colloquia? while updating the colloquia module to include a link to the speakers author page instead of a textual search for their name would be simple, the question of the author page proved worthy of deeper consideration. as author pages currently only record publications, they may be regarded as incomplete curriculum vitae (incomplete because the repository’s mission statement does not include publications prior to or after their period of affiliation with goddard). should colloquia simply be intermixed with publications? should they be two sections of the page or have tabs that allow users to switch between the two? these questions and more need to be asked before the relationships can be used and this part of the process will likely be folded into the overall redesign planned for the site’s migration to fedora 4. second, what should be done about colloquia presenters who a) still work at goddard and b) do not have author records? is it inconsistent to only offer links to author pages if the speaker is also a published author? or is it overkill to create author pages for those who don’t publish? creating author records and pages for those who don’t publish requires us to reimagine “author” pages. it would also take significant time, at least in the beginning, to research the 400+ cases of a colloquia indicating a goddard affiliation and whether or not that speaker was still at goddard (if they have left, we cannot create a retrospective record as some of the data necessary to do so gets removed from our internal source) and then to create the record. but once it was done, the ongoing process would not be too time-consuming to maintain. at this point, we’re faced with a lot of possibilities. this project and process will allow us to provide integrated access to resources in our repository. what remains is to decide how best to implement the other changes and to remain aware of possibilities for future enhancements. acknowledgements this work was done under nasa/gsfc contract nng13az16c. the author wishes to thank christina harlow, kate deibel, and benjamin armintor for resources or conversations which helped her solve several technical problems during the process. about the author ruth kitchin tillman worked as the goddard library’s metadata librarian through january of 2016. she starts in february 2016 as the digital collections librarian for notre dame’s hesburgh libraries. endnotes [1] the nasa goddard library repository. http://gsfcir.gsfc.nasa.gov [2] fedora 3’s resource index search (risearch) allows access to the resource index which indexes relationships between fedora objects using and basic metadata about the objects (dublincore, dates created/modified, etc.). this search merely exposes data from the index and cannot be used to perform updates to the objects or index. for further details, see: https://wiki.duraspace.org/display/fedora34/resource+index+search [3] metadata authority description schema. used to encode authority records. the goddard library uses a mildly adapted version which takes into account additional aspects of employment at goddard. http://www.loc.gov/standards/mads/ [4] niso journal authoring tag suite. http://dtd.nlm.nih.gov/publishing/ [5] http://openrefine.org/ [6] the first of three useful tutorials explaining openrefine. https://www.youtube.com/watch?v=b70j_h_zawm [7] an rdf plugin for openrefine. http://refine.deri.ie/ [8] friend of a friend rdf vocabulary specification. http://www.foaf-project.org/ [9] this method of importing data into openrefine and extracting as an rdf skeleton may be used as an alternative, particularly to handle more detailed data or data that’s going to be used for more than a basic reconciliation: http://refine.deri.ie/rdfexport [10] sublime text 2 supports use of regular expressions. http://www.sublimetext.com/2 [11] e.g. http://gsfcir.gsfc.nasa.gov/authors/id/857747614 [12] documentation for the fedora batch modify process in fedora 3. https://wiki.duraspace.org/display/fedora34/batch+processing#batchprocessing-batchmodify [13] in a pairtree filesystem, objects nest under hierarchical two-character directory names, e.g. http://fedorarepository:8080/fedora/rest/39/3b/b2/64/object-uuid. these directories are generally non-semantic. unlike complex, proprietary structures intended to hide or obscure, the pairtree hierarchy can easily be parsed by basic system tools, easily backed up, retrieved by any system which could parse the identifier, and migrated to or reindexed by any system which understands pairtree. pairtree isn’t the only such hierarchical structure, but it was the one chosen by fedora’s developers. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – premis events through an event-sourced lens mission editorial committee process and structure code4lib issue 56, 2023-04-21 premis events through an event-sourced lens the premis metadata standard is widely adopted in the digital preservation community. repository software often include fully compliant implementations or assert some level of conformance. within premis we have four semantic units, but “events”, the topic of this paper, are particularly interesting as they describe “actions performed within or outside the repository that affects its capability to preserve objects over the long term.” events can help us to observe interactions with digital objects and understand where and when something may have gone wrong with them. events in premis, however, are slightly different to events in software development paradigms, specifically event driven software development – though similar, the design of premis event logs does not promote their “being complete” nor their consumption and reuse; and so, learning from logs in event driven software development, may help us to simplify the premis data model; plug identified gaps in implementations; and improve the ability to migrate digital content in future repositories. abstract the premis metadata standard is widely adopted in the digital preservation community. repository software such as archivematica includes a full representation of premis and other software like preservica assert premis conformance [1][2]. for the purpose of this paper, the implementation of premis is focused on digital objects and the different events around them that impact long-term preservation [3]. event sourcing is a software development methodology that provides a mechanism of persisting data that focuses on recording what has happened rather than how things are; rather than storing current state, event sourcing maintains “state mutations” as separate records called “events” [4]. event-sourced events and premis events sound similar to one another, but they look different and are implemented differently. in premis, compliance is often achieved through strict structural properties of the model and recording information in xml. there is less rigidity around the technical implementation of premis. as premis is difficult to engage with, i.e. write and consume, with xml a defacto encoding, we often see ad-hoc decisions made about when information is recorded in a system because not all events fit neatly inside the published data dictionary and not all events look like preservation events – even though they may have an impact on the digital object in the future. event sourcing on the other hand requires that all events are recorded for all aggregates (objects) in that model. event sourcing has a place in long-running processes, e.g. stock-processing, but this paper suggests it may also play a role in helping to record preservation metadata. linguistically we see a difference in how events are recorded in premis (as nouns) versus in the event-sourced model (as action verbs). we assert in this paper that this also makes it difficult to engage with, and may hold back data modeling. it is through an event-sourced lens that this author suggests that we may be able to advance premis further: better align it with software development methodologies that could potentially implement the standard; simplify the data model; plug gaps identified in the current premis 3.0 data dictionary; and improve the ability to migrate digital content in future repositories. introduction premis is a digital preservation data dictionary and schema. it stands for “preservation metadata implementation standard [5]. the standard: supports the viability, renderability, understandability, authenticity, and identity of digital objects in a preservation context; represents the information most preservation repositories need to know to preserve digital materials over the long term; emphasizes “implementable metadata”: rigorously defined, supported by guidelines for creation, management, and use, and oriented toward automated workflows; and, embodies technical neutrality: no assumptions made about preservation technologies, strategies, metadata storage and management, etc. there are four main elements of the data model described as semantic units: objects, agents, rights, and events[6]. objects – named and ordered sequence of bytes that is known to an operating system; units of information in digital form rights – assertion of rights and permissions associated with a premis object. agents – actors (human, machine, or software) associated with one or more event and/or rights statements associated with a digital object. events – actions performed within or outside the repository that affects its capability to preserve objects over the long term. events and how they interact with objects are the primary focus of this paper. physical objects, as described in premis 3.0, are outside of the scope of this work. agents are important inasmuch as an event does not occur without an agent associated with it. rights are only touched on incidentally in the remaining text [7]. event sourcing is a software development methodology that provides a mechanism of persisting data that focuses on recording what has happened rather than how things are; rather than storing current state, event sourcing maintains a “log” of “state mutations” as separate records called “events”. event sourcing is usually contrasted with a “crud” style of writing software that requires some form of long-term persistence. in some ways it is a parallel style of software development. crud stands for “create”. “read”, “update” and “delete” and those are usually implemented as functions on top of a database; functions to read, to delete, and so on. the database in this approach only records current “state”, i.e. records about how something is “at this moment.” optionally, if such a capability has been implemented, an audit trail may be persisted in a crud database that may, conversely, describe snapshots of a previous state and potentially what was done to change the state at the time. event sourcing does not assert a single use in a specific domain, and so its few domain objects are “primitives” upon which we can build. artifacts that we may be interested in are: aggregate – an aggregate is a unique object, artifact, or item that exists. event – is a record describing something that has happened to an aggregate; events tell us how we got to a current state. projection – provides a “view” of an aggregate, e.g. a representation of its current state or states. a projection is built from the underlying event-source model. an example of what an event log for a digital file may look like: aggregate img_001.jpg events {“event”: “checksum recorded”, “datetime”: “2022-01-01t00:01:00z”, “outcome”: ”ba5eba11”, “previous_id”: “e7e4de42”, “id”: ”8e66dad9”}, {“event”: “checksum recorded”, “datetime”: “2022-06-01t00:01:00z”, “outcome”: ”ba5eba11”, “previous_id”: “8e66dad9”, “id”: ”b5ef0830”}, {“event”: “checksum recorded”, “datetime”: “2022-12-01t00:01:00z”, “outcome”: ”badf00d”, “previous_id”: “b5ef0830”, “id”: ”db39686b”}, projection img_001.jpg → warning: the checksum has been modified, please investigate further figure 1. we may consider that aggregates in an event-sourced digital preservation model may be an intellectual entity in premis terms, or digital file or files. we simply have to define what the aggregate looks like and what events are associated with it. premis in more detail premis describes events that are performed inside or outside of a repository that affect the ability to preserve “objects” over time. there are 52 events listed on the library of congress website [8], including: accession, appraisal, compression, creation, encryption, ingestion start, ingestion end, message digest calculation, printing, rendering, replication, recovery, redaction, replication, unpacking, validation, virus check. in premis terms, an event theoretically affects the ability to preserve objects over time. in reality, this is not always the case. printing is extremely unlikely to affect our ability to preserve an object and is in contrast with compression which rewrites an object’s entire byte stream. events, as mentioned, are associated with objects, and those include the following as described in the premis 3.0 data dictionary: intellectual entity – a set of content that is considered a single intellectual unit for purposes of management and description: representation – a digital representation is the set of stored digital files and structural metadata needed to provide a complete and reasonable rendition of the intellectual entity. bitstream – contiguous or non-contiguous data within a file that has meaningful properties for preservation purposes. file – named and ordered sequence of bytes that is known to an operating system. an event may look as follows (this happens to be associated with a single file and is extracted from a sample ingest in the archivematica system): <premis:event xmlns:premis="http://www.loc.gov/premis/v3" xsi:schemalocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd" version="3.0"> <premis:eventidentifier> <premis:eventidentifiertype>uuid</premis:eventidentifiertype> <premis:eventidentifiervalue> 46efc236-fc87-46d8-8f79-99778496bc05 </premis:eventidentifiervalue> </premis:eventidentifier> <premis:eventtype>virus check</premis:eventtype> <premis:eventdatetime>2022-10-04t20:24:15+00:00</premis:eventdatetime> <premis:eventdetailinformation> <premis:eventdetail> program="clamav (clamd)"; version="clamav 0.103.6"; virusdefinitions="26679/tue oct 4 07:56:50 2022" </premis:eventdetail> </premis:eventdetailinformation> <premis:eventoutcomeinformation> <premis:eventoutcome>pass</premis:eventoutcome> <premis:eventoutcomedetail> <premis:eventoutcomedetailnote></premis:eventoutcomedetailnote> </premis:eventoutcomedetail> </premis:eventoutcomeinformation> <premis:linkingagentidentifier> <premis:linkingagentidentifiertype> preservation system </premis:linkingagentidentifiertype> <premis:linkingagentidentifiervalue> archivematica-1.13.2 </premis:linkingagentidentifiervalue> </premis:linkingagentidentifier> <premis:linkingagentidentifier> <premis:linkingagentidentifiertype> repository code </premis:linkingagentidentifiertype> <premis:linkingagentidentifiervalue> artefactual </premis:linkingagentidentifiervalue> </premis:linkingagentidentifier> <premis:linkingagentidentifier> <premis:linkingagentidentifiertype> archivematica user pk </premis:linkingagentidentifiertype> <premis:linkingagentidentifiervalue> 2 </premis:linkingagentidentifiervalue> </premis:linkingagentidentifier> </premis:event> event sourcing in more detail a useful summary of what an event means is written by jon vines from ao.com [9]. beginning with the dictionary definition of an event: “an event is something that happens, especially when it is unusual or important.” they go further: when thinking about this in the context of our application estate, we can further define an event as something that has happened, that the organization cares about. further, we should remember that an event reflects the past nature of the occurrence. events are also immutable in their nature and are unable to be changed once they have occurred. there are examples that we can draw on: an action that a user has taken e.g. an order complete event in a retail scenario originating from a machine e.g. a delivery van, tracking its location as it moves a system event e.g. a logging event, showing duration of a transaction (think observability) fowler [10] describes a key principle of event sourcing; the core idea of event sourcing is that whenever we make a change to the state of a system, we record that state change as an event, and we can confidently rebuild the system state by reprocessing the events at any time in the future. the event store becomes the principal source of truth, and the system state is purely derived from it. for programmers, the best example of this is a version-control system. the log of all the commits is the event store and the working copy of the source tree is the system state. “when we change the state of a system, we record that state change as an event”. if we think in terms of the objects in our system, and think about the events associated with those, we can see a parallel with premis. having previously introduced the concept of these objects as aggregates, we can take the sum-total of all events across all our aggregates to tell us something more about our “system” as a whole. revisiting the checksum example in figure 1 an event that is important in digital preservation is fixity checking. the name itself implies three things: fixity is calculated, e.g. using a sha256 algorithm. fixity is compared with a previous value. fixity is determined to be good, i.e. unchanged; or bad, i.e. changed. fixity is an important “event” but if we inspect the process above, then only one event is described: “fixity is calculated.” in this process, the previous state is read to check against the current value and a new state is potentially recorded. discovering something has changed may be “eventful”, but it is more likely to be the trigger for a set of corrective processes than an event that is recorded. in figure 1 we have an aggregate “img_001.jpg”, and for that aggregate we record the event: “checksum calculated” three times, in january, june, and december [11]. our events simply record state changes in our system. in a digital preservation system we can think about the changes to the metadata for the objects in our repository. as humans we can see there is a problem in december when we observe a change to the outcome of recording fixity. our “projection,” one of many potential views of our data, discussed shortly, can be regenerated, and in this case, our view is specifically designed to trigger a warning that the checksum has changed. this warning can be forwarded to the repository manager and remedial actions can begin. remedial activities are likely to create new events: aggregate img_001.jpg events {“event”: “checksum recorded”, “date”: “2022-01-01t00:01:01z”, “outcome”: ”ba5eba11”, “previous_id”: “329623b1”, “id”: ”7f914e0e”}, {“event”: “checksum recorded”, “date”: “2022-06-01t00:01:01z”, “outcome”: ”ba5eba11”, “previous_id”: “7f914e0e”, “id”: ”0e9b3a89”}, {“event”: “checksum recorded”, “date”: “2022-12-01t00:01:01z”, “outcome”: ”badf00d”, “previous_id”: “0e9b3a89”, “id”: ”54089c0a”}, {“event”: “object recovered”, “date”: “2022-12-01t00:02:01z”, “outcome”: ”img_001.jpg retrieved from lockss server 002_de”, “previous_id”: “54089c0a”, “id”: ”df63ed10”}, {“event”: “checksum recorded”, “date”: “2022-12-01t00:03:01z”, “outcome”: ”ba5eba11”, “previous_id”: “df63ed10”, “id”: ”a1c82d8a”}, projection img_001.jpg → info: file integrity remains good. figure 2. the projection created from the new event log in figure 2 is a heuristic that compares recent checksum values with the first checksum of the aggregate as it was ingested. in this case the result for our aggregate is a positive one and recovery has been successful. the projection may be created as soon as a “checksum recorded” event happens, but the projection here is produced by just one consumer of the event feed, the event source, and other consumers may have different desires for this information. take, for example, a view of the file’s characteristics based on looking at the object’s characteristics: an example with metadata extraction aggregate img_001.jpg events {“event”: “format identified”, “agent”: “siegfried”, “date”: “2022-02-01t00:01:00z”, “outcome”: ”unknown”, “previous_id”: “005f52d2”, “id”: ”9eeda82f”}, {“event”: “metadata extracted”, “agent”: “mediainfo”, “date”: “2022-03-01t00:01:00z”, “outcome”: ”bit-depth: 24-bit”, “previous_id”: “9eeda82f”, “id”: ”70eaa5cc”}, {“event”: “format identified”, “agent”: “siegfried”, “date”: “2022-04-01t00:01:00z”, “outcome”: ”unknown”, “previous_id”: “70eaa5cc”, “id”: ”d7e644b4”}, {“event”: “metadata extracted”, “agent”: “gimp”, “date”: “2022-05-01t00:01:00z”, “outcome”: “resolution: 1000x1000; layers: 1;”, “previous_id”: “d7e644b4”, “id”: ”56e8d2f3”}, {“event”: “metadata extracted”, “agent”: “exiftool”, “date”: “2022-05-01t00:02:00z”, “outcome”: ”created-by: windows paint 128-bit”, “previous_id”: “56e8d2f3”, “id”: ”3bfbf661”}, {“event”: “format identified”, “agent”: “siegfried”, “date”: “2022-12-01t00:01:00z”, “outcome”: ”fmt/30001; jpeg-3000-super-hd;”, “previous_id”: “3bfbf661”, “id”: ”488a41b9”}, projection img_001.jpg → “metadata”: { “projection date”: “2023-01-01”, “resolution”: “1000x1000”, “bit-depth”: “24-bit”, “layers”: 1, “created-by”: “windows paint 128-bit”, “puid”: “fmt/30001”, “format name”: “jpeg 3000 super hd”, } figure 3. in our second example, we see additional events for the same aggregate over the course of a year. the events tell a story: a format could not initially be assigned to the file, and it took a few attempts, eventually resulting in the new format “fmt/30001; jpeg 3000 super hd” by the end of 2022. different metadata extraction tools were used over the course of the year, “mediainfo”; “gimp”; and “exiftool”, and at each separate attempt new information was extracted. in the projection we simply return summary information about the file encoded as json. an alternative projection may encode this as xml. another may take some of our other events associated with the aggregate and also display the checksum so that it can be downloaded safely by those accessing it. notably, the projection has a date as well – perhaps this was configured as an argument to the function creating it, and in another rendering of this information with the given date february 2022, we would only receive a block of json with “format name” and “puid” as “unknown” – importantly: the information we had about the object at an earlier point of time in the system; a view of its “historic” state. other important features of event sourcing append only event-sourced logs are “append only.” data is added at the end of the log and earlier data should (may) never be deleted. data can be write-once, read-many and over the long-term, the changes to the record are all accessible and auditable; this is in line with other modern approaches to recording strongly provenanced information in technologies such as web 3.0, e.g. blockchains. append-only logs are less susceptible to inadvertent changes – while not beyond the capability of any developer, modifying data structures such as xml or json in memory is inherently risky, especially a structure such as xml that requires the document tree to be read into memory correctly, appended, and then written back to disk. data for append-only logs need only be added at the end of a process. objects (aggregates) can be continuously monitored and updated with an event-sourced approach. if one decides to run a process such as a “reingest,” then the events occuring during the re-processing of an intellectual entity, or entire accession, are simply added to the end of a log, removing the need for reversioning, or modifying anything that already exists – the log tells the story. different consumers with different needs consumers of event sources have different needs. we have seen two examples in this text: creation of “object characteristics” like structures. checksum monitoring. but other projections are needed: outputs in different metadata formats. access logs. derivation logs, i.e. listing derivatives of files and their locations. system-wide: collection-wide views of digital objects. events are the simple but powerful building blocks that contribute to creating complex views of our digital preservation systems. we do not need to understand all the views we need today, we can simply delve into the event logs at a point in the future and pull out the information we need to understand past and present state. no need for migrations, or, at least, migrations may not be such a burden in a structural metadata model, usually stored on disk, or in a database, such as mets/xml with premis, if the standard or representation changes in any way, then a migration (from one structure or schema to another) may be required to make the old version of the model compliant to the updated standard. using events, the generation of a new structural model using a different or newer standard, e.g. portland commons data model, or premis 4.0, can be done by versioning and incrementing the code used to generate the previous model. the foundational events and data they contain does not change. users and systems, then, have easy access to both versions, as they do multiple representations, using projections. functions used to create projections today when a new projection is created. while any structural model can then be maintained on disk, the creation of a projection is a more distributed [12] approach to doing this, and can be performed at run-time, as opposed to complete versions of structural models being maintained alongside archival packages [13]. premis changes we cannot simply apply an event-sourced lens to premis in its current form. if premis were to become compatible at all with this approach some changes are required. nouns become actions in the past-tense premis events are often nounified. in some cases the event verb is simply used as a noun through a “functional shift” and remains the same read both ways – as noun and a verb. as nouns, events lose their temporal energy and become static, instead of simply being “something that happened” they are the titles of big, capital “e” “events,” choreographed into rigid digital preservation processes as opposed to something dynamic and changing, especially actions that may be unseen. it is this author’s assertion that titular premis nouns become a barrier to engage with, use, and extend without committee. if we maintain the focus that events describe something that has happened and hold onto the original verb, then it may become easier to work with. event sourcing uses action verbs, but also uses the past tense “something has happened,” and so mapping premis 1:1 today into an event log is an unnatural fit. instead we need to convert as follows: noun verb action accession accessioning …was accessioned creation creating …was created replication replication …was replicated unpacking unpacking …was unpacked validation validating …was validated figure 4. less importance placed on the data dictionary a data dictionary is a valuable contribution to digital preservation literature, however, rather than trying to capture all possible events, we can potentially find more flexibility in an event-sourced approach where any activity is captured, not just prescribed actions. events that do not fit easily into a “nounified” category, e.g. “checking for zero byte files” can be logged as an event simply labeled with the activity itself: “checked for zero byte files”. this may be associated with an intellectual entity aggregate, or the file “checked size on disk is greater than zero;” the latter could also be rendered as a projection at the end of a process, where the projection performs some form of validation function. closer to a technical implementation the only conscious decision of an event-sourced approach to event logging is that all events are captured for an aggregate. a system capturing all events need only consider the design of the event log itself and ensure that it can capture enough information. more elaborate crud style data models are not needed, and can indeed become emergent as projections sitting on top of the logs being created. “design”: schema, or monolithic system’s design, are not front-loaded, meaning that systems can be developed quicker, while the use of the data from these future systems can be designed simply knowing the data is coming in and that there will be access to an event log. replace object characteristics and object-characteristic extensions with projections capturing an object’s characteristics is done in premis using the object-characteristics section of the data model. characteristics unique to a particular representation of a file can only be captured using an “extensibility mechanism” described as “objectcharacteristicsextension”. this is itself somewhat of a data model within a data model, and, crucially, more difficult to work with in that there are no restrictions on the data captured here or the model used. if xml is captured from jhove and also tika, the xml can be stored here verbatim and requires further parsing to split up. with an event-based approach, there is potential to capture unique object characteristics through events and then render information about characteristics through a new projection, as illustrated in figure 3. premis conformance restrictions around internal schemas are made more flexible a transformation on any set of data creates an abstraction. premis itself is an abstraction of the data accumulated to build our digital preservation knowledge base. what event sourcing does differently is ask that we engage with a lower-level abstraction of some set of source data (our encoded events) to enable us to generate many more higher level abstractions. as we become more proficient with data-like events [14], one thing we may take away from looking at premis through an event sourced lens is that lower-level abstractions like events are not “less” than their higher level structural metadata counterparts, they’re just the building blocks. as the premis committee considers their next revision, then taking another look at what it means to be conformant, not from a data modeler’s perspective, but from a software development perspective, may be beneficial. the premis conformance statement suggests three levels of conformance while making clear that: adherence to the conformance principles is not a formal requirement for implementing the premis data dictionary (although the editorial committee does believe that following these principles would be good practice in nearly all implementation contexts). the conformance statement goes on to describe the following: the levels are built around three ways of implementing premis in any repository system: being able to map preservation metadata to premis, being able to export preservation metadata as premis, using premis as an internal schema in a way that does not require any further mapping or conversion. two challenges to the levels of conformance are: while described as levels, should they be viewed hierarchically? the first level of conformance clearly makes interpretation of an internal schema next to premis more difficult, and vice versa. one may remove this level of conformance completely. conceptually, however, two and three, have the same outcome, functionally, they take different paths toward getting there. can one approach realistically be viewed as better than the other? is level three reasonable in today’s software development world, is it reasonable in today’s environmental climate? given any well-formed dataset we have the potential to convert it into numerous representations – abstractions at the front and center of premis’ understanding of digital records. despite the conformance statement’s secondary description of events “sufficient event metadata to document actions the repository has taken to preserve the digital objects.” that seems to be at odds with its formal description, then premis can be trivially demonstrated to record information that extends beyond that which is useful for preservation. it is useful for presentation, access, other maintenance activities, and so on. is premis reasonably expected to be the source dataset for new representations of this data, beyond those that support digital preservation? is it anticipated that premis becomes a companion dataset alongside another representation?whatever the cost, storage costs “something”, fiscally, environmentally, and intellectually – data asks a lot of us and our resources. environmentally we should be reducing our data footprint, we do have to ask if premis is the most efficient format to be used as a sole representation of digital preservation metadata; and if the answer is no, level three conformance may be re-considered, or nuanced somewhat, as optional, but no more importance placed on it than being able to extract premis as an abstraction of another format. premis conformance should be separate from representation. if we acknowledge premis is at least one important representation of preservation metadata, i.e. for its ability to act as an interface to those looking to interpret preservation metadata, then whether it exists logically on disk, or is generated through an event sourced projection, is irrelevant. how a representation complies with the premis data model remains of greater importance, but this is measured from the same eventual view, whatever intermediate abstraction it sits within. analysis compared to premis potential improvements there is an opportunity when looking at premis though an event-sourced lens to simplify both processes outputting event information and the structural data model, even potentially alleviating the need for parts of premis that do not sit easily in the implementation, such as unique object characteristics (objectcharacteristicextension), because they can exist naturally in an event-sourced approach and used only as the consumer requires. event-sourced logs are simple, so they are potentially easier to write into any style of archival package. there is no encoding implied by event-logging, only that events need to be able to be read and processed and other encodings can be generated from event logs as projections, satisfying many of the different uses of the data in an archival environment. events are simple and complex database schema are not required for their logging. they can even be stored as plain-text, and/or distributed alongside archival packages. an event-sourced approach leads with the premise that “not everything has to be known up-front,”, i.e. fit into a neat model. some information is simply not available early on in a preservation workflow, e.g. for new file formats, other times information needs to fall out of other research processes and becomes available over time. potential drawbacks in a truly event-sourced system, logs can grow large because of the number of processes occuring or objects processed. there is always a potential they will grow large for the aggregates in an archival workflow as well, resulting in the need for greater computing resources and subsequent greater impact on the environment. additionally, time is required to process lots of events, and so an event-sourced view of an archival package may mean slower access times. time can be saved by caching projections. the amount of data for aggregations of archival objects may grow quite large, but those for individual aggregates may be somewhat less. thinking about events for individual objects is a different proposition than thinking about the logs of long-running systems and processes. conclusion premis remains a hugely important tool in digital preservation. its implementation has worked well for many institutions to this day. the data model is, however, verbose, and for the number of elements required to encode data, there is not a lot of information encoded – its entropy is high and still requires a lot of processing to develop views on top of what is there. there is no easy way to read the data as a human without additional viewers such as metsflask. projects have been undertaken to understand how to optimize premis output, looking at reducing files with lines numbering in their millions into thousands because of how much space premis requires [15]. if one looks at the essence of premis then arguably, its single greatest drawback is the verbosity of the structural data model, not the concept of events as “actions performed within or outside the repository that affects its capability to preserve objects over the long term.” indeed, others have looked at making the encoding of premis more light-weight [16]. current premis models also tend to show us just the current state of an object in time. we do not necessarily see how we got there. we know systems can generate new snapshots through processes such as “reingest”, modifying the existing output, or adding a new version of the metadata to a package, but these are only snapshots, not a complete story. perhaps, then, it doesn’t track, that this paper will shortly conclude by suggesting that using events differently, it may be possible to keep more data than premis currently supports in its model, as well as keep metadata sizes low – but this is the prospect of an event-sourced lens on the topic. if one takes compression as their analogy, then an event-log is like a losslessly compressed package [17]; concentrated information distilled down to its essence. we can store more and when we “uncompress” or expand this data we can generate a number of different representations of this data that satisfy many more of our needs – these are our projections. adopting a truly event-sourced approach for premis creates an important separation between “data that is recorded” and “data that is used”. event sourcing activates stored data by its very nature – the event-log requires projections to live beyond events – as data, triggers, or components of user interfaces, e.g. charts and graphs. as we become more proficient at processing event log data, we can find emergent representations we may not yet be considering. alexandra chassanof asked back in 2020: what i’m really wondering is about whether institutions are using premis elements to analyze & manage ingested preservation objects and whether this happens periodically or regularly – if at all. probably a longer conversation would be helpful [18]. further back the premis health-check (phc) project in 2013 looked at aligning premis with the spot threat model for actively monitoring digital preservation repositories [19]. in an event-sourced premis model, events record everything that can affect the numerous different states of the information around the digital objects (aggregates) in our repositories. the projections, of which, we require at least a minimum number, even to simply return sensible user-facing information about object metadata; these become the foundation for other projections that consume the events needed to actively manage objects in our repository, from the phc’s perspective, multiple consumers can be configured to generate different projections revealing different levels of threat. additionally, event streams can be combined to create projections for any entity the project may have desired, e.g. digital object, collection, or repository. we can create as many dashboards as we desire with warnings galore informing us that our content is potentially at risk [20]. if we have been writing projections for consumers to see object metadata, alone, then modifying these consumers of events becomes trivial to start exploring the other possibilities such as those in the phc. readers may also see that premis is evolving too – premis rdf [21] may achieve some of what is described here, the extensibility discussed, a slightly smaller footprint [22] for example. a graph-based approach to recording metadata still has its drawbacks. to start, it still tries to (needs to) assert a structure through an ontology that codifies its purpose and use. this is how we have tended to approach our work in digital preservation; perhaps coming from fields so closely linked to knowledge organization, we strive for structure which may in turn come from the field’s home in archives and libraries. an event-based approach still complements this, but asks that structure to become emergent [23]. a graph of digital preservation information (premis/rdf) can be yet another output. it is this paper’s assertion that we can store more, and “do more” by taking an event-sourced approach to storing events associated with the “objects” described in the premis data dictionary. as of yet, this paper is just one projection. the next step is to develop some more models, describe some principles of event sourcing for digital preservation, and generate some real world data, and to try and prove it. about the author ross spencer (orcid: 0000-0002-5144-9794) is a digital preservation specialist for the international games and book publisher, ravensburger ag in germany. ross has been part of the digital preservation community for over a decade and has formerly been a digital preservation analyst and programmer in the teams at the national archives uk, archives new zealand, and artefactual systems inc. acknowledgements this author would like to thank peter van garderen for his review and support around the direction of this article. they would also like to thank evelyn mclellan at artefactual systems for her generosity in sharing her knowledge around premis while i had the opportunity to work alongside her. endnotes [1] level b is cited in (o’sullivan et al., 2019) [2] premis conformance statement and levels, (premis editorial committee, 2015) [3] the paper has yet to find a way to incorporate analog “versions” as introduced in premis 3.0. [4] (zimarev, 2020) [5] (premis editorial committee, 2015) [6] this information is most easily accessed through the premis owl ontology, (premis 3 ontology, n.d.) [7] while rights are interesting, the scope of rights statements in premis is limited to documenting whether a repository has the right to perform a certain action in an automated way, and documenting that assertion. we will assume this right for this discussion although it is also possible to envisage events not happening due to permissions being unavailable. [8] (event type, n.d.) [9] (vines, 2020) [10] (fowler, 2017) [11] in december the checksum has changed, which is problematic – but also, let us assume that we normally sample at greater frequencies than this in reality, and this is the first time the change has occurred. [12] i.e. bringing data, from heterogeneous “distributed” sources, not just the local disk, to generate one concrete projection [13] the university of north texas (unt) adopts a hybrid approach, using a distributed events model, i.e. premis events are separated from their archival packages, but they are stored as a structural model and delivered wrapped in the atom publishing format and protocol which provides additional contextual metadata. (phillips et al., 2011) [14] or in general in our different information management disciplines. [15] evelyn mclellan investigated how bloat in archivematica’s xml representation of premis could be reduced, (mclellan, 2018) [16] the bodleian library, for example, began a mets/premis analysis that sought to reduce the size of mets/premis in archivematica. one of the published artifacts around this work includes a survey seeking to understand the community’s appetite for rdf-turtle as a way of improving throughput performance, (bodleian libraries, n.d.) [17] claude shannon’s theories showed us that messages could be losslessly reduced, by removing redundancies, but then still reproduced, i.e. source coding. similarly, data can be encoded like this from the outset, and still be used to construct familiar (and predictable) structures, (aftab et al., 2001) [18] (chassanoff, 2020) [19] (van der werf, 2013) [20] the corollary of which are the green lights giving us much comfort. [21] premis-rdf examples, (caron et al., 2019) [22] there are different tradeoffs with rdf representations. rdf/xml still contains a lot of excess characters, where turtle (ttl) removes a lot of that but maintains a lot of whitespace. n-tuples, are “maybe” the closest to an “event log”, however, “subject” uris are often repeated many times over, also taking space. [23] young (2014), on events, alludes to the question “do i store a structural model” or events from which a structural model can then be generated, i.e. “give you back” a structural model, (young, 2021) references aftab, o., cheung, p., kim, a., thakkar, s., & yeddanapudi, n. (2001, december 16). information theory information theory and the digital age [6.933 final paper]. retrieved february 02, 2023, from https://web.mit.edu/6.933/www/fall2001/shannon2.pdf bodleian libraries. (n.d.). archivematica metadata survey. webarchive.org. retrieved 02, february, from https://web.archive.org/web/20210201133052/https://docs.google.com/forms/d/e/1faipqlseahxmtkyk7yz2uzmd86z-ams0ccofme6uzpu-07mqdny1rlq/viewform caron, b., mclellan, e., duval, m., & cowles, e. (2019, october 25). premis ontology examples. github. retrieved february 17, 2023, from https://github.com/premis-owl-revision-team/premis-owl/tree/master/examples chassanoff, a. (2020, february 19). twitter. retrieved february 17, 2023, from https://perma.cc/sh8b-nd5q event type. (n.d.). premis 3 ontology: event type. retrieved february 16, 2023, from https://id.loc.gov/vocabulary/preservation/eventtype.html fowler, m. (2017, february 7). what do you mean by “event-driven”? martin fowler. retrieved february 16, 2023, from https://martinfowler.com/articles/201701-event-driven.html mclellan, e. (2018). premis/mets for scalability. premis/mets for scalability: wiki.archivematica.org. retrieved february 02, 2023, from https://wiki.archivematica.org/index.php?title=premis/mets_for_scalability&direction=prev&oldid=13322 o’sullivan, j., smith, r., gairey, a., & o’farrelly, k. (2019). a pragmatic application of premis: mapping the key concepts to a real-world system. proceedings: 16th international conference on digital preservation, 1-10. retrieved february 16, 2023, from https://ipres2019.org/static/pdf/ipres2019_paper_49.pdf phillips, m. e., schultz, m., & nordstrom, k. (2011). premis event service. retrieved february 02, 2023, from https://digital.library.unt.edu/ark:/67531/metadc40413/m1/1/ premis 3 ontology. (n.d.). premis 3 ontology. retrieved february 16, 2023, from https://id.loc.gov/ontologies/premis-3-0-0.html premis editorial committee. (2015, april). conformant implementation of the premis data dictionary. retrieved february 17, 2023, from https://www.loc.gov/standards/premis/premis-conformance-20150429.pdf premis editorial committee. (2015, june). premis data dictionary for preservation metadata (3.0). premis data dictionary for preservation metadata. retrieved february 16, 2023, from https://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf van der werf, t. (2013, september 5). preservation health check: work in progress. retrieved february 17, 2023, from https://www.loc.gov/standards/premis/pif-presentations-2013/05premis-vdw-phc.pdf vines, j. (2020, july 2). what is an event, anyway?. by software development team lead, jon… | by ao.com | ao’s engineering blog. medium. retrieved february 16, 2023, from https://medium.com/aos-engineering-blog/what-is-an-event-anyway-651122f4f3e6 young, g. (2021, march 22). transcript of greg young’s talk at code on the beach 2014: cqrs and event sourcing. event store. retrieved february 17, 2023, from https://www.eventstore.com/blog/transcript-of-greg-youngs-talk-at-code-on-the-beach-2014-cqrs-and-event-sourcing zimarev, a. (2020, june 3). what is event sourcing? event store. retrieved february 17, 2023, from https://www.eventstore.com/blog/what-is-event-sourcing subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – shiny fabric: a lightweight, open-source tool for visualizing and reporting library relationships mission editorial committee process and structure code4lib issue 47, 2020-02-17 shiny fabric: a lightweight, open-source tool for visualizing and reporting library relationships this article details the development and functionalities of an open-source application called fabric. fabric is a simple to use application that renders library data in the form of network graphs (sociograms). fabric is built in r using the shiny package and is meant to offer an easy-to-use alternative to other software, such as gephi and ucinet. in addition to being user friendly, fabric can run locally as well as on a hosted server. this article discusses the development process and functionality of fabric, use cases at the new college of florida’s jane bancroft cook library, as well as plans for future development. by atalay kutlay, cal murgu and tammera race introduction at the new college of florida, we sought to develop a tool that allowed us quickly to visualize the relationships of our instructional librarians and teaching faculty using internal data. our main objectives with this project were twofold: 1) develop an open-source application that can easily render network graphs depicting instructional activities at our library using our libinsight datasets; 2) include export/reporting functions for straightforward sharing. over the course of development, we included several stretch-goal features, including relationship weighing and additional reporting features. this article documents our experience creating a shiny app, fabric, that quickly and efficiently renders interactive network visualizations. fabric renders an illustration of the recent and current relationships between librarians and faculty members, and offers fast exporting functions for planning purposes. ultimately, fabric is a lightweight, open-source alternative to other social network analysis (sna) applications, like gephi or ucinet, that supplements the visualization features currently available in libinsight. background like many other academic libraries, the new college of florida jane bancroft cook library began collecting various data to document the work of librarians in different service areas, including instructional data as well as research consultations. we started using libinsight in 2016. we created a series of standardized forms that collect a variety of information, such as front-desk analytics, library instruction, and research support interactions with students and faculty. while we have been analyzing this data in different ways since then, either through libinsight’s built-in visualization tools or through local analysis using excel or r, we sought to create an application that allowed us to easily render network graphs depicting instructional activities and enabled us to create reports for internal and external stakeholders. while libinsight provides basic reporting functionalities, such as line, column, spline, and pie charts, we wanted to represent the relationships of our librarians with faculty and students in a way that is more akin to how we see ourselves: as part of an academic network. visualizing library data has been in vogue recently. others have written about their experiences creating applications centered on the application of library data. costello and martin, for example, detailed stony brook university libraries’ efforts to consolidate their data streams into a tableau dashboard that enabled them to “tell stories with data” (costello and martin, 2018). their objective is a familiar one: they required “a way to quickly share information among different [stakeholders within] and outside the library” (costello and martin, 2018). similarly, horne-popp, tessone and welker detailed the implementation of a local data dashboard, built on rails, concluding that, among other things, “the ability to visualize data by setting a few parameters greatly increased the flexibility of data sharing with library administration and other internal and external stakeholders” (horne-popp et al., 2017). other studies focus on how we can leverage data to support library strategic-decision making (murphy, 2013), scholarly communications (smith, 2013), library assessment (murphy, 2015), and collection development (lewellen and plum, 2016). it is not at all surprising that tableau features in much of this scholarship, although not exclusively, given the software’s relatively low barrier of entry yet excellent querying and visualization options. this speaks to an important element of any data centered visualization strategy: it needs to be accessible, both for a user as well as the administrator. the aforementioned literature informed our own approach to visualizing library data. in particular, we noticed that the type of graph that made most sense to us given our focus on relationships was missing from many of these dashboards: network graphs. after considering several options, including an internal pipeline that pushed libinsight data to proprietary software such as ucinet and open software such as gephi, we decided to explore the r package, shiny (https://shiny.rstudio.com/). shiny is an open source package which makes it easy to build interactive web applications or data dashboards. shiny leverages r’s abilities to enable you to create interactive web applications that focus on data (beeley and sukhdeve, 2018). most shiny applications revolve around displaying data in interactive, manipulable views. in addition to shiny’s relatively simple development process, we used shiny because of its limited footprint. essentially, shiny applications are r projects that can be run locally and do not require a server or an intricate environment to run. the application can also be hosted, either on a web server or on shinyapps.io, a freemium hosting solution provided by rstudio that enables low-volume users to host shiny applications at no charge (limited to 25 hours of active hosting per month). as a result of this flexibility, and a desire to learn more about how we could leverage r in the library, we decided to build a shiny web application. networks before we explore our application, fabric, a quick primer on network graphs would be useful in situating our project. social network analysis emerged as a methodology in the field of social psychology and sociology in the early 20th-century (scott, 1988). it promoted the empirical and theoretical analysis of relationships between agents (people, institutions), taking methodological direction from graph theory, the mathematical study of graphs, as well as the qualitative study of social relationships. depicting these relationships as a series of visual models was a natural extension of social network analysis. taking cues from graph theory, scholars interested in social network analysis developed a variety of measures to determine the centrality of certain nodes in a network (a proxy for importance), the density of a graph (number of relationships), and node clustering, among other methods. however, a basic network graph is actually quite simple. it includes nodes (people, objects, things) connected by a series of lines, known as edges or relationships. a simple network graph can be seen in fig 1. figure 1. a basic network graph for our use case, the nodes represent instructors and librarians, and the edges represent instructional activities. the first iteration of our results can be seen in fig 2. while this basic data is enough to create a network graph, it limits its potential for identifying patterns among relationships in the graph because it does not include other variables that introduce additional nuance. for example, our first result did not delineate between type of nodes (librarians or teaching faculty), departments or schools (arts, sciences, social sciences), and the number of interactions and intensity of the relationship between nodes. the graph required a significant amount of institutional knowledge to be able to be read and applied in a strategic way. figure 2. network graph of librarians and teaching faculty additionally, we realized that beyond visualizing the nodes and relationships that are evident within the dataset, we needed to include those individuals (in our case, instructors) who were not included in the dataset. in other words, we needed to include nodes that, to date, did not have a relationship with a librarian. after an analysis of our libinsight data, we recognized that what we collected enabled us to render more complex network visualizations. an example of the final iteration of our network graph rendering can be seen in fig 3. figure 3. complex network graph showing additional variables functionality our application allows users to render graphs varying in complexity based on the type of data that is available. while we happen to collect data using springshare’s libinsight, fabric simply needs a properly formatted csv to function. to render a basic network graph, a dataset need only include two columns: a column listing librarians, and a column listing instructors. this basic information is enough to create a graph that shows the relationships with faculty, with dynamically weighted edges. should a user want to delineate between departments or schools, an additional dataset which includes two columns is required: a column listing faculty members, and a column listing their departments or schools. developing the shiny application there are 2 main components in a shiny app: ui and server. for the ui component, developers can either use shiny’s own functions to create a fluid ui or use their own custom html templates with correct div classes and ids. for fabric, we used a custom html template for more flexibility and better user experience, such as mobile compatibility. the data manipulation, analysis, and rendering occurs in the server component. we used the visnetwork package to visualize the social network, the dt package for the tables, and the tidyverse package collection to handle most of the data manipulation and analysis. fig. 4 shows the structure of a shiny application, as well as the libraries we used in fabric. figure 4. the structure of fabric shiny application using the application fabric offers a simple user interface. there are 4 main steps to generate and export a social network, which can be seen in fig. 5. figure 5. featured steps in fabric 1. file upload: the first step requires users to upload their interaction data and/or an optional complete list of faculty. users may choose either one of the existing datasets for testing, or upload their own datasets for analysis. users can also choose to anonymize the data that is visualized. figure 6. screenshot of the file upload 2. data verification: like with other network graph visualization software, users need to describe certain features of the dataset. the second step requires users to explain which column lists librarians and instructors. if users choose to include an option complete faculty list, they will need to clarify which columns have the faculty members and their division/school/department in the second optional dataset. figure 7. screenshot of the data verification 3. results: the third step renders the interactive network graph. users can click on each node for additional information, such as the distribution of their relationships among divisions, top 5 interactions, and average number of interactions with the same person. users can also facet by particular groups of nodes, or by individual nodes. figure 8.1 screenshot of results figure 8.2 screenshot of results 4. export: finally, the fourth step offers an export function that creates an interactive html file, which can be opened and manipulated locally. we find exporting helpful to use the data in reports or presentations. 5. understanding the results understanding a complex network graph may be confusing at first. in the following section, we briefly explain the information you can gain from the network graph. figure 9 different weights of edges first, we will look at the types of information that can be gained from the edges. 3 different types of edges can be seen in fig. 10. edges are weighted with the number of interactions between nodes. looking at fig. 9, we can deduce that the number of interactions between person 2 and person 26 is greater than the number of interactions between person 2 and person 39. figure 10 pie chart of the division/school distributions second, we will consider the charts and tables that our tool generates. fig. 11 indicates the distribution of a librarian’s instructional relationships across different schools or divisions. na may represent possible errors in the dataset (disambiguation, missing fields, spelling mistakes). this information may be particularly useful in settings where instructional librarians are not linked to specific departments or schools. finally, fig. 12 shows a table summary of the top interactions a librarian has with instructors. figure 11 table of top interaction list for a librarian discussion we have developed this tool for our own particular needs but with the larger library community in mind. visualizing library data is an important practice given the ability of visuals to communicate information, and for visuals to enable us to determine insights and patterns that are sometimes obscured by the spreadsheet. the following provides you with a notion of how we have used this tool at the jane bancroft cook library: to assess whether the shift from the liaison to team-oriented instruction model has changed the way we interact with instructors across disciplines. a year-to-year comparison made clear that despite moving away from our liaison model, our interactions with faculty continue to be informed by previous relationships between subject librarians and their “home” disciplines; to identify strong faculty partnerships that consistently ask for and are active supporters of library instruction; to identify gaps in our instruction services with respect to who and what departments we collaborate with more often than others. this allowed us to explore why that gap exists in the visualization, and prioritize actions to reconcile these gaps; allows us to onboard new librarians in a way that is much more intuitive, by making it possible to visualize institutional memory. we are working to embed this workflow as a part of our onboarding process. this is critical given that according to several studies, the cost of recruiting a new employee “approaches 30 percent of the new hire’s annual salary and the total cost of replacing an employee can be as high as 150 per cent of the annual salary after calculating lost productivity, recruiting expenses, retraining and time-to-productivity.” moreover, research indicates that the return on investment for a comprehensive onboarding program can reduce employee turnover from an average of 44 to 14 percent while improving new employees’ time-to-productivity by 33 per cent (hall-ellis, 2014). visualize instruction activities in promotion material, strategic planning, and stakeholder engagement meetings. conclusion and future development in conclusion, while much work is being done to operationalize library data for a variety of institutional and professional reasons, we found that there was a lack of focus on network graphs. beyond simply quantifying the number of instruction sessions, we consider network graphs to be the most appropriate way to visualize instruction data given the importance of relationship building to successful library services, including instruction. using shiny, an r package for web application development, we created fabric to enable library professionals to quickly visualize relationships in the form of network graphs. the resulting visualization is interactive, and can be exported and easily shared with internal and external stakeholders. we have found that this approach has helped us confirm or reconsider certain ideas that we hold about our contributions to the institution. we hope our examples will give you a sense of how this tool could be modified/applied in your specific context. we plan to continue developing certain aspects of the application. for example, we would like to include temporal delimiting in the application to allow for effortless year-to-year analysis. try it out the complete source code for this project can be found at: https://github.com/ncflib/fabric if you’d like to try the application, check out the most current version at the following link: http://dss.ncf.edu/fabric. notes beeley, c., & sukhdeve, s. r. (2018). web application development with r using shiny: build stunning graphics and interactive data visualizations to deliver cutting-edge analytics. packt publishing ltd. costello, l & heath martin (2018). creating the library data dashboard. in using digital analytics for smart assessment (pp. 107-116). american library association. hall-ellis, s. (2014). onboarding to improve library retention and productivity. the bottom line, 27(4), 138-141. horne-popp, l. m., tessone, e. b., & welker, j. (2018). if you build it, they will come: creating a library statistics dashboard for decision-making. in developing in-house digital tools in library spaces (pp. 177-203). igi global. lewellen, rachel, and terry plum.“assessment of e-resource usage at university of massachusetts amherst: a mines for libraries study using tableau for visualization and analysis.” research library issues: a report from arl, cni, and sparc, no. 288 (2016): 5–20. murphy, s. a. (2013). data visualization and rapid analytics: applying tableau desktop to support library decision-making. journal of web librarianship, 7(4), 465-476. murphy, s. a. (2015). how data visualization supports academic library assessment: three examples from the ohio state university libraries using tableau. college & research libraries news, 76(9), 482-486. scott, j. (1988). social network analysis. sociology, 22(1), 109-127. smith, v. s. (2013). data dashboard as evaluation and research communication tool. new directions for evaluation, 2013(140), 21-45. about the authors atalay kutlay is the digital scholarship library fellow at the new college of florida jane bancroft cook library. cal murgu is the digital humanities librarian at the new college of florida jane bancroft cook library. tammera race is the systems, metadata & assessment librarian at new college of florida subscribe to comments: for this article | for all articles one response to "shiny fabric: a lightweight, open-source tool for visualizing and reporting library relationships" please leave a response below: samato, 2020-03-31 updated to include third author. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – getting more out of marc with primo: strategies for display, search and faceting mission editorial committee process and structure code4lib issue 41, 2018-08-09 getting more out of marc with primo: strategies for display, search and faceting going beyond author, title, subject and notes, there are many new (or newly-revitalized) fields and subfields in the marc 21 format that support more structured data and could be beneficial to users if exposed in a discovery interface. in this article, we describe how the orbis cascade alliance has implemented display, search and faceting for several of these fields and subfields in our primo discovery interface. we discuss problems and challenges we encountered, both primo-specific and those that would apply in any search interface. by kelley mcgrath and lesley lowery introduction we’ve been hearing for years, even decades, that the marc format is dying. nevertheless, the marc advisory group continues to meet twice a year and continues to approve new fields and subfields. in recent years, many new fields and subfields have been introduced into marc 21 that could help users if the data they contain were exposed by library discovery interfaces. in this article, we will look at several of these, describing how we incorporated them into our consortial primo discovery layer and discussing the challenges we encountered. primo norm rules primo is a patron-facing discovery interface for library resources created by ex libris that can ingest and combine data from multiple sources. primo translates incoming data into standardized primo normalized xml (pnx) records. the pnx record is divided into sections, such as display, search, facets and links. each section includes multiple fields, both out-of-the-box fields, such as “display:title,” and locally-customizable fields such as “display:lds12” [1]. incoming data is transformed into the pnx format by primo normalization rules (or norm rules), which are maintained in the web-based primo back office [2]. unlike traditional library catalogs, which generally only allow libraries to turn specific marc fields and subfields on and off, and to have them displayed and indexed as-is, primo norm rules enable significant customization of incoming data. each primo norm rule consists of four main parts: source, conditions, transformations and action (figure 1). source defines what the rule takes as input. this could be a marc or dublin core field, a constant or another pnx field. a rule is only triggered if the input it is looking for exists in the source record. the second section contains conditions that have to be met before the rule will write anything to the pnx record. the third section contains transformations, or changes that are made to the source data before writing it to the pnx record. the options for conditions and transformations are hard-coded and must be selected from a dropdown list in the primo back office. the final section is labeled action and specifies how the output of the rule relates to the output of the other rules for a given pnx field. the output may be added as a new instance of the current pnx field or appended to an instance containing data written by previous rules. the action section also determines whether there can be only a single value for the pnx field or if multiple instances of the field are allowed. we will now provide some examples of how we used the primo norm rules to provide access to some marc fields that are not commonly used in library discovery interfaces. the fields and facets described in this article may all be seen in the sandbox interface at http://alliance-primo-sb.hosted.exlibrisgroup.com/primo_library/libweb/action/search.do?vid=uo_nrwg. figure 1. a primo norm rule in the primo back office interface showing the source, conditions, transformations and action. creator and audience demographics several new marc fields were created as a consequence of the library of congress (lc) developing a new vocabulary for genre and form terms. although library of congress subject headings (lcsh) were primarily designed to encompass topical subjects (what a resource is about), they have also traditionally been used to describe genre or form (what a resource is) in certain circumstances. in most cases, the same term is recorded in the same field (650: subject added entry-topical term) for both topical use and genre/form use. for the end user, it is usually not possible to distinguish between these two functions when searching. even when a different term is used (e.g., symphony for topic and symphonies for genre), the distinction is unlikely to be intuitive for a naïve user. in 2007 the lc began a project to create a new separate vocabulary for genre and form with the intention of phasing out the recording of genre/form information in 650 (subject added entry-topical term) and moving it to 655 (index term – genre/form), a field that is intended specifically for genre and form information.[3] in addition to enabling more precise searching, this change was meant to lead to data that is more suitable for use in faceted interfaces and for algorithmic processing. however, as the sample lcsh strings below show, the goal of implementing clean faceting faced a significant challenge. although the underlined terms below could be incorporated into this new vocabulary, something had to be done with the related information that was also part of the existing library of congress subject headings, but is neither genre/form information nor topical information. french poetry $y 20th century english drama $x women authors children’s films american literature $z new york (state) songs (high voice) with piano to account for this information, lc created some additional new vocabularies, including library of congress demographic group terms (lcdgt) to describe characteristics of creators of and audiences for resources. in conjunction with the development of lcdgt, two new fields were added to marc 21 to accommodate this information: 385 (audience characteristics) and 386 (creator/contributor characteristics). we implemented display, search and faceting of 386 (creator/contributor characteristics). figure 2 shows the display for a book of plays by english women that included an existing lcsh of “english drama–women authors.” figure 2. the lcsh “english drama–women authors” is split out to become genre/form “drama” + creator demographic group “women; english; britons” + original language “english” (the meaning of the term english in lcsh’s “english drama” is ambiguous; in this particular case it signifies both the original language of the dramas and the nationality of the playwrights. note the variety of constructions found in lcsh for this type of information, such as nigerian drama (english), english drama–irish authors, american drama, and hispanic american drama (spanish), which makes it quite complex for the average person to search by characteristics of resource creators). figure 3 shows an example facet list from a search for “autobiographies.” figure 3. creator demographic group (386) facets from a search for autobiographies. 386 (creator/contributor characteristics) is straightforward to display and facet, but there are many questions about best practices for choosing what data to put in this field. it’s also likely that there is a better way to label this field for the public. we also implemented display, search and faceting for 385 (audience characteristics), but this was more complex. audience information appears in several places in marc bibliographic records so we incorporated data from three different sources into a single primo display field: the audience code in the 008 fixed field (008/22), the recently-created 385 field (audience characteristics) and the audience note field (521). display of 385 (audience characteristics) is straightforward and the rule follows the pattern used in primo for displaying topical subject fields (figure 4). figure 4. primo norm rules that displays all the 385 (audience characteristics) subfield a’s (audience term) separated by “semicolon space” punctuation. (note that ^ signifies a space in primo’s norm rule expressions.) figure 5. audience facet results and display in primo. the 008 (fixed field data) is a little trickier. the meaning of bytes in 008 depends on a combination of position in the 008 and record type in the ldr. 008/22 is not used for target audience in all record formats. in the map format, this byte has a completely different meaning. if maps are not excluded from this primo norm rule, suddenly all the quadrangle maps will appear to be intended for an audience of preschool children! we therefore add conditions to exclude the record formats (maps, serials and mixed materials) that do not include positions for audience in 008 (figure 6). figure 6. first part of 008/22 (target audience) norm rule showing the condition that checks to make sure the ldr/06 (type of record) format does not equal maps using ex libris’ internal abbreviations for record types where mp = ldr/06 (type of record) e or f. we also added a condition to ignore the fixed field if the 385 field (audience characteristics) also exists. this is done to reduce redundancy in display and would be unnecessary if primo de-duped pnx display fields as it does facets. we preferred 385 (audience characteristics) because it supports more granular information, as well as multiple values. we then isolate the value of the audience fixed field, but, of course, we don’t want to display the raw value. the primo back office provides the ability to create what it calls mapping tables that enable a list of values to be transformed into a second list of values. we send the audience fixed field code to one of these mapping tables to turn it into a human-friendly value for display (figure 7). we use a single mapping table for display, search and faceting. it is important to not only map the coded values from the fixed field to textual terms for display purposes, but also for inclusion in the search index so that users who see a term in a record can search for others like it. because we also use the textual terms from the lcdgt vocabulary for faceting, we mapped the fixed field codes to that vocabulary to improve collocation. for example, the label for the value “c” in the marc format is “pre-adolescent.” however, “preadolescents” is a cross-reference for “preteens” in lcdgt so we used “preteens” as our label. figure 7. mapping table that changes 008/22 (target audience) fixed field values into human-readable strings. the sourcecode1 column contains the input and the targetcode column contains the output of the mapping table. commas are used to separate multiple output values, which are later separated by the norm rule using the “split field” transformation. finally, we display the 521 field (target audience note). this field includes indicator values that mark narrower meanings. we add prefixes to the note display based on these indicator values where applicable, as shown in figure 8 with 1st indicator 1 for interest age level. figure 8. primo norm rule that takes 521 (target audience note) with 1st indicator 1 as its source and prefixes the content of the note with the associated more-specific label “interest age level:” we add both the terms from the 385 (audience characteristics) and the terms mapped from the audience fixed field to an audience facet. however, because of the way primo weights facet values, the facet results are dominated by the fixed field terms, which far outnumber the uses of lcdgt, and it is hard to find examples with lcdgt. it’s also not clear how useful the “general,” “adult” and “specialized” audience fixed field values are as facets since they are much less consistently used than the ones for juvenile materials. medium of performance medium of performance is another type of information that is commonly combined with genre in lcsh and now needs to be accommodated somewhere else. lc and the music library association have developed a new vocabulary called library of congress medium of performance thesaurus for music or lcmpt and the 382 (medium of performance) field was added to marc 21 as a place to record this information. 382 (medium of performance) is designed to support more complex details about medium of performance than lcsh and includes a large number of subfields. it was also intended to be useful in a faceted environment. looking at the raw 382 (medium of performance) field below, it is clear that some intervention is needed to make this new data usefully displayable and searchable in primo. 382 01 $a viola $n 1 $p clarinet $n 1 $a piano $n 1 $s 2 $2 lcmpt although the primo norm rules provide powerful tools for manipulating marc data, there are some limitations on what is possible, especially for complex fields with many subfields. there are two approaches that can be used to deal with fields that include more than one type of subfield. the first approach is to process each type of subfield separately in its own rule and then combine the results. this allows you to treat specific subfields differently, but it doesn’t preserve the order of the subfields. that is, you could get all the subfield a’s followed by all the subfield n’s. the second approach is to process the different types of subfields with a single rule. in this case, you preserve the order of the data, but you can’t do something different to subfield a from what you’re doing to subfield n. we used a combination of both approaches to turn 382 (medium of performance) into something eye-readable. we will now walk through how we got from the raw marc field shown above to the following human-readable display. duet: viola (1); piano (1) (alternate instrumentation: clarinet) the first rule that applies takes subfield s (total number of performers), a non-repeatable subfield, as input. subfield s (total number of performers) should contain a number representing the total number of performers. we use a mapping table to change numbers into words for display (figure 9). from our example field, the rule takes the input of 2 and maps it to the word duet. figure 9. primo norm rule that sends the contents of 382 subfield s (total number of performers) to the mapping table “no_musical_parts” and outputs the results of the mapping. the second rule takes its input from multiple subfields: a (medium of performance), b (soloist), e (number of ensembles of the same type) and n (number of performers of the same medium) (figure 10). the starting input for this rule in our example is the string “viola 1 1 piano 1”, which retains the order of the data, but does not identify the source subfield for any of the data. figure 10. first part of a rule that takes subfields a (medium of performance), b (soloist), e (number of ensembles of the same type) and n (number of performers of the same medium) of a single instance of 382 (medium of performance) as input. we then use a transformation to define a subfield delimiter (first line in figure 11). we use a semicolon and space (denoted by ^ in primo expressions), which results in each subfield being separated by a semicolon and space. in our example, this gives us “viola; 1; 1; piano; 1”. we then take advantage of the fact that we can identify numbers separately from instrument names to remove the semicolon preceding any number and instead put the number in parentheses. this is a situation that is crying out for a regular expression. there is a transformation in primo’s documentation that looks like it would be helpful here, but unfortunately, that function is broken. so we explicitly wrote a replace transformation for the numbers 1-9 (figure 11). this will fail if there are ten or more performers coded as playing the same instrument, but that is an uncommon situation. this gives us “viola (1) (1); piano (1)” figure 11. transformations that separate each subfield with a semicolon followed by a space and then find semicolon, space, number and replace it with space (number). in primo norm rule transformations, the string to find and the string to replace are separated by “@@” the “^” is used to represent a space in the replacement string. there is another problem, though, which is that subfield n (number of performers of the same medium) is used both for the number of performers in the main medium of performance statement and also after doubling and alternate instrumentation as shown in our example with the alternate clarinet part in subfield p (alternative medium of performance). 382 01 $a viola $n 1 $p clarinet $n 1 $a piano $n 1 $s 2 $2 lcmpt we don’t want to mix up the main medium of performance with the alternate or doubling instrumentation. for this rule, we only want the subfield n’s (number of performers of the same medium) that correspond to the default instrumentation. however, we have no way to distinguish these different meanings of subfield n (number of performers of the same medium), which is causing the multiple appearance of the number 1 after viola. we undertake another series of find and replace actions to remove extra numbers (figure 12). because we cannot do this conditionally and each transformation only removes a single instance of an extra number, we tried to include enough transformations to account for the vast majority of situations. however, it is still possible to find medium of performance statements in our catalog with extra numbers if you know where to look. figure 12. transformations that find ) space (number) and replace it with just ). this removes the extra numbers. finally, the merge action for this rule prepends a colon and space to the instrumentation if there is preceding text, such as the “duet” written by the previous rule in our example (figure 13). figure 13. data is merged onto an existing instance of a pnx field and is preceded by a colon delimiter, with a space after the delimiter. the last thing to deal with in this example is the alternative instrumentation in subfield p (alternative medium of performance). given our choice between preserving the order of the data and thus associating clarinet with the instrument that it is an alternative for and being able to process the subfield separately, we have chosen the latter. this enables us to separately label the alternative instrumentation. in 2015 the music oclc users group (moug) commissioned a report that found that music librarians preferred a display that looks like viola or clarinet (1); piano (1) where subfield p (alternative medium of performance) is preceded by the word “or” and directly follows the instrument for which it is an alternative [4]. our display does not preserve the relationship between the viola and clarinet. viola (1); piano (1) (alternate instrumentation: clarinet) in order to preface the clarinet in subfield p (alternative medium of performance) with some indication that it is an alternative, we have to process it in a separate rule (figure 14). we therefore lose the ability to retain the original subfield order. figure 14. rule adds “(alternate instrumentation: “ label to the beginning of 382 (medium of performance) subfield p (alternative medium of performance) data, and closes the data with a “)”. multiple instance of subfield p (alternative medium of performance) are separated by a semicolon and space. for the most part, we have just added the 382 (medium of performance) subfields to the general search index without any intervention. however, we have included rules to index the words, such as duet and alternate instrumentation, that we generated for display. it turns out that at least some users do want to search by words like “solo.” currently, we have implemented three facets based on medium of performance in the 382 field. the first one is a facet for instrument, voice or ensemble type. this includes everything from subfield a (medium of performance) and subfield b (soloist). the terms from subfield b (soloist) are double-posted with the qualifier solo to help users find pieces that have solo parts for their instruments (figure 15). this seems pretty straightforward, but there are a couple ways this facet behaves that are likely to clash with user expectations or preferences. figure 15. rules that (1) add instrument/voice/ensemble type and (2) double-post soloist data with (solo) qualifier. first of all, users probably want to find pieces that include all the instruments or voices that they select. that is, they want to do a boolean and search. if users select terms consecutively from the facet list in primo, they will get an and search. however, if they select multiple terms at once using primo’s checkboxes and the “apply filter” button (figure 16), primo will perform an or search. this makes sense from the standpoint that for most long lists, an or search would be preferred, and it’s desirable for all the facet lists to predictably exhibit the same behavior. but it probably isn’t what users want from this particular facet most of the time. figure 16. primo check list of facet values. another important thing to notice is that the facets are working against bibliographic records for the whole resource being described so violin and piano just have to be somewhere in the record. the terms don’t have to be referring to the same piece as in the example in figure 17, which includes a piano concerto and a violin concerto. figure 17. facet values are related to different pieces that are part of a compilation. in addition to medium of performance, we also implemented a facet for total number of performers if that is given (figure 18). it doesn’t attempt to count ensembles, but just lists the presence of an ensemble. some of the larger numbers of individual performers should possibly be double-posted under ensemble, but we’re not doing that right now. again, the facets are working against the bibliographic record so if a user picks 1 performer and saxophone, there’s no guarantee that all the results will include a piece for solo sax. however, the results are not as misleading as they might be thanks to relevancy ranking and the fact that users can only ever see the top twenty values in a given facet. figure 18. facet results for number of performers. in order to compensate somewhat for that weakness, we have also created a medium of performance statement facet (figure 19), which allows users to specify exactly what combination of instruments or voices they are looking for. figure 19. medium of performance statement facet values. there are a couple places where this facet doesn’t provide the collocation that users will expect. one is due to the fact that the marc format allows you to omit the 1 for number of performers for any instrument or voice that has only a single performer. since the facets are looking at literal text strings, they see piano and piano followed by 1 as two different things. (starred lines in figure 19) this is an old screenshot and we have since been able to resolve this issue by supplying a 1 following any medium of performance that doesn’t have a number of performers specified. this improves collocation in most cases, but does not account for situations where counting performers is not straightforward and conflates some cases where the number of performers is in fact unknown or unspecified. for instance, when a score includes multiple percussion instruments, in many cases, it is possible for a single performer to play more than one of the instruments [5]. a thornier issue is that the instruments or voices are not listed in a standard order, which again results in different strings that don’t collocate. the first and third lines in figure 20 both contain the same instrumentation, just in a different order. this is unlike lcsh headings for medium of performance, where there are rules for ordering the instruments so that correctly-constructed headings do collocate. as far as we can tell, reordering the instruments is beyond the capability of primo’s data manipulation tools so the only solution would be to do some massaging of the data before sending it to primo. figure 20. lack of collocation for medium of performance statement facet values. country of production one of the challenges of working with marc data is dealing with inconsistencies. definitions of fields, as well as practices and rules for entering data have changed over time. incorrect information will exist in any large dataset, but lack of effective validation features in most software for editing marc records leads to a higher error rate than would otherwise exist. the rules we wrote for dealing with the 257 field (country of producing entity soon to be renamed area of producing entity) [6] are a good example of compensating for this problem. the 257 field was originally defined as country of producing entity for archival films. it was designed to support the data element for country of production in archival moving image materials: a cataloging manual (amim) [7]. amim defines country of production as the country or countries where the headquarters of a film’s production company or companies are located, or the location of an individual producer if there is no production company. this was a purely descriptive element and contained values such as 257 $au.s. 257 $au.s. ; france ; west germany. 257 $a[s.l.]. prior to 2009, a controlled vocabulary was not used and the united states was commonly identified by the abbreviated form “u.s.”. subfield a (country of producing entity) was not repeatable and multiple countries were separated by the prescribed punctuation “space-semicolon-space.” the latin abbreviation s.l. (sine loco) was used if the location of the producing entity was unknown. in 2009, the field was renamed country of producing entity and the limitations to archival cataloging and to moving images were both dropped, although the field has not seen much uptake outside of moving image cataloging. it was also revised to support controlled vocabulary. both the field as a whole and subfield a (country of producing entity) were made repeatable and subfield 2 (source) was added as a place to record a code for the vocabulary from which the country name(s) were taken. catalogers primarily use the lc name authority file (naf) as a source for country names in 257 (country of producing entity) so the first two examples given above would appear as follows in current, non-archival cataloging. 257 $aunited states$2naf 257 $aunited states$afrance$agermany (west)$2naf we are not aware of any method being used to record an unknown location with a controlled vocabulary. the marc format still supports the older amim-style usage so our norm rules have to be prepared to deal with both forms of data. in addition, in the wild there exist examples that incorrectly mix both of these forms of recording data. using the punctuation from amim and the repeated subfield and source vocabulary in $2 from the controlled vocabulary method: 257 $aitaly ;$afrance$2naf using an indication of controlled vocabulary in $2, but combining multiple locations in a single $a and separating them with punctuation: 257 $aitaly ; france$2naf identifying the variations that exist in any real dataset and figuring out how to deal with them is an iterative process. we eventually settled on the sequence of transformations shown in figure 21 to achieve a consistent display. figure 21. transformations for 257 country of production. table 1 walks through the effects of the transformations on each of the variants described above. the fourth variant is omitted as its $a (country of producing entity) differs from the first variant only because it ends in a period, which is removed in the sixth transformation. yellow highlighting is used to show the effect of each transformation. in the table “_” indicates the addition of a space and “ ” indicates that a character was removed and replaced by nothing. in step 3, the transformation “replace spaces by string” is a necessary intermediate step because the general transformation “replace string by string” will not take a space as part of its input to find. note that in primo transformations “^” denotes a space and “@@” is used to separate the string to find and the string to replace it with. table 1. transformation impact on source data variants source = 257$a $afrance ; italy. $afrance$aitaly $afrance ;$aitaly. 1 $a ? “; “ france ; italy. france;_italy france ;;_italy. 2 “;;” ? “;” france ; italy. france; italy france ; italy. 3 “ “ ? “++” france++;++italy. france;++italy france++;++italy. 4 “++;” ? “;” france;++italy. france;++italy france;++italy. 5 “++” ? “ “ france;_italy. france;_italy france;_italy. 6 remove “.” from end france; italy france; italy france; italy it is also necessary to think about consistency for facets. because primo only displays a limited number of values for dynamic facets (primo facets where the range of possible values is not limited to a predetermined list), we wanted to minimize redundancy in the results. we therefore only draw the facet values from 257 (country of production) fields that include a subfield 2 (source) with “naf” indicating that the country names are in the form used in the lc name authority file (figure 22). primo provides transformations that allow us to split each subfield a into its own pnx facet field and to remove trailing semicolons and periods. trailing characters in all but the final subfield have to be removed before using the “remove characters from the end” transformation as this transformation only acts on the final characters of the field as a whole. although we could additionally split subfield a on internal semicolons, we have currently chosen not to do that. instead we treat multiple places recorded in a single subfield a (country of producing entity) as metadata errors to be cleaned up as they are identified. figure 22. primo norm rules for the 257 (country of producing entity) facet. note the condition that checks to make sure the 257 field currently being processed includes subfield 2 with a source code for the lc name authority file and the transformations to remove ending punctuation and separate each instance of the repeatable subfield a (country of producing entity) into its own pnx facet field. composers, directors and performers the marc format has long had subfields that can be used to record the relationship between access points for names of agents and the resource being described. roles such as editor and composer can be recorded either as relator terms (usually in subfield e) or as relator codes (in subfield 4). the emphasis on relationship designators in the new cataloging standard resource description and access (rda), which are recorded in subfield e as relator terms in marc, has driven a resurgence of interest in these relationships. the group that maintains the primo norm rules for the orbis cascade alliance has been mostly reactive, focusing on requests from member libraries. one of the requests we received was to make use of relationship designators, which indicate the relationship between a person or other agent and a resource, for search and faceting. initially, it was not clear how we might effectively provide access to relationship designators via facets. it is possible to just add the relationship designators to the end of names in the creator and contributor facet. however, this has several significant drawbacks, as shown in the hypothetical example below. clint eastwood, 1930(116) clint eastwood, 1930actor (45) clint eastwood, 1930director (23) clint eastwood, 1930director, actor (26) first of all, many names in our dataset are not associated with relationship designators. many of the 116 unqualified instances of clint eastwood probably should be marked as actors or directors, leading to a lack of both precision and recall. secondly, some names are associated with multiple relationship designators. we do not think it is possible to separate these with primo norm rules. in the example above, there are actually 71 results for eastwood as actor, but they are split between the ones marked only as actor and those marked as both actor and director. in order to get all the resources for which eastwood is an actor, a user would have to select both facets, which is not intuitive. finally, because primo only displays a limited number of values for this facet, splitting names by role can lead to a significant reduction in the variety of names the user can pick from. this problem is exacerbated by the fact that records can potentially contain role information in varying degrees of specificity (actor vs. performer, director vs. film director), from multiple vocabularies (rda’s “director” vs. amim’s “direction”) and either as terms or as codes ($e director vs. $4 drt). it is possible, but complex, to reduce some of this variation. instead of adding relationship designators to names in the creator and contributor facet, it is also possible to create separate facets for specific roles (see example at http://sinenomine.co.uk/curwen/?ix0=name&id0[]=119&roleback=&letter=s) [8]. this is the approach we took. it is not practical to make facets for all possible roles. the marc code list for relators alone includes 268 valid terms [9]. even if facets were made for all of those, usability would remain a big challenge and standards such as rda in its instruction 18.15.1.3 (recording relationship designator) allow the use of “another concise term” if none of the relationship designators in the rda vocabulary is “appropriate or sufficiently specific.” [10] we selected three roles for experimentation: composer, director and performer. we selected these for two reasons. first, we believe there are use cases where patrons would benefit from being able to search and browse for names in these categories separately. secondly, role information is more likely to be present in records for music and moving images than in book records. although $e (relator term) was commonly used to identify non-author roles, such as editor or translator, in older pre-aacr2 records, it was rarely used during the period when aacr2 was the predominant cataloging standard. most libraries followed the library of congress rule interpretation that said not to record functions except for illustrators of children’s books. with the implementation of the rda cataloging standard in 2013, the use of relationship information in name access points became far more widespread and consistent, but the legacy of decades of cost saving by not adding this information remains. for composers and directors, we currently look only at personal names in 100 (main entry-personal name) and 700 (added entry – personal name). for performers, we look at both personal names and at corporate names in 110 (main entry – corporate name) and 710 (added entry – corporate name) to pick up performing groups. the rules take the name subfields that we will display in the facet as input. we then use a series of conditions about data in the same instance of the marc field to flag the name if it is associated with the role that we’re looking for. for example, for the composer facet we look for the string “composer” in $e (relator term) or “cmp” in $4 (relator code) (figure 23). figure 23. primo norm rule for composer facet writes name from 100 field (main entry-personal name) if that 100 field includes a relevant relator code or term. when working with marc data, it is sometimes possible to compensate for missing data by making inferences from other data in the record. in the case of the “composer” relationship, we are able to increase recall in this way. for scores, we assume that all 100 fields (main entry-personal name) and names from analytic name-title access points in 700 (added entry – personal name) are composers. there are occasional older records for scores where the 100 field (main entry-personal name) contains an editor or compiler, but the increase in recall from assuming that 100 (main entry-personal name) contains a composer far outweighs the loss of accuracy. for musical sound recordings, we only inferred that the 100 (main entry-personal name) represents the composer if a uniform title was also present in 240 (figure 24). aacr2 cataloging permits performers or conductors to be recorded in 100 (main entry-personal name) for musical recordings in some cases. however, if 240 (uniform title) is present, it almost certainly means that the name in 100 (main entry-personal name) is the composer of that title. for recordings, we again included all the names from name-title entries in 700 (added entry – personal name) with 2nd indicator 2 (analytic added entries). figure 24. primo norm rule that takes the subfields from 100 (main entry-personal name) that we intend to display in the composer facet. features two conditions that must both be true: (1) 240 (uniform title) exists and (2) ldr/06 (type of record) equals “j”(musical sound recording). unlike composer, which can be accurately identified by the presence of the word composer in the relator term in subfield e, the “director” relationship presented a challenge due to the multiple contexts in which the word “director” appears. we both had to account for the various strings that identify directors (“director,” “direction,” “film director,”) and exclude strings that include the word director, but specify a different role (“director of photography,” “art director”). to identify potential strings for matching, we reviewed the marc relator code list, the list of rda relationship designators and the terms used in amim. marc relator codes are all three-letter strings and are recorded in subfield 4 (relator code). there may be multiple instances of subfield 4 (relator code) in a single 100 (main entry-personal name) or 700 (added entry – personal name) field as a person may have performed multiple functions. the primo rule takes all the subfield 4’s (relator code) in a single field as input as a single string with a space separating each subfield. so “$4drt$4pro$4act” is turned into “drt pro act” when subfield 4 (relator code) is the source for a rule or condition. the marc relator code strings are unique so it is sufficient to test for the presence of “drt” anywhere in the input from subfield 4 (relator code) in order to decide to write the associated name in that field to the director facet. this may become more complicated in the future as the definition of subfield 4 (relator code) has recently been expanded to allow the use of uris [11]. this greatly expands the number of potential values for subfield 4 (relator code), which means that a search for a marc relator code such as “act” (actor) could pick up false drops where, for example, the string “act” is part of the domain name of a uri. relator terms are usually recorded in subfield e. 111 and 711 (conference names) are exceptions. here relator terms are recorded in subfield j because a subfield for relator terms for conference names was only defined recently in response to rda and subfield e was already in use. for the most part, our primo norm rules for relator terms, like the rules for relator codes, merely check for the presence of a string, such as “television director.” like subfield 4 (relator code), subfield e (relator terms) can be repeated. this creates a difficulty for identifying the plain string “director.” we want to pick up director when it is one of a number of relator terms as in “$edirector,$eproducer,$eactor,” but we don’t want to get “$edirector of photography.” to identify “director” when it is the only term in a specific subfield e, we have to isolate the strings in each individual subfield. to do this, after removing any periods or commas at the end of a subfield, we change the subfield delimiters to “+++” which gives “director+++producer+++actor”. we then add “+++” to the beginning and end of the whole string and look for “+++director+++” (figure 25). if no matching terms or codes are present, we did not attempt to infer a directorial relationship. figure 25. condition that identifies plain string “director.” we tried to include as many types of performers as possible in the “performer” facet. rda helpfully subarranges many types of performers under the performer relationship designator. we browsed the list of marc relator codes for equivalents to the rda terms and anything else that implied performance. this results in a rule with a long list of “or” conditions that test for the presence of terms and codes for roles such as “performer”, “actor”, and “musician” (figure 26). if no matching terms or codes were present, we did not attempt to infer performer relationships. figure 26. norm rule for performer facet, showing conditions for “actor” and “dancer” relator codes. challenges as with any innovation that attempts to capitalize on the granularity of marc coding, bringing these new display fields and facets to our users via the discovery interface provided challenges large and small. some of the problems we encountered are areas of concern in any discovery tool. others were specific to our use of ex libris’s primo discovery interface. one general challenge is something we’ve dubbed the “chicken-or-egg” question. although no large-scale dataset is likely to support perfect recall, new fields that aren’t widely used will suffer from significantly incomplete results when used as faceting values. this causes users to see them as inaccurate, and degrades trust in the facet’s accuracy, leading to disuse. this situation makes institutions hesitant to add facets or develop other functionality for new fields in their public search interfaces – which makes their cataloging shops hesitant to spend time encoding them. this results in the new fields not being widely used, and the cycle continues. to address this problem, it’s possible to engage in retrospective conversion projects. the american library association’s subject access committee is optimistic about the potential for “retrospective implementation of faceted vocabulary terms using algorithms developed, vetted, and tested by expert communities.” [12]. in fact, the music library association has partnered with gary strawn to develop a tool that automatically adds marc fields for non-topical facets, such as medium of performance, based on lcsh subject headings in individual bibliographic records for music in oclc’s connexion metadata editor. [13][14], (files available at http://files.library.northwestern.edu/public/music382/). the tool is currently intended to be paired with manual review by an informed user before updating the record. caution should be exercised here, because such projects sometimes use incomplete or inaccurate algorithms to batch process records, which can result in badly-converted data. retrospective conversion is limited to information that was previously recorded in an identifiable way and will also carry over inaccuracies from the existing data. retrospectively-converted data will generally be less accurate, less complete and less granular than data added with human review. our consortial catalog subscribes to updates to oclc worldcat master records so we hope that the combined effort of retrospective metadata enhancement and the wider community of catalogers will help us get beyond the “chicken-or-egg” dilemma. another problem in most discovery systems is search specificity in records for aggregates like music compilations on cd and literary anthologies. for instance, in the case of our “medium of performance” facet, users limiting by multiple instruments will get results with those instruments present in the record – but that doesn’t mean the instruments all appear in the same piece. we’ve tried to improve this situation by providing the medium of performance statement facet, which groups the instrumentation for each piece. however, if the user is looking for a specific medium of performance statement in a piece by a specific composer or in a specific key, they’re back to the aggregate problem (figure 27). figure 27. aggregate records can defy user expectations when faceting. another challenge is the question of discovery “real estate”. one reason that many libraries in our consortium haven’t implemented new facets is that there are usability challenges with a long list of facets. many of these fall “below the fold,” making them easy for users to miss. amazon and other retails sites often compensate for this problem by showing a shorter list of facets that are contextually linked to the user’s search. one way to achieve this is to build a material-specific sub-interface in your library’s discovery interface. this raises its own crop of new challenges, however. how will you market the sub-interface to bring it to users’ attention? how easy is it to switch between the general search interface and the material-specific one? will users get confused about which interface they’re using? [15] other challenges we came up against were specific to our discovery layer, ex libris’s primo. in theory, facets could be used not only for narrowing searches, but also for open-ended browsing and exploring of a library’s collection [16]. however, primo does not support true exploratory search [17] due to processing limitations. users must enter a search term before they are shown any facets. primo also limits the number of facet values that are displayed to the end user in most cases. primo supports two types of facets. static facets are based on enumerated lists where all the possible values are determined in advance and have complete recall. in the case of static facets, the users are presented with all the values found in their complete result set and selecting a facet value will retrieve all the records in the database with that value. however, static facets are limited to relatively short lists of values. we use static facets for things like resource types and the marc language codes. the marc language code list contains around 500 values and it’s not clear how many more values could be supported by a static facet. primo also supports dynamic facets, which use whatever terms occur in the data. in the out-of-the-box primo configuration, dynamic facets select the top 20 terms that occur in that pnx facet field in the top 200 ranked records in the user’s result set and display only those twenty terms as options. each facet value is qualified by the number of records that it will retrieve if selected. this total matches against only the first 50,000 records in the result set so it will be incomplete if the user has done a broad search. the number of terms shown and the number of records from which those terms are drawn can be adjusted, but there is a trade-off between improved recall and system performance. [18] in primo, customers can activate a database of article-level discovery data called the primo central index (pci). while access to this data greatly improves discovery for specific scholarly resources, the pci metadata does not go through the same normalization process as the metadata in our local bibliographic database (ex libris’s alma). the result is that the customized facets and display fields we’ve built for primo cannot incorporate the data from the pci. only “out of the box” facets and fields can make use of the pci metadata. we also have no control over the values that are written to the pci pnx records. we would like to experiment with building material-specific sub-interfaces for music or videos, but we are unable to find a way to incorporate the pci resources. in fact, we cannot even create an accurate resource type facet for online audio and online video because the pci records contain a value of “media” for both audio and video. the two types of resources are identified separately for display so the necessary metadata seems to be there, but we are forced to create a single “eaudio & evideo” resource type facet to align our local materials with the records coming from the pci. we’ve also been frustrated in our work by primo’s interface for building customized fields and facets. as shown in this article, primo uses normalization rules to translate marc metadata into xml that can be used by the primo front end. like most saas application configurations, these rules are managed via a browser-based gui with hard-coded elements. that gives us a fixed set of logical conditions and data transformations that we can use to find specific marc data and turn it into user-friendly elements in primo. apart from this inflexibility, some specific interface problems we’ve encountered are broken transformations that require multi-step workarounds, long page load times, and limitations on copying single rules from one rule set to another. one final problem of note in our specific case is that we would like to create marc field-specific search and browse indexes for primo. however, this is not currently supported. we can build “good enough” functionality in this regard by customizing “lateral linking” (hyperlinked) fields in primo, but we very quickly come up against some limitations here as well. first, the limit in primo is 9 locally-defined linking fields. recent developments have seen an increase in the number of available “out of the box” linking fields, but not for locally-defined fields. furthermore, these linking fields provide functionality that’s similar to faceting, but not equivalent. a user clicking on the linking field executes a new search rather than limiting their current search results. whether this behavior is acceptable depends on local user expectations and the search context. conclusion and looking forward we’ve been able to use many of the new fields and subfields introduced into marc 21 to add user-friendly features to our consortium’s faceted discovery interface. while some of those developments were straightforward, others proved more challenging to achieve. some of those challenges were due to the nuances of marc and the inconsistencies of legacy data, while others were due to the nuances of our discovery tool (and of discovery systems in general). despite these challenges, we feel strongly that our work has helped users across our consortium find the resources they need to advance their scholarly work. as time goes on, and the use of fuller and more structured metadata increases, it is critical for libraries to explore new ways of making that metadata available for their users in human-readable displays and user-friendly elements like facets and search indexes. while these developments can present many challenges, we also feel confident that discovery tools will continue to evolve, making the process of customization easier and more sustainable as well. references [1] ex libris. 2018. the pnx record. in: ex libris knowledge center. [internet][jerusalem]: ex libris; [cited 2018 june 5]. available from: https://knowledge.exlibrisgroup.com/primo/product_documentation/technical_guide/010the_pnx_record [2] ex libris. 2018. working with normalization rules. in: ex libris knowledge center. [internet][jerusalem]: ex libris; [cited 2018 june 5]. available from: https://knowledge.exlibrisgroup.com/primo/product_documentation/technical_guide/020working_with_normalization_rules [3] library of congress, policy and standards division. 2017. introduction. in library of congress genre/form terms for library and archival materials (lcgft) [internet]. washington, d.c.: the library of congress. p. 1-6. [cited 2018 june 5]. available from: https://www.loc.gov/aba/publications/freelcgft/2017%20lcgft%20intro.pdf [4] belford r. 2015. worldcat discovery display preferences for medium of performance. columbus, oh: music oclc users group. available from: http://musicoclcusers.org/wp-content/uploads/wcd_medium_report_201504291.pdf [5] lee d. 2017. numbers, instruments and hands: the impact of faceted analytical theory on classifying music ensembles. knowledge organization [internet]. [cited 2018 june 5]; 44(6):405-415. available from: http://openaccess.city.ac.uk/18645/ [6] library of congress network development and marc standards office. 2017. marc proposal no. 2017-10, rename and broaden definition of field 257 in the marc 21 bibliographic format. marc 21 proposals [internet]. [cited 2018 june 5]. available from http://www.loc.gov/marc/mac/2017/2017-10.html. [7] library of congress. archival moving image materials: a cataloging manual (2nd edition) [internet]. washington, d.c.: the library of congress; 2000 [cited 2018 june 7]. available from: https://archive.org/details/amim2 or via cataloger’s desktop. [8] camden b. 2013. relationship designators and facets: vcat@penn in: program for cooperative cataloging participants meeting; 2013 june 27; orlando. available from: https://www.slideshare.net/bethcamden/relationship-designators-and-facets-vcatpenn [9] library of congress network development and marc standards office. 2018. relator code and term list — term sequence: marc 21 source codes. marc 21 proposals [internet]. [cited 2018 june 5]. available from http://www.loc.gov/marc/relators/relaterm.html. [10] rda toolkit [internet]. [updated 2017 april]. american library association, canadian federation of library associations, and cilip: chartered institute of library and information professionals; [cited 2018 june 7]. available from: http://access.rdatoolkit.org/ [11] library of congress network development and marc standards office. 2017. marc proposal no. 2017-010, redefining subfield $4 to encompass uris for relationships in the marc 21 authority and bibliographic formats. marc 21 proposals [internet]. [cited 2018 june 5]. available from https://www.loc.gov/marc/mac/2017/2017-01.html. [12] working group on full implementation of library of congress faceted vocabularies. 2017. a brave new (faceted) world: towards full implementation of library of congress faceted vocabularies. chicago (il): american library association, association for library collections and technical services, cataloging and metadata management section, subject analysis committee, subcommittee on genre/form implementation. available from: http://www.loc.gov/aba/pcc/documents/poco-2017/bravenewfacetedworld-170713.pdf [13] mullin c. 2018. retrospective implementation of faceted vocabularies for music. middleton (wi): music library association, vocabularies subcommittee. available from: http://works.bepress.com/casey-mullin/12/ [14] mullin c, strawn g. 2018. deriving faceted terms from library of congress subject headings for music: challenges and possibilities. in: music library association annual meeting; 2018 jan 31 – feb 4; portland, or. available from: https://hcommons.org/deposits/item/hc:19023/ [15] music library association. music discovery requirements [internet]. middleton (wi): music library association; 2017 [cited 2018 june 5]. available from: https://www.musiclibraryassoc.org/mpage/mdr_ivf [16] mcgrath k, kules b, fitzpatrick c. 2011. frbr and facets provide flexible, work-centric access to items in library collections. proceedings of the 11th annual international acm/ieee joint conference on digital libraries, p.49-52. available from: https://doi.org/10.1145/1998076.1998085 [17] exploratory search. 2018. in wikipedia. [internet]: wikimedia foundation, inc.; [cited 2018 june 5]. available from: https://en.wikipedia.org/wiki/exploratory_search [18] ex libris. 2018. facets. in: ex libris knowledge center. [internet][jerusalem]: ex libris; [cited 2018 june 5]. available from: https://knowledge.exlibrisgroup.com/primo/product_documentation/060back_office_guide/100facets suggested reading library of congress information page for lcdgt http://www.loc.gov/aba/publications/freelcdgt/freelcdgt.html mcgrath, kelley. getting more out of marc for music and movies with primo: strategies for display, search and faceting (2017 olac conference poster) http://pages.uoregon.edu/kelleym/publications/primo&marcforvideo&musicposter.pdf mcgrath, kelley. using marc facets for music with primo: strategies and challenges (ala midwinter 2018 presentation) http://pages.uoregon.edu/kelleym/publications/mcgrath_using_marc_facets_for_music_with_primo.pptx mcgrath, kelley and lesley lowery. getting more out of marc with primo: strategies for display, search and faceting (2018 eluna presentation) http://pages.uoregon.edu/kelleym/publications/getting_more_out_of_marc_with_primo-eluna2018.pptx music library association music discovery requirements (2017) http://www.musiclibraryassoc.org/mpage/mdr_ia music library association lcmpt best practices (version 1.3, january 2018) https://c.ymcdn.com/sites/www.musiclibraryassoc.org/resource/resmgr/bcc_resources/bpsforusinglcmpt.pdf music library association lcgft for music best practices (version 1.1, january 2018) https://c.ymcdn.com/sites/www.musiclibraryassoc.org/resource/resmgr/bcc_resources/bpsforusinglcgft_music.pdf oclc’s longitudinal statistics on field usage http://experimental.worldcat.org/marcusage/ the university of oregon’s primo sandbox for norm rules testing at http://alliance-primo-sb.hosted.exlibrisgroup.com/primo_library/libweb/action/search.do?vid=uo_nrwg about the authors kelley mcgrath (kelleym@uoregon.edu) is metadata management librarian at the university of oregon libraries. she is an experienced media cataloger and has been active in olac (online audiovisual catalogers) for many years. she is a member of the orbis cascade alliance primo norm rules working group/standing group. lesley lowery (llowery@orbiscascade.org) is the network zone manager at the orbis cascade alliance. until this year she worked at western washington university, where she served as the shared ils administrator and prior to that worked as a cataloger. she is a former member and chair of the orbis cascade alliance primo norm rules working group. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – how to build a computer availability map mission editorial committee process and structure code4lib issue 12, 2010-12-21 how to build a computer availability map most libraries house one or more computer labs. wouldn’t it be nice to be able to let your patrons view how many and what type of computers are available at any given time? well, now you can. follow along in this tutorial that takes you through the stages of implementing a real-time computer availability map that works for a mobile and full website. the complete code package is provided under the gpl v3 license, and is available at: http://github.com/griggsk/availability-map. by kim griggs computer availability maps are popping up all over university and library websites. some libraries have created the maps in-house (see appendix a), while others have used a vendor like labstats (see appendix b). i became interested in implementing a computer availability map when i attended a presentation by ncsu libraries [1] where they listed their computer availability map as a top-viewed page on their mobile site. since a real-time availability map supported the mobile context of a user on the go, i proposed to the oregon state university libraries (osul) mobile team that we add an online availability map of our computer lab to the mobile site. the result was a mobile computer availability map (http://m.library.oregonstate.edu/computers/) implemented with ruby on rails and scripts embedded in the computer’s build to track and display the availability status of the computers in the library’s learning commons. the map was released first to the mobile site and then was ported to osul’s drupal website (http://osulibrary.oregonstate.edu/computers/). both versions get high use, especially around mid-terms, finals and the beginning of term. over the last year, the map has been the second-most-viewed page on the mobile site and in the top 20 most-viewed pages on the main site. since releasing the map i have received many compliments and requests from the library community to share how i implemented it. computer availability map requirements the following is a list of requirements the osul mobile team developed for our computer availability map. your own requirements may vary. requirement 1: the map shall work on multiple screen sizes to enable us to offer the computer availability map to our patrons on multiple devices, the map needs to be easily extendable to work on both the mobile site (300 px) and the full website (600 px) as well as the large lcd (900px) in the computer lab. this requires 3 maps with dimensions that can easily be scaled and code that can handle the multiple dimensions. requirement 2: the map shall be accessible the map needs to be accessible to all our users. this requires paying attention to color use in icons, providing a text equivalent, standards compliant code and accessible mark-up. requirement 3: the map shall display a realistic view of the computer lab to enable patrons to easily understand the map and be able to locate available computers in the lab, the map needs to be a realistic view of the lab with location-based pointers. this requires a digital image of the computer lab. requirement 4: the map shall display computers in use and not in use the map needs to clearly display the status (available or unavailable) of the computers in the lab. this requires a database to store the location, an identifier for every computer, and scripts that are triggered when a user logs on or off. this also requires icons to identify the status. requirement 5: the map shall display which type of computer is available we also want to capture what type of computers are in the lab, so the map needs to clearly display the type (pc or mac) visually. this requires a database to store the type of every computer. this also requires icons to identify the types. requirement 6: the information on the page and in the map shall be updated in real-time to keep the availability status up-to-date the map needs to be updated in real-time with an acceptable lag time of 5 min. this requires the database to be queried automatically as well as manually. system design & architecture the computer availability map application i am going to show you how to build is a centralized logging system with a standard lamp stack. a mysql database is the backbone of the app and contains data about the computers in the lab. each computer stored in the database contains a unique identifier, such as a name or ip address; an x, y location that relates them to their physical space in the lab; and the type of computer (win/mac). embedded in each computer’s login and logout scripts are perl files that send http requests to a php script to update the individual computer status when a user logs in or out in real-time. a standard lamp application and hand-drawn images of the lab provides the user interface of the computer availability map customized for different screen sizes. figure 1. system design & architecture note: osul mobile team programmers decided to use this architecture because of our current development environment and expertise. implement the system building the availability map involves 6 steps. first you need to create the framework of the system and get the computers and the database talking to each other. then you are going to create a map and icon images. finally, you are going to build the user interface and finish with scaling the map for multiple screen sizes. step 1: create the database note: i am assuming you have access to a mysql database and a basic understanding of database creation and management. the computer availability database schema is pretty simple, all you need is one table. compstatus id (int) (primary key) computer_name(string) top_pos(int) left_pos(int) status(int) computer_type(string) updated_at(date) if you have multiple labs, then you would need a field indicating which lab the computer is in. you could also extend the schema to include a listing of installed software or anything else you want to capture and display about the computers. create database 'computer_availability' create table `compstatus` ( `computer_name` varchar(250) not null default "", `status` int(11) default null, `computer_type` varchar(250) default null, `left_pos` int(11) default null, `top_pos` int(11) default null, `updated_at` timestamp not null default current_timestamp on update current_timestamp, primary key (`computer_name`) ) at this point you may not have the data for your computer lab; if you want to play along you can just use your own computer as a test case. step 2: embed the login/logout scripts note: i am assuming you have access to the computer’s imaging process and a basic understanding of perl scripts. you need to track when a user logs on and when they log off; to do so you need 2 scripts that can pass along the computer’s status and name to a database. #!/usr/bin/perl -w use lwp; use sys::hostname; my $host = hostname(); #the hostname after which you can find out the ip address my $ipaddr = inet_ntoa(scalar gethostbyname($host || 'localhost')); $ua = lwp::useragent->new; #timeout in 60 seconds if we can't make a connection $ua->timeout(60); #send the http request with the status of 1 and computer name $request = $ua->post('http://yourdomain.edu/statuschange.php', ["status" => "1", "workstation" => $ipaddr, "host" => $host]); #if successful log 200 status else quit and log error if ($request->is_success) { $content = $request->content; print "content-type: text/html\\n\\n"; print $content; } else { die "can't get to url", $request->status_line; } exit; script 1. login.pl #!/usr/bin/perl -w use lwp; use sys::hostname; my $host = hostname(); #the hostname after which you can find out the ip address my $ipaddr = inet_ntoa(scalar gethostbyname($host || 'localhost')); $ua = lwp::useragent->new; #timeout in 60 seconds if we can't make a connection $ua->timeout(60); #send the http request with the status of 0 and computer name and the host $request = $ua->post('http://yourdomain.edu/statuschange.php', ["status" => "0", "workstation" => $ipaddr, "host" => $host]); #if successful log 200 status else quit and log error if ($request->is_success) { $content = $request->content; print "content-type: text/html\\n\\n"; print $content; } else { die "can't get to url", $request->status_line; } exit; script 2. logout.pl now you need to embed these scripts into the computer’s login and logout scripts. first, for this to work each computer is going to need a dns entry. the process of adding the scripts will differ depending upon the operating system. see appendix c for resources. step 3: update the computer’s status note: i am assuming you have a basic understanding of php. now that you have the perl scripts installed on the computers and triggering the http request you need a script to capture that data and update the computer’s status in the database. this script needs to be stored at the location you pointed to in the http request. #add your database username and password $user="username"; $password="password"; $database="computer_availability"; #unless the computers name was empty if($_post['workstation'] != ""){ $workstation = strtoupper($_post['workstation']); } else{ #build the computer's name from the host $host_domain = strstr($_post['host'], '.'); $workstation = strtoupper(str_replace($host_domain, '', $_post['host'])); } #connect to the database $db = mysql_connect('mysqlcluster.adm.yourmysqlserver.edu', $user, $password); @mysql_select_db($database) or die("unable to select database"); #get the computer's row based on it's name $checkquery = "select computer_name from compstatus where computer_name = '".$workstation."'"; $result = mysql_query($checkquery); #if we find a computer update it's status if(mysql_numrows($result)>0){ $query="update `compstatus` set status = '".$_post['status']."' where computer_name = '".$workstation."'"; mysql_query($query) or die(mysql_error()); } mysql_close($db); script 3. statuschange.php sanity check: let’s test it out at this point you can run some tests and make sure the computers are talking to the database. since you don’t have the web service built yet, you’re going to log on and off then query the database to see if it recorded the status changes. note: you can use your own computer as a test case. check that the perl scripts are installed into the computers build check that your computer name is stored in the database and the status is 0 select 'status' from 'compstatus' where name = 'your computer name' login to your computer check the database to see if the status was recorded select 'status' from 'compstatus' where name = "your computer name" if you see that the status is now 1 then it worked! now test log off. you should see it set back to 0 step 4: create the map osul mobile team had a student designer create the computer lab image and the computer icons based on our design requirements. you will need to define your own requirements and create your own images. design requirements the map shall be a simple but realistic view of the lab the map shall include physical markers such as the reference desk, printers or walls the map dimensions shall be 300px, 600px and 900px the icon dimensions shall be 10px, 20px and 30px the icons shall use both color and shape to identify its status and type the map shall have an empty spot for every computer (x,y location) the map is comprised of two layers: the lab image, and then the computer icons placed on top at an x,y location with css. figure 2. computer lab image lessons learned requiring that we support multiple screen sizes greatly increased the complexity of creating the maps. we used whole multipliers (2x) to scale the image and icons up and down. this allowed us to use simple math to change the computer’s x,y location when it was displayed on the various sizes of maps. since we wanted a map that could be usable and attractive on a mobile phone and a desktop, we decided to use a grayscale color scheme and simple line drawings to indicate where the computers are in regards to physical landmarks, such as the printers. we wanted the map to be realistic so we also tried to capture the shapes of the computer pods and clusters. there were a couple little nuisances when we where creating the images. one thing was drawing the map so that the computer’s space in the pod or cluster was big enough to fit the icons. choosing the orientation of the map also posed a problem. we choose to orient the map based on major reference points like the printers and help desk. to support our accessibility requirements we decided to use both color and shapes to represent the status and type. we chose squares for pcs (pun intended) and triangles for macs; red and an x for unavailable and green and a plus sign for available. we used both the color and the shapes to address color accessibility. we also provided a legend and text equivalents of the map’s information. we used just the text-equivalent for non-smart phone mobile users that cannot display the 300px image. figure 3. computer icons finally, we chose to implement a css layout. step 5: build the user interface (ui) note: i am assuming you have a basic understanding of php, sql and css. this code is intended to be used in a standard lamp stack. now let’s put that final tier on and build the ui to display the map with the real-time computer availability status and computer types indicated by the correct computer icon. <?php #add your database username and password $user="username"; $password="password"; $database="computer_availability"; #connect to the database $db = mysql_connect('mysqlcluster.adm.yourmysqlserver.edu', $user, $password); @mysql_select_db($database) or die("unable to select database"); $total_pc_results = mysql_query("select * from compstatus where computer_type='pc'"); $avail_pc_results = mysql_query("select * from compstatus where status='0' and computer_type='pc'"); $pcs = mysql_num_rows($avail_pc_results) . '/' .mysql_num_rows($total_pc_results); #get the textual data total numbers and available numbers of macs $total_mac_results = mysql_query("select * from compstatus where computer_type='mac'"); $avail_mac_results = mysql_query("select * from compstatus where status='0' and computer_type='mac'"); $macs = mysql_num_rows($avail_mac_results) . '/' . mysql_num_rows($total_mac_results); #get all the computer's row of data $result = mysql_query("select * from compstatus"); mysql_close($db); ?> <div id="computer_map"> #text equivalant <p>pc's available: <?php echo $pcs; ?> mac's available: <?php echo $macs; ?></p> # the map <div id="computer_map_600"> <dl> #loop through the rows and display the correct icon at the computer's location #the computer's location is multiplied by 2 because this is the large map #<dt class="avail_pc_600 icon" style="left:200px;top:100px>icp10</dt> #<dt class="avail_pc_600 icon" style="left:200px;top:100px>icp10</dt> <?php while($row = mysql_fetch_assoc($result)){?> <dt class="<?=($row['status']==0 ? 'avail' : 'busy');?>_<?=(strcmp($row['computer_type'],'pc')==0 ? 'pc' : 'mac');?>_600 icon" style="left: <?=($row['left_pos']*2);?>px; top:<?=($row['top_pos']*2)+30;?>px" ><?=$row['computer_name']?></dt> <?php }?> </dl> #last update <p>map is updated every 5 minutes. last updated: <?php echo date("m j, y \&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;\m\d\a\s\h\; h:i:s"); ?></p> #end map </div> </div> script 4. computers.php figure 4. osul computer availability map note: we first implemented this in ruby on rails (ror) for our mobile site and then ported it to php so that it could be used on osul main site which is in drupal. i have made the ruby on rails mobile version and the drupal version available at: http://github.com/griggsk/availability-map/ sanity check: let’s test it out now that you have all the pieces put together you can see it in action. check that you have added your computer’s location to the database and it corresponds to a place on the map login to your computer navigate to the computers.php page check to see that your computer is now displayed as unavailable test logout. you should see that your computer is available optional step 6: support multiple sizes if you want to support multiple sizes then you will need maps and icons in those sizes. osul choose 300px (mobile), 600px (full site), and 900px (lcd). you can use the same web app you built above to have a mobile version of the application. to alter the map you just built to work on smart phones you need to show the smaller images and reduce the multiplier of the location. <div id="computer_map_300"> <dl> #loop through the rows and display the correct icon at the computer's location #the computer's location is not multiplied by 2 because this is the mobile site #<dt class="avail_pc_300 icon" style="left:100px;top:50px>icp10</dt> #<dt class="busy_mac_300 icon" style="left:120px;top:70px>icp20</dt> <?php while($row = mysql_fetch_assoc($result)){?> <dt class="<?=($row['status']==0 ? 'avail' : 'busy');?>_<?=(strcmp($row['computer_type'],'pc')==0 ? 'pc' : 'mac');?>_300 icon" style="left: <?=($row['left_pos']);?>px; top:<?=($row['top_pos'])+30;?>px" ><?=$row['computer_name']?></dt> <?php }?> </dl> </div> figure 5. osul mobile computer availability map conclusion there are some obvious problems with this solution. there’s the false negative/positive problem, such as students not logging off or computers needing service. there is also upkeep of the image to reflect changes to the layout of the lab. then there is also one strict requirement: you must have access to the computer’s imaging process. if you don’t have that you can’t use this method. i hope that this tutorial has been helpful; feel free to use this code in part or whole. the code can be easily integrated into drupal sites or as a standalone lamp or ror application. like this code, the tools used to build the map are free and open-source. the complete computer availability map application from this tutorial is available at http://github.com/griggsk/availability-map. reference 1 ncsu’s computer availability map system overview (pdf) greatly influenced how we implemented osul’s computer availability application. appendix a: in-house availability maps university of technology sydney library ncsu libraries asu libraries smith college libraries boston college georgia institute of technology libraries rice university library uw-madison college learning support services unc charlotte uncg jackson library appendix b: labstats availability maps wilfrid laurier university library florida gulf coast university appendix c: computer imaging resources windows: frequently asked questions about logon scripts http://www.rlmueller.net/logonscriptfaq.htm option 1: use window’s group policy create system startup / shutdown and user logon / logoff scripts http://technet.microsoft.com/en-us/magazine/dd630947.aspx setting up a logon script through gpo in windows server 2008 http://www.petri.co.il/setting-up-logon-script-through-gpo-windows-server-2008.htm option 2: use active directory setting up a logon script through active directory users and computers in windows server 2008 http://www.petri.co.il/setting-up-logon-script-through-active-directory-users-computers-windows-server-2008.htm macintosh: creating a login hook http://support.apple.com/kb/ht2420?viewlocale=en_us logout hooks for dummies: http://discussions.apple.com/thread.jspa?messageid=6891519 note: many of us may not have access to the computer’s imaging process. osul mobile team worked with the computer lab’s itc (information technology consultant) to embedd the scripts. our itc provided me with the links above and said he uses option 1 for windows and hooks for macs. if your labs are under management by someone else, you may need to form a similar relationship. about the author kim griggs is a programmer/analyst at oregon state university libraries. she has a b.s in computer science and a m.s in human computer interaction. she is the developer of a number of projects including: library a la carte, an open-source cms, osul mobile website & catalog, and beavertracks, a location-based campus history tour mobile application. subscribe to comments: for this article | for all articles 2 responses to "how to build a computer availability map" please leave a response below: coding for librarians : applied knowledge is the best kind « mmit blog, 2012-01-10 […] osu libraries’ computer availability map […] mike tishman, 2014-01-08 hello kim! i am a first year student here over at the florida institute of technology and was trying to put together a system like this; however i didnt want to have to re-image the computers. i have been doing research for the past couple of days and have been trying alot of scripts i’ve written to obtain information that would indicate an active computer. however i have had no luck. do you have any suggestions for me that wouldnt require re-imaging of the computers. any help would be greatly appreciated! thank you! ~mike leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – query translation in europeana mission editorial committee process and structure code4lib issue 27, 2015-01-21 query translation in europeana europeana – a database containing european digital cultural heritage objects – recently introduced query translation in order to aid users in searching the collections regardless of language. the user enters query terms, and the portal searches for those terms in multiple languages. this article discusses the technical details of query translation with the aim of assisting similar projects to implement similar features. by péter király introduction europeana [1] is a multilingual database that contains cultural heritage metadata from all european countries in many languages. the service provides different tools to help users run automatic machine translations, one of which is a new tool for query translation. the usage scenario is simple: the user sets the languages he or she wants to use in the process (up to six languages in total), then when a search expression is entered, the europeana portal extends the query by injecting translations of the terms or phrases found in it, and runs the modified version against the search index. this paper highlights the details of this process in the hope that it can be reused in other projects. europeana makes use of apache solr for searching. solr reads and parses the search expressions, runs a couple of transformations, analyses and filters, and creates its own internal representation. the process is highly configurable, and solr provides tools to modify the query by injecting synonyms or by skipping stop words. query translation is a similar modification; however, while the synonyms come from a static source (usually a file), translations come dynamically from external sources. this would normally not be a problem because solr provides methods which enable the developer to write a custom plugin, but there is a price for that: the developer has to separate this particular part of his/her project into a different codebase because its product will be a distinct artifact – a solr plugin. it should have its own development cycle with deployment and maintenance issues – partly-independent from the main project. this plugin should be inline with the solr version and can become outdated if the solr api changes. at europeana, we chose a different route. we made this injection code part of the main europeana project and the system performs a query modification before the query reaches solr. this code takes a query string and produces another query string. the string should conform to the underlying lucene query syntax [2], which has not changed much in the past ten years and is less dependent on solr api changes. in order to do this, the process somewhat mimics what solr itself does: normalize the boolean operators of the query extract the terms from the query call translation service on the extracted terms inject the translations back into the appropriate position of the query string below we will discuss these steps in detail. boolean normalization normalizing the boolean operators of the query is the easiest step. solr lets us set a default boolean operator. solr’s default operator is or, but europeana uses and. this means that when the user enters mona lisa, solr understands that the query has two atomic terms and it sets the boolean and operator between them. so the resulting query will be mona and lisa. but what happens if the user enters mona lisa or la joconda? you would expect that solr understands that it is two expressions: mona lisa, and la joconda, however what solr really understands is three expressions: the first is mona, the second is either lisa or la, and the third one is joconda. fortunately the lucene query language allows us to use parentheses for grouping terms together forming an expression in an unambiguous way. the query (mona lisa) or (la joconda) does what the we originally expected: it finds the records either having both mona and lisa or both la and joconda. the query normalization uses some text analysis to add parentheses to the user query and groups together the common parts. it does not use any solr-related library. it is purely text manipulation based on guessing. extraction of terms now we have a normalized query from which we have to extract terms and phrases. for instance, in dog or cat we have two terms; in (mona lisa) or (la joconda) we have two phrases and each contains multiple terms. solr supports a number of syntactical patterns to distinguish between query types (descendant of lucene’s basic query object). some examples include: spinoza – a term query (lucene’s termquery object) den haag – a boolean query (booleanquery) composed of two term queries, using the default boolean operator “den haag” – a phrase query (phrasequery) spinoza* – a prefix query: terms starting with spinoza (prefixquery) spinoza~ – a fuzzy query: terms similar to spinoza [3] (fuzzyquery) [alpha to beta] – a term range query: terms in lexicographical order standing between alpha and beta (termrangequery) timestamp_created:[2013-03-15t19:58:36.43z to 2013-04-15t19:58:36.43z] – a special term range query: because the type of this particular field is a date (represented in iso 8601 format for utc standard), the query using date comparisions. this example matches records created between 15th march and 15th april, 2013. *:* – matches all documents (matchalldocsquery) any query could be a component in a bigger boolean query. the process should take into account that we do not have to inject translations into certain types of queries. to be precise, we only need to inject them into term queries and phrase queries. sometimes we should handle two terms as a compound phase (for example mona lisa or den haag), and we have to subject them to query translation. the lucene java library provides a way to deconstruct queries into atomic components (via the queryparser class) by extracting terms and query types, but the query object and descendants do not give information about the position, e.g. where the term or phrase takes place in the user-entered query. moreover, usually the extracted term has undergone some character transformations such as white space removal, making terms lower case, and truncation. let us imagine a fictitious query for ‘“prince” is the prince of music’. the first prince is a phrase query, so we do not extract it. the second prince is a term query, so we would like to get synonyms for that. now we have translations for only one prince, but how do we inject it? we do not know which prince should be modified in the string. in the time we built a mapping of the normalized query terms and their translated versions, we already lost the information where the term occurred in the original query string. lucene provides tokenization as another tool for solving this problem. the same analyzer – which the process uses in creating the queryparser – can extract the stream of tokens, which are terms in the query string with additional metadata, such as position (start and end offset). when both tokenization and query parsing end, the next process merges the two kinds of information and tries to figure out which token belongs to which query component. by the end of this process our aforementioned mapping will contain the position information as well. thus it has now been enabled for use in injecting translations into the right position. calling the translation service currently, europeana uses wikipedia’s api to run translations via the following url pattern: http://[language].wikipedia.org/w/api.php?action=query&prop=langlinks&lllimit=200&format=json&titles=[query term] there are two variables in this pattern. the first is the [language] variable which is the iso 639-1 language code [4] and in europeana’s case comes from the user’s settings. the second variable is the [query term], which is a title of a wikipedia entry and comes from the term extraction. we know the user’s language preferences, but we do not know which language was used in entering the query string, so the process runs the same api call in all languages in order to be certain. it then extracts all possible language variations (so it calls en.wikipedia.org, de.wikipedia.org, nl.wikipedia.org, etc., with the same query). the api returns the response in json format and the software ingests it using google’s gson library. an example for the returning json (querying for the hungarian version of den haag http://hu.wikipedia.org/w/api.php?action=query&prop=langlinks&lllimit=200&format=json&titles=h%c3%a1ga): { ... "query": { "normalized": [ { "from": "hága", "to": "hága" } ], "pages": { "123336": { "pageid": 123336, "ns": 0, "title": "hága", "langlinks": [ ... {"lang": "de", "*": "den haag"}, {"lang": "en","*": "the hague"}, ... ] } } } } the wikipedia api does some normalization and corrects lower case issues (line 4-9). the current language’s main translation in the title property (line 14), and all other versions are available in the langlinks array (we removed the bulk of the items in the array, keeping only two language versions). post-translation positioning now that we have both the positions and the translations, all we have left to do is modify the query in order to inject the translations back into the appropriate position within the query string. the basic pattern looks like this: original query: terma and termb modified query: (terma or translationa1 or translationa2) and (termb or translationb1 translationb2) for example: original query: den haag and warsaw modified query: (“den haag” or “the hague” or hága) and (warsaw or varsó or warschau) the injection is usually rather straightforward, but there are some cases where it is a bit complicated. for example: if the translations are the same for multiple languages, the second instance of the term is filtered out. if the term or translation is a phrase or expression, then we have to handle quotation marks correctly (lucene queryparser does not accept them as part of the phrase). lucene handles phrases containing only one element such as place:”paris” as a termquery and not a phrasequery, so place:”paris” and place:paris are equivalent. user interfaces the query translation feature is available in the europeana api and europeana portal. in the europeana api the end point is http://europeana.eu/api/v2/translatequery.json. on top of parameters common to all europeana apis (namely wskey and callback, see http://labs.europeana.eu/api) it accepts the following two parameters: term, the query string to be translated languagecodes, space or comma separated list of iso 639-1 language codes denoting the language(s) of the translations it returns the list of the translations and a well-formed query string which can be used in a search api call. the example: http://europeana.eu/api/v2/translatequery.json?wskey=xxxxxxxx&term=%28den%20haag%29%20or%20warsaw&languagecodes=de,nl,hu,en returns { "apikey": "xxxxxxxx", "action": "translatequery.json", "success": true, "requestnumber": 13811, "translations": [ {"text": "den haag", "languagecode": "de"}, {"text": "warsaw",    "languagecode": "de"}, {"text": "warschau", "languagecode": "de"}, {"text": "the hague", "languagecode": "en"}, {"text": "warsaw",   "languagecode": "en"}, {"text": "hága",      "languagecode": "hu"}, {"text": "varsó",     "languagecode": "hu"}, {"text": "den haag", "languagecode": "nl"}, {"text": "warsaw",    "languagecode": "nl"}, {"text": "warschau", "languagecode": "nl"} ], "translatedquery": "(\"den haag\" or \"the hague\" or \"hága\")) or (warsaw or \"varsó\" or \"warschau\")" } the api key used in this example (line 2) is fictitious – please use your own api key. the translations key (lines 6-17) contains the list of translations, each having two properties: text is the translation and languagecode is the iso 639-1 code of the language. the translatedquery key (lines 18-19) contains the solr query the api user can run in a distinct search call. in the portal the user has to set up language preferences in order to utilize the query translation feature. it can be done in language settings within my europeana (available for anonymous users as well). below the search input box the user can see the individual translations (with language information), and he or she can remove them one by one or all of them with one click. the language versions are saved in all the links of the search result page, and if the user follows a link, the portal extracts the translation from the url instead of trying to call the translation api again. conclusion this model can be followed in other projects as it is flexible enough to use a different translation service or any other service to help users enhance their query by other means than multilinguality. one can imagine a classification system, a thesaurus, or an ontology instead of a translation service – each would serve different purposes and target different user experiences. the hardest part of implementation is term extraction. europeana’s solution is closely coupled with the lucene library and java language. if another project would like to follow this approach they should find out which techniques support the parsing of the query language used by their favorite search engine. if they do not use java but still use solr (or any other lucene-based search engine), lucene query parsers are available in almost all programming languages (such as python, php, c#, c++, perl, ruby, javascript). appendix in this section you will find some code snippets from the europeana’s queryextractor class. i removed some of the details such as parameter checking and try-catch blocks etc. to concentrate instead on the lucene library usage. if you are interested please consult the code at europeana’s github repository. [5] a) by initializing the queryparser we made use of simpleanalyzer, and “text” as the default query field [6]: analyzer analyzer = new simpleanalyzer(); queryparser queryparser = new queryparser(version.lucene_40, "text", analyzer); queryparser.setdefaultoperator(operator.and); b) getting the topmost query object: query query = queryparser.parse(rawquerystring); c) iterating over the query components: public void deconstructquery(query query, stack<querytype> querytypestack) { if (query == null) { return; } if (query instanceof termquery) { querytypestack.add(querytype.term); deconstructtermquery((termquery)query, querytypestack); } else if (query instanceof phrasequery) { group++; querytypestack.add(querytype.phrase); deconstructphrasequery((phrasequery)query, querytypestack); } else if (query instanceof booleanquery) { group++; querytypestack.add(querytype.boolean); deconstructbooleanquery((booleanquery)query, querytypestack); } else if (query instanceof prefixquery) { group++; querytypestack.add(querytype.prefix); deconstructprefixquery((prefixquery)query, querytypestack); } else if (query instanceof fuzzyquery) { group++; querytypestack.add(querytype.fuzzy); deconstructfuzzyquery((fuzzyquery)query, querytypestack); } else if (query instanceof termrangequery) { group++; querytypestack.add(querytype.termrange); deconstructtermrangequery((termrangequery)query, querytypestack); } else if (query instanceof matchalldocsquery) { group++; querytypestack.add(querytype.matchalldocs); deconstructmatchalldocsquery((matchalldocsquery)query, querytypestack); } else { log.warning("unhandled query class: " + query.getclass()); } if (querytypestack.size() > 0) { querytypestack.pop(); } } in this method querytype is an enumeration of the lucene querytypes and denotes the name of the actual query class. the object collects them into a stack, for later usage (in some cases we have to investigate the parent query component’s type to determine what to do with the terms of the actual component). the deconstructxxxquery() methods usually extract and save the terms. i mentioned previously that booleanquery connects two or more components together, so the deconstructbooleanquery() method recursively calls this deconstructquery() method to find those components. the role of the group variable is simple. it helps to find implicit phrases such as den haag without the quotes. d) lastly, extracting tokens from the user-entered search query. tokenstream ts = analyzer.tokenstream("text", new stringreader(query)); offsetattribute offsetattribute = ts.addattribute(offsetattribute.class); chartermattribute chartermattribute = ts.addattribute(chartermattribute.class); ts.reset(); while (ts.incrementtoken()) { int start = offsetattribute.startoffset(); int end = offsetattribute.endoffset(); string term = chartermattribute.tostring(); string terminquery = query.substring(start, end); // save the above information } we use the same analyzer as in the first snippet. we explicitly have to ask the tokenstream object to provide information about the position (offsetattribute), and the term (chartermattribute). in line 10 we extract the original term, since the token may be in an already transformed state, which could be different than the user entered version. notes [1] http://europeana.eu [2] apache lucene is a java library created for supporting indexing and searching. solr is built on top of this library, and makes use lucene’s query language. the europeana query translation also depends on some functionalities of this library. [3] the similarity between terms are measured with the edit distance: how many atomic letter changes (replacement, deletion, addition) should be done to get from terma to termb. see http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#fuzzy_searches and http://en.wikipedia.org/wiki/damerau%e2%80%93levenshtein_distance. [4] http://en.wikipedia.org/wiki/list_of_iso_639-1_codes [5] https://github.com/europeana/api2/, and https://github.com/europeana/corelib/. you can use the api’s querytranslationcontroller class (https://github.com/europeana/api2/blob/master/api2-war/src/main/java/eu/europeana/api2/v2/web/controller/querytranslationcontroller.java) as entry point. [6] text is a kind of super field in europeana’s index. the indexing process copies almost every other field’s content into this field. when the user does not specify a field, the search runs this field again. about the author péter király is a software developer with a humanities background (history and philology). he works at gesellschaft für wissenschaftliche datenverarbeitung göttingen (germany). his main interests are publishing and searching large textual corpora, digital libraries, digital humanities, and new possibilities of digital cultural heritage. he has participated in projects like project gutenberg, extensible catalog and europeana. you can reach péter at peter.kiraly[at]gwdg.de. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – openumisma: a software platform managing numismatic collections with a particular focus on reflectance transformation imaging mission editorial committee process and structure code4lib issue 37, 2017-07-18 openumisma: a software platform managing numismatic collections with a particular focus on reflectance transformation imaging this paper describes openumisma; a reusable web-based platform focused on digital numismatic collections. the platform provides an innovative merge of digital imaging and data management systems that offer great new opportunities for research and the dissemination of numismatic knowledge online. a unique feature of the platform is the application of reflectance transformation imaging (rti), a computational photographic method that offers tremendous image analysis and possibilities for numismatic research. this computational photography technique allows the user to observe on browser minor details, unseen with the naked eye just by holding the computer mouse rather than the actual object. the first successful implementation of openumisma has been the creation of a digital library for the medieval coins from the collection of the bank of cyprus cultural foundation. by avgoustinos avgousti, andriana nikolaidou, ropertos georgiou introduction as the implementation of software application grew in the creation of numismatic digital collections, the need for better supporting applications grew correspondingly with a better provision of the digital libraries’ requirements. for the study on a range of objects such as numismatics, with delicate materiality and rich historical background, an application that can digitally reflect their illustrious nature is necessary. in response to this growing need a web-based platform “openumisma” is being developed by the cyprus institute. the platform supports an advanced imaging technology, reflectance transformation imaging (rti) [1] that offers powerful new methods of documenting and studying numismatic cultural heritage information. rti is a computational photography technique that enables the virtual examination of objects in front of a computer monitor and assists for visual analysis, conservation and documentation. reflectance transformation imaging is shown to capture more complete documentation than traditional photographic methods and broadens the scientific research of a coin. we intend to demonstrate how the addition of openumisma can bring a feasible step forward to general-purpose service system numismatic digital libraries and overview its current features and future additions. the significant system performance formation and software architecture is analyzed by giving a coherent view of components, where the architectural elements meet all of the technical and operational requirements of the software. the presentation of the platform is based on the first successful implementation of openumisma in the online coinage collection of the bank of cyprus cultural foundation, showing how the software successfully managed to accomplish its significant system performance. numismatic software’s applications a tremendous amount of digital projects are available to the public that provides online access to coins collection from different cultural institutions [2]. many of the numismatic collections that are available online today are using tailor-made software developed for their specific needs or open source software applications. particular examples using tailor-made applications are the roman provincial coinage online [3] ,the sylloge nummorum graecorum [4] and vatican library: catalogues-coins and medals [5]. on the other hand, the custom coded software gives endless possibilities to a project to create anything it can “imagine” but at a significant time and cost, without any open collaboration and more diverse scope of development perspectives. ocre: online coins of the roman empire [6] is using the open source of numishare [7] along with coinage of the roman republic online [8] and dar al-kutub collection of the egyptian national library [9]. openumisma background and development working on the creation of the medieval numismatic digital library of bank of cyprus cultural foundation, the main aim was to develop a web-based platform. this platform had to be usable particularly for digital numismatic collections, in order to improve the context of the library. working on different advanced technologies that could enrich the medieval numismatic digital library of bank of cyprus cultural foundation we developed the platform “openumisma”. a few of the advanced technologies that openumisma provides are reflectance transformation imaging, embedded interactive maps, bibliographical reference, metadata, taxonomical terms, advanced search, facets, images search, xml export and more. for the development of openumisma we selected drupal 7, an open source content management framework (cmf) with a large community of developers and users. at that time, in late 2013, drupal 8 was not quite ready for the development of such a project. drupal 8 is newer and most probably will be supported longer and possibly with more features than drupal 7. we are looking forward to releasing openumisma in a drupal 8 version when the framework is more stable and ready for supporting the complexity of a project like openumisma. study of coins through reflectance transformation imaging (rti) the examination of artwork’s surfaces under tangential light was one of the main study techniques in the past. the illumination technique, called raking light, casts a strong light at an oblique angle, almost parallel to the surface of the object, the light intensifies in order to reveal and highlight information clearly. the created light and shadows give a greater understanding of the relief of the object by emphasizing any surface irregularities and deformations such as cracks, distortions, and material condition. however, nowadays reflectance transformation imaging (rti) is a unique computational photography method for examination, conservation documentation, and identification for coins. the advanced computational photography technique simulates all possible light angles within a virtual hemisphere atop the surface of the coin for effective study and analysis (figure 1). this powerful imaging technique proved to be extremely important for archaeologists, conservators, and historians for the examination of a variety of artifacts. these imaging techniques can “re-animate” and penetrate surfaces in a manner that conventional photography and digital methods of acquisition such as 3d laser scanning cannot fully convey but instead complement [10]. in addition, these methods feature creative and reliable ways for material examination reproducing reality with minor computation demands for the end-user and it aims to provide easily accessible and straightforward practical methodologies for experts in the archaeological field as well the museum and the web [11]. by moving a mouse or any other pointing device, the user can control the direction of the light, zoom in and out and observe minor details (figure 2). figure 1. medieval coins from the bank of cyprus cultural foundation collection figure 2. medieval coins from the bank of cyprus cultural foundation collection http://numismatics-medieval.dioptra.cyi.ac.cy rti data acquisition a custom-build rti dome that produces polynomial texture map (ptm) files consists of a hemispherical dome with a hole at the apex, and thirty-six halogen lights embedded at randomly fixed intervals around the dome and controlled by a pre-scripted programmable microcontroller (figure 3). an artifact is placed at the base of the dome, while a camera is positioned looking downward, focusing through the hole at the top, on the aforementioned object. thirty-six photographic images are sequentially taken, each with a single light shining on the artifact, thus creating thirty-six images with different light angles. then, the ptm algorithm synthesizes the data from these images to create a single image that can be examined on a ptm viewer with a “virtual spotlight”. the viewer allows the user to move the light angle intuitively in real time so that the combination of light and shadow representing the relief features of the object’s surface can be freely altered. rti also permits the enhancement of the subject’s surface shape, color and luminance attributes, which extracts detail out of the surface that cannot be otherwise derived. rti viewers are open source[12], however, rti datasets can be accessed just through desktop applications. consequently, the requirement to create an open-source drupal rti web platform that encapsulates coins and can be assessed and visualized through web applications has been achieved through the development of the openumisma platform. figure 3. the rti dome of the cyprus institute reflectance transformation imaging custom drupal 7.x module the functionality of rti on the platform of openumisma is developed on a custom drupal module. the module is based on webrtiviewer from the visual computing laboratoryisti-cnr [http://vcg.isti.cnr.it/rti/webviewer.php]. webrtiviewer uses html5-webgl, can display high-resolution rti images (ptm and hsh), and is available under the gnu general public license version 3. openumisma is already preconfigured with the rti drupal module, however having in mind organizations that have an existing drupal 7 installation and wish to have only the rti functionality, the rti module can be downloaded and installed separately with only a couple of clicks as all drupal contributed modules (figure 4). the rti drupal 7 module is already helping some organizations like “the cultural informatics research group” at the university of brighton. the module can be downloaded from github: https://github.com/avgoustinos/drupal-7-webgl_viewer future plans are to have the module on drupal.org /** * implements hook_field_formatter() */ function webgl_viewer_field_formatter_info() { $formatters = array(); $webgl_viewer_settings = webgl_viewer_settings_info('default_value'); $default_settings = array(); // return a single depth array with the given key as value. foreach ($webgl_viewer_settings as $key => $setting) { if (isset($setting['fieldset'])) { $default_settings[$setting['fieldset']][$key] = $setting['default_value']; } else { $default_settings[$key] = $setting['default_value']; } } $formatters['webgl_viewer'] = array( 'label' => t('web viewer'), 'field types' => array('image', 'file'), 'settings' => $default_settings, ); return $formatters; } listing 1. code example from the custom rti drupal module openumisma data modeling and content types for the profile of each coin that can be added to any digital library, openumisma created a default data modelling and content type design. in collaboration with numismatics experts, we developed a list of controlled vocabularies. for each coin addition there is a complex set of fields with metadata elements related to numismatics such as the material of the coin, the description of the obverse and reverse surface of the coin, the location and many more (figure 5). for example, a coin was minted in a specific location. the name of the place along with its geographical location is listed in a vocabulary called mint. by assigning the term mint with the name of the place to the image or rti, the content is classified according to geo-location minted. with this process openumisma can be used to implement this concept and create a retrieval tool. using the example above the user is able to view all coins minted in a specific location. moreover, the information can be viewed in a different medium, such as an interactive map where you can drill down to a specific location. the set of fields for a coin covers most of the standard needs and requirements for numismatics collections, and the administrator of the platform can easily extend the functionality by adding new fields based on their specific data and metadata requirements. all taxonomical terms and related data can be exported in xml format. figure 4. openumisma coin content type information retrieval for easy retrieval of information, openumisma comes by default with some preconfigured views that provide a listing of data and nodes. for the display of the coins collection in a digital library we used the module drupal view (figure 6) that displays all the coins in a single table list. users can easily browse the collection and filter their search by using the fields on the top of the page. by clicking on the image of each coin, the user is redirected to the rti web viewer and has the opportunity to use the rti on the coin in a separate window. another way to search the database is by the interactive maps that display content based on a geographical location (figure 7). the interactive map helps the user to discover coins by their location and also by the date they were minted. moreover, clustering and drill down capabilities of the map make it user-friendly when it comes to numerous geo-points. to enhance performance and readability of data-heavy maps, the system clusters geo-points and allows displaying a large amount of data in a preformatted way. figure 5. the visual collection of medieval coins from the cypriot medieval coins: history and culture library figure 6. the visual interactive map from the cypriot medieval coins: history and culture library linked open data: semantic web the semantic web is a set of standards and best practices for sharing data over the web that make it more machine-friendly, which makes computers more intelligent. the web pages are structured data in a way that computers can understand. according to the w3c, “the semantic web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries”. openumisma is based on cidoc conceptual reference model (cidoc-crm) ontology, which is designed to facilitate the exchange of heterogeneous cultural heritage information among cultural heritage institutions [12]. cidoc-crm appeared to be the perfect match for openumisma; a selection of standard openumisma fields was easily mapped to the most relevant cidoc fields. however, in order to adapt and extend cidoc-crm for numismatic collections more work needs to be done. extensible markup language xml the system is able to generate and export xml data files for all the data and metadata stored in the system. currently, systems administrators can export data in xml format; some of the advantages of using xml-encoded documents for the long-term archive are that xml is an open and well-established notation. xml is both human and machine-readable (listing 2). text format and internationalized characters are supported. with these characteristics, we can facilitate data migration and data transparency. <?xml version="1.0" encoding="utf-8"?> <coins> <coin> <title>ar gros. john ii, 1432-1458 /sites/default/files/1995-07-05-ob.png nicosia ar 07:00 25 mm cross of jerusalem. 4.18 gm gros ioan rex.d clc 3,70-73,pl.10.8-13. latitude:35.185451000000 longitude:33.382411000000 billon sizin. janus, 1398-1432 /sites/default/files/1995-02-03-ob.png nicosia billon 12:00 19 mm cross of jerusalem. 1.57 gm sizin ianvs roi dei ieru *bmk1997, 92, no.40; *clc3, 124-127, pl.27.2 (not correct weight). latitude:35.185451000000 longitude:33.382411000000 listing 2. example of xml export filemedieval coins from the collection of the bank of cyprus cultural foundation, cypriot medieval coins: history and culture. appearance openumisma offers a preconfigured functionality and a custom drupal sub-theme with css and javascript files related to the display of each coin. the theme is responsive and flexible enough to adapt to any mobile devices, as well on desktop screen resolution.. openumisma allows the user to choose and configure the look and feel of the platform, however, the rti viewer is not responsive, an additional effort will need to be done in order to achieve this goal. applicability the first successful implementation of openumisma was the creation of a digital library for the medieval coins from the collection of the bank of cyprus cultural foundation (figure 7). cypriot medieval coins: history and culture is a novel digital library dedicated to the study, promotion, and dissemination of the history of medieval cypriot coinage (12th-16th centuries). the project is the development of the ongoing collaboration between the bank of cyprus cultural foundation and the cyprus institute, which originally began in february 2012 with the pilot implementation of the advanced reflectance transformation imaging (rti) photography on a large part of the collection of the museum of the history of cypriot coinage. the results of the above work have been the basis for the development of the digital platform for the documentation and dissemination of medieval coins from the museum of the history of cypriot coinage. the digital platform offers an interactive exploration of the medieval collection with the innovative implementation of reflectance transformation imaging, complemented with text descriptions, essays, links to other collections and repositories that provide context as well as alternative ways to study and access the material. figure 7. cypriot medieval coins: history and culture http://numismatics-medieval.dioptra.cyi.ac.cy further development future development of openumisma will be focussed on the implementation of nomisma.org, which has been hosted by the american numismatic society since 2010. nomisma.org is a trustable resource in numismatics that provides uris to many numismatic concepts and terms. moreover, the ontology was created to integrate the openly available databases. the ontology is limited to numismatics and provides an easily understandable way to describe numismatic datasets. our main goal is to have openumisma as a downloadable application package from our website by the end of 2017. moreover, our plans are to have on on github and drupal.org. conclusion the creation of openumisma aims to significantly enhance the documentation and preservation of cultural heritage objects and to improve digital online collections with the use of the reflectance transformation imaging. the innovative web-based platform can be extended in order to support more advanced functionalities as well as to exploit the ever-increasing technological opportunities to enhance the interactive experience between the user and online databases. about the authors avgoustinos avgousti is a research technical specialist at the cyprus institute at the science and technology in archaeology research center (starc). he is the web architect and lead developer of dioptra: the digital library for cypriot culture. he received his bachelor of science (bsc) from st. francis college in new york city in information technology/computer science and master of science (msc) in medical informatics from the state university of new york downstate medical center in new york city. he has more than 12 years of experience in information technology and 3 years of teaching computer science at a college level. furthermore, he participated in a number of eu and national funded projects. avgoustinos’ research activity has been focused in the last five years on digital humanities and digital cultural heritage. andriana nikolaidou joined the science and technology in archaeology research center (starc) of the cyprus institute in september 2015 as an intern in advanced imaging for art and cultural heritage. she contributes to the digital libraries of the cyprus institute as a research assistant. andriana received a bachelor’s degree in fine arts with first-class honors from de montfort university (uk) in 2013. adriana is interested in the aspect of archeology, researching the theme of legend as it relates to cultural narratives by questioning the notions between visual and scientific perspectives. her aim is to combine new and porous ways of digital technologies and specifically advanced imaging into digital art scenarios. ropertos georgiou received an associate degree of applied science in electronics engineering from tci institute in new york city, a bsc in computer engineering from the university of technology in budapest, hungary and an msc in multimedia applications and virtual environments from sussex university, uk. currently, he holds a position at the cyprus institute at the science and technology research centre (starc) as a research new media specialist. his research interests span through innovative methods of electronic and image-based acquisition, documentation, and visualization for the development and support of research infrastructures. acknowledgments the bank of cyprus cultural foundation for funding openumisma: https://www.boccf.org the cyprus institute: http://www.cyi.ac.cy reflectance transformation imaging webrtiviewer: http://vcg.isti.cnr.it/rti/webviewer.php imaging cluster for archaeology and cultural heritage: http://imlab.cyi.ac.cy notes [1] cultural heritage imaging,reflectance transformation imaging (rti) [internet]. available from: http://culturalheritageimaging.org/technologies/rti [2] the 2014 arl spec kit on open source software notes that 53% of 66 academic and public libraries responding to its survey use a locally hosted and supported oss solution for digital preservation. thacker, j., knutson, c., dehmlow, m. (2014) spec kit 340: open source software. united states of america, washington, d.c. : association of research libraries. chapter 1, pg. 12. [3] the roman provincial coinage online is developed by oxford university. the present archive is grounded on ten collections and includes information on more than 13,000 coin types based on 46,725 specimens and with more than 9,000 images. some of the functionalities of the system are that the user can search by city, inscription, mental, diameter and much more. the roman provincial coinage online [internet]. [updated 2017] university of oxford. available from: http://rpc.ashmus.ox.ac.uk/project [4] the sylloge nummorum graecorum is one of the larger numismatic databases dedicated to greek numismatics. more than 25,000 coins are available online, and the database can be searched by mint, material, ruler, period, denomination, coin descriptions and many others fields. all records include images of the coin and descriptive material.the sylloge nummorum graecorum [internet]. [updated 2017] the fitzwilliam museum: department of coins and medals. available from: http://www.sylloge-nummorum-graecorum.org [5] the catalogue of the vatican library’s numismatic cabinet was created in 2001 and includes the collection of parthian coins the collection of imperial roman coins is still underway. vatican library: catalogues-coins and medals. for more information see: http://opac.vatlib.it/iguana/www.main.cls?surl=homemed&language=eng [accessed 07/04/2017] [6] online coins of the roman empire (ocre), a joint project of the american numismatic society and the institute for the study of the ancient world at new york university, is a revolutionary new tool designed to help in the identification, cataloging, and research of the rich and varied coinage of the roman empire. the project records every published type of roman imperial coinage from augustus in 31 bc, until the death of zeno in ad 491. ocre: online coins of the roman empire. for more information see: http://numismatics.org/ocre [accessed 15/01/2017] [7] numishare is an open source suite application developed and maintained by the american numismatic society for managing digital heritage artifacts, with a particular focus on coins and medals, built on open source applications: solr, exist xml data store, and others. some of the functionalities of numishare are browsable facets, handling images, search and more. for more information see: https://github.com/ewg118/numishare [accessed 14/01/2017] [8] coinage of the roman republic online (crro) aims to provide in effect an online version of michael crawford’s 1974 publication roman republican coinage (rrc), which is still the primary typology used for the identification of roman republican coin types. coinage of the roman republic online for more information see: http://numismatics.org/crro [accessed 16/01/2017] [9] our catalog of 6,500 numismatic pieces – coins, glass weights, dies, medals, etc. – is the third major catalog of islamic numismatic material held in the egyptian national library, formerly the khedivial library, egypt’s most important library. dar al-kutub collection of the egyptian national library. for more information see: http://enl.numismatics.org [accessed 16/01/2017] [10] mudge, m. et al. 2010. principles and practices of robust, photography-based digital imaging techniques for museums. in: 1th international symposium on virtual reality, archaeology and cultural heritage vast, edited by a. artusi, m. joly-parvex, g. lucet, a. ribes, and d. pitzalis, paris. available from: https://www.si.edu/mci/earlyphotography/references/vast2010.pdf [11] malzbender, t., gelb, d., and wolters, h., 2001. “polynomial texture maps.” in siggraph ’01 proc. of the 28th annual conference on computing graphics and interactive techniques, (august 2001), 519-528. los angeles, california [12] for more information see: http://www.cidoc-crm.org [accessed 28/02/2017] references velios a. 2011. the john latham archive: an online implementation using drupal. art documentation volume 30, number 2 wiltshire n.g. 2013. the use of sahris as a state-sponsored digital heritage repository and management system in south africa. cipa symposium 2-6 september doi: 10.5194/isprsannals-ii-5-w1-325-2013 available from: researchgate.net berry a, byron a and de bondt b. 2012. using drupal using drupal, 2nd edition kawana h. 2009. towards digital archive systems: architecture and design of digital museum archive: the eighth international symposium on operations research and its applications (isora’09) zhangjiajie, china, september 20–22, 2009 copyright © 2009 orsc & aporc, pp. 15–21 h. kawano. 2012. digital archive system using cms and gallery tools –implementation of anthropological museum. wooster g.e. 2006. recent advances in roman numismatics: ba, the pennsylvania state university. muller eva, klosa uwe hansson peter, andersson, stefan siira erik. using xml for long-term preservation. reflectance transformation imaging webrtiviewer: http://vcg.isti.cnr.it/rti/webviewer.php vandyk j.,tomlinson t, 2010. pro drupal 7 development (expert’s voice in open source) 3rd ed. united states of america, new york: apress. chapter: 1. pg. creating custom modules: https://www.drupal.org/docs/7/creating-custom-modules earl, g.p., martinez, k. and malzbender, t. 2010. archaeological applications of polynomial texture mapping: analysis, conservation, and representation. in journal of archaeological science, 37, pg 40-50. dutr’e, p., bekaert, p., and bala, k. 2003. advanced global illumination 2nd ed. united states of america, natick: a k peters/crc press bakirtzis, n., georgiou, r. 2014, light on el greco from a technological perspective: the baptism of christ and the view of mount sinai at the historical museum of crete in heraklion. in international conference proceedings “el greco the cretan years”, 21-23 june, heraklion earl, g.p., martinez, k. and malzbender, t. 2010. archaeological applications of polynomial texture mapping: analysis, conservation, and representation. in journal of archaeological science, 37, pg 40-50. cultural heritage imaging,reflectance transformation imaging (rti) [internet]. available from: http://culturalheritageimaging.org/technologies/rti for more information see: https://www.drupal.org/project/views [accessed 28/01/2017] thompson k, richard j. 2013. moving our data to the semantic web: leveraging a content management system to create the linked open library. journal of library metadata, 13:290–309, 2013. doi: 10.1080/19386389.2013.828551. available from: https://repository.si.edu/bitstream/handle/10088/22149/wjlm_a_828551.pdf gerth, p and dai (2016) ariadne d14.2: pilot deployment experiments. version: 1st. available from: http://www.ariadne-infrastructure.eu/resources/d14.2-pilot-deployment-experiments [accessed 14/03/2017] subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – revamping metadata maker for ‘linked data editor’: thinking out loud mission editorial committee process and structure code4lib issue 55, 2023-1-20 revamping metadata maker for ‘linked data editor’: thinking out loud with the development of linked data technologies and launch of the bibliographic framework initiative (bibframe), the library community has conducted several experiments to design and build linked data editors. while efforts have been made to create original linked data ‘records’ from scratch, less attention has been given to copy cataloging workflows in a linked data environment. developed and released as an open-source application in 2015, metadata maker is a cataloging creation tool that allows users to create bibliographic metadata without previous knowledge in cataloging. metadata maker might have the potential to be adopted by paraprofessional catalogers in practice with new linked data sources added, including auto suggestion of virtual international authority file (viaf) personal name and library of congress subject heading (lcsh) recommendations based on the users’ text input. this article introduces those new features, shares the user testing results, and discusses the possible future steps. by greta heng, myung-ja han introduction libraries have been using machine readable cataloging (marc) as a tool to create bibliographic and authority data since the 1960s. while marc brought libraries a new way to organize information in the past, the evolving information landscape asks for libraries to explore other means of information organization that can connect library collections with resources on the web. as a successor to marc, bibliographic framework (bibframe) initiative was launched by the library of congress (lc) in 2012.[1] it is expressed in the resource description framework (rdf, a data model for structured data)[2] and based on three categories of abstraction (work, instance, item). as the library’s new entity relation data model, bibframe is grounded in linked data techniques, which allows metadata creators to build relationships with web resources by facilitating shared structured data and uniform resource identifiers (uris). many national and research libraries have been exploring the possibility of converting the marc format metadata to bibframe and, even further, creating metadata as linked data using a linked data/bibframe editor. libraries such as the swedish national library,[3] the french national library,[4] the german national library (dnb),[5] and the library of congress[6] have been involved in the marc to linked data conversion, linked data based new discovery services, and linked data editor experiments. in addition, some external linked data management platforms are gaining popularity among glam (galleries, libraries, archives, and museums) institutions. wikidata,[7] an open, collaborative, and multilingual global linked data repository, is being used by libraries as an alternate source of name and subject data for bibliographic description. however, since wikidata is designed to represent all domains of knowledge and not specific for library use, concerns about its capacity and suitability for describing library resources were raised by the wikidata community.[8] while there has been much discussion on and development of tools for creating full original linked data, less attention has been given to copy cataloging workflows (creating new short records by deriving from other records or creating minimum records) in linked data environments. developed and released as an open-source application in 2015, metadata maker[9] is a metadata creation tool that can be used by anyone regardless of their cataloging experience and knowledge, allowing them to create a minimum level catalog record. metadata maker has been updated in several areas since then, including supporting different formats of resources (currently in ten modules) and bibframe output service for monographs.[10] as more and more cataloging and metadata creation work relies on paraprofessional catalogers or non-catalogers[11] with language or subject expertise, the authors tried to revamp metadata maker with linked data authority services to test whether this tool and the updated functions can facilitate the minimum record creation in a linked data cataloging environment. this paper shares the revamping process and issues found in linked data sources and their service, and discusses the user testing results of metadata maker and a bibframe editor. changing landscape the development of linked data technologies brought out a systematic change in libraries’ cataloging production practice. as van der werf said, “libraries used to be knowledge organizations and library professionals were trained in bibliographic description and authority control. now, authorities are called entities and the new description logic is about creating a ‘knowledge graph of entities.’”[12] it is noticeable that the focus of metadata creation has gradually shifted from the curation of text strings to the management of entities (work, persons, corporate bodies, places, events, etc), i.e., linking resources using uris and managing uris instead of name strings.[13] this revolution has triggered a discussion on linked data cataloging models, standards, and tools in the library. changing library cataloging production practice libraries have carried out several initiatives to re-design cataloging workflows and devise the transition plan from traditional cataloging to linked data cataloging, for example, the development of marc to bibframe conversion tools and bibframe editors. notably, the linked data for libraries (ld4l)[14] community made a series of significant efforts on linked data cataloging from 2014 to 2022, including linked data for libraries labs (ld4l labs),[15] linked data for production (ld4p),[16] linked data for production: pathway to implementation (ld4p2),[17] and linked data for production: closing the loop (ld4p3).[18] albeit those new linked data cataloging tools, catalogers need to be versed in new linked data related knowledge and exercise new skills, such as rdf, sparql, bibframe ontology, and more, to create library data as linked data. in addition, as linked data implementations in libraries are still under development, it is hard to keep up to date with the most current linked data application developments, e.g., bibframe editors. it is challenging to identify the type of skills that catalogers need to be developing. as a result, catalogers may feel overwhelmed by the new linked data technology, and administrators are experiencing challenges in designing and providing training for the ever-growing skill set and emerging linked data tools for catalogers.[19] the shifting roles of librarians and staff in technical services are an additional challenge in linked data training and planning. libraries used to depend on professional cataloging librarians to do original cataloging. copy cataloging was usually performed by paraprofessional catalogers. however, this is no longer true. with shrinking budgets, organizational restructuring, and changes in cataloging software and workflows, more paraprofessional staff are responsible for both original and copy cataloging tasks (el-sherbini & klim, 1997; zhu, 2012).[20] as van der werf articulated, the number of professional librarians is decreasing while paraprofessional staff are increasing in cataloging departments.[21] in fact, not only are professional librarians decreasing, but the whole cataloging team is also shrinking. while there are several options that can ease the shortage of manpower, such as outsourcing to vendors, cooperative cataloging programs, and more productive cataloging workflows, libraries still lack staff with expertise to catalog special collections and/or foreign language materials. the need for foreign language and special collection cataloging will not go away in a linked data environment as libraries keep purchasing resources from foreign countries and work with perpetual backlogs. bibframe editors and copy cataloging currently, there are three bibframe editors that are widely known and used: lc’s bibframe editor,[22] marva,[23] and ld4p’s sinopia.[24] all three editors seem to target experienced catalogers as their user group, not paraprofessional catalogers or non-catalogers. for one, they use the resource description and access (rda) terms[25] as field names and bibframe’s three categories of abstraction, work, instance, item, as record/data types. those cataloging terms, though commonly used by professional catalogers, may result in a learning curve for paraprofessional catalogers. for example, “parallel title” is not a common phrase and the differences between work and instance are not self-explanatory for many. for another, some abbreviations that appear in the user interface as controlled vocabularies, including getty_aat, lcgft, and gac, are not familiar to paraprofessional catalogers. in order to use the editor and add appropriate values to those data fields, it requires training on rda, bibframe ontology, authority, and the editor itself at the very least. another challenge is a lack of clear definition as to what makes full level and brief bibframe data. the core bibframe data fields are still under discussion by the program for cooperative cataloging (pcc) bibframe interoperability group (big).[26] as there are no clear guidelines, some bibframe editors mark required fields while some do not. for catalogers or users of bibframe editors, it seems that one needs to fill out all fields to create full-level bibframe data and provide values for those required fields, if applied, to generate brief bibframe data. as there is no quick way of filling out the minimum data fields to produce brief bibframe data, the cataloging workflow used in the current bibframe editors might not meet libraries’ needs for cataloging large volumes of perpetual backlogs with a shrinking cataloging team. lorimer (2022) stated that the notion of copy cataloging has broadened and expanded in a linked data environment,[27] which emphasizes reusing metadata rather than creating completely new metadata from scratch. some bibframe editors like sinopia indeed allow catalogers to search, load, and copy or clone existing bibframe data to revise and reuse those descriptions by sharing uris. this workflow would help reduce duplicate work-level bibliographic records and increase cataloging efficiency. yet, considering the reality and looking into the future, libraries, with professional catalogers and language/subject experts shortage, will have to resort to non-catalogers and paraprofessional catalogers with limited linked data and cataloging knowledge to create records in bibframe editors. shall users adapt to the bibframe editors or shall the editors be designed to be more friendly to their users? this dilemma raises a question: is it possible to build a linked data editor without cataloging jargon in the application interface? given the above mentioned issues, this project is an attempt to build a straightforward linked data editor that does not use rda terms for non-catalogers for the purpose of copy cataloging. libraries may benefit from adopting metadata maker as it does not require new hiring or training for catalogers and allows non-catalogers with needed language/subject knowledge to create minimum level cataloging records. the authors also conducted a small-scale survey to learn catalogers’ opinions about metadata maker and a linked data editor. revamping metadata maker metadata maker enables any user to create catalog records that are “good enough” (provide sufficient information to identify a bibliographic item and generate a basic bibliographic description)[28] in various formats, including marc, regardless of one’s knowledge of or experience with cataloging standards, integrated library systems, or oclc. it now has ten different modules or templates (datasets[29], monographs[30], monographs (ld)[31], ebooks[32], government documents[33], maps[34], microfilms[35], scores[36], serials[37], theses and dissertations[38]). users can select a module based on the resource type, fill out basic information about the resource, and choose the download format, including marc binary, marcxml, metadata object description schema (mods), html, and bibframe.[39] for this phase, two new linked data features, virtual international authority file (viaf) personal name suggestions and library of congress subject heading (lcsh) suggestions were added in the monographs (ld) module in metadata maker. the new functions support search and auto completion of personal names in viaf, and lcsh (keywords) generation based on the user provided text. uris of the controlled terms are added in the output metadata. figure 1. metadata maker interface screenshot. linked data input viaf name search the viaf personal name autocomplete dropdown list in fig. 2 uses viaf auto suggest api[40] to retrieve the personal name’s label, viaf uri, and library of congress name authority file (lcnaf) uri. when a name is selected, the links to both uris, if they are available in viaf, will be presented on metadata maker. users have the option to verify the name entity’s information on either the viaf or lcnaf page if so desired. the application then retrieves values of the 100 field subfields a to d from lcnaf whenever they are available. if no lcnaf uri is provided in viaf, the preferred label from dnb[41] is the alternative option if that can be found in viaf. the lcnaf and viaf uris are added to subfield 0 and 1 respectively in the marc and marcxml 100 field or 700 field based on their role. for other supported output formats, the uris and the label/preferred name are also inserted into the appropriate elements. if there is no satisfactory result in the autocomplete dropdown list, it also allows users to manually input the name strings. the code is available online.[42] figure 2. viaf auto suggest dropdown list. // using viaf auto suggest api fetch personal names (function($) { $.widget("oclc.viafauto", $.ui.autocomplete, { options: { select: function(event, ui) { alert("selected!"); return this._super(event, ui); }, source: function(request, response) { var term = $.trim(request.term); var url = "https://viaf.org/viaf/autosuggest?query=" + term; var me = this; $.ajax({ url: url, datatype: "jsonp", success: function(data) { if (data.result) { response( $.map( data.result, function(item) { if (item.nametype == "personal"){ var retlbl = item.term + " [" + item.nametype + "]"; var uri = "http://viaf.org/viaf/" + item.viafid; if (item.lc){ return { label: retlbl, value: item.term, id: item.viafid, viafuri: uri, lcuri: "http://id.loc.gov/authorities/names/" + item.lc, nametype: item.nametype } }else{ return { label: retlbl, value: item.term, id: item.viafid, viafuri: uri, lcuri: "nolc", nametype: item.nametype } } } })); } else { me._trigger('nomatch', null, {term: term}); } }, }); } }, _create: function() { return this._super(); }, _setoption: function( key, value ) { this._super( key, value ); }, _setoptions: function( options ) { this._super( options ); } }); })(jquery); // get information for user selected name in the author input field $(function() { $(".author").viafautox( { select: function(event, ui){ var item = ui.item;} } }); }); lcsh and fast suggest the second function that was added to metadata maker is the lcsh suggestion using annif api.[43] annif (http://annif.org/) is a subject suggest tool for documents, originally developed by the national library of finland.[44] according to its webpage, annif can be trained through natural language processing and machine learning algorithms to support any kind of subject headings. to make annif support lcsh, the ld4p group used annif’s built-in algorithms and training corpus from the ivyplus platform for open data (pod)[45] and share-vde (virtual discovery environment)[46] to train annif (hahn, 2022;[47] khan, 2020[48]).[49] upon request, annif lcsh api returns a list of suggested lcshs labels, uris, and predicted scores. the list is sorted by the predicted score from high to low: the higher the score, the more relevant the subject heading is. // annif lcsh api response [ { "label": "clothing and dress--china--history", "notation": null, "score": 0.06058865785598755, "uri": "http://id.loc.gov/authorities/subjects/sh2003012066" }, { "label": "costume--china", "notation": null, "score": 0.014286939986050129, "uri": "http://id.loc.gov/authorities/subjects/sh85033251" }, { "label": "costume--china--history", "notation": null, "score": 0.014127381145954132, "uri": "http://id.loc.gov/authorities/subjects/sh85033252" }, { "label": "clothing and dress--history", "notation": null, "score": 0.011828765273094177, "uri": "http://id.loc.gov/authorities/subjects/sh2003012061" }, { "label": "clothing and dress--social aspects", "notation": null, "score": 0.008354970254004002, "uri": "http://id.loc.gov/authorities/subjects/sh85027167" }, { "label": "fashion--history", "notation": null, "score": 0.008040583692491055, "uri": "http://id.loc.gov/authorities/subjects/sh2008103592" }, { "label": "fashion--history--20th century", "notation": null, "score": 0.007795797660946846, "uri": "http://id.loc.gov/authorities/subjects/sh2008103594" }, { "label": "chinese poetry--translations into english", "notation": null, "score": 0.007471516728401184, "uri": "http://id.loc.gov/authorities/subjects/sh2008100615" }, { "label": "medicine, chinese", "notation": null, "score": 0.0065437802113592625, "uri": "http://id.loc.gov/authorities/subjects/sh85083125" }, { "label": "clothing and dress in literature", "notation": null, "score": 0.005863940808922052, "uri": "http://id.loc.gov/authorities/subjects/sh85033275" } ] using annif lcsh api, metadata maker can recommend ten lcsh terms given a book summary in any romance languages. users can select zero to ten lcsh terms by checking the provided checkbox. it is also possible to re-run the suggest function by updating the summary in the input box and clicking the suggest button. if one is not satisfied with the recommended keywords or uncomfortable using lcsh, users can still use an autocomplete faceted application of subject terminology (fast) heading search box to add keywords. figure 3. keyword (summary suggest and keyword search box) screenshot. // if a user clicks the #lcshsuggest button, based on the user’s // text input in the #summary box, lcsh will generate in the #lcshresponse div container $(function() { document.getelementbyid('lcshsuggest').onclick = function(){ document.getelementbyid("lcshresponse").innerhtml = ""; var summary = document.getelementbyid('summary').value; if (summary!=null){ var requests = "text=" + summary; var url = "http://annif.info/v1/projects/upenn-omikuji-bonsai-en-gen/suggest"; var xhr = new xmlhttprequest(); xhr.open("post", url, false); xhr.setrequestheader("content-type", "application/x-www-form-urlencoded"); xhr.setrequestheader("accept", "application/json"); xhr.onreadystatechange = function () { if (xhr.readystate === 4) { var data = xhr.responsetext; var jsonresponse = json.parse(data); console.log(jsonresponse); if (jsonresponse["results"] && jsonresponse["results"].length){ for (var i = 0; i < jsonresponse["results"].length; i++){ var lcshabel = jsonresponse["results"][i]["label"]; var lcshurl = jsonresponse["results"][i]["uri"]; document.getelementbyid("lcshresponse").innerhtml += ''+lcshabel+'
';} } } }; xhr.send(requests); } }; }); bibframe output with recent updates, the bibframe output data now includes uris of personal names, lcsh, and fast headings in the monographs (ld) module. below is an example of a . the lcnaf uri of “shakespeare” is added to the agent node. both viaf and lcnaf uris of “shakespeare” are added as the value of identifiers. --> shakespeare, william, 1564-1616 the second is an example of a . the fast heading uri is added to the topic node. these are represented in bibframe metadata as below. --> comedy plays [form/genre] some consideration while developing new features for metadata maker, the authors found some issues with the apis and linked data sources. encoding viaf provides a single name authority file that combines name authority files from more than 40 organizations,[50] making it convenient for libraries to take advantage of linked data and obtain information about name entities from one source. yet, the aggregation process might cause some encoding issues in viaf records. for example, when one searches for “greta reyghere,”[51] the name includes empty boxes in the dropdown list returned by the api. the same issue also appeared in the viaf json record: the source of the name with empty boxes was dnb according to the viaf json record (see fig. 5).[52] however, the dnb record did not have anything anomalous.[53] it seems that the empty boxes in the name label only exist in the viaf record; aggregation setting in viaf might be the reason why. figure 4. empty boxes in viaf. figure 5. empty boxes in viaf json. name entities search scope when describing resources in bibframe editors, cataloging experts tend to use name authority files like lcnaf. however, non-catalogers or paraprofessional catalogers may not be aware of those sources and are more likely to rely on the linked data editor itself. it is expected that bibframe editors understand the different name entity search behaviors between experienced and nascent catalogers. specifically, there are two expectations for the name entity search function in linked data editors: (1) no restraints on the order of a name; and (2) supporting variant name searching. as many non-professional catalogers may not receive identity management (authority) training, it is not intuitive for them to search names following the marc 100 field format: “last name, first name.” it is also important to make bibframe editors connected to various linked data sources for name entities on the web and collect the name variances from as many sources as possible. to meet the two expectations, metadata maker adopts viaf auto suggest api for personal name searching. the viaf auto suggest api supports both preferred name and variant name searching without any name format or name order constraints. this flexibility allows non-catalogers to find the desired personal name in different ways. one bibframe editor that was tested for this project supports only preferred name label search. a korean author, han, shin-kap,[54] has name variances: “한신갑” and “shin-kap han” in his authority record. the bibframe editor only brought a result when the term “han, shin-kap” was searched, as it matched with the existing lcnaf record 100 field. the other two variant names did not bring any results as the selected editor does not support variant name search. the failed search may drive non-catalogers to create duplicate name entity records or use strings instead of uris to represent the person. figure 6. search han, shin-kap in a linked data editor. figure 7. search 한신갑 in a linked data editor. figure 8. search shin-kap han in a linked data editor. quality of authority data viaf authority data provided via json-linked data (json-ld) format does not always have detailed and granular information. viaf authority cluster endpoint allows catalogers to retrieve authority data in various formats.[55] the name-related elements in the json-ld representation of viaf authority records include family name, given name, alternative name, and name (full name). more complicated names may contain title, numeration, and other information about the entity. take “john paul ii, pope” as an example.[56] “pope” is the title of “john paul ii.” “john paul” is the papal name and “ii” is the numeration. however, in his viaf json-ld record (see below), “john paul ii” is treated as the family name and “pope” is treated as the given name, which is not correct. while this would not be a problem when using data models that do not require name parts information like bibframe, it could be a problem for schemas that have fields or attributes specifically designated for name part, e.g., first name and last name. // json-ld description of john paul ii, pope in viaf { ... "familyname" : [ "janis", "john paul ii", "juan pablo ii.", "jawién", "joannes paulus ii.", "ioannis pauli ii", "yūhạnnā būlus at-tanī", "ioann pavel ii", "wojytla", "jean paul ii.", "ויטילה", "vajtyla", "wojtila", "ii", "jean paul ii", "jean-paul ii.", "voitilah", "ján pavol ii.", "jános pál ii.", "ivan pavao ii.", "yuhạnnā-būlus at-tanī", "jawieň", "ṿoiṭilah", "juan pablo ii", "vojtyla", "ivan pavlo ii.", "ян павел ii", "johannes paulus ii.", "giovanni paolo‏ ii", "voityla", "jasien", "jasień", "yoḥanan paʾulus ha-sheni", "voitila", "xoán paulo ii", "ṿoiṭilah", "jasień", "ואיטילה", "gruda", "giovanni paolos ii.", "wojtyla", "johano paŭlo la dua", "войтыла", "jawień", "wojtiła", "johannes paul ii", "paulus", "yuḥannā-būlus at-tānī", "johannes paul ii.", "john paul ii.", "wojtyła", "보이티야", "アンジェイ", "jan paweł ii", "jean-paul ii", "yuhạnnā-būlus at-tanī", "보이티와", "janez", "jan paweł ii.", "jawien", "jan paweł", "jawień", "yūḥannā būlus at-tānī", "giovanni paolo ii", "janez pavel ii.", "ioannis paulus ii.", "yūḥannā būlus at-tānī", "vojtila", "iohannes paulus pp. ii", "yūhạnnā būlus at-tanī", "jan pavel druhý", "ioannes paulus ii.", "jānis pāvils ii.", "yuḥannā-būlus at-tānī", "joannes paulus ii" ], "gender" : "http://www.wikidata.org/entity/q6581097", "givenname" : [ "karal'", "pape", "papież", "karol józef", "al-bābā", "stanislaw andrzej", "carlo", "karols", "karol'", "stanisław a.", "ḳarol", "pope", "папа рымскі", "кароль", "קארול", "‏ papa", "karol joźef", "johannes", "andrzej", "pāvests", "papież", "ḳarol", "carol", "ヤヴィエニ", "karol j.", "카롤", "piotr", "saint", "lolek", "stanisław", "k.", "stanisław andrzej", "papa", "heiliger", "santo", "karolis", "karol jozef", "pavils", "pápa", "papa", "카롤 유제프", "karol józef", "karolʹ", "papst, heiliger", "papst", "al-bābā", "ii", "karol", "karel", "pavel", "pape", "john paul", "paus", "קרול" ], ... } testing after adding the viaf api into metadata maker, the authors did a very small scale unofficial usability testing in university of illinois with eleven participants: five paraprofessional catalogers who create original cataloging records as part of their responsibilities; two hourly catalogers who did not have cataloging knowledge but with language and subject knowledge; two graduate assistants; and two cataloging and metadata librarians. they were asked to create a record for a monograph book in sinopia and metadata maker and share their thoughts on two things: ease of use and knowledge/skills required to use each tool. the survey also had a section where testers could add their thoughts.[57] ease of use for the first question, testers could choose one answer from the following options: extremely hard hard, but can follow through it easy very easy figure 9. survey result: ease of use. eight participants said that metadata maker is easy to use (five chose “very easy” and three chose “easy”) while ten people said that sinopia is hard to use (five chose “extremely hard” and another five chose “hard, but can follow through it”). the survey reveals that the majority of participants prefer the simple interface of metadata maker to the relatively complex and verbose interface of sinopia. there is one person who chose that metadata maker is “extremely hard to use” and two people chose “hard, but can follow through it”. those who answered that metadata maker is hard to use are paraprofessional catalogers who create original records in oclc. during the follow-up interview, they expressed that they do not like the simple interface of metadata maker and the notion of creating short/minimum records. they want the bibframe editors to be similar to the oclc connexion, the tool that they are familiar with and allows them to create full level cataloging records. an undergraduate student with language skills answered that sinopia is easy to use. the student added that while there is a lot to learn and it takes time, they can follow through the sinopia by reading the information provided for each element. while sinopia allows users to view the output data in json-ld, turtle, n-triples, rdf table, and interface view formats, three participants commented that it is hard to check the outcome of their work in sinopia. it might be because those participants have not learned rdf data models and linked data serialization formats. metadata maker, however, allows records to be downloaded and viewed locally. those participants also added that it would be helpful to know the dataflow once the record is created in both editors. knowledge and skills required to use the editors the second multiple-choice question was to ask participants what kind of skills they thought were needed for the two bibframe editors, such as functional requirements for bibliographic records (frbr),[58] rda, bibframe, lcsh, and other controlled vocabularies, name authority, linked data, and marc. however, the authors quickly realized that the jargon and acronyms in this question caused misunderstandings for many participants as they did not know some or all options, especially the two non-catalogers who do not have cataloging knowledge/education. those staff members who routinely create original records also are not familiar with frbr, bibframe, and linked data. as a result, the answers to this question are all over the place as below: table 1. answers from 11 participants: knowledge and skills required to use the editors. sinopia metadata maker unsure none bibframe, marc, lcsh and other controlled vocabularies, name authority, linked data marc, none rda, bibframe, frbr, lcsh and other controlled vocabularies, linked data, need an extreme understanding of frbr terms and rda standards just to read/understand the interface i feel like you don’t actually need to know anything about cataloging standards to use this interface marc, lcsh and other controlled vocabularies, name authority, linked data marc marc, lcsh and other controlled vocabularies, name authority, linked data none bibframe none rda, bibframe, frbr, linked data, i did not use it enough to know all that one needs to know, but this is meant for experienced (and very technically savvy) catalogers none, if applicable, an non-english language. marc lcsh and other controlled vocabularies i do not know? rda basic book information everything basic book information none none however, one thing that is clear is that while many participants said there are things that are necessary to learn in order to use the bibframe editor, the majority of participants said no knowledge is needed to use the metadata maker. discussion and next steps the process of revamping metadata maker with linked data sources and bibframe output presented a possibility for building a linked data editor without any cataloging terminologies that can be used by anyone. the intuitive design, self-explanatory wording, and one-page web form break the learning barriers of bibframe cataloging and allow non-professional catalogers and language/subject experts to get involved in linked data metadata creation. as metadata maker is designed for generating “good enough” records, it can also serve as a quick bibframe generation tool for paraprofessional catalogers. however, the authors have learned some concerns from catalogers with regard to using this tool in practice, such as an oversimplified interface and unclear dataflow. the authors were perplexed by the variant degree of acceptance for metadata maker among survey participants. paraprofessional catalogers are inclined to use quasi-connexion editors with the option to describe detailed information about resources; whereas nascent catalogers might be more comfortable using linked data editors that do not require such prerequisite knowledge. the developers of linked data editors will need to balance those two needs. while the library domain has made significant progress in the development of and experimentation with linked data and bibframe production, there are still many things that the library community has to think further about and work together on to find a solution. first, a clear dataflow needs to be established. as of now, bibframe linked data created from the current bibframe editors are not automatically ingested into any integrated library system.[59] this was brought up by several staff members who tested sinopia. in addition, most vendors do not support bibframe import as of this writing. the authors acknowledge that the dataflow requires a possible new integrated library system that can work with metadata in different formats and with a different ontology. second, libraries may have a completely different data sharing method in the linked data environment compared with the current centralized shared database.[60] if that is the case, what would a data sharing model be like? if it is still possible to have a centralized linked data database, then who is going to manage it, and how is it going to be managed? third, a discussion of work distribution between human catalogers and machines needs to start. as machines can do marc to bibframe conversion and authority reconciliation work rather effectively, libraries might want to think about what machines can do and what cataloging and metadata professionals should do. if there are tasks that machines can do better, then it would be better to leave those to the machines, and identify what cataloging and metadata professionals should focus on, in terms of linked data creation and workflows. fourth, according to fortier, pretty, and scott (2022),[61] the understanding and knowledge of bibframe among canadian libraries is still low after close to two decades of ongoing discussion and development efforts. while it is important to understand the underlying structure of bibframe and linked data, it would be worthwhile to think about how much training is adequate for cataloging professionals and how much integration of rda terms into the bibframe editors is necessary for the transition to linked data creation. or, maybe what libraries really need is a linked data editor rather than a bibframe editor. if there are problems in understanding bibframe and rda among ourselves, it would be much more difficult for users on the web to understand what kind of data we are sharing. about the author greta heng (orcid: 0000-0002-3606-6357) is cataloging and metadata strategies librarian at san diego state university. myung-ja (mj) k. han (orcid: 0000-0001-5891-6466) is a professor and metadata librarian at the university of illinois at urbana-champaign. bibliography [1] library of congress. bibliographic framework initiative. https://www.loc.gov/bibframe/. [2] world wide web consortium (w3c). rdf. https://www.w3.org/rdf/. [3] wennerlund, b., & berggren, a. (2017). leaving comfort behind: a national union catalogue transition to linked data. paper presented at: ifla wlic 2019 – athens, greece – libraries: dialogue for change in session s15 – big data. in: data intelligence in libraries: the actual and artificial perspectives, 22-23 august 2019, frankfurt, germany. [4] french national library. semantic web and data model. https://data.bnf.fr/en/semanticweb. [5] german national library. linked data service. https://www.dnb.de/en/professionell/metadatendienste/datenbezug/lds/lds_node.html. [6] library of congress. marva editor. https://bibframe.org/marva/editor/. [7] wikidata. wikidata main page. https://www.wikidata.org/wiki/wikidata:main_page. [8] godby, j., smith-yoshimura, k., washburn, b., davis, k., detling, k., eslao, c., folsom, s., li, x., mcgee, m., miller, k., moody, h., thomas, c., & tomren, h. (2019). creating library linked data with wikibase: lessons learned from project passage (pp.70). oclc research. https://doi.org/10.25333/faq3-ax08. [9] han, m. k., ream-sotomayor, n. e., lampron, p., & kudeki, d. (2016). making metadata maker: a web application for metadata production, library resources & technical services, 60(2), 89–98.; all the source codes are available in github: https://github.com/dkudeki/metadata-maker; metadata maker is still in the exploratory phase and currently only supports linked data cataloging for monographs. [10] michael, b., & han, m. j. k. (2019). assessing bibframe 2.0: exploratory implementation in metadata maker. proceedings of the international conference on dublin core and metadata applications, 26-31. [11] non-catalogers refer to people who do cataloging work but do not have adequate cataloging experience or may not need it as they do not pursue a career in cataloging. [12] van der werf, t. (2021, march 4). next generation metadata… it’s getting real! hanging together, oclc research blog. https://hangingtogether.org/next-generation-metadata-it-is-getting-real/. [13] dalgord,c. shared entity management infrastructure project update. oclc. https://www.loc.gov/bibframe/news/source/bibframe-from-home-oclc-update.pptx. [14] linked data for libraries. https://wiki.lyrasis.org/pages/viewpage.action?pageid=41354028. [15] linked data for libraries labs. https://wiki.lyrasis.org/pages/viewpage.action?pageid=77447730. [16] linked data for production. https://wiki.lyrasis.org/pages/viewpage.action?pageid=74515029. [17] linked data for production: pathway to implementation. https://wiki.lyrasis.org/display/ld4p2. [18] linked data for production: closing the loop. https://wiki.lyrasis.org/display/ld4p3. [19] lnenicka, m., kopackova, h., machova, r., & komarkova, j. (2020). big and open linked data analytics: a study on changing roles and skills in the higher educational process. international journal of educational technology in higher education, 17(1), 1-30. [20] el-sherbini, m. & klim, g. (1997). changes in technical services and their effect on the role of catalogers and staff education: an overview. cataloging & classification quarterly, 24(1-2), 23-33; zhu, l. (2012). the role of paraprofessionals in technical services in academic libraries. library resources & technical services, 56(3), 127-154. [21] van der werf, next generation metadata… it’s getting real! [22] library of congress, bibframe editor. https://bibframe.org/bfe/index.html. [23] library of congress, marva. https://bibframe.org/marva/editor/. [24] linked data for production: pathway to implementation. sinopia. https://sinopia.io/. [25] rda toolkit: https://www.rdatoolkit.org/. [26] bibframe interoperability group. (2022. april 15). terms of reference. https://www.loc.gov/aba/pcc/bibframe/taskgroups/big/big-tor.pdf. [27] lorimer, n.(2022, march 8). re-use or copy? redefining copy cataloging in a linked data environment. ala copy cataloging ig, online. https://docs.google.com/presentation/d/1ukxcdjea-cwmxnfixfibdpbcvn_jxvmymoojgzibi9o/edit?usp=sharing. [28] library of congress. appendix c – minimal level record examples. https://www.loc.gov/marc/bibliographic/bdapndxc.html. [29] http://quest.library.illinois.edu/marcmaker/dataset/. [30] http://quest.library.illinois.edu/marcmaker/. [31] aka, monograph (linked data), http://quest.library.illinois.edu/marcmaker/monoviaf/. [32] http://quest.library.illinois.edu/marcmaker/ebooks/. [33] http://quest.library.illinois.edu/marcmaker/govdocs/. [34] http://quest.library.illinois.edu/marcmaker/maps/. [35] http://quest.library.illinois.edu/marcmaker/microfilms/. [36] http://quest.library.illinois.edu/marcmaker/scores/. [37] http://quest.library.illinois.edu/marcmaker/serials/. [38] http://quest.library.illinois.edu/marcmaker/theses/. [39] bibframe is only added to two monograph modules for now. [40] oclc developer network. authority cluster resource. https://www.oclc.org/developer/api/oclc-apis/viaf/authority-cluster.en.html. [41] dnb was selected as an alternative name label source because (1) it provides linked data service; and (2) it is a national library for a non-native english speaking countries which may compensate for lcnaf. [42] https://github.com/dkudeki/metadata-maker/blob/monoviaf/lcsh/lcshsearch.js. [43] suominen, o., inkinen, j., virolainen, t., fürneisen, m., kinoshita, b. p., veldhoen, s., sjöberg, m., zumstein, p., neatherway, r., & lehtinen, m. (2022). annif (version 0.60.0-dev) [computer software]. https://doi.org/10.5281/zenodo.2578948; https://api.annif.org/v1/ui/. [44] annif github repository. https://github.com/natlibfi/annif. [45] ivyplus platform for open data. https://pod.stanford.edu/. [46] share-vde (virtual discovery environment). https://www.svde.org/. [47] hahn, j. (2022, june 20). cataloger acceptance and use of semiautomated subject recommendations for web scale linked data systems. 87th ifla world library and information congress (wlic) / 2022 in dublin, ireland. https://repository.ifla.org/handle/123456789/1955. [48] khan ,h. (2020, march 10). annif use and explanation. linked data for production: pathway to implementation. https://wiki.lyrasis.org/display/ld4p2/annif+use+and+explanation. [49] when accessed http://lcsh.annif.info/ in october 2022, annif lcsh api project updated its vocabulary sources: “ivyplus-tfidf” was changed to “penn-fasttext-en” (penn (lcsh english) conference papers and proceedings), “upenn-omikuji-bonsai-en-gen” (upenn (lcsh english) all genres), and “upenn-omikuji-bonsai-spa-gen” (upenn (lcsh spanish) all genres). [50] virtual international authority file. https://www.oclc.org/en/viaf.html. [51] viaf authority record for de reyghère, greta. retrieved on september 11, 2022, from http://viaf.org/viaf/69118441. [52] viaf authority record in json for de reyghère, greta. retrieved on september 11, 2022, from https://viaf.org/viaf/69118441/viaf.json. [53] dnb authority record for de reyghère, greta. retrieved on september 11, 2022, from https://hub.culturegraph.org/entityfacts/134496175, and https://d-nb.info/gnd/134496175. [54] viaf authority record for han, shin-kap. retrieved on september 11, 2022, from http://viaf.org/viaf/198153409742041581752. [55] oclc authority cluster resource. https://www.oclc.org/developer/api/oclc-apis/viaf/authority-cluster.en.html. retrieved on october 5, 2022. [56] viaf authority record in json-ld for john paul ii, pope. retrieved on september 20, 2022, from https://viaf.org/viaf/35605/viaf.jsonld. [57] we chose sinopia over other bibframe editors because it is created for the community and has pcc templates that have been tested out by many catalogers. we also understand that the purpose of the bibframe editor and metadata maker are different. [58] the international federation of library associations and institutions. functional requirements for bibliographic records (frbr). https://www.loc.gov/marc/bibliographic/bdapndxc.html. [59] there are some unofficial statements that folio and ex libris have been working on bibframe data import. but as of october 6, 2022, there has not been a bibframe data import function released by them. [60] library of congress. bibframe and the pcc. https://www.loc.gov/aba/pcc/bibframe/bibframe-and-pcc.html. [61] fortier, a., pretty, h., & scott, d. (2022): assessing the readiness for and knowledge of bibframe in canadian libraries, cataloging & classification quarterly. https://doi.org/10.1080/01639374.2022.2119456. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction: how things change mission editorial committee process and structure code4lib issue 21, 2013-07-15 editorial introduction: how things change introducing issue 21 by terry reese one of the things that i’ve always appreciated about this community has been its ability to accept changes. maybe this is because of the type of work that we do, technology, it’s rarely stable, but i think that one of the things that has made the code4lib community so successful and vibrant has been its ability to be malleable, to find common ground and move on. a great example of this process happens every year in the run up to the code4lib conference. as each new host works through their own unique set of challenges, the conference changes, it evolves and moves on. as someone that has been fortunate enough to attend every code4lib conference, starting in 2006 when 80 library technologies descended on corvallis, oregon, it’s been easy to track the changes in both the community and our seminal event. so it should come as no surprise that this same flare for change, this same need to continue to evolve and improve processes exists with the journal. as editors cycle on and off, processes are looked at, evaluated, maybe changed. likewise, as new personalities are added to the editorial committee, new feature ideas may be discussed. in the short time that i’ve been a part of the editorial committee, a number of such discussions have been had, and i imagine, will continue to be had. it’s how we keep the journal fresh, and how we continue to work together to produce a product that represents not only the broad interests of our community, but does so in a way that honors a community that values openness and transparency. one of the benefits of being the coordinating editor is the ability to try something new, and in this issue we experimented with how articles were selected for this issue. the editorial committee is always looking at the process that we use to evaluate proposals. each issue regularly receives a significant number of proposals each issue, and it’s up to the editorial committee to select the best proposals for inclusion into the journal. in the past, proposals were evaluated as they were submitted, a process that allowed the committee to provide very quick feedback to potential authors. the problem for the editorial committee (myself included), is that as the issue fills up, it can be difficult to judge each new proposal without considering those that came before it. when faced with this type of problem, the obvious answer is to hack a solution. rather than provide a final evaluation of each proposal as it came in, we waited until the proposal solicitation period had passed, and then looked at what we got. did this process work better? i don’t know. on the one hand, i think it may have been harder for authors because there was more downtime between submitting a proposal and a response of acceptance – but as an editorial member, i personally felt like i had a much easier time evaluating each proposal on equal footing. as i look back and assess the process, i definitely think that there are things that we learned and can apply to future journal issues. for example, if we were to continue to tweak the evaluation process, the next coordinating editor should probably try to do a better job keeping potential authors abreast of the process. as we put this issue to bed and start focusing on issue 22, i expect that the editorial committee will decompress and continue to come up with a better process. and you know what, that’s ok…because that’s what code4lib is all about. summary of issue 21 it’s been my great pleasure to be the coordinating editor for issue 21. as one of the newer members of the editorial committee, it’s been interesting seeing how each new issue of the journal takes on its own personality and tells its own story. while the editorial committee doesn’t solicit or select articles based on a specific set of agreed upon themes, each issue that i’ve been a part of seems to organically take shape around a few common topics – and it appears that this issue is no different. issue 21 is a substantial body of work, consisting of 10 articles coalescing around linked data, api usage, and the broad topic of finding new and novel ways to repurpose tools and data available in the library. what’s more, this issue has a distinctly international flavor, with articles being provided by authors from the united states, canada, germany, and the united arab emirates; issue 21 highlights both the universality of many of the challenges currently facing the library community, as well as the success the code4lib community has had in bridging culture and language through a shared interest in technology. one of the new themes that pop up in this issue are around the practical implementation of linked data. i expect that future issues and authors will continue to wrestle with and discuss new and exciting ways in which linked data will allow libraries to become more interconnected with the world outside of libraries – and i expect that this issue will signal the start of that trend. m. cristina pattuelli, matt miller, leanora lange, sean fitzell, and carolyn li-madeo’s paper, “crafting linked open data for cultural heritage: mapping and curation tools for the linked jazz project” and götz hatop’s paper entitled: “integrating linked data into discovery” provide an interesting look at two real life projects seeking to integrate linked data and linked data concepts into their design. these articles talk about some current tools, as well as some of the successes and challenges found in this approach. in the proud code4lib community tradition, this issue also provides a number of articles for readers interested in learning how to do things better. ted diamond, susan price, and raman chandrasekar’s article “actions speak loaders than words: analyzing large-scale query logs to improve the research experience” offers a timely look at how libraries can take an active role in improving the academic research experience for their users. while diamond, et al’s article specifically looks at summon query logs, the authors offer up a novel approach for any library currently utilizing a unified discovery service. kyle banerjee and maija anderson take a fresh look at utilizing existing tools and utilities to leverage facial recognition software to streamline the creation of image collection metadata in “batch metadata assignment to archival photograph collections using facial recognition software,” while donald moses and kirsta stapelfeld provide a closer look at some of the new features that are a part university of prince edward island’s institutional repository utilizing the islandora repository system in “renewing upei’s institutional repository: new features for an islandora environment.” richard anderson writes about stanford’s decision to utilize the “moab” design for versioned archiving of digital objects in “the moab design for digital object versioning” and edward iglesias and arianna schlegel explore how libraries can leverage raspberry pi devices to create inexpensive electronic signage in “using a raspberry pi as a versatile and inexpensive display device.” lisa gayhart, in “out from behind the firewall: towards better library it communications” provides a thought provoking piece on the need to improve it-specific communication within the library. finally, libraries continue to utilize a wide variety of services to improve discovery, enhance displays, and improve user experiences. services are wide and varied, but what each have in common are application programming interfaces (api) that facilitate this sharing of information. issue 21 features two articles specifically looking at a number of widely used library services. the first, by thomas hodge and james macdonald entitled “relevance and phrase searching in summon: looking under the hood” provides an excellent analysis of some of the black magic behind the summon api search and some ways to ensure that queries made using the summon api return only relevant results. the second takes a closer look at three different bibliographic services, comparing developer ease of use, license restrictions and data provided in “comparing the librarything, oclc, and open library isbn apis.” as one can see, issue 21 covers a wide range of topics with something for just about everyone. whether you are a data wonk, an administrator, or just someone interested in library technology and the challenges therein, the journal and its authors continue to illustrate the wide and ever expanding interests and talent found within the library community. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – developing weeding protocols for born digital collections mission editorial committee process and structure code4lib issue 43, 2019-02-14 developing weeding protocols for born digital collections as collections continue to be digitized and even be born digital, the way we handle collection development needs to also shift towards a digital mindset. digital collections development are not so much concerned about shelf or storage space, as expansion can be as simple as procuring a new hard drive. digital collections, when not archival, need to focus on issues of access and accessibility. for a born digital library, quality and usefulness must be the primary factors in the collection development policy. this article will walk through the steps taken by one digital library (pbslearningmedia.org) to assess their collections with an eye to quality and user experience as well as a multi-phase deaccessioning project that occurred and is ongoing. the process, including the multi-iteration drafting of subject specific rubrics, targeted to the needs of the site’s core audience. it also included the quantitative assessment of thousands of items in the collection and the distribution of qualitative and quantitative data to stakeholders across the country. special attention to the setting of minimal required standards and the communication of those standards was paid. finally, as this process is now an ongoing review schema for learningmedia, the article will discuss the issues faced in this project, recommendations for other organizations attempting their own digital weeding/deaccessioning projects, and the plans for the future of the project. by athina livanos-propst introduction in 2018, pbs learningmedia (an online destination offering free access to thousands of classroom-ready resources, made possible through a partnership between pbs and wgbh) took on a project to review and refresh its library of teacher-focused, born digital materials. what follows is an overview of the process that was undertaken by pbs learningmedia. it is possible for organizations of any size to adapt the process to their own non-archival digital collections. the software utilized are generally accessible and open to librarians at many different levels of technical expertise. pbs learningmedia was launched in may of 2011. at launch, the site’s goal was to make pbs’s media and teacher materials accessible in one digital destination, as well as to distribute high-quality educational resources from pbs member stations and other organizations. these materials included full length videos, clips, documents, still images, games, and lesson plan. the editorial standards and policies when the site was founded were carefully crafted to ensure that the collection would be robust and beneficial to patrons. since then, editorial standards have shifted to include a requirement for contextualizing video clips, a desired reduction in linking to external sites, and combining small pieces into larger, more robust teacher materials. in the summer of 2017, the collection reached a critical mass, surpassing 100,000 available resources. the content team realized that due to the acquisition policy shifts, the full collection was no longer up to the current editorial standard. a full-scale audit of the content collection began to weed out materials that were no longer up to current editorial specifications and to inform stakeholders of the removal of those materials and the reasons for removal. the collection had also never been previously been reviewed with deaccessioning in mind. to that end, a process was developed to review and assess the materials and begin this process for the first time. considerations & rationale at 100,000 resources, the collection was too large for users to easily navigate and find the high-quality resources needed for their classrooms. a few choices were made at the beginning of the process. the nearly 45,000 resources that were single images would not be assessed. these pieces were used minimally. therefore, it was decided that select image collections would be removed from search, but still accessible via direct links. it was also decided that the weeding review would only look at materials that had been added to the collection prior to 2015. because the acquisition and editorial standards became better enforced around that time, that would be the area of the collection that would likely experience the largest loss during the weeding process. with that narrowed scope, that left 7,457 resources to be reviewed in the humanities and 6,827 in stem fields (science, technology, engineering, and math). the process that will be outlined here is focused exclusively on what was done to review the humanities materials. the stem materials were reviewed by a partner organization and followed a different process. organizations wishing to perform their own assessment of born digital content should note that the humanities process is not exclusive to the humanities and could be readily adapted to the sciences. the process occured in two primary phases. the first would be the development of a rubric tool with which we could assess the content on the site. the second phase would be the actual utilization of those rubrics by members of the content team to perform the assessment itself. the results from phase two would then be shared with the appropriate contributor as a tool for requesting and making changes to bring the resource materials up to the editorial standard. design & iterations it is vital when undergoing any weeding process to consider your patron community. in this case, our community was educators, which have different user needs than other patron groups. our first step was to develop an impartial tool to grade each resource that kept the needs of our patron community at the forefront. multiple weeding guidelines for physical collections were reviewed (see further reading at end). we selected six primary areas to focus on for the structure of the resource review: accuracy, currency, appearance, relevance, contextualization, and usage. for the purposes of this review and assessment, each of the terms was defined in the following manner: accuracy is the information presented in the resource technically accurate and factually correct? example: shakespeare was a famous writer in his time. example: jackson pollock did large scale, semi-performative abstract paintings. example: an explanation of the sounds letters make to teach chunking. currency is the resource balanced? are there missing perspectives due to when the materials was made? was this accurate at the time, but no longer reflect current understanding? example: does the resource state that lincoln freed slaves, or is a more nuanced approach taken? example: does the lesson on uncle tom’s cabin include both the inspiration it gave abolitionists as well as the backlash from the african american community? relevancy is this easy for teachers to use? how well is the resource aligned to standards? does the wording of the intro paragraph aid in the usage of the materials? example: the standards listed under the resource all make sense. example: resource approaches a standard topic, but via a pop-culture lens to encourage interest in students. appearance does the video quality meet current standards? is the audio quality intelligible? example: is the video quality distractingly granulated? example: are there images of technology or fashion that are so outdated it could cause a distraction from the learning material? contextualization does the resource have support materials attached? are there contextualizing questions and/or activities ‘baked in’ to the primary resource? what is the overall quality of the contextualization efforts? example: the resource has support materials for handouts, quizzes, and discussion questions. example: the resource incorporates a quiz within the media itself. example: do the support materials call for simple recall tasks or for a larger understanding and interpretation? usage how often has the resource been viewed? how often has the resource been favorited? raw numerical counts based on the information listed on the resource page. after primary areas of focus had been established, the next phase of the review process was able to begin. this stage involved hiring a series of subject matter experts (smes) to make specific rubrics, one for each subject area that pbs learningmedia covers. these experts were teachers from across the country, who could speak directly to the content needs of our primary patron base. the smes were contracted to deliver three iterations of rubrics during the development process. for each iteration, the subject specific rubrics would be tested by a small group of internal team members. the internal team would grade two resources per subject in accordance with the rubric draft provided. the smes were then able to take the grades in front of them and see how they, as teachers, would have assessed the same materials. in this manner, we could test to see that the language for the rubrics both meant something concrete to teachers and was understandable for those outside the education field. we could also be assured that our standards for retaining content were in line with usefulness to our core user base. all smes provided first drafts of rubrics that covered every point of which they could think, both about the quality of the materials and the quality of the site itself. the smes threw all potential points on their grading forms. this was both beneficial and problematic. it was beneficial in that it allowed the smes to ask all their questions and get a better sense of the limitations of the system. smes asked questions about accessibility features that were already integrated into the site, colors, text size and formatting, and similar issues. these systematic issues for the site, while important, were not a part of the resource quality review. it was problematic in that the smes had to be strongly course-corrected away from being bogged down by formatting of the site itself instead of classroom focused needs. it was also useful, in that it allowed for detailed notes on site design that were passed along to the usability design team, working on a different, yet parallel, project. the second iteration of the rubrics was more focused on a resource’s viability in the classroom. the questions brought forward on this rubric draft from the smes focused on areas of content and technical specifications that were able to be edited by either the pbs learningmedia content team or the contributors themselves. in retrospect, it would have been helpful to have the smes more fully prepared about the site’s existing specifications and adaptability. that would have allowed for the first draft to be more focused. the second iteration also saw the largest change in approach to the project overall. at the suggestion of one of the internal team members, we opted to transition from a grading system that allowed for huge amounts of variation to a stripped down three tier option. each question would be assessed on the simple scale of “pass,” “needs work,” or “fail.” resources would either be good enough, fixable, or below our standards. the third and final iteration of the rubric process focused predominantly on getting our smes to write out what “pass,” “needs work,” or “fail” would look like for each question in their subject area. the descriptions for what was good enough, what a teacher could somewhat work with, and what a failure looked like were highly valuable in communicating educator needs to non-educator content producers. it was this final step that allowed our production teams to know what they were looking for and to more accurately assess if the materials on the site were meeting teachers’ needs. once the rubrics were developed, the first major hurdle of the weeding project was overcome. we now had a tool with which to assess the born digital content that would be understandable by any user, while ensuring value to our core patron base. the rubrics were loaded into a single, combined google form. the questions that applied to all pieces, regardless of subject matter, were asked first, then a pointer question that led to the the subject-specific questions, with a final section of data metrics and ‘nice to have’ features completing the form. the combined form came in at just under 37 pages. that was spread out over seven subject areas, so each individual resource went through only a small portion of the complete rubric. each subject was assigned to an individual with some level of subject expertise in the given area. library staff with the right backgrounds were pulled in from other projects as well as retaining three of the smes from the rubric writing. each individual was given a list of resources to review that had been published on the pbs learningmedia platform prior to 2015 and tagged to their designated subject area. reviewers were tasked with reviewing what pbs learningmedia defines as a “single resource page,” that is watching the video(s) that were on each given page as well as any accompanying documentation and teacher support materials. each review took, on average, 10-15 minutes to complete and enter results into the google form. the phrase two review process took four months. during that time, small corrections were made to the form, mostly clarifications of terms and combining and moving of questions. during the first two months of the review, weekly check-in meetings were held. these meetings allowed for consistent interpretation of the rubrics across all team members. after the first two months, the team’s work was consistent and of high enough quality that it was determined that the hour-long meeting time each week could be better spent doing the review work itself. assessment once the review phase was completed, the task of assembling and assessing the data began. we used the fact that google forms automatically outputs to google sheets to our advantage. the raw data was downloaded from google to compile individual sheets for each organization that had contributed to the database. scores for resources were calculated into the following fields: rubric score a simple percentage score based on how the resource performed in every pass/needs work/fail question answered on the rubric three points were awarded for a pass, two for a needs work, and one for a fail formula to assess graded areas for each resource line =countif(g2:ei2, “pass*”) formula to calculate overall percentage score =(average((ej2*3)+(ek2*2)+(el2*1))/sum(ej2:el2)/3) factual a check on a question in the universal section of the rubric, designed to ensure that only pass level factual materials will be approved formula for results check on factual question =if(countif(p2, “pass*”)+countif(ay1, “pass*”)+countif(bc1, “pass*”)+countif(cc1, “pass*”), “good”, “review”) support materials present a check on a several cells that checks if the resource has been given support materials in any format formula for check of presence of support materials, multiple questions =if(sum(countif(aa2, {“pass*”;”needs work*”}),countif(ab2, {“pass*”;”needs work*”}),countif(ak2, {“pass*”;”needs work*”}), countif(aw2, {“pass*”;”needs work*”}),countif(bd2, {“pass*”;”needs work*”}),countif(bg2, {“pass*”;”needs work*”}),countif(bj2, {“pass*”;”needs work*”}),countif(bp2, {“pass*”;”needs work*”}),countif(cf2, {“pass*”;”needs work*”}),countif(ec2, {“pass*”;”needs work*”})), “present”, “missing”) overall score pass: rubric score, 90% or above factual, positive result support materials, positive result needs work rubric score, 80% or above factual, positive result fail rubric score, below 80% factual, negative result formula to assess final score, dependent on above scoring results =if(and(em2>=0.89,en2=”present”,eo2=”good”),”pass”,if(and(em2>=0.89,en2=”missing”,eo2=”good”),”needs support materials only”,(if(and(em2>=0.8,en2=”missing”,eo2=”good”),”needs work”,(if(and(em2>=0.8,en2=”present”,eo2=”good”),”needs work”,if(and(em2<0.8,en2=”missing”,eo2=”review”),”fail”,”fail”))))))) additionally, all results were color coded, as well as all notes fields in the results panel itself. figure 1. a more condensed summary tab was also provided for a quick assessment of the materials. figure 2. the data for each score section was also depicted in a series of charts. figure 3. sample view of a the executive summary tab, intended to convey key information to report recipients. figure 4. graphic breakdown of full data, showing the scores in key areas as well as the overall score. the prepared sheets were sent to each contributor who had added material to the site prior to 2015. even though much of the feedback was that many materials would have to be pulled due to the elevation in standards, contributors were mostly positive in their response. they readily accepted that their materials would not be hosted indefinitely, and they were open to making now required changes on existing materials. many contributors were also very thankful to have a list of questions they could ask of their own materials before adding new pieces to the collection. there were some contributors who took the feedback as an opportunity to reorganize, restructure, or rearrange their existing materials to allow for educational goals to be met. in this way, many resources that originally scored poorly could be spared by combining them together to create a larger narrative that was able to achieve a passing score. another benefit of sending out the result forms was that it encouraged several lapsed contributors to become interested in adding to the platform once again. we have seen a spike in potential users signing up for training session about how to utilize the database for uploading new content. notes one of the issues that we planned around was the awareness that our collection numbers would drop significantly. all resources that received a grade of “fail” would be removed from the website, and all that received a grade of “needs work” would be un-indexed in search, but still available to users with a direct link. we tied the launch of the weeding project results to align with an overall site redesign. the thought was that the materials on the refreshed site would all align to the new acquisition standards. timed with the new look and feel, both changes could be positioned as the new and improved pbs learningmedia. this choice proved to be a wise one, as we had minimal reports from users asking about the reduced resource count. moving forward, we are continuing to review content that was added to the site after our initial cut off point of january 1, 2015. the current plan is to be able to review materials as they hit three years of age in the system. we have reduced the review team to 1.5 full time staffers, who are able to review an average of 500 resources per month. we also hope to be able to convert the forms being sent to contributors to an automated form. about the author athina livanos-propst is a digital librarian for pbs education, overseeing the metadata implementation and strategy for pbs learningmedia. she earned her mlis from the catholic university of america, where she specialized in cataloging and special collections. her primary focus is born digital collections and supporting linked data. further reading larson, jeanette. 2008. crew: a weeding manual for modern libraries. texas state library and archives commission [internet]. [cited 2019 jan 23]; available from: https://www.tsl.texas.gov/sites/default/files/public/tslac/ld/pubs/crew/crewmethod08.pdf vnuk, rebeca. 2016. weeding without worry. american libraries [internet]. [cited 2019 jan 23]; available from: https://americanlibrariesmagazine.org/2016/05/02/library-weeding-without-worry/ subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial introduction – issue 5 mission editorial committee process and structure code4lib issue 5, 2008-12-15 editorial introduction – issue 5 welcome to the 5th issue of the code4lib journal. we’ve come a long way in just over a year! we hope you take a few minutes to celebrate those accomplishments with us as you explore issue 5 and read about the innovations, ideas, and experiences shared there. let’s learn from each other. by emily lynema coordinating editor, issue 5 celebrating a year this 5th issue of the code4lib journal celebrates our first full year of publication. i think i can speak for the entire editorial committee when i say we are very excited to see that the journal is still going strong a year later. what have we accomplished in a year? it has been a busy year for the code4lib journal. for starters, we’ve done a lot of talking. google groups tells me we’ve sent 3,758 message covering 865 topics on the private c4lj-articles email list alone [1]. while some are the inevitable email spam, it’s still a lot of words. but we’ve done more than just talk; all this discussion has led to the publication of 44 articles, editorials, and columns authored by about 75 different community members (give or take a few) across 5 issues. overall, we’ve responded to about twice that many proposals. we’re happy to see so much community interest in publishing in our venue. in conjunction with the fantastic work of our authors, all this has been accomplished by a fairly small group of people. we currently have an editorial committee of 12, having gained 3 additional hands while only saying farewell to 1 in the past year. many thanks go out to eric lease morgan for his year of service with the journal, especially his masterful coordination of issue 2. and we have a few readers, as well! google analytics claims that journal.code4lib.org has seen 58,316 visits in the past year, unsurprisingly spiking with the release of each new issue. the most visited article so far? free and open source options for creating database-driven subject guides, co-authored by the editorial committee’s very own edward corrado. the most common referrers pointing folks into the pages of the code4lib journal are code4lib.org, google.com, and bloglines.com. however, we still see about 40% of direct incoming traffic and another 25% coming from search engines. in fact, the code4lib journal content is starting to pop up in more and more places. the journal has been added to the directory of open access journals, where we upload article metadata for each issue. in fact, the doaj is actually the 4th most frequent referrer to the journal site. we have established cc-by licensing for all article content to make it more freely re-usable. representatives from the editorial committee are currently working with ebsco to determine if our full-text content can be included in the library, information science, and technology abstracts database. and if you use sfx as a link resolver, you’ll find the code4lib journal has been added to its knowledgebase. we’re doing whatever needs to happen to get the work done. 9 wiki pages, 8 wordpress pages, 4 wordpress custom plugins, 3 google groups pages, 3 google docs spreadsheets, 2 wordpress templates, and 2 email lists later we’re still having fun. what more do we want? from the beginning, we have maintained that the code4lib journal is all about you, the readers. we envisioned a journal that lowered barriers to publication, focused on practical, useful, applicable content, and encouraged an open conversation with readers. this desire for an open conversation is one of the major reasons we chose wordpress as a publishing platform for the journal. 5 issues later, we hope that we’ve succeeded in lowering barriers to publication and we’re pretty sure we’ve provided some interesting and practical articles to the community. but we’re still hoping to see these accomplishments provide a platform for an ongoing and open conversation about problems and solutions, questions and answers, ideas and projects. out of 44 published articles, we’ve only seen 37 (non-spam) comments. i encourage all of you, our readers, to consider joining the dialogue by posting comments on the articles you read with questions, opinions, and ideas. oh, and you should feel encouraged to submit proposals, too! libraries, technology, and everything else as saddened as i am by the possibility that i won’t be able to attend this year’s annual code4lib conference, it makes me more appreciative of the fact that i can still contribute to this vibrant community through the journal, the email list, and the irc channel. time and again, i’ve turned to the helping hands there to discuss ideas and to point me towards technologies and tools that i can employ in my workplace. as libraries around the country face budget cuts and freezes, it’s increasingly important to maintain a cohesive, helpful virtual community. sharing ideas and expertise can help us solve problems cheaper and faster; an excellent option as many face a short-term future with limited staff and resources. in the midst of an economic crisis, the show must go on for libraries. innovation remains important as an increasingly diverse array of resources becomes available and our patrons attempt to navigate a sea of confusing options among which the library probably doesn’t even appear on the first page. libraries must effectively choose and use technology to help reach these patrons. this means we have to find new ways to do many things; new ways to reach out into our patrons’ spaces, new ways to support their preferred style of doing business, new ways to make our systems and metadata flex for better finding and discovering, new ways to send our data out into the web and bring the web back into our local systems, new ways to be more intuitive and more responsive. it’s a lot of new. fortunately, it’s not the first time that libraries have faced change. librarians working with technology are not working alone. and we continue to see growth and innovation in the tools available, including a surge in free and open source technologies [2]. issue 5 the 5th issue of the code4lib journal continues to contribute to this community of sharing and helping. it features 10 articles focusing on new ideas and tools, how to information, and sharing of experience gained over time. new ideas and tools ilana kingsley and mark morlino describe the process of creating a dvd browser for patrons by mashing up data from the library catalog, imdb, rotten tomatoes, and freecovers.net and integrating it into a customized drupal module. jill strass shares how the folks at st. olaf college used an excel workbook to wrangle together scanned newspaper images and individual file names to create a browsable, searchable collection in contentdm without requiring manual cataloging. wayne graham discusses creating a library facebook application using facebook athenaeum to reach out to patrons who use this social networking tool frequently. cody hanson, shane nackerud, and kristi jensen demonstrate how the university of minnesota libraries have taken advantage of enterprise “affinity strings” generated by peoplesoft to provide customized library resource pages within the campus portal and to gather more targeted usage statistics for electronic resources. chris catalfo of liblime introduces the code4lib community to ‡biblios, an open source cataloging editor that is currently integrated with the koha ils. how to noel peden describes the process perfected over the past 2 code4lib conferences to capture presentations and make them available online for those unable to attend. sharing experience henrik lindström and martin malmsten from the national library of sweden share how their team integrated user-centered design processes and agile development methodologies to create a new, adaptable, patron-focused search for the swedish national union catalogue. dianne dietrich, jennifer doty, jen green and nicole scholtz suggest a set of guidelines to use when thinking about terminating vs. reviving digital projects created by your predecessors. kelley mcgrath and lynne bisko describe in detail the outcome of their attempts to automatically extract work-level data from marc bibliographic records for moving images. dale askey provides a rousing editorial on why libraries struggle to effectively share the light-weight “doodads and widgets” we use to glue our web sites together and how we might overcome these tendencies. we hope that you enjoy issue 5! notes [1] note that all numbers in this introduction are slightly estimated so as to provide the spirit of the past year without worrying too greatly about 100% accuracy. [2] i’ve provided a link to the project page of oss4lib as a place to get started. i’m sure there are many other relevant open source tools available. please feel free to add links to other individual projects or project aggregator pages as comments. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – on dentographs, a new method of visualizing library collections mission editorial committee process and structure code4lib issue 16, 2012-02-03 on dentographs, a new method of visualizing library collections a dentograph is a visualization of a library’s collection built on the idea that a classification scheme is a mathematical function mapping one set of things (books or the universe of knowledge) onto another (a set of numbers and letters). dentographs can visualize aspects of just one collection or can be used to compare two or more collections. this article describes how to build them, with examples and code using ruby and r, and discusses some problems and future directions. by william denton introduction these checkerboard dentographs compare the holdings of the toronto and san francisco public libraries. without knowing anything more about dentographs, it is clear at a glance that whatever it is san francisco has, toronto has more. figure 1. checkerboard dentographs of the toronto and san francisco public libraries when you know that both libraries use the dewey decimal classification, that the hundreds digit is shown along the x-axis and the tens along the y-axis, and that the colour of the square at (8,1) tells how many items are the 810s (“american literature in english”), you can see that toronto and san francisco collect the same kind of material, but toronto’s collection is much deeper. mountain dentographs are better for the library of congress classification. they are so called because they look like mountain ranges, with one line of mountains for each lcc class. here are mountain dentographs that compare two branches of the university of toronto. the split between the arts, humanities and social sciences at robarts and science at gerstein is clear. figure 2. mountain dentographs of two university of toronto branches: robarts (arts, humanities and social sciences) and gerstein (science) in this article i will show in detail how to generate both checkerboard and mountain dentographs to visualize and compare the holdings of different libraries. this is one of many possible uses for dentographs. others include: collection usage: dividing circulation by holdings gives usage ratios. in a checkerboard dentograph the colours could show the usage; in a mountain dentograph, the height. interlibrary loan lending and borrowing: this is another kind of usage. the number of items borrowed or lent could be visualized: borrowing could be subtracted from lending to show net intake or outtake in a given call number range, and this could be shown either by colour or positive and negative mountains. collection overlap or distinctness: the examples in this article are all based on call numbers, but other standard identifiers such as isbn or oclc numbers may also be used to calculate the overlap or distinctness between collections. dentographs could show what percentages of their holdings libraries have in common in different subjects, or how much of one library’s collection is unique and not held by other libraries. this approach could be particularly useful in consortia, such as for “last copy” holding agreements, or, at the other end of the scale, for small specialized collections to show that in their specific area they have better holdings than large libraries or consortia. with regard to ebooks, a library could compare its print collection to an ebook vendor’s offerings to see where it would benefit most. university ranking reports: these usually have small sections about the libraries, giving some numbers about study space or student satisfaction. dentographs could be included to give the reader a quick impression of collection size and strength. theoretical digression on mathematics: classification schemes are functions dentographs are a practical implementation of the idea that classification schemes are functions. it may be years since the reader last thought of mathematical functions, so i will only briefly describe my thinking here. a full explanation will come in a subsequent paper that goes into the theory. in grossly simplified terms, a function is a formula that, given some input, will generate some output. functions such as f(x) = 3x + 4 take a number and turn it into another number by applying a simple rule: “multiply by three and add four.” wolfram alpha defines a function as a “relation that uniquely associates members of one set with members of another set. more formally, a function from a to b is an object f such that every a ∈ a is uniquely associated with an object f(a) ∈ b.” the library of congress and dewey decimal classifications are functions that map the universe of knowledge onto combinations of numbers and letters. it’s possible to quibble with the “uniquely associated” part of the definition, but for our purposes, we can think of classification schemes as functions. and when we have a function, a natural next step is to ask: can we graph it? f(x) = 3x + 4 is easily graphed on the x-y plane as a straight line. dewey numbers are in the range (0 < n < 1000): how can they be graphed? lcc call numbers are made up of letters from ac to za and numbers from 1 to 9999: how can they be graphed? dentographs combine call numbers with holdings counts and graph the result to visually represent a library’s collection. getting ready to make dentographs all of the examples in this paper are fully reproducible. by downloading the code and data sets you will be able to follow along, line by line, and generate all of the graphics. if you have access to call numbers from your own library system you can easily adapt what is here to make dentographs for that data. installing r all of the graphics will be generated with r, described on its site as “a language and environment for statistical computing and graphics.” it’s a powerful tool for advanced statistics, but it’s also used for other purposes such as data mining and, as we’ll be doing here, visualization. r on its own has a fairly simple interface, so i recommend also installing rstudio, a gui that provides a powerful and friendlier interface. install r first, then rstudio. download r at your nearest cran mirror download rstudio desktop getting the code all of the scripts used in this paper are available at http://github.com/wdenton/c4lj-dentographs. every shell or r command is fully reproducible. there are two kinds of snippets of code below: $ is at the command line and > is in r. to get the code you need git, and to run the scripts you need ruby. download git download ruby once git is installed, to make a local copy of the repository run this at the command line: $ git clone http://github.com/wdenton/c4lj-dentographs $ cd c4lj-dentographs the last step is to set your r working directory to this same c4lj-dentographs directory. either run r at the command line in that directory, or in rstudio use tools > set working directory in the menu bar. (if you use rstudio, use its export button to save images instead of the saveplot command in the code below.) getting data files of call numbers to generate dentographs we need call numbers. luckily there is a good source: marc records some libraries have uploaded to the internet archive to help the open library. a number of libraries have made their data available, and i use three here: san francisco public library marc records (1 of 16; search for the others) (dewey decimal classification) toronto public library marc records (ddc) university of toronto marc records (library of congress classification) two libraries supplied data to me on request: university of prince edward island provided a list of call numbers (lcc) and locations. my thanks to melissa belvadi for this. york university libraries, where i work, does not give open access to its marc records, but i obtained a dump for this research. we use lcc. to save time, i processed the records and made available data files to replicate the examples. download the five files there to the c4lj-dentographs directory: sfpl-092.txt.gz tpl-090.txt.gz upei-call-number.txt.gz utoronto-949.txt.gz york-call-number.txt.gz they are all compressed with gzip so you will need to uncompress each before it can be used. for example: $ gunzip utoronto-949.txt.gz how to extract data from marc records dealing with a large set of marc records can be painful. there are so many ways that a library can customize its data for its individual needs that writing one script to extract call numbers from any of the open library dumps became tedious and complicated. in the end i found it was much easier and faster to run yaz-marcdump on all the files, pick out the one marc field i needed, and then process those lines to pick out the call numbers and store them in a text file. i’ll show how i did this with the toronto public library (tpl) data. the goal of operating on the tpl catalogue records was to extract every numerical call number in the range (0 < number < 1000). this leaves us with all nonfiction material and any fiction (or drama, poetry, etc.) that was classified with a number. anything without a number will be ignored. this is a problem in fairly assessing public library collections, where fiction is often classified as fic or something similar. the dentograph will only accurately represent the nonfiction collection. visual inspection of the tpl marc records is easily done with yaz-marcdump. the dewey number is stored in the 090 field (probably for historical reasons, because it is now obsolete: see marc bibliographic definition of 09x), and it was easy to extract all 2,210,126 to a file. (to save you the trouble of doing all the downloading, tpl-090.txt is one of the data files available, but to recreate it yourself you would get the files and run yaz-marcdump ol.20100104.* | grep ^090 > tpl-090.txt.) $ wc -l tpl-090.txt 2210126 tpl-090.txt $ head -5 tpl-090.txt 090 $a fiction rob 090 $a feature aik 090 $a 614.59939 rep 090 $a 614.59939 r25 090 $a 598.29729 ffr extract-tpl-ddc-from-090.rb pulls out the numerical dewey call numbers, ignoring everything else, such as fiction aik. 645,244 090 fields are thrown out.. $ ruby extract-tpl-ddc-from-090.rb tpl-090.txt > tpl-ddc-numbers.txt $ wc -l tpl-ddc-numbers.txt 1564882 tpl-ddc-numbers.txt $ head -3 tpl-ddc-numbers.txt 614.59939 614.59939 598.29729 extracting lcc call numbers from the university of toronto records is much the same, as we will see below, but with the advantage that fiction is also classified, so the call numbers cover everything in the collection. everything, that is, with a proper lcc call number: special schemes for government documents, audio, video, maps and so on are left out. checkerboard dentographs the dewey decimal classification is nicely suited to visualization because of its rigidity. the hundreds define the ten top-level classes: computer science, information and general works (0xx), philosophy and psychology (1xx), religion (2xx), social sciences (3xx), language (4xx), science (5xx), technology (6xx), arts and recreation (7xx), literature (8xx), and history and geography (9xx). each hundred is divided into ten tens and each ten into ten ones, within which the decimal expansions can go much farther. we can divide up a dewey collection into these hundreds, tens, ones, and even decimals, in various ways, and each leads to a dentograph of different granularity, complexity, and visual informativeness. to do these dewey dentographs we will use the levelplot command in r. one-by-one the most basic dewey dentograph shows the collection broken down to the tens: the ten hundreds are broken down into ten tens each, making a 10×10 grid with 100 squares. i call this a one-by-one dewey checkerboard dentograph, because it uses one number of importance on each side of the grid. to build the one-by-one we need to pick out the hundreds and tens from our list of call numbers. make-one-by-one-data.rb in the repository does this. run the script on the file of all call numbers and generate a text file of pairs of numbers (notice how 614.59939 becomes “6 1”): $ ruby make-one-by-one-data.rb tpl-ddc-numbers.txt > tpl-one-by-one.txt $ head -3 tpl-one-by-one.txt 6 1 6 1 5 9 finally we are ready to begin work in r. this is often how work with r goes: another language is used to clean the data first. r has text manipulation tools, but that is not its main strength, and if you are comfortable in a scripting language then you will probably find it easier to massage your data there. it only takes four commands in r to generate a raw, unadorned checkerboard dentograph. one: load the lattice library, which provides the levelplot command. two: load the data. three: turn that data into a table. four: generate a levelplot from the table. > library(lattice) > tpl.one.by.one tpl.one.by.one.table levelplot(tpl.one.by.one.table) figure 3. raw, unadorned toronto public library one-by-one checkerboard dentograph we’ll make that look nicer, but first let’s look into the data structures. the head command in r is much as in unix, but instead of showing the first few lines in a file it shows the first few elements in a data structure. the numbers here are the same as above, but r has converted them into two columns and many rows. > head(tpl.one.by.one) v1 v2 1 6 1 2 6 1 3 5 9 4 3 6 5 3 6 6 9 7 > ncol(tpl.one.by.one) [1] 2 > nrow(tpl.one.by.one) [1] 1564666 the table command “uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels”, according to the ?table help file. in other words, with our dewey data, it will build a 10×10 table that counts how many times each pair of numbers appears in the tpl.one.by.one data frame. > tpl.one.by.one.table v2 v1 0 1 2 3 4 5 6 7 8 9 0 11648 19793 10232 1453 21 456 1298 4547 589 737 1 738 915 1661 6566 815 15094 398 4877 2100 4193 2 2654 878 5245 4987 3526 1523 5457 4636 11107 15440 3 59634 1518 34402 72476 25494 32259 50131 24536 20430 20600 4 1471 2694 13608 1943 3786 1044 1755 383 516 12257 5 8261 5968 5262 5499 3740 11867 1792 8387 3994 17269 6 2412 43022 40281 19564 28335 35299 5786 3149 6261 6193 7 18948 10984 13396 9673 27130 18819 4962 8883 30110 62074 8 18840 93554 80340 13161 17987 5302 8613 1421 3155 40156 9 7625 72508 41884 7230 63727 19756 5688 67239 2652 1987 for example, the value of the (3, 5) entry in this table is 32,259. this means that “3 5” appeared 32,259 times in the data file. we can confirm this at the command line: $ grep -c "3 5" tpl-one-by-one.txt 32259 the toronto public library has 32,259 items classified in the 350s (“public administration and military science”). now we can make a prettier dentograph. there are a vast number of ways to customize graphs and charts in r. i won’t go into many details here, because most of the commands will be self-explanatory when you see them and then look at the generated image. two things about this next snippet: i create a function palette to change the colours used, and the scales parameter lets me customize what appears on the axes, defining some new labels and rotating them where necessary. > palette levelplot(tpl.one.by.one.table, col.regions = palette, xlab = "hundreds", ylab="tens", main = "tpl one-by-one dentograph", scales=(x=list(rot=90, at=seq(1, 10), labels=c("general 0xx", "philosophy, psychology 1xx", "religion 2xx", "social sciences 3xx", "language 4xx", "science 5xx", "technology 6xx", "arts 7xx", "literature 8xx", "history, geography 9xx"), y=list(rot=0, at=seq(1, 10), labels=10*seq(0, 9)))) ) figure 4. toronto public library one-by-one checkerboard dentograph note the depth of the collection in the 300s and the relative paucity of the 200s 400s, and watch for how the representation changes in the next two dentographs. one-by-two the next step is to go further into the numbers. let’s make a one-by-two checkerboard dentograph, again with the hundreds on the x-axis but now tens and ones on the y-axis. this will be a 10×100 matrix. the process is the same as above, but make-one-by-two-data.rb prepares the data: $ ruby make-one-by-two-data.rb tpl-ddc-numbers.txt > tpl-one-by-two.txt then in r: > tpl.one.by.two tpl.one.by.two.table levelplot(tpl.one.by.two.table, col.regions = palette, xlab = "hundreds", ylab="tens and ones", main = "tpl one-by-two dentograph", scales=(x=list(at=seq(1, 10, by = 2), labels=seq(0, 9, by =2), y=list(rot = 0, at=seq(11, 100, by=10), labels=paste (seq(1, 9), "0", sep=""))))) figure 5. toronto public library one-by-two checkerboard dentograph it’s interesting how the hundreds form columns that run up the image (the 300s stand out again, for example), but perhaps there is both too little and too much here to be very useful. two-by-two going one more level into the dewey numbers, to make a two-by-two dentograph of a 100×100 matrix, is far more interesting. make-two-by-two-data.rb will generate the file of pairs of numbers we need: $ ruby make-two-by-two-data.rb tpl-ddc-numbers.txt > tpl-two-by-two.txt then in r, again the data is loaded in and levelplot run. here, to make it a little clearer about where the numbers fall, a grid of dashed lines is added (the way that this is done with lattice graphics, with a function declared and separate commands run within it, is a little confusing): > tpl.two.by.two tpl.two.by.two.table levelplot(tpl.two.by.two.table, col.regions = palette, main="tpl two-by-two dentograph", xlab="hundreds and tens", ylab="ones and decimals", scales=(x=list(at=seq(1,100, by=10), labels=paste(seq(0, 9), "0", sep=""))), panel=function(...){ panel.levelplot(...); panel.abline(h=seq(11,99, by=10), lty="dashed", col="light grey"); panel.abline(v=seq(11,99, by=10), lty="dashed", col="light grey") } ) figure 6. toronto public library two-by-two checkerboard dentograph here again we see the 300s as much stronger than the 200s and 400s. the darkest colours, representing the deepest parts of the collections, are even more visible now in the 800s. three strong lines in the 900s have emerged: the 910s (geography and travel), 940s (history of europe) and 970s (history of north america), which matches the deep colouration those squares have in the one-by-one dentograph. where are the most items, and how many are there? two commands tell us: > which(tpl.two.by.two.table == max(tpl.two.by.two.table), arr.ind=true) row col 82 83 40 > max(tpl.two.by.two.table) [1] 29366 there are 29,366 items at (83, 40) in the table, but the way r counts rows and columns does not equal how we are putting dewey numbers into the table: row 1 of the table is 00, row 2 is 01, and so on; column 1 is 00, column 2 is 01, etc. row 83 in the table is for dewey 82x, and column 40 is 39 within that, giving up the call number 823.9 (english fiction, 1900-). sure enough, if you look in the graph, count two lines over from 80 on the x-axis, and go up to one line below 40 on the y-axis, there it is, the darkest square. comparing two dewey collections comparing two dewey collections is easily done by putting two one-by-one checkerboard dentographs beside each other. next we will create the toronto and san francisco public libraries comparison shown in the introduction. we already have tpl.one.by.one.table in memory, so we begin by generating the data from the sfpl-092.txt data file. $ ruby make-one-by-one-data.rb sfpl-092.txt > sfpl-one-by-one.txt and then in r: > sfpl.one.by.one sfpl.one.by.one.table max(tpl.one.by.one.table) [1] 93554 > which(tpl.one.by.one.table == max(tpl.one.by.one.table), arr.ind=true) row col 8 9 2 > max(sfpl.one.by.one.table) [1] 11417 > which(sfpl.one.by.one.table == max(sfpl.one.by.one.table), arr.ind=true) row col 9 10 2 when doing a comparison like this we must make sure the same z-axis scale is used for both collections. the deepest part of the tpl collection at the tens level is the 810s (row 9 is the 800s, column 2 is the 10s), with 94,201 items. the deepest part of the sfpl collection is in the 910s (row 10, column 2) with 11,417 items. the at parameter to levelplot sets out where the cuts on the z-axis will happen. this is not necessary for a one-collection checkerboard dentograph, but when comparing two collections the colour schemes must match up and show collection depth levels in absolute and not relative terms. here we force r to use a scale from 0 to 100,000, with 50 colour gradations (49 cuts) along the way. > levelplot(tpl.one.by.one.table, col.regions = palette(50), cuts = 49, main = "toronto public library", xlab = "hundreds", ylab = "tens", at = 2000*seq(1:50)) > saveplot(filename="comparison-tpl.png", type="png") > levelplot(sfpl.one.by.one.table, col.regions = palette(50), cuts = 49, main = "san francisco public library", xlab = "hundreds", ylab = "tens", at = 2000*seq(1:50)) > saveplot(filename="comparison-sfpl.png", type="png") back at the command line, convert from imagemagick turns the two images into one: $ convert +append comparison-tpl.png comparison-sfpl.png comparison-tpl-to-sfpl-large.png $ convert -resize 800 comparison-tpl-to-sfpl-large.png comparison-tpl-to-sfpl.png figure 7. comparison of toronto public library and san francisco public library one-by-one checkerboard dentographs (note that holdings counts are ignored in both the toronto and san francisco public libraries data files. tpl’s call numbers are taken from the 090 field in the marc records, but special codes in the 906 show how many copies are at different branches. sfpl’s call numbers were taken from the 092, but multiple 945s are used for the holdings.) mountain dentographs the library of congress classification doesn’t have dewey’s methodically rigid structure. lcc call numbers can begin with one, two or three letters, which is manageable, but instead of being laid out neatly from 0 to 999 the numbers can range from a maximum of 9 (in lh, college and school magazines and papers) to 9999 (six classes outside of law, the first being bx, christian denominations). instead of trying to fit lcc call numbers to some procrustean bed to make a checkerboard dentograph, we can leave them as they are in a mountain dentograph. mountain dentographs are three-dimensional, with the lcc class letters on the x-axis, the numbers on the y-axis, and the item counts on the z-axis. they look like very orderly mountain ranges. to keep things simple i am going to ignore everything in k (“law”), which has 156 subdivisions, ending at kzd (space law, law of outer space). my apologies to any law librarians reading this. processing call numbers for the first examples we’ll get call numbers from the university of toronto marc records in the internet archive. i want to keep the branch information to generate branch-specific dentographs, so the call number extraction will be a little different, and here we will use holdings counts. the first step is to extract the 949s with yaz-marcdump as above (yaz-marcdump utoronto.mrc | grep ^949 > utoronto-949.txt) but to save time i’ve done this and put the results in the utoronto-949.txt data file. we’ll run 949-extractifier.rb to pull out the branch and call number of each item. there are 6,787,653 949s in the marc file, and after processing 5,414,215 proper lc call numbers are left in a very simplified listing. $ wc -l utoronto-949.txt 6787653 utoronto-949.txt $ head -2 utoronto-949.txt 949 $a ac1 .h32 n4 $w lc $c 1 $i 31761016601411 $d 17/4/2003 $e 17/4/2003 $l stacks $m robarts $n 2 $r y $s y $t book $u 26/8/1992 949 $a ac1 [online resource 47903] $w lc $c 1 $i 2-2001 $l online $m e_resour ce $r y $s y $t e_resource $u 7/2/2008 $ ruby 949-extractifier.rb utoronto-949.txt > utoronto-branch-call-number.txt $ wc -l utoronto-branch-call-number.txt 5414215 utoronto-branch-call-number.txt $ head -2 utoronto-branch-call-number.txt robarts:ac 1 e_resource:ac 1 let’s look first at university of toronto’s entire collection. for this, we want all the call numbers regardless of branch, and it’s easy to pull that out with cut. then we need to prepare the data for r. we want to make a 3d graph where the class letters (ac, ae, ag, …, za) run along the x-axis, the numbers run along the y-axis, and the z-axis shows the number of items at each call number. to do this the class letters need to be turned into numbers. convert-lc-to-numbers.rb does this by mapping ac -> 1, ae -> 2, …, za -> 212. this forces the x-axis to always run from 1 to 212, so it will be the same width for all libraries. the script also adds points along the y-axis from (0,0) to (0,10000) to ensure that the graph has the same depth for all collections. if the width and depth were not forced it would be impossible to compare collections reliably: f or q could end up in different places, or one chart might only go to 500 while another goes to 9,000. $ cut -d ":" -f 2 utoronto-branch-call-number.txt > utoronto-call-number.txt $ ruby convert-lc-to-numbers.rb utoronto-call-number.txt > utoronto-mountain-data.txt now we can visualize this with persp in r (theta and phi set the angles the graph is seen at, adjust them to move your point of view left/right and up/down): > utoronto utoronto.table persp(utoronto.table, theta = -5, phi = 20, scale = true, border = na, axes = f, box = f, col = "cyan", shade = 0.5, main = "university of toronto") > max(utoronto.table) [1] 19748 > which(utoronto.table == max(utoronto.table), arr.ind=true) row col 140 141 77 figure 8. unlabelled mountain dentograph of the university of toronto you’ll see one standout high peak. as shown with max, we can locate it at (141, 77). 141 on the x-axis is qa, and 77 on the y-axis is 76 in call numbers (the y-axis starts at 0), so that peak is at qa 76: a number so familiar to readers i need hardly mention it is where computer science is found. lcc is incredibly limited in how it can accommodate books on that subject. this is the highest peak in every library i’ve graphed. a script for mountain dentographs the next two examples will be easier with an r script we can run at the command line: dentograph.r is in the repository. it takes arguments on the command line that set the data file it’s reading, the filename of the image it will output, and the title to use on that image. the dimensions of the output image are set with png. persp is run again, but with zlim to force the z-axis to the same scale across collections (just as we did with the dewey checkerboard dentographs). #!/usr/bin/env rscript # usage: dentograph.r mountain-data.txt filename.png "library name" args utoronto-robarts-call-number.txt $ grep ^gerstein utoronto-branch-call-number.txt | cut -d":" -f 2 > utoronto-gerstein-call-number.txt $ sort utoronto-robarts-call-number.txt | uniq -c | sort -rn | head -1 9574 pg 3476 $ sort utoronto-gerstein-call-number.txt | uniq -c | sort -rn | head -1 9482 qa 76 $ ruby convert-lc-to-numbers.rb utoronto-robarts-call-number.txt > utoronto-robarts-mountain-data.txt $ ruby convert-lc-to-numbers.rb utoronto-gerstein-call-number.txt > utoronto-gerstein-mountain-data.txt $ ./dentograph.r utoronto-robarts-mountain-data.txt utoronto-robarts-mountain.png "robarts (arts/hum/soc sci)" $ ./dentograph.r utoronto-gerstein-mountain-data.txt utoronto-gerstein-mountain.png "gerstein (science)" $ convert +append utoronto-robarts-mountain.png utoronto-gerstein-mountain.png utoronto-branches.png figure 9. university of toronto’s robarts and gerstein branches compared the distinctness of the two collections is clear. gerstein is almost entirely concentrated in q (science) and r (medicine) with some in s (agriculture) and t (technology). robarts sprawls heavily throughout a-p, especially p (linguistics and literature). because of how lcc works, the relatively small range of numbers used in q and r is also easy to see. seven of the nineteen letters in p go into the 9,000s, but the maximum number possible for any of the letters in the qs is under 1,000 (for example q stops at 510 and qa (mathematics) at 939). comparing libraries finally, let’s compare the university of toronto collection to the libraries of two other canadian universities, york university and the university of prince edward island. these dentographs tell us at a glance how the collections compare, even without numbers to show how large they are. first, as a bit of background, some basic facts about the universities and their libraries. (enrolment numbers are total students as of fall 2010, taken from the association of universities and colleges of canada’s enrolment by university page): university of toronto is the biggest university in canada with 78,900 students, and its library system is also the biggest, with about 11,350,000 “bookform” items in its collection as of april 2010. york university (also in toronto) has 54,600 students and its library reports almost 2,500,000 print volumes in its collection as of april 2010. university of prince edward island has 4,590 students, and its 2007-2008 library annual report says it had about 370,000 books and ebooks (print books are not separated out). the following commands will generate the dentographs. before running them, note that the maximum value in university of toronto’s holdings is 19,748 (this can be found with a sort | uniq | sort as above), so you will need to edit dentograph.r to change the zlim value to 20,000 to force the z-axis to be the same in all graphs. if you don’t edit it, some spikes will run out the top of the dentographs. $ ruby convert-lc-to-numbers.rb york-call-number.txt > york-mountain-data.txt $ ruby convert-lc-to-numbers.rb upei-call-number.txt > upei-mountain-data.txt $ ./dentograph.r utoronto-mountain-data.txt utoronto-mountain.png "university of toronto" $ ./dentograph.r york-mountain-data.txt york-mountain.png "york university" $ ./dentograph.r upei-mountain-data.txt upei-mountain.png "university of pei" $ convert +append utoronto-mountain.png york-mountain.png upei-mountain.png mountain-comparison.png $ convert -resize 800 mountain-comparison.png mountain-comparison-smaller.png figure 10. university of toronto, york university and university of pei compared pei’s collection is sparse and shallow compared to the others, which is no reflection on anything other than its size. it’s unfair to compare it to much larger libraries except to serve some kind of illustration like this. on the other hand, comparing toronto and york, two large universities in the same city, is quite interesting. toronto is clearly broader and deeper than york: its collection is larger and covers more subjects, apparently across the board. in b (philosophy, psychology, religion) toronto has more (both close to the x-axis and stretching out to the far side), probably because it has divinity programs. m (music) and n (fine arts) are both denser. p is much richer than at york, with far more high spikes. the science cluster in q is also much denser. future directions i hope readers find the ideas here interesting and will extend them beyond what i’ve described. aside from the possible uses described at the beginning of the paper, there are two lines of future work that i see, but i hope readers will find more. first, there are more ways to use the three dimensions of mountain dentographs. perhaps it would be possible to fly around inside the mountain dentograph, exploring the collection and seeing flags or labels on the mountains to identify what lc number or subject they represent. in r some interactivity is possible with the persp3d command, which makes it possible to rotate and zoom the image. the arguments are the same but the experience is very different from persp. run this to try it: > library(rgl) > persp3d(utoronto.table, theta = -5, phi = 20, scale = true, border = na, axes = f, box = f, col = "cyan", shade = 0.5, main = "university of toronto") second, perhaps three dimensions for lcc isn’t best, and a two-dimensional representation would work better. lcc is so sprawling and varied a classification that it would probably work best not to map it directly but to make clusters. for example, the university of toronto dentographs show a strong line along bx (christian denominations), which has these subsections, as listed in b – philosophy. psychology. religion in the library of congress classification outline: 1-9.5: church unity. ecumenical movement. interdenominational cooperation 100-189: eastern churches. oriental churches 200-756: orthodox eastern church 800-4795: catholic church 4800-9999: protestantism five data points are much easier to understand than 10,000. if the rest of lcc was similarly clustered and mapped it would be easy to generate checkerboard lcc dentographs. they would be ragged because there would be different numbers of clusters per letter, so instead of a neat 10×10 or 100×100 visualization it would be 212 wide (more if law is included) by varying depths, but that doesn’t matter. a mapping somewhat like this is in fact already available: the oclc conspectus. this is old work, now abandoned, but perhaps still useful here. it has 29 top-level subjects, such as art and architecture, medicine, and philosophy and religion. there are 378 narrower second-level subjects. philosophy and religion has 18, such as “philosophy – ancient, medieval, renaissance,” “philosophy – modern (1450/1600),” and “logic.” each subject is associated with lcc and ddc call number ranges (so it is possible to assess collections regardless of classification scheme), and here bx is boiled down to three headings: “eastern christian churches & ecumenicism” from bx 0-765, “roman catholic church” from bx 800-4795, and “protestantism” from bx 4800-9999. it would be possible to use this to generate either checkerboard or mountain dentographs by mapping the top-level subjects on the x-axis and the second-level subjects on the y-axis. finally, there are undoubtedly other, and i hope better, forms of dentographs than checkerboards and mountains. three problems the biggest problem with dentographs of holdings is that they show quantity but not quality. there is no easy way, automated or manual, to assess the quality of a collection. we can only visualize available data, so dentographs are limited. however, as mentioned above, dentographs can also show usage, overlap or uniqueness, which in their own ways tell us something about a collection’s quality. the second main problem is that dentographs depend entirely on call numbers from a standard classification (or, with the oclc conspectus, subject assignments to a controlled vocabulary). for most print material, that is fine. everything on a library shelf will have some kind of call number. if the call number is not lcc or dewey, however, that is a problem. collections that are special for their format or location will be overlooked, as may huge collections of fiction or children’s books in public libraries. electronic resources are especially susceptible to lacking call numbers. at my library, very few electronic books or journals have valid lcc call numbers: they are all assigned electronic. however, restricting to print or other physical resources may actually be useful. among academic libraries, if all libraries of a similar size subscribe to the same electronic resources then comparing that part of their collections is pointless. the comparison of university of toronto to york university shows how much better university of toronto is with its print collection, but when it comes to electronic resources, the two are more or less the same (except for subjects university of toronto teaches that york doesn’t, such as medicine and architecture). more and more, it is the local print material that makes collections special, and dentographs are good tools for assessing that. a side effect of using standard classifications such as dewey and lcc is that the dentographs reveal the limitations of the schemes. in the checkerboards for the toronto public library, dewey’s 290s are coloured darker than any other ten in the 200s (“religion”) because it is everything that is not christianity (“other religions”). the toronto public library’s expansive, multicultural religion collection is shoehorned into dewey’s nineteenth century organization. similarly, lcc was not built to handle computer science, and now qa 76 is so overcrowded that it cannot be fairly compared to other numbers. thirdly, quality and access are always problems with cataloguing records. bad data can be worked around, but not every library makes regular catalogue dumps available. they should. that data, and union and consortial catalogues, should be available under open licenses. tools used r rstudio git convert and resize from imagemagick ruby yaz-marcdump from the yaz toolkit further reading about r an introduction to r by w.n. venables, d.m. smith, and the r development core team. of the many books about r, i found two from o’reilly particularly useful: r cookbook by paul teetor and r in a nutshell by joseph adler. r programming wikibook is under development. the r journal. questions tagged with r at stackoverflow are a great place to look for answers to problems. blogs: revolutions has great coverage, and is part of r-bloggers, which collects many r-related blogs into one place, in a more readable way than planet r. about the author william denton is web librarian at york university in toronto, canada. he has a b.sc in mathematics and an mist from the university of toronto. his web site is miskatonic university press. subscribe to comments: for this article | for all articles 4 responses to "on dentographs, a new method of visualizing library collections" please leave a response below: ingrid mason, 2012-02-04 william, this is awesome work you’ve done here. i love how you’re looking for potential real library use cases for using collection data visualisations. also i really like the way you’re teasing out the limitations of using classification schemes as the means to base the calculations and create meaningful dentographs. have you thought about using other types of collection data? the powerhouse museum in australia has an api of its collection data and they use a thesaurus that perhaps you could plot at bt level? anyway – keep up the great work! ingrid william denton, 2012-02-05 thanks for the kind words, ingrid. i didn’t know of the powerhouse museum but its api and data access (http://www.powerhousemuseum.com/collection/database/download.php) look great—just the kind of thing we should all be doing. i’ll definitely look into what i could do with their data. elizabeth, 2012-06-08 thank you very much for the detailed instructions–if i ever feel ambitious i might try this. it could be a good tool for libraries. great idea and very nice dentographs! 필리핀 여행, 2025-07-05 it’s a potentially useful innovation. it’s a creative approach to information visualization. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – “what if i break it?”: project management for intergenerational library teams creating non-marc metadata mission editorial committee process and structure code4lib issue 28, 2015-04-15 “what if i break it?”: project management for intergenerational library teams creating non-marc metadata libraries are constantly challenged to meet new user needs and to provide access to new types of materials. we are in the process of launching many new technology-rich initiatives and projects which require investments of staff time, a resource which is at a premium for most new library hires. we simultaneously have people on staff in our libraries with more traditional skill sets who may be able to contribute time and theoretical expertise to these projects, but require training. incorporating these “seasoned” employees into new initiatives can be a daunting task. in this article, i will share some of the strategies i have used as a metadata project manager for bridging diverse generations of library staff who have various levels of comfort and expertise with technology, and strategies that i have used to reduce the barriers to participation for staff with diverse perspectives and skill sets. these strategies can also be helpful in assisting a new librarian with technology-rich skill sets to more successfully orient themselves when embedded in a “traditional” library setting. by kelly j. thompson introduction i am an early-career metadata librarian embedded in a traditional cataloging department in a large academic library. part of my job is to train the marc cataloging (and copy-cataloging) staff in my department in non-marc metadata creation for our digital collections and institutional repository. this staff is most diverse in age ranges and technological ability—i am several decades junior to the majority of my colleagues and the first new hire the department has made in a long time. most of the library literature to date which addresses the transition of traditional marc cataloging librarians and support staff to non-marc metadata creators focuses on applicable skill sets which can be applied in both types of work, ways to build on those cataloging-specific skills to foster skills in metadata creation, and the types of projects which may be appropriate to assign to these staff, both as they are being trained and as they gradually build up competencies. (boydston and leysen 2008; keenan and sipe 2013; o’bryan and palmer 2007; rodgers and sugarman 2012; veve and feltner-reichert 2010) i want to speak to a gap in this literature related to generational diversity: the inter-generational differences in work styles, theoretical understandings, and comfort with technology which manifest in a collegial environment. i will speak to a number of barriers to participation in new metadata projects for age-diverse staff, and identify a few strategies that have worked to build inclusivity in my experience managing these projects. i write from the perspective of a new metadata librarian without formal supervisory responsibilities, who is also engaged in the process of acclimating to a very traditional library environment. i will share some of the communication, training, and project management strategies i have utilized while leading metadata projects that have eased some of the tensions experienced in this process. from technology anxiety (what if i break it? what if i accidentally delete everything?) to sitting with uncertainty (is there a rule i can read about how to format this title?) and navigating spreadsheet basics, i will talk about some of the on-the-ground challenges and opportunities which exist for the new metadata professional tasked with midwifing the training of a multi-generational cataloging staff with a diverse array of comfort and familiarity with technology in general. i am strongly convinced that what has made our projects successful are mutual reciprocity, active listening, and frequent and open communication. i will talk about ways we can build co-working communities to foster growth beyond barriers to participation in projects that require some members to develop new technology skills with assistance from others. strategies used to bridge diverse generations/classifications of staff communication, documentation, and checking your own assumptions in my work as a metadata project manager, i have observed a number of indicators which alerted me to gaps in my understanding of what i was communicating and what was being understood by project staff. invisible gaps in expertise and shared language can easily lead to frustration and tension. as a new librarian, i had a lot of assumptions about what i knew, and a different set of assumptions about what i thought my coworkers knew. the fact that i was coming into the department as a new person who was supposed to guide them in doing new work was anxiety-provoking for all parties. i observed anxieties among some of the more senior staff in my department regarding their lack of the newest skill sets, or their understanding of what those skill sets were. i dealt with this by increasing the amount of communication in every area of work possible. i continue to ask a lot of questions and strive to practice active listening. it is important for anyone creating new workflows to inquire about procedures already in place, how similar types of work have been done in the past, and how new initiatives may impact other workflows in the organization. if there is existing documentation available about workflows, this is a good place to start, but often you may need to create this documentation as you do your homework. i find it important to consult with my department head before implementing any major steps forward, so that we can go over the plan i’ve developed and the proof of concept. she will often share insight about how a similar situation might be handled in a traditional marc cataloging situation, which i believe improves our metadata quality measures and makes our non-marc metadata more interoperable with our marc catalog data once in a discovery system. i like to discuss any preliminary plans or roadblocks i encounter with my colleagues, as they will often have insight about a way to improve the process, or refer me to another person in the library who has relevant expertise or has done something similar in the past. because i am curious about the work that my organization is already doing, it is easier for those i am training to develop an interest in what i am asking them to do. it has helped that i was brand new to traditional marc cataloging when i arrived in the department, as this provided a mutual reciprocity in knowledge transfer that bridged some of these tensions. on a regular basis i will ask my senior cataloging colleagues questions about complex original cataloging, and they will ask me questions about non-marc metadata they are working on for a digital collection. by recognizing that everyone is bringing some kind of expertise to the table and asking for help when you need it, you set a tone of equality from the get-go. one challenge of working with staff at different classifications (we have unionized merit staff, professional and scientific staff, and librarians at either faculty or professional and scientific rank) is that non-professional staff sometimes have completely different understandings of the department’s work. often the same level of communication is not present from administration to non-professional staff as it is to professional staff or librarians. different types of staff have different amounts of specialized education as well – for example, typically the professional librarians in our organization have completed an mlis degree, and thus have a background in library theory and a shared vocabulary that accompanies that experience. these differences often lead to non-professional staff being less likely to go about their day with the big-picture view of a problem or project in mind, as they are responsible primarily for on-the-ground details. one strategy to bridge this is to clearly communicate what the project goals and objectives are, providing context for the work within the library enterprise, and to explain why you are doing this project (who will this benefit?). i have found that it is helpful to succinctly and non-technically share what all of the necessary steps to a project are, even if staff will not be participating, so that all staff have an awareness of the whole workflow. this makes it easier if there is a snafu in one part of the production workflow that impacts other steps down the line. people find it easier to be flexible in their timelines and work if they understand the driving forces causing disturbances to the routine. it was difficult for me to develop a sense for how much and at what points i needed to be communicating with project staff. the answer to this is usually more often than you think, and i find that i tend to assume people have more understanding of what i want them to do than they actually do. by communicating the big-picture process as clearly as possible, and clearly stating your expectations at each step of the process, you can help project participants both move their skills forward and learn more about the new work you are asking them to do. we communicate in-person and via email, one-on-one and in group meetings, depending on what needs to be communicated, decided, or taught. i have found that i need to document nearly everything. whether this is a data dictionary for a digital project with examples given for each field value, or a simple procedures document for navigating a document in the contentdm project client, everyone is more comfortable when all staff have a common starting point and clear guidelines with step-by-step references they can return to later on their own. it sounds simple, but i received feedback from one of the catalogers that even simply numbering the steps in a written procedure makes it easier to check in with colleagues when questions arise. regardless of how good [you think] your documentation is, as a project manager, staff will still be coming to you with questions, with the expectation that you will have answers. in surveying traditional cataloging staff, veve and feltner-reichert found that 68% of non-mls catalogers they surveyed said that they will seek assistance first by approaching someone at their institution to answer questions or for other assistance. while 43% of mls catalogers in their study reported that they would first try to figure out the answers themselves, asking someone at their institution was still the first choice for 31% of the mls catalogers. (veve and feltner-reichert 2010) this data reassures me that i am most likely not the only metadata librarian who has ever felt flooded with questions from trainees. the deluge of questions was a bit surprising and “imposter syndrome”-provoking for me at first, but i’ve found a few strategies which helped me become more comfortable with this situation. first, don’t be afraid to admit when you don’t know the answer to a question, to explain where you’ll find this information, and to ask people if you can do some digging and get back to them in a timely manner. by demonstrating that you will follow up and that you are not the sole source of these answers, you can help your staff become more comfortable themselves with not knowing, which can make question-asking less stressful. i also tend to show people how i found the answers to their questions if i sense that they will be open to this. in addition, workflows aren’t always going to function as well as you imagined them. i’ve found that i need to change or update some part of my carefully produced procedures and documentation in almost every project as staff find and report problems. own your mistakes! honesty may feel like displaying a weakness, but ideally it will build trust and a culture which supports well-researched risks and innovation. good documentation not only makes your life easier today, but it is also good succession planning. by setting clear parameters for the work and documenting much of the project planning, teams have a shared understanding and a good starting point for training. good documentation makes it is easier to return to the project or replicate it after time away. it enables others to replicate the process, understand the context of the metadata, and add more items to an existing collection while ensuring consistency. organizational culture my department’s organizational culture was one area where i needed to bridge some gaps in understanding while implementing some of the newer technology-rich projects i’ve worked on. i am on a number of cross-functional teams at my institution and am frequently outside of my cubicle at meetings or co-working with collaborators in other departments. for technical services departments such as mine, where the number of support staff is much greater than the number of librarians, and where support staff are a bit more rooted to their desks during the day, this was a strange transition for everyone. this was especially hard for staff in the department who would prefer to stop by for a quick chat instead of sending an email when they run into a problem with metadata work they’ve been assigned. even though my calendar was always up-to-date in our enterprise email/scheduling software, most people did not know how to effectively use this tool. at my department head’s suggestion, i began posting a copy of my weekly schedule outside my office door. this made it less frustrating for staff to know when they could find me in person, or to know that i was in a meeting and was simply delayed in answering their email until i was back to my office. by knowing that i was working somewhere else and not just absent, this created more camaraderie between myself and staff who do not have as hectic of a meeting schedule. as we’ve been engaged in library-wide strategic planning efforts recently, more staff have experienced what it is like to have more meeting commitments, which has also built empathy in my department for differences in schedules. by communicating and demonstrating new modes of work, we’ve been able to come to more mutual understandings of what a 21st century library’s work can look like. strategies used to avoid barriers to participation organizational culture, continued… a major barrier to the catalogers participating in this new work in my institution was the long-standing internal perceptions that our digital team had about catalogers. “people think of us as dinosaurs,” one of the catalogers said to me during my first weeks on the job. part of my work was to be an ambassador for the cataloging staff when i went out to meet with other groups in the library to provide metadata expertise and report on projects. i strive to give credit to the cataloger librarians and staff contributing to the metadata work these groups are asking for, and i make sure to give updates on the training and “next steps” i see the catalogers make. i try to emphasize the substantial amounts of theory that these professionals carry, and that while they may not be technologically adept, they are certainly adaptable. for example, by sharing that one of our monographs catalogers had recently independently completed metadata (including customizing a controlled vocabulary) for a new digital collection and was now going to complete the documentation for the project on her own for the first time, some members of the digital team expressed surprise that catalogers were trainable in this area. to anyone who has been a cataloger, you understand how ridiculous this sentiment is, but it continues to be a real, prevailing bias that i encountered among several people in our library, and i share it simply to encourage those working with traditional technical services staff to speak up about the good work people are learning to do. by genuinely acknowledging their hard-earned growth and hard work, you boost morale and excitement among your trainees as well. i also make it a point to suggest that colleagues in my library who are working on digital projects contact cataloging librarians who i know have expertise in areas they are struggling with, such as metadata for library-published journals (serials) in our institutional repository. technical services librarian christine dezelar-tiedman writes, “do not assume that your more “seasoned” staff cannot get involved in digital projects or would not be interested. invite everyone to the table. experienced librarians have seen radical changes over the years and are used to changing along with them.” (dezelar-tiedman 2004) it is vital to select tasks of appropriate scope and complexity for beginning metadata creators. you may have to start with what seems to you like the ground floor, but consider it an investment in the sustainability of your metadata program long-term, as well as the sustainability of your own time. scaffold on skill sets most of the library literature focusing on what skill sets catalogers can bring to the metadata table imply that metadata creation is such a natural fit for these groups, they tend to gloss over the sticky practical implementation snafus inevitably present. while catalogers bring skills, expertise, and loads of relevant theory, they still need training and lack many of the technology skills that newly minted librarians may take for granted. (boydston and leysen 2008) for example, even though catalogers are trained to intuitively understand data models and the need for standardized element schemas, many of them have not been trained to read, produce, or navigate xml, which is how most of the widely adopted schema standards have been documented and serialized. by acknowledging these differences, we can begin to identify where to begin training. in library technology culture, i perceive that there is often an assumption that people should be able to teach themselves anything, to pick up new technologies on their own. i think this is often not the case for many library staff because they do not have a number of ‘preliteracies’ required to begin thinking about library data like metadata librarians or digital library developers. before elementary school students can learn to read, they need to develop an understanding of how to hold a book, that letters are put together to form words, and that words represent meanings – these skills are referred to as preliteracies. (stahl et al. 1990) there are a number of preliteracies which technical services staff need to develop before they can gain metadata literacies. background knowledge in how to read and write mark-up languages, and understanding how resource discovery works, how databases work, and the ways that data can exist in aggregate (and outside of one individual record) are some pieces of this. preliteracies for metadata training are an area of study warranting further research. i agree with dezelar-tiedman’s suggestion that “perhaps the best way to get involved is incrementally. not slowly or timidly, but in manageable steps.” (dezelar-tiedman 2004) i consider participation in projects to be the biggest source of training we do in my department. it is critically important to select appropriate projects and tasks for beginners, and build up progressively more complex skills from there. for example, we’ve included a number of staff in the metadata creation for retrospectively digitized theses and dissertations to be loaded into our institutional repository. this work includes assigning standardized department headings from a local controlled vocabulary managed in complex spreadsheets. although the staff have experience working with controlled vocabularies in marc records, they had very little experience with metadata in excel spreadsheets in general, a competency i took for granted. the staff had widely varying levels of comfort with spreadsheets. as one of them said, “there is a big difference between typing in a basic word document versus formatting tables!” by participating in this project, these staff have all developed basic to intermediate excel skills. i have also been able to begin to identify the individual talents and aptitudes that these staff bring to the department, which will help in deciding who to assign future metadata work to, observing who staff go to as peer mentors, and understanding who is most ready to develop more complex skills which go far beyond spreadsheet work. by starting with spreadsheets, we kept the work at a manageable scale, and we also avoided the need to train the staff to navigate the sometimes complex content management system we use until they have developed some more metadata preliteracies. technology anxiety from the beginning, the catalogers i work with exhibited a fair amount of technology anxiety. this manifests in questions such as, “what if i break it?” or “what if i accidentally delete everything?” this fear is a barrier to learning, because in order to learn a new skill, you need the freedom to experiment and make mistakes with a safety net. we worked around these fears (and potential real issues) by discussing the larger scale systems-view of the work we were doing. by explaining that they were each granted specific permissions in software such as the contentdm project client which did not allow them to delete all of the data, or by explaining that if a program froze up it was not their fault but a system bug, they felt freer to navigate and explore on their own. we work on a lot of spreadsheets, so i keep back-ups of any files that they are working on. because of this, i was able to demonstrate when one of them accidentally deleted a column of urls that it was easy to restore that information. you can also think strategically about how to provide low-risk access to shared files necessary for their work, for example, placing read-only permissions on the folders with digital objects in them to be described, or putting files on a server so they can be accessed through the web. by keeping versioned back-ups of all important documents, files or datasets, you can sooth some of the “what if i accidentally delete everything” anxiety for all parties, and free people up to concentrate on the actual intellectual work required of the metadata description process. it is nice to separate the training aspects of metadata content creation from that of software navigation, although this is not always possible. technology anxiety can also be due to unfamiliarity with the vocabulary being used. i think that vocabulary is a huge barrier to participation in technology-rich work in libraries. technical terms, jargon, and acronyms are present in most library work, and it is a worthwhile investment to learn these from each other, as it pays dividends in the efficiency of your communication and the level of understanding you can build in common. at the beginning, it was great to have people remind me when they had no idea what i was talking about, so i could back up and clear up any confusion before proceeding. change anxiety not all of the anxieties that are barriers to participation are related to technology. change anxiety is also often present. people in technical services are especially aware of the fact that the information landscape is moving underneath them, but don’t always have the necessary information to know what to do about it personally. i have observed that many cataloging staff members have a real discomfort for sitting with uncertainty. i have noticed that this tends to be stronger among those who have worked with monographs, as those who work with serials tend to have an easier time, perhaps because they are accustomed to the constant flow of title changes and irregularities encountered in the work. as technical services librarian christine dezelar-tiedman writes, “there is a codified set of standards and years of tradition behind the way that we operate. ingrained in at least some of us is a discomfort with chaos and uncertainty. cataloging is our own way to create control over a small corner of the universe, which can be very satisfying. in contrast, the digital realm is nebulous and constantly changing.” (dezelar-tiedman 2004) cataloging staff have been socialized over decades to refer to the relatively small handful of standards used in traditional cataloging, including marc, lc headings, aacr2, and now rda. all of these standards are meticulously documented and detailed, and are thoroughly rule-based to avoid as much ambiguity as possible. a metadata project manager should not be surprised when presenting catalogers with a data dictionary with few content rules explicitly stated, to receive a follow-up email asking, “is there a rule i can read about how to format this title?” one of the training exercises we have done in my department is a comparative standards reading, where we looked at content standards for titles in aacr2, rda, dacs, and dcrm-g. by taking a look at standards from different kinds of cultural heritage institutions and discussing what we liked and disliked about each, we were able to agree on a local policy that worked for us and our users’ needs. i also remedied this by providing example values in all data dictionaries, as well as going through a few practice records together as a group during our training sessions. during projects, i send out regular faq emails (the frequency depends on the pace of the project — and number of questions!) recapping all of the questions i’ve gotten since the last update which may be of value to others beyond the original question-asker and which don’t “out” the question-asker. reinventing the wheel is not the best use of your time one of the barriers that i personally had to overcome was that i had very little experience with teaching, project management, or training before stepping into my current position. for me, it was important to learn about each of these in order to help our projects stay organized and run smoothly. before setting out to reinvent any of these essential skill sets solely through my own experience, i sought out information in the library literature as well as education, agile development, and project management literature. i also partook of services on my university’s campus to assist new professionals in developing these skills. i experimented with some existing workflow management tools (including trello, which we use to track in-process digital team projects as the work progresses through various library departments), and i’ve developed my own spreadsheet-based workflow tracking tools for individual projects. for individual projects, it is essential to have good file identification systems. to avoid confusion for the cataloging staff, i “assign” each project participant a folder on our shared network drive, where i put spreadsheets or other documents for them to work on. this space also serves as a repository for their own copies of project documentation, so they may edit or notate as needed. by relieving them of having to navigate through dispersed file hierarchies and needing to know the file ids they were responsible for, we were able to create a much friendlier workflow. i am always able to know who is working on a particular file or who completed which work by referring back to my managing spreadsheet. this is also useful as staff email me questions, because i can use this system to quickly know which batch of data they are referring to. many campuses have dedicated staff and centers for helping staff and faculty develop their pedagogical skills. i participated in several workshops through our campus’s center for excellence in learning and teaching, including a teaching and learning circle which aimed to help participants develop methods for incorporating critical thinking into their pedagogy. this group helped me to begin developing training activities which incorporate active learning, reflection, and problem-solving. i look at example documents from instructional designers at my institution to draw inspiration for my documentation and training handouts. i try to consider a diversity of learning styles and incorporate active learning into our training sessions by having demonstrations, software “tours,” and group practice activities. i’ve also learned from our training sessions that when doing small group instruction with new software, it is nice to have a “helper” in the room. this person can assist those who get behind or need help finding a button so you don’t have to be in all places at once. this can help you to avoid leaving anyone behind during training sessions. i think it will help overall project management stress to remember that learning is really inefficient and that planning how you are going to use work time also requires an investment of time. by budgeting time for this in your planning, you can save a lot of stress during the project implementation stages. it was hard for me at first to accept that i needed to train my colleagues to take on this new work when it felt like it took three times as long to train them as it would for me to just do the work myself. to move past these thoughts, i needed to think of the long-term sustainability of my time, and to think of these initial projects as investments of time. each subsequent project has required less start-up time and generally become easier for everyone involved, and i’m now moving toward including certain staff in more complex projects (such as developing local metadata application profiles from scratch, following shawn averkamp’s metadata design and development course project model). (averkamp 2013) i try to see that with each project, everyone gets a manageable amount of new challenges or experiences. i’m now also trying to incorporate evaluation at the end of each project, so that we can share feedback with each other on how the project went, and incorporate that into the next phase of our work. it is important to budget time for assessment as it is always possible to improve communication or documentation, or the ways that you demonstrate the value of the team’s work in a new way or to a new audience. achieve common goals with a diverse group i have found that by feeling included and listened to, people are more motivated to do the hard work of learning and trying new things. in my institution, metadata was still just a buzzword when i arrived. for technical services staff who have long been overlooked when the library is considering doing something new, it feels energizing to be included in something new, mysterious, and exciting, and this has been a great help in generating the energy necessary to overcome the many barriers and gaps we’ve needed to bridge. another way to generate energy for this work is to orient your project people to the big-picture goal of the project work. communication about the library’s broader mission and goals aren’t always emphasized in the same way for support staff, who often don’t have access to the same meetings or emails where these are reiterated. by sharing understanding of the “why” of the work they are doing, as well as the “who” it will benefit, you can help people choose to align themselves with common goals of a diverse group. conclusion i think that by intentionally including efforts to be inclusive in staff training, we can increase the diversity of participants in technology-rich library work. an important step in this process is acknowledging barriers to participation faced by diverse work groups, and identifying solutions for moving past those barriers. it is equally important to acknowledge that it takes extra effort and communication to successfully bridge the perspectives and experiences of a diverse work group. i would love to hear more about the processes that others have used to incorporate “traditional” library staff in their technology-rich workflows, as well as what the best practices for on-boarding processes look like for new hires in emerging or innovative positions. i hope that by sharing my experiences, we can start a conversation around the human resources that drive our development as a community. the author sincerely thanks joan leysen for her valuable insight related to the topics in this paper. she also wishes to thank lori kappmeyer and the entire staff of the metadata and cataloging department at iowa state university for their patience and understanding while she learned how to train staff and manage projects on the job. references averkamp, s. 2013. course project. course materials for slis 021:239 topics: conceptual structures/systems: metadata design and development. available from: https://github.com/saverkamp/metadata-design boydston jmk, leysen jm. 2006. observations on the catalogers’ role in descriptive metadata creation in academic libraries. cataloging & classification quarterly 43(2):2-17. doi: 10.1300\j104v43n02_02 (coins) dezelar-tiedman c. 2004. crashing the party: catalogers as digital librarians. oclc systems & services: international digital library perspectives 20(4):145-147. available from: http://dx.doi.org/10.1108/10650750410564600 keenan t, sipe v. 2013. transitioning from cataloging to creating metadata. alcts webinar presented february 27, 2013. available from: http://www.ala.org/alcts/confevents/upcoming/webinar/022713 o’bryan a, palmer kl. 2007. the evolving cataloging department. idea: iupui’s digital archive. available from: http://hdl.handle.net/1805/705 rodgers jr, sugarman t. 2012. library technical services: key ingredients in the recipe for a successful institutional repository. the serials librarian 65:80-86. doi: 10.1080/0361526x.2013.800632 (coins) stahl sa, osborn j, lehr f. 1990. “beginning to read: thinking and learning about print” by marilyn jager adams: a summary. urbana-champaign (il): center for the study of reading, the reading research and education center, university of illinois at urbana-champaign. (coins) valentino l. 2010. integrating metadata creation into catalog workflow. cataloging & classification quarterly 48(6-7):541-550. doi: 10.1080/01639374.2010.496304 (coins) veve m, feltner-reichert m. 2010. integrating non-marc metadata duties into the workflow of traditional catalogers: a survey of trends and perceptions among catalogers in four discussion lists. technical services quarterly 27(2):194-213. doi: 10.1080/07317130903585477 (coins) about the author kelly thompson is the metadata management and cataloging librarian at iowa state university library. her duties include managing metadata projects, wrangling descriptive metadata for isu’s digital repository and digital collections, and cataloging scientific, agricultural, and technical resources. she enjoys thinking critically about discoverability, metadata for open access/library-published resources, gender, and critical thinking pedagogy. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – providing information about reading lists via a dashboard interface mission editorial committee process and structure code4lib issue 19, 2013-01-15 providing information about reading lists via a dashboard interface as developers of the open source lorls resource/reading list management system we have developed a dashboard to better support academic staffs’ understanding of how their students use reading lists. this dashboard provides both graphical and tabulated information drawn from lorls and the aleph integrated library system. development of the dashboard required changes to back-end functionality of lorls such as logging views of reading lists and caching of loan data. changes to the front end included the use of html5 canvas elements to generate pie charts and line graphs. recently launched to academic staff at loughborough university, the dashboard has already garnered much praise. it is hoped that further development of the dashboard will provide even more support for academics in the compilation of their reading lists. by dr jason cooper, dr jon knight and gary brewerton introduction in the uk academics typically produce an annotated list of material to support students in their course of study. traditionally these lists would contain references to books, book chapters and journal articles, and thus are known as reading lists. as the variety of material available to academics and students has evolved so too has the list, which now covers other media types including audio visual material and websites/webpages, and are therefore now sometimes called resource lists. a reading/resource list management systems (rlms) provides academics an easy way to create and maintain their reading lists; it also provides the institution’s libraries with easy access to all the academics’ reading lists as the content of these can directly impact the library’s stock management decisions. the provision of statistical feedback on reading lists to their academic owners has historically been seen as a weak spot for rlms. this issue is one of the areas being tracked in the latest version of the loughborough online reading list system (lorls) (brewerton et al. 2003, knight et al. 2012) by the inclusion of an academic dashboard. the idea of providing a dashboard for the academic owner(s) of lists was strengthened by discussions in the first meeting the reading list challenge workshop (meeting the reading list challenge [updated 2012]) in july 2011. being able to feed back list related information to academics was seen as a good incentive for them to make better use of the system. the current version of lorls is separated into two halves, the back end and the front end. the back end handles the data storage and security checks and is known as the loughborough universal metadata platform (lump). lump is written in perl and uses a mysql databases for storage. the front end is a client for lump (named clump) which handles the user interface. this is an ajax style interface, written in javascript and using the jquery javascript library (jquery … [updated 2012]). the front and back ends of lorls communicate via a number of restful apis which can return data in either xml, json or jsonp. implementing the dashboard in lorls required a number of changes to the existing apis as well as the creation of a number of new apis. the front-end interface also needed to be extended to make use of the new features added to existing apis. a new dashboard interface had to be created to present the information provided by the new apis to the academic owners of each reading list. the dashboard the academic dashboard in lorls is available on a per reading list basis, to any users with edit permission for that list. the dashboard option has been included in the toolbar alongside other options like search and logout. clicking on the dashboard option brings up a lightbox effect window within which the dashboard contents are displayed. it was decided early on in that the information presented by the dashboard should, where possible, be presented in a visual format. for example, rather than just reporting the total number of views for a reading list a graph should be used to show those views over time, thus presenting more information without overloading the academic with it. figure 1. screenshot of the academic dashboard the list composition information gives a summary count of the types of items on the list, how many items have a url as part of their metadata and if any of those urls had been marked as not working by the system. if there are any urls that have been marked as not working an option is provided that returns the owner to their list and highlight those items with nonworking urls. to the right of the list composition information are two graphs, one above the other. the top graph shows the number of users who have viewed the reading list over the last year. library staff are not reported in the number of views to give a more accurate representation of student usage. the bottom graph shows the number of items on the reading list that have been loaned out from the library within the last year. the remainder of the academic dashboard consists of three questions. clicking on a question will cause it to expand out with the answer in tabular form. when viewing an item on a reading list the student is presented with two icons for rating the item: a thumbs up and thumbs down. the thumbs up icon can be used to mark that they liked the item and the thumbs down icon can be used to mark that they disliked the item. the students’ votes against items are used to provide the answers to the first two expandable questions on the academic dashboard. figure 2. screenshot showing ratings icon the “which items do my students like?” question shows a table displaying the highest user ranked items, how many people had marked them in a positive way, how many had marked them in negative way and the net score for the item. only items with a positive net score are considered when deciding which items to display. an option to return to the reading list and highlight the items shown is also provided. figure 3. screenshot showing expanded answer the “which items do my students dislike?” question shows the same information as the previous question except for the lowest user ranked items. this time only items with a negative net score are considered for displaying. again an option is provided to return to the reading list and highlight the items shown. the final question, “which items aren’t being borrowed?”, presents a list of items that haven’t been borrowed in the last year, yet are available in the integrated library system (ils). once again there is an option to return to the reading list and highlight these items. logging views the initial change required for lorls was to set up logging views of reading lists. this required changes to both the front and back end. to access information on a reading list the front end calls the fastgetstructuralunit api. due to the threaded nature of clump the fastgetstructuralunit api is called multiple times when displaying lists; this in turn required an additional option to be added to the fastgetstructuralunit api so that the front end could specify that a call shouldn’t be logged as a view. when logging a view, the source ip address of the request, the name of the client software in use (usually clump in our case), the id numbers for the particular reading list, the username of the person and the time of the request are recorded in a row in a new “view_log” table in the lump database. a new viewusage api was added to allow the front end to access information stored in the view_log table. it takes the id of the reading list and returns a list of views including a timestamp for each view. this information can then be processed by the front end and used to generate a graph of views over a period of time. the viewusage api allows requests to discard views made from machines in given ip address ranges (details of which are held in a separate table). this allows accesses by library staff or during development/debugging of the code to be ignored. the remaining ip addresses are anonymised using the ip::anonymous perl module from cpan (ip::anonymous … [updated 2012]). the viewusage api can also restrict returned viewing matches based on any combination of the client software name, the username of the person viewing the sus, the source ip address (prior to anonymising) and a start/end timestamp. item loans initially the local ils was queried dynamically to obtain the loan information. as lorls attempts to be ils agnostic the code to query it was kept separate. this turned out to be beneficial as due to performance issues actively querying the ils was soon demonstrated to be an unworkable solution. a local loans cache database was created that was specifically designed to provide the required information more efficiently. the loans cache table uses a script that runs each day and gets any loans that were either issued or returned for the previous day. these details are then used to update the local loans cache. there are currently two api calls available for the local loans cache: ‘loanhistory2’ and ‘lastloaned’. the first api call takes a list of isbns and returns, for each day in the last year, the total number of items on loan which match any of those isbns. the second api also takes a list of isbns but returns for each isbn if and when it was last issued. the front end uses the data returned by the first api to provide the graph showing items loaned and the data from the second api to produce the list of items not borrowed in the last year. item rating the introduction of a feature for users to rate individual items on a reading list required the addition of a new table in the database and a number of new apis for the front end to interact with it. the new apis are ‘editsurating’ and ‘getsurating’. editsurating will set a logged in users rating as either positive (thumbs up) or negative (thumbs down). an individual user can only have one rating attached to any structural unit; any future calls of editsurating for that su will alter the existing rating for that user. the getsurating api retrieves for each su specified the total number of positive ratings and the total number of negative ratings. graphing technology traditionally graphs shown in web sites are either static images generated and stored on the server or images dynamically generated by cgi scripts running on the web server. both methods increase the load on the web server and are impractical when the graphs are expected to make full use of the available space on a page, even after the browser has been resized. the solution to this issue used in lorls’s dashboard is to have the back end supply the data to the browser and for the browser to dynamically generate the graphs itself. this became possible with the introduction of html5’s canvas element and flot (flot … [updated 2012]), a jquery plug-in for drawing graphs. flot supports a number of different graph styles (line, bar, pie, etc.) and also supports a number of advanced features including dynamically resizing the graph when its containing element is resized. this method is not without its own issues though. for example, version 8 and lower of internet explorer (ie) does not support the canvas element. this requires lorls v7 to add a canvas element to these versions of ie through the use of the explorercanvas javascript library (explorercanvas [updated 2012]) which emulates the canvas element through the use of ie’s vector markup language (vml). launch and future development the dashboard was launched to academics at loughborough in august 2012. feedback has been very positive with many academics praising the graphical design of the dashboard and appreciating the insight the dashboard gives them into the use of their reading lists. enhancing the dashboard will be part of future versions of lorls as there is still a lot of information that the academic owners may find useful when creating their lists. areas that are currently under investigation to be added or enhanced in future versions of the dashboard include loan information, pricing and download counts for online items. the loan information can be extended with more details such as which books are the most loaned, what are the average number of loans and what the standard deviation is. given the recent increase in tuition fees for students, academics may feel the need to be more aware about the cost of resources they are expecting their students to own/access. information relating to the total and average cost of books on the list may be of benefit to the academic owners as well as librarians when updating their lists. as more items on reading lists are available electronically via special copyright cleared licenses, being able to view the number of downloads of these items alongside the loan information will help present a fuller picture of how students are making use of the reading list. conclusion the inclusion of an academic dashboard for reading list owners increases the information they have available for managing their reading lists. they can use it to gain an understanding of how their students are using reading lists: if and when they are checking them or borrowing items from them. they can also see which items their students have reported as being particularly useful and which items were not. finally they can see which books on their reading list haven’t been loaned out by the library and use this information when revising their reading list and lecture notes. the dashboard in lorls will continue to be enhanced in future versions with an emphasis on providing information to the academic owners that helps them make the most of their reading lists. references brewerton g, knight j. 2003. from local project to open source: a brief history of the loughborough online reading list system (lorls). vine, 33(4):189-195. available from https://dspace.lboro.ac.uk/handle/2134/441 knight j, cooper j, brewerton g. 2012. redeveloping the loughborough online reading list system. ariadne, (69). available from http://www.ariadne.ac.uk/issue69/knight-et-al meeting the reading list challenge [internet]. [cited 2012 october 08]. available from http://blog.lboro.ac.uk/mtrlc/ jquery: the write less, do more, javascript library [internet]. [cited 2012 october 08]. available from http://jquery.com/ ip::anonymous : perl port of crypto-pan to provide anonymous ip addresses [internet]. [cited 2012 october 08]. available from http://search.cpan.org/dist/ip-anonymous/lib/ip/anonymous.pm flot: attractive javascript plotting for jquery [internet]. [cited 2012 october 08]. available from https://code.google.com/p/flot/ explorercanvas [internet]. [cited 2012 october 22]. available from http://excanvas.sourceforge.net/ about the authors dr. jason cooper is the library systems analyst/programmer at loughborough university. he is responsible for developing and maintaining a number of key systems for the library, including the front end of the lorls (an open source reading list management system). dr. jon knight is the library systems developer at loughborough university. he is responsible for developing and maintaining the back end of lorls (an open source reading list management system). gary brewerton is the library systems manager at loughborough university and is interested in library automation, information and communication technology, and open source development (e.g. lorls). subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – institutional data repository development, a moving target mission editorial committee process and structure code4lib issue 51, 2021-06-14 institutional data repository development, a moving target at the end of 2019, the research data service (rds) at the university of illinois at urbana-champaign (uiuc) completed its fifth year as a campus-wide service. in order to gauge the effectiveness of the rds in meeting the needs of illinois researchers, rds staff developed a five-year review consisting of a survey and a series of in-depth focus group interviews. as a result, our institutional data repository developed in-house by university library it staff, illinois data bank, was recognized as the most useful service offering by our unit. when launched in 2016, storage resources and web servers for illinois data bank and supporting systems were hosted on-premises at uiuc. as anticipated, researchers increasingly need to share large, and complex datasets. in a responsive effort to leverage the potentially more reliable, highly available, cost-effective, and scalable storage accessible to computation resources, we migrated our item bitstreams and web services to the cloud. our efforts have met with success, but also with painful bumps along the way. this article describes how we supported data curation workflows through transitioning from on-premises to cloud resource hosting. it details our approaches to ingesting, curating, and offering access to dataset files up to 2tb in size–which may be archive type files (e.g., .zip or .tar) containing complex directory structures. by colleen fallaw, genevieve schmitt, hoa luong, jason colwell, and jason strutz introduction in 2016, the university of illinois at urbana-champaign (uiuc) launched illinois data bank, a public access repository for publishing research data from illinois researchers. illinois data bank is part of a larger digital library system at uiuc called medusa, which is composed of a set of internal digital preservation and management tools, along with several patron-facing access systems (robbins, 2018). all deposited datasets receive professional curation services to ensure their completeness, understandability, and access for use in the future. in the strategic plan for 2018-2023, uiuc continues emphasizing the importance of “responsible data sharing practices throughout the institutional and constituent lifecycles” (strategic plan, 2018). the year it launched, publication of the article “overly honest data repository development” provided a holistic view of the development process for the illinois data bank (fallaw et al., 2016). the initial technical and service design choices were detailed in the article. going into illinois data bank’s fifth year of operation, researchers increasingly express a need to share large, complex datasets. this calls for reliable, highly available, cost-effective, scalable storage accessible to computation resources. the promise of these features in cloud services appeared well-suited to meet those needs, and implementation met with some meaningful success, but also with painful bumps along the way. challenges of large, complex datasets for ingestion, curation, and access all faculty, graduate students, and staff conducting research affiliated with uiuc can deposit their research data into the illinois data bank. published datasets are then available to everyone around the world for viewing and downloading. to align with the spirit of open data for data sharing, all deposited datasets are curated to ensure completeness, understandability, and accessibility in the future. curatorial review is a “human layer” that brings the disciplinary knowledge and software expertise necessary for reviewing incoming submissions to ensure that the data meet the fair principle: findability, accessibility, interoperability, and reusability (johnston, 2018). the depth and length of curation review depends heavily on the size and structure of the dataset and how much detail is in the documentation. based on our internal data, approximately two hours is an average amount of time spent to curate a deposited dataset in the illinois data bank. the curation services that we offer include, but are not limited to: • review files and metadata for completeness. • check documentation thoroughly for understandability. • validate files, including running any included computer code to the extent availability, license terms, and resources permit. • link to related publications and other materials. while researchers surveyed by springer nature identified non-technical issues, such as “organizing data in a presentable and useful way,” as the main challenges to data sharing, file size was also found to impact data sharing. “respondents that generate the smallest data files (<20mb; n = 2,036) have the highest proportion of data that are neither shared as supplementary information nor deposited in a repository (42%). in contrast, 70% of those with data files greater than 50gb (n = 700) share their data, with a strong preference for sharing through repositories (59%)” (stuart et al., 2018). large datasets also create challenges for the curation process. in order to curate a dataset, our curator downloads as many files in the dataset as possible to check for completeness and possible errors. however, this is not always a case, especially when the dataset contains multiple large files. for example, research data from data-intensive disciplines, such as atmospheric sciences, electrical, and computer engineering, are normally deposited in large, zipped files (some above 50gb). to curate those, the curator needs technical assistance from development and infrastructure professionals in library it. in most cases, physically sharing and transporting an external, portable hard drive which contains files that curator needs, would work. however, this consumes cumbersome time and resources even when participants are not navigating a pandemic, so we were motivated to find an alternative solution to not only support depositors and users of datasets, but also curators. for this paper and in planning discussions among the staff supporting the illinois data bank, file size in datasets starts being “large” at the size it is not trivial to ingest and access it in a standard web form. because our policies allow for multi-tb deposits, we have pragmatically found the need for additional solutions at approximately 4gb, 15gb, and 50gb. as shown in figure 1, almost all the files deposited are 4gb or smaller. however, as shown in figure 2, datasets with larger files are being published in the illinois data bank with increasing frequency. figure 1. file counts by size category in illinois data bank 2016-04-01 to 2021-04-01. figure 2. file sizes over time in illinois data bank 2016-04-01 to 2021-04-01. in the context of technically supporting data sharing in illinois data bank, a dataset is “complex” when there is more than about one page worth of files to list in a dataset, or if the data is organized hierarchically in directories. curators have identified archive formats such as .zip, .7z, and .tar to be preservation challenges, in part because of a lack of transparency as to the format of the files inside (braxton, 2018). also, in informal feedback, depositors and users of datasets in illinois data bank have identified being able to see the structure of the datasets in the web application as important to data sharing. while the datafiles in illinois data bank are in a large variety of file formats, 19.5% (n = 859) are in some kind of archive format, requiring additional processing to reveal the internal structure and file types for use and curation (fallaw, 2021). figure 3. file type categories. on-premises solutions when illinois data bank launched, all storage and web service resources were hosted and administered on-campus. at launch, illinois data bank supported ingestion of files up to 15 gb directly from box through the web interface. larger files were transferred with portable hard drives and ingested manually by technical support staff. we soon added a data file ingest api and sample python client, which worked by incrementally adding chunks to a file on a filesystem, which became the go-to solution for command-line transfer of files from research compute clusters. box offers api access, and illinois data bank uses a file import box widget with features that include single sign-on and integrated file selection. box has a per-file limit of 15gb. the data file ingest api was used to transfer data exclusively with command line access, which is the case for some of the compute clusters where researchers store and process data. it also supported files up to the policy limit of 2tb. on-premises to cloud migration the file storage element of medusa at the time of illinois data bank’s launch had been designed to be integrated with a compute cluster for active research projects, not to support highly-available web applications with uncertain and variable storage needs. we wanted storage designed to be more highly available, and to scale better with storage needs, which we expected to grow, though it was not clear to us how much or when. an additional element of the environment building momentum toward migration to cloud architecture was that our university’s centralized technical services group was brokering a deal with amazon web services (aws) and encouraged us to take advantage of the potential benefits they were evaluating. this aligned with trends in the library community (goldner, 2010). in approaching our initial migration to cloud architecture, we understood that we had a lot to learn and that every change would have potential to uncover complications, but we judged that it was worth at least a pilot program to test the concept. as is common in large innovative technical projects, it was easy to underestimate the time, effort, and complications involved. on february 18, 2019 we switched the production instances of our digital library storage and web applications from on-campus platforms to amazon web services (aws). file system storage systems are the kind of system with folders containing files like you would find on a personal computer. in contrast, aws s3 and google drive are referred to as object stores. aws does offer a flexible file system storage called elastic file system (efs), but it has a per-gb per-month price of ~30 cents for efs vs ~3 cents for s3 for our initial ~150tb. we opted to adapt to s3 object storage buckets (robbins et al., 2019). a fully on-premises implementation was the most familiar to us. we wanted a minimum of three replicated copies in two physically separate locations in order to durably protect data, and in order to maintain a secondary read-only option during scheduled and unscheduled maintenance at the primary location. the physical, logistical, and technical challenges were daunting. appliances that managed replicated copies were readily available on the market, but were expensive compared to bulk storage. creating and maintaining a data center within library it facilities would require a large initial investment with repeating refresh cycles, which could not be accommodated at the scale we needed on a consistent timeline given the fluidity of higher education budgets. at launch, illinois data bank and the supporting infrastructure was built on virtual hardware. different functional components or software stacks were built on separate virtual servers, with cpu, ram, disk, and network bandwidth allocated according to each function. the production environment had eight virtual servers, all of which had to be managed for operation system and software updates and monitored for security and performance limitations. the virtual cluster disk was configured for daily backup via volume snapshot, which was great for the operating system volumes but was not feasible in a realistic backup window for the amount of raw data in the repository. for that purpose, we turned to network attached storage, which replicated via rsync to a secondary block storage location in a different building. we were able to fail over by changing the mount point configurations on each application server and restarting them in the appropriate sequence. although the on-premises setup was cost effective, there were problems with reliability and scale. the primary and secondary storage were in different buildings from the virtual server cluster and frequently experienced network issues. an outage to any of the locations had the potential to partially or completely disable the repository. we were also required to buy additional storage in 30tb chunks with up to eight weeks of lead time due to procurement and installation delays. in order to accommodate growth and to improve stability, we decided to move all the primary repository storage to amazon s3. for both campus-based and cloud-based approaches, we decided to use aws s3 glacier as a completely separate disaster recovery location for all data. with the application and primary data on premises, we periodically uploaded an archive file (tar.gz format) with the content to be preserved to glacier, and maintained an index of what content was in each archive. with the cloud implementation, we have cross region replication setup to copy data unidirectionally to a separate aws region when any object is uploaded to medusa’s s3 bucket in the primary region. this ensures strong durability at extremely low cost, with no chance of automated processes propagating accidental deletions or changes to the preservation copy. it is very slow and expensive to retrieve the entire data set, and complicated to identify a smaller part of the data set to restore in the event of disaster, but this cost was accepted as a last resort storage location for the digital objects. setting up mount points to s3 buckets is comparable in complexity to other types of network attached storage. however, because s3 is an object storage platform, we could take advantage of apis to reduce or eliminate file system calls. combined with services offered natively in aws, such as the relational database service and elasticsearch, we were able to completely remove several virtual machines from the system. this created a shift in how infrastructure maintenance was done—instead of keeping a list of all servers that needed to be patched, monitored, and controlled for changes, we had services with monitors and alerts maintained within the aws console. because everything was defined in our infrastructure provisioning tool (terraform), we could predict when changes would have an effect on other parts of the system so they could be tested, and any mistakes could be rolled back by reverting to the previous known-good configuration. we went from managing many parts of a single system separately to managing the entire system holistically. cloud solutions some of the medusa servers involved with supporting the illinois data bank were hosted on virtual machines (vms) administered by centralized campus it services, while others were entirely maintained by library it staff. while we used the infrastructure automation framework puppet to some extent, we hoped to administer our infrastructure more efficiently and effectively—focusing on innovation rather than routine maintenance. cloud solutions include globus-based solutions to handle large files. the illinois data bank manages the complexity of a large number of files or hierarchically structured files by organizing them into archive files types such as zip or tar. a custom service developed in-house examined and reported on the contents of archive file types such as zip and tar to support use and curation of complex dataset directory and file structures. this service relied on filesystem tools. our initial adaptation to cloud infrastructure involved migrating the vm to an ec2 and temporarily copying binary objects from s3s buckets to ecs. to further adapt this service to cloud infrastructure, we dove into aws services for scalable microservices. the differences between how file storage and object storage work required extensive adaptation of our systems and approaches to content transfer. and even file system storage provided by cloud services over the internet has connection interruptions and delays that require more flexible and robust solutions than an all on-campus infrastructure. we created custom workflows that involve the transfer of files between campus storage and cloud storage: dataset files of up to 2tb into and out of illinois data bank by campus researchers. the challenge of needing increased availability, scalability, and flexibility has been effectively addressed with our migration to cloud infrastructure. the biggest change for system administration is that library it has full control over infrastructure. using on campus resources relied on system administrators from other groups for provisioning and responding to incidents for storage and compute resources. overall, library it has a much easier time managing resources now since s3 is very reliable, and if more compute resources are needed, that can be adjusted directly. one side-effect to having full control over infrastructure is the need to better keep track of expenses. the library it infrastructure team was used to using monitoring tools to keep track of servers and services, but we needed a similar vision into our daily and weekly aws infrastructure costs. after tracking expense trends in the first few months operating in aws, we created aws budgets to notify us when expenses are beyond what we were expecting. with the added responsibility of keeping costs under control, the change has been a net positive for the infrastructure team. it has turned many of our less fun infrastructure management tasks like provisioning and deprovisioning infrastructure, user and access management, security, and cost management into challenging puzzles using aws’s automation tools and third-party tools like terraform and ansible. while system administration became more interesting and effective, many elements of the application needed to be adapted with migration to the cloud. one of the adaptations from an all on-campus infrastructure vs. cloud object storage and web servers involved modifying the api for data transfer. there is support in the aws s3 api for chunked file upload, although the minimum chunk size was larger than we had used previously. to adapt to cloud infrastructure and improve upload performance, library it staff implemented tus resumable file upload protocol (tus) with s3 backend. this produced noticeably faster uploads in the web interface compared to http without the tus layer, and initial testing of the api indicated this could be a workable solution. however, “the [tus] protocol uses a data model which is mostly incompatible with aws s3” and integrating “requires some workarounds” (kleidl, 2016). in field conditions, our researchers started encountering frustrating reliability and transparency issues with files larger than ~100 gb using the updated api and sample client adapted to our cloud architecture. on top of the aggravation of failed transfer attempts, this led to rushed, disruptive portable hard drive transfers. enter globus. as described on their website at the time of this writing, “globus is a non-profit service for secure, reliable research data management. with globus, subscribers can move, share, & discover data via a single interface – whether your files live on a supercomputer, lab cluster, tape archive, public cloud or your laptop, you can manage this data from anywhere, using your existing identities, via just a web browser. developers can also use globus to build applications and gateways leveraging our advanced identity management, single sign-on, search, authorization, and automation capabilities” (globus, 2021). because globus is a transfer service, not a separate kind of file system, it does not maintain exclusive control of the binary objects it has access to through a storage gateway. files transferred between endpoints using globus can be manipulated using other tools in the same location. this was key to its utility for our purpose. globus offers a subscription service. the university of illinois has a campus-wide non-profit research and education subscription for globus online. our subscription includes unlimited managed endpoints plus premium storage connectors for aws s3 and google drive. a managed endpoint means the endpoint can use the subscription features, such as hosting shared endpoints and any premium storage connectors. shared endpoints (a.k.a guest collections) are required to share data using globus connect server v5. posix storage systems include the kind of system with folders containing files like you would find on a personal computer. aws s3 and google drive are non-posix object stores. premium storage connectors are required to use non-posix storage with an endpoint. setting up globus for use by illinois data bank required significant organizational and technical investment initially, but it not only supported considerably smoother ingestion of data files into datasets, but library it staff now use it for other backend services as well as additional file transfer tasks that have arisen. globus ingest steps figure 4. globus ingest steps. while the pain that prompted us to integrate globus into illinois data bank related to transfer in, once it was set up, we also offered transfer out, which is increasingly important as illinois data bank hosts larger datasets. it also supports networked curation of large complex datasets within the data curation network. at launch, illinois data bank adapted a system for supporting large downloads that was already in place for the medusa preservation system. one of the shared backend services of medusa is a downloader service: a ruby on rails application combined with an nginx server running mod_zip and another nginx server to allow streamable zip downloads of medusa content. downloads with a large number of files are served directly through rails using a helper program clojure-zipper. illinois data bank interacts with the medusa downloader service using an api—sending a list of files and getting back a download link, which it in turn displays to the end user. while the medusa downloader system was updated effectively to adapt to cloud infrastructure, larger datasets can be transferred more robustly and flexibly using globus. also, for institutions and individuals already using globus, some may prefer to use it when available even for smaller files. we set up a shared globus endpoint on an s3 bucket with read permission granted to all logged-in users. as each dataset is published, a background job copies the files to that endpoint using a naming convention. when a dataset landing page is served, illinois data bank checks if the files are available in globus. if available, the interface offers a link to the globus file manager page for that dataset. if problems arise from illinois data bank users attempting to download very large datasets using other methods, we may disable non-globus downloads for those datasets, where “very large” means whatever size causes problems. curator preservation challenges for archive file types many of the largest and most complex datasets include archive type files, such as .zip, .tar, or .7zip. in order to offer a listing of the contents of the archive files in the user interface and for curation analysis, we extract the files to filesystem storage and traverse the resulting tree. other times it is possible to extract the needed information from metadata in the archive file, but even that requires tools that require the object to be stored on a filesystem. we use s3 for long term storage instead of efs because of the order of magnitude difference of expense mentioned above in our description of migrating from on-premises to cloud infrastructure. however, the flexibility of cloud infrastructure means that we can provision efs on a temporary basis cost effectively when processing archive files. the process of extracting content metadata from archive files is resource intensive, but infrequent, making it a good candidate for something more flexible than a set-size vm on all the time. to take advantage of cloud infrastructure support for flexible resource use, the archive extractor process utilizes an aws fargate elastic container service task running a custom docker image stored in the aws elastic container repository (ecr). the image consists of a ruby script utilizing efs for transient structured storage and aws simple queue service (sqs) for communication back to the illinois data bank with the information extracted from the archive files. this serverless architecture is a simplification from a persistent aws elastic compute cloud (ec2) instance running a ruby on rails api to process the archives. the ec2 solution followed existing development paradigms during the transition from campus-hosted infrastructure to the aws cloud. as development has settled into the aws ecosystem it became apparent the archive extraction process could be transitioned into a serverless solution. research archive deposits into the illinois data bank are large, infrequent, and require increased compute power for short bursts of time making the extraction process ideal to utilize a serverless solution. the most notable serverless architecture provided by aws is lambda. it allows developers to simply upload code as a .zip file in a variety of supported languages, seamlessly integrates with aws elastic file system, and is triggered through customizable json events. naturally, lambda was the initial architecture investigated for the archive extractor and after nearly a month of implementation, the limitations of this solution were brought to our attention. while extremely powerful and user-friendly for novice cloud developers, lambda functions can only run for a maximum of 15 minutes before they timeout. illinois data bank can handle files up to 2 tb which can take hours to process even under ideal computing conditions. with this constraint in mind, we worked with an aws technical account manager to brainstorm another aws serverless solution and we decided to pursue aws fargate elastic container service. fargate is a serverless solution similar to lambda with more customization and fewer restrictions which can lead to development complications. whereas lambda works with .zip program files, fargate utilizes container images thus working with fargate requires prior knowledge of docker containers or rapid familiarization. once the ruby script was migrated to a docker container and uploaded as an image, the next hurdle was connecting the fargate task to the efs. ideally, the process to set up an efs connection with fargate is straightforward, the file system is added to the fargate task as a mount point and connected as an external volume. however, this relies on the aws virtual private cloud (vpc) utilizing the aws domain name system (dns), and due to university constraints, the existing vpc uses a custom dns. in an attempt to mitigate this issue, we tried to attach the efs volume through the docker container itself outside of the cloud, however, this is not supported in aws. instead, we were left with two solutions. first, we could add dns forwarding and tackle the setup, and maintenance while risking the stability of our existing infrastructure on the vpc or we could create a new separate vpc to use aws dns for the task. we chose to create a new vpc, accepting the added risk for the increased stability and reduced risk to our existing production environments on the vpc. with this new infrastructure in place, we were able to connect the efs to the fargate task and begin fully testing the archive extractor process. initial testing proved successful as smaller files were able to be uploaded to the efs processed in the ruby script and the output was written to the sqs to be queried and handled by the illinois data bank. subsequent testing of larger files demonstrated fargate’s ability to compute for extended periods of time, however, it also highlighted a key limitation in the sqs communication between the archive extractor service and illinois data bank. sqs queues have a size limit of 256 kb per message. this can be extended up to 2 gb through the sqs extended client library which is unfortunately only available for java at the moment. this new constraint left us pondering a few different solutions including splitting json results into smaller messages, creating a wrapper for the java library to be used in ruby, or uploading the json result to an s3 bucket and sending the bucket key to the illinois data bank as the sqs message. with a similar thought process to creating a new vpn, to minimize future maintenance and increase stability we chose to write the json to an s3 bucket to be read from the illinois data bank backend server. this solution utilized existing aws products without the need to maintain any new technologies ourselves. as with any technology there will always be more bugs, however, for now it appears we have worked through the major challenges of transforming an ec2 solution into a serverless architecture with minimal overhead and increased technological support via aws. conclusion and beyond the initial technical and service design choices made at illinois data bank’s launch addressed the institutional data repository challenges of depositing, using, and curating large, complex datasets with tools designed to work with file systems rather than object stores and internal networked connections between storage systems and web servers rather than cloud-based connections. during the five years since launch, library it at uiuc has pursued improvements in reliability, performance, and flexibility through a migration of repository infrastructure to a cloud platform. we discovered and developed tools, such as globus and illinois data bank archive extractor. in working through what sometimes seemed like intensely nested layers of complexity and surprises in the migration, we developed new patterns of infrastructure management that afford higher levels of flexibility and position us for further innovation. our anticipated next targets for institutional data repository development and migration efforts are using sqs to replace our rabbitmq server, and more automation integration with globus. we currently maintain a rabbitmq server on an ec2 to handle communication between repository service components. we are analyzing a migration to sqs to potentially further reduce server administration and infrastructure costs. serving as a pilot use case, the recently migrated, custom developed archive extractor described above uses sqs to communicate with illinois data bank. while migration to sqs would be invisible to users of the systems, improved integration with globus could improve user experience. each step in the globus ingest workflow represented in figure 4 is vulnerable to staff availability, and many steps require coordination between the depositor and repository staff. emerging features of the globus and aws apis may support smoother automation of more steps. on the dataset curation and use side, very large datasets could overtax html-based download options, so limiting download options in the interface to only globus for those datasets may be necessary for reliability and performance–for the users downloading the files as well as other users accessing repository web services at the same time. information sharing needs of publishers, curators, and users of datasets in institutional data repositories continue to be moving targets in an ever-shifting technology landscape. research data service and library it staff expect to continue to use a mix of on-premises resources, cloud platforms, third party tools, custom code, and whatever comes next to meet those needs. acknowledgements we thank heidi imker, director of the research data service and associate professor, university library at uiuc, for feedback and editing. references [1] braxton, s., fallaw, c., luong, h., orlowska, d., hetrick, a., rimkus, k., anderson, b., & imker, h. (2018). should we keep everything forever? determining long-term value of research data. poster presented at ipres2018, boston, usa. http://hdl.handle.net/2142/91659 [2] fallaw, c. (2021). datafile features in illinois data bank datasets 2016-04-01 to 2021-04-01. university of illinois at urbana-champaign. https://doi.org/10.13012/b2idb-7291801_v1 [3] fallaw, c., dunham, e., wickes, e., strong, d., stein, a., zhang, q., rimkus, k., ingram, b., & imker, h.j. (2016). overly honest data repository development. the code4lib journal, no. 34. https://journal.code4lib.org/articles/11980 [4] gewin, v. (2016). data sharing: an open mind on open data. nature 529, 117–119. https://doi.org/10.1038/nj7584-117a [5] globus. (2021). what we do. globus. www.globus.org/what-we-do [6] goldner, m. (2010). winds of change: libraries and cloud computing. oclc online computer library center, inc. https://www.oclc.org/content/dam/oclc/events/2011/files/ifla-winds-of-change-paper.pdf [7] johnston, l.r., carlson, j., hudson-vitale, c., imker, h., kozlowski, w., olendorf, r., stewart, c., blake, m., herndon, j., mcgeary, t.m., hull, e., & coburn, e. (2018). data curation network: a cross-institutional staffing model for curating research data. international journal of digital curation, 13(1), 125-140. https://doi.org/10.2218/ijdc.v13i1.616 [8] johnston, l. (2020). how a network of data curators can unlock the tremendous reuse value of research data. oclc. https://blog.oclc.org/next/data-curators-network/ [9] kim, y., and burns, c.s. (2016). norms of data sharing in biological sciences: the roles of metadata, data repository, and journal and funding requirements. journal of information science, 42(2), 230–245. https://doi.org/10.1177/0165551515592098 [10] kleidl, m. (2016). tus.io (blog). https://tus.io/blog/2016/03/07/tus-s3-backend.html [11] kowalczyk, s., & and shankar, k. (2011). data sharing in the sciences. annual review of information science and technology journal, 45, 247-294. https://doi.org/10.1002/aris.2011.1440450113 [12] national science board. (2011). digital research data sharing and management. https://www.nsf.gov/nsb/publications/2011/nsb1124.pdf [13] office of science and technology policy. https://web.archive.org/web/20160304043850/https:/www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf [14] office of the provost. (n.d.). the illinois strategic plan, the university of illinois at urbana-champaign. https://strategicplan.illinois.edu/2013-2016/goals.html [15] office of the provost. (n.d.). the next 150: strategic plan for 2018-2023. the university of illinois at urbana-champaign. https://strategicplan.illinois.edu/ [16]robbins, s., troy, j., rimkus, k., strutz, w.j., & colwell, j. (2019). the challenges and charms of a cloud-based repository infrastructure transition: lifting and shifting the library’s medusa repository. presented at open repositories 2019 conference, hamburg, june 13, 2019. http://hdl.handle.net/2142/104029 [17] robbins, s. (2018). medusa: service-oriented repository architecture at the university of illinois. presented at open repositories 2018 conference, bozeman, mt, june 6, 2018. http://hdl.handle.net/2142/100077 [18] strutz, j. (2017). the medusa repository: turning data into stone. presented at the university of illinois at urbana-champaign it pro forum november 1, 2017 [19] stuart, d., baynes, g., hrynaszkiewicz, i., allin, k., penny, d., lucraft, m., & astell, m. (2018). whitepaper: practical challenges for researchers in data sharing. https://doi.org/10.6084/m9.figshare.5975011.v1 [20] tenopir, c., dalton, e.d., allard, s., frame, m., pjesivac, i., birch, b., pollock, d. & dorsett, k. (2015). changes in data sharing and data reuse practices and perceptions among scientists worldwide. plos one, 10(8): e0134826. https://doi.org/10.1371/journal.pone.0134826 [21] tenopir, c., allard, s., douglass k., aydinoglu, a.u., wu, l., read, e., manoff, m., & frame, m. (2011). “data sharing by scientists: practices and perceptions.” plos one 6,(6): e21101. https://doi.org/10.1371/journal.pone.0021101 about the authors colleen fallaw research programmer at the university of illinois at urbana-champaign library http://orcid.org/0000-0002-0339-9809 mfall3@illinois.edu genevieve schmitt software engineer at the university of illinois at urbana-champaign library https://orcid.org/0000-0002-4974-9501 gschmitt@illinois.edu hoa luong associate director, research data service at the university of illinois at urbana-champaign library https://orcid.org/0000-0001-6758-5419 hluong2@illinois.edu jason colwell systems administrator at the university of illinois at urbana-champaign library https://orcid.org/0000-0003-4150-9172 colwell3@illinois.edu w. jason strutz manager, it infrastructure at the university of illinois at urbana-champaign library https://orcid.org/0000-0002-0284-3517 strutz@illinois.edu subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – easing gently into opensrf, part 1 mission editorial committee process and structure code4lib issue 10, 2010-06-22 easing gently into opensrf, part 1 the open service request framework (or opensrf, pronounced “open surf”) is an inter-application message passing architecture built on xmpp (aka “jabber”). the evergreen open source library system is built on an opensrf architecture to support loosely coupled individual components communicating over an opensrf messaging bus. this article introduces opensrf, demonstrates how to build opensrf services through simple code examples, explains the technical foundations on which opensrf is built, and evaluates opensrf’s value in the context of evergreen. part 1 of a 2 part article in this issue. by dan scott an article in two parts. introducing opensrf opensrf is a message routing network that offers scalability and failover support for individual services and entire servers with minimal development and deployment overhead. you can use opensrf to build loosely-coupled applications that can be deployed on a single server or on clusters of geographically distributed servers using the same code and minimal configuration changes. although copyright statements on some of the opensrf code date back to mike rylander’s original explorations in 2000, evergreen was the first major application to be developed with, and to take full advantage of, the opensrf architecture starting in 2004. the first official release of opensrf was 0.1 in february 2005 (http://evergreen-ils.org/blog/?p=21), but opensrf’s development continues a steady pace of enhancement and refinement, with the release of 1.0.0 in october 2008 and the most recent release of 1.2.2 in february 2010. opensrf is a distinct break from the architectural approach used by previous library systems and has more in common with modern web applications. the traditional “scale-up” approach to serve more transactions is to purchase a server with more cpus and more ram, possibly splitting the load between a web server, a database server, and a business logic server. evergreen, however, is built on the open service request framework (opensrf) architecture, which firmly embraces the “scale-out” approach of spreading transaction load over cheap commodity servers. the initial gpls pines hardware cluster, while certainly impressive, may have offered the misleading impression that evergreen requires a lot of hardware to run. however, evergreen and opensrf easily scale down to a single server; many evergreen libraries run their entire library system on a single server, and most opensrf and evergreen development occurs on a virtual machine running on a single laptop or desktop image. another common concern is that the flexibility of opensrf’s distributed architecture makes it complex to configure and to write new applications. this article demonstrates that opensrf itself is an extremely simple architecture on which one can easily build applications of many kinds – not just library applications – and that you can use a number of different languages to call and implement opensrf methods with a minimal learning curve. with an application built on opensrf, when you identify a bottleneck in your application’s business logic layer, you can adjust the number of the processes serving that particular bottleneck on each of your servers; or if the problem is that your service is resource-hungry, you could add an inexpensive server to your cluster and dedicate it to running that resource-hungry service. programming language support if you need to develop an entirely new opensrf service, you can choose from a number of different languages in which to implement that service. opensrf client language bindings have been written for c, java, javascript, perl, and python, and service language bindings have been written for c, perl, and python. this article uses perl examples as a lowest common denominator programming language. writing an opensrf binding for another language is a relatively small task if that language offers libraries that support the core technologies on which opensrf depends: extensible messaging and presence protocol (xmpp, sometimes referred to as jabber) – provides the base messaging infrastructure between opensrf clients and services javascript object notation (json) – serializes the content of each xmpp message in a standardized and concise format memcached – provides the caching service syslog – the standard unix logging service unfortunately, the opensrf reference documentation, although augmented by the opensrf glossary, blog posts like the description of opensrf and jabber, and even this article, is not a sufficient substitute for a complete specification on which one could implement a language binding. the recommended option for would-be developers of another language binding is to use the python implementation as the cleanest basis for a port to another language. enough jibber-jabber: writing an opensrf service imagine an application architecture in which 10 lines of perl or python, using the data types native to each language, are enough to implement a method that can then be deployed and invoked seamlessly across hundreds of servers. you have just imagined developing with opensrf – it is truly that simple. under the covers, of course, the opensrf language bindings do an incredible amount of work on behalf of the developer. an opensrf application consists of one or more opensrf services that expose methods: for example, the opensrf.simple-text demonstration service exposes the opensrf.simple-text.split() and opensrf.simple-text.reverse() methods. each method accepts zero or more arguments and returns zero or one results. the data types supported by opensrf arguments and results are typical core language data types: strings, numbers, booleans, arrays, and hashes. to implement a new opensrf service, perform the following steps: include the base opensrf support libraries write the code for each of your opensrf methods as separate procedures register each method add the service definition to the opensrf configuration files for example, the following code implements an opensrf service. the service includes one method named opensrf.simple-text.reverse() that accepts one string as input and returns the reversed version of that string: #!/usr/bin/perl package opensrf::application::demo::simpletext; use strict; use opensrf::application; use parent qw/opensrf::application/; sub text_reverse { my ($self , $conn, $text) = @_; my $reversed_text = scalar reverse($text); return $reversed_text; } __package__->register_method( method => 'text_reverse', api_name => 'opensrf.simple-text.reverse' ); ten lines of code, and we have a complete opensrf service that exposes a single method and could be deployed quickly on a cluster of servers to meet your application’s ravenous demand for reversed strings! if you’re unfamiliar with perl, the use opensrf::application; use parent qw/opensrf::application/; lines tell this package to inherit methods and properties from the opensrf::application module. for example, the call to __package__->register_method() is defined in opensrf::application but due to inheritance is available in this package (named by the special perl symbol __package__ that contains the current package name). the register_method() procedure is how we introduce a method to the rest of the opensrf world. registering a service with the opensrf configuration files two files control most of the configuration for opensrf: opensrf.xml contains the configuration for the service itself, as well as a list of which application servers in your opensrf cluster should start the service. opensrf_core.xml (often referred to as the “bootstrap configuration” file) contains the opensrf networking information, including the xmpp server connection credentials for the public and private routers. you only need to touch this for a new service if the new service needs to be accessible via the public router. begin by defining the service itself in opensrf.xml. to register the opensrf.simple-text service, add the following section to the element (corresponding to the xpath /opensrf/default/apps/): --> 3 --> 1 --> perl --> opensrf::application::demo::simpletext --> 100 --> 1000 --> opensrf.simple-text_unix.log --> opensrf.simple-text_unix.sock --> opensrf.simple-text_unix.pid --> 5 --> 15 --> 2 --> 5 --> the element name is the name that the opensrf control scripts use to refer to the service. the element specifies the interval (in seconds) between checks to determine if the service is still running. the element specifies whether opensrf clients can call methods from this service without first having to create a connection to a specific service backend process for that service. if the value is 1, then the client can simply issue a request and the router will forward the request to an available service and the result will be returned directly to the client. the element specifies the programming language in which the service is implemented. the element pecifies the name of the library or module in which the service is implemented. (c implementations only): the element, as a direct child of the service element name, specifies the maximum number of requests a process serves before it is killed and replaced by a new process. (perl implementations only): the element, as a direct child of the element, specifies the maximum number of requests a process serves before it is killed and replaced by a new process. the element specifies the name of the log file for language-specific log messages such as syntax warnings. the element specifies the name of the unix socket used for inter-process communications. the element specifies the name of the pid file for the master process for the service. the element specifies the minimum number of child processes that should be running at any given time. the element specifies the maximum number of child processes that should be running at any given time. the element specifies the minimum number of idle child processes that should be available to handle incoming requests. if there are fewer than this number of spare child processes, new processes will be spawned. the`` element specifies the maximum number of idle child processes that should be available to handle incoming requests. if there are more than this number of spare child processes, the extra processes will be killed. to make the service accessible via the public router, you must also edit the opensrf_core.xml configuration file to add the service to the list of publicly accessible services: making a service publicly accessible in opensrf_core.xml --> router public.localhost --> opensrf.math opensrf.simple-text --> this section of the opensrf_core.xml file is located at xpath /config/opensrf/routers/. public.localhost is the canonical public router domain in the opensrf installation instructions. each element contained in the element offers their services via the public router as well as the private router. once you have defined the new service, you must restart the opensrf router to retrieve the new configuration and start or restart the service itself. complete working examples of the opensrf_core.xml and opensrf.xml configuration files are included with this article for your reference. calling an opensrf method opensrf clients in any supported language can invoke opensrf services in any supported language. so let’s see a few examples of how we can call our fancy new opensrf.simple-text.reverse() method: calling opensrf methods from the srfsh client srfsh is a command-line tool installed with opensrf that you can use to call opensrf methods. to call an opensrf method, issue the request command and pass the opensrf service and method name as the first two arguments; then pass one or more json objects delimited by commas as the arguments to the method being invoked. the following example calls the opensrf.simple-text.reverse method of the opensrf.simple-text opensrf service, passing the string "foobar" as the only method argument: $ srfsh srfsh # request opensrf.simple-text opensrf.simple-text.reverse "foobar" received data: "raboof" =-----------------------------------request completed successfully request time in seconds: 0.016718 =-----------------------------------getting documentation for opensrf methods from the srfsh client the srfsh client also gives you command-line access to retrieving metadata about opensrf services and methods. for a given opensrf method, for example, you can retrieve information such as the minimum number of required arguments, the data type and a description of each argument, the package or library in which the method is implemented, and a description of the method. to retrieve the documentation for an opensrf method from srfsh, issue the introspect command, followed by the name of the opensrf service and (optionally) the name of the opensrf method. if you do not pass a method name to the introspect command, srfsh lists all of the methods offered by the service. if you pass a partial method name, srfsh lists all of the methods that match that portion of the method name. note the quality and availability of the descriptive information for each method depends on the developer to register the method with complete and accurate information. the quality varies across the set of opensrf and evergreen apis, although some effort is being put towards improving the state of the internal documentation. srfsh# introspect opensrf.simple-text "opensrf.simple-text.reverse" --> opensrf.simple-text received data: { "__c":"opensrf.simple-text", "__p":{ "api_level":1, "stream":0, \ # <1> "object_hint":"opensrf_application_demo_simpletext", "remote":0, "package":"opensrf::application::demo::simpletext", \ # <2> "api_name":"opensrf.simple-text.reverse", \ # <3> "server_class":"opensrf.simple-text", "signature":{ \ # <4> "params":[ \ # <5> { "desc":"the string to reverse", "name":"text", "type":"string" } ], "desc":"returns the input string in reverse order\n", \ # <6> "return":{ \ # <7> "desc":"returns the input string in reverse order", "type":"string" } }, "method":"text_reverse", \ # <8> "argc":1 \ # <9> } } stream denotes whether the method supports streaming responses or not. package identifies which package or library implements the method. api_name identifies the name of the opensrf method. signature is a hash that describes the parameters for the method. params is an array of hashes describing each parameter in the method; each parameter has a description (desc), name (name), and type (type). desc is a string that describes the method itself. return is a hash that describes the return value for the method; it contains a description of the return value (desc) and the type of the returned value (type). method identifies the name of the function or method in the source implementation. argc is an integer describing the minimum number of arguments that must be passed to this method. calling opensrf methods from perl applications to call an opensrf method from perl, you must connect to the opensrf service, issue the request to the method, and then retrieve the results. #/usr/bin/perl use strict; use opensrf::appsession; use opensrf::system; opensrf::system->bootstrap_client(config_file => '/openils/conf/opensrf_core.xml'); # <1> my $session = opensrf::appsession->create("opensrf.simple-text"); # <2> print "substring: accepts a string and a number as input, returns a string\n"; my $result = $session->request("opensrf.simple-text.substring", "foobar", 3); # <3> my $request = $result->gather(); # <4> print "substring: $request\n\n"; print "split: accepts two strings as input, returns an array of strings\n"; $request = $session->request("opensrf.simple-text.split", "this is a test", " "); # <5> my $output = "split: ["; my $element; while ($element = $request->recv()) { # <6> $output .= $element->content . ", "; # <7> } $output =~ s/, $/]/; print $output . "\n\n"; print "statistics: accepts an array of strings as input, returns a hash\n"; my @many_strings = [ "first i think i'll have breakfast", "then i think that lunch would be nice", "and then seventy desserts to finish off the day" ]; $result = $session->request("opensrf.simple-text.statistics", @many_strings); # <8> $request = $result->gather(); # <9> print "length: " . $result->{'length'} . "\n"; print "word count: " . $result->{'word_count'} . "\n"; $session->disconnect(); # <10> the opensrf::system->bootstrap_client() method reads the opensrf configuration information from the indicated file and creates an xmpp client connection based on that information. the opensrf::appsession->create() method accepts one argument – the name of the opensrf service to which you want to want to make one or more requests – and returns an object prepared to use the client connection to make those requests. the opensrf::appsession->request() method accepts a minimum of one argument – the name of the opensrf method to which you want to make a request – followed by zero or more arguments to pass to the opensrf method as input values. this example passes a string and an integer to the opensrf.simple-text.substring method defined by the opensrf.simple-text opensrf service. the gather() method, called on the result object returned by the request() method, iterates over all of the possible results from the result object and returns a single variable. this request() call passes two strings to the opensrf.simple-text.split method defined by the opensrf.simple-text opensrf service and returns (via gather()) a reference to an array of results. the opensrf.simple-text.split() method is a streaming method that returns an array of results with one element per recv() call on the result object. we could use the gather() method to retrieve all of the results in a single array reference, but instead we simply iterate over the result variable until there are no more results to retrieve. while the gather() convenience method returns only the content of the complete set of results for a given request, the recv() method returns an opensrf result object with status, statuscode, and content fields as we saw in the http results example. this request() call passes an array to the opensrf.simple-text.statistics method defined by the opensrf.simple-text opensrf service. the result object returns a hash reference via gather(). the hash contains the length and word_count keys we defined in the method. the opensrf::appsession->disconnect() method closes the xmpp client connection and cleans up resources associated with the session. accepting and returning more interesting data types of course, the example of accepting a single string and returning a single string is not very interesting. in real life, our applications tend to pass around multiple arguments, including arrays and hashes. fortunately, opensrf makes that easy to deal with; in perl, for example, returning a reference to the data type does the right thing. in the following example of a method that returns a list, we accept two arguments of type string: the string to be split, and the delimiter that should be used to split the string. basic text splitting method sub text_split { my $self = shift; my $conn = shift; my $text = shift; my $delimiter = shift || ' '; my @split_text = split $delimiter, $text; return \@split_text; } __package__->register_method( method => 'text_split', api_name => 'opensrf.simple-text.split' ); we simply return a reference to the list, and opensrf does the rest of the work for us to convert the data into the language-independent format that is then returned to the caller. as a caller of a given method, you must rely on the documentation used to register to determine the data structures – if the developer has added the appropriate documentation. accepting and returning evergreen objects opensrf is agnostic about objects; its role is to pass json back and forth between opensrf clients and services, and it allows the specific clients and services to define their own semantics for the json structures. on top of that infrastructure, evergreen offers the fieldmapper: an object-relational mapper that provides a complete definition of all objects, their properties, their relationships to other objects, the permissions required to create, read, update, or delete objects of that type, and the database table or view on which they are based. the evergreen fieldmapper offers a great deal of convenience for working with complex system objects beyond the basic mapping of classes to database schemas. although the result is passed over the wire as a json object containing the indicated fields, fieldmapper-aware clients then turn those json objects into native objects with setter / getter methods for each field. all of this metadata about evergreen objects is defined in the fieldmapper configuration file (/openils/conf/fm_idl.xml), and access to these classes is provided by the open-ils.cstore, open-ils.pcrud, and open-ils.reporter-store opensrf services which parse the fieldmapper configuration file and dynamically register opensrf methods for creating, reading, updating, and deleting all of the defined classes. example fieldmapper class definition for “open user summary” --> --> --> --> --> --> --> the element defines the class: the id attribute defines the class hint that identifies the class both elsewhere in the fieldmapper configuration file, such as in the value of the field attribute of the element, and in the json object itself when it is instantiated. for example, an “open user summary” json object would have the top level property of "__c":"mous". the controller attribute identifies the services that have direct access to this class. if open-ils.pcrud is not listed, for example, then there is no means to directly access members of this class through a public service. the oils_obj:fieldmapper attribute defines the name of the perl fieldmapper class that will be dynamically generated to provide setter and getter methods for instances of the class. the oils_persist:tablename attribute identifies the schema name and table name of the database table that stores the data that represents the instances of this class. in this case, the schema is money and the table is open_usr_summary. the reporter:label attribute defines a human-readable name for the class used in the reporting interface to identify the class. these names are defined in english in the fieldmapper configuration file; however, they are extracted so that they can be translated and served in the user’s language of choice. the element lists all of the fields that belong to the object. the oils_persist:primary attribute identifies the field that acts as the primary key for the object; in this case, the field with the name usr. the oils_persist:sequence attribute identifies the sequence object (if any) in this database and provides values for new instances of this class. in this case, the primary key is defined by a field that is linked to a different table, so no sequence is used to populate these instances. each element defines a single field with the following attributes: the name attribute identifies the column name of the field in the underlying database table as well as providing a name for the setter / getter method that can be invoked in the json or native version of the object. the reporter:datatype attribute defines how the reporter should treat the contents of the field for the purposes of querying and display. the reporter:label attribute can be used to provide a human-readable name for each field; without it, the reporter falls back to the value of the nameattribute. the element contains a set of zero or more elements, each of which defines a relationship between the class being described and another class. the field attribute identifies the field named in this class that links to the external class. the reltype attribute identifies the kind of relationship between the classes; in the case of has_a, each value in the usr field is guaranteed to have a corresponding value in the external class. the key attribute identifies the name of the field in the external class to which this field links. the rarely-used map attribute identifies a second class to which the external class links; it enables this field to define a direct relationship to an external class with one degree of separation, to avoid having to retrieve all of the linked members of an intermediate class just to retrieve the instances from the actual desired target class. the class attribute identifies the external class to which this field links. the element defines the permissions that must have been granted to a user to operate on instances of this class. the element is one of four possible children of the element that define the permissions required for each action: create, retrieve, update, and delete. the permission attribute identifies the name of the permission that must have been granted to the user to perform the action. the contextfield attribute, if it exists, defines the field in this class that identifies the library within the system for which the user must have prvileges to work. if a user has been granted a given permission, but has not been granted privileges to work at a given library, they can not perform the action at that library. the rarely-used element identifies a linked field (link attribute) in this class which links to an external class that holds the field (field attribute) that identifies the library within the system for which the user must have privileges to work. when you retrieve an instance of a class, you can ask for the result to flesh some or all of the linked fields of that class, so that the linked instances are returned embedded directly in your requested instance. in that same request you can ask for the fleshed instances to in turn have their linked fields fleshed. by bundling all of this into a single request and result sequence, you can avoid the network overhead of requiring the client to request the base object, then request each linked object in turn. you can also iterate over a collection of instances and set the automatically generated isdeleted, isupdated, or isnew properties to indicate that the given instance has been deleted, updated, or created respectively. evergreen can then act in batch mode over the collection to perform the requested actions on any of the instances that have been flagged for action. article continues, see part two at http://journal.code4lib.org/articles/3365. about the author dan scott is the systems librarian for laurentian university and a committer for the the evergreen and opensrf projects. you can reach him by email at dan@coffeecode.net or follow his occasional ramblings at http://coffeecode.net. subscribe to comments: for this article | for all articles one response to "easing gently into opensrf, part 1" please leave a response below: evergreen, software libre para informatizar bibliotecas | tramullas.com, 2011-03-29 […] web de la aplicación usando xulrunner. el servidor descansa sobre una implementación básica de opensrf, y, para seguir con las diferencias frente a otros sistemas, la base de datos no descansa sobre […] leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial: journal updates and a call for editors mission editorial committee process and structure code4lib issue 55, 2023-1-20 editorial: journal updates and a call for editors journal updates, recent policies, and a call for editors. by junior tidal this is my second and last time as coordinating editor for code4lib journal. after serving on the editorial committee for 7 years, i am rotating off of the committee to focus on other research projects. code4lib has played a big part in my career. in 2012, i published my first article for the journal. after attending my first code4lib conference at north carolina state university in 2014, funded by a code4lib diversity scholarship, i really wanted to get more involved with this wonderful and supportive community. i was co-convener for the local new york city chapter of code4lib, presented at two national pre-conferences, and served on a couple of code4lib national conference committees since then. out of participating in various code4lib related activities, i have to say that working with the editorial committee (ec) has been the most rewarding experience. i have learned quite a deal from my fellow editorial committee members, and for that i am immensely grateful. this includes everything from copy editing, the article review process, communicating and collaborating with authors, and most especially, managing a journal. i would like to share two recent developments with the journal: a guest editorial policy and a retraction policy. the ec has implemented a guest editor policy. editorial members have a wide skill set reflective of library coders and technologists, however, some of the articles that we review are beyond our scope of expertise. in those situations, we feel it necessary to consult with experts outside of the ec. the guest editor policy is in place to make it clear to the author, guest editor, and readers, their role in the review process. a retraction policy has also been implemented. this retraction policy was developed so the ec could withdraw articles that may include work that violate ethical standards or may be unreliable. retractions are not to be taken lightly, and as such, the journal will inform readers why the article was retracted. this will be another part of the article’s lifecycle post-publication. since there is now an opening on the editorial committee of code4lib journal, please respond to this call for editors. if you are interested in reading and learning about library information technology, as well as being part of a great team of editors, then this is an excellent opportunity. applicants from diverse communities are highly encouraged to apply. i believe that every issue of code4lib journal has practical applications for almost any library, archive, museum, and other related spaces, and this issue is no exception. this issue includes: a fast and full-text search engine for educational lecture archives which outlines the development of a search engine for educational videos using python in india. click tracking with google tag manager for the primo discovery service explores how to track open access content through unpaywall links. creating a custom queueing system for a makerspace using web technologies is a case study on streamlining the queue process of a makerspace. data preparation for fairseq and machine-learning using a neural network details the use of sequence-to-sequence models and how it can be applied for a variety application with the appropriate formatting of datasets. designing digital discovery and access systems for archival description compares the differences between archival and bibliographic description and the challenges of utilizing discovery based systems for digital born materials. drying our library’s libguides-based webpage by introducing vue.js investigates how to better streamline redundant html code from the popular libguides web content management system. revamping metadata maker for ‘linked data editor’: thinking out loud looks at using and evaluating the catalog record creation tool using linked data sources. using python scripts to compare records from vendors with those from ils examines the use of python to identity and synchronize out-of-sync vendor and ils catalog records. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – scaling iiif image tiling in the cloud mission editorial committee process and structure code4lib issue 47, 2020-02-17 scaling iiif image tiling in the cloud the international archive of women in architecture, established at virginia tech in 1985, collects books, biographical information, and published materials from nearly 40 countries that are divided into around 450 collections. in order to provide public access to these collections, we built an application using the iiif apis to pre-generate image tiles and manifests which are statically served in the aws cloud. we established an automatic image processing pipeline using a suite of aws services to implement microservices in lambda and docker. by doing so, we reduced the processing time for terabytes of images from weeks to days. in this article, we describe our serverless architecture design and implementations, elaborate the technical solution on integrating multiple aws services with other techniques into the application, and describe our streamlined and scalable approach to handle extremely large image datasets. finally, we show the significantly improved performance compared to traditional processing architectures along with a cost evaluation. by yinlin chen, soumik ghosh, tingting jiang, james tuttle introduction for over 30 years, virginia tech university libraries’ special collections and archives has amassed a hidden treasure of architectural drawings, design sketches, personal and professional correspondence, project files, business records, photographs by more than 400 women who practiced architecture and design around the world from the 1800s through the present day. the collections, including many oversized materials, were only available on-premises until 2019. in 2016, virginia tech university libraries was awarded a digitizing hidden collections grant from the council on library resources (clir). the grant, women of design: revealing women’s hidden contributions to the built environment [1], partially funded the scanning, description, and provision of access to 800 cubic feet of materials from the iawa collections. through the generous support of clir, virginia tech university libraries’ special collections and archives digitized and described a significant part of the collection. the grant also supported collaboration between special collections and archives and the libraries’ information technology services to design and build a web application to make the work of 30 women architects freely accessible to the world through an online image repository: international archive of women in architecture (iawa) [2]. to enhance access to virginia tech’s cultural heritage collections, iawa supports the international image interoperability framework (iiif) [3], providing a world-class image viewing and sharing experience by exposing high-quality digital images along with rich metadata. iiif is a set of shared application programming interface (api) specifications built to support image interoperability across digital image repositories. enriched by a community of software tools and technology systems, iiif enables: a uniform and rich access to image-based resources hosted around the globe by repositories promoting the common technical specifications better, faster and cheaper image delivery with iiif compatible image servers and web clients advanced features for viewing and interacting with structured collections of images by iiif api-enabled image viewers. due to limitations in our on-premise infrastructure and rapidly increasing computing and storage needs during this project, we suspected a cloud solution might best address our needs. relatively early in the project, libraries’ network attached storage was nearing capacity at over 90 percent utilization. budget and time constraints conspired against expanding on-premise storage within the project timeline, which led us to use aws simple storage service (s3) [4], amazon’s cloud object storage, which was used for other purposes on campus. numerous efficiencies were easily realized after moving the data to the aws cloud. hosting the data in aws cloud facilitated leveraging other aws services. to improve performance and lower latency to and from our application in addition to reducing data transfer costs, we leverage amazon cloudfront [5], a content delivery network (cdn) service, to deliver static content. to perform a long-running and compute-intensive task, we use aws batch [6], a service that manages batch computing workloads. finally, to receive system event and response action notifications, process intermediate data, and trigger other aws services, we use aws lambda [7] to implement our customized microservices. we describe in detail our implemented aws solutions in the following sections. problem description iawa is a hyrax/savmvera application [8] supported by fedora [9]. specifically, it is based on hyrax 2.1.0 and fedora 4. initially, we took the native, out-of-the-box approach, and ingested high-resolution tiff images and stored them directly in fedora. we integrated an iiif-api enabled viewer – mirador [10] – with frontend hyrax to display our image collections. the viewer supports zoom, pan, and other features, as well as selection among related images. the underlying image tile and manifest generation are processed in real-time. upon receiving a user’s request, the application first retrieves all the collection image files from fedora, stores them into the server’s temporary location, subsequently triggers the iiif image service using the riiif gem [11] to start creating tiles and manifests. after the generation process is completed, the mirador viewer can then use the generated iiif url to display the images to the end-user. this is the program logic as it was implemented in hyrax [12]. however, this implementation is only suitable for a small amount of image data and the application performance drops significantly when both the size of image files and the number of images in the collection are large. in the iawa project, for example, a collection might contain 20 pages (image files) at 100mb each. under this circumstance, even with 32gb ram and 50gb storage space allocated, the image viewing process became unbearably slow, often causing the application to halt and crash. there was an alternative implementation of caching the manifests to avoid frequent fedora access; however, due to the large number of our image collections, the response time to newly requested images was still too long for a production service. moreover, this approach added the overhead of periodically running a rake task to refresh and rebuild the cache. thus, we sought an approach to display images from pre-generated image tiles and manifests instead of generating them on the fly in hopes of eliminating the need to make exhaustive search requests in fedora. our solution was to build an iiif level 0 compatible image server with all the pre-generated files in place. since all the tiles and manifests are ready to serve the requests, the website performance increased immediately. our challenge then became how to efficiently generate terabytes of iawa image collection metadata and tiles. we performed a local test that used a batch script to generate a set of image collection of around 20gb, which took about a week to finish. extrapolating out the time it would take to process the expected size of the entire digitization process, several months or possibly a year, it was clear that this was not a viable approach. finally, we designed and implemented our cloud-based aws solution and successfully generated all our iawa image collections in 3 days. we describe our architecture design, software implementation, deployment, and performance in the next sections. system architecture this system architecture was designed with several goals in mind. the system provides a single entrypoint to assign tasks. it can run multiple tasks in parallel and adjust the resource needs for each task automatically without human intervention. the system can monitor task status and produce detailed task and cost reports. it reduces or eliminates maintenance needs by using managed services in a serverless architecture. figure 1 shows an overview of our system architecture design. we use amazon s3 and aws batch management services and implement our business logic as microservices using aws lambda. after a user uploads a task file to a designated folder in s3, an aws lambda is triggered automatically and parses the content in the task file and then submits multiple jobs to aws batch. these batch jobs run in parallel and each job processes a set of image collections and uploads generated tiles and manifests into s3 buckets. when batch jobs are finished, aws batch disposes of resources that are no longer needed. given this design, resources are provisioned only when needed to process tasks. an added benefit to aws managed services is that there is no longer a need to perform system maintenance such as installing os updates, security patches, etc. figure 1. an overview of our aws cloud-based image tiling system architecture the iiif tile and manifest generation relies on several aws components. there is an s3 bucket for task files and another bucket for tile and manifest files. our image collections reside in another s3 bucket. an aws batch is instantiated by creating a job definition, compute environment, and a job queue. a job definition specifies the type of job to be run and the amount of resources to be allocated. for example, it may specify a docker image, the number of vcpus, amount of memory, what command to be executed, etc. a compute environment defines what kind of ec2 instance is to be provisioned and used to run the aws batch jobs. a job queue is mapped to one or more compute environments and a job is submitted to the job queue. once there are jobs in the job queue, aws batch will acquire the necessary resources, prepare the compute environment with amazon ec2 instances, and schedule these jobs to be run within the instances. then, our microservice implementation is deployed in the aws lambda. the entire system configuration with instructions is published in our project github page. [13] software implementation our software implementations include an aws lambda function, an iiif ruby gem implementation, and an executable script. we package the ruby gem and executable script into a docker image, which is the core business logic component to be run in the aws batch job. figure 2 illustrates the overarching workflow inside the aws batch job. figure 2. the overarching workflow inside our aws batch job the iiif ruby gem generates two iiif items. one is iiif image api compatible image tiles that can be requested and delivered using iiif specific uris, and the other is image metadata and the structure/layout of the image objects can be displayed through iiif presentation api compatible image manifests. with these two iiif items, we can support advanced viewing capabilities, including deep zoom for individual images, metadata information sharing as well as structured/ordered view of images in a collection. currently, there are several libraries capable of generating iiif level 0-compatible image tiles and manifests for the image files that can be statically served in local or cloud image server, such as iiif_s3 [14]. however, the implementation of an iiif image viewer, such as mirador, which we are using, and universal viewer require specific image metadata information in order to properly display the images with the above viewing capabilities. although iiif specifications support bare thumbnails as string objects, mirador only accepts thumbnails as image api service objects. thus, we extended the iiif_s3 work and implemented our own iiif ruby gem giiif [15] to fulfill this specific requirement. furthermore, we fixed a bug in iiif_s3 when using imagemagic to transform high-resolution image files to default (e.g., jpeg) format: if the original image is in the default format, the transformation step should be skipped in order to avoid a code break. the executable script is a set of bash commands to execute the workflow. it first fetches the images to be processed into a local disk inside the docker container. it then executes a ruby script [16] which gathers the required information for generating tiles and manifests, including metadata csv file for image collections, the path to the folder containing image files in the collections, and root path information for the generated image stack. finally, it uploads all the generated files (tiles and manifests) to the designated location in s3. to automate the iiif tiling process, we implemented an aws lambda function that processes task definition files uploaded to an s3 folder and submits aws batch jobs. when a task file is added to a specific folder in s3, this lambda function automatically triggered. the lambda function reads the content in the task file, including parameters such as job definition, job queue, command to execute, environment parameters list, etc and then submits an aws batch job respectively. figure 3 shows the python code snippet representing this procedure. figure 3. aws lambda function in python: receives task and submits aws batch job finally, we update our iawa hyrax application by storing an item’s iiif manifest url in a new metadata field called import_url. this iiif manifest url contains links to the pre-generated tiles for that item. when a user views the images in an item using the mirador viewer, the application does not query fedora for image data; it instead retrieves the information from s3. more specifically, the program now reads a static string instead of calling a complex and time-consuming function to retrieve the image metadata and create the tiles on the fly. with this change, users can view an image in milliseconds to seconds, compared to minutes in the original implementation. the performance improvement is approximately 1,000x faster. figure 4 is the code snippet shows the difference between our implementation (left) and hyrax implementation (right) figure 4. iawa (left) vs. hyrax (right) implementations for image manifest deployments and tasks executions as mentioned briefly in the system architecture section, aws batch organizes its work into four components: jobs – the unit of work submitted to aws batch. job definition – defines how jobs are to be run, memory and cpu requirements, docker container properties, environment variables, and mount points for persistent storage job queues – submitted jobs go into queues with priorities. queues can be associated with one or more compute environments. compute environment – the compute resources that are used to run jobs. when jobs are scheduled to run, they are placed into a single queue associated with a compute environment. at a single point in time, there can be multiple active queues trickling jobs into their respective compute environments. we can see all job statuses, number of jobs in each queue, and compute environment information through aws batch dashboard, which is shown in figure 5. figure 5. aws batch dashboard we can see more granular information, such as the logs from running docker containers through cloudwatch as shown in figure 6. cloudwatch logs are updated in real-time as well as retained after job completion. figure 6. cloudwatch log showing the output of a running job performance and cost there are 31 iawa image collections and sizes ranged from tens of gigabytes to hundreds. we selected four collections to conduct the performance and cost for this article. the smallest one we tagged for calculating costs was around 23 gb while the largest one was upwards of 230 gb. table 1 shows a breakdown of the execution time for these collections. table 1. execution time of the selected collections collection name size (gb) # of tasks time ms1995_007_piomelli 229.80 gb 62 12 hours ms2007_007_feuerstein 149.43 gb 44 5 hours ms2008_089_alexander 35.14 gb 19 4 hours ms2016_012_womens_development_corp 22.99 gb 16 3 hours the total cost of generating tiles for these collections was $36.54. the smallest collection cost $1.71 while the largest one cost $20.08. as we can see, this method of generating tiles by leveraging the concurrency of cloud services is fast and extremely cost efficient. table 2.cost to process selected collection collection name size (gb) cost ms1995_007_piomelli 229.80 gb $20.08 ms2007_007_feuerstein 149.43 gb $12.93 ms2008_089_alexander 35.14 gb $1.82 ms2016_012_womens_development_corp 22.99 gb $1.71 total 437.36 gb $36.54 iawa application the iiif tile generation system is deployed in aws and we currently use this system to process all iawa image collections. whenever a new iawa collection with raw image files along with the metadata csv files are ready to be processed, sysops need only create a task file for the new collection and upload that task file to s3 at which point the automated process will initiate. figure 7 shows one page of image files in an item (a group of image files). with our new design, no matter how large each image size or how many images that item contains, our application can display it with a fast and stable response time. figure 7. item page with multiple images conclusion in this article, we describe how we establish an auto-scaling system to generate iiif tiles using aws services. our proposed serverless architecture design eliminates the need to manage underlying servers and delegate all the heavy lifting, e.g. auto-scaling, instance provisioning, scheduling, resource deprovisioning, etc to aws. we focus on implementing microservices and meshing aws to accomplish our tasks. this system can process any amount of image collections automatically and efficiently. we reduce the time to process a collection from weeks to hours and simplify the human process from several manual operations to a simple file upload step. we also share our aws experience and implementations details about the entire system building and deployment process. finally, we publish our work in the authors’ institution github page [13]. future work this proposed system is deployed in the aws and becomes one of our cloud-based image processing services. while we manually deployed this system into aws currently, we start working on developing an aws cloudformation and/or terraform template in order to have the ability to deploy the entire system at one click of a button. furthermore, our proposed serverless architecture can be extended to perform other tasks by simply replacing the software in the docker image and changing the aws batch configuration files. for example, we can package jhove software [17] into a docker image and configure the aws batch to use this new docker image. a new service is then established and ready to perform format identification, validation, and characterization of digital objects. we will generalize this procedure and use this approach to create other library services. references [1] women of design [internet]. [updated 2019]. available from: http://registry.clir.org/projects/23081249 [2] international archive of women in architecture (iawa) [internet]. [updated 2019 august 1]. blacksburg (va): university libraries special collections and information technology department, virginia tech. available from: https://iawa.lib.vt.edu/ [3] international image interoperability framework (iiif) [internet]. [updated 2019 july 1]. available from: https://iiif.io/about/ [4] amazon simple storage service (amazon s3) [internet]. [updated 2019]. available from: https://aws.amazon.com/s3/ [5] amazon cloudfront [internet]. [updated 2019]. available from: https://aws.amazon.com/cloudfront/ [6] aws batch [internet]. [updated 2019]. available from: https://aws.amazon.com/batch/ [7] aws lambda [internet]. [updated 2019]. available from: https://aws.amazon.com/lambda/ [8] hyrax/samvera [internet]. [updated 2019]. available from: https://hyrax.samvera.org/ [9] fedora [internet]. [updated 2019]. available from: https://duraspace.org/fedora/ [10] mirador viewer: [internet]. [updated 2019 august 10]. available from: https://projectmirador.org/ [11] riiif ruby gem: [internet]. [updated 2018 april 30]. available from: https://github.com/curationexperts/riiif [12] hyrax github: [internet]. [updated 2018 march 30]. available from: https://github.com/samvera/hyrax/blob/v2.1.0/app/presenters/hyrax/displays_image.rb#l15 [13] aws batch iiif generator: [internet] [updated 2019]. available from: https://github.com/vtul/aws-batch-iiif-generator [14] iiif_s3: a generator for iiif level 0 compatible static server on amazon s3. [internet] [updated 2017 november 26]. available from: https://rubygems.org/gems/iiif_s3/versions/0.1.0 [15] giif: iiif compatible image stack and metadata generator. [internet] [updated 2019 august 12]. available from: https://github.com/vtul/giiif [16] a ruby script to gather image-related information, generate tiles and upload to amazon s3. [internet] [updated 2019 august 12]. available from: https://github.com/vtul/image-iiif-s3/blob/master/create_iiif_s3.rb [17] jhove: a file format identification, validation and characterisation tool. [internet] [updated 2019 april 18]. available from: https://jhove.openpreservation.org/ about the authors yinlin chen (ylchen@vt.edu) is a digital library architect at the virginia tech libraries, blacksburg, va. he holds a ph.d. in computer science and applications from virginia tech and a m.s. and b.s. in computer science at national tsing hua university, taiwan. his professional interests include digital libraries, cloud computing, microservices and serverless architecture. soumik ghosh (soumikgh@vt.edu) is a systems engineer at virginia tech libraries since 2017. he received his masters in computer science and applications from virginia tech in 2017 and his bachelors in computer science from west bengal university of technology. apart from containerisation and cloud native architectures, he likes to dabble in casual reading, writing, astronomy and microbiology. tingting jiang (virjtt03@vt.edu) has been a software engineer at virginia tech libraries since 2014. she received the b.s. (summa cum laude) and m.s. degrees in computer science from virginia tech, blacksburg, va. from 2007 to 2009, she was a software engineer with intrexon corporation, blacksburg, va. she was a recipient of an nsf graduate research fellowship (2011–2014) and a microsoft research graduate women’s scholarship (2011). james tuttle (james.tuttle@vt.edu) is associate director for digital libraries at the virginia tech libraries, blacksburg, va. he received the mslis and ba, anthropology from the university of illinois at urbana-champaign. he specializes in digital curation and preservation of complex materials, systems design, workflow analysis and efficiency design, and general systems planning with experience in higher education and industry. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – editorial edit mission editorial committee process and structure code4lib issue 42, 2018-11-08 editorial edit a few words about our editors. a farewell to one editor. a solicitation for new editors. this is the editorial, so let’s have a quick hurrah for the editors. we’re all volunteers, and i would hazard a guess that we’re all quite busy in our day jobs. work for the code4lib journal comes in pretty regular waves: there is a wave of proposals to consider and vote on; a wave of first drafts, second drafts, and all the edits in between; next entering the article into wordpress; and a final push to read as many completed articles as possible and give them a final vote of approval. the coordinating editor—for this issue, me—stands in the surf and throws stones at the editors, trying to distract them from their day jobs and make sure the process keeps flowing. why do we do it, then? i can’t speak for everyone, but there are perks: for instance, we get a first look at some really incredible ideas and projects. i must admit that more than once i have chosen to edit an article because the author was describing something my department was contemplating, and this seemed like an excellent way to jump in and learn more. we also get to work with some pretty interesting people and help them shape their ideas and text. sometimes our contributions are modest, sometimes they are significant. every article is different, every author is different. each issue is something new–jam packed with newness if we’re lucky. finally, we get to work with great people, our colleagues on the editorial committee. there are currently fourteen of us listed on the editorial committee page, which sounds like a lot, but when you have eleven articles as we do in issue 42, and each article needs an assigned editor and a “second reader,” we can get stretched thin. if you scroll down the editorial committee page, you will find a further twenty editors who have passed into emeritus status. that number is about to grow by one. carol bean is one of three remaining founding members of the journal. she has been here for all 42 issues, but has decided to step back, as she has moved on from libraries and put together an interesting mix of new pursuits: “data munging,” stocks and options, grandkids. in fact, carol tried to step back last issue, but when she saw that we could use another hand, she stepped forward again. the journal is part of the code4lib solar system and shares many of the same traits. it’s still very much diy, and when something needs doing there is the assumption that someone will make it happen (ideally without even being asked). carol embodied this trait—she took care of things that needed doing. she was generous in other ways. when fellow editors might be ready to throw in the towel on an article, she would always give the author one more chance to get it into a publishable state. she also wanted to make sure that public and special libraries were represented on the editorial committee. finally, she was a master of the sometimes enigmatic signature line, which was a tiny window into her life tucked beneath her name. a recent sampling: carol --who will now hunt down all the crumbs and demand someone pick them up --who thinks the maple tree outside is being overly melodramatic --who issues disclaimers when the wine is good --who is waiting for the other shoe to fall --who watches with curiosity as the squirrels try to hop through foot-deep snow call for editors you probably saw where this editorial was leading. we’re looking for more editors to step into carol’s big shoes. we need more carol: people who are engaged, smart, generous. you can find out more about this on our call for editors page. eric hanson and junior tidal from the editorial committee are leading this effort, and they ask that you submit your letter of interest by friday, december 7. issue 42 we have a great issue. take some time and read it all. consider helping us put together issue 43. andrew –who is ready for a nap subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – from finding aids to wiki pages: remixing archival metadata with ramp mission editorial committee process and structure code4lib issue 22, 2013-10-14 from finding aids to wiki pages: remixing archival metadata with ramp the remixing archival metadata project (ramp) is a lightweight web-based editing tool that is intended to let users do two things: (1) generate enhanced authority records for creators of archival collections and (2) publish the content of those records as wikipedia pages. the ramp editor can extract biographical and historical data from ead finding aids to create new authority records for persons, corporate bodies, and families associated with archival and special collections (using the eac-cpf format). it can then let users enhance those records with additional data from sources like viaf and worldcat identities. finally, it can transform those records into wiki markup so that users can edit them directly, merge them with any existing wikipedia pages, and publish them to wikipedia through its api. by timothy a. thompson, james little, david gonzález, andrew darby, and matt carruthers introduction in recent years, with the emergence of the glam-wiki initiative (galleries, libraries, archives, and museums with wikipedia) [1], cultural heritage institutions have begun to share their resources and expertise with the global audience of wikipedia. data are already beginning to show that glam-wiki collaboration is making information more accessible to researchers (szajewski 2013). one particular focus of glam-wiki collaboration has developed around authority control, an area that has been an important part of the traditional role of libraries. authority control is essentially the process of establishing and maintaining lists of official (“authorized”) and alternative names for different entities in order to uniquely identify them. in recent years, this process has been greatly enhanced by the pairing of name strings with uris. tools like oclc’s viafbot [2] have helped integrate library authority data into wikipedia (at web scale) by automatically adding over 250,000 reciprocal links from the virtual international authority file (viaf) [3]. as these initiatives progress, however, there is also a need for glam-wiki projects that balance robust data-sharing with the curatorial attention to detail that has been another traditional strength of the cultural heritage sector. whereas libraries have typically focused on maintaining name authority files, the archives community has produced a body of detailed biographical descriptions that help provide access to the broader archival context. in an effort to help archivists disseminate the information they have compiled about the people and organizations whose documents they preserve, a team of librarians and programmers at the university of miami libraries has developed the remixing archival metadata project (ramp). ramp is a lightweight web-based metadata tool that lets users generate enhanced archival authority records and gives them the ability to publish the content of those records as wikipedia pages. the ramp editor first extracts biographical and historical data from archival finding aids to create new authority records for persons, corporate bodies, and families associated with archival and special collections. it then lets users enhance those records with additional data from sources like viaf and worldcat identities [4]. finally, it transforms those records into wiki markup for direct publication to wikipedia through its api. background for decades, archivists have devoted considerable effort to describing the collections they curate and creating finding aids to help users navigate through their content. the encoded archival description (ead) standard was created to store and present this information online in machine-readable format. ead, in combination with the more product, less process (mplp) methodology (greene and meissner 2005), has enabled many more finding aids to be made accessible online, greatly increasing the availability of collections and information for researchers around the world. these archival collections, however, do not exist in a vacuum. understanding the circumstances surrounding the creation of a collection and its contents can help researchers gain important knowledge about the primary sources they are studying. the eac-cpf (encoded archival context–corporate bodies, persons, and families) metadata standard was designed to aggregate contextual information about collection creators and provide a unified access point for associated archival materials (wisser 2011). eac-cpf is a relatively new standard, endorsed by the society of american archivists in 2011. the eac-cpf user community is growing steadily, spurred on by projects such as social networks and archival context (snac), an initiative of the national archival authorities cooperative (naac) [5], and “connecting the dots: using eac-cpf to reunite samuel johnson and his circle,” a collaboration between yale university libraries and harvard university libraries [6]. ramp-27-48-102 chc0339.48.r27 http://dbpedia.org/resource/lydia_cabrera lccn-n80-98243 79081464 derived inprocess fmu-n fqc university of miami libraries english derived machine xslt ead2eac.xsl/libxslt new eac-cpf record derived from ead instance and existing eac-cpf record. guide to the lydia cabrera papers person cabrera, lydia lcnaf spanish;castilian authors, cuban the lydia cabrera papers contain the personal and research papers of 20th century cuban anthropologist, writer, and artist, lydia cabrera. hiriart, rosario lydia cabrera papers, 1910-1991 lydia cabrera papers (digital collection) figure 1. sample xml illustrating the basic structure of an eac-cpf record. eac-cpf contains both administrative and descriptive information, intended to allow for the management of the record as well as the description of the entity the record is about. the administrative portion of the record is located within the element and is designed to capture information regarding the identification of the file, its status and history, and declarations of conventions and languages used in the description. the descriptive portion of eac-cpf is broken down into three main components: , , and . the portion is concerned with uniquely identifying the entity being described. the portion deals largely with biographical or historical information about the entity. the real power of eac-cpf comes in its section, which provides the ability to associate the entity being described with other entities or resources. these links help create a web of contextual information that could potentially have a substantial benefit for researchers. currently, it can be very difficult to locate all archival resources related to an individual or entity, especially if collections are geographically dispersed or held by different institutions. there is not yet a central authority file or database for the creators of archival collections that could link out to all of these related resources. with eac-cpf and national initiatives like naac, the archival community is now beginning to encode this authoritative information. as with any new metadata standard, there are barriers to implementing eac-cpf within the archival community. full eac-cpf records can potentially contain hundreds or thousands of lines of description, making the manual creation of records very time consuming. often, research must be performed to collect enough information about the entity being described to create a full eac-cpf record. another issue arises with the dissemination of the information contained within the records. there are not currently many avenues for publishing this information or making it available to users, outside of prototype systems such as snac’s customized xtf platform. development process and technology ramp brought together curators, programmers, and cataloging and metadata librarians, each of whom had a unique perspective on the project and its goals. finding aid data was initially drawn from two repositories: the university of miami libraries’ special collections [7] and its cuban heritage collection [8]. curators from these collections were the project’s primary stakeholders. the development process itself was guided by the agile scrum methodology [9]. over the course of three two-week sprints, team members adopted the various scrum roles: the “product owner” (played by matt carruthers, metadata librarian) gathered user stories about the desired functionality and clarified these stories for the developers; the “scrum master” (played by andrew darby, head of web & emerging technologies) facilitated the process; and the three developers (played by david gonzález, digital programmer; jamie little, digital programmer; and tim thompson, metadata librarian) broke the user stories down into programming tasks and got to work. the development team began each sprint day with a 10–15 minute stand-up meeting to outline the previous day’s accomplishments and set goals for the coming day. at the end of each two-week sprint, colleagues and stakeholders were invited to see a demo of the functional (but incomplete) application. one of the goals of ramp was to make the software accessible to other institutions. with this in mind, the project utilizes the common lamp stack. on the back end, php is used to facilitate xslt transformations, xml validation, and database access. much of this functionality is exposed as lightweight services that can be accessed with javascript. client-side, jquery is used to communicate with those services. project workflow in the ramp workflow, data passes through three stages: first, ead-encoded archival finding aids are ingested into the system and biographical data is extracted to produce authority records for collection creators. this initial step utilizes two xml metadata formats: ead for finding aids and eac-cpf for authority records. second, users select individual records for editing. at this stage, the ramp tool can do basic named entity recognition and pull in data from external sources like viaf and worldcat identities. the user is presented with a list of potential matches and can select which ones to add to the eac-cpf record to create relation elements. third, a relevant subset of the enhanced eac record is transformed into wiki markup. the tool can then check to see if there is an existing wikipedia page for a given person or organization. if there is no entry, a stub page is created containing custom wikipedia metadata, the entity’s viaf id (if available), a biography, and sections for references, works by, and external links back to the local finding aid and any associated digital collections. if there is already an entry, that wiki markup is pulled in for potential integration and presented alongside the locally created markup. finally, ramp allows users to publish this new or updated entry directly to wikipedia with the push of a button. figure 2. process diagram of ramp workflow. (click to enlarge) data transformations ead to eac-cpf the availability of archival management systems like archon, archivists’ toolkit, and now archivesspace has facilitated the processing and description of archival and special collections. our current local system, archon, contains separate modules for creating both finding aids for archival collections and authority records for creators. archon also includes a batch export utility that outputs ead/xml records for finding aids and basic eac-cpf/xml records for creators. ramp uses xslt 1.0 stylesheets to transform ead records into eac-cpf records and then eac-cpf records into wikipedia markup. preliminary stylesheets created in the early stages of the project had been written in xslt 2.0, but the continued absence of any xslt 2.0 transformation engine that does not require java meant that these stylesheets needed to be rewritten. one of the benefits of using external xslt 1.0 stylesheets is that a large portion of the core functionality of ramp can be repurposed using other programming languages that support xslt, or through workflows based on desktop xml editors. an effort was made to create stylesheets that were as generalizable as possible. for example, several global parameters were created to allow users to specify local data in an external configurations file (conf/inst_info.php). when the transformation is called, the values for these parameters are passed into the stylesheet from php. notwithstanding, the potential for interoperability is somewhat limited by the idiosyncrasies of individual archival management systems like archon and the way they format ead and eac-cpf records on export. for example, although the ead metadata standard includes elements for encoding chronologies and references to related material, archon does not support this level of structured markup in its data entry interface. when finding aids did include chronologies or related reference sections, this data was often marked up in simple
or tags. markup limitations were compounded by inconsistencies in the way data had been entered over time. the ead-to-eac stylesheet tries to parse out as much relevant information as possible, but this code would most likely need to be adapted to work with markup from other institutions. the need to account for one-to-many relationships between creators and collections, and vice versa, also presents a major challenge for data transformation. at present, ramp’s file import routine generates a single eac-cpf record for each ead finding aid. in the future, this process will need to be improved to enable information from related finding aids and any existing creator records to be merged into a single eac-cpf instance. much of the work for merging multiple eads into eac-cpf records has already been accomplished by the snac project, and its code, soon to be publically released, may help resolve this issue. as a temporary workaround, a shell script and xslt 2.0 stylesheet were developed to merge our local ead and eac-cpf files as part of a separate preprocessing stage prior to import into ramp. eac-cpf to wiki markup by comparison, the transformation from eac-cpf to wikipedia markup (an intermixture of a unique plain text format and html tags) presented fewer challenges. wiki markup is meant to be an abstraction from html that allows for easier editing of wikipedia articles. at first glance, wikipedia’s markup has a simple syntax, but in practice wikipedia articles rely on a multitude of templates that can be quite complex. one of the goals of ramp was to include an infobox for each person or organization (at present, ramp’s eac-to-wikipedia stylesheet does not transform records for families). infoboxes are essentially metadata about an entity that is displayed as a sidebar on the public page, and they contain dozens of associated properties [10]. the stylesheet outputs a subset of infobox properties that a user can fill in manually, and it also attempts to extract basic information from the eac record to pre-populate certain infobox fields such as those for birth and death dates. whereas infoboxes are intended for display, wikipedia pages contain an additional metadata template called persondata, which includes a limited set of properties and is used only for machine processing [11]. ramp’s eac-to-wikipedia stylesheet employs the same methods used in creating infoboxes to output and attempt to pre-populate persondata templates. for other data, top-level sections are created according to the following basic crosswalk: eac-cpf tag name / xpath wiki markup section / template bioghist/p biography (persons) / history (organizations) chronlist/chronitem chronology localdescription/term[@localtype=’6xx’] category resourcerelation[@resourcerelationtype=’creatorof’ and @xlink:role=’resource’]/relationentry works or publications cpfrelation/relationentry see also sources/source[1]/sourceentry notes and references resourcerelation[@resourcerelationtype=’subjectof’ and @xlink:role=’resource’]/relationentry further reading resourcerelation[@xlink:role=’archivalrecords’]/relationentry external links sources/source/@xlink:href[contains(.,’viaf’)] otherrecordid[@localtype=’wci:lccn’] authority control figure 3. basic eac-cpf-to-wikipedia crosswalk. inline xml editor ramp utilizes the open source ace editor [12] for editing xml records. the cloud9 online ide [13], github [14], and the unc libraries [15] are currently using this editor. the editor is written in javascript and includes many features, such as syntax highlighting, that users of desktop xml editors would expect. in addition to the editor, ramp includes a validation service that checks to see whether the xml being edited is well formed and valid as the user edits the xml. if the xml is valid and well formed, a green icon is displayed. if the xml contains errors, the validation service returns specific error information. the xml records edited in ramp are not stored on the file system, but in a database. although ramp may seem like the ideal project to utilize an xml database, it uses mysql because of its ubiquity and xml capabilities. individual ead and eac-cpf records are stored in the database along with basic metadata about the file. mysql versions beginning with 5.1 allow queries to contain xpath statements, and this functionality is utilized by ramp to extract names for display. in future iterations of ramp, it may be possible to extract more information directly from xml stored in the database. diff functionality ramp incorporates php diff [16], a library that displays differences between text, and jquery merge for php diff [17], a javascript-based library that lets users perform a merge based on the differences generated by php diff. this functionality was included to cover situations where users may need to incorporate changes or updates made to local ead records. during ramp’s import process a user will be presented with a merging interface for files that have changed: figure 4. diff functionality in the ramp import interface. (click to enlarge) api lookups ramp currently uses the viaf [18] and worldcat identities [19] apis to ingest new information into the eac-cpf record. ramp’s api functionality was designed to be extensible, and other data sources, like geonames.org, may be incorporated in the future. ramp uses the php curl [20] library to connect and communicate with these apis, and xml dom [21] to manipulate all xml it encounters. data from viaf lets ramp create new (with authorized and alternative forms of names) and relation elements, and data from worldcat identities provides associated subject headings, works by and about an entity, and data about relations to other entities. with these data sources, ramp can ingest a considerable amount of information about an entity to enhance the eac-cpf record. most importantly, it provides an easy method for harvesting uris, which can be leveraged in future linked data applications. although the task of assigning viaf ids to wikipedia articles is already being ably handled by the viafbot, ramp provides a mechanism for ensuring that this data is stored and associated with records for locally held resources. ramp automates and centralizes the process of integrating data from a viaf or worldcat identities record. it first uses the apis’ search query to provide the user with a list of search results, with links to the record for each result. from this list, the user can choose the appropriate record to ingest. after the user chooses the desired result, ramp queries the api again to retrieve the actual record and then parses the response to collect the desired data (for example, the viaf id from a viaf record and associated subjects from worldcat identities). once this data has been collected, it needs to be converted into objects that can be easily mapped to dom elements in order to facilitate ingest into xml. the following snippet of data from worldcat identities illustrates how these objects are structured (json formatted) and represented in xml: { "cpfrelation": [ { "attributes": { "xlink:arcrole": "associatedwith", "xlink:href": "http://id.loc.gov/authorities/names/n78095648", "xlink:role": "http://rdvocab.info/uri/schema/frbrentitiesrda/person", "xlink:type": "simple" }, "elements": { "relationentry": { "attributes": { }, "elements": "hiriart, rosario" } } } ] } figure 5. worldcat identities sample object as json. hiriart, rosario figure 6. worldcat identities sample object as xml. this data mapping allows for an easy transition from objects (either php or javascript) to dom elements. after conversion into objects occurs, ramp then prompts the user with a list of elements so that he or she can choose which ones should actually be inserted into the eac-cpf record. before generating elements using the viaf api, ramp also attempts to extract other named entities from the local file, prompting the user with a list of possible matches. currently, this extraction is performed by applying a regular expression to the text of selected fields in the eac-cpf record ( and ) and in the original ead record (), which is now stored in the database. matching strings are output and presented to the user for selection (see figure 7). figure 7. partial results of ramp named entity recognition for the lydia cabrera papers. (click to enlarge) once appropriate matches have been chosen, a subsequent lookup is triggered and, if there are potential matches in viaf, new results are presented for final selection (figure 8). to help users determine whether a result from viaf is an appropriate match, links to the corresponding viaf pages are provided, preceded by the original matching string. selected results are then inserted into the eac-cpf record as elements that store the value of the viaf id in an @xlink:href attribute. figure 8. results of viaf lookup of named entities in the lydia cabrera papers. (click to enlarge) this method does have its shortcomings: it retrieves a relatively high number of false hits and does not catch name variations that do not match the pattern, which is based on sequences of capital letters (so “lydia cabrera” would count, but “lydia cabrera” would not). it also requires users to verify the accuracy of the matches pulled in from viaf. future integration of a named entity recognition program or api would greatly improve this functionality. notwithstanding, the ability to extract relationships from unstructured text and match names to uris is already a significant step toward tying local resources into the emerging web of data (van hooland et al. 2013). in-tool wiki editing ramp’s wiki markup editing interface is shown below (figure 9): figure 9. ramp wiki editing interface. (click to enlarge) the text area on the left contains locally stored wiki markup, and the one on the right contains wiki markup that a user wants to post to wikipedia, including any content from an existing wikipedia page. the tool makes it easy to move content between text areas. every time the user selects text from either text area, ramp stores that text in memory; when the user wants to move the text to the other side, he or she need only actuate one of the arrows: “<” will move text to the focus of the left text area and “>” will do the opposite; the focus of each text area is stored by the browser, and ramp uses that to move text between areas. interacting with the wikipedia api the wikipedia api is both useful and versatile, allowing access to wiki features, data, and metadata. ramp uses this api to log in, search pages, get pages, and edit pages on wikipedia. to begin editing pages on wikipedia, ramp searches wikipedia for existing pages that are relevant to the entity being described. after searching, it displays a list of results, with links, so the user can choose the desired page to edit. one problem encountered while implementing this portion of ramp was that newly created pages are not indexed until the following day and do not appear in search results from wikipedia’s api until then. therefore, if a user creates a page for “lydia cabrera” and comes back to it later on the same day, the search results will not display the new page. a solution to this problem was to add, on top of wikipedia’s search results, an exact match result. in the previous example, if the user searches for “lydia cabrera,” ramp does a separate http request to wikipedia, constructing a wikipedia url with the search term as a page title. if the page exists, ramp places that result above the search results from wikipedia’s search api. once a page has been selected for editing, ramp queries wikipedia again to get the wiki markup of the requested page and then displays it in the wiki editing interface shown in figure 9. ramp also provides an option to create a new wiki page if none of the search results reflects the entity a user is writing about; in this case, an empty text area is displayed on the right. after the user enters or edits content in the right-hand text area, changes are then ready to be posted to wikipedia. first, however, the user is prompted to log into his or her personal wikipedia account from within the ramp editing interface. having users log into wikipedia with a personal account helps ensure compliance with wikipedia’s username policy, which prohibits institutional or shared accounts [22]. after a successful login, ramp stores a cookie file with the user’s wiki session information for later requests, and the user can now post a new page or changes to an existing page to wikipedia, along with comments. some wikipedia edits also require a captcha upon submission. if a captcha is needed, ramp prompts the editor with the captcha image and a text box in order to get the solved captcha. it then tries the edit request again with new parameters, including the captcha id and answer. finally, ramp facilitates the editing process by letting users save wikipedia-hosted drafts of their articles. for example, rather than editing and posting new content directly to http://en.wikipedia.org/wiki/lydia_cabrera, ramp users can check a “draft” box to save their work-in-progress to a subpage of their wikipedia user homepage (like http://en.wikipedia.org/wiki/user:timathom/lydia_cabrera). in the wikipedia community, saving draft articles to user subpages is standard practice. preliminary results it is still too early to point to concrete evidence, but our preliminary experience using ramp to create and edit wikipedia pages has been very positive. to date, only a few live pages have been created using the tool. minor edits or additions have also been made to a handful of existing pages. for each of us, ramp represented our first foray into the world of wikipedia editing. it has been especially interesting to note the speed with which wikipedia users begin contributing to and improving newly created pages. for example, one of our first test pages was for a composer named rodrigo prats llorens (see figure 10). as the edit history for the page reveals, contributors began to edit this new entry quite quickly [23]. in the span of five days, three different wikipedia users provided edits or enhancements, and the first edits came within six minutes of the page’s creation. these contributions suggest that users are viewing and adding to wikipedia pages at an impressive rate, and they reaffirm that wikipedia is a robust platform for information-sharing and collaboration. in future versions of ramp, we hope to explore automated methods for incorporating these “crowd-sourced” enhancements back into local records. figure 10. wikipedia revision history for the ramp-created page about the composer rodrigo prats llorens. (click to enlarge) lessons learned as with any collaborative project that involves multiple stakeholders, there were challenges that arose during the ramp development process. like chaucer’s pilgrims [24], each participant involved in the project represented a socio-professional type, as it were: curators care deeply about the information they have worked to structure and provide; programmers are focused on problem solving and achieving specific results; and cataloging and metadata librarians are concerned with data conversion and the integrity of formats and standards. these different frames of reference made constant communication a necessity. the agile scrum process provided a structured framework for facilitating communication on the development side, but it may also have made communication with stakeholders more difficult, since each sprint tended to create a kind of isolation zone around the project. our experience reminded us that the demands of development should be balanced with the need to stay engaged, early and often, with stakeholders and interested parties, who may run the risk of feeling left out of the project management process. this balance was particularly hard to achieve during the initial sprint, before we had a working product to demonstrate. the end-of-sprint demos, however, gave stakeholders an opportunity to comment on a concrete thing rather than an idea of a thing. as a result, we received concrete and practical feedback, which we worked to incorporate into subsequent sprints. there were also technical challenges during the development process. the original concept for manually editing eac-cpf files in ramp was to utilize a web form, mirroring the functionality of archival information management systems such as archon, archivist’s toolkit, and archivesspace. the advantage to this approach is that it limits the possibilities for user error, which may result in malformed xml or invalid eac-cpf documents. the developers quickly encountered a number of challenges when trying to build the web form, however. eac-cpf is designed to be a fairly flexible schema, allowing for things such as the description of multiple identities for a single entity (for example, barack obama the president and barack obama the law school lecturer). building a web form to accommodate this would have consumed time and resources that needed to be devoted to other core functionality. utilizing a web form for data entry in eac-cpf files also effectively limits the user to a single application profile (which is a problem alluded to in the discussion of archon’s xml export format, above). we decided instead to use the ace editor as the interface for editing eac-cpf files. it proved to be a lightweight but powerful solution, allowing users full control over the structure of their eac-cpf files. although working directly in the xml can potentially place a greater burden on users, the validation service will alert them to any errors in the encoding. conclusion now that the core functionality of the ramp editor is in place, we have begun to explore the possibilities for integrating it into local workflows. because of the nature of the tool, an important part of adoption will involve establishing local guidelines and best practices for wikipedia editing in general. one important issue to address before beginning to utilize the tool is that of copyright and permissions. wikipedia provides two basic options for “donating” text for republication on its pages [25]. the first option is to include a permissions statement on the webpage of the local source of information, licensing it under a creative commons attribution-sharealike 3.0 unported license (cc-by-sa) and gnu free documentation license (gfdl). as an alternative, content contributors can also send an email to wikipedia with a formal declaration of consent for each text to be republished [26]. although there is a clear analogy between the structured content of an eac-cpf record and that of a biographical wikipedia entry, it is important to recognize that two different sets of conventions and expectations are at play. whereas wikipedia places a premium on rigorous citation of sources, this is not a practice that archival metadata standards have traditionally emphasized. when contributing biographical pages to wikipedia through ramp, it may be necessary to add references or check the sources of the local description (especially if it involves a living person) in order to bring it into line with the expectations of the wikipedia community. as a ubiquitous, international platform, wikipedia can provide users with new access points to libraries, archives, and special collections; at the same time, publishing archival metadata on wikipedia can challenge librarians and archivists to evaluate the quality and accessibility of their own descriptive practice. moving forward, we hope that future development of the ramp editor will be driven by feedback from users. we have carried out a round of usability testing with colleagues who have volunteered to test the tool, and based on initial usability feedback, we have been iteratively enhancing the ramp user interface to make its editing workflow more intuitive. usability testing has drawn our attention to the challenge, generally speaking, of working with data across domains and platforms. working with each of the three data formats involved in ramp places a different burden on the user: there is a significant amount of background knowledge involved in reaching a certain comfort level with the data itself, whether it be ead, eac-cpf, or wiki markup. some of the functionality that did not make it into the initial stage of development includes workflow management features like user accounts, notifications, and approval queues for newly created content, as well as integration with additional apis. we also plan to explore ways to track usage statistics and measure the impact of ramp-created articles on access to local resources and materials. although we do not have the capacity to provide technical support for the tool, we are eager to explore possibilities for collaboration with other institutions and to find ways to plug ramp into the broader glam-wiki movement. ramp is licensed under an educational community license, version 2.0 (ecl-2.0), and its source code is available at https://github.com/umiamilibraries/ramp. ultimately, we hope that the ramp editor can serve as a gateway for more librarians and archivists to become active on wikipedia. the ramp platform could be used as a starting point for wikipedia edit-a-thons and could lend support to initiatives like the wikipedia library [27], an initiative focused on putting active wikipedia editors in touch with library resources, or the rewriting wikipedia project, which aims to bring increased diversity to wikipedia by creating content about “marginalized groups and their histories” [28]. in the end, this project should serve to remind us that archives and special collections are not only repositories of unique content, they are also repositories of information about unique people and organizations. the university of miami libraries, for example, are home to collections that reflect the history and culture of south florida, the caribbean, cuba, and the cuban diaspora. many of the people and organizations present in our collections, including women and minorities, have not yet been adequately represented in wikipedia. the overall goal of ramp is not simply to provide additional access points to local content—although we certainly see that as important. above all, its objective is to further the altruistic mission of wikipedia by enriching the global cultural context with information that for too long has been removed from the broader conversation. notes [1] http://en.wikipedia.org/wiki/wikipedia:glam [2] http://en.wikipedia.org/wiki/user:viafbot [3] http://www.oclc.org/research/news/2012/12-07a.html and http://viaf.org [4] http://worldcat.org/identities/ [5] http://socialarchive.iath.virginia.edu/prototype.html [6] https://osc.hul.harvard.edu/liblab/proj/connecting-dots-using-eac-cpf-reunite-samuel-johnson-and-his-circle [7] http://library.miami.edu/specialcollections/ [8] http://library.miami.edu/chc/ [9] http://en.wikipedia.org/wiki/scrum_(software_development) [10] https://en.wikipedia.org/wiki/template:infobox_person [11] https://en.wikipedia.org/wiki/wikipedia:persondata [12] http://ace.c9.io/#nav=about [13] https://c9.io/ [14] https://gist.github.com/ [15] https://github.com/unc-libraries/jquery.xmleditor [16] https://github.com/chrisboulton/php-diff [17] https://github.com/xiphe/jquery-merge-for-php-diff [18] http://www.oclc.org/developer/documentation/virtual-international-authority-file-viaf/using-api [19] http://oclc.org/developer/documentation/worldcat-identities/using-api [20] http://www.php.net/manual/en/intro.curl.php [21] http://www.w3schools.com/dom/ [22] http://en.wikipedia.org/wiki/wikipedia:username_policy [23] we eventually discovered that this page was a duplicate of another page, titled “rodrigo prats,” and we used ramp to create a redirect and merge the new page with the existing one: http://en.wikipedia.org/w/index.php?title=rodrigo_prats. for the revision history of the original page, see http://en.wikipedia.org/w/index.php?title=rodrigo_prats_llorens&action=history. [24] http://en.wikipedia.org/wiki/the_canterbury_tales [25] http://en.wikipedia.org/wiki/wikipedia:donating_copyrighted_materials [26] http://en.wikipedia.org/wiki/wikipedia:declaration_of_consent_for_all_enquiries [27] https://en.wikipedia.org/wiki/wikipedia:the_wikipedia_library [28] http://dhpoco.org/rewriting-wikipedia/ references greene ma, meissner d. 2005. more product, less process: revamping traditional archival processing. the american archivist, 68:208-263. available from http://archivists.metapress.com/content/c741823776k65863/fulltext.pdf szajewski m. 2013. using wikipedia to enhance the visibility of digitized archival assets. d-lib magazine [internet], 19(3):9-. available from http://www.dlib.org/dlib/march13/szajewski/03szajewski.html van hooland s, de wilde m, verborgh r, steiner t, and van de walle r. 2013. named-entity recognition: a gateway drug for cultural heritage collections to the linked data cloud? available from http://freeyourmetadata.org/publications/named-entity-recognition.pdf wisser km. 2011. describing entities and identities: the development and structure of encoded archival context–corporate bodies, persons, and families. journal of library metadata [internet], 11(3):166-75. available from http://dx.doi.org/10.1080/19386389.2011.629960 about the authors tim thompson (t.thompson5@miami.edu) is a metadata librarian at the university of miami libraries, where his specific responsibilities include metadata creation for collections in spanish. james little (j.little@miami.edu) is a digital programmer at the university of miami libraries. he holds an mslis from the university of illinois urbana-champaign. david gonzalez (d.gonzalez26@umiami.edu) is a digital programmer at the university of miami libraries. he holds a bs degree in computer science from the university of miami. andrew darby (agdarby@miami.edu) is the head of web & emerging technologies at the university of miami libraries. matt carruthers (m.carruthers@miami.edu) is a metadata librarian at the university of miami libraries, where he supports and enhances discovery of and access to the libraries’ digital content. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – openwemi: a minimally constrained vocabulary for work, expression, manifestation, and item mission editorial committee process and structure code4lib issue 60, 2025-04-14 openwemi: a minimally constrained vocabulary for work, expression, manifestation, and item the dublin core metadata initiative has published a minimally constrainted vocabulary for the concepts of work, expression, manifestation and item (wemi) that can support the use of these concepts in metadata describing any type of created resources. these concepts originally were defined for library catalog metadata and did not anticipate uses outside of that application. employment of the concepts in non-library applications is evidence that the concepts are useful for a wider variety of metadata users, once freed from the constraints necessitated for the library-specific use. by karen coyle, https://kcoyle.net introduction a previous code4lib journal article (coyle, 2022) and a book chapter before that (coyle, 2016) described some ways that non-library metadata creators have incorporated the concepts of work, expression, manifestation, and item (wemi), first described in functional requirements for bibliographic records (frbr). the article described examples of communities using the wemi concepts that they had discovered in frbr. however, these communities were using wemi for data unrelated to libraries with the result that some of their uses did not adhere to the model expounded in the frbr report. the code4lib article proposed that a version of wemi could be developed that was more appropriate to non-library data by abstracting the base concepts from the library-specific design. a vocabulary defined using the resource description framework schema (rdf/s) in the dublin core metadata initiative (dcmi) namespace is now available that implements this simple idea. part i: openwemi vocabulary key documents specification html vocabulary display vocabulary in turtle (normative) the openwemi cookbook an ongoing gathering of use cases and examples. additional documents, including examples, are at the openwemi github site. the motivation the library community produced a multi-layered conceptual model of its catalog metadata in 1998, in a report titled functional requirements for bibliographic records (frbr) (ifla 2009). this was later developed into the library reference model (lrm) (riva 2017). the frbr report introduced a view of library catalog resources with a four-part structure at its base: work, expression, manifestation, and item (wemi), and these were incorporated into the lrm. although the frbr and lrm models were defined only in relation to library catalog data, communities unrelated to libraries found utility in the wemi concepts for the description of a variety of created resources. (see the openwemi bibliography for examples.) in many cases these reuses varied significantly from the library application model, and in ways that, at least in the rdf world, could result in inconsistencies in the semantics of the vocabulary terms. openwemi frees the wemi concepts from the original library application to allow these to be adapted by other metadata communities where their use would contradict the semantics of the library-defined model. the changes to the wemi model in openwemi are subtle yet are designed to overcome the restrictions included in the library model. openwemi can be used by communities adjacent to libraries, such as non-library archives, but can also be expanded to anyone using metadata to describe created resources in any domain: architecture, manufacturing, entertainment, and more, including less formal data practices. the philosophy it is to be expected that those creating metadata in the course of their work will be thinking only of their own application needs. it is less common that models are created with no specific application in mind, or with a goal of serving a wide variety of needs.(gruber, 1993) a prime example of the latter is the dublin core metadata terms (dcmt). these minimally constrained terms have gained traction on the web and elsewhere precisely because they have few constraints on their definitions. each term in dcmt describes an aspect of a resource, any resource, and aside from term definitions and some suggestions for value types, the rest is unconstrained. similarly, openwemi is also not limited to a particular application, but where dcmt is primarily made up of terms to be used in descriptive metadata, openwemi consists mainly of concepts — in the form of rdf classes — that can provide the scaffolding for metadata and metadata vocabularies. a goal of openwemi is to offer a possible way for metadata developers to think about their data with the least possible constraint on their modeling choice. the openwemi rdf vocabulary the openwemi vocabulary is defined in rdf/s with use of web ontology language (owl) functions when a union of classes is needed. classes the openwemi classes are: openwemi:endeavor: “a creation” openwemi:work: “an abstract notion of an endeavor.” openwemi:expression: “a perceivable form of an endeavor.” openwemi:manifestation: “a realization of an endeavor in physical, digital, or experiential form.” openwemi:item: “an instantiation of an endeavor.” the class openwemi:endeavor originated in the frbr-based vocabulary created by ian davis, richard newman, leigh dodds, and bruce d’arcus.(davis, 2005) this vocabulary was not sanctioned by ifla, yet it resulted as the only rdf-defined vocabulary for frbr and thus was included in projects wishing to use frbr in their rdf instance. figure 1. openwemi class diagram. openwemi includes openwemi:endeavor as an umbrella class to the wemi classes. each of the wemi classes is subclassed to openwemi:endeavor, and each class is defined in its relation to openwemi:endeavor. openwemi:endeavor can allow reference and searching in situations where the precise usage of the openwemi classes is unknown. it also will support the addition of other classes at the level of the openwemi “stack” if needed. like other high-level classes such as owl:thing, openwemi:endeavor is in the background of openwemi and is not likely to be used explicitly in metadata, although there is no prohibition against this. properties the properties of openwemi define only the essential relationships between the classes. there are no properties for descriptive metadata. that latter is left entirely to the metadata efforts that make use of the openwemi vocabulary. the properties are designed to reflect the general semantics of wemi, maintaining the logical direction from the most concrete (item) to the most abstract (work). within that one constraint, however, the ranges and domains of the properties allow the properties to be used with any classes of a more abstract definition. thus, openwemi:instantiates has a domain of openwemi:item and a range of openwemi:manifestation, openwemi:expression, or openwemi:work, or any combination of those. the primary properties of the model are: openwemi:expresses (domain: expression | range: work) openwemi:expressedby (domain: work | range: expression) openwemi:manifests (domain: manifestation | range: work or expression) openwemi:manifestedby (domain: work or expression | range: manifestation) openwemi:instantiates (domain: item | range: work or expression or manifestation) openwemi:instantiatedby (domain: work or expression or manifestation | range: item) note that each property has two options, one that takes a bottom-up view (openwemi:manifests, openwemi:instantiates) and a parallel property that takes a top-down approach (openwemi:manifestedby, openwemi:instantiatedby). figure 2. openwemi class relationships. “related” properties another set of properties supports statements of relationships between like classes: openwemi:relatedwork (domain:work | range: work) openwemi:relatedexpressionn | range: expression) openwemi:relatedmanifestation (domain:manifestation | range: manifestation) openwemi:relateditem (domain:item | range: item) a property such as openwemi:relatedexpression could be the superclass to a property expressing the relationship between an original text and a translation, both of which are defined as expressions in this : mymd:translationof rdfs:subpropertyof openwemi:relatedexpression . after which all uses of mymd:translationof will be inferred to be between two openwemi:expressions. “common” properties most communities work with data that are not defined in terms of the wemi concepts. data managers may still wish to make use of some of the wemi concepts to express relationships between two resources that may or may not be modeled as wemi. with these properties, originally defined by ross singer of talis at open.vocab.org, it is possible to state that any two entities share a common work, common expression, common manifestation and/or common item. for example, one can make a statement that two metadata instances, one in a library catalog in marc format and one in amazon in their proprietary format, do indeed represent the same work, the same expression, and/or the same manifestation. these properties differ from the openwemi “related” in that there is no definition of domain or range. while not interrelated with the openwemi-defined properties we included these as in the spirit of openwemi and of rdf, allowing anyone to make connections between metadata instances with these useful concepts. openwemi:commonendeavor openwemi:commonwork openwemi:commonexpression openwemi:commonmanifestation openwemi:commonitem how openwemi might be used you can use openwemi directly without creating subclasses or properties, but keep in mind that openwemi as defined has very loose semantics. if that fits your use case, then by all means use the defined openwemi vocabulary. if you wish to “make it your own” then you will want to create a resource-specific version that is more expressive of your metadata by subclassing your model components to openwemi. this provides a wemi-based vocabulary that is compatible with your semantics. for example, if you are creating metadata for music materials (prefixed “mu:” here), you can create music-defined subclasses to openwemi classes: mu:musicwork rdfs:subclassof openwemi:work . mu:musicexpression rdfs:subclassof openwemi:expression. etc. and you would give your classes definitions that make sense in your data environment: mu:musicwork “a distinct musical creation” mu:musicexpression “the work as expression in musical notation” mu:musicmanifestation “a produced or published realization of the music.” mu:musicitem – “a single exemplar of the realized music.” additional levels of subclasses can be used to describe types of resources in your collection, such as: class subclass mu:musicwork mu:symphony mu:contata mu:serenade mu:musicexpression mu:score mu:libretto mu:performedmusic mu:musicmanifestation mu:cd mu:streaming prefix dct: prefix xsd: # general resource enumeration : every subject, but no blank nodes select distinct ?subject where { ?subject ?p ?o . filter (!isblank(?subject)) . } order by desc(?subject) the query will deliver all subjects available from a sparql service, but without blank nodes. however, in practice it is often necessary and easily possible to restrict the record enumeration by using date properties or some other known statements about the resources in question. resource dump data modeling with rdf is based on statements with subject, predicate and object. therefore, the original document-orientated structure of the bibliographic data is destroyed. in this phase of indexing, all information available about a resource is queried from the store. a common solution to the problem of querying a graph for all relevant information is described in the literature [11] as “symmetric consise bounded description”, which informally is the extraction of a sub-graph with all nodes associated with the resource. the sparql standard uses the describe keyword for this purpose. some, but not all triple stores currently used in practice do implement the sparql describe part of the specification, but the implementation is often limited to simple consise bounded descriptions, that is in effect a description without those nodes, where the resource is the object. to the best of our knowledge, symmetric consise bounded descriptions are not built into triple stores right now and have to be formulated as sparql construct-queries. since the sparql query language does not support recursion, the required query can not be given in general, but depends on the specific data model. typically, author information and partof-relationships need to be considered here. a typical query may look like the one shown below. prefix dct: prefix dcmitype: prefix dcq: prefix gnd: prefix foaf: prefix fabio: prefix urn: prefix shg: ## resource dump. construct { ?p ?o . ?o ?x ?y . } where { ?p ?o . optional { ?o ?x ?y filter (isblank(?o) && !isblank(?y)) } optional { dct:ispartof ?o . ?o ?x ?y . } optional { dct:creator ?o . ?o ?x ?y . } } in the query shown, the parameter is resolved step by step by the record enumeration of the previous step. the query is intended to filter out blank nodes and to fetch information that is hidden behind uri nodes like the author name and the information about records related by a partof-relationship. the sparql standard is prepared for federated queries, and the jena framework does implement the service keyword to support this behavior. accordingly, this phase of the indexing process is the place, where the separation of authority and title data into distinct data sets can be treated. the approach taken here, to first load all required data into a triple store, makes the overall setup relative simple. however, an approach using federated sparql queries for resource dump could take the distributed nature of linked data into account. buil-aranda et.al. [12] formalize such a setup, and also provide some background information on the problem of federated query evaluation. result transformation since sparql query results can be returned in a standardized xml format, the process of transforming the result to the solr index format used by vufind can be done with a xslt transformation. an example of a xslt stylesheet to transform the rdf/xml response format to the solr index scheme in question is contained within the sources of this project [2]. data written in rdf may be considered to be on a higher level of abstraction than those in xml. for that reason, the translation from xml to rdf is often called lifting, while the opposite direction is called lowering. apart from its syntactic ambiguities, processing rdf/xml via xslt loses another feature of rdf, namely its interplay with ontological information. the “lowering” of the rdf/xml data in our use case is the mapping to the solr index scheme used by vufind. this final processing step provides all the flexibility that is available with xslt and benefits from the standardized xml formats that come with rdf. xslt can be problematic with respect to performance, and input size is a determining factor. the processing described here deals largely with only small data blocks, but every record has to be transformed from its rdf/xml representation to the xml variant required by solr before it can be loaded to the index. although detailed performance studies have not been undertaken in this project, some very simple tests gave hints that performance is more likely to be determined by the solr engine rather than xslt, at least as long as the standard http based solr update processing is used. proof of concepts to test the concepts in a more general setting, two datasets published by the dnb in 2012 were indexed. the datasets are published as two big sets of rdf data, known as gnd (german norm data) and dnb (deutsche nationalbibliographie). both data sets could be loaded into a jena triple store and indexed with the developed programs within three weeks on a standard desktop pc. the problems found in the datasets tested are most often due to the existence of space characters in uris, which is not allowed by specification. the long running indexing process could handle these problems through the use of careful java exception handling and ignoring invalid records. the jena framework was found to be flexible and easy to work with, and the adjustments could be made without the need to study the jena code in greater detail. in a somewhat simpler environment the developed indexing program is in production use since some month. metadata from an institutional repository and from journals hosted with the open journal system are collected via their oai interfaces, lifted to rdf and stored in a jena tdb triple store. the indexing of about 5400 bibliographic records from the rdf data store to the solr index engine can be done within a few minutes, and different mappings from rdf to the index scheme used by vufind are tested out easily. conclusion the predicated nature of the resource description framework has emerged as a widespread formalism for information and data exchange. this report describes a method of building a search index from linked data sources and shows that the combination of semantic web technologies and already established search technology is possible without profound changes to existing systems. the approach taken here to solve the problem of indexing linked data works with a simplified setup by providing a single point of data access. areas for future investigation include optimising data processing by adjusting the indexing algorithms to better utilize the federated nature of the underlying sparql services. references [1] villanova university: vufind. http://vufind.org (2012-04-22). [2] hatop, g.: the shanghai linked data indexer. 2013. https://github.com/cloud8/shanghai (2013-05-04). [3] silvio peroni and david shotton: fabio and cito: ontologies for describing bibliographic resources and citations. 2012-08-13. web semantics: science, services and agents on the world wide web. [4] berners-lee, t.: linked data – design issues. 2006-07-27. http://www.w3.org/designissues/linkeddata.html (2012-11-22). [5] bizer, c., cyganiak, r., heath, t.: how to publish linked data on the web. 2007. http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial (2012-12-01). [6] open knowledge foundation: the data hub. 2013. http://datahub.io/group/lld (2013-01-03). [7] pierre-yves vandenbussche: public sparql endpoints stats. 2013. http://labs.mondeca.com/sparqlendpointsstatus (2013-06-13) [8] jane greenberg: advancing the semantic web via library functions. chapter 11, knitting the semantic web. cataloging & classification quarterly 43(3-4). 2007. [9] apache software foundation: apache solr. 2012. http://lucene.apache.org/solr (2012-08-30). [10] apache software foundation: apache jena. 2012. http://jena.apache.org (2012-09-02). [11] stickler, p.: cbd – concise bounded description. 2005. http://www.w3.org/submission/cbd (2013-05-04). [12] c. buil-aranda et. al.: federating queries in sparql 1.1: syntax, semantics and evaluation. web semantics: science, services and agents on the world wide web 18 (2013) 1–17 about götz hatop is an information scientist and works at the it department of the university library of marburg/lahn. he is interested in semantic web technologies and information architecture. subscribe to comments: for this article | for all articles 2 responses to "integrating linked data into discovery" please leave a response below: paul hermans, 2013-07-30 the example construct query dor dumping the resource seems to contain errors (invalid triple patterns). götz hatop, 2013-08-07 yes, there seems to be some rendering problem with the correct triple pattern used in the described case, since some < and > are included. you may want to look at the sources to see the plain sparql code, it is not much more than to have subject predicate object in the right order. leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using xslt’s sql extension with encyclopedia virginia mission editorial committee process and structure code4lib issue 16, 2012-02-03 using xslt’s sql extension with encyclopedia virginia this paper explores how to integrate data across a hybrid relational database and xml-based management system. it examines specifically how xslt’s sql extension can be used to communicate information between sql tables and tei-conformant xml documents to make data-centric content more manageable and flexible and thereby leverage the strengths of both systems. in what follows, one will learn about some of the methods, benefits, and shortcomings of xslt’s sql extension in the context of encyclopedia virginia, an open access publication of the virginia foundation for the humanities that utilizes a suite of digital humanities and digital library xml vocabularies such as tei and mets. by matthew gibson background and context some questions so why and when would one want to mix xslt—arguably one of the most powerful languages for performing xml transformations—with sql expressions, a language for talking to, managing, and extracting information from a relational database? having cut my own xml teeth beginning in 1998 marking up and managing the markup of electronic texts at the university of virginia library, i embraced a poor assumption that mixing xml and relational databases didn’t make much sense. why, i thought at the time, would you want to store richly structured information—information that, in some ways, was its own self-described and self-contained relational database—inside of a relational database management system? it just seemed redundant and not really keeping with the spirit of what xml was about: platform independent content that could be read and generally understood by humans and easily manipulated by machines. my approach and view of xml during this time was less that it was an efficient and consistent way to transport data and more that it provided the rules by which i could describe complexly structured literary texts with accurate structural and semantically rich markup. i was an xml purist: disdainful of textual projects that did not employ xml applications in their production and reluctant to see and imagine the strengths and improved flexibility that mixing xml and xml tools with other environments and systems—such as a relational database management system (rdbms)—could yield for content and data. a hybrid publication system in 2005, i started developing and building encyclopedia virginia (ev), a digitally-born, open access publication of the virginia foundation for the humanities that explores the history and culture of the state of virginia. bringing to this project all of my xml biases and experiences, i decided that all encyclopedia entries—i.e. the textual essays—in ev would conform to the tei (text encoding initiative) p5 guidelines[1] and that all media—still and moving images, and audio files—would be described using different metadata standards inside of a mets (metadata encoding and transmission standard) wrapper. thus the content structure of an entry and that entry’s relationship to other entries and other objects (e.g. media, external urls, etc.) would be defined and “preserved” in the deeper hierarchies of the xml to allow for the content’s portability to different platforms and contexts. while much of the semantic and mixed content markup that xml affords might get lost, or might be more difficult to reconstruct from a pure rdbms environment, there are other data and workflow requirements for encyclopedia virginia that make an xml/mysql hybrid publication system a better solution than what a pure xml database option might provide. the strengths that an rdbms provides for ev are: version control over every piece of content that goes into the encyclopedia; one-to-many relationship management of, for instance, one author and/or editor to many articles, one chronological event referenced by many articles, and one media object shared by many articles; and, most importantly, more efficient and scalable performance in looking up and retrieving data. a very generalized view of encyclopedia virginia‘s publication workflow for an entry looks something like this: xslt transforms word processed and edited entry to tei p5-compliant xml php shreds tei into appropriate mysql tables (e.g. authors, editors, events, and geolocations) but retains the canonical and intact xml document for future editing and versioning a sql challenge for xslt although all of ev‘s content is stored in and delivered from a lamp (linux, apache, mysql, and php) platform, the original tei documents that are ingested into the mysql tables are created and normalized with a series of xsl transformations before the php/mysql processing takes place. over time we have realized that in the tei there are several structures that are better stored and managed through mysql and that, instead of describing those in the xml, it is more efficient and consistent to point to unique keys in the database that represent that content. one challenge, in particular, was how tei might reference authors. because ev publishes the affiliation of each author and because we want that information to be consistent across, say, multiple entries that a single author might write, marking up that text across multiple tei documents can lead to variability. what is more, if a given author contributes a large number of articles and her affiliation needs to change, it is difficult to manage that change over multiple files. in the past, marking up an author across tei documents might look something like this (figures 1 & 2): figure 1. tei snippet: “divorce in early virginia indian society” divorce in early virginia indian society contributor john doe professor of history at collegiate university figure 2. tei snippet: “pocahontas” pocahontas (d. 1617) contributor john doe professor emeritus of history at collegiate university and author of love and death in the virginia colony (1990) and hard times for early virginia coloniasts (2005) between these two entries by john doe, the author’s affiliation has changed. while one could certainly reflect these changes consistently in the xml with a little searching, cutting and pasting, we decided it would be much easier and more modular to control the author’s relationship to his biography and to the entries he contributed by inputting and storing that information directly in the sql database. thus if we had to change any of the fields related to that author, such as his affiliation, we would do it once, and those changes would be reflected in all of the entries that he wrote. thus each of the entry snippets above would, before being published, look like this (figure 3): figure 3. tei snippet: “pocahontas” with pointer to primary index in author table pocahontas (d. 1617) contributor where the result in the author table with primary index matching “195” contains the following (figure 4): figure 4. result from sql table of author with with a primary index of 195 *************************** 1. row *************************** contributor_id: 195 salutation: dr. first_name: john last_name: doe middle_name: d. bio: professor emeritus of history at collegiate university and author of love and death in the virginia colony (1990) and hard times for early virginia coloniasts (2005) with the reference now made consistently across all entries written by john doe, we can pick and choose which values we want to display and how we want to display them. however, the missing piece here was how we would get that reference to the author into the xml itself. if we still valued keeping the xml intact for future portability, we had to figure out a way to retain the relationship between the tei file and the author’s key in the database. and the challenge, of course, was to get those keys into the xml in an automated way. because we use xslt to transform, test, and normalize all of our xml before we publish it, we looked at saxon’s xslt sql extension as a logical solution to create this relationship. the saxon sql extension while xalan—the apache project’s xslt processor—has an sql extension library, because our production process uses features of xslt 2.0 we turned to the saxon processor’s sql extension library. (nb: while you can run saxon’s sql extension with command-line parameters, all work below is done in the environment of oxygen’s xml editor.) prerequisites to get the saxon sql extension up and running and talking to your database you need several things: the java database connectivity (jdbc) driver for whatever database you are running—in our case it is mysql. the saxon-sql.jar (recent releases of oxygen come prebundled with this) a configuration file that allows oxygen to see the saxon-sql.jar for validation and processing. appropriate credentials (i.e. username and password) to access the database with which you want to communicate and that your database server is configured to be accessed by remote machines. xslt sql elements saxon’s sql extension defines five new xsl elements for interacting with a database that are all bound to the “sql” namespace: sql:connect, sql:query, sql:insert, sql:column, and sql:close. i actually don’t recommend using sql:insert and sql:column very much since injecting xml data into what, typically, is the more controlled environment of a relational database introduces too much variability and presents possibilities to really cause some chaos. the other three sql elements work as such: establishes, as you might have guessed, the connection to the database and the attributes that it takes (driver, database, user, and password) provide the information that will validate the connection. does the real work for us here. it performs the query on the database and writes the results of the query to the result tree. while there are six attributes for sql:query, the main ones we use are: table—which defines the name of the table to be queried. this attribute is mandatory. column—defines the name of the column or columns to be retrieved. using “*” as the value of the column attribute selects all columns. this attribute is also mandatory. where—defines the conditions to be applied in a given selection. this is optional. closes the sql connection. an xslt example: authors2sql.xsl given these elements—and the snippets from figures 1 and 2 above as our source tree—let’s look at a sample of xslt code that places these elements in context: connecting to database... connection to mysql failed. connected...
last_name --> first name: last name: table: contrib id: summary despite these drawbacks, if a situation arises that requires information interchange between an xml-formatted document and a relational database, xslt’s sql extension can bridge these two platforms that describe information and information relationships in very different ways. what is more, when a project’s needs require leveraging the strengths of xml and a relational database—when managing and delivering information necessitates a hybrid solution between complex xml description and structure and a relating data efficiently—xslt can remain a powerful tool in your suite of options for working with xml content. [1] the p5 specification was in the process of being created during this period. the official release came about in 2007. about the author matthew gibson is director of digital programs at the virginia foundation for the humanities (vfh). he holds a ph.d. in english from the university of virginia. prior to joining the vfh in 2005, matthew served as associate director of the university of virginia library’s electronic text center. at the vfh, he oversees encyclopedia virginia, a digital publication about virginia history and culture; provides supervision and support for documents compass, a mellonand nhprc-funded initiative to facilitate digitization and interoperability between documentary editions; and guides the shortand long-term planning for the organization’s digital efforts and scholarly communications. matthew has a multitude of conference presentations and several publications to his credit, mostly in the field of digital library standards. for the past nine years he has taught week-long xml application building workshops as an independent consultant and for the association of research libraries. urls: http://virginiahumanities.org and http://encyclopediavirginia.org. subscribe to comments: for this article | for all articles one response to "using xslt’s sql extension with encyclopedia virginia" please leave a response below: bernadette, 2012-02-10 i see where the variable ‘contribid’ gets assigned a value in lines 50-53, but i can’t figure out where the variable ‘contrib-table’ gets a value to use in line 68? is it picked up from the ‘table’ attribute value in the sql:query somehow? otherwise, thanks so much for this! it’s a straight-forward example! leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – practical digital forensics at accession for born-digital institutional records mission editorial committee process and structure code4lib issue 31, 2016-01-28 practical digital forensics at accession for born-digital institutional records archivists have developed a consensus that forensic disk imaging is the easiest and most effective way to preserve the authenticity and integrity of born-digital materials. yet, disk imaging also has the potential to conflict with the needs of institutional archives – particularly those governed by public records laws. an alternative possibility is to systematically employ digital forensics tools during accession to acquire a limited amount of contextual metadata from filesystems. this paper will discuss the development of a desktop application that enables records creators to transfer digital records while employing basic digital forensics tools records’ native computing environment to gather record-events from ntfs filesystems. by gregory wiedeman the acquisition of born-digital records for institutional archives is nothing new. over the past five to ten years a consensus has developed within the profession that the best practice for the long-term preservation of digital records is the use of forensic disk imaging to preserve the records’ original bitstreams in their original context. yet, despite the recent growth and success of these techniques, disk imaging may provide some problems for institutional archives that may be subject to records retention schedules and public records laws. this article examines a different and more challenging approach to managing permanent digital records in institutional archives: the use of digital forensics tools in records’ original creation environment to preserve contextual metadata at accession, and the preservation of records at the file layer. forensic tools have developed to a point where they can be employed as part of an automated workflow – limiting practical usability barriers. even gathering a small portion of valuable contextual metadata in the form of filesystem timestamps has the potential to make records more usable and discoverable. the key to employing digital forensics at accession is developing a user-friendly transfer application that runs basic forensic tools in records’ native environments prior to transfer. at the university at albany, suny, we are developing ants – a basic desktop application that will enable records creators to transfer born-digital records to an institutional archive over shared network drives or file transfer protocol (ftp). written in python and limited to ntfs filesystems, this experimental tool contains a basic gui that allows records creators to describe the accession – either as a whole or by adding descriptions or access concerns to individual directories and files. ants then attempts to run mftrcrd, plaso, and python’s os.stat functions to gather as many record-events as possible before packaging records with bagit and transferring them to the archives. ants should serve as a useful proof of concept and a tool for future experimentation. evolution of digital forensics in archive digital records first became widely discussed by archivists in the 1990s, and the challenges of these new formats appeared formidable. some of the first major questions concerned the stability of records in volatile digital environments where they could be easily altered or manipulated. in order to grapple with the complexity and volatility of digital records, archivists turned to the field of diplomatics to help identify the general requirements to manage digital records – namely the establishment and maintenance of their authenticity, reliability, and integrity. [1] still, modern computers were not designed for the permanent maintenance of electronic records and the technical challenge of preserving this information was still daunting. the application of forensic disk imaging to archives gained widespread traction after 2010. matthew g. kirschenbaum, richard ovenden, and gabriela redwine made the case that the work of digital forensic practitioners overlapped with archivists’ requirements for managing digital records. digital forensics methodologies, they offered, can provide a way to stabilize digital information and preserve it in a state that includes the contextual information necessary for demonstrating records’ authenticity. [2] soon after, kam woods and christopher a. lee began an effort to directly apply digital forensics tools to archival workflows – work that evolved into the bitcurator project. they emphasized how, “disk images that contain complete operating systems capture significant information about the ‘digital ecosystem’ in which documents and media were created.” [3] the bitcurator project itself has developed effective and accessible forensics tools for archivists – enabling them to use, and see value from, disk imaging. [4] generally, the stated focus of these efforts has been on “cultural heritage collections” and “collecting institutions” generally, but how does digital forensics apply to institutional archives? in conjunction with the developing digital archives theory and the evolution of forensics tools, over the past five years there have been a number of practically-focused publications on managing born-digital archives. [5] these case studies and best practices have correctly prioritized getting file contents onto stable media, establishing intellectual control, and providing access. the available literature seems to show that many institutional repositories are accepting digital records as files rather than disk images. only more recently have practical case studies been published that showcase the preservation of contextual metadata outside of files. [6] part of this is likely due to lag time in the publishing process, but it might just be that some archivists are still making an educated decision to preserve digital records as files. this could be because the technical barriers to disk imaging are still too high for some repositories, or perhaps some repositories are finding that disk imaging—despite its applicability to digital archives as a whole—is not a perfect fit for many types of materials. institutional archives and public records while disk imaging offers the easiest way to ensure that digital materials are authentic and complete, the consensus on disk imaging for digital records also poses a few major challenges for institutional archives: imaging all of the physical disks which hold institutional records can be very impractical. disk imaging retains additional information that should have been disposed of in accordance with a traditional records retention schedule. archives managing records that are subject to public records laws may not be able to limit or restrict access to information that was unwittingly preserved in disk images. even with these concerns, institutional archivists should not easily abandon digital forensics or even disk imaging. recordkeeping practices are always changing and archivists may be able to update their records management strategies to meet these new realities. [7] organizations could also adopt an enterprise document or records management system to manage the entire records process from creation to final disposition. still, these methods may not help archivists who are legally obligated to make all information obtained from disk imaging available for public use. the university at albany, suny is a good example of where a large disk imaging program is unworkable for an institutional archives. first, we manage the permanent records of a large public university with over 17,000 students and more than 3,400 employees on three campuses. since the university archives has only one full-time professional staff member, imaging innumerous hard disks and managing the data footprint that would be created is more than impractical without a large organization-wide commitment driven from the university administration. secondly, as a public university, we are subject to new york state’s foil law which mandates that all records must be made available to the public, with only minor exceptions. [8] copying all the bits from a disk inevitably retains additional information outside of what was selected to be retained, and—by law—we are not able to enforce access policies to restrict the availability of deleted files, data found in slack space, or other records unwittingly transferred. outside of select cases, the risk of keeping all the bits is simply too great. managing records as files without disk imaging, we must manage records using the same abstractions used by operating systems: files, stored on filesystems, which likely contain the primary content of records. documenting the authenticity and integrity of the file itself is easy with the use of cryptographic hashes, commonly called checksums. this ensures the preservation of content as well as whatever contextual metadata that is included within the file. we can handle file format identification, extract embedded metadata like exif data, and document file size after records are transferred to the archives if we rely on checksums. so what are we missing if we preserve only files? computers also maintain a variety of other information on how a record was created, used, and maintained outside of a file, providing valuable context. as the bitcurator project reported, “extraction of basic technical metadata (such as timestamps) from file systems can provide a foundation on which one can establish a ‘ground-truth’ for content and structure on a device. they can provide significant support for assessment and preservation activities.” [9] additionally, dates have traditionally supported description in archives and worked to make records more accessible. the most useful and pragmatic approach is to gather as many forensic artifacts that can be attached to records as possible. these “record-events” would be any point of activity relating to a record that can be gathered from the original computing environment, including timestamps found in a filesystem, as well as actions entered into an event log or journal. these events could signify record creation, alteration, or use, and helps to contextualize the record itself. to gather this information, we need to take action in records’ native environment, which means we need to run forensics tools on each individual’s desktop before records are transferred to the archives. by combining an effective transfer application with basic forensic tools, we can make it easier for creators to transfer records while establishing a consistent submission package that includes at least some contextual metadata in addition to the files themselves. building a transfer tool at the university at albany, suny, we are building a basic gui desktop application that can be run on a records creators’ computer that will easily perform these tasks while transferring records to servers accessible to the university archives. ants, for archives network transfer system, is written in python and uses wxpython, a cross-platform gui wrapper that allows us to easily use and manipulate wxwidgets in python. application data is stored and managed in xml using the lxml library, and the program is frozen as an executable using pyinstaller, so it should run on any modern computer without any external dependencies. [10] while the technology used to develop ants is not very fast nor resource-efficient, the application manages and stores only a small amount of data and only performs a few simple tasks. these tools also make ants very open and easy to develop, and since it runs on the desktop, efficiency and scalability are much smaller concerns. while it might not be the most popular gui framework, wxpython is mature, easy to use, and is open and well-documented. the code below places a text label, a one-line text box where users can enter a path, and a button which simply checks if that path is a valid location. this is a simplified example of some of the code in the ants options panel and serves to show how easy desktop development can be. #sizer that positions widgets gridsizer = wx.flexgridsizer( 4, 3, 0, 0 ) gridsizer.setflexibledirection( wx.both ) gridsizer.setnonflexiblegrowmode( wx.flex_growmode_specified ) #label for textctrl self.tranferlabel = wx.statictext( self.transfertab, wx.id_any, u"transfer location (path or ftp url)", wx.defaultposition, wx.defaultsize, 0 ) self.tranferlabel.wrap( -1 ) gridsizer.add( self.tranferlabel, 0, wx.all, 5 ) #textctrl where user enters transfer path self.transferlocinput = wx.textctrl( self.transfertab, wx.id_any, configdata["transferlocation"], wx.defaultposition, wx.size( 300,-1 ), 0 ) gridsizer.add( self.transferlocinput, 0, wx.all, 5 ) #button to test if transfer path is valid self.checklocation = wx.button( self.transfertab, wx.id_any, u"test transfer location", wx.defaultposition, wx.defaultsize, 0 ) gridsizer.add( self.checklocation, 0, wx.all, 5 ) #binds button to function self.checklocation.bind( wx.evt_button, self.testtransferlocation ) #function to test path on button click def testtransferlocation(self, event): #gets content of textctrl location = self.transferlocinput.getvalue() #check to see if textctrl contains value if not len(location) > 0: #shows popup dialog noloc = wx.messagedialog(none, 'you must enter a local or network path.', 'location test error', wx.ok | wx.icon_error) noloc.showmodal() else: #example of testing a path if os.path.isdir(location): #shows popup dialog gooddir = wx.messagedialog(none, 'the directory you entered is correct.', 'found directory', wx.ok | wx.icon_information) gooddir.showmodal() else: #shows popup dialog baddir = wx.messagedialog(none, 'invalid location, did not find the directory you entered.', 'incorrect directory', wx.ok | wx.icon_exclamation) baddir.showmodal() complex positioning using sizers can be unintuitive, but wxformbuilder is an open-source application that can easily position widgets with a gui and export the code. there are some limitations to wxformbuilder, as it does not allow you to directly bind functions to widgets, but the code is easy to modify after it is exported. figure 1. ants splash page when users run ants, they are shown an entry window that looks like a splash screen. this asks them to browse and select a folder to transfer while letting them directly access ants’s other functions. using an entry window like this is not ideal, but in this case the user needs a “browse” button, otherwise dropping them directly into a directory select dialog is disorienting. functionally, ants performs three primary tasks. most importantly, it allows users to select, describe, package, and transfer directories and files. ants also maintains a running receipt of transferred files that will let users easily request files they transferred in the past. clicking this button creates a bootstrap html report with links that lets donors email the archivist with a specific file identifier, accession number, and transfer information. for ualbany, this is configured to send email requests directly into our ticket system. finally, ants allows users to view and download files through the same method they transferred them. [11] when users click browse, ants asks to elevate the user account control (uac) privileges that will help it run forensics tools later. windows does not easily allow programs to elevate privileges within a single process, so after the elevation request ants actually runs a separate executable, antsfromboot.exe, which contains the major portion of the tool, and can test for elevated privileges. this enables ants to run different tools depending on the privileges that it has been granted. figure 2. ants file browsing after users browse and select a directory, the main panel opens with an expanded directory tree that displays every file and folder. checkboxes let users select which files and folders to transfer, and which they would like to omit. this encourages records creators to select files in their native environment without unnecessarily transferring them to avoid neighboring files. the most difficult part of the gui development was that wxpython’s default treectrl widget does not contain checkboxes. instead, ants uses a customtreectrl widget which is part of an extension library written by andrea gavana. the issue with this extension is that the checkbox rules that automatically toggle parents and children did not work as stated in the documentation. the solution was to write the directory information into an xml file at the same time the widget is displayed. each file or folder is bound to an event that enters checkbox information into the xml file and reloads the widget. the rules for checking parents and children had to be written manually. although tedious to develop, this works rather well and allows ants to add the description and access information to the xml file during the same event, but may cause scalability issues for transfers with hundreds of subdirectories and/or thousands of files. the right side of the panel invites users to add descriptions and access concerns at any level of the directory. while it is unlikely that users will describe their files manually, ants does provide this functionality. archivists might ask donors to provide a little information about files that may not be effectively named, and in a few cases they could provide some useful information. the bottom-right of the panel is a submit button and an option to compress the data before transferring. once ants is configured, users can transfer records simply by selecting a directory and clicking “submit.” additional notebook tabs enable users to configure ants through the gui. the “creator profile” tab sets the donor metadata that will be included with the packaged files. this includes the creator, a creator id, the donor and the donor’s contact information such as an email address that will let archivists notify donors of a successful transfer. the “options” tab contains the transfer and receive locations, credentials for ftp transfers, and timestamp, compression, and checksum options. here users can also export the receipt data in bootstrap html, csv, or xml. all of this information is stored in the user’s appdata directory in a config.xml file. ants is also designed to be configured remotely, so that archivists can send a customized config.xml file that can be installed with the application. figure 3. ants options tab when ants is set up using the installer that is provided, there is an option to place a shortcut to ants in the windows context menu – the “right-click” menu. this allows records creators to simply right click any file or folder in windows explorer, omit the splash screen, and transfer records in two clicks. figure 4. ants context menu packaging files with digital forensic tools a number of free and open-source forensic tools have emerged in recent years and most of them are distributed with permissible licenses. this permits us to use python’s subprocess library to run these tools from within ants and parse their output. right now, there are two tools that we can utilize to perform digital forensics at accession: joakim schicht’s mftrcrd and the plaso timeline engine developed by kristinn gudjonsson and joachim metz. there is also hope that these and other tools will continue to evolve where they can further support archivists and gather more record-events at accession. forensics tools are designed for digital forensics investigations not archives. in most cases, tools choose to extract large bodies of contextual metadata from disk images, not—as we require—to gather metadata on individual files. this means that most forensic tools available today often take too long to gather record-events for individual files. a second problem is that we need records creators to run these programs on their own computers, so they need to be able to access live filesystems instead of disk images. this is possible on modern windows operating systems, by accessing a disk as a device rather than through the filesystem. to do this we just need to use the device namespace in the command line with the \\.\ prefix. so the c drive would be accessed as \\.\c: instead of c:\. [12] we can use this prefix as part of a command wrapped in a python application using the subprocess module. the only problem with this is that windows requires elevated privileges to access disks as devices, privileges that many records creators may not have. this is the reason why ants requests elevated privileges before browsing for a directory, though even privileged users will not have privileges for shared network drives, likely leaving us with only a portion of our records creators that can use this method. with these restrictions in mind, ants must be able to run different tools based on the privileges it has been granted. the most common open-source forensic tools currently used by archivists is the sleuth kit (tsk), written by brian carrier. tsk is a set of over 20 command line tools used to analyze disk images. tsk can access live systems, but only with administrator privileges. tsk metadata layer tools (iprefix) at first seem to be particularly useful to us. ifind.exe produces the master file table identifier for a file and istat.exe displays that master file table (mft) record. the mft is where most filesystem metadata is found in ntfs systems, and contains a file’s mac times or mace times—for modified, accessed, and created—these primary timestamps are called the $standard_information attribute on ntfs systems. in addition to the primary $standard_information timestamps, the mft record also contains at least one other additional set of mac times, the $file_name attribute, that are typically hidden from the common user and are not as volatile. according to corey altheide and harlam carvey, the $file_name timestamps, “…are not affected by normal system activity or malicious tampering…” and are generally more stable than the $standard_information mac times. [13] brian carrier goes even further to claim that the $file_name timestamps, “…frequently correspond to when the file was created, moved, or renamed.” [14] in addition, the $file_name attribute also identifies its parent directory, and that directory will also have a set of mac timestamps for its child files. if nothing else, these different sets of timestamps can support or complicate the standard mac times, and together they can help future archivists and researchers better understand how records were used. tsk’s ifind.exe’s speed was a concern for this project. in basic, unscientific tests, ifind.exe took about 13 seconds per file to obtain the mft identification number. running as a subprocess within a python script, looping through 25 files took about 334 seconds, or just over five and a half minutes. for comparison’s sake, using bagit to run md5 checksums on those files—taking up a fairly small 178 megabytes—takes less than a third of a second. users will not expect ants to take over five minutes to package this small number of files, and that time does not include transferring that data. fundamentally, tsk’s ifind takes too long to identify mft identifiers for individual files, so we sought a better option. at first glance, joakim schicht’s mft2csv tool appeared to meet our needs, but instead we encountered similar issues – namely that forensic tools are often focused on the disk level. as the name states, mft2csv decodes the mft of a ntfs disk image or live filesystem to a csv file. users have a wealth of options and may employ mft2csv with a gui or via the command line. the problem with this tool is that we would have to dump a disk’s entire mft to a csv or sql file before parsing it for individual records. this is the same for similar tools such as pymftgrabber, analyzemft.py, and indxparse. while this method would not be difficult to write, for large disks it would take up additional resources and time while records creators are making the transfer. schicht has developed another tool that better fits our needs. mftrcrd can extract an individual file’s mft entry from a live system by providing both that file’s path and the device namespace method discussed previously. using the same methods we used to test tsk tools, mftrcrd ran at about 0.7-0.8 seconds per file to total just over 19 seconds to loop though the entire directory of 25 files. starting mftrcrd by joakim schicht version 1.0.0.37 target is a file filesystem on c: is ntfs file indexnumber: 418561 bytespersector: 512 sectorspercluster: 8 reservedsectors: 0 sectorspertrack: 63 numberofheads: 255 hiddensectors: 206848 totalsectors: 976564223 logicalclusternumberforthefilemft: 786432 logicalclusternumberforthefilemftmirr: 16 mft record size: 1024 record number: 418561 found at disk offset: 0x0000001bc8e89400 $logfile sequence number (lsn): 13711955478 ... $standard_information 1: file create time (ctime): 2015-10-20 13:40:47:010:6144 file modified time (atime): 2015-10-20 13:57:12:762:9689 mft entry modified time (mtime): 2015-10-20 13:57:12:762:9689 file last access time (rtime): 2015-10-20 13:40:47:027:6178 dos file permissions: archive max versions: 0 version number: 0 class id: 0 owner id: 0 security id: 2866 quota charged: 0 usn: 3219407008 $file_name 1: parent mftreference: 397315 parentsequenceno: 7 file create time (ctime): 2015-10-20 13:40:47:010:6144 file modified time (atime): 2015-10-20 13:40:47:027:6178 mft entry modified time (mtime): 2015-10-20 13:40:47:027:6178 file last access time (rtime): 2015-10-20 13:40:47:027:6178 allocsize: 0 realsize: 0 easize: 0 flags: archive namelength: 8 nametype: dos+win32 namespace: 3 filename: test2.py condensed example of mftrcrd’s output the output includes some filesystem information such as the type of the filesystem and the sizes and number of sectors and clusters. more importantly, it includes the four primary $standard_information timestamps and the four additional $file_name timestamps. also included is the file’s $logfile sequence number and the update sequence number (usn) journal number – both of which may be useful in the future to gather record-events from files that have been more recently accessed. while this output is unstructured text, it was fairly easy to run it as a subprocess like the example below, parse it with python, and add the timestamps to ants’s xml directory file. readmft = subprocess.popen("mftrcrd.exe " + path + "-d indxdump=off 1024 -s",\ shell=false, stdin=subprocess.pipe, stderr=subprocess.pipe, stdout=subprocess.pipe) out, err = readmft.communicate() using the mftrcrd tool is currently the most effective method to gather contextual metadata at accession, but the plaso timeline engine may be a more intriguing tool for the future. unlike mftrcrd, it can be run without the elevated privileges that many records creators may not have. here the plaso engine may offer a helpful backup plan by enabling us to gather at least the primary mac timestamps. plaso is different than the forensic tools discussed previously as it is more user-focused – designed more for ease of use and at-a-glance information. forensics tools need to be effective and comprehensive, yet plaso is not designed to gather all forensic artifacts, but to gather the most readily available evidence of activities and display them in a useful and intuitive timeline. plaso has evolved from a single perl tool, called log2timeline, to an “engine” that amalgamates a number of useful forensic processes to produce “super timelines.” this is particularly intriguing for archivists who would benefit from using a number of smaller and more obscure tools and producing a simple and standardized output format. plaso’s functionality matches the needs of institutional archivists by gathering a variety of record-events from different sources in a simple and user-friendly manner. however, for archivists plaso offers more hope for future gains than current productivity. it does not access live disks as devices, so the only way it can run on live systems is to read a directory and gather the $standard_information mac times that are available. while it still gathers artifacts from sources other than the filesystem, in practice this means that in rare cases it may be able to gather record-events embedded within a file itself – most often with microsoft’s office open xml files (.docx, xlsx, etc.). additionally, plaso does take a significant amount of time to run through a directory at just over 23 seconds for 25 files. this is not much more than mftrcrd, yet since it is run with a single process it was a challenge showing progress to the user. running plaso, ants’s progress bar dialog freezes for that 23 seconds, giving users the appearance that ants itself is frozen. it is possible, although challenging, to show progress as plaso updates the console, but plaso does not provide feedback to the console during most of its run, so updates to the progress bar dialog would likely be intermittent at best. for uses without elevated privileges, ants can employ both plaso and python’s os.stat library. unlike plaso, os.stat runs nearly instantaneously and can show progress for each file. it gathers the basic $standard_information mac times that are available to the operating system. this makes it a realistic minimum for performing digital forensics at accession. plaso is still included with ants as a proof of concept and a tool for future experimentation. it runs effectively, yet for now there seems to be little tangible benefit to using plaso over os.stat. transferring records to the archives after employing basic forensic tools to read filesystem metadata, ants runs the python version of bagit to package files with checksums. an unexpected problem was that using the existing bagit tool required ants to move and hash the files twice – a redundant and time-consuming step for large transfers. in order to prevent the unwitting modification of the original files, ants makes a copy of the entire transfer directory after running the forensics tools. ants then runs bagit, which again moves the directory to a “data” folder according to the bagit standard. ants later adds an xml metadata file to the bag, which requires updating the manifest to make it valid again – running checksums on the data for a second time. this redundancy could be eliminated by creating the bagit package before transferring the files and running the checksums, but creating an empty bag raises an error. fixing this issue would require reworking the python bagit tool or building the bag framework from scratch, and the latter option will probably be implemented in future versions. another improvement that may be made in the future is the use of robocopy. currently, ants uses python’s native shutil library to move files, which is significantly slower than native windows utilities. ants could instead run robocopy as a subprocess, which would greatly improve performance, but in practice, this caused similar issues as running plaso. robocopy provides a large amount of feedback in unstructured text though both the console and a log file. this would have to be parsed to show progress to the user and raise an exception if the process failed. after the records are packaged, ants creates an xml directory of the folder and file hierarchy that includes identifiers, paths, and any descriptions and access concerns that were entered by the user. here ants adds all of the record-events it was able to gather from the files, and logs all of the actions it took on the files as curatorial events. if a user opted to compress the accession, ants then compresses the directory either in zip or gzip formats. ants does not use major metadata standards for its output. after initial plans to incorporate premis or mets, we found that the information provided by ants was different enough that mapping to these standards would be very imprecise and serve only to add unnecessary complexity. the xml provided by ants also lacks namespaces or a stated schema. namespaces can be difficult to deal with in lxml, and since ants creates standardized xml, a schema seemed unnecessary. archivists are welcome to map the data provided by ants to their favorite standards upon receipt, and omitting namespaces and schemas makes this easier to do. here is an excerpt from ants’s output that depicts a single file: 418f3945-d66d-494a-a352-146ba45e0053 f:\libraries\documents\dchs\dutchesscountyre00cook.pdf listing from promotional event. ran mftrcrd to gather ntfs timestamps ran bagit-python to package accession ftp transfer 2015-11-07 17:54:09 2013-07-13 11:50:48 2015-11-07 17:54:10 2015-11-07 17:54:09 2015-11-07 17:54:09 2015-11-07 17:54:09 2015-11-07 17:54:09 2015-11-07 17:54:09 ideally, ants is designed to transfer files over shared network storage. the transfer directories do not need to be mapped to a drive letter, and users may not even be aware that the directory exists – ants just requires a local or unc path. it is expected that archivists will remove accessions from the transfer directory and not store them there. this can be automated with a simple script that can also send a notification email drawn from the metadata included with the files. by transferring records over network shares, users are authenticated by their windows login. the advantage of this method is that it leverages network protocols which may already be in use by an organization, such as window’s active directory or other forms of lightweight directory access protocol (ldap). ants can also serve as an ftp client using individual user credentials. the problem here is coordinating credentials between ants and the ftp server. credentials are stored in the user’s appdata directory, encrypted with cryptprotectdata from the windows api using python’s win32 library. while this is similar to how many applications store user credentials, it is not very secure, and it still requires manual entry of credentials into the ftp server. conclusion employing digital forensics at accession is and will always be an imperfect and challenging method of managing digital records. unlike disk imaging, employing these tools at accession requires proactive planning and the preemptive use of resources. if we do not use disk imaging, all contextual metadata outside of file content must be gathered in advance – actions that must be performed by the records creators themselves in records’ native environment. yet, institutional archivists may choose not to accept disk images for the problems they pose to records retention schedules and public records laws. here, gathering record-events at accession may be the most effective way to maintain some basic context without retaining additional information that may conflict with retention scheduling. while we have to employ digital forensic at accession under a number of constraints, forensic tools have developed to a point where it is now a feasible option. ants should act as proof of this viability and serve as a tool for future experimentation. right now we are able to only gather some of the many possible record-events that may exist in the form of filesystem timestamps. still, each timestamp documents a point of action that may help tell a more complete story. every little bit helps, and if we can gather even this basic information systematically with limited barriers for records creators, the cost of digital forensics at accession can be small enough that archives can significantly benefit. there is always the potential for future growth that may enable us to gather more information at accession. digital forensic tools are evolving – getting more effective and easier to use. there is a strong potential for performing more sophisticated forensic analysis at accession, particularly for records creators who have administrative privileges. digital forensics at accession will never be able to gather all of the contextual information that disk imaging can, yet it can preserve the record-events that are most likely to be useful to future archivists and researchers. with reasonable future advances in tools, gathering record-events at accession can empower archivists to appraise and preserve the contextual information with the greatest value and minimum risk. for many repositories, particularly institutional archives which manage public records in a relatively controlled environment, employing digital forensics at accession can be viable and effective approach to managing digital records. about the author gregory wiedeman is the university archivist in the m.e. grenander department of special collections & archives at the university at albany, suny where he is charged with developing a digital records program. notes [1] interpares project authenticity task force, “requirements for assessing and maintaining the authenticity of electronic records,” march 2002. [2] matthew g. kirschenbaum, richard ovenden, and gabriela redwine, digital forensics and born-digital content in cultural heritage collections, (washington, d.c.: council on library and information resources, 2010): 8, 21, 32. [3] kam woods and christopher a. lee, “acquisition and processing of disk images to further archival goals,” in proceedings of archiving 2012 (springfield, va: society for imaging science and technology, 2012), 148. & kam woods, christopher a. lee, and simson garfinkel, “extending digital repository architectures to support disk image preservation and access,” in jcdl ’11: proceeding of the 11th annual international acm/ieee joint conference on digital libraries, (new york, ny: acm press, 2011). [4] christopher a. lee, kam woods, matthew kirschenbaum, and alexandria chassanoff, “from bitstreams to heritage: putting digital forensics into practice in collecting institutions,” september 30, 2013. [5] ben goldman, “bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories,” rbm: a journal of rare books, manuscripts, and cultural heritage 12:1 (2011) & aims work group, aims born-digital collections: an inter-institutional model for stewardship, january 2012 & ricky erway, you’ve got to walk before you can run: first steps for managing born-digital content received on physical media, (dublin, ohio: oclc research, august 2012) & cyndi shein, “from accession to access: a born-digital materials case study,” journal of western archives 5:1 (2014) & joseph a. williams and elizabeth m. berilla, “minutes, migration, and migraines: establishing a digital archives at a small institution,” the american archivist 78:1 (2015). [6] julianna barrera-gomez and ricky erway, walk this way: detailed steps for transferring born-digital content from media you can read in-house, (dublin, ohio: oclc research, june 2013) & sam meister and alexandria chassanoff, “integrating digital forensics techniques into curatorial tasks: a case study,” the international journal of digital curation 9:2 (2014) & john durno and jerry trofimchuk, “digital forensics on a shoestring: a case study from the university of victoria,” code4lib journal 27 (january 21, 2015). [7] the national archives’ capstone email model comes to mind as an example. national archives and records administration, “email management,” retrieved from https://www.archives.gov/records-mgmt/email-mgmt.html. [8] new york state’s freedom of information law states definitively that, “…government is the public’s business and that the public, individually and collectively and represented by a free press, should have access to the records of government…” new york (state) legislature, “freedom of information law,” public officers law article 6 (s 84-90), 2008. exceptions to foil requests include records that, if made available, “…would constitute an unwarranted invasion of personal privacy…,” impede collective bargaining processes, contain certain trade secrets, or “…interfere with law enforcement investigations or judicial proceedings.” all other records must be made “available for public inspection and copying.” [9] lee, woods, kirschenbaum, and chassanoff, 5-6. [10] ants was developed primarily for windows and has only been tested on windows 7 and later. the pyinstaller executable should run on mac osx and linux machines, yet ants is only written to read ntfs filesystems and non-windows users are warned of this when running ants. [11] ants envisions that records creators will have both a transfer and a request directory. it is designed that after a request is make, through the receipt or otherwise, an archivist can place the requested materials in the request directory temporarily and creators can download them through ants using either shared network folders or ftp. the transfer process is detailed further in the article. [12] pär österberg medina, “how to acquire ‘locked’ files from a running windows system,” open security research blog, october 25, 2011, retrieved from http://blog.opensecurityresearch.com/2011/10/how-to-acquire-locked-files-from.html. [13] cory altheide and harlan carvey, digital forensics with open source tools (walthan, ma: syngress, 2011), 73-74. [14] brian carrier, file system forensic analysis (addison-wesley: upper saddle river, nj, 2005), 318. bibliography aims work group. aims born-digital collections: an inter-institutional model for stewardship. january 2012. retrieved from http://dcs.library.virginia.edu/aims/white-paper/. altheide, cory and harlan carvey. digital forensics with open source tools. walthan, ma: syngress, 2011. barrera-gomez, julianna, and ricky erway. walk this way: detailed steps for transferring born-digital content from media you can read in-house. dublin, ohio: oclc research, june 2013. retrieved from http://www.oclc.org/content/dam/research/publications/library/2013/2013-02.pdf. carrier, brian. file system forensic analysis. addison-wesley: upper saddle river, nj, 2005. durno, john, and jerry trofimchuk. “digital forensics on a shoestring: a case study from the university of victoria.” code4lib journal 27 (january 21, 2015). retrieved from http://journal.code4lib.org/articles/10279. erway, ricky. you’ve got to walk before you can run: first steps for managing born-digital content received on physical media. dublin, ohio: oclc research, august 2012. retrieved from http://www.oclc.org/content/dam/research/publications/library/2012/2012-06.pdf. gengenbach, martin, alexandria chassanoff, and porter olsen. “integrating digital forensics into born-digital workflows: the bitcurator project.” proceedings of the american society for information science and technology 49:1 (2012). retrieved from https://www.asis.org/asist2012/proceedings/submissions/343.pdf. goldman, ben. “bridging the gap: taking practical steps toward managing born-digital collections in manuscript repositories.” rbm: a journal of rare books, manuscripts, and cultural heritage 12:1 (2011). gudjonsson, kristinn. “mastering the super timeline with log2timeline.” sans institute information security reading room. june 29, 2010. retrieved from https://www.sans.org/reading-room/whitepapers/logging/mastering-super-timeline-log2timeline-33438. interpares project authenticity task force. “requirements for assessing and maintaining the authenticity of electronic records.” interpares project. march 2002. retrieved from http://www.interpares.org/book/interpares_book_k_app02.pdf. kirschenbaum, matthew g., richard ovenden, and gabriela redwine. digital forensics and born-digital content in cultural heritage collections. washington, d.c.: council on library and information resources, 2010. retrieved from http://www.clir.org/pubs/reports/reports/pub149/pub149.pdf. lee, christopher a. “archival application of digital forensics methods for authenticity, description and access provision.” comma 2:14 (2012). retrieved from http://ils.unc.edu/callee/p133-lee.pdf. lee, christopher a. “digital forensics meets the archivist (and they seem to like each other).” provenance 30 (2012). retrieved from http://digitalcommons.kennesaw.edu/cgi/viewcontent.cgi?article=1023&context=provenance. lee, christopher a., kam woods, matthew kirschenbaum, and alexandria chassanoff. “from bitstreams to heritage: putting digital forensics into practice in collecting institutions.” september 30, 2013. retrieved from http://www.bitcurator.net/docs/bitstreams-to-heritage.pdf. meister, sam and alexandria chassanoff. “integrating digital forensics techniques into curatorial tasks: a case study.” the international journal of digital curation 9:2 (2014). retrieved from http://www.ijdc.net/index.php/ijdc/article/view/325. national archives and records administration. “email management.” retrieved from https://www.archives.gov/records-mgmt/email-mgmt.html. new york (state) legislature. “freedom of information law.” public officers law article 6 (s 84-90), 2008. retrieved from http://www.dos.ny.gov/coog/foil2.html. medina, pär österberg . “how to acquire ‘locked’ files from a running windows system.” open security research blog october 25, 2011. retrieved from http://blog.opensecurityresearch.com/2011/10/how-to-acquire-locked-files-from.html. shein, cyndi. “from accession to access: a born-digital materials case study.” journal of western archives 5:1 (2014). retrieved from http://digitalcommons.usu.edu/westernarchives/vol5/iss1/1/. williams, joseph a., and elizabeth m. berilla. “minutes, migration, and migraines: establishing a digital archives at a small institution.” the american archivist 78:1 (2015). woods, kam and christopher a. lee. “acquisition and processing of disk images to further archival goals.” proceedings of archiving 2012. springfield, va: society for imaging science and technology, 2012. retrieved from http://ils.unc.edu/callee/p147-woods.pdf. woods, kam, christopher a. lee, and simson garfinkel. “extending digital repository architectures to support disk image preservation and access.” jcdl ’11: proceeding of the 11th annual international acm/ieee joint conference on digital libraries. new york, ny: acm press, 2011. retrieved from http://ils.unc.edu/callee/p57-woods.pdf. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – new metadata recipes for old cookbooks: creating and analyzing a digital collection using the hathitrust research center portal mission editorial committee process and structure code4lib issue 37, 2017-07-18 new metadata recipes for old cookbooks: creating and analyzing a digital collection using the hathitrust research center portal the early american cookbooks digital project is a case study in analyzing collections as data using hathitrust and the hathitrust research center (htrc) portal. the purposes of the project are to create a freely available, searchable collection of full-text early american cookbooks within the hathitrust digital library, to offer an overview of the scope and contents of the collection, and to analyze trends and patterns in the metadata and the full text of the collection. the digital project has two basic components: a collection of 1450 full-text cookbooks published in the united states between 1800 and 1920 and a website to present a guide to the collection and the results of the analysis. this article will focus on the workflow for analyzing the metadata and the full-text of the collection. the workflow will cover: 1) creating a searchable public collection of full-text titles within the hathitrust digital library and uploading it to the htrc portal, 2) analyzing and visualizing legacy marc data for the collection using marcedit, openrefine and tableau, and 3) using the text analysis tools in the htrc portal to look for trends and patterns in the full text of the collection. by gioia stevens introduction the early american cookbooks project is a case study in analyzing digital special collections as data by using new tools to explore trends and patterns in the metadata and the full text of a collection. the project is a useful example for librarians, technologists and researchers interested in exploring the wealth of data in the public domain corpus of hathitrust and the computational tools available in the hathitrust research center (htrc) portal. it is also a useful example of how analyze and visualize sets of legacy marc catalog records and offer users new insights into the depth and breadth of library collections. the project workflow described in this article includes: (1) creating a searchable public collection of full-text titles within the hathitrust digital library and uploading it to the htrc portal, (2) analyzing and visualizing legacy marc data for the collection using marcedit, openrefine and tableau, and (3) using the text analysis tools in the htrc portal to look for trends and patterns in the full text of the collection. building a collection within hathitrust is a way to create a searchable digital collection for a topic of particular research interest or a specialized subject area with few digital resources (such as early american cookbooks). it can also be used to create a digital collection to mirror the print holdings available in a particular library. the full text of collections is keyword searchable independently of the full hathitrust library and basic metadata for search results can be downloaded to create new subgroups of titles. analyzing collections as data reveals valuable new uses for old catalog records. however, working with legacy marc records from many different libraries over time presents complex challenges. this project developed a workflow using marcedit, openrefine and tableau to transform, clean, and visualize marc records. when this legacy data is cleaned up and presented as a dataset rather than as individual records, it can yield new information about a collection in aggregate. metadata about authors, publishers, places of publication, dates, and subject content can offer insights into a collection and its significance within a broader historical scope. this information can be visualized and presented to users as a new way to gain understanding of the depth and breadth of a collection. the htrc portal offers a convenient platform for importing a hathitrust collection and then analyzing that collection using text analysis algorithms. these algorithms are built into the htrc portal as ready-made tools that can be run on a collection within the portal. no coding skills are required and the portal is a useful way to begin experimenting with computational methods. topic modeling using the meandre topic modeling algorithm is “a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts.” [1] these topics show trends and patterns in the contents of a collection and can serve as an overview of its thematic content. comparing two sets of text using the meandre dunning log-likelihood to tag cloud algorithm is a useful way to trace differences in subject matter by sorting the under-represented terms and over-represented terms in each set. these results can be used to highlight how a particular subgroup of titles differs from the collection as a whole. early american cookbooks project overview the idea for this project grew out of cataloging hundreds of early print cookbooks for the marion nestle food studies collection at the fales library & special collections at new york university libraries. most early cookbooks in printed form are accessible only in special collections libraries or private collections. these books are an incredibly important resource for food historians or anyone with a passion for old cookbooks, but there are very few searchable full text resources available online. the feeding america digital archive created by michigan state university libraries is the largest digital collection, but it contains only seventy-six full-text titles. the early american cookbooks project was designed to meet this need. the purposes of the project are to create a freely available, searchable online collection of early american cookbooks, to offer an overview of the scope and contents of the collection, and to analyze trends and patterns in the metadata and the full text of the collection. the project has two basic components: a collection of full-text titles within the hathitrust digital library and a website to present a guide to the collection and the results of the analysis. early american cookbooks collection the early american cookbooks collection is a separate public collection within the hathitrust digital library which was created specifically for this project. it contains 1450 full text cookbooks published in the united states between 1800 and 1920. users can browse the titles, read them online, and search the contents of each book. the collection as a whole can be searched independently of the rest of the hathitrust. keyword searching across all 1450 titles allows the user to find particular recipes, ingredient names, or anything else and go immediately to the results in the full text. this type of search would be very valuable for a food historian tracing something such as the history of the hamburger or when americans first started eating spaghetti. early american cookbooks website the early american cookbooks website serves as a gateway and guide to the collection. the site gives a general introduction to the history of american cookbooks, an overview of the scope and contents of the collection, a discussion of certain interesting books or groups of titles within the collection, and links to other online resources and library collections specializing in early american cookbooks. the site presents and interprets the results of explorations in the metadata and the full text of the collection. data visualization of the results shows trends and patterns in the collection that may add to our understanding of early cookbooks and the history of american food. building the collection on hathitrust and uploading worksets to htrc workflow overview: create public or private collection in hathitrust from full-text search results evaluate search results and edit collection contents check and de-dupe individual records (multiple printings, editions, and scans) download basic collection metadata and upload volume id numbers to create a workset in htrc hathitrust collections are a way to group and save a selection of titles for public or private use. collections are searchable independently of the full hathitrust library and are a convenient way to assemble and research a subgroup of hathitrust materials. collections may be created by any member of a hathitrust partner institution or by anyone who chooses to sign up for a university of michigan “friend account” login. some examples of recent public collections include the english short title catalog, islamic manuscripts at the university of michigan, patent indexes, and action and adventure fiction. items can be added to a collection from the full-text search results (not the catalog search results) or from the page-turner interface when looking at an individual full-text title. members of hathitrust partner institutions also have the option to limit the full text search results to titles held at their own libraries. this feature could be useful for a library wishing to create a digital collection to mirror specific print titles in their holdings. creating a collection involves evaluating and manually de-duplicating the record set. using precise search terms and carefully evaluating the results are important to avoid adding irrelevant titles to a collection (such as books on captain cook or cook county, illinois, in a collection of cookbooks). search results frequently contain duplicate records from different libraries and many records offer multiple scans from different libraries. it can be quite complex to compare many similar records and determine which are duplicates and which represent different editions or printings of a book. for the early american cookbooks collection, evaluating the initial search results and distinguishing and de-duping multiple printings, editions, and scans reduced the collection from over 2000 records to a final set of 1450. the next step is to download the metadata for the collection. this metadata is includes the following items in a tab-delimited text file: htitem_id – the hathitrust item identifier which is used to uniquely identify every hathitrust digital item title author date rights – the copyright status for this item as determined by hathitrust a series of identifiers commonly used by libraries: oclc, lccn, isbn catalog_url – the url for the catalog record with which the item in question is associated handle_url – the permanent url for the hathitrust digital item the full marc records for the collection are available from the htrc, the research arm of hathitrust. the htrc portal “provides an infrastructure to search, collect, analyze, and visualize the full text of nearly 3 million public domain works and is intended for nonprofit and educational researchers.” [2] anyone possessing an email address from a nonprofit institution of higher education is allowed to register for a user account, including those whose institutions are not hathitrust members. the htitem_id identifiers (also called volume id numbers) downloaded from the hathitrust collection can be uploaded to the htrc portal to create a new workset. the htrc portal also provides a workset builder tool for users who do not have a collection within hathitrust and wish to create and analyze a workset within the portal. figure 1. hathitrust research center home page. analyzing collection metadata with marcedit, openrefine and tableau workflow overview: download records in marcxml format using htrc marc downloader tool convert marcxml to marc21 and join records into one file using marcedit export master metadata spreadsheet as csv file using marcedit clean up metadata using openrefine create topical subsets of records using openrefine upload csv files to tableau to explore data produce visualizations in tableau it is important to note that this workflow could be done with any set of marc records, whether downloaded from hathitrust or from another source. for hathitrust records, the htrc portal offers the marc downloader tool to provide individual catalog records in marcxml format. legacy marc data for early books held in special collections presents particular challenges. for many 19th century books, such as the ones in the early american cookbooks collection, these records are idiosyncratic, legacy data, much of it drawn from old catalog cards. many of these records contain archaic information such as “brooklyn, opposite new york” rather than “brooklyn, n.y.” as the place of publication. these records also contain many different forms of abbreviation, punctuation, and terminology, resulting from different cataloging standards over time and also the local practices at individual libraries. cleaning and standardizing this legacy data is an essential step in analyzing special collections metadata as a dataset rather than as individual records. marcedit workflow marcedit is an open source gui-based application for editing and manipulating marc data. it was developed by terry reese beginning in 1999 and has become “one of the more complete metadata edit suites available to librarians.” [3] this workflow uses three different utilities in marcedit which are all available in the same utilities dialog window. the first step in the workflow is to convert the individual marcxml records from hathitrust to marc records using the batch process marc records utility. the next step is to join the individual records into one record using marcjoin. the final step is to create a master spreadsheet for the metadata using the export tab delimited records utility. this utility will export selected marc fields and subfields in .csv format. figure 2. marcedit’s utilities dialog window showing the batch process marc records utility openrefine workflow the next step is to upload the master metadata spreadsheet into openrefine. openrefine, described as a “power tool for working with messy metadata” [4] is a very useful open source program for working with legacy marc records. the program functions to explore data, clean and transform data, and reconcile and match data. for example, it was easy to find all the different forms of an author’s name (example: fanny farmer, fannie farmer, fannie merritt farmer), group them together and use one transformation to change all of them to the authorized version. dates and many different ways of recording dates (example: 1882, [1882], [1882?], ©1882) posed particular challenges. openrefine made it possible to group and reconcile these dates and then transform the data from number format into date format. the screenshot below shows the original date in the 260$c column and the transformed date in the edited date column. this date cleanup work was necessary to be able to sort the titles by date and then visualize the collection metadata along a timeline. figure 3. collection data in openrefine openrefine is also useful for the creation of additional metadata and sorting records into subgroups. this process was useful for preparing the data for visualization in tableau and also for creating topical subsets of titles (examples: all books by a particular author, southern cookbooks). in the screenshot above, the state name column was created based on the place of publication in the 260$a field. this involved grouping and transforming numerous different forms of state abbreviations (example: mass., ma, m.a.). it was also necessary to add the state when only the city name appeared in 260$a. sometimes this was obvious (example: boston) and sometimes it required more examination of the marc record or the full text of the book (example: springfield). the full state name, rather than an abbreviation, proved to be an essential piece of metadata for creating map visualizations in tableau. the region column was created based off of the state name and the united states census regions. dividing the collection into different worksets for the northeast, midwest, west, and southeast regions allowed for regional map visualizations and text analysis explorations to look for regional variations in recipes. tableau visualizations the next step in the workflow is to upload the metadata spreadsheet into tableau, a data visualization software program which is available in both licensed versions and in a free online version called tableau public. tableau offers “drag and drop visual analytics” [5] designed for users without prior experience in data analysis. using tableau to experiment with graphing metadata is an interesting way to explore a collection and think about marc records in aggregate as a dataset. tableau offers a wide range of visualizations and there are numerous different ways to display information, such as the number of titles per author, publisher, state, or year. figure 4. number of cookbooks in collection published each year (1800-1920) the tableau bar chart in figure 4 shows the number of books published each year for the full collection of 1450 books. creating this chart required cleaning and transforming the dates in the 260$c field in openrefine prior to loading the spreadsheet into tableau. the chart gives a useful overview of the collection because it shows that the number of cookbooks published per year grew steadily during the 19th century and bounded upwards in the early 20th century. in the early 19th century, most families used collections of handwritten recipes, often handed down through generations and shared with neighbors and friends. the publishing industry in the united states expanded rapidly during the late 19th century and commercially produced cookbooks became widely available. figure 5. number of cookbooks in collection published per state the tableau map in figure 5 shows the number of books published per state for the full collection of 1450 cookbooks. the map is a filled map with the darker shades of color representing the states with the greatest numbers. tableau has a user-friendly mapping function and is able to plot latitude and longitude coordinates based on certain common sets of geographic identifiers such as countries, states, and zip codes. this automated feature is helpful because it eliminates the need to geocode data in order to create a map. adding a column with full state names to the spreadsheet in openrefine made it possible to create a basic map in tableau without any prior experience in visualizing geographic data. this map visualization of the metadata is useful because it shows how the collection reflects the history of cookbook publishing in the united states. new york has the greatest number of books published, followed by massachusetts, illinois, pennsylvania, and california. these numbers align with the growth of the book publishing industry in the united states. new york city, traditionally the publishing center of the united states, published the greatest number over time, followed by other publishing centers in boston, philadelphia, chicago, san francisco, and los angeles. the trend in the numbers also shows the history of westward expansion from 1800 to 1920, with the greatest total numbers in the east and much lower numbers in the west. analyzing full text using htrc portal tools workflow overview: search selected terms as keywords in full text and download metadata for results upload new worksets in htrc for these keyword search sets upload new worksets in htrc for topical subgroups of titles created using openrefine run htrc algorithm on selected worksets evaluate results, look for errors in data, and re-run as needed export results from htrc this workflow moves beyond analyzing metadata to look at the full text of the books as a corpus. analyzing the full text can be done using the full collection of titles or it can be done by creating subsets of titles and uploading them as separate worksets in htrc. openrefine can be used to create subgroups based on the metadata (examples: all books by a particular author, southern cookbooks) and then the hathitrust volume id numbers can be uploaded to create worksets for each of these subgroups. another way to create a subgroup is to use the full text keyword search feature on the collection in the public facing hathitrust digital library to look for works containing a particular term. the basic metadata with the volume id numbers for titles in the search results can be downloaded and then uploaded to htrc as a new workset. metadata analysis of the worksets created by keyword search can offer another window into the contents of the collection. the timeline below was created by searching for the word “vegetarian” in the collection and then visualizing the number of titles per year in tableau. the number of books which contain the word “vegetarian” in the text is near zero until 1850, increases slowly in the late 19th century, and then grows substantially in the years from 1900 to 1920. the american vegetarian society was founded in new york in 1850 and the vegetarian movement in the united states grew over the same time span. [6] figure 6. “vegetarian” keyword over time (1850 to 1920) htrc algorithms the htrc also offers an array of text analysis tools to explore trends and patterns in the full text of a collection. the algorithms offer a range of functions as described in the documentation. [7] meandre classification naive bayes: classify the volumes in a workset into categories of your choosing meandre dunning log-likelihood to tag cloud: compare and contrast two worksets by identifying words that are more and less common in one workset meandre opennlp date entities to simile: visualize the dates in a workset on a timeline meandre opennlp entities list: generate a list of people and places in a workset meandre spellcheck report per volume: find misspelled words that are the result of ocr errors meandre tag cloud: create a tag cloud visualization of the most frequently occurring words meandre tag cloud with cleaning: performs cleaning of the text before it allows you to create a tag cloud visualization of the most frequently occurring words meandre topic modeling: identify “topics” in a workset based on words that have a high probability of occurring close together in the text simple deployable word count: identify the words that occur most often in a workset and the number of times they occur all of these are built into the htrc portal as ready-made tools that can be run on a workset within the portal. no coding skills are required and the portal serves as a single platform for managing worksets, analyzing them, and exporting the results. the htrc algorithms that proved most fruitful for the early american cookbooks project were the meandre topic modeling algorithm and the meandre dunning log-likelihood to tag cloud algorithm. meandre topic modeling algorithm topic modeling is an automated text mining technique that offers a “suite of algorithms to discover hidden thematic structure in large collections of texts.” [8] topic modeling is a methodology developed in computer science, machine learning, and natural language processing that has recently become very popular in the digital humanities and related fields. [9] tools such as mallet [10] generate “topics” or groups of related words through statistical analysis of word occurrences in a corpus. the meandre topic modeling algorithm in the htrc portal (created by loretta auvil at the university of illinois) creates a topic model using mallet and displays the top 200 tokens in a tag cloud as well as exporting the topics in an xml file. the meandre topic modeling algorithm shows some interesting trends and patterns in the text for the early american cookbooks collection. the selected tag clouds below show different topics or clusters of words that recur across all of the texts. the names of the topics were not generated by the algorithm, but rather devised by the researcher as a way to label and interpret the clusters. while it is impossible to draw definitive analytical conclusions, the topics do provide an interesting snapshot of the subject matter. early american cookbooks had many common themes, largely because the diet and cookery techniques in the 1800 to 1920 period were far more homogeneous than they are today. nearly every cook used salt, pepper, and butter as the primary methods of seasoning (figure 7), prepared meat most frequently with gravy or sauce (figure 8), and baked cake (figure 9). the “food and family” tag cloud (figure 10) reaches beyond the ingredients and instructions into the how and why of cooking and homemaking. words such as food, time, good, made, great, people, work, body, give, family and years are commonly present in the forewords and introductions to cookbooks which sought to provide inspiration for readers. figure 7. seasoning tag cloud figure 8. meat tag cloud figure 9. cake tag cloud figure 10. food and family tag cloud meandre dunning log-likelihood to tagcloud algorithm the meandre dunning log-likelihood to tagcloud algorithm was useful for comparing and contrasting two worksets. the dunning log-likelihood statistic was developed by ted dunning at the university of new mexico. it employs a statistical measure based on likelihood ratios that can be applied to the analysis of text. [11] the statistic has been employed by digital humanities researchers as a way to compare corpuses of text and discover “subtle differences between closely related sets.” [12] the meandre dunning log-likelihood to tagcloud algorithm (also created by loretta auvil) calculates dunning log-likelihood based on two worksets provided as inputs: an “analysis workset” and a “reference workset.” the “overused” words and the “underused” words in the analysis workset (relative to the reference workset) are displayed in a tag cloud and made available via a csv file. this tool has been very useful in analyzing different subsets of the early american cookbooks collection. one example shows how the content of books by fannie merritt farmer (1857-1915) differs from the collection as a whole. farmer was a major figure in american cooking in the late 19th and early 20th centuries. her most successful cookbook, the boston cooking-school cook book, was first published in 1896 and sold millions of copies in many subsequent printings and editions. the book was the first to introduce precise measurement and farmer later became known as “the mother of level measurements.” [13] the dunning log-likelihood to tagcloud algorithm results clearly illustrate her emphasis on precise measurement. in the tag clouds below, the over-represented terms (figure 11) are the more precise terms tablespoons, teaspoons, and cup. the under-represented terms (figure 12) are the more vague terms teaspoonful, tablespoonful and cupful which were frequently used in cookbooks of the era. figure 11. fannie farmer over represented terms figure 12. fannie farmer under represented terms challenges and solutions this project encountered several problems in the course of its development. the htrc portal algorithms sometimes failed to work or created poor results. tracing the cause of these problems meant questioning the validity of upload data, the suitability of the tool, and/or the composition of the dataset. sometimes the answer was simple, such as when an algorithm repeatedly failed to process a workset because a library call number had been substituted for a hathitrust volume id number in the upload metadata. other problems seemed to stem from a mismatch between the chosen algorithm and the dataset. for example, the opennlp entities algorithm is designed to generate lists of people and place names, but the results included terms such as “brown sauce” and “butter taffy” as if these were names of individuals. more complex problems involved the composition of the dataset and how decisions made in subdividing the data had influenced the results. for example, the word “feces” was a valuable clue in interpreting and correcting a problem with one of the datasets. the word first appeared in a meandre dunning log-likelihood tag cloud of over-represented terms for books published in the southern census region of the united states. it seemed hard to believe that cookbooks on southern cuisine featured feces, so a re-examination of the dataset was in order. washington, d.c. is part of the southern census region, but it is also the place of publication for large numbers of government documents. separating out the books published by government agencies from the larger southern set proved to be the answer to the problem. running the meandre dunning log-likelihood algorithm and the meandre topic modeling algorithm on the government documents alone (see figure 13 below) showed that these cooking publications were concerned with scientific approaches to human nutrition. feces along with other terms describing digestion were prominent in this dataset. rerunning the algorithms on the set of southern books without the government publications created results that did not contain the word feces and that were more consistent with the subject matter of the titles in the set. figure 13. government documents over represented terms conclusion the early american cookbooks project workflows may prove useful for librarians, technologists and researchers interested in exploring small scale approaches to digital special collections and digital humanities methods. institutions with limited budgets and staff time may not have resources to devote to large digital projects. the early american cookbooks project was planned and completed by one person working several hours per week over the course of six months. there was no need to find funding for the project because access to the hathitrust digital library, the htrc portal, and all of the tools discussed in this article are available at no cost. the tools and the workflows can be adapted to suit many different types of projects, including those not using hathitrust. for example, visualizing sets of marc records offers a useful overview of any library collection. the workflow using marcedit, openrefine, and tableau could be used to visualize all of a library’s holdings or to highlight the contents of a particular collection. visualizing catalog records in aggregate can offer users a valuable new way to understand the depth and breadth of a collection. references [1.] posner m. very basic strategies for interpreting results from the topic modeling tool. miriam posner’s blog. 2012. [internet]. [cited 2017 may 15]. available from: http://miriamposner.com/blog/very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool/ [2.] htrc home. hathitrust research center. [internet]. [cited 2017 may 15]. available from: https://analytics.hathitrust.org/ [3.] reese t. about marcedit. marcedit development. 2013. [internet]. [cited 2017 may 15]. available from: http://marcedit.reeset.net/about-marcedit [4.] openrefine. [internet]. [accessed 2017 may 15]. available from: http://openrefine.org/ [5.] tableau for teaching. tableau software. [internet]. [cited 2017 may 15]. available from: https://www.tableau.com/academic/teaching [6.] shprintzen ad. the vegetarian crusade: the rise of an american reform movement, 1817-1921. chapel hill: university of north carolina press, 2013. [7.] description of the htrc portal algorithms – documentation – htrc docs. [internet]. [accessed 2017 may 15]. available from: https://wiki.htrc.illinois.edu/display/com/description+of+the+htrc+portal+algorithms [8.] blei dm. topic modeling and digital humanities. journal of digital humanities. 2012. [internet]. [cited 2017 may 15]. available from: http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/ [9.] weingart sb, meeks e. the digital humanities contribution to topic modeling. journal of digital humanities. 2012. [internet]. [cited 2017 may 15]. available from: http://journalofdigitalhumanities.org/2-1/dh-contribution-to-topic-modeling/ [10.] mccallum ak. mallet: a machine learning for language toolkit. mallet: a machine learning for language toolkit. 2002. [internet]. [cited 2017 may 15]. available from: http://mallet.cs.umass.edu [11.] dunning t. accurate methods for the statistics of surprise and coincidence. computational linguistics. 1993:19(1):61–74. [12.] schmidt b. sapping attention: comparing corpuses by word use. sapping attention. 2011 [internet]. [cited 2017 may 15]. available from: http://sappingattention.blogspot.com/2011/10/comparing-corpuses-by-word-use.html [13.] longone jb. feeding america: the historic american cookbook project. feeding america: the historic american cookbook project. [internet]. [cited 2017 may 15]. available from: http://digital.lib.msu.edu/projects/cookbooks/html/authors/author_farmer.html about the author gioia stevens is special collections cataloger at new york university libraries. she holds an mlis from pratt institute and recently completed an ma in the digital humanities track of the master of liberal studies program at city university of new york graduate center. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 60, 2025-04-14 previous issues issue 59, 2024-10-07 issue 58, 2023-12-04 issue 57, 2023-08-29 issue 56, 2023-04-21 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using google tag manager and google analytics to track dspace metadata fields as custom dimensions mission editorial committee process and structure code4lib issue 27, 2015-01-21 using google tag manager and google analytics to track dspace metadata fields as custom dimensions dspace can be problematic for those interested in tracking download and pageview statistics granularly. some libraries have implemented code to track events on websites and some have experimented with using google tag manager to automate event tagging in dspace. while these approaches make it possible to track download statistics, granular details such as authors, content types, titles, advisors, and other fields for which metadata exist are generally not tracked in dspace or google analytics without coding. moreover, it can be time consuming to track and assess pageview data and relate that data back to particular metadata fields. this article will detail the learning process of incorporating custom dimensions for tracking these detailed fields including trial and error attempts to use the data import function manually in google analytics, to automate the data import using google apis, and finally to automate the collection of dimension data in google tag manager by mimicking seo practices for capturing meta tags. this specific case study refers to using google tag manager and google analytics with dspace; however, this method may also be applied to other types of websites or systems. by suzanna conrad, digital initiatives librarian and head of digital services & technology, cal poly pomona introduction cal poly pomona launched an institutional repository, dubbed “bronco scholar,” in february 2014 to support various types of scholarship from the campus community. as one of twenty-three california state university (csu) campuses, cal poly pomona was able to use a multi-tenant, shared instance of the open-source repository software dspace hosted by the csu chancellor’s office. while we are provided with a number of helpful statistics about bitstream downloads, searches, and other statistics within this multi-tenant instance, accessing granular data about specific downloads is a very time-consuming and manual process. it was important to us to be able to track specific details about individual downloads and pageviews including author names, advisor names for student research projects, content types, and titles of items accessed. and, because our dspace instance is part of a multi-tenant instance and we do not have full administrator privileges to edit code, we needed a solution that would not require much custom code. this article will detail the learning process of incorporating custom dimensions for tracking these detailed fields including trial and error attempts to use the data import function manually in google analytics, our attempts to automate the data import using google apis, and finally how we successfully automated collection of dimension data in google tag manager by mimicking seo practices for capturing meta tags. this specific case study demonstrates using google tag manager and google analytics with dspace; however, this method may also be applied to other types of websites or systems. google analytics & events tracking google analytics is often implemented by libraries of all types to track website usage. a simple snippet of javascript code on a website linked to a google analytics account enables google to intercept page requests and capture standard data that is sent by a user agent along with the request — such as information about the requested page and the user agent that’s making the request. with vanilla google analytics, website owners can view a standard set of metrics (such as the number of pageviews or unique pageviews) grouped by a standard set of dimensions (such as page url, page title, user location, etc.). using custom javascript, one can capture additional data about page activity by using event tracking. with event tracking, one can set up javascript event handlers to submit data to google analytics every time a particular event is triggered (such as when a page element is clicked or a key is pressed). for instance, if one specifically wanted to track downloads of pdfs on a website, one could set a particular google analytics event to be recorded each time a relevant download link is clicked. other authors have discussed technical implementations and benefits of event tracking using javascript edits and custom variables.[1] once event tracking is implemented in google analytics, it is possible, for example, to track downloads by file types, file names, and other dimensions. furthermore, it is possible to set up goals to better track conversions, such as tracking the percentage of users visiting the site who actually download repository content. google analytics custom dimensions & metrics if website owners want to track certain metrics or group data based on dimensions that are not included in default google analytics, they can set up custom metrics or dimensions as discussed in the google analytics platform principles course. [2] within the standard google analytics reporting tools it is possible to use custom dimensions as secondary dimensions when viewing reports. in our case, we were specifically interested in grouping data based on a key set of metadata tracked in dspace: content type, author name, advisor name, and title. adding these metadata fields as custom dimensions was a logical way to accomplish this goal. [3] google tag manager google tag manager is a product that provides a simplified means of updating website “tags.” by google’s definition, “a tag is a snippet of javascript that sends information to a third party,” such as a piece of custom google analytics event tracking code. [4] coates and durrant discussed the usage of google tag manager to track events specifically within dspace. [5] however, these principles can apply to all types of websites that need to track events. as long as a code snippet for google tag manager can be implemented on the website, it is possible to automate tag creation without editing the code on each and every page. for example, many event tracking implementations include sample code such as: download pdf in google tag manager this function can be handled by a tag. the tag fires on a page according to the rules established for that tag. in order for a google tag manager tag to work, it relies on rules and macros. a tag in google tag manager is implemented for whatever is being measured such as pageviews or downloads. a rule defines in what instance the tag should be fired. a macro, whether pre-defined or custom, is a name-value pair that the tag references to define values in the tag. tracking events in google tag manager the first goal of our project was to track downloads of dspace bitstreams within google analytics so that we could filter out administrator traffic and glean more information about the users of our repository. specifically, we wanted to utilize built-in demographic data about our audiences and also track and analyze their behavior in a more granular fashion than is possible in dspace. normally we would not have the ability to maintain a set of custom events within the code of our consortial dspace instance, but google tag manager requires only the implementation of one code snippet and was easy for our administrator to implement across our entire dspace instance. referencing coates and durrant’s solution and rachel sweeney’s blog on setting up download tracking in google tag manager, [6] we set up three tags in google tag manager: a link click listener to listen for clicks on bitstream files, a total downloads tag, and an outbound links tag to listen for clicks to external websites. two rules supported these tags: a rule that fired the tag on all pages and a rule that fired the download tag when a bitstream was clicked on. macros were also configured based on sweeney’s recommendations to strip out any information other than the file type and to return the file format. using these recommendations, our tags were configured as follows: figure 1: google tag manager configuration at cpp troubleshooting custom dimensions in google analytics we still had the larger problem of not being able to use specific metadata fields as dimensions in our analytics reports. we wanted to be able to provide authors with reports of their downloads or to group traffic by certain types of content such as student research. custom dimensions were a good solution to track these additional metadata fields. as outlined in a blog post by justin cutroni, it is possible to use the data import function to “widen dimensions” with a csv file. [7] essentially, adding custom dimensions from an external file would allow one to map external data (not available in the standard google dimensions) to existing google analytics data structures using a unique key. since the google academy and bloggers have published a number of step-by-step guides for importing custom dimensions using the data import function, this was a starting point for attempting to implement custom dimensions. to test this data import function, we first set up four custom dimensions in google analytics using custom definitions in the property admin section: figure 2: custom dimensions admin setup in google analytics then we set up these dimensions in the data import section (also located under property admin in google analytics), which provided us with the keys to map each of the metadata fields to a dimension, e.g., ga:dimension3 maps to the “advisor” custom dimension. figure 3: mapped metadata fields in google analytics data import dspace does offer a metadata export function, which we used to get a metadata dump for the csv file. this file was unfortunately very messy; often the metadata fields would span two columns so we had to combine those fields manually. also quotations within titles were corrupted during the export and some fields had to be normalized before they could be imported successfully to google analytics. we were only interested in tracking four metadata fields in custom dimensions, so we also had to reduce the csv to include only those four columns and the unique key that google analytics would reference to assign dimensions to pages. in our case, the unique key was the pagepath, which corresponded to the handle url in dspace. our final csv included columns for the pagepath and four columns for the custom dimensions as defined in google analytics. figure 4: sample csv file for google analytics data import once we had cleaned up the csv file and set the paths and dimensions to correspond to the custom dimensions in google analytics, we were able to upload the csv under the “data import” section under property admin. this process was successful and we were able to see the custom dimensions in our reports. figure 5: custom dimensions displaying in google analytics (data import) there were a number of disadvantages to this method. dimension widening is not retroactive, so the dimensions applied only to downloads and pageviews occurring after the data import had happened. the manual process of cleaning the csv was also extremely time-consuming, and a more automated process would ensure that all downloads and pageviews could be captured. within the platform principles course, the google analytics academy also mentions apis for dimension widening, so we began to investigate how we could automate this process using the apis instead. we tested scraping the site and dumping the correct metadata fields into a csv that could be pulled into google analytics using the apis. we were able to get a cleaner data pull of the metadata fields than with the dspace metadata export. however, two problems still existed. first, a process still had to run to pull the data into google analytics using the apis, and second, the dimensions would only be recorded as of the time that process ran, which meant that usage statistics of new uploads would not be tracked immediately after upload. since google tag manager was already firing on our dspace instance pages, we began investigating the possibility of using existing tags to track metadata fields and include those fields as custom dimensions. one seo blogger discussed the possibility of tracking meta tag values using google tag manager to determine what optimized keywords were performing best on individual webpages. [8] dspace includes metadata fields as meta tags in the page source, so this solution offered automated tracking of dspace item-level metadata using custom dimensions. our solution using google tag manager first, as with our earlier “data widening” solution, in google analytics we added a new custom dimension each for content type, author, advisor, and title. [9] figure 6: additional custom dimensions in google analytics we set the scope to “hit.” other options include user, session, and product. the first two scopes only record the first item that a user clicks on or downloads, and if the user subsequently downloads content while on the website, those downloads will inherit the first item’s metadata records. the scope “product” is an e-commerce specific scope that applies to a product for purchase. the index number was noted so that it could be mapped in google tag manager. in google tag manager, we had already configured tags and rules to collect download statistics per examples above (see the section, tracking events in google tag manager). we set up two additional types of macros including a data layer macro to hold the information for that dimension and a macro for tracking the meta tag element we wanted to track. the second macro consisted of custom javascript that collected the information from the page. for example, this is the macro we set up to pull author information: function getdata() { var x = document.getelementsbytagname("meta"); var txt = []; for (var i=0;i

step 2b: searching google book search

book title: {title}

book title: {this.props.title}

sara mannheimer people at msu library

sara mannheimer