the code4lib journal – editorial: journal updates and a call for editors mission editorial committee process and structure code4lib issue 55, 2023-1-20 editorial: journal updates and a call for editors journal updates, recent policies, and a call for editors. this is my second time and last time as coordinating editor for code4lib journal. after serving on the editorial committee for 7 years, i am rotating off of the committee to focus on other research projects. code4lib has played a big part in my career. in 2012, i published my first article for the journal. after attending my first code4lib conference at north carolina state university in 2014, funded by a code4lib diversity scholarship, i really wanted to get more involved with this wonderful and supportive community. i was co-convener for the local new york city chapter of code4lib, presented at two national pre-conferences, and served on a couple of code4lib national conference committees since then. out of participating in various code4lib related activities, i have to say that working with the editorial committee (ec) has been the most rewarding experience. i have learned quite a deal from my fellow editorial committee members, and for that i am immensely grateful. this includes everything from copy editing, the article review process, communicating and collaborating with authors, and most especially, managing a journal. i would like to share two recent developments with the journal: a guest editorial policy and a retraction policy. the ec has implemented a guest editor policy. editorial members have a wide skill set reflective of library coders and technologists, however, some of the articles that we review are beyond our scope of expertise. in those situations, we feel it necessary to consult with experts outside of the ec. the guest editor policy is in place to make it clear to the author, guest editor, and readers, their role in the review process. a retraction policy has also been implemented. this retraction policy was developed so the ec could withdraw articles that may include work that violate ethical standards or may be unreliable. retractions are not to be taken lightly, and as such, the journal will inform readers why the article was retracted. this will be another part of the article’s lifecycle post-publication. since there is now an opening on the editorial committee of code4lib journal, please respond to this call for editors. if you are interested in reading and learning about library information technology, as well as being part of a great team of editors, then this is an excellent opportunity. applicants from diverse communities are highly encouraged to apply. i believe that every issue of code4lib journal has practical applications for almost any library, archive, museum, and other related spaces, and this issue is no exception. this issue includes: a fast and full-text search engine for educational lecture archives which outlines the development of a search engine for educational videos using python in india. click tracking with google tag manager for the primo discovery service explores how to track open access content through unpaywall links. creating a custom queueing system for a makerspace using web technologies is a case study on streamlining the queue process of a makerspace. data preparation for fairseq and machine-learning using a neural network details the use of sequence-to-sequence models and how it can be applied for a variety application with the appropriate formatting of datasets. designing digital discovery and access systems for archival description compares the differences between archival and bibliographic description and the challenges of utilizing discovery based systems for digital born materials. drying our library’s libguides-based webpage by introducing vue.js investigates how to better streamline redundant html code from the popular libguides web content management system. revamping metadata maker for ‘linked data editor’: thinking out loud looks at using and evaluating the catalog record creation tool using linked data sources. using python scripts to compare records from vendors with those from ils examines the use of python to identity and synchronize out-of-sync vendor and ils catalog records.   subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using python scripts to compare records from vendors with those from ils mission editorial committee process and structure code4lib issue 55, 2023-1-20 using python scripts to compare records from vendors with those from ils an increasing challenge libraries face is how to maintain and synchronize the electronic resource records from vendors with those in the integrated library system (ils). ideally vendors send record updates frequently to the library. however, this is not a perfect solution, and over time a problem with record discrepancies can become severe with thousands of records out of sync. this is what happened when, at a certain point, our acquisitions librarian and our cataloging librarian noticed a big record discrepancy issue. in order to effectively identify the problematic records among tens of thousands of records from both sides, the author of this article developed some solutions to analyze the data using python functions and scripts. this data analysis helps to quickly scale down the issue and reduce the cataloging effort. by dan lou the issue of record discrepancies at a certain point our acquisition librarian noticed there were large discrepancies between the records from vendors (axis360 and overdrive) and those in our sierra system. the vendors usually send us regular record updates, but unfortunately overtime, it still became a challenge to keep the information perfectly synchronized. in order to resolve these discrepancies more efficiently and quickly, i was asked to compose a python script to compare record details and identify the exact records we needed to modify in our catalog. comparing records with isbn as the first step i tried to identify the discrepancies in records by comparing isbn numbers. i chose python pandas (pandas 2018[1]) as the data analysis tool for this task. pandas is an open source python package that is widely used for data science and machine learning tasks. pandas is built on top of another python package named numpy that provides support for multi-dimensional arrays, and it works very well with other python packages like matplotlib that is widely used for data visualization. pandas makes it simple to do many of the time consuming, repetitive tasks associated with working with data, including but not limited to loading and converting data, data scrubbing, data normalization, data fill, data merges and split, data visualization, data inspection etc. you can install pandas and all the relevant packages on your local machine but a better option is to start your coding with google colab(google 2019[2]). google colab is a cloud-based python jupyter notebook service and it provides a free tier. by default, pandas and matplotlib are already pre-installed in google colab. to enable pandas in a jupyter notebook in google colab, we simply import the package and give it an short alias: import pandas as pd we exported the relevant marc records from sierra to a comma-separated values (csv) file, and loaded it into the jupyter notebook as a pandas dataframe object. for example, this is how we read a csv file of the axis360 audiobook records exported from sierra: sierra_axis_audio = pd.read_csv( "/content/drive/my drive/library/axis_overdrive/sierra_axis360_audio.csv") this file typically contains columns like title, author and isbn entries. a record sometimes can contain more than two isbn numbers, and those are stored in the “unnamed” columns towards the end for now: sierra_axis_audio.columns index(['title', 'author', 'isbn1', 'isbn2', 'unnamed: 4', 'unnamed: 5', 'unnamed: 6'], dtype='object') we also needed to further clean up all the isbn numbers in order to compare them with those from the vendor. a typical marc 020 field can contain many subfields besides the subfield a for the isbn number. in our exported file, the isbn field usually contains more information than we needed for the comparison. for example, “9780804148757 (electronic audio bk.)“, “9780062460110 : $62.99” etc. i composed a python function to extract the isolated isbn number from each of the columns after the “author” column, and compile all the isbn numbers into a new dataframe object. one sierra record can contain multiple isbn numbers. in this way, i was able to generate a complete list of all isbn numbers from the marc records exported from sierra: def extract_isbns(df): isbns = pd.dataframe() result = pd.dataframe() for i in range(2,len(df.columns)-1): new_isbns = df.iloc[:,i].astype(str) new_isbns = new_isbns.str.split(' ', 1).str[0].str.strip() isbns = pd.concat([isbns, new_isbns]) isbns = isbns.reset_index() isbns = isbns[isbns[0]!='nan'] return isbns in comparison the vendor’s csv file is simple and straightforward. it contains only one isbn field with the clean isbn number. it also has some other fields like title, author, publication date, data added, format, total quantity and total checkouts. all that is left to do is to import these as a pandas dataframe object. next i composed a python function to compare the isbn numbers in the cleaned up data from both sources. i created a new column, “exist,” in both python dataframe objects. if an isbn number from a vendor record exists in a sierra record, the “exist” column on the sierra record has a value of “1”; if not, the value is “0”. the opposite is also true. if an isbn from the sierra record matches that of a vendor record, the “exist” column on the vendor record will have a value of “1” set; if not, the value is “0”. the function also returns two additional dataframe objects that contain the discrepant records only. def find_mismatch(vendor_df, sierra_df, sierra_isbns_df): vendor_df["exist"] = vendor_df["isbn"].astype(str).isin(sierra_isbns_df[0]) not_in_sierra = vendor_df[vendor_df["exist"]==false | vendor_df["exist"].isna() | vendor_df["exist"].empty] sierra_isbns_df["exist"] = sierra_isbns_df[0].astype(str).isin(vendor_df["isbn"].astype(str)) exists = sierra_isbns_df[sierra_isbns_df['exist']==1].drop_duplicates("index") not_exists = sierra_isbns_df[sierra_isbns_df['exist']==0].drop_duplicates("index") not_exists = not_exists[~not_exists["index"].isin(exists['index'])] isbns = pd.concat([exists, not_exists]) sierra_df = pd.merge(sierra_df, isbns, how="left", left_index=true, right_on="index") not_in_vendor = sierra_df[sierra_df["exist"]==false | sierra_df["exist"].isna() | sierra_df["exist"].empty] return vendor_df, sierra_df, not_in_vendor, not_in_sierra after running this script against our records, i did some data analysis to see how many conflicting records were resolved after comparing isbn numbers. i generated the list of all marc records that need to be added to the ils and another list of records that need to be removed. as shown in the following chart, we were able to identify that the majority of our axis360 records were well positioned, but on the overdrive side, about a third of the records are problematic. in total, we still have over 4,600 records in sierra that are troublesome. figure i. vendor vs. sierra: total mismatches by isbn comparing records with title and author fields to further scale down the issue, i moved on to align the remaining mismatched records by comparing the similarity of the author and title fields. a list of mismatched records is one of the outputs from the previous step after aligning the isbn numbers. first, we needed to clean up and extract the title and author field. i made a function to concatenate the title and author fields for each record from both sources. the following function transforms all text to lower cases, removes extra punctuation, extracts the relevant information, and concatenates the data to create a new column, “titleauthor,” for all the records: def extract_title_author(vendor_df, sierra_df, vendor_title_field, vendor_author_field, sierra_title_field = "title", sierra_author_field ="author"): vendor_df['titleauthor'] = vendor_df[vendor_title_field].astype(str).str.lower().apply(remove_punctuations).astype(str) + ' ' + vendor_df[vendor_author_field].astype(str).str.lower().apply(remove_punctuations).astype(str) sierra_df['titleauthor'] = sierra_df[sierra_title_field].str.lower().str.extract("([^[]*\w*)").astype(str).apply(remove_punctuations).astype(str) + ' ' + sierra_df[sierra_author_field].str.lower().astype(str).apply(remove_punctuations).astype(str).str.extract("([ a-z]*)").astype(str) then i used the python module difflib to compare the similarity of the “titleauthor” columns from both sources. the difflib module provides classes and functions for comparing data differences. here is the function to compare and get the similarity level from two sources: from difflib import sequencematcher def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6): if not n &gt; 0: raise valueerror("n must be &gt; 0: %r" % (n,)) if not 0.0 &lt;= cutoff &lt;= 1.0: raise valueerror("cutoff must be in [0.0, 1.0]: %r" % (cutoff,)) result = [] s = sequencematcher() s.set_seq2(word) for idx, x in enumerate(possibilities): s.set_seq1(x) if s.real_quick_ratio() &gt;= cutoff and \ s.quick_ratio() &gt;= cutoff and \ s.ratio() &gt;= cutoff: result.append((s.ratio(), idx)) after comparing the similarity of the title and author fields, i was able to generate new spreadsheet files for all of the mismatched records. each row in the spreadsheet contains a sierra record together with the most similar record from the vendor side using title and author. all rows are sorted by similarity level (on a scale of 0 to 1). similarity level 0 means the two records in the row are very different; 1 means they are highly identical. again, i made some charts to see how much it would help to scale down the issue further. figure ii. similarity of title and author for mismatched axis 360 audio records figure iii. similarity of title and author for mismatched axis 360 ebook records figure iv. similarity of title and author for mismatched overdrive audio records figure v. similarity of title and author for mismatched overdrive ebookrecords as shown in the charts, if we take a similarity level of 0.7 as a benchmark, roughly two-thirds of the mismatched records required more cataloging effort. with the python scripts we successfully narrowed down a problem of 30,843 records to that of 3,394 within a few days. only the 3,394 records require further examination by our cataloger. this reduced the cataloging effort eventually needed to fix record discrepancies. figure vi. comparing 30,843 records from sierra and vendors (overdrive and axis360) conclusion it has become an increasing challenge to keep the marc records of online resources up to date between the library’s ils and the vendors’ provided records due to the frequent changes taking place. the solution described in this article is a good remedy to identify and fix record discrepancies accumulated over time. i highly recommend libraries adopt this method and make it a routine task to prevent the issue having a snowball effect. it is better if we can insert in the vendor provided marc records a customized field to identify the valid date range. then we can develop a script to automatically renew or modify the records that are close to or past the due date by retrieving them from the ils system via record exporting or via api. many vendors tend to not generate a 404 error page when a library loses access to an online resource like an ebook, so the script will need to be developed to be versatile enough to detect those links that stop working properly. on the other hand, a huge change has been taking place in recent years as the record system of online resources slowly eases out of the library’s ils. for example, library online service provider bibliocommons has implemented bibliocloudrecords which allows libraries to add certain online resources records via api to the library’s catalog website without needing to add the marc records to the ils. while this improves the resource access for library customers, it means patron data and usage statistics have inevitably shifted from libraries to the vendors. online resources have had a stronger presence in the library’s collection since the pandemic and will become more and more influential in the foreseeable future. it is a good question to ask now how libraries could better position ourselves in this new online resources landscape. references [1] pandas. 2018. python data analysis library — pandas: python data analysis library. pydataorg. https://pandas.pydata.org/. [2] google colaboratory. 2019. googlecom. https://colab.research.google.com/ about the author dan lou is a senior librarian at palo alto city library, where she works on web content management and develops pioneering projects on robotics, coding, ar and distributed web. her particular interests include python, machine learning and data analytics. previously, she worked as a technical project manager at wolfram alpha, a knowledge base queried by apple siri and amazon alexa. before that, she was a systems librarian at innovative interfaces. author email: loudan980@gmail.com subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – a fast and full-text search engine for educational lecture archives mission editorial committee process and structure code4lib issue 55, 2023-1-20 a fast and full-text search engine for educational lecture archives e-lecturing and online learning are more common and convenient than offline teaching and classroom learning in the academic community after the covid-19 pandemic. universities and research institutions are recording the lecture videos delivered by the faculty members and archiving them internally. most of the lecture videos are hosted on popular video-sharing platforms creating private channels. the students access published lecture videos independent of time and location. searching becomes difficult from large video repositories for students as search is restricted on metadata. we presented a design and developed an open-source application to build an education lecture archive with fast and full-text search within the video content. by arun f. adrakatti and k.r. mulla introduction e-lecturing has become increasingly popular over the past decade. there is an exponential increase in the amount of lecture video data being posted on the internet.  many universities and research institutions are recording their lectures and publishing them online for students to access independently of time and location. in addition, there are numerous massive open online courses (moocs) that are popular across the globe for their ability to provide online lectures in a wide variety of fields. the availability of online courses and ease of access has made them a popular learning tool. lecture videos and moocs are hosted on cloud platforms to make them available to registered users. most of these resources are for the benefit of the public. the majority of lecture videos are available on popular video-sharing platforms such as youtube, vimeo, and dailymotion. due to a shortage of storage servers and internet bandwidth, academic and research institutions are having difficulty maintaining video repositories. there are a great number of lecture videos uploaded to online platforms that are annotated only with a few keywords, which results in the search engine returning incomplete results. there are only a limited number of keywords available to access lectures and the search can be conducted primarily using their occurrences or tags. videos are retrieved solely based on their metadata, such as their title, author information, and annotations, or by user navigation from generic to specific topics. users can fetch the materials only based on limited options. irrelevant and random annotations, navigation, and manual annotations made without considering the video contents are the major stumbling blocks to retrieving lecture videos. existing approaches and proposed models the fast development of video data has made efficient video indexing and retrieval technologies one of the most essential concerns in multimedia management.[1] bolettieri et al. (2007) proposed a system based on milos, a general-purpose multimedia content management system that was developed to aid in the design and implementation of digital library applications.[2] the goal is to show how digital information, such as video documents or powerpoint presentations, may be reused by utilizing existing technologies for automatic metadata extraction, including ocr, speech recognition, cut detection, and mpeg-7 visual video retrieval (cbvr). yang et al. (2011) provided a method for automated lecture video indexing based on video ocr technology, built a new video segmenter for automated slide video structure analysis and implemented a new algorithm for slide structure analysis and extraction based on the geometrical information of identified text lines.[3] an approach for automated video indexing and video search in huge lecture video archives has been developed.[4] by using video optical character recognition (ocr) technology on key-frames and automatic speech recognition (asr) on lecture audio files, automatic video segmentation and key-frame identification can provide a visual guideline for video content navigation and textual metadata. for keyword extraction, both videoand segment-level keywords are recovered for content-based video browsing and search using ocr and asr transcripts as well as identified slide text. the authors developed and proposed text to speech extraction from videos and text extracted from ocr from the slide used while delivering the lecture videos.  most of these proposed ideas and developed concepts are in commercial platforms which the academic and research institutions are not in a position to be able to afford. motivation for the project in the indian scenario, the vast majority of lecture video repositories and moocs are hosted on the youtube platform and the web links are embedded in content management systems (cms) and e-learning applications. typically, these applications only search for metadata and human annotations associated with specific videos. some applications lack search engines, and the swayam (study webs of active-learning for young aspiring minds) platform is the best example of this. the user needs to browse the videos according to the subject on this platform. a commercial video lecture application focuses only on recording and organizing video lectures based on their subject and topics. the retrieval of videos from repositories is the least concern and applications neglect user concerns. the users do not find the desired information on the lecture video repositories and invest lots of time in browsing and listening to the videos to gain information. the popular open source institutional repositories systems are limited to maintaining only document and image files. the retrieval is restricted to metadata and human annotations. these limitations and restrictions led to the design and development of the educational lecture archive with more focus on the retrieval of the videos based on the content from these videos. development of content-based video retrieval system the capture of e-lecturing has become more popular and the amount of lecture video data on the web is growing rapidly. universities and research institutions are recording their lectures and publishing them online for students to access independent of time and location. on the other side, users find it difficult to search for the desired information from these video repositories. to overcome this issue, the application is designed and developed to organize the education lecture videos through the practical approach of searching videos based on the entire speech contained in the video. the application is named cbmir (content-based multimedia information retrieval). the scope of cbmir covers only educational lecture video repositories maintained and hosted by universities, r&d, and nonprofit organizations of india and is limited to speech extraction and automatic text indexing of lecture videos available in the english language. figure 1. overview of the cbmir application. technical requirement linux based os – ubuntu django – content management system: python-based web framework. python – programming language anaconda – python distribution whoosh library: whoosh is a fast, full text, python search engine library. the cbmir application has three major modules: the administrator, information processing & retrieval, and user platform. figure 2 shows a detailed workflow process for an administrator uploading video into repositories to a user retrieving the desired videos from repositories. flow chart figure 2. flow chart of the content-based multimedia information retrieval. modules of cbmir applications administrator module the administrator module allows the admin to upload the external video using hosted weblinks, and to upload the video file from a personal device or external storage device.  figure 3 shows the dashboard of the administrator module. the authorized admin will have access to the application to manage and upload the data. the admin can view / edit / delete the videos uploaded using the option to manage the content library. the admin can add a new user and assign a limited admin role to distribute the workload of uploading the content in cases when they are creating large educational repositories. the admin was given the privilege to add / edit / delete the text of transcript uploaded into the database. figure 3. dashboard of administrator module. information retrieval processing module the following are the step-by-step processes for the information processing and storage of videos with the cbmir application. figure 4 shows the status bar of the work process after uploading the external video link into the cbmir application. uploading videos download the video from external source: video will be downloaded to the application server from an external server or from the device using youtube api to access the youtube data and google api for accessing the youtube data for login credentials. download the videos from the device: video uploaded from personal devices or external storage devices using python library. video to audio: video file will be converted to audio using the ffmpeg python library audio to text: audio file will be converted to a text file with speech recognition using pocket-spinx python library with an acoustic model text segmentation: whole text will be broken into multiple segments based on the timings. automatic indexing and searching: the text file will be auto-indexed using the whoosh library figure 4. the status bar of processing of information and storage. user module the user module acts as a search engine, where the page contains a search box. it shows a dashboard below the search box that displays the total number of videos and subject, or topic-wise, collections. the user is allowed to search desired information using keyword terms. search results are displayed based on the word occurrence. the whoosh library converts all audio files into a text file, auto indexes these files, and makes a fast searchable format. the keyword term will be searched in the database and a list of videos will be displayed. the video starts playing on a single click on a specific video. figure 5. display of search result of the keyword speech. the whole text will be broken into multiple segments based on the speech timing and auto indexed into the database. with the advantage of segmentation, the video will start playing on the specific keyword or phrase searched by the user. figure 6. display of search results based on the time segmentation. unique features of applications: open source application: the cbmir application is designed and developed based on the open source web application and databases. the source code of the application will be shared on a popular source code distribution platform for further development from the community. fast and full-text search engine: searching through a full-text database or text document is referred to as a full-text search. a search engine is embedded within cbmir that analyses all the words contained within a document and compares them to the search criteria specified by the user during the search process. searching a huge converted text file using a full-text search ensures fast retrieval of results for large numbers of documents. the search engine provides efficient search results that are accurate and precise in all fields. domain based lecture video repository: the cbmir application allows subject-based lecture video archiving for academic institutions. this repository provides consolidated subject-specific lectures and helps users to spend less time searching on popular video sharing platforms for lectures. no video advertisements on the application: the video on the cbmir application will not play advertisements. popular video sharing platforms typically include advertisements at the beginning and middle of their videos. lightweight, less storage space, and unlimited video upload: the cbmir application is very lightweight and it can extract speech, convert it into audio, and save text files stored into databases; usually it requires less storage space. there is no limitation in uploading video into the application. text segmentation for retrieval of video content: the application divides converted text from audio into sets of words along with video timestamp. the video is played from the specific timestamp depending on the user desired search term. conclusion: due to covid-19, e-lecturing became a part of the academic community for all levels of education. academic institutions are finding it difficult to archive lecture video on internal servers and hosting on popular video sharing platforms for future use. users find it difficult to search the desired video information from the larger video repositories. the search function is restricted only to metadata, human annotation, and tags of a particular video. to overcome this problem, the cbmir application has been developed using open source technology to build educational videos repositories based on the relevant subject, focusing on fast and full-text search of the video content. the current cmbir application is limited to converting the speech to text in the english language only. the further development of the application includes converting speech to text for indian regional languages, translating converted transcripts into indian regional languages, including voice search options, text extraction using ocr technology, finding objects from the videos, creating a dashboard on the user module, etc. the source code of the cbmir application is being submitted to the authors’ university in order to fulfill requirements for a degree and will be distributed on github under a creative commons cc by-nc license after completion of the degree. reference [1] saoudi, e. m., & jai-andaloussi, s. (2021). a distributed content-based video retrieval system for large datasets. journal of big data, 8(1). https://doi.org/10.1186/s40537-021-00479-x [2] bolettieri, p., falchi, f., gennaro, c., & rabitti, f. (2007). automatic metadata extraction and indexing for reusing e-learning multimedia objects. proceedings of the acm international multimedia conference and exhibition. https://doi.org/10.1145/1290067.1290072 [3] yang, h., siebert, m., lühne, p., sack, h., & meinel, c. (2011). lecture video indexing and analysis using video ocr technology. proceedings – 7th international conference on signal image technology and internet-based systems, sitis 2011. https://doi.org/10.1109/sitis.2011.20 [4] yang, h., & meinel, c. (2014). content based lecture video retrieval using speech and video text information. ieee transactions on learning technologies, 7(2). https://doi.org/10.1109/tlt.2014.2307305 about the authors arun f. adrakatti (arun@iiserbpr.ac.in) is assistant librarian at the indian institute of science education research (iiser) in berhampur, odisha, india. k.r. mulla (krmulla@vtu.ac.in) is a librarian at visvesvaraya technological university in belagavi, karnataka, india. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – click tracking with google tag manager for the primo discovery service mission editorial committee process and structure code4lib issue 55, 2023-1-20 click tracking with google tag manager for the primo discovery service this article introduces practices at the library of oregon state university aiming to track the usage of unpaywall links with google tag manager for the primo discovery interface. unpaywall is an open database of links to full-text scholarly articles from open access sources[1]. the university library adds unpaywall links to primo that will provide free and legal full-text access to journal articles to the patrons to promote more usage of open-access content. however, the usage of the unpaywall links is unavailable because primo does not track the customized unpaywall links. this article will detail how to set up google tag manager for tracking the usage of unpaywall links and creating reports in google analytics. it provides step-by-step instructions, screenshots, and code snippets so the readers can customize the solution for their integrated library systems. by hui zhang introduction in 2020, staff at oregon state university library started a project to provide single-click links to open access content in 1search[2], the university’s library discovery interface built on the primo service platform[3]. the goal of the project is to provide free and legitimate access to full-text scholarly resources for the patrons. although primo already includes open access content in the search results, there are studies [1] that show the primo solution has significant flaws in indexing and providing open access resources. ultimately, we decide to use unpaywall, an open database that harvests and indexes tens of millions of open access scholarly articles, as the source of open access content in addition to primo. by customizing the user interface (ui) of primo, we added unpaywall links of open access content to the search result and the individual item view. however, it has a problem in that we cannot get the link usage statistics from primo’s analytics tool because primo does not track the unpaywall links. this article will detail tracking customized links in primo using google tag manager [4] including testing, troubleshooting, and creating usage reports with google analytics. although the case study is specific to tracking unpaywall links, the workflow and configuration of google tag manager are general. users may also adapt the included snippets and tags to tracking activities for websites beyond the types used in library systems. adding open access links with primo customization finding open access resources in primo primo users can find open access content in two ways. the first way is to filter the search results by selecting “open access” in the availability facet. figure 1. open access facet in primo. the second way of finding open access content is to look for the open access icon that will appear for an item identified as open access both in the search results and in the full item view. figure 2. open access indicator in primo. primo provides open access content in many resource types, such as journal articles, books, and theses. why adding unpaywall links to primo the significant flaw in primo’s solution of providing open access is the preference for subscribed content over open access. one study [1] finds search results in primo will provide article links to subscribed journals even when the articles are open access, undermining the visibility of open access content to the readers. with the growing demand to make more open access content available to the patrons, developers at primo added a feature to integrate the unpaywall api so users can find open access articles that may not appear or be available to them initially [2]. however, librarians at oregon state university (osu) finally decided to provide unpaywall links to primo by adopting a solution called oadoi link [3] developed by the primo customization standing group. the standing group is formed by the orbis cascade alliance, which oregon state university is a member of. one advantage of oadoi link is that the osu library has better access to technical support as it is locally developed. but more importantly, a recent study [4] found that oadoi links can provide an estimated 30% more open access articles compared to primo’s unpaywall feature. we extend the oadoi links in our solution [5] to provide unpaywall links to open access items in the brief display view next to the availability status. figure 3. customized unpaywall link shown in the brief display of an open access item. offering unpaywall as single-click links is a significant improvement to usability as library patrons can access the full-text contents without authentication with their university credentials. the challenge of tracking unpaywall links usage in primo because the unpaywall links are added to primo by a ui customization, we cannot get usage statistics of these links from primo because they are not tracked. it is a major problem as usage data is crucial evidence to assess the impact and success of the unpaywall project. to overcome the problem, we investigated the possibility of using google tag manager to track the customized unpaywall links. as primo continues to provide new features including link tracking, it is worth updating the latest situation so that the readers will have a better understanding of the motivation and contribution of our approach. ex libris, the company that develops primo, added the capacity to track and report the usage of unpayall links in august 2021 [6]. however, that feature is only available to primo ve [7], a newer and different cloud computing platform to primo. the ex libris solution was unavailable when we investigated the potential of google tag manager in 2020, and we then used primo, not primo ve, as the discovery interface. for full disclosure, oregon state university library migrated its discovery interface to primo ve in the summer of 2022. however, many libraries worldwide are still using primo, and our work on google tag manager will help them to track customized links like unpaywall in their discovery interfaces. tracking unpaywall links with google tag manager how google tag manager works perhaps many readers, like us, are confused when they try to understand what google tag manager is at the beginning. we will explain how google tag manager works by answering two questions: what is a tag and what is the difference between google tag manager and google analytics? according to google, a tag is a code snippet deployed to measure website user activity [8]. these tags, or tracking codes, were usually created and deployed by developers before tools like google tag manager were available. however, with google tag manager, people can create, test, and deploy a tag without programming skills. creating triggers that will tell the manager when, where, and how to operate the tag is required to set up a tag. google tag manager provides two types of triggers: all elements and just links. the all elements trigger can track clicks on any element on a page, e.g. links, images, and buttons. the just links trigger can track clicks on html links that use the element. google tag manager and google analytics are two different tools, but you should always use them together. google tag manager can add google analytics tracking code (i.e., tag) to the website but can not create reports. instead, it will send activity data of the website to google analytics for analyzing and reporting. adding google tag manager to primo your first step to adding google tag manager to primo will be creating an account. go to the google tag manager website to create an account or log in using your google account. then you need to create and name a container with “web” as the target platform, and you will be given an id in the format of “gtm-xxxxxxx” after the container has been created. take note of the container identifier because you will need it in the next step. we suggest creating a container for every website, then defining tags for activities you want to track in the newly created container. you will add a snippet to primo to allow google tag manager to track the web activity. the technical detail of managing and customizing the primo ui package is beyond the scope of this article. however, the primo administrators should have the knowledge and privilege to add the sample javascript snippet below to primo. make sure you use the correct container id in the snippet. /* google tag manager */ const gtmid = 'gtm-xxxxxxx' function addgtm(doc) { const newscript = doc.createelement('script') const scripttext = `(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new date().gettime(),event:'gtm.js'});var f=d.getelementsbytagname(s)[0], j=d.createelement(s),dl=l!='datalayer'?'&amp;l='+l:'';j.async=true;j.src= '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentnode.insertbefore(j,f); })(window,document,'script','datalayer','${gtmid}');` newscript.innerhtml = scripttext doc.head.append(newscript) const noscript = doc.createelement('noscript') const noscripttext = `&lt;iframe src="//www.googletagmanager.com/ns.html?id=${gtmid}" height="0" width="0" style="display:none;visibility:hidden"&gt;&lt;/iframe&gt;` noscript.innerhtml = noscripttext doc.body.insertbefore(noscript, doc.body.firstchild) } addgtm(document) then you need to save and deploy the change so it will take effect in primo. congratulations! you have done all the required configuration for primo, and all the rest will happen in google tag manager. creating tag and trigger for unpaywall link clicks you will create a google tag manager tag and trigger in the newly created container. for example, you can create and name a tag called “unpaywall” with the type of google analytics and associate the tag with an existing google analytics account. in our case, we associate the tag with the google analytics account of primo. figure 4. google tag manager tag with the type of google analytics inside the new tag, you need to create a trigger that fires the tag when the unpaywall link is clicked. figure 5. google tag manager trigger for tracking unpaywall link clicks for our purpose of tracking the link click events, make sure you will select “all elements” for the trigger type. the next step is to attach a condition or rule to the created trigger. in our example, we create a condition like this: the tag is activated when links with the text “open access(via unpaywall)” are clicked. the link text used in the trigger is the label of the unpaywall link we added to primo ui by our customization code. if you want to customize the method, you can define a trigger with different conditions that are appropriate for your need. trigger testing and troubleshooting people can test newly created tags and triggers using the preview feature in google tag manager. figure 6. google tag manager preview for testing tags and triggers. the preview feature is available at the container level. a pop-up window appears after clicking the “preview” button, where you can enter the url of the primo instance for testing. after the connection is established, you can go to the primo website, click the unpaywall link, and check whether the tag is triggered as expected in google tag manager. the browser’s developer console will be your best tool for troubleshooting. for example, we use chrome’s console to confirm the label text for the unpaywall link is the same as entered in the trigger condition. after finalizing the configuration with the preview, you must verify the changes by clicking the “submit” button next to the right of the “preview.” you will be asked to create a version of your container and finally publish it. the deployment is reasonably quick, and in our example, we can see click data shown in google analytics a few minutes after we published the change in google tag manager. generating usage report with google analytics you can use the many report functions in google analytics to review and analyze the data collected by google tag manager. in our example, we can find statistics of unpaywall link usage in the event under behavior, then in the “click” event category. figure 7. click event report in google analytics. the result is promising as it shows patrons are attracted to open access content, and there is a clear trend that more patrons are using the unpaywall links. for instance, the total number of clicks for the unpaywall link is 53,361 during the calendar year of 2021. that number jumped to 60,534 for the first six months of 2022 until osu migrated its discovery interface to primo ve. conclusion in this article, we describe the work of tracking customized unpaywall links with google tag manager in primo. we outline the motivation of our project, introduce google tag manager, and provide details on how to define tags for tracking unpaywall link clicks. we have used data collected by google tag manager for decision-making. for example, we have decided to continue to provide open access and unpaywall links in primo ve with the usage statistics of unpaywall links collected in primo. by integrating google tag manager and google analytics, we can also get more insights into patrons’ activities, such as which open access databases are popular and which subjects patrons are most interested in. we hope the code and screenshots are helpful and the readers can refer to them in their works. notes [1] unpaywall: https://unpaywall.org/ [2] 1search: https://search.library.oregonstate.edu/ [3] primo service platform: https://exlibrisgroup.com/products/primo-discovery-service/ [4] google tag manager: https://tagmanager.google.com/ references [1] bulock, c. (2021). finding open content in the library is surprisingly hard, serials review, 47:2, 68-70, doi: 10.1080/00987913.2021.1936416 [2] how to utilize the unpaywall api for open access content and resources in discovery. (2021). ex libris knowledge center. https://knowledge.exlibrisgroup.com/alma/knowledge_articles/how_to_utilize_the_unpaywall_api_for_open_access_content_and_resources_in_discovery [3] oadoi link. (2022). orbis cascade alliance. https://www.orbiscascade.org/programs/systems/pcsg/primo-ve-toolkit/oadoi-link/ [4] veldhuisen, k. (2020). unpaywall in alma, oadoi customization in primo (and other open access). retrieved from https://docs.google.com/document/d/1rbz7l4_ktra7psxfxatpyjev-og3sizw1qjrpxc5ebk/edit [5] osulp/1search-ui-package. (2021). github. https://github.com/osulp/1search-ui-package [6] primo ve 2021 release notes. (2022, may 6). ex libris knowledge center. https://knowledge.exlibrisgroup.com/primo/release_notes/002primo_ve/2021/010primo_ve_2021_release_notes [7] primo ve overview. (2022, september 18). ex libris knowledge center. https://knowledge.exlibrisgroup.com/primo/product_documentation/020primo_ve/primo_ve_(english)/010getting_started_with_primo_ve/005primo_ve_overview [8] overview. (2022). google developers. https://developers.google.com/tag-platform/devguides about the author hui zhang (hui.zhang@oregonstate.edu) is the digital services librarian at the oregon state university. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – drying our library’s libguides-based webpage by introducing vue.js mission editorial committee process and structure code4lib issue 55, 2023-1-20 drying our library’s libguides-based webpage by introducing vue.js at the kingsborough community college library, we recently decided to bring the library’s website more in line with dry principles (don’t repeat yourself). we felt we this could improve the site by creating more concise and maintainable code. dryer code would be easier to read, understand and edit. we adopted the vue.js framework in order to replace repetitive, hand-coded dropdown menus with programmatically generated markup. using vue allowed us to greatly simplify the html documents, while also improving maintainability. by mark e. eaton keeping it dry a common goal among programmers is to write code that is dry, in other words, code where you don’t repeat yourself. this is usually motivated by the insight that computers can often effectively automate repetitive tasks, making it unnecessary to repeat yourself in code. taking advantage of the efficiencies of automation is widely regarded as a best practice among programmers. however, html, when written by hand, is unfortunately not terribly suited to dry practices. html is particularly declarative: all elements of the page are explicitly laid out by the programmer, so as to fully describe its structure. the problem with this is that it means that hand-written webpages are often not very dry. even those of a relatively modest amount of complexity can quickly grow into very long html documents. this can be problematic, for a few reasons: it can become difficult to conceptualize the structure of a whole page when it stretches out over hundreds of lines. even relatively trivial aspects of coding, such as indentation, can become difficult with the deeply nested html structures of a large page. it is easy to introduce syntax errors or formatting problems into long html documents, because typos can be easily overlooked. this is especially problematic in cases where there is no built-in linting or validation.[1] at our college these challenges were familiar to us at kingsborough community college, a college of the city university of new york. our homepage, built on libguides cms, ran to over 500 lines, not including the <head> or <footer> sections. much of this was owing to repetitive dropdown menus: our page relies heavily upon bootstrap-based dropdown navigations to provide easy access to many of our services from the library homepage. these hand-coded menus, structured as lists of links, accounted for much of the length of the page’s source code. included below is the original code for our hamburger menu, which, despite its length, was in fact the shortest and simplest dropdown menu on our page: <div class="dropdown" id="hamburger-container"> <button class="btn btn-default dropdown-toggle" type="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="true" id="hamburger"> <i class="fas fa-bars" style="font-size: 2em;"></i> </button> <ul class="dropdown-menu fade" aria-labelledby="hamburger" id="hamburger-ul"> <li> <a class="searchmenu" aria-label="onesearch" href="https://library.kbcc.cuny.edu/onesearch"> <div class="highlight-menu-item bigger-fancy-text"> <i class="fas fa-search fa-fw bigger-icon" aria-hidden="true"></i> <strong>onesearch</strong> </div> </a> <a class="searchmenu" aria-label="databases a to z" href="https://library.kbcc.cuny.edu/az.php"> <div class="highlight-menu-item bigger-fancy-text"> <i class="fas fa-database fa-fw bigger-icon" aria-hidden="true"></i> <strong>databases a-z</strong> </div> </a> <a class="searchmenu" aria-label="research guides" href="https://library.kbcc.cuny.edu/guides"> <div class="highlight-menu-item bigger-fancy-text"> <i class="fas fa-telescope fa-fw bigger-icon" aria-hidden="true"></i> <strong>research guides</strong> </div> </a> <a class="searchmenu" aria-label="faq" href="https://library.kbcc.cuny.edu/faq"> <div class="highlight-menu-item bigger-fancy-text"> <i class="fas fa-question-circle fa-fw bigger-icon" aria-hidden="true"></i> <strong>faq</strong> </div> </a> <a class="searchmenu" aria-label="hours" href="https://library.kbcc.cuny.edu/calendar"> <div class="highlight-menu-item bigger-fancy-text"> <i class="fas fa-clock fa-fw bigger-icon" aria-hidden="true"></i> <strong>library hours</strong> </div> </a> <a class="searchmenu" aria-label="site map" href="https://library.kbcc.cuny.edu/sitemap"> <div class="highlight-menu-item bigger-fancy-text"> <i class="fas fa-location-arrow fa-fw bigger-icon" aria-hidden="true"></i> <strong>site map</strong> </div> </a> </li> </ul> </div> abstracting away some of that repetition was, in some important ways, an obvious win for the maintainers of the library webpage. there were clear benefits to abstraction. specifically, drying the page would: provide increased simplicity and maintainability; align us more with contemporary best practices in web development; allow us to write more aesthetically pleasing code; allow us to adopt and learn a modern javascript framework; raise the technical bar for what we are attempting to accomplish with our webpage. in brief, it would make the site better, and make life easier for the maintainers. these improvements were not undertaken without some hesitation. our library has non-technical librarians who work with libguides daily, and who may also want to edit our webpage. we were worried that adding another layer of abstraction might be confusing to them, as they would no longer be able to “see” the full html document object model (dom) to their satisfaction, and therefore no longer be able to properly understand and manipulate it themselves. this was an important concern. on the other hand, reducing the html devoted to dropdowns might in fact make other parts of the website more legible to our non-technical colleagues, because it would reduce the amount of noise that a non-expert user would need to filter through to accomplish their goals. in this sense, simplifying is also a way to improve access to the code. we decided to proceed because we felt that, in sum, the benefits out-weighed the drawbacks. the tradeoff is that it will make the page more maintainable for some, while it is perhaps of mixed benefit to others. this project was the best way we could find to address these issues in a balanced way, while making sustained progress on the further development of the site. selecting and using vue.js the tool we chose to do this work was vue.js (referred to as vue in the text that follows). vue is what is referred to as a “progressive” javascript framework, in that it aims to scale up, as well as scale down. scaling down was important to us, as our use case was not complicated, and we did not need the overhead of the complex build systems that are common to many javascript frameworks. we wanted something we could use within our cms. helpfully, it is possible to use vue in this way. we were able to import vue as a library with a simple call to a content delivery network (cdn), which allowed us to use it much in the same way that we would use other common libraries like bootstrap or jquery. we had access to many of vue’s abstractions by simply including <script src=”https://cdn.jsdelivr.net/npm/vue@2.7.8″> in our page, without necessitating other complex overhead. vue provides a very useful templating system to build html programmatically. we were familiar with html templating from previous work that we had done with python’s flask framework and its templating engine, jinja. jinja is somewhat conceptually similar to vue’s templating system, which helped us wrap our head around some parts of vue. however, in our opinion, vue provides added functionality beyond what is possible with jinja, such as two-way binding and an even more broad-based control of the dom. vue allowed us to write directives such as v-for, which is basically a for loop for constructing part of the dom. constructing a long list of links with a v-for loop is a huge improvement over typing out many lines of html. the content of the individual items, such as links, text, icons, and so on, can be stored as a javascript object in our vue constructor. vue’s directives allow us to draw content from this object, while structuring the html with the templating syntax. this approach gives us the full power of javascript and vue when building our html. it is truly much more powerful and accurate than patiently typing out html by hand. here is the html for the same hamburger menu as shown in the previous code snippet, but now dryed with vue: <div class="dropdown" id="hamburger-container"> <button class="btn btn-default dropdown-toggle" type="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="true" aria-label="hamburger menu" id="hamburger"> <i class="fas fa-bars" style="font-size: 2em;"></i> </button> <ul class="dropdown-menu fade" aria-labelledby="hamburger" id="hamburger-ul"> <li> <a v-for="item in hamburger" v-bind:key="item.id" class="searchmenu" v-bind:aria-label="item.description" v-bind:href="item.link" v-bind:target="item.target_blank"> <menus-component v-bind:item="item"></menus-component> </a> </li> </ul> </div> the corresponding vue component that makes this v-for loop happen looks like this: const menuscomponent = { template: `<div class="bigger-fancy-text"> <i v-bind:class="item.icon" aria-hidden="true"></i> <strong>{{ item.description }}</strong> </div>`, props: ['item'] } and the vue constructor: new vue({ el: '#app', components: { 'menus-component': menuscomponent … }, data() { return { hamburger: [ { link: "https://cuny-kb.primo.exlibrisgroup.com/discovery/search?tab=everything&vid=01cuny_kb:cuny_kb&lang=en", icon: "fas fa-search fa-fw bigger-icon", description: "onesearch", target_blank: "", id: 1 }, … we uploaded our vue code as a single file to the “upload customization files” section of the libguides cms admin interface. while this was quite effective, nonetheless, there were a couple of notable downsides: the “upload customization files” section of libguides cms is a bit hidden away in the admin interface. this is not our preference, but it is a design decision by springshare, the maker of libguides. the result of this is that someone new to the project, or new to libguides, might not immediately know where to look for the configuration files that are essential to rendering the page. it is very important that the uploaded javascript be valid, since formatting errors will mean that the data won’t load when the page loads, which will cause major problems. indeed, bad javascript often results in parts of the page not being rendered at all. the solution we adopted is to simply remember to validate our code before uploading it. we used babeljs (https://babeljs.io) for this purpose. babeljs allows us to paste in our code – for example, our vue constructor, including the data object – and it will flag any syntax errors. this is clearly not the most automated workflow, but it was simple enough to be an effective strategy for us.[2] assessing our approach this project was not an entire re-write of our library webpage in the idiom of vue.js. that was not our goal, and was far beyond the scope of this project. we simply took parts of the page that were easily dry-able and used vue to render them. for the most part, this process consisted of replacing the numerous hand-coded lists of links that produce our site’s menus. we focused on these lists because they were the principal offenders that made our page source too long and unmanageable. we tested this proposed approach ahead of time, by implementing vue’s v-for directive on the web librarian’s personal projects page. this trial run worked surprisingly seamlessly, with basically no problems of any significance. this gave us the confidence to move ahead implementing vue on the library homepage. and if there were major problems, we knew we could always use version control to roll back to a previous version, if needed (we use git and github for version control). nonetheless, the transition was not entirely without problems. our initial, most naive approach introduced new <div>s to the page, which broke the existing css. we initially (and mistakenly) thought that these problems were on account of vue, but they were in fact due to our css not behaving as expected on account of the new dom structure. in hindsight, it should have been obvious to us that changing the page structure would break the css. the good news was that our vue code was fine and was more or less working as expected. tweaking the html created by vue – so as not to break our css – was entirely doable. we solved the problem by configuring our v-for loops so as to faithfully recreate the original dom structure, which allowed the css to work properly and as expected. in this way, the project was completed without needing to rewrite any of our css. moreover, we found that besides iterating through individual menus using vue, we could dry the next level of our webpage’s hierarchy by having vue iterate through the list of menus (in other words, through the entire nav bar), so as to create another layer of abstraction, automation, and benefit to the maintainers. while there appears to be more than one way to tackle this problem, we settled upon creating a second vue component to handle the second-order drying logic. this higher-order abstraction is new to us, and we expect to make it better and more efficient over the coming weeks and months. these limited goals and constrained scope for this project meant that the changes that we attempted were not too overwhelming to implement. we rolled out our improvements within a few weeks, without disrupting other, existing workflows. we did this work in our “sandbox” libguides group, before moving the code over to our production group. we did this to avoid breaking, even temporarily, the production homepage. one especially pleasing part of this project was that, from the perspective of our users, there was no change at all to our website. from a user’s point of view, the site remained identical. we were able to increase the maintainability of the site while causing no end-user effects, which was a big win. to look at these changes quantitatively, our homepage was initially 501 lines of html, and after applying our drying techniques, was 279 lines. this is a difference of 44%. while our goal is not to play code golf, we felt that this alone made this project a worthwhile endeavor. of course this appealing headline number is somewhat offset by the added cognitive load (and lines of code) of the vue components and constructor, but nonetheless it is fair to say that the page is more concise. the new logic builds much of the dom automatically, with less room for human error. conclusion the result of our work is a more manageable and concise html document, which provides efficiencies in maintainability. adopting vue for our limited use case turned out to be a good decision. we hope to find compelling reasons to go further with vue’s more advanced features in the future. for example, as a next step, we intend to move more of the page’s logic to vue methods and computed properties. one possible outcome of this project is that it may eventually lead to a full rewrite of the webpage which more fully embraces the vue idiom. the chance to adopt vue as the principal organizing framework for our page offers many exciting new possibilities beyond our current bootstrap-oriented setup. vue could expand the functionality available to us, and truly move the needle when it eventually comes time to fully redesign our webpage. drying our project is a small first step in that direction. endnotes [1] the platform we use, libguides cms, does not provide built-in linting or validation. [2] interestingly, this is a case where a more sophisticated build system would be an advantage, as linting and validation could be included as part of the build. this is something for us to consider in the future, if we decide to adopt vue further. about the author mark e. eaton is a reader services librarian (associate professor) at kingsborough community college (city university of new york). subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – data preparation for fairseq and machine-learning using a neural network mission editorial committee process and structure code4lib issue 55, 2023-1-20 data preparation for fairseq and machine-learning using a neural network this article aims to demystify data preparation and machine-learning software for sequence-to-sequence models in the field of computational linguistics. the tools, however, may be used in many different applications. in this article we detail what sequence-to-sequence learning looks like using code and results from different projects: predicting pronunciation in esperanto, predicting the placement of stress in russian, and how open data like wikipron (mined pronunciation data from wiktionary) makes projects like these possible. with scraped data, projects can be started in automatic speech recognition, text-to-speech tasks, and computer-assisted language-learning for under-resourced and under-researched languages. we will explain why and how datasets are split into training, development, and test sets. the article will discuss how to add features (i.e. properties of the target word that may or may not help in prediction). by scaffolding the tasks and using code and results from these projects, it’s our hope that the article will demystify some of the technical jargon and methods. by john schriner introduction there are many tools in the field of natural language processing (nlp) and computational linguistics that: help us to understand language better; find patterns that we cannot perceive; find word collocation (i.e. when words are commonly near other words); improve text-to-speech; perform text summarization; perform information extraction; provide sentiment analysis; and perform machine-translation. some of these tools are the user-friendly web-based voyant tools,[1] the python software platform natural language toolkit (nltk),[2] and praat[3] phonetic software for examining sound. the nlp tool linguistic inquiry and word count (liwc)[4] is a psycholinguistic black-box[5] tool that can provide sentiment analysis, language style matching, and many other metrics using over 100 dimensions of text. liwc has been widely used for decades, is dictionary-based, and does not involve machine learning. although we may not see a lot of conspicuous use of machine-learning in libraries at present, any project in library and information science that uses an input sequence to map to an output sequence could be improved with this technology; indeed, our discovery services and search engines embrace techniques identified in 1995 that can “analyze user queries, identify users’ information needs, and suggest alternatives for search” (chen, 1995, p. 1). moving to the present day, in zhu & lei (2022) we see machine-learning being used in classification of research topics in covid-19 research. they extract noun phrases from an experimental corpus of full text articles indexed in web of science. these noun phrases numbered 19,240 with a minimum frequency of 10 per million words. zhu & lei (2022) identify research topics whose subject matter was increasing; these are labeled hot topics and categorized into larger categories such as biochemical terms, public health measures, symptoms and diseases, etc. their methods are robust and they work with six different classification models, finding that a random forest classifier[6] yields the best results. in a similar vein and apropos to information literacy, sanaullah et al. (2022) offers a systematic review of covid-19 misinformation research involving machine-learning and deep learning. in their review they selected 43 research articles and categorized them into misinformation types: fake news, conspiracy theory, rumor, misleading information, and disinformation (deceptive information, as opposed to inaccurate in the case of misinformation) (sanaullah et al., 2022). after a thorough discussion of methods, this survey finds that deep-learning methods are more efficacious than traditional machine-learning methods. with known datasets, or datasets created from scraped web data, we can use modern machine-learning tools for any number of projects in different subfields of linguistics like phonology (the study of linguistic sound), morphology (the study of words and how they are formed and used together), and even historical linguistics (the study of languages over time, including language families). this paper focuses on sequence-to-sequence models, the conversion of a sequence from one domain into a sequence of another domain. this could be, for example, polish words converted to their pronunciation in the international phonetic alphabet (ipa) format: e.g. osłu ‘donkey’ converted to ɔswu. this model would effectively aid in text-to-speech systems. another example of sequence-to-sequence modeling could be to predict the correct inflection and placement of a stress marker given a word and its part of speech: training a model that when given the russian adjective эйфорически ‘euphorically,’ must successfully place the stress on the middle «и́» as in эйфори́чески. the idea is that we will use 80% of the data to train on, 10% for development with which to choose the best parameters and model, and 10% for the test set. it’s easy to feel overwhelmed with these tools and their architectures. the aim of this paper is to help demystify this particular type of machine-learning with a well-prepared dataset and project goals. importance of open data open data is essential for original research and replication studies.  sparc states that “despite its tremendous importance, today, research data remains largely fragmented—isolated across millions of individual computers, blocked by disparate technical, legal and financial restrictions” (“open data,” n.d.).  to combat this fragmentation, a call for open data would require that research data: “(1) is freely available on the internet, (2) permits any user to download, copy, analyze, re-process, pass to software or use for any other purpose; and (3) is without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself” (“open data,” n.d.).  open data can be found worldwide in glam labs such as the data foundry[7] at the national library of scotland, and linguistics repositories such as the tromsø repository of language and linguistics (trolling).[8]  the registry of research data repositories[9] indexes nearly 3,000 research data repositories that provide databases, corpora, tools, statistical, and audiovisual data.  with open and well-described data alongside open access papers, our research lives on in repositories, waiting to be replicated, rebutted, added to, or improved. projects like the one we describe below rely on data scraped by the wikipron project (lee et al., 2020), providing phonological and morphological datasets coupled with frequency data, all regularly updated and open.  the wikipron project contains 1.7 million pronunciations from 165 languages.  better still, the project released its mining software so that anyone may mine the data themselves so that researchers “no longer depend on ossified snapshots of an ever-growing, ever-changing collaborative resource” (lee et al., 2020, p 4223). under-researched languages like adyghe or urak lawoi’, or an endangered/moribund language like wiyot can benefit from projects that have access to open phonological data for language revitalization efforts or preservation.  the wikipron project even has 452 words from old french (842 ce – ca. 1400 ce) that could be used to track sound change to modern french.  repositories and applications like wikipron provide invaluable data that can be used in countless ways. projects with fairseq each project requires preparing the data in a way that can be used by the software.  in this paper we use fairseq (ott et al., 2019) which is a “facebook ai research sequence-to-sequence toolkit written in python.”[10]  the toolkit requires that characters be separated with a space if that is what we’re trying to sequence.[11]  a wikipron dataset may be downloaded as a tab-separated values (tsv) file. in this article we’ll look at two projects and how we’d manipulate the data for fairseq. esperanto esperanto is a constructed language (conlang) created to be a universal auxiliary/second language to aid in international communication.[12]  from the wikipron project we first download the tsv file for esperanto.[13]  in esperanto, each letter has only one pronunciation, so it should be trivial to convert characters to the ipa pronunciation and our machine should be able to do this with great accuracy.  stress is not marked in the dataset, but in esperanto stress is always placed on the penultimate syllable.  the data is in two tab-separated columns with the grapheme (the written word) in the first column and the phoneme (the ipa representation for pronunciation) in the second column:   table 1. example data from the tsv file from wikipron.   aarono a a r o n o abadono a b a d o n o abateco a b a t e t͡s o abelmanĝulo a b e l m a n d͡ʒ u l o abortitaĵo a b o r t i t a ʒ o     the tsv is shuffled using shuf and then split into three tsv files: an 80% training set, a 10% development set, and a 10% test set using a python script.[14] python3 split.py \ --seed 103 \ --input_path epo.tsv \ --train_path epo_train.tsv \ --dev_path epo_dev.tsv \ --test_path epo_train.tsv   to prepare the data for fairseq, the important part of the code to note is that each of the three tsv files are then split into .g (for grapheme) and .p (for phoneme) files for training, dev, and test: import contextlib import csv # data was shuffled using `shuf` and split 80-10-10 using `split.py` train = "epo_train.tsv" train_g = "train.epo.g" train_p = "train.epo.p" dev = "epo_dev.tsv" dev_g = "dev.epo.g" dev_p = "dev.epo.p" test = "epo_test.tsv" test_g = "test.epo.g" test_p = "test.epo.p" # processes training data. with contextlib.exitstack() as stack: source = csv.reader(stack.enter_context(open(train, "r")), delimiter="\t") g = stack.enter_context(open(train_g, "w")) p = stack.enter_context(open(train_p, "w")) for graphemes, phones in source: print(" ".join(graphemes), file=g) print(phones, file=p) # processes development data. with contextlib.exitstack() as stack: source = csv.reader(stack.enter_context(open(dev, "r")), delimiter="\t") g = stack.enter_context(open(dev_g, "w")) p = stack.enter_context(open(dev_p, "w")) for graphemes, phones in source: print(" ".join(graphemes), file=g) print(phones, file=p) # processes test data. with contextlib.exitstack() as stack: source = csv.reader(stack.enter_context(open(test, "r")), delimiter="\t") g = stack.enter_context(open(test_g, "w")) p = stack.enter_context(open(test_p, "w")) for graphemes, phones in source: print(" ".join(graphemes), file=g) print(phones, file=p)   as shown above in table 1, the second column characters were already spaced correctly, so we needed to add spaces to only the first column. the result is two files for each set with spaced characters:   table 2. example of data ready for fairseq.   train.epo.g train.epo.p s t a c i o s t a t͡s i o o m a ĝ o o m a d͡ʒ o ĉ i r k a ŭ f l a t a d i t͡ʃ i r k a w f l a t a d i     the generated files are now ready for pre-processing in fairseq: fairseq-preprocess \ --source-lang epo.g \ --target-lang epo.p \ --trainpref train \ --validpref dev \ --testpref test \ --tokenizer space \ --thresholdsrc 2 \ --thresholdtgt 2   this pre-processing creates a folder called data-bin with binaries and a log file that provides the number of tokens found.  we can now start the training: fairseq-train \ --data-bin \ --source-lang epo.g \ --target-lang epo.p \ --encoder-bidirectional \ --seed {choose a random whole numeral} \ --arch lstm \ --dropout 0.2 \ --lr .001 \ --max-update 800 \ --no-epoch-checkpoints \ --batch-size 3000 \ --clip-norm 1 \ --label-smoothing .1 \ --optimizer adam \ --clip-norm 1 \ --criterion label_smoothed_cross_entropy \ --encoder-embed-dim 128 \ --decoder-embed-dim 128 \ --encoder-layers 1 \ --decoder-layers 1   with these parameters it took my machine[15] a half-hour to train.  tweaking the max-updates, the number of encoding layers, the architecture, or the optimizer (e.g. transformer instead of adam) will provide different, and perhaps better results.  doubling the encoder and decoder layers, or doubling the encoder and decoder dimensions to 256 slowed the processing time significantly, without improving the model in this case.  the training part of these experiments is meant to help us decide on which parameters we hope will yield the best results from many different options.[16]  we’ll run this training several times with different parameters and choose three models.  the dev part (10%) of the experiment is meant to choose the model that performs the best on the dev set.  lastly, confident on our model, we’ll use that model on the test set, as yet unseen data. to determine how well each model is doing, we use fairseq-generate that provides us an error analysis that details where our model came up short. fairseq-generate \ data-bin \ --source-lang epo.g \ --target-lang epo.p \ --path checkpoints/checkpoint_best.pt \ --gen-subset valid \ --beam 8 \ predictions.txt   the generated error analysis in predictions.txt is quite readable and shows where the expected hypothesis may be different from its target sequence: s-17 i k t i o s a ŭ r o t-17 i k t i o s a w r o h-17 -0.14448021352291107 i k t i o s a w r o d-17 -0.14448021352291107 i k t i o s a w r o s-824 e k s i ĝ o n t a j t-824 e k s i d͡ʒ o n t a j h-824 -0.12416490912437439 e k s i d͡ʒ o n t a j d-824 -0.12416490912437439 e k s i d͡ʒ o n t a j s-1085 k a p t o ŝ n u r o t-1085 k a p t o ʃ n u r o h-1085 -0.15732990205287933 k a p t o ʃ n u r o d-1085 -0.15732990205287933 k a p t o ʃ n u r o   the rows in predictions.txt are source, target, hypothesis (tokenized, meaning any punctuation symbols in a project with sentences would be space-separated), and detokenized (not broken into separate linguistic units).  the number before the hypothesis is the log-probability of this hypothesis.  for our project, if the target matches the hypothesis, the model has predicted correctly. we use a script written by dr. kyle gorman to parse the output of fairseq-generate and provide a word error rate (wer).  using this script, if any character is incorrect, the word error rate is raised.  as there should be no ambiguity in pronunciation and the conversion of a character to a sound, we expected that our model would perform near-perfectly. choosing the model that performed the best, we can now give the model the test data.  predictably, on the test data, the word error rate was 0.00, a perfect score. russian stress to explore how to add features to the model, we can look at experiments in russian stress.  features are properties of the target that may or may not help in prediction.  features could include part of speech, frequency, animacy (whether a noun is sentient or not), or many other characteristics. similar to the esperanto project above, we have columns with data in a tsv file.   table 3. example data from the tsv file from schriner (2022).   ямбам я́мбам 1 ямб n;msc;inan;pl;dat шихтовее шихтове́е 1 шихтовой a;cmpar;pred щелкануть щелкану́ть 0 щелкануть v;perf;inf иноки и́ноки 2 инока n;fem;anim;pl;nom стёсанном стёсанном 2 стесать v;perf;der;der/pstpss;a;neu;anin;sg;loc     the first column of data is the word with no stress markers.  the second column is the word with stress marked.  the third column is a stress code derived from the placement of the stress in the word: reversing the text in place and counting from 0 at the end of the word, each word was given a stress code; this data was added to the tsv as a column.  only vowels in russian may have stress, so deriving the stress code was simply a matter of counting vowels until a stress marker occurred.  «ё» is always stressed, so the script stops and assigns a code when an «ё» is discovered.  the fourth column is the word’s lemma (its root).  the fifth column contains the full morphology of the word including the word’s part of speech, the tense for those that are verbs, animacy, gender, grammatical number (whether a noun is singular or plural) and russian case (e.g. nominative (nom) case for the subject of the sentence, or dative (dat) case for an indirect object of a sentence).  for the adjective (a) in table 3, the word is comparative (cmpar, as in more) and it functions as an adjective predicate (pred), linked to the subject of the sentence. in this paper we will not be processing this with fairseq, but some promising results may be found in schriner (2022).  this project is already significantly different from our esperanto example in that stress in russian has complicated patterns and ambiguous rules that will challenge a machine to place the stress correctly.  incorrectly-stressed words may be unintelligible or prove more difficult to place correctly with the existence of stress homographs such as óрган ‘organ of the body’ and оргáн ‘organ’ (musical instrument) (wade & gillespie, 2011). similar to the esperanto example, we have to format our text for fairseq and sequence-to-sequence modeling.  to do this we’ll again have space-separated characters that we’ll convert to other space-separated characters.  from table 3, the word иноки ‘others’ will be converted to и́ноки so our tsv file should have spaces: и н о к и will convert to и́ н о к и.  we want our machine to learn that given certain features we can expect a certain outcome in training.  the features in table 3 are: stress code, lemma (the root of the word), and the full morphology including part of speech. we can create several experiments from this data including:   given the word and its lemma, predict the stress code: и н о к и инока  ← the feature added to the spaced-characters 2   ← the target will be the stress code, three vowels from the end starting at 0   given the word and its part of speech, predict the stress code: и н о к и noun  ← the feature added to the spaced-characters 2   ← the target will be the stress code, three vowels from the end starting at 0   given the word and all of its morphological properties, predict the stress code: и н о к и n;fem;anim;pl;nom  ← the feature added to the spaced-characters 2   ← the target will be the stress code, three vowels from the end starting at 0   from the first experiment, the data in the tsv would be formatted like so, with the feature added to the end of the data in the first column, itself with no spaces: table 4. formatting the tsv data.   source (column 1) target (column 2) я м б а м ямб 0 ш и х т о в е е шихтовой 1 щ е л к а н у т ь щелкануть 0 и н о к и инока 2 с т ё с а н н о м стесать 2     the same methods used in the esperanto example could be used: we would train the model using fairseq on 80% of the data so the model can learn that words like иноки with the root of инока would have a stress code of 2.  once trained, we choose the model that performs best on the dev set (10%).  then we use that model on completely unseen data in the test set (10%).  by examining and contrasting different experiments, we can see if knowing the word’s root helps in placement of the stress, or if adjectives tend to have stress in particular places, or possibly even that the ambiguity in stress-placement can not be aided with this type of machine-learning.  experiments similar to these were conducted in schriner (2022), showing that knowing the word’s root led to the best predictions and the lowest word error rate, while adding the part of speech feature led to the worst results and the highest word error rate. conclusion preparing for experiments like those above require hypotheses, planning, and formatting the data for the software.  we used fairseq and found that with our wikipron data, the model we chose had no errors in predicting pronunciation in esperanto, even with unseen data.  in the russian stress experiment we looked at how to prepare data in the same way but added features to the model’s training.  the fairseq framework makes it astonishingly easy to toggle and experiment with different parameters from the terminal and work on experiments like those described above.  with continued, collaborative, and open data, we can expect invaluable further research in this area. about the author john schriner is the e-resources and digital initiatives librarian at nyu law school. his research tends to coalesce at the intersection of linguistics, cybersecurity, and librarianship. references chen, h. (1995). machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms. journal of the american society for information science, 46(3), 194–216. https://doi.org/10.1002/(sici)1097-4571(199504)46:3<194::aid-asi4>3.0.co;2-s lee, j.l., ashby, l., garza, e., lee-sikka, y., miller, s., wong, a., mccarthy, a., and gorman, k. (2020). massively multilingual pronunciation mining with wikipron. in proceedings of the 12th language resources and evaluation conference, pages 4223-4228. open data. (n.d.). sparc. retrieved november 29, 2022, from https://sparcopen.org/open-data/ ott, m., edunov, s., baevski, a., fan, a., gross, s., ng, n,. grangier, d., and auli, m. (2019). fairseq: a fast, extensible toolkit for sequence modeling. in proceedings of the 2019 conference of the north american chapter of the association for computational linguistics (demonstrations), minneapolis, minnesota. association for computational linguistics, (pp. 48-53). sanaullah, a. r., das, a., das, a., kabir, m. a., & shu, k. (2022). applications of machine learning for covid-19 misinformation: a systematic review. social network analysis and mining, 12(1), 94. https://doi.org/10.1007/s13278-022-00921-9 schriner, j. (2022). predicting stress in russian using modern machine-learning tools. https://academicworks.cuny.edu/gc_etds/4974/ wade, t., & gillespie, d. (2011). a comprehensive russian grammar. wiley-blackwell. zhu, h., & lei, l. (2022). a dependency-based machine learning approach to the identification of research topics: a case in covid-19 studies. library hi tech, 40(2), 495–515. https://doi.org/10.1108/lht-01-2021-0051 endnotes [1] https://voyant-tools.org/ [2] https://www.nltk.org/ [3] https://www.fon.hum.uva.nl/praat/ [4] https://www.liwc.app/ [5] meaning simply that the input and output are visible but the inner-workings and source code are closed [6] https://towardsdatascience.com/understanding-random-forest-58381e0602d2 [7] https://data.nls.uk/ [8] https://dataverse.no/dataverse/trolling [9] https://www.re3data.org/ [10] fairseq can be installed via pip from https://pypi.org/project/fairseq/ [11] this is specified in the preprocessing below [12] for a fascinating history of esperanto from its beginnings through the early soviet union, please see brigid o’keeffe’s esperanto and languages of internationalism in revolutionary russia, ‎2021, bloomsbury academic [13] https://github.com/cuny-cl/wikipron/blob/master/data/scrape/tsv/epo_latn_narrow.tsv [14] this script is agnostic to the data-format and is written by kyle gorman and jackson lee.  the script can be found here: https://github.com/cuny-cl/wikipron-modeling/blob/master/scripts/split.py [15] intel core i7-6700 cpu @ 3.40ghz × 8 with 32gb ram [16] for all available parameters for training, please see: https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-train subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – creating a custom queueing system for a makerspace using web technologies mission editorial committee process and structure code4lib issue 55, 2023-1-20 creating a custom queueing system for a makerspace using web technologies this article details the changes made to the queueing system used by virginia tech university libraries’ 3d design studio as the space was decommissioned and reabsorbed into the new prototyping studio makerspace. this new service, with its greatly expanded machine and tool offerings, required a revamp of the underlying data structure and was an opportunity to rethink the react and electron app used previously in order to make the queue more maintainable and easier to deploy moving forward. the new prototyping queue application utilizes modular design and auto building forms and queues in order to improve the upgradeability of the app. we also moved away from using react and electron and made a web app that loads from the local filesystem of the computer in the studio and runs on the svelte framework with ibm’s carbon design components to build out functionality with the frontend. the deployment process was also streamlined, now relying on git and windows batch scripts to automate updating the app as changes are committed to the repository. by jonathan bradley the challenges with the old system the 3d design studio at the university libraries at virginia tech for years used a battle-tested queueing system. created by jonathan bradley in 2017, the queueing system was a react app packaged and installed using electron with a backend api created using dreamfactory on a digital ocean vps cloud instance. the queue was composed of a form filled out by student workers with the help of patrons, and all of the data dumped into multiple filterable, sortable, and searchable tables that allowed for updating said queue entries. the system also accommodated multiple queues, one for our standard print requests and a “special request” queue that handled particularly challenging patron requests that required input from the studio manager. the system also had a few bonus comforts, such as automatically sending an email to the user when their print was marked as completed in the system and informing them they could come pick it up, connecting to a small receipt printer in the room which is used to label prints for pickup and organize the physical objects, collecting reference question statistics, and checking user requests to make sure that a particular patron only had one job in the queue at a time. figure 1. a screen capture of the queueing system for the 3d design studio, made using react and electron but the system wasn’t without its difficulties. the electron system itself presented many of these challenges, the main of which was the difficulty of updating. each time a new build needed to be made, a long build process had to be run. the software that builds electron, called electron forge, was constantly changing, and it was quite common for our app to no longer build just from the changes that had happened between updates to the app. this usually meant that what should have been a small and quick update to the interface became hours of work as the build process was reestablished and dependency versions were adjusted and conflicts cleared. this made updating the app onerous, which resulted in fewer updates in general. but even after the build was completed, the install process meant physically loading the installer on a flash drive and running down to the space to install it on the studio computer. even though electron has a squirrel installer option that can be run on a server to provide updates centrally, from our research, such a server could not easily be setup privately for software not intended to be distributed to the public. the react app itself was also a problem. in general, over the years, we’ve found that react just isn’t a good solution to most of our software projects, as it is far too involved for the small projects we are creating, and the framework itself requires more overhead than we really need. and finally, our digital ocean vps was presenting an ongoing problem, as a change in university policy meant the payment method we were using for digital ocean was not an option anymore. with no alternative way to pay them, we needed to move our backend to a different service. when we received confirmation in fall 2020 that our proposal to build the prototyping studio, a large makerspace that would absorb the 3d design studio and greatly expand on its offerings and service model, was moving into construction, we knew this would be the time to fix many of the challenges we were facing. the new service model demanded a change to how we handle patron requests. fixing our mistakes the first major challenge we wanted to take on was the change to the service model and ultimately, our data structure. in the 3d design studio’s queue, the database had a single table called “queue” that contained a field “type” where we stored information about whether the print was intended to be a resin print, special request, or just a standard print. when we made api calls to dreamfactory, a server-side software that generates, documents, and manages rest apis based on a database’s structure, we would filter based on this field in order to generate the individual queues in the system that were displayed to our student workers. and that worked fine for a service that was only offering 3d printers, but with the prototyping studio, in addition to offering all of those types of queues, we would also have people coming in to use our cnc machines, laser cutter, vacuum former, pick and place machine, etc. additionally, patrons might come in to use multiple machines or just have multiple jobs on a single machine that were all contributing to a single project. we wanted to be able to capture how these projects were coming together and the various tools needed to finish them. we changed the primary field in our database from “queue” to “project” and decided that all entries in the system would be a project, and every project would hold the machine jobs needed to complete it. this meant our database now had many tables in it, starting with the “project” table and adding a table for each type of queued machine, including “cnc”, “laser”, “resin”, etc. the database also contains many-to-many join tables for each machine, allowing for each project to contain multiple entries for a machine job, and multiple types of machine job entries, which means now a patron could have a single project with entries for a laser cutting job, 2 resin 3d printing jobs, and a cnc machine job. figure 2. the main screen of the new prototyping studio queue while we kept the individual tabbed queues from the old system, each one is now filtering jobs out of a single “project” api call to our dreamfactory instance. but in order to accommodate this, our form for adding projects needed to grow in complexity over the old system. since a project can contain any number of job types and individual jobs, our form now has checkboxes for selecting the queues needed for the project and templated sub-forms that our student workers can add to any project. figure 3. a sample of the form used to gather project information into the queue with a new data structure in mind, we tackled the problem of our vps, which we were able to solve by moving our dreamfactory instance out of the cloud and back on-premises into a vm owned and secured by the university libraries. by doing this, we were able to also include added security by limiting access to the dreamfactory instance to virginia tech’s internal network, drastically reducing the number of potential attack vectors on the api server. after the backend was sorted, the question became how to handle front-end concerns for the application. we wanted to avoid using electron for the reasons listed above. tauri, a rust-based alternative to electron, and proton native were both considered as well, but tauri was still in early beta and both seemed to present similar concerns to electron. we stepped back and thought about the need in more detail. the app needed to be present on three computers, all within the physical building of the library; all the data would come from api calls; the ancillary functions, like sending emails and printing receipts, were all accomplished via api calls to various endpoints as well; even saving files, an option present in the previous version of the queue but underutilized, was handled via api calls to the dreamfactory server. we didn’t need our students to log in using any federated system because we can find out which student updated the software based on update timestamps and work schedules if need be, and we both didn’t need and didn’t want the public to be able to access the app. at the core of it, the app didn’t need to do anything outside the capabilities of a simple web application, and i had been playing around more and more with local web apps that run within the browser from the computer’s file system instead of via a server as a means of deploying one-off apps that needed to be on kiosks or other controlled points with no access from the greater internet. this sort of deployment had worked well for our patron satisfaction kiosks we have spread throughout the various studios in newman library, and it seemed like it could be a viable option for bypassing some of our greatest hiccups with the previous version of the queue. the next hurdle was choice for front-end development. i had moved away from using react months prior for new projects given that it was simply far more infrastructure than was needed and often made our projects less maintainable, since not many people in our library were working within the react ecosystem. since moving away from react, i was forgoing a front-end framework all together for new projects and coding in vanilla javascript, but i’m a proponent of using the right tool for the job and no more. the prototyping studio queue was going to be far more complicated than a feedback kiosk or a dynamic upcoming events page for digital signage, and the nature of the project, handling large amounts of data that would need to be loaded, displayed, sorted, and filtered frequently and in a user-friendly manner, could actually benefit from some of the features many frameworks offer out-of-the-box, such as two-way data binding, templating, and component-based layouts. in the end, i decided to go with svelte; i had used it on a couple previous projects, and i really appreciated the way it implements data-binding and how little overhead it adds, leaving a project that doesn’t require much additional knowledge beyond standard html/css/js. its handling of app state via the built-in stores implementation was worlds ahead of the react + redux solutions i had needed to use previously in terms of simplicity, and its build and bundle up-front strategy for websites was a benefit for a locally deployed site, since websites loaded using the filesystem instead of a server have some additional hurdles, particularly related to fetching additional files, which are treated as having opaque origins by modern browsers regardless of actual location. the other reason svelte was chosen was that it had a good community that had already built some component libraries for the framework based on various popular design languages, like material design by google or carbon by ibm. i often look to component libraries when developing larger projects, as they can take care of much of the implementation of common web elements like forms with validation, date/time calendar pop-ups, tooltips, etc. and in general take the process of design off the shoulders of the person doing the coding and place it on someone with more explicit training in design, which usually results in a more user-friendly experience that follows best-practices in the field of ux. it also frees up more of the developer’s time to work on the unique components that will be required for their particular app, which is always a bonus. looking through the potential options for a component library, we eventually landed on using the svelte implementation of carbon by ibm for two main reasons. the first reason was that it had a fully-featured data table component, which is the main focal point of a queueing app. their implementation included built-in search components, sortable headers, customizable cell views, collapsible sub-rows, and many other small quality-of-life touches that can make a big difference in an app where a user is primarily interacting with data tables. the second reason was that out of the options available, carbon was closest to virginia tech’s branding style, meaning with only a few tweaks to the css i could get the components to meet branding guidelines for the university. i would note that even though this application, given its niche use and the lack of any public-facing distribution, didn’t actually have to meet our university’s branding guidelines. i always try to meet them regardless because i find that 1) it is good practice to be in the habit of always trying to meet your university or organization’s defined style, 2) in a component-based development process, the things you build can be reused elsewhere in the future, and 3) it lends legitimacy to the things you build, ensuring your student workers and any patrons who see the application view it as a cohesive part of the ecosystem established by your organization and not some random or potentially sketchy software. the structure and maintainability of the app a number of features were implemented in the development of this app from the beginning with the goal of making it a simple task to update. the first was the modularization of the major functions. all the named functions within the app are contained in their own file, which is named after the function, in a folder called “lib,” and the functions are exposed the rest of the app via an index file in the folder that imports and exports each function export { searchtable } from "./searchtable.js"; export { getdata } from "./getdata.js"; export { resetschema } from "./resetschema.js"; export { sendreceipt } from "./sendreceipt.js"; export { sendemail } from "./sendemail.js"; export { filterqueues } from "./filterqueues.js"; export { editentry } from "./editentry.js"; export { buildsubrows } from "./buildsubrows.js"; export { saveentry } from "./saveentry.js"; export { getarchive } from './getarchive.js'; export { addjob } from "./addjob.js"; export { makeactive } from './makeactive.js'; export { closeproject } from './closeproject.js'; export { checkprojectstatus } from './checkprojectstatus.js'; figure 4. code snippet from index.js that imports and exports all of the function in the lib folder this sort of modularization is common in many programming ecosystems and may not seem like much, but i strongly recommend this practice to anyone building apps, especially in javascript. this allows for easy debugging of your functions and makes them reusable not only within the app, but within other projects as well. i’ve written a filtering function for data tables that has made its way into half a dozen different projects, which was made possible by designing all of the functions to be modular. the second concern was the nature of the space this queuing app will serve. given our previous experience with the 3d design studio, we know to expect this service to evolve, which almost assuredly means both new machines in need of their own queue becoming part of the service and changes to the nature of the forms and the information we need to run a job on a given machine. in the past, making substantial changes to the forms or adding a new queue would mean a lot of editing code and customizing the solution for that particular scenario, which was something i wanted to avoid moving forward. this app contains a config.js file that exports a json object to the app with definitions for all of the queues in an array of objects: queues: [ { label: "projects", machine_id: false, headers: [ { key: "project_name", value: "name" }, { key: "email", value: "email" }, { key: "user_name", value: "user" }, { key: "complexity", value: "complexity" }, { key: "timestamp_created", value: "date submitted" }, { key: "timestamp_updated", value: "last updated" }, { key: "machines", value: "machines involved" }, ], }, { label: "extrusion printing", machine_id: "extrusion", form: { active: "extrusion_jobs", definition: extrusiondefinition, buttontext: "extrusion print", }, headers: [ { key: "user_name", value: "user" }, { key: "email", value: "email" }, { key: "activeextrusion.currently_on", value: "currently on", }, { key: "activeextrusion.timestamp_created", value: "date submitted", }, { key: "activeextrusion.filename", value: "filename" }, { key: "activeextrusion.comments", value: "comments/notes", }, { key: "activeextrusion.printer_size", value: "print size", }, { key: "activeextrusion.material", value: "material" }, { key: "activeextrusion.print_weight", value: "estimated filament", }, { key: "activeextrusion.print_time", value: "estimated time", }, ], }, figure 5. snippet of code from config.js illustrating the structure of the array of objects that define two example queues instead of coding a ui for each queue, the system loads this array and loops through it, building the queue ui tabs based on the data provided. this allows for the creation of a new queue by simply adding a definition to this config file instead of coding an interface for it. <tabs> {#each config.queues as tab} <tab key={tab.label} label={tab.label} /> {/each} <tab key="archive" label="archive" /> <main slot="content"> {#each config.queues as tab} <tabcontent key={tab.label}> {#await $alldata} <loading /> {:then} {#key $alldata} <table fulldata={filterqueues($alldata, tab.machine_id)} headers={tab.headers} title={tab.label} id={tab.machine_id} /> {/key} {/await} </tabcontent> {/each} <archive /> </main> </tabs> figure 6. snippet of code from tabbar.svelte that builds the queues similarly, the project contains a folder called forms, which contains numerous definition files, one for each form. these definition files, too, export a json object that contains an array of objects with the required data to build the form, including the type of form question, be it text input, number input, dropdowns, etc., the data the form answer should be bound to, and any restrictions on the input. export default [ { type: "text", id: "filename", label: "filename", placeholder: "should follow naming conventions", bind: "filename", }, { type: "select", id: "status", label: "status", bind: "status", email: true, options: [ { value: "in queue", text: "in queue", }, { value: "completed successfully", text: "completed successfully", }, { value: "failed", text: "failed", }, { value: "remove", text: "remove", }, ], }, { type: "select", id: "currently_on", label: "currently on", bind: "currently_on", receipt: true, options: [ { value: "not manufacturing", text: "not manufacturing", }, { value: "lazer face", text: "lazer face", }, ], }, { type: "select", id: "source", label: "file source", bind: "source", options: [ { value: "downloaded it", text: "downloaded it", }, { value: "made it myself", text: "made it myself", }, { value: "edited a download", text: "edited a download", }, ], }, { type: "select", id: "job_type", label: "job type", bind: "job_type", options: [ { value: "cut", text: "cut", }, { value: "engrave", text: "engrave", }, { value: "both", text: "both", }, ], }, { type: "select", id: "material", label: "material", bind: "material", options: [ { value: "wood", text: "wood", }, { value: "acrylic", text: "acrylic", }, { value: "hardboard", text: "hardboard", }, { value: "other/patron provided", text: "other/patron provided", }, ], }, { type: "number", id: "length", label: "length of material", invalidtext: "this won't fit on the machine.", helpertext: "in inches", bind: "length", min: 1, max: 36, }, { type: "number", id: "width", label: "width of material", invalidtext: "this won't fit on the machine.", helpertext: "in inches", bind: "width", min: 1, max: 24, }, { type: "textarea", id: "comments", label: "comments/notes", placeholder: "enter comments here...", bind: "comments", }, ]; figure 7. code from laserdefintion.js that defines the form for our laser cutter the folder also contains formtemplate.svelte, which takes the data provided in the array and loops over it in order to build each form. this approach means that adding a whole new form means only adding the definition file, and updating an existing form is as simple as changing the values in a json file instead of re-coding an interface. it also means bugs, like one we encountered where dropdowns weren’t resetting after an entry was saved, need only be fixed in a single place for all forms. <div class="queue-container" style={$currentproject[[active]] ? "" : "display:none;"} > {#each formdefinition as question} <div class="question-container"> <formgroup> {#if question.type === "text"} <textinput id={question.id} labeltext={question.label} placeholder={question.placeholder} bind:value={$data[index][[question.bind]]} required={true} /> {:else if question.type === "select"} <select id={question.id} labeltext={question.label} bind:selected={$data[index][[question.bind]]} on:input={(e) => { question.receipt ? receiptcheck(e, index) : question.email ? emailcheck(e, index) : null; }} > {#each question.options as option} <selectitem value={option.value} text={option.text} /> {/each} </select> {:else if question.type === "number"} <numberinput min={question.min} bind:value={$data[index][[question.bind]]} max={question.max} invalidtext={question.invalidtext} helpertext={question.helpertext} label={question.label} /> {:else if question.type === "textarea"} <textarea id={question.id} labeltext={question.label} bind:value={$data[index][[question.bind]]} placeholder={question.placeholder} /> {/if} </formgroup> </div> {/each} </div> figure 8. code snippet from formtemplate.svelte illustrating how the form is generated using the definition file one of the other major concerns to arise concerning maintainability surrounded the global state of the app. using svelte already solved many of the headaches related to store-based global state, primarily because it doesn’t require writing additional control functions for saving and retrieving data and it provides some nice shortcuts in the form of the $ operator before a variable, which allows you to access the store value instead of having to first assign it to a local variable. but the nature of this queue, where every project is potentially composed of multiple sub-queues, and those sub-queues may have multiple entries in them, meant that reconciling this complex series of forms prior to saving updates or creating new entries in the database required multiple custom functions. however, we were able to replace these functions and instead use the derived store functionality of svelte to reconcile them, drastically improving the readability of the code and removing that logic from the inner workings of the app and placing it in the stores.js file where it can be easily found. export const finalproject = derived( [ currentproject, currentextrusion, currentresin, currentlaser, currentmetal, currentmill, currentpcb, ], ([ $currentproject, $currentextrusion, $currentresin, $currentlaser, $currentmetal, $currentmill, $currentpcb, ]) => { let project = { ...$currentproject }; delete project.cells; delete project.timestamp_created; delete project.timestamp_updated; project.extrusion_jobs ? ((project.extrusion = $currentextrusion), delete project.extrusion_jobs) : (delete project.extrusion_jobs, delete project.extrusion); project.resin_jobs ? ((project.resin = $currentresin), delete project.resin_jobs) : (delete project.resin, delete project.resin_jobs); project.laser_jobs ? ((project.laser = $currentlaser), delete project.laser_jobs) : (delete project.laser, delete project.laser_jobs); project.metal_jobs ? ((project.metal = $currentmetal), delete project.metal_jobs) : (delete project.metal, delete project.metal_jobs); project.mill_jobs ? ((project.mill = $currentmill), delete project.mill_jobs) : (delete project.mill, delete project.mill_jobs); project.pcb_jobs ? ((project.pcb = $currentpcb), delete project.pcb_jobs) : (delete project.pcb, delete project.pcb_jobs); // console.log(project); return project; } ); figure 9. code snippet from stores.js of derived store that contains all queues app deployment with the app itself working and ready for use, the question of how to improve the deployment process over the old queue became front and center. three major concerns stood out, the first being the need for the process to be as easy as possible. apps that are easy to update are updated more frequently; that’s just a fact of the development world, and it’s a truth i didn’t want to fight against. the second concern was removing the need for manual deployment to the physical computers where the app would live. carrying flash drives around the building, interrupting service points, running from floor to floor to each computer, all of this amounted to an annoyance that had grown over time. and finally, the third concern was the need for a development version of the application to be available as well. in the past, we’ve had to train student workers using the production queue since it was the only version that our manager had access to, and training using production data is never a good idea. eventually it became clear that git was the solution that would fix concerns one and two. git was already part of the workflow for development of the app, and we had used git pulls as a way of deploying other apps, such as the aforementioned feedback kiosks, so this seemed like a reliable solution here as well. we also decided to simply bundle the development version of the app in with the deployment of the production app, and differentiate the version in both name and visual style. figure 10. the header for the development version of the app figure 11. the header for the production version of the app the eventual workflow began to look like this: the developer makes an update to the app, and once tested and ready for production, uses a custom npm script command called “build-all” that builds two versions of the app, one that is passed a “dev” variable and another that is passed the “prod” variable. these variables are used in the build process of the app to decide which api endpoints and keys to use and which template to use to style the app header. each separate build, after completing, is copied by the script into its own special folder, one called “production” and the other “training.” those two folders are part of their own git repository, separate from the repo that contains the source code, and after the build script is completed, the repo is committed and pushed to virginia tech’s install of gitlab. meanwhile, each computer in the building that should have access to the queue, which includes the manager’s, the service point computer, and the developer’s, all go through a one-time setup. on each computer, a windows batch script is placed in the windows documents folder. this script has the information needed to pull the repository down from gitlab and place it in the current directory. the script is then added to the windows task scheduler to run every night at 2:00am, when it will do a “git pull” and get any updated code from the repository. finally, two shortcuts are placed on the desktop of the computer, one called “prototyping queue” and the other “training queue,” each of which point back to the index.html files in their respective folders. when clicked, the app is opened in the default browser, and the user can go about managing queue functions as normal. this structure means that updates are as simple as running a single bash command to build the app and then committing it to the repository. from there, the app will automatically be updated at 2:00am, or in the case that something isn’t working quite right or the update needs to happen immediately, the user on the computer can simply open their documents folder and double click on the windows batch file, which will immediately instigate a “git pull” against the repository. using this deployment structure, we’ve been able to manage the app in a much more timely manner, including identifying bugs and then fixing and deploying those changes in a matter of minutes as opposed to the far more involved development pipeline that existed for the previous application. looking ahead the changes to our queueing system put into place for the new prototyping studio have proven to be effective and have saved a great deal of developer time and effort, and the overall gains that could accumulate over the lifetime of the application speak to the benefits of this simple deployment structure for applications that don’t require public-facing servers. the biggest challenge another organization might face in producing similar applications is likely the dreamfactory server space for api handling. we are lucky enough to have infrastructure in place to fill that gap in the workflow, but that may not be the case for everyone. in those situations, a local sqlite3 database bundled with the app might be a usable stand-in. it is an alternative i have been considering more frequently as of late, but having a readily available and self-generating api on top of said database is a problem i have not investigated fully. i feel confident that a solution could be put together using deno and its ability to compile scripts to binary executables, but it is possible that an even simpler solution exists as well. as for our own prototyping queue, the issues tracker has no current outstanding bugs reported from our student workers, mostly due to the ability to quickly make updates and deploy them, but we do have some feature enhancements we are looking to implement in the near future. the one currently at the top of our priority list is to enable file uploads from our student workers, which will allow for better tracking of the files needed to complete a project and potentially allow for the collection of “evidence” in the situation in which a patron wishes to add their project to their eportfolio and needs photos of the prototyping process or other materials collected related to their work to upload later. the other major enhancement we are currently strongly considering is the addition of a data visualization tab that would allow for a quick glance at some anonymous real-time statistics about the various projects in the queue. both of these, however, are luxuries, not requirements, that we are able to pursue now that we have a queue redesigned with a focus on simplicity and maintainability instead of just having something that works. about the author jonathan bradley is the assistant director for learning environments and innovative technologies at the university libraries at virginia tech. in this position, he oversees the libraries’ six studio spaces, which focus on technologies ranging from virtual reality to media production to maker tools. he also does web development for the studios, creating various tools to gather feedback from patrons and improve the service point experience. he earned his doctorate from middle tennessee state university in 2013. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – revamping metadata maker for ‘linked data editor’: thinking out loud mission editorial committee process and structure code4lib issue 55, 2023-1-20 revamping metadata maker for ‘linked data editor’: thinking out loud with the development of linked data technologies and launch of the bibliographic framework initiative (bibframe), the library community has conducted several experiments to design and build linked data editors. while efforts have been made to create original linked data ‘records’ from scratch, less attention has been given to copy cataloging workflows in a linked data environment. developed and released as an open-source application in 2015, metadata maker is a cataloging creation tool that allows users to create bibliographic metadata without previous knowledge in cataloging. metadata maker might have the potential to be adopted by paraprofessional catalogers in practice with new linked data sources added, including auto suggestion of virtual international authority file (viaf) personal name and library of congress subject heading (lcsh) recommendations based on the users’ text input. this article introduces those new features, shares the user testing results, and discusses the possible future steps. by greta heng, myung-ja han introduction libraries have been using machine readable cataloging (marc) as a tool to create bibliographic and authority data since the 1960s. while marc brought libraries a new way to organize information in the past, the evolving information landscape asks for libraries to explore other means of information organization that can connect library collections with resources on the web. as a successor to marc, bibliographic framework (bibframe) initiative was launched by the library of congress (lc) in 2012.[1] it is expressed in the resource description framework (rdf, a data model for structured data)[2] and based on three categories of abstraction (work, instance, item). as the library’s new entity relation data model, bibframe is grounded in linked data techniques, which allows metadata creators to build relationships with web resources by facilitating shared structured data and uniform resource identifiers (uris). many national and research libraries have been exploring the possibility of converting the marc format metadata to bibframe and, even further, creating metadata as linked data using a linked data/bibframe editor. libraries such as the swedish national library,[3] the french national library,[4] the german national library (dnb),[5] and the library of congress[6] have been involved in the marc to linked data conversion, linked data based new discovery services, and linked data editor experiments. in addition, some external linked data management platforms are gaining popularity among glam (galleries, libraries, archives, and museums) institutions. wikidata,[7] an open, collaborative, and multilingual global linked data repository, is being used by libraries as an alternate source of name and subject data for bibliographic description. however, since wikidata is designed to represent all domains of knowledge and not specific for library use, concerns about its capacity and suitability for describing library resources were raised by the wikidata community.[8] while there has been much discussion on and development of tools for creating full original linked data, less attention has been given to copy cataloging workflows (creating new short records by deriving from other records or creating minimum records) in linked data environments. developed and released as an open-source application in 2015, metadata maker[9] is a metadata creation tool that can be used by anyone regardless of their cataloging experience and knowledge, allowing them to create a minimum level catalog record. metadata maker has been updated in several areas since then, including supporting different formats of resources (currently in ten modules) and bibframe output service for monographs.[10] as more and more cataloging and metadata creation work relies on paraprofessional catalogers or non-catalogers[11] with language or subject expertise, the authors tried to revamp metadata maker with linked data authority services to test whether this tool and the updated functions can facilitate the minimum record creation in a linked data cataloging environment. this paper shares the revamping process and issues found in linked data sources and their service, and discusses the user testing results of metadata maker and a bibframe editor. changing landscape the development of linked data technologies brought out a systematic change in libraries’ cataloging production practice. as van der werf said, “libraries used to be knowledge organizations and library professionals were trained in bibliographic description and authority control. now, authorities are called entities and the new description logic is about creating a ‘knowledge graph of entities.’”[12] it is noticeable that the focus of metadata creation has gradually shifted from the curation of text strings to the management of entities (work, persons, corporate bodies, places, events, etc), i.e., linking resources using uris and managing uris instead of name strings.[13] this revolution has triggered a discussion on linked data cataloging models, standards, and tools in the library. changing library cataloging production practice libraries have carried out several initiatives to re-design cataloging workflows and devise the transition plan from traditional cataloging to linked data cataloging, for example, the development of marc to bibframe conversion tools and bibframe editors. notably, the linked data for libraries (ld4l)[14] community made a series of significant efforts on linked data cataloging from 2014 to 2022, including linked data for libraries labs (ld4l labs),[15] linked data for production (ld4p),[16] linked data for production: pathway to implementation (ld4p2),[17] and linked data for production: closing the loop (ld4p3).[18] albeit those new linked data cataloging tools, catalogers need to be versed in new linked data related knowledge and exercise new skills, such as rdf, sparql, bibframe ontology, and more, to create library data as linked data. in addition, as linked data implementations in libraries are still under development, it is hard to keep up to date with the most current linked data application developments, e.g., bibframe editors. it is challenging to identify the type of skills that catalogers need to be developing. as a result, catalogers may feel overwhelmed by the new linked data technology, and administrators are experiencing challenges in designing and providing training for the ever-growing skill set and emerging linked data tools for catalogers.[19] the shifting roles of librarians and staff in technical services are an additional challenge in linked data training and planning. libraries used to depend on professional cataloging librarians to do original cataloging. copy cataloging was usually performed by paraprofessional catalogers. however, this is no longer true. with shrinking budgets, organizational restructuring, and changes in cataloging software and workflows, more paraprofessional staff are responsible for both original and copy cataloging tasks (el-sherbini & klim, 1997; zhu, 2012).[20] as van der werf articulated, the number of professional librarians is decreasing while paraprofessional staff are increasing in cataloging departments.[21] in fact, not only are professional librarians decreasing, but the whole cataloging team is also shrinking. while there are several options that can ease the shortage of manpower, such as outsourcing to vendors, cooperative cataloging programs, and more productive cataloging workflows, libraries still lack staff with expertise to catalog special collections and/or foreign language materials. the need for foreign language and special collection cataloging will not go away in a linked data environment as libraries keep purchasing resources from foreign countries and work with perpetual backlogs. bibframe editors and copy cataloging currently, there are three bibframe editors that are widely known and used: lc’s bibframe editor,[22] marva,[23] and ld4p’s sinopia.[24] all three editors seem to target experienced catalogers as their user group, not paraprofessional catalogers or non-catalogers. for one, they use the resource description and access (rda) terms[25] as field names and bibframe’s three categories of abstraction, work, instance, item, as record/data types. those cataloging terms, though commonly used by professional catalogers, may result in a learning curve for paraprofessional catalogers. for example, “parallel title” is not a common phrase and the differences between work and instance are not self-explanatory for many. for another, some abbreviations that appear in the user interface as controlled vocabularies, including getty_aat, lcgft, and gac, are not familiar to paraprofessional catalogers. in order to use the editor and add appropriate values to those data fields, it requires training on rda, bibframe ontology, authority, and the editor itself at the very least. another challenge is a lack of clear definition as to what makes full level and brief bibframe data. the core bibframe data fields are still under discussion by the program for cooperative cataloging (pcc) bibframe interoperability group (big).[26] as there are no clear guidelines, some bibframe editors mark required fields while some do not. for catalogers or users of bibframe editors, it seems that one needs to fill out all fields to create full-level bibframe data and provide values for those required fields, if applied, to generate brief bibframe data. as there is no quick way of filling out the minimum data fields to produce brief bibframe data, the cataloging workflow used in the current bibframe editors might not meet libraries’ needs for cataloging large volumes of perpetual backlogs with a shrinking cataloging team. lorimer (2022) stated that the notion of copy cataloging has broadened and expanded in a linked data environment,[27] which emphasizes reusing metadata rather than creating completely new metadata from scratch. some bibframe editors like sinopia indeed allow catalogers to search, load, and copy or clone existing bibframe data to revise and reuse those descriptions by sharing uris. this workflow would help reduce duplicate work-level bibliographic records and increase cataloging efficiency. yet, considering the reality and looking into the future, libraries, with professional catalogers and language/subject experts shortage, will have to resort to non-catalogers and paraprofessional catalogers with limited linked data and cataloging knowledge to create records in bibframe editors. shall users adapt to the bibframe editors or shall the editors be designed to be more friendly to their users? this dilemma raises a question: is it possible to build a linked data editor without cataloging jargon in the application interface? given the above mentioned issues, this project is an attempt to build a straightforward linked data editor that does not use rda terms for non-catalogers for the purpose of copy cataloging. libraries may benefit from adopting metadata maker as it does not require new hiring or training for catalogers and allows non-catalogers with needed language/subject knowledge to create minimum level cataloging records. the authors also conducted a small-scale survey to learn catalogers’ opinions about metadata maker and a linked data editor. revamping metadata maker metadata maker enables any user to create catalog records that are “good enough” (provide sufficient information to identify a bibliographic item and generate a basic bibliographic description)[28] in various formats, including marc, regardless of one’s knowledge of or experience with cataloging standards, integrated library systems, or oclc. it now has ten different modules or templates (datasets[29], monographs[30], monographs (ld)[31], ebooks[32], government documents[33], maps[34], microfilms[35], scores[36], serials[37], theses and dissertations[38]). users can select a module based on the resource type, fill out basic information about the resource, and choose the download format, including marc binary, marcxml, metadata object description schema (mods), html, and bibframe.[39] for this phase, two new linked data features, virtual international authority file (viaf) personal name suggestions and library of congress subject heading (lcsh) suggestions were added in the monographs (ld) module in metadata maker. the new functions support search and auto completion of personal names in viaf, and lcsh (keywords) generation based on the user provided text. uris of the controlled terms are added in the output metadata. figure 1. metadata maker interface screenshot. linked data input viaf name search the viaf personal name autocomplete dropdown list in fig. 2 uses viaf auto suggest api[40] to retrieve the personal name’s label, viaf uri, and library of congress name authority file (lcnaf) uri. when a name is selected, the links to both uris, if they are available in viaf, will be presented on metadata maker. users have the option to verify the name entity’s information on either the viaf or lcnaf page if so desired. the application then retrieves values of the 100 field subfields a to d from lcnaf whenever they are available. if no lcnaf uri is provided in viaf, the preferred label from dnb[41] is the alternative option if that can be found in viaf. the lcnaf and viaf uris are added to subfield 0 and 1 respectively in the marc and marcxml 100 field or 700 field based on their role. for other supported output formats, the uris and the label/preferred name are also inserted into the appropriate elements. if there is no satisfactory result in the autocomplete dropdown list, it also allows users to manually input the name strings. the code is available online.[42] figure 2. viaf auto suggest dropdown list. // using viaf auto suggest api fetch personal names (function($) { $.widget("oclc.viafauto", $.ui.autocomplete, { options: { select: function(event, ui) { alert("selected!"); return this._super(event, ui); }, source: function(request, response) { var term = $.trim(request.term); var url = "https://viaf.org/viaf/autosuggest?query=" + term; var me = this; $.ajax({ url: url, datatype: "jsonp", success: function(data) { if (data.result) { response( $.map( data.result, function(item) { if (item.nametype == "personal"){ var retlbl = item.term + " [" + item.nametype + "]"; var uri = "http://viaf.org/viaf/" + item.viafid; if (item.lc){ return { label: retlbl, value: item.term, id: item.viafid, viafuri: uri, lcuri: "http://id.loc.gov/authorities/names/" + item.lc, nametype: item.nametype } }else{ return { label: retlbl, value: item.term, id: item.viafid, viafuri: uri, lcuri: "nolc", nametype: item.nametype } } } })); } else { me._trigger('nomatch', null, {term: term}); } }, }); } }, _create: function() { return this._super(); }, _setoption: function( key, value ) { this._super( key, value ); }, _setoptions: function( options ) { this._super( options ); } }); })(jquery); // get information for user selected name in the author input field $(function() { $(".author").viafautox( { select: function(event, ui){ var item = ui.item;} } }); }); lcsh and fast suggest the second function that was added to metadata maker is the lcsh suggestion using annif api.[43] annif (http://annif.org/) is a subject suggest tool for documents, originally developed by the national library of finland.[44] according to its webpage, annif can be trained through natural language processing and machine learning algorithms to support any kind of subject headings. to make annif support lcsh, the ld4p group used annif’s built-in algorithms and training corpus from the ivyplus platform for open data (pod)[45] and share-vde (virtual discovery environment)[46] to train annif (hahn, 2022;[47] khan, 2020[48]).[49] upon request, annif lcsh api returns a list of suggested lcshs labels, uris, and predicted scores. the list is sorted by the predicted score from high to low: the higher the score, the more relevant the subject heading is. // annif lcsh api response [ { "label": "clothing and dress--china--history", "notation": null, "score": 0.06058865785598755, "uri": "http://id.loc.gov/authorities/subjects/sh2003012066" }, { "label": "costume--china", "notation": null, "score": 0.014286939986050129, "uri": "http://id.loc.gov/authorities/subjects/sh85033251" }, { "label": "costume--china--history", "notation": null, "score": 0.014127381145954132, "uri": "http://id.loc.gov/authorities/subjects/sh85033252" }, { "label": "clothing and dress--history", "notation": null, "score": 0.011828765273094177, "uri": "http://id.loc.gov/authorities/subjects/sh2003012061" }, { "label": "clothing and dress--social aspects", "notation": null, "score": 0.008354970254004002, "uri": "http://id.loc.gov/authorities/subjects/sh85027167" }, { "label": "fashion--history", "notation": null, "score": 0.008040583692491055, "uri": "http://id.loc.gov/authorities/subjects/sh2008103592" }, { "label": "fashion--history--20th century", "notation": null, "score": 0.007795797660946846, "uri": "http://id.loc.gov/authorities/subjects/sh2008103594" }, { "label": "chinese poetry--translations into english", "notation": null, "score": 0.007471516728401184, "uri": "http://id.loc.gov/authorities/subjects/sh2008100615" }, { "label": "medicine, chinese", "notation": null, "score": 0.0065437802113592625, "uri": "http://id.loc.gov/authorities/subjects/sh85083125" }, { "label": "clothing and dress in literature", "notation": null, "score": 0.005863940808922052, "uri": "http://id.loc.gov/authorities/subjects/sh85033275" } ] using annif lcsh api, metadata maker can recommend ten lcsh terms given a book summary in any romance languages. users can select zero to ten lcsh terms by checking the provided checkbox. it is also possible to re-run the suggest function by updating the summary in the input box and clicking the suggest button. if one is not satisfied with the recommended keywords or uncomfortable using lcsh, users can still use an autocomplete faceted application of subject terminology (fast) heading search box to add keywords. figure 3. keyword (summary suggest and keyword search box) screenshot. // if a user clicks the #lcshsuggest button, based on the user’s // text input in the #summary box, lcsh will generate in the #lcshresponse div container $(function() { document.getelementbyid('lcshsuggest').onclick = function(){ document.getelementbyid("lcshresponse").innerhtml = ""; var summary = document.getelementbyid('summary').value; if (summary!=null){ var requests = "text=" + summary; var url = "http://annif.info/v1/projects/upenn-omikuji-bonsai-en-gen/suggest"; var xhr = new xmlhttprequest(); xhr.open("post", url, false); xhr.setrequestheader("content-type", "application/x-www-form-urlencoded"); xhr.setrequestheader("accept", "application/json"); xhr.onreadystatechange = function () { if (xhr.readystate === 4) { var data = xhr.responsetext; var jsonresponse = json.parse(data); console.log(jsonresponse); if (jsonresponse["results"] && jsonresponse["results"].length){ for (var i = 0; i < jsonresponse["results"].length; i++){ var lcshabel = jsonresponse["results"][i]["label"]; var lcshurl = jsonresponse["results"][i]["uri"]; document.getelementbyid("lcshresponse").innerhtml += '<input type="checkbox" name="lcsh" class= "lcshcheckbox" uri="'+lcshurl +'" value="'+lcshabel+'">'+lcshabel+'<br>';} } } }; xhr.send(requests); } }; }); bibframe output with recent updates, the bibframe output data now includes uris of personal names, lcsh, and fast headings in the monographs (ld) module. below is an example of a <bf:contribution>. the lcnaf uri of “shakespeare” is added to the agent node. both viaf and lcnaf uris of “shakespeare” are added as the value of identifiers. <!-example of a <bf:contribution> --> <bf:contribution> <bf:contribution> <bf:role> <bf:role rdf:about="http://id.loc.gov/vocabulary/relators/aut"/> </bf:role> <bf:agent> <bf:agent rdf:about="http://id.loc.gov/authorities/names/n78095332"> <rdf:type rdf:resource="http://id.loc.gov/ontologies/bibframe/person"/> <rdfs:label>shakespeare, william, 1564-1616</rdfs:label> <bf:identifiedby> <bf:identifier> <rdf:value rdf:resource="http://viaf.org/viaf/96994048"/> </bf:identifier> </bf:identifiedby> <bf:identifiedby> <bf:identifier> <rdf:value rdf:resource="http://id.loc.gov/authorities/names/n78095332"/> </bf:identifier> </bf:identifiedby> </bf:agent> </bf:agent> </bf:contribution> </bf:contribution> the second is an example of a <bf:subject>. the fast heading uri is added to the topic node. these are represented in bibframe metadata as below. <!-example of a <bf:subject> --> <bf:subject> <bf:topic rdf:about="http://id.worldcat.org/fast/1921567"> <rdfs:label>comedy plays [form/genre]</rdfs:label> <rdf:type rdf:resource="http://www.loc.gov/mads/rdf/v1#topic"/> <bf:source> <bf:source rdf:about="http://id.loc.gov/vocabulary/identifiers/fast"/> </bf:source> </bf:topic> </bf:subject> some consideration while developing new features for metadata maker, the authors found some issues with the apis and linked data sources. encoding viaf provides a single name authority file that combines name authority files from more than 40 organizations,[50] making it convenient for libraries to take advantage of linked data and obtain information about name entities from one source. yet, the aggregation process might cause some encoding issues in viaf records. for example, when one searches for “greta reyghere,”[51] the name includes empty boxes in the dropdown list returned by the api. the same issue also appeared in the viaf json record: the source of the name with empty boxes was dnb according to the viaf json record (see fig. 5).[52] however, the dnb record did not have anything anomalous.[53] it seems that the empty boxes in the name label only exist in the viaf record; aggregation setting in viaf might be the reason why. figure 4. empty boxes in viaf. figure 5. empty boxes in viaf json. name entities search scope when describing resources in bibframe editors, cataloging experts tend to use name authority files like lcnaf. however, non-catalogers or paraprofessional catalogers may not be aware of those sources and are more likely to rely on the linked data editor itself. it is expected that bibframe editors understand the different name entity search behaviors between experienced and nascent catalogers. specifically, there are two expectations for the name entity search function in linked data editors: (1) no restraints on the order of a name; and (2) supporting variant name searching. as many non-professional catalogers may not receive identity management (authority) training, it is not intuitive for them to search names following the marc 100 field format: “last name, first name.” it is also important to make bibframe editors connected to various linked data sources for name entities on the web and collect the name variances from as many sources as possible. to meet the two expectations, metadata maker adopts viaf auto suggest api for personal name searching. the viaf auto suggest api supports both preferred name and variant name searching without any name format or name order constraints. this flexibility allows non-catalogers to find the desired personal name in different ways. one bibframe editor that was tested for this project supports only preferred name label search. a korean author, han, shin-kap,[54] has name variances: “한신갑” and “shin-kap han” in his authority record. the bibframe editor only brought a result when the term “han, shin-kap” was searched, as it matched with the existing lcnaf record 100 field. the other two variant names did not bring any results as the selected editor does not support variant name search. the failed search may drive non-catalogers to create duplicate name entity records or use strings instead of uris to represent the person. figure 6. search han, shin-kap in a linked data editor. figure 7. search 한신갑 in a linked data editor. figure 8. search shin-kap han in a linked data editor. quality of authority data viaf authority data provided via json-linked data (json-ld) format does not always have detailed and granular information. viaf authority cluster endpoint allows catalogers to retrieve authority data in various formats.[55] the name-related elements in the json-ld representation of viaf authority records include family name, given name, alternative name, and name (full name). more complicated names may contain title, numeration, and other information about the entity. take “john paul ii, pope” as an example.[56] “pope” is the title of “john paul ii.” “john paul” is the papal name and “ii” is the numeration. however, in his viaf json-ld record (see below), “john paul ii” is treated as the family name and “pope” is treated as the given name, which is not correct. while this would not be a problem when using data models that do not require name parts information like bibframe, it could be a problem for schemas that have fields or attributes specifically designated for name part, e.g., first name and last name. // json-ld description of john paul ii, pope in viaf { ... "familyname" : [ "janis", "john paul ii", "juan pablo ii.", "jawién", "joannes paulus ii.", "ioannis pauli ii", "yūhạnnā būlus at-tanī", "ioann pavel ii", "wojytla", "jean paul ii.", "ויטילה", "vajtyla", "wojtila", "ii", "jean paul ii", "jean-paul ii.", "voitilah", "ján pavol ii.", "jános pál ii.", "ivan pavao ii.", "yuhạnnā-būlus at-tanī", "jawieň", "ṿoiṭilah", "juan pablo ii", "vojtyla", "ivan pavlo ii.", "ян павел ii", "johannes paulus ii.", "giovanni paolo‏ ii", "voityla", "jasien", "jasień", "yoḥanan paʾulus ha-sheni", "voitila", "xoán paulo ii", "ṿoiṭilah", "jasień", "ואיטילה", "gruda", "giovanni paolos ii.", "wojtyla", "johano paŭlo la dua", "войтыла", "jawień", "wojtiła", "johannes paul ii", "paulus", "yuḥannā-būlus at-tānī", "johannes paul ii.", "john paul ii.", "wojtyła", "보이티야", "アンジェイ", "jan paweł ii", "jean-paul ii", "yuhạnnā-būlus at-tanī", "보이티와", "janez", "jan paweł ii.", "jawien", "jan paweł", "jawień", "yūḥannā būlus at-tānī", "giovanni paolo ii", "janez pavel ii.", "ioannis paulus ii.", "yūḥannā būlus at-tānī", "vojtila", "iohannes paulus pp. ii", "yūhạnnā būlus at-tanī", "jan pavel druhý", "ioannes paulus ii.", "jānis pāvils ii.", "yuḥannā-būlus at-tānī", "joannes paulus ii" ], "gender" : "http://www.wikidata.org/entity/q6581097", "givenname" : [ "karal'", "pape", "papież", "karol józef", "al-bābā", "stanislaw andrzej", "carlo", "karols", "karol'", "stanisław a.", "ḳarol", "pope", "папа рымскі", "кароль", "קארול", "‏ papa", "karol joźef", "johannes", "andrzej", "pāvests", "papież", "ḳarol", "carol", "ヤヴィエニ", "karol j.", "카롤", "piotr", "saint", "lolek", "stanisław", "k.", "stanisław andrzej", "papa", "heiliger", "santo", "karolis", "karol jozef", "pavils", "pápa", "papa", "카롤 유제프", "karol józef", "karolʹ", "papst, heiliger", "papst", "al-bābā", "ii", "karol", "karel", "pavel", "pape", "john paul", "paus", "קרול" ], ... } testing after adding the viaf api into metadata maker, the authors did a very small scale unofficial usability testing in university of illinois with eleven participants: five paraprofessional catalogers who create original cataloging records as part of their responsibilities; two hourly catalogers who did not have cataloging knowledge but with language and subject knowledge; two graduate assistants; and two cataloging and metadata librarians. they were asked to create a record for a monograph book in sinopia and metadata maker and share their thoughts on two things: ease of use and knowledge/skills required to use each tool. the survey also had a section where testers could add their thoughts.[57] ease of use for the first question, testers could choose one answer from the following options: extremely hard hard, but can follow through it easy very easy figure 9. survey result: ease of use. eight participants said that metadata maker is easy to use (five chose “very easy” and three chose “easy”) while ten people said that sinopia is hard to use (five chose “extremely hard” and another five chose “hard, but can follow through it”). the survey reveals that the majority of participants prefer the simple interface of metadata maker to the relatively complex and verbose interface of sinopia. there is one person who chose that metadata maker is “extremely hard to use” and two people chose “hard, but can follow through it”. those who answered that metadata maker is hard to use are paraprofessional catalogers who create original records in oclc. during the follow-up interview, they expressed that they do not like the simple interface of metadata maker and the notion of creating short/minimum records. they want the bibframe editors to be similar to the oclc connexion, the tool that they are familiar with and allows them to create full level cataloging records. an undergraduate student with language skills answered that sinopia is easy to use. the student added that while there is a lot to learn and it takes time, they can follow through the sinopia by reading the information provided for each element. while sinopia allows users to view the output data in json-ld, turtle, n-triples, rdf table, and interface view formats, three participants commented that it is hard to check the outcome of their work in sinopia. it might be because those participants have not learned rdf data models and linked data serialization formats. metadata maker, however, allows records to be downloaded and viewed locally. those participants also added that it would be helpful to know the dataflow once the record is created in both editors. knowledge and skills required to use the editors the second multiple-choice question was to ask participants what kind of skills they thought were needed for the two bibframe editors, such as functional requirements for bibliographic records (frbr),[58] rda, bibframe, lcsh, and other controlled vocabularies, name authority, linked data, and marc. however, the authors quickly realized that the jargon and acronyms in this question caused misunderstandings for many participants as they did not know some or all options, especially the two non-catalogers who do not have cataloging knowledge/education. those staff members who routinely create original records also are not familiar with frbr, bibframe, and linked data. as a result, the answers to this question are all over the place as below: table 1. answers from 11 participants: knowledge and skills required to use the editors. sinopia metadata maker unsure none bibframe, marc, lcsh and other controlled vocabularies, name authority, linked data marc, none rda, bibframe, frbr, lcsh and other controlled vocabularies, linked data, need an extreme understanding of frbr terms and rda standards just to read/understand the interface i feel like you don’t actually need to know anything about cataloging standards to use this interface marc, lcsh and other controlled vocabularies, name authority, linked data marc marc, lcsh and other controlled vocabularies, name authority, linked data none bibframe none rda, bibframe, frbr, linked data, i did not use it enough to know all that one needs to know, but this is meant for experienced (and very technically savvy) catalogers none, if applicable, an non-english language. marc lcsh and other controlled vocabularies i do not know? rda basic book information everything basic book information none none however, one thing that is clear is that while many participants said there are things that are necessary to learn in order to use the bibframe editor, the majority of participants said no knowledge is needed to use the metadata maker. discussion and next steps the process of revamping metadata maker with linked data sources and bibframe output presented a possibility for building a linked data editor without any cataloging terminologies that can be used by anyone. the intuitive design, self-explanatory wording, and one-page web form break the learning barriers of bibframe cataloging and allow non-professional catalogers and language/subject experts to get involved in linked data metadata creation. as metadata maker is designed for generating “good enough” records, it can also serve as a quick bibframe generation tool for paraprofessional catalogers. however, the authors have learned some concerns from catalogers with regard to using this tool in practice, such as an oversimplified interface and unclear dataflow. the authors were perplexed by the variant degree of acceptance for metadata maker among survey participants. paraprofessional catalogers are inclined to use quasi-connexion editors with the option to describe detailed information about resources; whereas nascent catalogers might be more comfortable using linked data editors that do not require such prerequisite knowledge. the developers of linked data editors will need to balance those two needs. while the library domain has made significant progress in the development of and experimentation with linked data and bibframe production, there are still many things that the library community has to think further about and work together on to find a solution. first, a clear dataflow needs to be established. as of now, bibframe linked data created from the current bibframe editors are not automatically ingested into any integrated library system.[59] this was brought up by several staff members who tested sinopia. in addition, most vendors do not support bibframe import as of this writing. the authors acknowledge that the dataflow requires a possible new integrated library system that can work with metadata in different formats and with a different ontology. second, libraries may have a completely different data sharing method in the linked data environment compared with the current centralized shared database.[60] if that is the case, what would a data sharing model be like? if it is still possible to have a centralized linked data database, then who is going to manage it, and how is it going to be managed? third, a discussion of work distribution between human catalogers and machines needs to start. as machines can do marc to bibframe conversion and authority reconciliation work rather effectively, libraries might want to think about what machines can do and what cataloging and metadata professionals should do. if there are tasks that machines can do better, then it would be better to leave those to the machines, and identify what cataloging and metadata professionals should focus on, in terms of linked data creation and workflows. fourth, according to fortier, pretty, and scott (2022),[61] the understanding and knowledge of bibframe among canadian libraries is still low after close to two decades of ongoing discussion and development efforts. while it is important to understand the underlying structure of bibframe and linked data, it would be worthwhile to think about how much training is adequate for cataloging professionals and how much integration of rda terms into the bibframe editors is necessary for the transition to linked data creation. or, maybe what libraries really need is a linked data editor rather than a bibframe editor. if there are problems in understanding bibframe and rda among ourselves, it would be much more difficult for users on the web to understand what kind of data we are sharing. about the author greta heng (orcid: 0000-0002-3606-6357) is cataloging and metadata strategies librarian at san diego state university. myung-ja (mj) k. han (orcid: 0000-0001-5891-6466) is a professor and metadata librarian at the university of illinois at urbana-champaign. bibliography [1] library of congress. bibliographic framework initiative. https://www.loc.gov/bibframe/. [2] world wide web consortium (w3c). rdf. https://www.w3.org/rdf/. [3] wennerlund, b., & berggren, a. (2017). leaving comfort behind: a national union catalogue transition to linked data. paper presented at: ifla wlic 2019 – athens, greece – libraries: dialogue for change in session s15 – big data. in: data intelligence in libraries: the actual and artificial perspectives, 22-23 august 2019, frankfurt, germany. [4] french national library. semantic web and data model. https://data.bnf.fr/en/semanticweb. [5] german national library. linked data service. https://www.dnb.de/en/professionell/metadatendienste/datenbezug/lds/lds_node.html. [6] library of congress. marva editor. https://bibframe.org/marva/editor/. [7] wikidata. wikidata main page. https://www.wikidata.org/wiki/wikidata:main_page. [8] godby, j., smith-yoshimura, k., washburn, b., davis, k., detling, k., eslao, c., folsom, s., li, x., mcgee, m., miller, k., moody, h., thomas, c., & tomren, h. (2019). creating library linked data with wikibase: lessons learned from project passage (pp.70). oclc research. https://doi.org/10.25333/faq3-ax08. [9] han, m. k., ream-sotomayor, n. e., lampron, p., & kudeki, d. (2016). making metadata maker: a web application for metadata production, library resources & technical services, 60(2), 89–98.; all the source codes are available in github: https://github.com/dkudeki/metadata-maker; metadata maker is still in the exploratory phase and currently only supports linked data cataloging for monographs. [10] michael, b., & han, m. j. k. (2019). assessing bibframe 2.0: exploratory implementation in metadata maker. proceedings of the international conference on dublin core and metadata applications, 26-31. [11] non-catalogers refer to people who do cataloging work but do not have adequate cataloging experience or may not need it as they do not pursue a career in cataloging. [12] van der werf, t. (2021, march 4). next generation metadata… it’s getting real! hanging together, oclc research blog. https://hangingtogether.org/next-generation-metadata-it-is-getting-real/. [13] dalgord,c. shared entity management infrastructure project update. oclc. https://www.loc.gov/bibframe/news/source/bibframe-from-home-oclc-update.pptx. [14] linked data for libraries. https://wiki.lyrasis.org/pages/viewpage.action?pageid=41354028. [15] linked data for libraries labs. https://wiki.lyrasis.org/pages/viewpage.action?pageid=77447730. [16] linked data for production. https://wiki.lyrasis.org/pages/viewpage.action?pageid=74515029. [17] linked data for production: pathway to implementation. https://wiki.lyrasis.org/display/ld4p2. [18] linked data for production: closing the loop. https://wiki.lyrasis.org/display/ld4p3. [19] lnenicka, m., kopackova, h., machova, r., & komarkova, j. (2020). big and open linked data analytics: a study on changing roles and skills in the higher educational process. international journal of educational technology in higher education, 17(1), 1-30. [20] el-sherbini, m. & klim, g. (1997). changes in technical services and their effect on the role of catalogers and staff education: an overview. cataloging & classification quarterly, 24(1-2), 23-33; zhu, l. (2012). the role of paraprofessionals in technical services in academic libraries. library resources & technical services, 56(3), 127-154. [21] van der werf, next generation metadata… it’s getting real! [22] library of congress, bibframe editor. https://bibframe.org/bfe/index.html. [23] library of congress, marva. https://bibframe.org/marva/editor/. [24] linked data for production: pathway to implementation. sinopia. https://sinopia.io/. [25] rda toolkit: https://www.rdatoolkit.org/. [26] bibframe interoperability group. (2022. april 15). terms of reference. https://www.loc.gov/aba/pcc/bibframe/taskgroups/big/big-tor.pdf. [27] lorimer, n.(2022, march 8). re-use or copy? redefining copy cataloging in a linked data environment. ala copy cataloging ig, online. https://docs.google.com/presentation/d/1ukxcdjea-cwmxnfixfibdpbcvn_jxvmymoojgzibi9o/edit?usp=sharing. [28] library of congress. appendix c – minimal level record examples. https://www.loc.gov/marc/bibliographic/bdapndxc.html. [29] http://quest.library.illinois.edu/marcmaker/dataset/. [30] http://quest.library.illinois.edu/marcmaker/. [31] aka, monograph (linked data), http://quest.library.illinois.edu/marcmaker/monoviaf/. [32] http://quest.library.illinois.edu/marcmaker/ebooks/. [33] http://quest.library.illinois.edu/marcmaker/govdocs/. [34] http://quest.library.illinois.edu/marcmaker/maps/. [35] http://quest.library.illinois.edu/marcmaker/microfilms/. [36] http://quest.library.illinois.edu/marcmaker/scores/. [37] http://quest.library.illinois.edu/marcmaker/serials/. [38] http://quest.library.illinois.edu/marcmaker/theses/. [39] bibframe is only added to two monograph modules for now. [40] oclc developer network. authority cluster resource. https://www.oclc.org/developer/api/oclc-apis/viaf/authority-cluster.en.html. [41] dnb was selected as an alternative name label source because (1) it provides linked data service; and (2) it is a national library for a non-native english speaking countries which may compensate for lcnaf. [42] https://github.com/dkudeki/metadata-maker/blob/monoviaf/lcsh/lcshsearch.js. [43] suominen, o., inkinen, j., virolainen, t., fürneisen, m., kinoshita, b. p., veldhoen, s., sjöberg, m., zumstein, p., neatherway, r., & lehtinen, m. (2022). annif (version 0.60.0-dev) [computer software]. https://doi.org/10.5281/zenodo.2578948; https://api.annif.org/v1/ui/. [44] annif github repository. https://github.com/natlibfi/annif. [45] ivyplus platform for open data. https://pod.stanford.edu/. [46] share-vde (virtual discovery environment). https://www.svde.org/. [47] hahn, j. (2022, june 20). cataloger acceptance and use of semiautomated subject recommendations for web scale linked data systems. 87th ifla world library and information congress (wlic) / 2022 in dublin, ireland. https://repository.ifla.org/handle/123456789/1955. [48] khan ,h. (2020, march 10). annif use and explanation. linked data for production: pathway to implementation. https://wiki.lyrasis.org/display/ld4p2/annif+use+and+explanation. [49] when accessed http://lcsh.annif.info/ in october 2022, annif lcsh api project updated its vocabulary sources: “ivyplus-tfidf” was changed to “penn-fasttext-en” (penn (lcsh english) conference papers and proceedings), “upenn-omikuji-bonsai-en-gen” (upenn (lcsh english) all genres), and “upenn-omikuji-bonsai-spa-gen” (upenn (lcsh spanish) all genres). [50] virtual international authority file. https://www.oclc.org/en/viaf.html. [51] viaf authority record for de reyghère, greta. retrieved on september 11, 2022, from http://viaf.org/viaf/69118441. [52] viaf authority record in json for de reyghère, greta. retrieved on september 11, 2022, from https://viaf.org/viaf/69118441/viaf.json. [53] dnb authority record for de reyghère, greta. retrieved on september 11, 2022, from https://hub.culturegraph.org/entityfacts/134496175, and https://d-nb.info/gnd/134496175. [54] viaf authority record for han, shin-kap. retrieved on september 11, 2022, from http://viaf.org/viaf/198153409742041581752. [55] oclc authority cluster resource. https://www.oclc.org/developer/api/oclc-apis/viaf/authority-cluster.en.html. retrieved on october 5, 2022. [56] viaf authority record in json-ld for john paul ii, pope. retrieved on september 20, 2022, from https://viaf.org/viaf/35605/viaf.jsonld. [57] we chose sinopia over other bibframe editors because it is created for the community and has pcc templates that have been tested out by many catalogers. we also understand that the purpose of the bibframe editor and metadata maker are different. [58] the international federation of library associations and institutions. functional requirements for bibliographic records (frbr). https://www.loc.gov/marc/bibliographic/bdapndxc.html. [59] there are some unofficial statements that folio and ex libris have been working on bibframe data import. but as of october 6, 2022, there has not been a bibframe data import function released by them. [60] library of congress. bibframe and the pcc. https://www.loc.gov/aba/pcc/bibframe/bibframe-and-pcc.html. [61] fortier, a., pretty, h., & scott, d. (2022): assessing the readiness for and knowledge of bibframe in canadian libraries, cataloging & classification quarterly. https://doi.org/10.1080/01639374.2022.2119456. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – designing digital discovery and access systems for archival description mission editorial committee process and structure code4lib issue 55, 2023-1-20 designing digital discovery and access systems for archival description archival description is often misunderstood by librarians, administrators, and technologists in ways that have seriously hindered the development of access and discovery systems. it is not widely understood that there is currently no off-the-shelf system that provides discovery and access to digital materials using archival methods. this article is an overview of the core differences between archival and bibliographic description, and discusses how to design access systems for born-digital and digitized materials using the affordances of archival metadata. it offers a custom indexer as a working example that adds the full text of digital content to an arclight instance and argues that the extensibility of archival description makes it a perfect match for automated description. finally, it argues that building archives-first discovery systems allows us to use our descriptive labor more thoughtfully, better enable digitization on demand, and overall make a larger volume of cultural heritage materials available online. by gregory wiedeman introduction archives are weird. or at least that seems to be the perception of many library technologists. while archives are often part of larger research libraries, archival methodologies are often misunderstood by our administrator, technologist, and librarian peers. this confusion has become more problematic as archives continue to need and develop more complex access systems to make description, digitized materials, and born-digital objects available over the web. implementing these systems requires cross-domain partnerships, and the misunderstandings and miscommunications around archival description in particular have severely hindered the development of discovery and access systems for archives. archives access systems do not work like library catalogs or really anything else on the web and currently have major usability barriers. to those who work mostly with the bibliographic description used by libraries and most of the web, it can be unclear why archives cannot just use the same systems, or why archives systems and practices just seem so limiting for users. archival methodology and its reasoning can be easily obscured among the more esoteric traditions of archives, like the celebration of famous men to demonstrate value to donors, hollinger boxes, or finding aids. it is often hard to differentiate between the value and the dogma. archivists themselves often find it hard to articulate why their needs are just different than their librarian peers. it can be challenging even for many archival practitioners to acquire strong expertise in archival description. in the united states, archival training is a concentration within a library credential, which can mean merely one or two archives-specific courses. you might only get one single class that discusses archival description, and even that is often taught by a faculty member with a research focus rather than extensive practitioner experience. archival description skills often need to be learned on-the-job and seem to be mostly effectively passed on through peer groups, mentorship, or other types of informal professional development that not everyone has access to. even archivists that do have strong knowledge of archival description may not have a detailed understanding of how web applications or other technologies are designed or work in practice. while many archivists see firsthand the constant friction in current access systems, they often struggle to articulate how they can be designed better as web applications. the divide in domain knowledge between discovery systems and archival description is a challenging one to bridge. i hope to clarify the core differences between archival and bibliographic description and outline a path towards more effective discovery systems. while bibliographic description is much more intuitive and commonplace in our web applications, archival methods free us to apply the valuable descriptive labor that is the main bottleneck in our digitization and born-digital acquisitions programs more thoughtfully and appropriately.[1] if used properly, archival description could enable us to better provide digitization services on user request at scale and make these materials available online for future users. the extensibility of archival metadata also makes it a perfect fit for using automated description, such as optical character recognition (ocr), entity extraction, or automated transcription to enhance discovery, as it combines imprecise output with human-created records. i try to make it much clearer why archival metadata makes discovery so peculiar, highlight the cases where it can be advantageous, outline a path forward to increase the usability of archives access systems, and make the case for privileging archival description when planning and designing discovery systems. the misunderstandings around archival description have hidden an enormous problem: there are no available off-the-shelf systems that provide access to digital materials using archival description. every digital repository, digital asset management system (dams) or institutional repository (ir) uses bibliographic description as an unrecognized design assumption. to illustrate this, i provide a case study of ualbany’s existing hyrax and arclight implementations which use archival description for discovery by linking data from these systems over apis. this approach works functionally but has substantial usability and maintenance issues. in working to combine these systems into a single archives discovery system, i wrote a custom indexer that adds digital materials, full-text orc and extracted text content to arclight as a proof-of-concept example that i hope can illustrate a path forward towards designing access systems that work directly with archival methods. finally, i will point to some ways we can experiment with how archival inheritance is indexed to potentially mimic bibliographic usability. archival vs. bibliographic description by bibliographic description, i mean the creation of individual metadata records for each object with a set of descriptive fields. this has been the intuitive method of managing information going back beyond our relevant professional history. i’m sure you could go back thousands of years and find library workers creating some kind of discrete bibliographic record describing an individual item. library catalog cards and online public access catalogs (opacs) are canonical examples of bibliographic description. each record has a set of descriptive fields and is self-contained – all of the available information is contained within the record. dublin core states this explicitly in its “one-to-one principle,” where it declares that each discrete entity “should be described by conceptually distinct descriptions.”[2] while linked data adds some complexity by potentially breaking up records into statements, data structures and descriptive practices usually remain the same. most of the information on the web is displayed to users in a way that looks like bibliographic description. a search engine, a major e-commerce site, or wikipedia will display records of objects to users that contain all the available information. these records often link to other records, but each record still describes an isolated object and is fully comprehensible by itself. the ubiquity of this format proves its intuitiveness and usability. i am sure that this is to some degree an oversimplified caricature of bibliographic practices, but it is a useful contrast to help us to better understand the impact of archival description. while archives may appear to be just a specialized type of library, they have a fundamentally different methodology for managing and providing access to materials. why did early archivists reinvent the wheel and develop incompatible practices that are less intuitive for both professionals and users? the answer is very practical: they simply had too much stuff. the early development of archival description in the united states illustrates how usability was a conscious and necessary tradeoff to be able to adequately manage the scale of records that were working with. the american national archives was first created in the 1930s and, since the government had been functioning and creating records for over 150 years, records had been previously managed by individual departments and offices, often with a variety of different methods and techniques. by 1941 archivists had accessioned 302,114 cubic feet of records from seventy-two different agencies.[3] these early american archivists actually wanted to use bibliographic methods to make all these records easily accessible in familiar ways. they made multiple attempts to use various forms of card catalogs to describe materials and established a classification division devoted to somehow providing subject-based discovery. however, “…given the diverse mass of materials in the national archives, classification demanded vastly more time and expense than the agency could afford,” and the division was disbanded in 1941.[4] with truck after truck moving more and more records to the archives, all archivists were able to feasibly do was document the source of records and their existing arrangement. the provenance of each set of records was important because each source had a different arrangement system and discovery process. a user would have to use the “preliminary inventories” created by the archivists to find what office created the records they were seeking, and how that office arranged or maintained them to navigate that file series or records component.[5] these “preliminary inventories” evolved into paper and online finding aids over time.[6] of course, it would be simpler for users if all the records had a single discovery process, but to early american archivists, that was obviously (if regretfully) infeasible. usability was a conscious trade off to make the enormous volume of materials even somewhat accessible. as a rule of thumb, the approaches used by archivists are useful primarily because of the scale of the materials they manage. got a large but manageable amount of stuff? use bibliographic description. got a seemingly never-ending vast mountain of materials? use archival description. this is an oversimplification, as archival methods are also very good at retaining context of materials, but scale alone is a useful distinction to show how archival systems are meaningfully different.[7] the reality is that in our current unlimited information environment, archives and libraries have larger collecting scopes and volumes of materials than they have descriptive resources, much like the early national archives. even with the additional catalogers and archivists that should be hired to address this, archival methods should be reassessed in order to make the line between the available and the inaccessible more gradual, and to make a larger body of materials open for use. archival description in practice most of our librarian and technologist peers understand that archival data is structured differently, as archival data is hierarchical, with a tree structure of “components,” such as collections, file series, folders, and perhaps items. however, the way archival description inherits is not widely understood and has really important implications for system design. even archivists do not often articulate how the relationships between components of description work. for example, a repository might hold a folder called “meeting minutes, 1989 july 26.” this component only has a title and a date, which alone are not very helpful to users. who was meeting? what were they discussing? unlike a bibliographic record, not all of the available information is contained within the record and the relationships to other records are very meaningful. in this example, the file is part of a series titled “new jersey proportionality review project records,” which is part of a collection titled the “leigh b. bienen papers.” both higher order components have fields where a user might learn the purpose of the meeting and its potential participants and outcomes. image 1. the file is part of a series titled “new jersey proportionality review project records,” which is part of a collection titled the “leigh b. bienen papers.” https://archives.albany.edu/description/catalog/apap312aspace_c264f5e1f93f9d58e5b60483c32d76e9 here is where we have to get into the weeds a bit. at all levels, components may use twenty-five elements that are described in the archives content standard, describing archives: a content standard (dacs). eleven of these elements are required fields. the standard also outlines a set of requirements for multilevel descriptions that articulates rules for the relationships between multilevel archival components like the above example. this section of dacs is particularly impactful, but it is challenging for non-experts to fully appreciate its meaning. what is not often understood here is that, while most of the eleven required elements are often only used at the collection level in practice, each component is required to be described by every one of these fields. even the above “meeting minutes, 1989 july 26” example needs to have a name of creator(s), a scope and content note, an access conditions note, etc. this example is actually described by those elements, they are just stored outside of the record in higher order components. lower-level components only use dacs elements if they supersede or are more granular than the higher order component. if this is not the case, the element from the higher-level component applies. thus, the scope and content element from the new jersey proportionality review project records series component and several elements from the leigh b. bienen papers collection component also describe the “meeting minutes” file component. when archival repositories used paper finding aids, inherited elements were implicitly displayed using front matter, indents, and other design features that conveyed this relationship, but our current discovery systems do not account for this. archival description also provides us with a tremendous amount of flexibility, allowing for the discovery of full text, bibliographic records, and description automatically derived from digital materials within a single descriptive schema and discovery system. dacs allows archivists to use bibliographic metadata, such as dublin core fields, to further describe materials when there is a user-driven reason to do so. it just requires a clear and explicit relationship between these records and the archival component that describes them. this allows archivists to create high-quality descriptive records when appropriate. an archival collection can easily contain one series of lower-value or rarely used materials that are only generally described by the series description, and another series of high valued items containing high quality detailed metadata for each item that you would expect in a library catalog. instead of allocating a similar amount of descriptive labor to all materials as bibliographic description often does, archival description empowers archivists to use their appraisal skills and spend their valuable time in proportion to the value of materials they are working with. for rarely used items with less value, materials can still be accessible, just with a higher usability cost.[8] because archival methodology accommodates lower quality descriptive records, this also makes it a perfect fit for automated approaches that derive description from digital materials. this includes full text, technical metadata, or the output of computational techniques such as entities extracted using natural language processing (nlp). archival description is also a perfect fit for using emerging machine learning techniques for extracting meaningful information from digital images and documents for discovery if these tools can be used without causing harm. there have been some experiments that have used automated approaches to describe special collections materials. however, no matter how sophisticated, automated methods alone produce lower-quality records that limit discoverability and usability in bibliographic systems.[9] in archival description, these records would always be linked with higher-level metadata created by a human professional. the flexibility of archival description also makes it easy to manually enhance automated description when needed. for lower value materials that would not receive detailed description, automated description can also be better than nothing. archivists are also welcome to use automated description at first while assessing its use and potentially enhancing the description later as appropriate. yet, as i’ll discuss later, while archival description encourages these practices, the current systems available for managing digital materials are designed only to work with bibliographic description, thus they are blocking the use of automated approaches in practice. archival methods do have significant drawbacks. this is an idealized vision of archival description. systems that support the creation of quality archival description are a relatively new phenomenon and a lack of training and support can mean that archival methods are sometimes inconsistently mixed with bibliographic approaches, or just poorly applied. additionally, even if we design discovery systems that make use of archival description, there is a usability cost that may be unavoidable when we compare it to the simplicity of bibliographic description. when you compare catalog cards to finding aids or opacs to archivesspace, bibliographic description is often more familiar and comfortable for most users. the usability problems of online finding aids and archives access systems are very well-documented.[10] the more complex relationships in archival data are simply just more challenging to navigate and display intuitively. yet, there are paths forward if we design digital repositories to match the affordances of archival description, then we may be able to improve the usability of discovery systems to where the advantages are well worth the costs to users. the current landscape of digital repositories when we apply a strong understanding of archival description to the current landscape of digital repositories, we see that there are several digital repositories available, but no system allows for the discovery of digital material using archival description. this is true across both open source and proprietary systems. using archival methods for discovery is simply not currently possible without substantial customization. most repositories are designed as digital asset management systems (dams) like contentdm for the upload and discovery of digital objects or designed as institutional repositories (irs) like islandora, samvera-based applications, or bepress digital commons that have built-in multi-user submission workflows. every single one of these systems is designed with bibliographic description in mind. each assumes that librarians or archivists will enter a set of descriptive metadata fields when uploading digital objects. each tool also envisions itself as a self-contained system for this description. no complete off-the-shelf system expects description to be managed and made discoverable outside of its interface. remember that if an item is described by an archival component, dacs requires a clear and explicit relationship between that item and its higher-level components so that users can use those inherited descriptive fields, and it is reasonable for a user to expect at least a navigable link here. a common workflow is for archivists to digitize an item that is already described by an archival component, but since all dams and irs assume they are self-contained, the archivist then has to spend additional time and labor to create a separate set of dublin core or other bibliographic elements for a digital repository. this both duplicates effort and creates an obvious usability barrier. users often must navigate both a system for archival description and a separate system for digital content. this problem is particularly acute for small repositories, as to make digital content available, they are incentivised to change their local descriptive practices to match the system used by whatever consortial repository is available to them. it is probably correct to say that none of the current tools, including  contentdm, islandora, dspace, bepress digital commons, or samvera-based systems like hyrax or hyku, are compliant with dacs. archivists have no options. this is a major use case that is simply not being met with available tools, likely because of the divide in domain knowledge between archivists and administrators, librarians, and technologists. there is no off-the-shelf product that provides access and discovery for digital materials using archival repositories’ existing description methods and systems. over the last decade or so, there has been a lot of progress in designing and developing systems to manage archival description, with the development of archivesspace being a major success. however, archivesspace, access to memory (atom), and arclight all only manage and/or provide access to description, not digital content. while these tools all help provide us with an important piece to the access puzzle, users want to access materials, not just descriptive records. in-person research will always be a key part of archival repositories, but more and more archival research is being done primarily or solely online, with the covid-19 pandemic possibly being a major turning point. the closure of reading rooms finally forced many archives to regularly accommodate digitization requests on-demand. this is a major advancement in user services, yet many of these materials are often sent directly to users and not uploaded into digital repositories for future use. this is because these systems are not able to accommodate items without additional descriptive labor, despite them already having archival description and the fact that they were already discovered by a user.[11] archival repositories need systems that manage digital content to do less – focus on asset management, file serving, and interoperability. archivists are already able to create and manage complex archival description in tools like archivesspace or atom. archives need digital repositories to manage digital content but be interoperable with and rely on their existing description systems. the international image interoperability framework (iiif) is a great way to make these connections. there are some important roles that repositories should take on, such as processing or ingest workflows and technical metadata, but digital repositories as currently constituted cannot serve as the primary end-user discovery system for archival materials. it also could be advantageous to designate digital repositories and discovery systems as separate concerns, as repositories can better serve as a “back-end” systems that may better provide or are more interoperable with preservation functions. in the future this may help us avoid design problems like the samvera architecture, which too tightly coupled preservation and access functionality though activefedora.[12] this separation may also make it easier for systems to manage access restrictions, as archivists need to manage and preserve digital materials that cannot currently be made publicly available, as “virtual reading rooms” or limited or controlled access systems are another important piece of the access puzzle.[13] but most importantly, separating discovery from asset management may also provide us with the space and flexibility to design access systems that allow end-users to discover and navigate that content using archival description. ualbany case study a case study of the espy papers from ualbany illustrates both the potential for using archival description to manage digital objects, particularly by enabling digitization on demand, as well as the practical challenges that arose attempting this with current systems. m. watt espy spent most of his life documenting capital punishment in the united states. he dug up information for every death row inmate he could find from corrections records, county histories, court proceedings, and popular publications, and summarized each case on index cards – colorfully documenting victims, alleged perpetrators, and circumstances. at his height he had a large network of collaborators that sent him documentation sourced from all over the country. this collection represented the most complete documentation of executions in what is now the united states dating back to european colonization. in 1984 the national science foundation (nsf) awarded a grant to the university of alabama to create a computational dataset based on the materials, which was first released as executions in the united states, 1608-1987: the espy file.  on espy’s death, the original source materials along with other papers were donated to ualbany’s national death penalty archive and in 2010 it received detailed folder-level processing with funding from the national historical publications and records commission (nhprc).[15] while the espy file dataset became a canonical source for criminal justice researchers, abstracting the stories of these thousands of individuals onto a spreadsheet took away a lot of meaning and serviced only certain types of research. some researchers had found issues with the dataset and reference staff had heard a number of anecdotes from users about discrepancies they found between the index cards and the espy file data. seeing so many users willing to travel to see the index cards, along with the potential of leveraging the existing metadata from the dataset made it a strong candidate for digitization and in 2016, ualbany was awarded a council on library and information resources (clir) hidden collections grant to digitize two file series and make them openly available online.[16] since the collection had previously received detailed folder-level processing and the materials were the source for an existing dataset, it seemed wasteful and duplicative to create additional item-level records with bibliographic metadata for what would be about 125,000 digital objects. the espy file dataset was not created as descriptive metadata to our current standards and did not map to the paper materials in a machine-actionable way, so it was not useful as a drop-in replacement for bibliographic metadata in a dams. thus, the collection seemed like an excellent candidate for using existing archival description to provide access to the digital scans, as it could make practical use of the problematic espy file data. our existing systems provided no way to use the existing description to provide discovery and access to digital scans. we had recently completed migrating our archival description to archivesspace and were using extensible text framework (xtf) and the luna dams for access, but neither xtf or luna were interoperable or sustainably customizable and no digital repository was available that used archival description for discovery out-of-the-box. the archivesspace rest api provided the potential to use archival description in new ways, and we were eager to fully leverage our descriptive labor already dedicated to the collection to benefit users and make our work more impactful. we decided to implement an open-source digital repository that would be more customizable to use folder-level description from archivesspace along with the espy file dataset. for much of the source materials series, we thought that the quality folder-level description already existed should be sufficient to provide access. also, if we could implement a successful process for using existing archival description for digitization, we hoped that we could do the same for other collections, and potentially even provide digitization services on request for single folders without having to create detailed bibliographic metadata. we decided to implement a lightly customized hyrax repository which uses the samvera framework. hyrax is not a “turnkey” system, but a fully featured set of open components that can be implemented into a digital repository. we hoped the openness of hyrax would make it easily adaptable to our existing archival description. over the course of the project, the arclight mvp project made arclight into a viable option for providing access to archival description. because it uses a similar ruby on rails stack as hyrax, it became easier to implement arclight and integrate it with hyrax than doing a similar level of customization with the archivesspace public user interface (pui). we needed data to be passed both ways, from hyrax to arclight and from arclight to hyrax, and both systems exposed json metadata with rest apis, which was an invaluable feature we could not have done without. since both systems used the same technology, much of what we learned customizing one system could also be applied towards the other. we did not quite know what we were getting into. the project was significantly under-resourced in both outside funding and internal expertise. however, despite some delays, data problems, and the challenges of learning new technologies, the systems we implemented were a major success. the espy project execution records website provides open access and discovery to the espy papers. our university libraries also gained a lot of skills and capacity to implement and host open-source applications that would be applicable to other projects, we developed a more productive relationship with the university-level information technology services division, and we are better able to utilize our on-campus virtualized data center. the need to support these systems was successfully used in 2019 to justify filling a vacant technologist position that otherwise was not likely to have received university-level approval. the project enabled us to use existing archival description for digitized and born-digital items and allowed us to provide online access to a much greater volume of materials. on the hyrax side, we had to develop multiple custom data models to handle both legacy materials from our existing dams as well as objects that would rely only on a link to a component of archival description. it was relatively straightforward to create image and av models to handle the schemas used in our existing dams, but hyrax’s use of linked data uris was a barrier to creating a sensible digital archival object (dao) model for archival description.[17] to make connections between digital objects and components of archival description, we used the 32-character ref_id generated by archivesspace and indexed into arclight. each folder-level component would have a ref_id for itself, could have multiple ref_ids for higher level series and subseries components, and always had a collection identifier for the top-level collection. we thus needed three identifier fields, one containing multiple ids where the order mattered, and each having a separate meaning. it also made sense to store the name of each component in the model as a string. this was challenging to model using linked data uris since hyrax requires a unique uri for each field. once we got a set of uris that hyrax accepted, we essentially ignored the uris downstream and relied on local meanings for the fields. i am skeptical that even a perfectly designed or customized ontology would have provided any value to this project, and trying to use any form of the records in context (ric-o) ontology currently being designed by the international council on archives experts group on archival description (ica egad) would have been a nightmare.[18] once the dao model was complete, we customized the workflow page where an archivist would upload and describe a digital object. this worker would enter the ref_id for the component of archival description, the collection identifier, and click a “load record” button. this button would make a javascript ajax call to the arclight json api and automatically fill most of the descriptive fields. the worker would then only be required to add a resource type and a license or rights statement before uploading the object. image 2. dao model. we also customized the display page for each object to pull relevant archival description from arclight also using client-side javascript calls. when an object page loads, it uses the ref_id and collection identifier to query the archival description component and all of its parent components. the page then displays the names and links for all higher-level components as any scope and content notes. the use of client-side ajax calls is imperfect but allowed us to integrate the two systems without much more complex customization within the rails applications. if a worker was digitizing an item, they would just have to find the ref_id and collection number for the folder in archivesspace or arclight, and enter those fields in hyrax with a resource type and rights statement. for descriptive metadata hyrax would then only contain a title (example: skandalon, vol. 3, no. 9) and date (example: 1965 march 10), which by itself would not be very helpful to users. when a user accesses the item, hyrax will query and display scope and content notes for the skandalon and the university publications collection. a user could then read that this is a single issue from a bi-weekly journal of news and opinion published by campus christian council, which was part of an artificial collection of student publications.  this minimal descriptive workflow, along with rapid lower-quality scanning allowed for digitization on user request. we later implemented a new digital reproduction fee schedule that charged by the time required for digitization rather than page counts.[19] since we were using existing archival description for metadata and avoiding page count estimates with back-and-forth emails, in many cases we were able to digitize an item in about the same time as a traditional reference request and make requests that take under 30 minutes free to users. this practice improved user experience, allowed us to digitize a much larger volume of materials and make them accessible online, and has the added benefit of making our digitization labor more transparent to users. in this example, i received a request for one issue and digitized the whole run of 42 issues in an afternoon merely because i had some extra time and thought the materials were interesting and worth digitizing. in addition digitizing individual items on request, we also developed a batch upload workflow for large sets of items sent to an outside vendor for digitization. the process relied mostly on spreadsheets. here we also used existing archival description so the materials did not require item-level bibliographic metadata. this proved to be really useful for university publications, for example, where we had existing volume and issue lists. we had an existing tool for exporting this metadata from the archivesspace api, so we added on a process where an archivist could paste in the corresponding access file for each issue and a script would generate another spreadsheet that could be uploaded into hyrax using a rake task. this workflow enabled us to rapidly digitize large collections or file series that were really valuable for reference use, such as student newspapers, university publications, commencement programs, university organizational charts, press releases, and university senate legislation. while additional descriptive care would have improved discoverability as always, making these materials discoverable using existing archival description plus full text ocr and extracted text was a major advancement. while our arclight and hyrax implementations were very successful in providing access to digital materials using archival description, they also have a number of practical limitations. the most obvious problem is that users still must navigate two separate systems, one for archival description and another for digital materials. we implemented a “bento” style discovery layer based on quicksearch to make search results from both arclight and hyrax available from a single search box but found that users still had trouble navigating back and forth between the two systems.[20] a redesign in early 2022 based on the duke university arclight implementation addressed some minor issues with this integration, but the core problem remains.[21] additionally, getting data from hyrax back into arclight is challenging. it was easy to modify arclight templates to point to hyrax for digital materials, but once an archivist uploaded a new object into hyrax, that uri had to be added to a new archivesspace digital object record. we were also storing separate preservation copies for each object outside of hyrax so we needed to download the object, store it as a local archival information package (aip), and add an identifier that references the aip into hyrax. since hyrax does not provide an api for this, we were only able to automate this using a very wonky script that queries the hyrax solr index, adds a new digital object in archivesspace, schedules it to be indexed to arclight, downloads and stores the object as an aip and adds the identifier to hyrax by literally scraping the hyrax login and edit pages and posting data to the edit form using the python requests module. it worked, but it was a hack. this process along with overall support for hyrax creates major sustainability risks. our library systems department has struggled to maintain hyrax without anyone with a strong ruby or rails background on staff. major cuts to library staff in 2020-2022 only minimally impacted applications support, but with overall library staff reduced by about 30% due to unfilled retirements, our long-term support for customized applications should be questioned, particularly when we are adapting systems like hyrax and not using them quite as they are intended. overall, there is a need for this setup to be simplified. a discovery system designed for archival description archivists need a discovery system for digital materials that uses archival description. a true archival discovery system would query archival description along with item-level bibliographic metadata and automated description derived from digital materials, such as extracted text, ocr text, and a/v transcripts in a single search interface. arclight has the potential to be this system. currently arclight is an access system for archival description based on blacklight. it does not manage digital assets but returns individual components of archival description and lets users navigate through connected records. since arclight merely displays data indexed in solr just like blacklight, it also has the potential to display and return search results for digital objects, including full text. description_indexer is an experimental tool that overrides the default arclight indexing pipeline. out-of-the-box, arclight uses traject to index archival description from ead-xml files, often exported from archivesspace. while traject is set up to be easily configurable to select which xpath to use for each solr field, it is not easily customizable to add the significant logic needed to index archival description or data from other sources. instead, description_indexer is a python library that uses archivessnake and pysolr to index archival description directly from the archivesspace api. this approach is potentially very useful for individual repository instances but may be less so for consortial aggregators because of the high permissions levels currently needed to access the archivesspace api. description_indexer contains two very basic json data models, one for archival description and another for the arclight solr index. this extra layer of abstraction is useful, as any data source that can map to the archival description model would then be automatically indexable into arclight. the archival description model is very much a draft and is likely too simple to be comprehensive, but community consensus around a model like this is key to consistently representing digital materials in the arclight index. the description_indexer main branch is set up to be a “drop-in” replacement for the current traject indexer. the dao-indexing branch is designed to be a more experimental branch that flexibly indexes from digital repositories or other systems that manage digital assets. it is designed to be extensible, since individual implementations will likely need to index asset data from a number of different sources, you can write your own plugin-in to index digital assets from your local system. once description_indexer is installed, you can add a custom class in a .py file in your home directory or using an environment variable that will allow for local logic to override how digital objects are indexed. the ualbany example that is included queries json from our hyrax instance to index links to content and other item-level data not managed in archivesspace. description_indexer also contains multiple “extractors” for pulling content from digital files using apache tika and/or tesseract, however running these during indexing is a challenge and a better design would be to extract and store this data while processing digital files and make it available to the indexer via a file system or a rest service. here is also where there is the potential to experiment with new tools for extracting useful information from documents for discovery using nlp or models generated with machine learning. the data pipeline to the indexer needs further consensus and standardization. in writing description_indexer, i discovered that digital objects, files, and file versions are under-theorized in archival description and archivists need to better define these objects and their relationships. the portland common data model (pcdm) provides helpful definitions of objects and files, and should be incorporated as much as possible, but the relationships between objects and archival components in lieu of pcdm collections is ill-defined and current practice is inconsistent.[22] archivesspace attaches digital objects to archival components, but allows component attributes such as subjects and note fields to be attached to digital objects as well. digital objects also do not have href or url attributes but contain file versions which have file uri attributes. both digital objects and file versions also have is_representative boolean attributes that are likely useful for digital objects. overall, it should be clearer that digital objects are an abstraction that do not necessarily correspond to a file, and digital objects should probably have a field for an international image interoperability framework (iiif) manifest, as that also can be an abstraction and should be the preferred method of linking archival description to digital materials. attributes for how files and versions are displayed in the absence of a iiif manifests are also likely necessary, and overall, it was challenging to model this and broader and more complex community use cases are needed.[23] the biggest barriers to enabling the discovery of digital materials in arclight are establishing consensus data models and data pipelines. once content from digital materials is indexed into an arclight solr index, we can display those objects in arclight with only some minor customizations and a iiif-compliant image server. i implemented a simple demonstration application that illustrates what this could look like in practice. this system returns results based both on archival description and full-text content extracted from digital objects. this implementation has data and design limitations, but i hope that this can be a useful model that shows the potential for what arclight can be going forward. privileging archival description in discovery systems academic libraries and other cultural heritage institutions also manage digital objects using bibliographic description. to avoid implementing and maintaining multiple discovery systems, archival materials are often forced into off-the-shelf irs and dams designed for bibliographic description. a better understanding of archival description shows that it is actually more appropriate to do the reverse, and index bibliographic records into systems designed for archival materials. here, it might be helpful to see archival description as an organizational schema for managing materials which have many different organizational schemas. in the same way that the early national archives used archival description to manage different descriptive methods used by different government agencies, archival systems can also accommodate bibliographic metadata that provides more usable and familiar access. this provides the best of both worlds. we can have one discovery system that provides both a strong user experience for higher value materials while still providing some level of access for materials that do not receive wide interest and otherwise would not receive detailed descriptive care. this also works from a purely technical perspective. while it is possible to model archival description in digital repositories like hyrax, the more complex structure of archival data makes this very challenging. it is comparably much simpler to model bibliographic metadata into archival systems than the reverse. with well-defined data models, we can easily add bibliographic metadata to an arclight index, just like with a blacklight instance. these records could stand alone or also be linked to archival description. this provides arclight with the potential to unify bibliographic and archival metadata in a single user environment, offering the usability of detailed records with the extensibility of archival hierarchy. this would provide us with the full potential of archival description to flexibly allocate our descriptive labor based on the value of materials and user needs. navigating complex archival data structures for items with lesser value may still be challenging for users. if we can make decisions based on the value of materials, rather than systems limitations, this should actually be an effective allocation of our limited descriptive resources. there are also additional opportunities to improve the usability of archival description. since arclight is just an extension of blacklight, it presents description to users in search results as discrete units much like bibliographic metadata. what we can do is experiment with how archival tree structures are indexed to better match how dacs envisions inheritance. since dacs expects notes that are usually only applied at collection or series levels—like scope and content or historical notes—to apply to lower-level components as well, we can experiment with indexing these notes as part of lower-level components too and just return them with lower relevancy scores. arclight currently indexes parent access and use notes like this but does not use them to return search results. this has the potential to return better results for minimally-described materials, but would need to be part of an iterative usability testing process so that results are weighted appropriately. these are exciting possibilities, but we cannot do usability testing on archival discovery systems until they exist. conclusion archival description takes a very different approach to description than what is commonly used elsewhere – whether that be in library catalogs, digital repositories, or on the web. archival methodology has key strengths that make it very useful for managing the vast quantity of digital materials held by libraries and avoiding a digital divide in an era where pandemics and the emissions costs of travel may limit in-person research. our descriptive labor, no matter how extensive it is or should be, has limits. if academic libraries continue to prioritize bibliographic approaches to metadata and apply the same level of descriptive care to objects one by one regardless of value, there will always be a hard line between what is accessible and what is not. archival description provides flexibility that empowers us to apply that valuable descriptive care based on the needs of users and prepares us to experiment with automated metadata approaches and iterative workflows. archival methods simply more accurately and appropriately model our descriptive resources to our materials. unfortunately, it is currently very challenging to use archival description to manage and provide access to digital materials, as current digital repositories are not designed to work with archival description. archivists manage description for materials in systems like archivesspace that are designed for archival description, but dams and similar digital repository systems expect them to create additional bibliographic metadata for any digital material they manage, whether that is an appropriate use of resources or not. there is usually no easy way to link that metadata from two different systems together. in practice, this means that lower-valued items or the increasing number of items digitized by archives on user request are not made available or discoverable for future users because they do not have the value needed to receive detailed bibliographic description. this is silly considering archival description already exists for them. since archives data structures can accommodate bibliographic metadata, but the reverse is very challenging, discovery systems design must privilege archival description. currently, there is no easy way to integrate archival description from systems like archivesspace with digital materials managed in digital repositories into a single discovery point for users. ualbany’s approach of using a “bento” style discovery layer on top of these two systems works functionally but has substantial usability limitations and sustainability concerns. the misunderstandings around archival description have marginalized archival systems in academic libraries. because our digital access systems never have worked for archival methods, libraries long took shortcuts by establishing whole separate programs to manage unique digital materials and limiting archives and special collections to a very traditional understanding of their collecting scopes. instead of working with archives, libraries often worked around them – often causing needless duplication in metadata work, digitization, asset management, and digital preservation across different reporting structures.  arclight has the potential to unify discovery of archival and bibliographic description and provide a single discovery point for physical and digital materials that allows archivists to fully leverage the affordances of archival description. we need further community consensus on a data model for archival description – most notably for digital objects, files, and file versions. i hope description_indexer can be a helpful example that can be iterated upon, that further work can be done to index digital materials in the arclight index, and that we can experiment more with indexing archival description in general. while not really discussed here, archival description’s focus on agents and functions behind the creation of records has the potential for opening new patterns for discovery.[24] overall, we need examples of digital materials in arclight alongside archival and bibliographic description for iterative usability testing. about the author gregory wiedeman is the university archivist at the university at albany, suny where he helps ensure long-term access to the school’s public records. he manages the university archives and supports born-digital collecting, web archives, and systems implementation for the department’s outside collecting areas. he currently serves as co-chair of the technical subcommittee for describing archives: a content standard (ts-dacs). endnotes [1] joyce chapman, kinza masood, chrissy rissmeyer, dan zelner, “digitization cost calculator raw data,” digital library federation (dlf) assessment interest group (2015). https://dashboard.diglib.org/data/. amanda j. wilson, “toward releasing the metadata bottleneck: a baseline evaluation of contributor-supplied metadata,” library resources & technical services vol. 51, no. 1 (2007). https://journals.ala.org/index.php/lrts/article/view/5384/6604. [2] “dcmi: one-to-one principle,” dublin core metadata innovation. https://web.archive.org/web/20220627093857/https://www.dublincore.org/resources/glossary/one-to-one_principle/ [3] mccoy states that “…the national archives had to deal with the greatest volume of records in the world; the unparalleled diversity of their origins, arrangement, and types; and their widely scattered locations in 1935.” donald r. mccoy, the national archives: america’s ministry of documents 1934-1968 (chapel hill, nc: the university of north carolina press, 1978), 45, 69. [4] mccoy, 78-80. philip m. hamer, “finding mediums in the national archives: an appraisal of six years’ experience,” the american archivist, vol. 5, no. 2 (1942): 86-87. [5] the national archives, guide to the material in the national archives (washington, dc: united states government printing office, 1940), ix. [6] this process is discussed in more depth in gregory wiedeman, “the historical hazards of finding aids,” the american archivist, vol. 82, no. 2 (2019): 381-420. https://doi.org/10.17723/aarc-82-02-20. [7] in addition to working well at scale, archival description is also more effective at maintaining contextual relationships between records, their creators, and the activities that created them. this is further discussed in jodi allison-bunnell, maureen cresci callahan, gretchen gueguen, john kunze, krystyna k. matusiak, and gregory wiedeman, “lost without context: representing relationships between archival materials in the digital environment,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). https://doi.org/10.25740/gg453cv6438. [8] this practice is best described in daniel a. santamaria, extensible processing for archives and special collections: reducing processing backlogs (chicago: neal-schuman, 2015). shan c. sutton also discusses the further extension of this to digitization in shan c. sutton, “balancing boutique-level quality and large-scale production: the impact of “more product, less process” on digitization in archives and special collections,” rbm: a journal of rare books, manuscripts, and cultural heritage vol. 13, no. 1 (2012). https://doi.org/10.5860/rbm.13.1.369. [9] paul kelly, “better together: improving the lives of metadata creators with natural language processing,” in code4lib journal issue 51 (june 14, 2021), https://journal.code4lib.org/articles/15946. kaldeli, eirini, orfeas menis-mastromichalakis, spyros bekiaris, maria ralli, vassilis tzouvaras, giorgos stamou, and evaggelos spyrou, “crowdheritage: crowdsourcing for improving the quality of cultural heritage metadata,” information vol. 12, no. 2 (february 2021). [10] christopher j. prom, “user interactions with electronic finding aids in a controlled setting,” american archivist 67, no. 2 (2004): 234–68, https://doi.org/10.17723/aarc.67.2.7317671548328620. anne j. gilliland-swetland, “popularizing the finding aid: exploiting ead to enhance online discovery and retrieval in archival information systems by diverse user groups,” journal of internet cataloging 4, nos. 3–4 (2001): 199–225, https://doi.org/10.1300/j141v04n03_12. luanne freund and elaine g. toms, “interacting with archival finding aids,” journal of the association for information science and technology 67, no. 4 (2016): 1007, https://doi.org/10.1002/asi.23436. wendy scheir, “first entry: report on a qualitative exploratory study of notice user experience with online finding aids,” journal of archival organization 3, no. 4 (2005): 49–85, https://doi.org/10.1300/j201v03n04_04. joyce celeste chapman, “observing users: an empirical analysis of user interaction with online finding aids,” journal of archival organization 8 (2010): 4–30, https://doi.org/10.1080/15332748.2010.484361. [11] james e. murphy, carla j. lewis, christena a. mckillop, and marc stoeckle, “expanding digital academic library and archive services at the university of calgary in response to the covid-19 pandemic,” ifla journal vol. 48, no. 1 (2021). https://doi.org/10.1177/03400352211023067. florence sloan, “special collections practice in response to the challenges of covid-19: problems, opportunities, and future implications for digital collections at the louis round wilson library at the university of north carolina at chapel hill,” masters thesis, university of north carolina at chapel hill school of information and library science (april 30, 2021). https://cdr.lib.unc.edu/concern/masters_papers/1z40m3313. the infeasibility of creating item level records is also discussed in stephanie becker, anne kumer, and naomi langer, “access is people: how investing in digital collections labor improves archival discovery & delivery,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021), 33. https://doi.org/10.25740/gg453cv6438. [12] esmé cowles, “valkyrie, reimagining the samvera community,” https://library.princeton.edu/news/digital-collections/2018-06-05/valkyrie-reimagining-samvera-community. [13] elvia arroyo-ramírez, annalise berdini, shelly black, greg cram, kathryn gronsbell, nick krabbenhoeft, kate lynch, genevieve preston, and heather smedberg, “speeding towards remote access: developing shared recommendations for virtual reading rooms,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). https://doi.org/10.25740/gg453cv6438. [14]  m. watt espy, john ortiz smykla, executions in the united states, 1608-2002: the espy file (icpsr 8451), (ann arbor, mi: inter-university consortium for political and social research (distributor), 2016-07-20). https://doi.org/10.3886/icpsr08451.v5. [15] m. watt espy papers, 1730-2008. m.e. grenander department of special collections and archives, university libraries, university at albany, state university of new york. https://archives.albany.edu/description/catalog/apap301. “commission recommends $7 million in grants,” the u.s. national archives and records administration, 2010 june 1. https://web.archive.org/web/20220307211150/https://www.archives.gov/press/press-releases/2010/nr10-107.html. [16] blackman and mclaughlin summarize the widespread praise for espy’s work, while also highlighting some of the espy file’s limitations and criticizing its use for quantitative analysis. blackman and mclaughlin, “the espy file on american executions: user beware,” homicide studies vol. 15, no. 3 (2011): 209-227. [17] models for the ualbany hyrax instance. https://github.com/ualbanyarchives/hyrax-ualbany/tree/main/app/models. [18] egad – expert group on archival description, “records in contexts – ontology,” july 22, 2021. https://www.ica.org/en/records-in-contexts-ontology. [19] “request items for digitization,” m.e. grenander department of special collections & archives, university at albany, suny. https://archives.albany.edu/web/reproductions/. [20] “quicksearch,” north carolina state university libraries. https://www.lib.ncsu.edu/projects/quicksearch. [21] sean aery, “arclight at the end of the tunnel,” november 15th, 2019. https://blogs.library.duke.edu/bitstreams/2019/11/15/arclight-at-the-end-of-the-tunnel/. [22] portland common data model (april 18, 2016), https://web.archive.org/web/20220912065008/https://pcdm.org/2016/04/18/models. [23] description_indexer experimental archival description model, https://github.com/ualbanyarchives/description_indexer/blob/dao-indexing/description_indexer/models/description.py. [24] the rockefeller archive center’s dimes access system is a really interesting step in this direction by emphasizing agent records and requiring users to click through archival components to convey description inheritance. renee pappous, hannah sistrunk, and darren young, “connecting on principles: building and uncovering relationships through a new archival discovery system,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). the records in contexts – conceptual model (ric-cm) also has a very intriguing focus on agents and functions for discovery that deserves further practical exploration. “records in contexts – conceptual model.” expert group on archival description (egad), https://web.archive.org/web/20221007020234/https://www.ica.org/en/records-in-contexts-conceptual-model. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. 