the code4lib journal – editorial: journal updates and a call for editors mission editorial committee process and structure code4lib issue 55, 2023-1-20 editorial: journal updates and a call for editors journal updates, recent policies, and a call for editors. this is my second time and last time as coordinating editor for code4lib journal. after serving on the editorial committee for 7 years, i am rotating off of the committee to focus on other research projects. code4lib has played a big part in my career. in 2012, i published my first article for the journal. after attending my first code4lib conference at north carolina state university in 2014, funded by a code4lib diversity scholarship, i really wanted to get more involved with this wonderful and supportive community. i was co-convener for the local new york city chapter of code4lib, presented at two national pre-conferences, and served on a couple of code4lib national conference committees since then. out of participating in various code4lib related activities, i have to say that working with the editorial committee (ec) has been the most rewarding experience. i have learned quite a deal from my fellow editorial committee members, and for that i am immensely grateful. this includes everything from copy editing, the article review process, communicating and collaborating with authors, and most especially, managing a journal. i would like to share two recent developments with the journal: a guest editorial policy and a retraction policy. the ec has implemented a guest editor policy. editorial members have a wide skill set reflective of library coders and technologists, however, some of the articles that we review are beyond our scope of expertise. in those situations, we feel it necessary to consult with experts outside of the ec. the guest editor policy is in place to make it clear to the author, guest editor, and readers, their role in the review process. a retraction policy has also been implemented. this retraction policy was developed so the ec could withdraw articles that may include work that violate ethical standards or may be unreliable. retractions are not to be taken lightly, and as such, the journal will inform readers why the article was retracted. this will be another part of the article’s lifecycle post-publication. since there is now an opening on the editorial committee of code4lib journal, please respond to this call for editors. if you are interested in reading and learning about library information technology, as well as being part of a great team of editors, then this is an excellent opportunity. applicants from diverse communities are highly encouraged to apply. i believe that every issue of code4lib journal has practical applications for almost any library, archive, museum, and other related spaces, and this issue is no exception. this issue includes: a fast and full-text search engine for educational lecture archives which outlines the development of a search engine for educational videos using python in india. click tracking with google tag manager for the primo discovery service explores how to track open access content through unpaywall links. creating a custom queueing system for a makerspace using web technologies is a case study on streamlining the queue process of a makerspace. data preparation for fairseq and machine-learning using a neural network details the use of sequence-to-sequence models and how it can be applied for a variety application with the appropriate formatting of datasets. designing digital discovery and access systems for archival description compares the differences between archival and bibliographic description and the challenges of utilizing discovery based systems for digital born materials. drying our library’s libguides-based webpage by introducing vue.js investigates how to better streamline redundant html code from the popular libguides web content management system. revamping metadata maker for ‘linked data editor’: thinking out loud looks at using and evaluating the catalog record creation tool using linked data sources. using python scripts to compare records from vendors with those from ils examines the use of python to identity and synchronize out-of-sync vendor and ils catalog records. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – using python scripts to compare records from vendors with those from ils mission editorial committee process and structure code4lib issue 55, 2023-1-20 using python scripts to compare records from vendors with those from ils an increasing challenge libraries face is how to maintain and synchronize the electronic resource records from vendors with those in the integrated library system (ils). ideally vendors send record updates frequently to the library. however, this is not a perfect solution, and over time a problem with record discrepancies can become severe with thousands of records out of sync. this is what happened when, at a certain point, our acquisitions librarian and our cataloging librarian noticed a big record discrepancy issue. in order to effectively identify the problematic records among tens of thousands of records from both sides, the author of this article developed some solutions to analyze the data using python functions and scripts. this data analysis helps to quickly scale down the issue and reduce the cataloging effort. by dan lou the issue of record discrepancies at a certain point our acquisition librarian noticed there were large discrepancies between the records from vendors (axis360 and overdrive) and those in our sierra system. the vendors usually send us regular record updates, but unfortunately overtime, it still became a challenge to keep the information perfectly synchronized. in order to resolve these discrepancies more efficiently and quickly, i was asked to compose a python script to compare record details and identify the exact records we needed to modify in our catalog. comparing records with isbn as the first step i tried to identify the discrepancies in records by comparing isbn numbers. i chose python pandas (pandas 2018[1]) as the data analysis tool for this task. pandas is an open source python package that is widely used for data science and machine learning tasks. pandas is built on top of another python package named numpy that provides support for multi-dimensional arrays, and it works very well with other python packages like matplotlib that is widely used for data visualization. pandas makes it simple to do many of the time consuming, repetitive tasks associated with working with data, including but not limited to loading and converting data, data scrubbing, data normalization, data fill, data merges and split, data visualization, data inspection etc. you can install pandas and all the relevant packages on your local machine but a better option is to start your coding with google colab(google 2019[2]). google colab is a cloud-based python jupyter notebook service and it provides a free tier. by default, pandas and matplotlib are already pre-installed in google colab. to enable pandas in a jupyter notebook in google colab, we simply import the package and give it an short alias: import pandas as pd we exported the relevant marc records from sierra to a comma-separated values (csv) file, and loaded it into the jupyter notebook as a pandas dataframe object. for example, this is how we read a csv file of the axis360 audiobook records exported from sierra: sierra_axis_audio = pd.read_csv( "/content/drive/my drive/library/axis_overdrive/sierra_axis360_audio.csv") this file typically contains columns like title, author and isbn entries. a record sometimes can contain more than two isbn numbers, and those are stored in the “unnamed” columns towards the end for now: sierra_axis_audio.columns index(['title', 'author', 'isbn1', 'isbn2', 'unnamed: 4', 'unnamed: 5', 'unnamed: 6'], dtype='object') we also needed to further clean up all the isbn numbers in order to compare them with those from the vendor. a typical marc 020 field can contain many subfields besides the subfield a for the isbn number. in our exported file, the isbn field usually contains more information than we needed for the comparison. for example, “9780804148757 (electronic audio bk.)“, “9780062460110 : $62.99” etc. i composed a python function to extract the isolated isbn number from each of the columns after the “author” column, and compile all the isbn numbers into a new dataframe object. one sierra record can contain multiple isbn numbers. in this way, i was able to generate a complete list of all isbn numbers from the marc records exported from sierra: def extract_isbns(df): isbns = pd.dataframe() result = pd.dataframe() for i in range(2,len(df.columns)-1): new_isbns = df.iloc[:,i].astype(str) new_isbns = new_isbns.str.split(' ', 1).str[0].str.strip() isbns = pd.concat([isbns, new_isbns]) isbns = isbns.reset_index() isbns = isbns[isbns[0]!='nan'] return isbns in comparison the vendor’s csv file is simple and straightforward. it contains only one isbn field with the clean isbn number. it also has some other fields like title, author, publication date, data added, format, total quantity and total checkouts. all that is left to do is to import these as a pandas dataframe object. next i composed a python function to compare the isbn numbers in the cleaned up data from both sources. i created a new column, “exist,” in both python dataframe objects. if an isbn number from a vendor record exists in a sierra record, the “exist” column on the sierra record has a value of “1”; if not, the value is “0”. the opposite is also true. if an isbn from the sierra record matches that of a vendor record, the “exist” column on the vendor record will have a value of “1” set; if not, the value is “0”. the function also returns two additional dataframe objects that contain the discrepant records only. def find_mismatch(vendor_df, sierra_df, sierra_isbns_df): vendor_df["exist"] = vendor_df["isbn"].astype(str).isin(sierra_isbns_df[0]) not_in_sierra = vendor_df[vendor_df["exist"]==false | vendor_df["exist"].isna() | vendor_df["exist"].empty] sierra_isbns_df["exist"] = sierra_isbns_df[0].astype(str).isin(vendor_df["isbn"].astype(str)) exists = sierra_isbns_df[sierra_isbns_df['exist']==1].drop_duplicates("index") not_exists = sierra_isbns_df[sierra_isbns_df['exist']==0].drop_duplicates("index") not_exists = not_exists[~not_exists["index"].isin(exists['index'])] isbns = pd.concat([exists, not_exists]) sierra_df = pd.merge(sierra_df, isbns, how="left", left_index=true, right_on="index") not_in_vendor = sierra_df[sierra_df["exist"]==false | sierra_df["exist"].isna() | sierra_df["exist"].empty] return vendor_df, sierra_df, not_in_vendor, not_in_sierra after running this script against our records, i did some data analysis to see how many conflicting records were resolved after comparing isbn numbers. i generated the list of all marc records that need to be added to the ils and another list of records that need to be removed. as shown in the following chart, we were able to identify that the majority of our axis360 records were well positioned, but on the overdrive side, about a third of the records are problematic. in total, we still have over 4,600 records in sierra that are troublesome. figure i. vendor vs. sierra: total mismatches by isbn comparing records with title and author fields to further scale down the issue, i moved on to align the remaining mismatched records by comparing the similarity of the author and title fields. a list of mismatched records is one of the outputs from the previous step after aligning the isbn numbers. first, we needed to clean up and extract the title and author field. i made a function to concatenate the title and author fields for each record from both sources. the following function transforms all text to lower cases, removes extra punctuation, extracts the relevant information, and concatenates the data to create a new column, “titleauthor,” for all the records: def extract_title_author(vendor_df, sierra_df, vendor_title_field, vendor_author_field, sierra_title_field = "title", sierra_author_field ="author"): vendor_df['titleauthor'] = vendor_df[vendor_title_field].astype(str).str.lower().apply(remove_punctuations).astype(str) + ' ' + vendor_df[vendor_author_field].astype(str).str.lower().apply(remove_punctuations).astype(str) sierra_df['titleauthor'] = sierra_df[sierra_title_field].str.lower().str.extract("([^[]*\w*)").astype(str).apply(remove_punctuations).astype(str) + ' ' + sierra_df[sierra_author_field].str.lower().astype(str).apply(remove_punctuations).astype(str).str.extract("([ a-z]*)").astype(str) then i used the python module difflib to compare the similarity of the “titleauthor” columns from both sources. the difflib module provides classes and functions for comparing data differences. here is the function to compare and get the similarity level from two sources: from difflib import sequencematcher def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6): if not n > 0: raise valueerror("n must be > 0: %r" % (n,)) if not 0.0 <= cutoff <= 1.0: raise valueerror("cutoff must be in [0.0, 1.0]: %r" % (cutoff,)) result = [] s = sequencematcher() s.set_seq2(word) for idx, x in enumerate(possibilities): s.set_seq1(x) if s.real_quick_ratio() >= cutoff and \ s.quick_ratio() >= cutoff and \ s.ratio() >= cutoff: result.append((s.ratio(), idx)) after comparing the similarity of the title and author fields, i was able to generate new spreadsheet files for all of the mismatched records. each row in the spreadsheet contains a sierra record together with the most similar record from the vendor side using title and author. all rows are sorted by similarity level (on a scale of 0 to 1). similarity level 0 means the two records in the row are very different; 1 means they are highly identical. again, i made some charts to see how much it would help to scale down the issue further. figure ii. similarity of title and author for mismatched axis 360 audio records figure iii. similarity of title and author for mismatched axis 360 ebook records figure iv. similarity of title and author for mismatched overdrive audio records figure v. similarity of title and author for mismatched overdrive ebookrecords as shown in the charts, if we take a similarity level of 0.7 as a benchmark, roughly two-thirds of the mismatched records required more cataloging effort. with the python scripts we successfully narrowed down a problem of 30,843 records to that of 3,394 within a few days. only the 3,394 records require further examination by our cataloger. this reduced the cataloging effort eventually needed to fix record discrepancies. figure vi. comparing 30,843 records from sierra and vendors (overdrive and axis360) conclusion it has become an increasing challenge to keep the marc records of online resources up to date between the library’s ils and the vendors’ provided records due to the frequent changes taking place. the solution described in this article is a good remedy to identify and fix record discrepancies accumulated over time. i highly recommend libraries adopt this method and make it a routine task to prevent the issue having a snowball effect. it is better if we can insert in the vendor provided marc records a customized field to identify the valid date range. then we can develop a script to automatically renew or modify the records that are close to or past the due date by retrieving them from the ils system via record exporting or via api. many vendors tend to not generate a 404 error page when a library loses access to an online resource like an ebook, so the script will need to be developed to be versatile enough to detect those links that stop working properly. on the other hand, a huge change has been taking place in recent years as the record system of online resources slowly eases out of the library’s ils. for example, library online service provider bibliocommons has implemented bibliocloudrecords which allows libraries to add certain online resources records via api to the library’s catalog website without needing to add the marc records to the ils. while this improves the resource access for library customers, it means patron data and usage statistics have inevitably shifted from libraries to the vendors. online resources have had a stronger presence in the library’s collection since the pandemic and will become more and more influential in the foreseeable future. it is a good question to ask now how libraries could better position ourselves in this new online resources landscape. references [1] pandas. 2018. python data analysis library — pandas: python data analysis library. pydataorg. https://pandas.pydata.org/. [2] google colaboratory. 2019. googlecom. https://colab.research.google.com/ about the author dan lou is a senior librarian at palo alto city library, where she works on web content management and develops pioneering projects on robotics, coding, ar and distributed web. her particular interests include python, machine learning and data analytics. previously, she worked as a technical project manager at wolfram alpha, a knowledge base queried by apple siri and amazon alexa. before that, she was a systems librarian at innovative interfaces. author email: loudan980@gmail.com subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – a fast and full-text search engine for educational lecture archives mission editorial committee process and structure code4lib issue 55, 2023-1-20 a fast and full-text search engine for educational lecture archives e-lecturing and online learning are more common and convenient than offline teaching and classroom learning in the academic community after the covid-19 pandemic. universities and research institutions are recording the lecture videos delivered by the faculty members and archiving them internally. most of the lecture videos are hosted on popular video-sharing platforms creating private channels. the students access published lecture videos independent of time and location. searching becomes difficult from large video repositories for students as search is restricted on metadata. we presented a design and developed an open-source application to build an education lecture archive with fast and full-text search within the video content. by arun f. adrakatti and k.r. mulla introduction e-lecturing has become increasingly popular over the past decade. there is an exponential increase in the amount of lecture video data being posted on the internet. many universities and research institutions are recording their lectures and publishing them online for students to access independently of time and location. in addition, there are numerous massive open online courses (moocs) that are popular across the globe for their ability to provide online lectures in a wide variety of fields. the availability of online courses and ease of access has made them a popular learning tool. lecture videos and moocs are hosted on cloud platforms to make them available to registered users. most of these resources are for the benefit of the public. the majority of lecture videos are available on popular video-sharing platforms such as youtube, vimeo, and dailymotion. due to a shortage of storage servers and internet bandwidth, academic and research institutions are having difficulty maintaining video repositories. there are a great number of lecture videos uploaded to online platforms that are annotated only with a few keywords, which results in the search engine returning incomplete results. there are only a limited number of keywords available to access lectures and the search can be conducted primarily using their occurrences or tags. videos are retrieved solely based on their metadata, such as their title, author information, and annotations, or by user navigation from generic to specific topics. users can fetch the materials only based on limited options. irrelevant and random annotations, navigation, and manual annotations made without considering the video contents are the major stumbling blocks to retrieving lecture videos. existing approaches and proposed models the fast development of video data has made efficient video indexing and retrieval technologies one of the most essential concerns in multimedia management.[1] bolettieri et al. (2007) proposed a system based on milos, a general-purpose multimedia content management system that was developed to aid in the design and implementation of digital library applications.[2] the goal is to show how digital information, such as video documents or powerpoint presentations, may be reused by utilizing existing technologies for automatic metadata extraction, including ocr, speech recognition, cut detection, and mpeg-7 visual video retrieval (cbvr). yang et al. (2011) provided a method for automated lecture video indexing based on video ocr technology, built a new video segmenter for automated slide video structure analysis and implemented a new algorithm for slide structure analysis and extraction based on the geometrical information of identified text lines.[3] an approach for automated video indexing and video search in huge lecture video archives has been developed.[4] by using video optical character recognition (ocr) technology on key-frames and automatic speech recognition (asr) on lecture audio files, automatic video segmentation and key-frame identification can provide a visual guideline for video content navigation and textual metadata. for keyword extraction, both videoand segment-level keywords are recovered for content-based video browsing and search using ocr and asr transcripts as well as identified slide text. the authors developed and proposed text to speech extraction from videos and text extracted from ocr from the slide used while delivering the lecture videos. most of these proposed ideas and developed concepts are in commercial platforms which the academic and research institutions are not in a position to be able to afford. motivation for the project in the indian scenario, the vast majority of lecture video repositories and moocs are hosted on the youtube platform and the web links are embedded in content management systems (cms) and e-learning applications. typically, these applications only search for metadata and human annotations associated with specific videos. some applications lack search engines, and the swayam (study webs of active-learning for young aspiring minds) platform is the best example of this. the user needs to browse the videos according to the subject on this platform. a commercial video lecture application focuses only on recording and organizing video lectures based on their subject and topics. the retrieval of videos from repositories is the least concern and applications neglect user concerns. the users do not find the desired information on the lecture video repositories and invest lots of time in browsing and listening to the videos to gain information. the popular open source institutional repositories systems are limited to maintaining only document and image files. the retrieval is restricted to metadata and human annotations. these limitations and restrictions led to the design and development of the educational lecture archive with more focus on the retrieval of the videos based on the content from these videos. development of content-based video retrieval system the capture of e-lecturing has become more popular and the amount of lecture video data on the web is growing rapidly. universities and research institutions are recording their lectures and publishing them online for students to access independent of time and location. on the other side, users find it difficult to search for the desired information from these video repositories. to overcome this issue, the application is designed and developed to organize the education lecture videos through the practical approach of searching videos based on the entire speech contained in the video. the application is named cbmir (content-based multimedia information retrieval). the scope of cbmir covers only educational lecture video repositories maintained and hosted by universities, r&d, and nonprofit organizations of india and is limited to speech extraction and automatic text indexing of lecture videos available in the english language. figure 1. overview of the cbmir application. technical requirement linux based os – ubuntu django – content management system: python-based web framework. python – programming language anaconda – python distribution whoosh library: whoosh is a fast, full text, python search engine library. the cbmir application has three major modules: the administrator, information processing & retrieval, and user platform. figure 2 shows a detailed workflow process for an administrator uploading video into repositories to a user retrieving the desired videos from repositories. flow chart figure 2. flow chart of the content-based multimedia information retrieval. modules of cbmir applications administrator module the administrator module allows the admin to upload the external video using hosted weblinks, and to upload the video file from a personal device or external storage device. figure 3 shows the dashboard of the administrator module. the authorized admin will have access to the application to manage and upload the data. the admin can view / edit / delete the videos uploaded using the option to manage the content library. the admin can add a new user and assign a limited admin role to distribute the workload of uploading the content in cases when they are creating large educational repositories. the admin was given the privilege to add / edit / delete the text of transcript uploaded into the database. figure 3. dashboard of administrator module. information retrieval processing module the following are the step-by-step processes for the information processing and storage of videos with the cbmir application. figure 4 shows the status bar of the work process after uploading the external video link into the cbmir application. uploading videos download the video from external source: video will be downloaded to the application server from an external server or from the device using youtube api to access the youtube data and google api for accessing the youtube data for login credentials. download the videos from the device: video uploaded from personal devices or external storage devices using python library. video to audio: video file will be converted to audio using the ffmpeg python library audio to text: audio file will be converted to a text file with speech recognition using pocket-spinx python library with an acoustic model text segmentation: whole text will be broken into multiple segments based on the timings. automatic indexing and searching: the text file will be auto-indexed using the whoosh library figure 4. the status bar of processing of information and storage. user module the user module acts as a search engine, where the page contains a search box. it shows a dashboard below the search box that displays the total number of videos and subject, or topic-wise, collections. the user is allowed to search desired information using keyword terms. search results are displayed based on the word occurrence. the whoosh library converts all audio files into a text file, auto indexes these files, and makes a fast searchable format. the keyword term will be searched in the database and a list of videos will be displayed. the video starts playing on a single click on a specific video. figure 5. display of search result of the keyword speech. the whole text will be broken into multiple segments based on the speech timing and auto indexed into the database. with the advantage of segmentation, the video will start playing on the specific keyword or phrase searched by the user. figure 6. display of search results based on the time segmentation. unique features of applications: open source application: the cbmir application is designed and developed based on the open source web application and databases. the source code of the application will be shared on a popular source code distribution platform for further development from the community. fast and full-text search engine: searching through a full-text database or text document is referred to as a full-text search. a search engine is embedded within cbmir that analyses all the words contained within a document and compares them to the search criteria specified by the user during the search process. searching a huge converted text file using a full-text search ensures fast retrieval of results for large numbers of documents. the search engine provides efficient search results that are accurate and precise in all fields. domain based lecture video repository: the cbmir application allows subject-based lecture video archiving for academic institutions. this repository provides consolidated subject-specific lectures and helps users to spend less time searching on popular video sharing platforms for lectures. no video advertisements on the application: the video on the cbmir application will not play advertisements. popular video sharing platforms typically include advertisements at the beginning and middle of their videos. lightweight, less storage space, and unlimited video upload: the cbmir application is very lightweight and it can extract speech, convert it into audio, and save text files stored into databases; usually it requires less storage space. there is no limitation in uploading video into the application. text segmentation for retrieval of video content: the application divides converted text from audio into sets of words along with video timestamp. the video is played from the specific timestamp depending on the user desired search term. conclusion: due to covid-19, e-lecturing became a part of the academic community for all levels of education. academic institutions are finding it difficult to archive lecture video on internal servers and hosting on popular video sharing platforms for future use. users find it difficult to search the desired video information from the larger video repositories. the search function is restricted only to metadata, human annotation, and tags of a particular video. to overcome this problem, the cbmir application has been developed using open source technology to build educational videos repositories based on the relevant subject, focusing on fast and full-text search of the video content. the current cmbir application is limited to converting the speech to text in the english language only. the further development of the application includes converting speech to text for indian regional languages, translating converted transcripts into indian regional languages, including voice search options, text extraction using ocr technology, finding objects from the videos, creating a dashboard on the user module, etc. the source code of the cbmir application is being submitted to the authors’ university in order to fulfill requirements for a degree and will be distributed on github under a creative commons cc by-nc license after completion of the degree. reference [1] saoudi, e. m., & jai-andaloussi, s. (2021). a distributed content-based video retrieval system for large datasets. journal of big data, 8(1). https://doi.org/10.1186/s40537-021-00479-x [2] bolettieri, p., falchi, f., gennaro, c., & rabitti, f. (2007). automatic metadata extraction and indexing for reusing e-learning multimedia objects. proceedings of the acm international multimedia conference and exhibition. https://doi.org/10.1145/1290067.1290072 [3] yang, h., siebert, m., lühne, p., sack, h., & meinel, c. (2011). lecture video indexing and analysis using video ocr technology. proceedings – 7th international conference on signal image technology and internet-based systems, sitis 2011. https://doi.org/10.1109/sitis.2011.20 [4] yang, h., & meinel, c. (2014). content based lecture video retrieval using speech and video text information. ieee transactions on learning technologies, 7(2). https://doi.org/10.1109/tlt.2014.2307305 about the authors arun f. adrakatti (arun@iiserbpr.ac.in) is assistant librarian at the indian institute of science education research (iiser) in berhampur, odisha, india. k.r. mulla (krmulla@vtu.ac.in) is a librarian at visvesvaraya technological university in belagavi, karnataka, india. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – click tracking with google tag manager for the primo discovery service mission editorial committee process and structure code4lib issue 55, 2023-1-20 click tracking with google tag manager for the primo discovery service this article introduces practices at the library of oregon state university aiming to track the usage of unpaywall links with google tag manager for the primo discovery interface. unpaywall is an open database of links to full-text scholarly articles from open access sources[1]. the university library adds unpaywall links to primo that will provide free and legal full-text access to journal articles to the patrons to promote more usage of open-access content. however, the usage of the unpaywall links is unavailable because primo does not track the customized unpaywall links. this article will detail how to set up google tag manager for tracking the usage of unpaywall links and creating reports in google analytics. it provides step-by-step instructions, screenshots, and code snippets so the readers can customize the solution for their integrated library systems. by hui zhang introduction in 2020, staff at oregon state university library started a project to provide single-click links to open access content in 1search[2], the university’s library discovery interface built on the primo service platform[3]. the goal of the project is to provide free and legitimate access to full-text scholarly resources for the patrons. although primo already includes open access content in the search results, there are studies [1] that show the primo solution has significant flaws in indexing and providing open access resources. ultimately, we decide to use unpaywall, an open database that harvests and indexes tens of millions of open access scholarly articles, as the source of open access content in addition to primo. by customizing the user interface (ui) of primo, we added unpaywall links of open access content to the search result and the individual item view. however, it has a problem in that we cannot get the link usage statistics from primo’s analytics tool because primo does not track the unpaywall links. this article will detail tracking customized links in primo using google tag manager [4] including testing, troubleshooting, and creating usage reports with google analytics. although the case study is specific to tracking unpaywall links, the workflow and configuration of google tag manager are general. users may also adapt the included snippets and tags to tracking activities for websites beyond the types used in library systems. adding open access links with primo customization finding open access resources in primo primo users can find open access content in two ways. the first way is to filter the search results by selecting “open access” in the availability facet. figure 1. open access facet in primo. the second way of finding open access content is to look for the open access icon that will appear for an item identified as open access both in the search results and in the full item view. figure 2. open access indicator in primo. primo provides open access content in many resource types, such as journal articles, books, and theses. why adding unpaywall links to primo the significant flaw in primo’s solution of providing open access is the preference for subscribed content over open access. one study [1] finds search results in primo will provide article links to subscribed journals even when the articles are open access, undermining the visibility of open access content to the readers. with the growing demand to make more open access content available to the patrons, developers at primo added a feature to integrate the unpaywall api so users can find open access articles that may not appear or be available to them initially [2]. however, librarians at oregon state university (osu) finally decided to provide unpaywall links to primo by adopting a solution called oadoi link [3] developed by the primo customization standing group. the standing group is formed by the orbis cascade alliance, which oregon state university is a member of. one advantage of oadoi link is that the osu library has better access to technical support as it is locally developed. but more importantly, a recent study [4] found that oadoi links can provide an estimated 30% more open access articles compared to primo’s unpaywall feature. we extend the oadoi links in our solution [5] to provide unpaywall links to open access items in the brief display view next to the availability status. figure 3. customized unpaywall link shown in the brief display of an open access item. offering unpaywall as single-click links is a significant improvement to usability as library patrons can access the full-text contents without authentication with their university credentials. the challenge of tracking unpaywall links usage in primo because the unpaywall links are added to primo by a ui customization, we cannot get usage statistics of these links from primo because they are not tracked. it is a major problem as usage data is crucial evidence to assess the impact and success of the unpaywall project. to overcome the problem, we investigated the possibility of using google tag manager to track the customized unpaywall links. as primo continues to provide new features including link tracking, it is worth updating the latest situation so that the readers will have a better understanding of the motivation and contribution of our approach. ex libris, the company that develops primo, added the capacity to track and report the usage of unpayall links in august 2021 [6]. however, that feature is only available to primo ve [7], a newer and different cloud computing platform to primo. the ex libris solution was unavailable when we investigated the potential of google tag manager in 2020, and we then used primo, not primo ve, as the discovery interface. for full disclosure, oregon state university library migrated its discovery interface to primo ve in the summer of 2022. however, many libraries worldwide are still using primo, and our work on google tag manager will help them to track customized links like unpaywall in their discovery interfaces. tracking unpaywall links with google tag manager how google tag manager works perhaps many readers, like us, are confused when they try to understand what google tag manager is at the beginning. we will explain how google tag manager works by answering two questions: what is a tag and what is the difference between google tag manager and google analytics? according to google, a tag is a code snippet deployed to measure website user activity [8]. these tags, or tracking codes, were usually created and deployed by developers before tools like google tag manager were available. however, with google tag manager, people can create, test, and deploy a tag without programming skills. creating triggers that will tell the manager when, where, and how to operate the tag is required to set up a tag. google tag manager provides two types of triggers: all elements and just links. the all elements trigger can track clicks on any element on a page, e.g. links, images, and buttons. the just links trigger can track clicks on html links that use the element. google tag manager and google analytics are two different tools, but you should always use them together. google tag manager can add google analytics tracking code (i.e., tag) to the website but can not create reports. instead, it will send activity data of the website to google analytics for analyzing and reporting. adding google tag manager to primo your first step to adding google tag manager to primo will be creating an account. go to the google tag manager website to create an account or log in using your google account. then you need to create and name a container with “web” as the target platform, and you will be given an id in the format of “gtm-xxxxxxx” after the container has been created. take note of the container identifier because you will need it in the next step. we suggest creating a container for every website, then defining tags for activities you want to track in the newly created container. you will add a snippet to primo to allow google tag manager to track the web activity. the technical detail of managing and customizing the primo ui package is beyond the scope of this article. however, the primo administrators should have the knowledge and privilege to add the sample javascript snippet below to primo. make sure you use the correct container id in the snippet. /* google tag manager */ const gtmid = 'gtm-xxxxxxx' function addgtm(doc) { const newscript = doc.createelement('script') const scripttext = `(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new date().gettime(),event:'gtm.js'});var f=d.getelementsbytagname(s)[0], j=d.createelement(s),dl=l!='datalayer'?'&l='+l:'';j.async=true;j.src= '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentnode.insertbefore(j,f); })(window,document,'script','datalayer','${gtmid}');` newscript.innerhtml = scripttext doc.head.append(newscript) const noscript = doc.createelement('noscript') const noscripttext = `<iframe src="//www.googletagmanager.com/ns.html?id=${gtmid}" height="0" width="0" style="display:none;visibility:hidden"></iframe>` noscript.innerhtml = noscripttext doc.body.insertbefore(noscript, doc.body.firstchild) } addgtm(document) then you need to save and deploy the change so it will take effect in primo. congratulations! you have done all the required configuration for primo, and all the rest will happen in google tag manager. creating tag and trigger for unpaywall link clicks you will create a google tag manager tag and trigger in the newly created container. for example, you can create and name a tag called “unpaywall” with the type of google analytics and associate the tag with an existing google analytics account. in our case, we associate the tag with the google analytics account of primo. figure 4. google tag manager tag with the type of google analytics inside the new tag, you need to create a trigger that fires the tag when the unpaywall link is clicked. figure 5. google tag manager trigger for tracking unpaywall link clicks for our purpose of tracking the link click events, make sure you will select “all elements” for the trigger type. the next step is to attach a condition or rule to the created trigger. in our example, we create a condition like this: the tag is activated when links with the text “open access(via unpaywall)” are clicked. the link text used in the trigger is the label of the unpaywall link we added to primo ui by our customization code. if you want to customize the method, you can define a trigger with different conditions that are appropriate for your need. trigger testing and troubleshooting people can test newly created tags and triggers using the preview feature in google tag manager. figure 6. google tag manager preview for testing tags and triggers. the preview feature is available at the container level. a pop-up window appears after clicking the “preview” button, where you can enter the url of the primo instance for testing. after the connection is established, you can go to the primo website, click the unpaywall link, and check whether the tag is triggered as expected in google tag manager. the browser’s developer console will be your best tool for troubleshooting. for example, we use chrome’s console to confirm the label text for the unpaywall link is the same as entered in the trigger condition. after finalizing the configuration with the preview, you must verify the changes by clicking the “submit” button next to the right of the “preview.” you will be asked to create a version of your container and finally publish it. the deployment is reasonably quick, and in our example, we can see click data shown in google analytics a few minutes after we published the change in google tag manager. generating usage report with google analytics you can use the many report functions in google analytics to review and analyze the data collected by google tag manager. in our example, we can find statistics of unpaywall link usage in the event under behavior, then in the “click” event category. figure 7. click event report in google analytics. the result is promising as it shows patrons are attracted to open access content, and there is a clear trend that more patrons are using the unpaywall links. for instance, the total number of clicks for the unpaywall link is 53,361 during the calendar year of 2021. that number jumped to 60,534 for the first six months of 2022 until osu migrated its discovery interface to primo ve. conclusion in this article, we describe the work of tracking customized unpaywall links with google tag manager in primo. we outline the motivation of our project, introduce google tag manager, and provide details on how to define tags for tracking unpaywall link clicks. we have used data collected by google tag manager for decision-making. for example, we have decided to continue to provide open access and unpaywall links in primo ve with the usage statistics of unpaywall links collected in primo. by integrating google tag manager and google analytics, we can also get more insights into patrons’ activities, such as which open access databases are popular and which subjects patrons are most interested in. we hope the code and screenshots are helpful and the readers can refer to them in their works. notes [1] unpaywall: https://unpaywall.org/ [2] 1search: https://search.library.oregonstate.edu/ [3] primo service platform: https://exlibrisgroup.com/products/primo-discovery-service/ [4] google tag manager: https://tagmanager.google.com/ references [1] bulock, c. (2021). finding open content in the library is surprisingly hard, serials review, 47:2, 68-70, doi: 10.1080/00987913.2021.1936416 [2] how to utilize the unpaywall api for open access content and resources in discovery. (2021). ex libris knowledge center. https://knowledge.exlibrisgroup.com/alma/knowledge_articles/how_to_utilize_the_unpaywall_api_for_open_access_content_and_resources_in_discovery [3] oadoi link. (2022). orbis cascade alliance. https://www.orbiscascade.org/programs/systems/pcsg/primo-ve-toolkit/oadoi-link/ [4] veldhuisen, k. (2020). unpaywall in alma, oadoi customization in primo (and other open access). retrieved from https://docs.google.com/document/d/1rbz7l4_ktra7psxfxatpyjev-og3sizw1qjrpxc5ebk/edit [5] osulp/1search-ui-package. (2021). github. https://github.com/osulp/1search-ui-package [6] primo ve 2021 release notes. (2022, may 6). ex libris knowledge center. https://knowledge.exlibrisgroup.com/primo/release_notes/002primo_ve/2021/010primo_ve_2021_release_notes [7] primo ve overview. (2022, september 18). ex libris knowledge center. https://knowledge.exlibrisgroup.com/primo/product_documentation/020primo_ve/primo_ve_(english)/010getting_started_with_primo_ve/005primo_ve_overview [8] overview. (2022). google developers. https://developers.google.com/tag-platform/devguides about the author hui zhang (hui.zhang@oregonstate.edu) is the digital services librarian at the oregon state university. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – drying our library’s libguides-based webpage by introducing vue.js mission editorial committee process and structure code4lib issue 55, 2023-1-20 drying our library’s libguides-based webpage by introducing vue.js at the kingsborough community college library, we recently decided to bring the library’s website more in line with dry principles (don’t repeat yourself). we felt we this could improve the site by creating more concise and maintainable code. dryer code would be easier to read, understand and edit. we adopted the vue.js framework in order to replace repetitive, hand-coded dropdown menus with programmatically generated markup. using vue allowed us to greatly simplify the html documents, while also improving maintainability. by mark e. eaton keeping it dry a common goal among programmers is to write code that is dry, in other words, code where you don’t repeat yourself. this is usually motivated by the insight that computers can often effectively automate repetitive tasks, making it unnecessary to repeat yourself in code. taking advantage of the efficiencies of automation is widely regarded as a best practice among programmers. however, html, when written by hand, is unfortunately not terribly suited to dry practices. html is particularly declarative: all elements of the page are explicitly laid out by the programmer, so as to fully describe its structure. the problem with this is that it means that hand-written webpages are often not very dry. even those of a relatively modest amount of complexity can quickly grow into very long html documents. this can be problematic, for a few reasons: it can become difficult to conceptualize the structure of a whole page when it stretches out over hundreds of lines. even relatively trivial aspects of coding, such as indentation, can become difficult with the deeply nested html structures of a large page. it is easy to introduce syntax errors or formatting problems into long html documents, because typos can be easily overlooked. this is especially problematic in cases where there is no built-in linting or validation.[1] at our college these challenges were familiar to us at kingsborough community college, a college of the city university of new york. our homepage, built on libguides cms, ran to over 500 lines, not including the or