Using Machine Learning and Natural Language Processing to Analyze Library Chat Reference Transcripts

ARTICLE

Using Machine Learning and Natural Language
Processing to Analyze Library Chat Reference Transcripts
Yongming Wang

INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2022
https://doi.org/10.6017/ital.v41i3.14967

ABSTRACT

The use of artificial intelligence and machine learning has rapidly become a standard technology
across all industries and businesses for gaining insight and predicting the future. In recent years, the
library community has begun looking at ways to improve library services by applying AI and machine
learning techniques to library data. Chat reference in libraries generates a large amount of data in
the form of transcripts. This study uses machine learning and natural language processing methods
to analyze one academic library’s chat transcripts over a period of eight years. The built machine
learning model tries to classify chat questions into a category of reference or nonreference questions.
The purpose is to predict the category of future questions by the model with the hope that incoming
questions can be channeled to appropriate library departments or staff.

INTRODUCTION

Since the beginning of this century, artificial intelligence (AI) and machine learning (ML) have
been used in almost all industries and businesses to gain knowledge and insights and predict the
future. The large amount of data available has helped to accelerate the application of AI and ML in
stunning speed. To follow this technology trend, the library community has begun looking at ways
to improve library services by applying AI and ML techniques to library data.

Stanford University Library is one of the pioneers in the research and application of ML and AI in
the library. The mission of its Library AI Initiatives states: “The Library AI initiative is a program
to identify, design, and enact applications of artificial intelligence that will help us make our rich
collections more easily discoverable, accessible, and analyzable.” 1 In 2019, Stanford University
Library hosted the second International Conference on AI for Libraries, Archives, and Museums,
titled Fantastic Future.2

Many academic libraries have implemented chat reference services as a way to support student
learning and academic research on campus. Chat reference serves as an important channel to
connect the library’s resources and services to the campus community.3 Over the years, libraries
have accumulated a large amount of data in the form of chat transcripts. Analyzing the content of
transcripts can help the library understand users’ information needs, deploy library human
resources more efficiently, and improve the quality of the chat reference service.

The College of New Jersey is a midsize academic library that serves a campus with 7,000 college
students, most of them undergraduates. The library began to use Springshare’s LibChat in 2014.
The chat service is freely accessible online from the library’s website, and anyone can initiate a
chat by asking an initial question through the chat box. Approximately 8,000 chat transactions
have been accumulated over the past eight years.

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 2

This study aims to use machine learning and natural language processing (NLP) techniques to
build a classification model to categorize all available questions into two categories: reference and
nonreference. By doing so, we hope that the model can automatically classify future chat questions
received into either the reference question category or the nonreference question category, and
channel the question to the appropriate library department or staff.

LITERATURE REVIEW

Traditionally, the analysis of chat transcripts has used qualitative or simple quantitative methods
(e.g., chat frequency, duration). To better understand chat service quality and patrons’ information
needs, librarians must manually review and read through chat transcripts, which requires a lot of
time and effort.4 In recent years, however, the library field has started to witness the application of
AI and ML techniques to analyze library data, including chat transcripts, in order to quickly and
efficiently gain more insight into user information needs and information seeking patterns.

Megan Ozeran and Piper Martin used topic modeling, a ML method, to analyze library chat
reference conversations. The purpose of their project was to identify the most popular topics
asked by library patrons in order to improve the chat reference service and to train the library
staff.5

The Brigham Young University library implemented a machine learning–based tool to perform
various text analysis on transcripts of chat reference to gauge patron satisfaction levels and to
classify patrons’ questions into several categories.6 Jeremy Walker and Jason Coleman used ML
and NLP techniques to build models that predict the relative difficulty of incoming chat reference
questions. They tested their large sample size of chat transcripts on hundreds of models. Their aim
was to help library professionals and management improve chat reference services in the
library.7Another ML topic modeling project of was carried out by HyunSeung Koh and Mark
Fienup. Their study applied pLSA (Probabilistic Latent Semantic Analysis) to library chat data over
a period of four years, resulting in more accurate and interpretable topics and subjects compared
with results by human qualitative evaluation.8

Another interesting ML project on chat reference data was conducted by Ellie Kohler. This project
used a machine learning model to analyze chat transcripts for sentiment and topic extraction. 9

In addition to library chat data, ML has been also used to analyze other library data, including
library digital collections and library tweet data. Jeremiah Flannery applied NLP summarization
techniques on a special library digital collection of Catholic pamphlets. This project tried to
automatically generate a summary for each digitized pamphlet by using NLP’s BERT Extractive
technique and Gensim python package.10 Sultan M. Al-Daihani and Alan Abrahams conducted a
text mining analysis of academic libraries’ tweets. They used a tool called PamTAT developed by
the Pamplin College of Business at Virginia Polytechnic Institute and State University. Pamplin is a
Microsoft Excel–based interface to the NLP NLTK package written in Python. The purpose of their
analysis was to try to identify the most common topics or subject keywords of the tweets by 10
large academic libraries. In addition, they also ran Harvard General Inquirer for semantic and
sentiment analysis of the tweets.11

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 3

Other applications of ML techniques in the academic library include analyzing library operations
such as acquisition. In 2019, Kevin W. Walker and Zhehan Jiang from the Univers ity of Alabama
used a machine learning method called adaptive boosting (AdaBoost) to predict demand -driven
acquisition (DDA).12 Carlos G. Figuerola, Franciso Javier Carcia Marco, and Maria Pinto used the
topic modeling technique, specifically the Latent Dirichlet Allocation, to identify the main topics
and categories of the 92,705 publications in the domain of library and information science from
1978 to 2014.13

PAIR (Projects in Artificial Intelligence Registry) is a repository and online global directory of AI
projects in higher education. It is maintained by the University of Oklahoma Libraries. The aim of
PAIR is to foster cross-institutional collaboration and to support grant activity in the field of
artificial intelligence and machine learning in higher education.14

Public libraries have started to seriously look at the application and impact of AI in the library.
Frisco Public Library in Texas has developed a series of applications and programs to help train
library staff in AI. They also developed artificial intelligence maker kits, including Google AIY Voice
Kit, for circulation. They even provide introductory Python lessons to the public.15

BACKGROUND OF NLP AND ML

Natural language processing is a multidiscipline field that involves linguistics, computer science,
and machine learning. By using computer algorithms, NLP tries to build a machine learning model
that is applied to large amounts of data in order to make predictions or decisions. The data in NLP
is natural language data, that is, data in plain and unstructured textual form in any language.

There are many types of applications of NLP and ML in business or people’s daily lives. Especially
with the popularity of Internet, there is a tremendous increase in and accumulation of textual data,
such as social media networks and customer online chat services. Major applications of NLP
include sentiment analysis on social media data, topic modeling in digital humanities, text
classification, speech recognition, search box auto correct, and auto completion, etc. The use cases
are countless.

In general, there are two types of ML: supervised learning vs. unsupervised learning. In supervised
learning, the dataset fed to the model is labelled in advance to classify data or predict outcomes
accurately whereas unsupervised learning is a type of algorithm that learns patterns from
unlabeled or untagged data.

No matter which type, all ML and NLP techniques involve a series of general steps in any project,
also called the ML/NLP pipeline.

1. Data collection, which involves obtaining the raw textual data and usually means
downloading data from some remote server or service.

2. Data preprocessing, which is necessary for any project, large or small, because the raw
textual data is unstructured data and is not ready to be fed to the model for computing
processes. Data preprocessing usually includes removing punctuations, changing all letters
to lowercase, tokenization, removing stop words, and stemming or lemmatization.

3. Feature engineering, which is optional but often very useful.
4. Text vectorization, which is the final step before feeding the data to the model. The purpose

is to transform the text into some kind of value in numbers.

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 4

5. Model building, evaluation, and optimization, which involves multiple cycles until the
optimal or desired results are achieved.

6. Implementation, which is the final step in implementing the model to the real world.

METHODOLOGY

For this ML/NLP project, the raw data came from the chat transcripts repository downloaded
from Springshare’s server. From 2014 to 2021, a total of 8,000 chat reference transactions were
logged. These transactions formed the raw dataset for model building and testing with this
project.

Because of the nature of the data, i.e., textual data, Python was chosen for this project. And the two
major Python packages used in the project are NLTK and scikit-learn. NLTK (Natural Language
Toolkit) is a suite of libraries and programs for natural language processing for English language.
NLTK supports classification, tokenization, and stemming, tagging, parsing, and semantic
reasoning functionalities. Scikit-learn is a Python module built on NumPy, SciPy, and Matplotlib.
Featuring various classification, regression, and clustering algorithms, including support-vector
machines, random forest, gradient boosting, k-means and DBSCAN, scikit-learn is a simple and
efficient tool for predictive data analysis. Scikit-learn is one of the most popular Python modules
for any ML project.

Data Collection
Data collection includes both data gathering and data preparation. Data gathering is the process of
downloading the 8,000 initial questions into an Excel file. Data preparation deals with the initial
data clean up, such as removing the blank rows. The most important task of data preparation is
data labelling. Because this is a supervised-learning ML project, all questions must be labeled as
either reference question (label=Yes) or nonreference question (label=No) by hand. Then, all
labeled questions (dataset) are fed to the ML model for either training purposes or testing
purposes. See table 1 for an example of data after the preparation step.

Table 1. Sample questions with Yes or No labels

Question
sequential
number

Label Question

3979 Yes Working on an Alumni Reunion presentation. I need to know …
3980 Yes would a book with this call number: DS559.8.D7 G68 1991 …
3981 No Would a Rutgers student be able to take out a textbook from …
3982 Yes Would I be able to find mathematics textbooks by Pearson on …
3983 No Would I be able to log in to find an article if I am an alumni of …
3984 Yes Would it be possible to help me find a online essay?
3985 No Would like to renew: Huguenots [videorecording] / music by …
3986 Yes would like to request for a course description catalog from

Fall …
3987 No Would someone be able to ask room 414 to quiet down please?
3988 No would someone be able to come up to floor 3 and tell people

to …

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 5

Data Preprocessing
Data preprocessing is the first programming step in the pipeline of this ML/NLP project. Data
preprocessing transforms the raw data into a more digestible form so that the ML model can
perform better and achieve the desired results. One of the purposes of data preprocessing is to
remove insignificant and nonmeaningful words such as “a,” “the,” “and,” etc., as well as
punctuation, from the textual data. Removing nonmeaningful and stop words from the corpus
allows for a better result in the ML model, allowing it to deal only with significant and meaningful
words.

It is also necessary to apply lowercase formatting to all letters. While we as humans know that
lowercase and uppercase words have the same meaning, the computer will treat them as having
different meanings. For example, “cat” and “Cat” are two different words to the computer.
Tokenization involves splitting the sentences into a list of individual words by removing spaces
between words. The last step of data preprocessing is stemming or lemmatizing, which is to find
the semantic meaning of a group of related words. In other words, this process explicitly
correlates words with similar meanings. For instance, run, running, runner will become “run” ;
library and libraries will become “librari”; goose and geese will become “goose.”

Feature engineering involves creating a new feature, or transforming the current feature. The
purpose of feature engineering is to help the model make better predictions. This step is optional
but often very helpful if done right. In this project, a new feature called “question length” was
created. “Question length” was based on the assumption that the average length of reference
questions is longer than the average length of nonreference questions. If such is the case, the ML
model will benefit by using this new feature to make better decisions. Figure 1 is a histogram of
question length distribution. The distribution of reference questions is represented in blue;
nonreference questions are represented in yellow.

Figure 2 shows a sample result list following completion of data preprocessing and feature
engineering. From left to right, it lists the result after each step. The question length feature
(Question_len column) must follow the original question because this step is based on the original
question before any other steps. The Question_lemma column is the result after all preprocessing
steps.

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 6

Figure 1. Histogram of question length distribution.

Figure 2. Results from data preprocessing and feature engineering.

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 7

Text Vectorization
The purpose of text vectorization is to transform the text data into numeric data so that the ML
algorithms and Python can understand and use that data to build a model. The basic idea is to
build an n-dimensional vector of numerical features that present some object. The three most
popular text vectorizations are count vectorization, n-grams vectorization, and TF-IDF
vectorization. TF-IDF stands for term frequency–inverse document frequency. Because it is
weighted, it is more accurate. Figure 3 shows the result of TF-IDF vectorization.

Figure 3. Result of TF-IDF vectorization.

Model Building, Testing, and Evaluation
The first step of model building is to divide the dataset into two sets, one for a training model and
one for model testing. Normally we use 80% of the data for training and 20% for testing. After
feeding the training data to the model, we feed the testing data as new data to the model to predict
the Yes or No label based on the pattern that the model builds through the training data. Testing
data were initially labeled by humans and are 100 percent accurate. By comparing the labels
predicted by the model with the labels in the training data, we would know how the model
performs, and make changes, if necessary, to the model parameters.

Scikit-learn contains several ML models. This project used two popular models: random forest and
gradient boosting. The random forest model builds many decision trees and computes them at the
same time. Then the final decision is made by majority vote. Because it computes at the same time,
it is more efficient and fast. The gradient boosting model builds one tree at a time. Each new tree
helps correct errors made by previously trained trees, and then the model is boosted
(optimization) by reward or penalty. In theory, gradient boosting should yield better results than
random forest. Nevertheless, it is slower and consumes more resources.

The confusion matrix was used to evaluate the performance of the two models. There are three
parameters in the confusion matrix. They are accuracy, precision, and recall. Accuracy equals true
positive plus true negative and then divided by the total. Precision equals true positive divided by

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 8

true positive plus false positive. Recall equals true positive divided by true positive plus false
negative.

Usually there is a tradeoff between precision and recall. Recall shows the number of false
negatives or the percentage of false negatives of the total, and precision shows the number of false
positives. False negative means that the model predicts the reference question as a nonreference
question. False positive means that the model predicts the nonreference question as a reference
question. Which is more important for the model to catch, false positive or false negative? The
answer depends on the actual situation. In our case, false negative is more serious than false
positive because we did not want the real reference questions to be predicted as nonreference
questions. However, it was acceptable if the nonreference questions were predicted as reference
questions. Therefore, we wanted the least amount of false negatives, which meant the largest
recall value possible.

RESULTS AND ANALYSIS

Table 2 lists the result from both models.

Table 2. Results of random forest model and gradient boosting model

Precision Recall Accuracy Fit time Predict time
Random forest model 0.914 0.964 0.912 2.489 s 0.15 s
Gradient boosting model 0.904 0.948 0.894 97.786 s 0.064 s

In general, any parameter values above 0.9 (90%) are very good. Looking at and comparing those
results, we can see that both models performed well. Nevertheless, the random forest model had
better results than that of the gradient boosting model in all three parameters. In addition, the fit
time of the random forest model was much shorter than that of the gradient boosting model. Even
though the predict time of the random forest model is slightly longer than that of the gradient
boosting model, it is relatively insignificant.

Therefore, the random forest model was chosen for the final model for this project.

CONCLUSION AND FUTURE WORK

In this pilot study, we used the classification modeling of NLP and ML techniques to divide
patrons’ chat questions into two categories: reference questions and nonreference questions. The
purpose of the model is to predict the category of future questions received through chat so that
library staff and professionals can provide faster, more efficient reference services. Two machine
learning models were tested: random forest and gradient boost. After comparing results from each
model, it was concluded that the random forest model showed better results.

What is the next step after the model is built? A potential use of this model is to implement it as a
plugin or feature enhancement for the online chat application. The model can function as the filter
to direct incoming questions to either reference librarians if the question is predicted as a
reference question by the model, or to library staff or graduate student assistants if the question is
predicted by the model as a nonreference question. This will be especially useful for libraries that
have busy online chat service transactions.

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 9

Further work can be done to make the model be multicategory. For example, a multicategory
model can go beyond two categories and include categories for information seeking, citation help,
printing help, noise complaints, interlibrary loan questions, spams, etc. Thus, the model can send
the question to the relevant department or library personnel accordingly.

ENDNOTES

1 “Stanford University Library AI Initiative,” Stanford University Library,
https://library.stanford.edu/projects/artificial-intelligence.

2 “Fantastic Futures: 2nd International Conference on AI for Libraries, Archives, and Museums,”
(2019), Stanford University Library, https://library.stanford.edu/projects/fantastic-futures.

3 Christina M. Desai and Stephanie J. Graves, “Cyberspace or Face-to-Face: The Teachable Moment
and Changing Reference Mediums,” Reference & User Services Quarterly 47, no. 3 (Spring 2008):
242–55, https://www.jstor.org/stable/20864890.

4 Sharon Q. Yang and Heather A. Dalal, “Delivering Virtual Reference Services on the Web: An
Investigation into the Current Practice by Academic Libraries,” Journal of Academic
Librarianship 41, no. 1 (November 2015): 68–86,
https://doi.org/10.1016/j.acalib.2014.10.003.

5 Megan Ozeran and Piper Martin, “Good Night, Good Day, Good Luck: Applying Topic Modeling to
Chat Reference Transcripts,” Information Technology and Libraries 38, no. 2 (June 2019): 49–
57, https://doi.org/10.6017/ital.v38i2.10921.

6 Christopher Brousseau, Justin Johnson, and Curtis Thacker, “Machine Learning Based Chat
Analysis,” Code4Lib Journal, no. 50 (2021), https://journal.code4lib.org/articles/15660.

7 Jeremy Walker and Jason Coleman, “Using Machine Learning to Predict Chat Difficulty,” College &
Research Libraries 82, no. 5 (2021), https://doi.org/10.5860/crl.82.5.683.

8 HyunSeung Koh and Mark Fienup, “Topic Modeling as a Tool for Analyzing Library Chat
Transcripts,” Information Technology and Libraries 40, no. 3 (2021),
https://doi.org/10.6017/ital.v40i3.13333.

9 Ellie Kohler, “What Do Your Library Chats Say? How to Analyze Webchat Transcripts for
Sentiment and Topic Extraction” (17th Annual Brick & Click Libraries Conference, Maryville,
Missouri: Northwest Missouri State University, 2017).

10 Jeremiah Flannery, “Using NLP to Generate MARC Summary Fields for Notre Dame’s Catholic
Pamphlets,” International Journal of Librarianship 5, no.1 (2020): 20–35,
https://doi.org/10.23974/ijol.2020.vol5.1.158.

11 Sultan M. Al-Daihani and Alan Abrahams, “A Text Mining Analysis of Academic Libraries’
Tweets,” The Journal of Academic Librarianship 42, no. 2 (2016): 135–43,
https://doi.org/10.1016/j.acalib.2015.12.014.

https://library.stanford.edu/projects/artificial-intelligence
https://library.stanford.edu/projects/fantastic-futures
https://www.jstor.org/stable/20864890
https://doi.org/10.1016/j.acalib.2014.10.003
https://doi.org/10.6017/ital.v38i2.10921
https://journal.code4lib.org/articles/15660
https://doi.org/10.5860/crl.82.5.683
https://doi.org/10.6017/ital.v40i3.13333
https://doi.org/10.23974/ijol.2020.vol5.1.158
https://doi.org/10.1016/j.acalib.2015.12.014

INFORMATION TECHNOLOGY AND LIBRARIES SEPTEMBER 2022

USING MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TO ANALYZE LIBRARY CHAT REFERENCE TRANSCRIPTS
WANG 10

12 Kevin W. Walker and Zhehan Jiang, “Application of Adaptive Boosting (AdaBoost) in Demand -
Driven Acquisition (DDA) Prediction: A Machine-Learning Approach,” The Journal of Academic
Librarianship 45, no. 3 (2019): 203–12, https://doi.org/10.1016/j.acalib.2019.02.013.

13 Carlos G. Figuerola, Francisco Javier Garcia Marco, and Maria Pinto, “Mapping the Evolution of
Library and Information Science (1978–2014) Using Topic Modeling on LISA,” Scientometrics
112 (2017): 1507–35, https://doi.org/10.1007/s11192-017-2432-9.

14 “Projects in Artificial Intelligence Registry (PAIR): A Registry for AI Projects in Higher Ed,”
University of Oklahoma Libraries, https://pair.libraries.ou.edu/.

15 Thomas Finley, “The Democratization of Artificial Intelligence: One Library’s Approach,”
Information Technology and Libraries 38, no. 1 (2019): 8–13,
https://doi.org/10.6017/ital.v38i1.10974.

https://doi.org/10.1016/j.acalib.2019.02.013
https://doi.org/10.1007/s11192-017-2432-9
https://pair.libraries.ou.edu/
https://doi.org/10.6017/ital.v38i1.10974

Abstract
Introduction
Literature Review
Background of NLP and ML
Methodology
Data Collection
Data Preprocessing
Text Vectorization
Model Building, Testing, and Evaluation

Results and Analysis
Conclusion and Future Work
Endnotes