Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Harvesting Information from a Library Data Warehouse
Su, Siew-Phek T;Needamangala, Ashwin
Information Technology and Libraries; Mar 2000; 19, 1; ProQuest
pg. 17

Harvesting Information from
a Library Data Warehouse

Data warehousing technology has been defined by John
Ladley as "a set of methods, techniques, and tools that are
leveraged together and used to produce a vehicle that deliv-
ers data to end users on an integrated platform. "1 This
concept has been applied increasingly by industries world-
wide to develop data warehouses for decision support and
knowledge discovery. In the academic sector, several uni-
versities have developed data warehouses containing the
universities'ftnancial, payroll, personnel, budget, and stu-
dent data.2 These data warehouses across all industries
and academia have met with varying degrees of success.
Data warehousing technology and its related issues have
been widely discussed and published. 3 Little has been done,
however, on the application of this cutting edge technology
in the library environment using library data.

I Motivation of Project
Daniel Boorstin, the former Librarian of Congress, men-
tions that "for most of Western history, interpretation has
far outrun data." 4 However, he points out "that modem
tendency is quite the contrary, as we see data outrun
meaning." His insights tie directly to many large organi-
zations that long have been rich in data but poor in
information and knowledge. Library managers are
increasingly finding the importance of obtaining a com-
prehensive and integrated view of the library operations
and the services it provides. This view is helpful for the
purpose of making decisions on the current operations
and for their improvement. Due to financial and human
constraints for library support, library managers increas-
ingly encounter the need to justify everything they do-
for example, the library's operation budget. The most
frustrating problem they face is knowing that the infor-
mation needed is available somewhere in the ocean of
data but there is no easy way to obtain it. For example, it
is not easy to ascertain whether the materials of a certain
subject area, which consumed a lot of financial resources
for their acquisition and processing, are either frequently
used (i.e., high rate of circulation), seldom used, or not
used at all. Or, whether they satisfy users' needs. Another
example, an analysis of the methods of acquisition (firm
order vs. approval plan) together with the circulation rate
could be used as a factor in deciding the best method of
acquiring certain types of material. Such information can
play a pivotal role in performing collection development
and library management more efficiently and effectively.
Unfortunately, the data needed to make these types of
decisions are often scattered in different files maintained

Siew-Phek T. Su and
Ashwin Needamangala

by a large centralized system, such as NOTIS, that does
not provide a general querying facility or by different
file/ data management or application systems. This situa-
tion makes it very difficult and time-consuming to extract
useful information. This is precisely where data ware-
housing technology comes in.

The goal of this research and development work is
to apply data warehousing and data mining technolo-
gies in the development of a Library Decision Support
System (LOSS) to aid the library management's decision
making. The first phase of this work is to establish a data
warehouse by importing selected data from separately
maintained files presently used in the George A.
Smathers Libraries of the University of Florida into a
relational database system (Microsoft Access). Data
stored in the existing files were extracted, cleansed,
aggregated, and transformed into the relational repre-
sentation suitable for processing by the relational data-
base management system. A graphical user interface
(GUI) is developed to allow decision makers to query for
the data warehouse's contents using either some prede-
fined queries or ad hoc queries. The second phase is to
apply data mining techniques on the library data ware-
house for knowledge discovery. This paper covers the
first phase of this research and development work. Our
goal is to develop a general methodology and inexpen-
sive software tools, which can be used by different func-
tional units of a library to import data from different
data sources and to tailor different warehouses to meet
their local decision needs. For meeting this objective, we
do not have to use a very large centralized database
management system to establish a single very large data
warehouse to support different uses.

I Local Environment
The University of Florida Libraries has a collection of
more than two million titles, comprising over three mil-
lion volumes. It shares a NOTIS-based integrated system
with nine other State University System (SUS) libraries
for acquiring, processing, circulating, and accessing its
collection. All ten SUS libraries are under the consortium
umbrella of the Florida Center for Library Automation
(FCLA).

Siew-PhekT. Su (pheksu@mail.uflib.ufl.edu) is Associate Chair
of the Central Bibliographic Services Section, Resource
Services Department, University of Florida Libraries, and
Ashwin Needamangala (nsashwin@grove.ufl.edu) is a
graduate student at the Electrical and Computer Engineering
Department, University of Florida.

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 17

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

I Library Data Sources
The University of Florida Libraries' online database,
LUIS, stores a wealth of data, such as bibliographic data
(author, title, subject, publisher information), acquisitions
data (price, order information, fund assignment), circula-
tion data (charge out and browse information, with-
drawn and inventory information), and owning location
data (where item is shelved). These voluminous data are
stored in separate files. The NOTIS system as used by the
University of Florida does not provide a general querying
facility for accessing data across different files. Extracting
any information needed by a decision maker has to be
done by writing an application program to access and
manipulate these files. This is a tedious task since many
application programs would have to be written to meet
the different information needs. The challenge of this
project is to develop a general methodology and tools for
extracting useful data and metadata from these disjoint-
ed files, and to bring them into a warehouse that is main-
tained by a database management system such as
Microsoft Access. The selection of Access and PC hard-
ware for this project is motivated by cost consideration.
We envision that multiple special purpose warehouses be
established on multiple PC systems to provide decision
support to different library units.

The Library Decision Support System (LOSS) is
developed with the capability of handling and analyzing
an established data warehouse. For testing our method-
ology and software system, we established a warehouse
based on twenty thousand monograph titles acquired
from our major monograph vendor. These titles were
published by domestic U.S. publishers and have a high
percentage of DLC/DLC records (titles cataloged by the
Library of Congress). They were acquired by firm order
and approval plan, The publication coverage is the calen-
dar year 1996-1997. Analysis is only on the first item
record (future project will include all copy holdings).
Although the size of the test data used is small, it is suffi-
cient to test our general methodology and the functional-
ity of our software system.

FCLA D82 Tables and Key List

Most of the data from the twenty-thousand-title domain
that go into the LOSS warehouse are obtained from the
DB2 tables maintained by FCLA. FCLA developed and
maintains the database of a system called Ad Hoc Report
Request Over the Web (ARROW) to facilitate querying
and generating reports on acquisitions activities . The
data are stored in 0B2 tables. 5

For our research and development purpose, we
needed DB2 tables for only the twenty-thousand titles

that we identified as our initial project domain. These
titles all have an identifiable 035 field in the bibliograph-
ic records (zybp1996, zybcip1996, zybp1997 or zybp-
cip1997). We used the BatchBAM program developed by
Gary Strawn of Northwestern University Library to
extract and list the unique bibliographic record numbers
in separate files for FCLA to pick up. 6 Using the unique
bibliographic record numbers, FCLA extracted the 0B2
tables from the ARROW database and exported the data
to text files. These text files then were transferred to our
system using the file transfer protocol (FrP) and inserted
as tables into the LOSS warehouse.

Bibliographic and Item Records Extraction

FCLA collects and stores complete acquisitions data
from the order records as DB2 tables. However, only
brief bibliographic data and no item record data are
available . Bibliographic and item record data are essen-
tial for inclusion in the LOSS warehouse in order to cre-
ate a viable integrated system capable of performing
cross-file analysis and querying for the relationships
among different types of data. Because these required
data do not exist in any computer readable form, we
designed a method to obtain them. Using the identical
NOTIS key lists to extract the targeted twenty-thousand
bibliographic and item records, we applied a screen
scraping technique to scrape the data from the screen
and saved them in a flat file. We then wrote a program in
Microsoft Visual Basic to clean the scraped data and
saved them as text-delimited files that are suitable for
importing into the LOSS warehouse.

Screen Scraping Concept

Screen scraping is a process used to capture data from a
host application. It is conventionally a three-part process:

• Displaying the host screen or data to be scraped.
• Finding the data to be captured.
• Capturing the data to a PC or host file, or using it in

another Windows application.

In other words, we can capture particular data on the
screen by providing the corresponding screen coordinates
to the screen scraping program. Numerous commercial
applications for screen scraping are available on the mar-
ket. However, we used an approach slightly different from
the conventional one. Although we had to capture only
certain fields from the NOTIS screen, there were other fac-
tors that we had to take into consideration. They are:

• The location of the various fields with respect to the
screen coordinates changes from record to record .
This makes it impossible for us to lock a particular
field with a corresponding screen coordinate.

18 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

• The data present on the screen are dynamic because
we are working on a "live" database where data are
frequently modified. For accurate query results, all
the data, especially the item record data where the
circulation transactions are housed, need to be cap-
tured within a specified time interval so that the data
are uniform. This makes the time taken for capturing
the data extremely important.

• Most of the fields present on the screen needed to be
captured.

Taking the above factors into consideration, it was
decided to capture the entire screen instead of scraping
only certain parts of the screen. This made the process
both simpler and faster . The unnecessary fields were fil-
tered out during the cleanup process .

I System Architecture
The architecture of the LOSS system is shown in figure 1 and
is followed by a discussion on its components' functions.

NOTIS

NOTIS (Northwestern Online Totally Integrated System)
was developed at the Northwestern University Library
and introduced in 1970. Since its inception, NOTIS has
undergone many versions. University of Florida Libraries
is one of the earliest users of NOTIS. FCLA has made
many local modifications of the NOTIS system since UF
Libraries started using it. As a result, the UF NOTIS is dif-
ferent from the rest of the NOTIS world in many respects .
NOTIS can be broken down into four subsystems:

• acquisitions
• cataloging
• circulation
• online public access catalog (OPAC)

At the University of Florida Libraries, the NOTIS sys-
tem runs on an IBM 370 main frame computer that runs
the OS/390 operating system .

Host Explorer

Host Explorer is a software program that provides a
TCP /IP link to the main frame computer . It is a terminal
emulation program supporting the IBM main frame,
AS/400, and VAX hosts . Host Explorer delivers an
enhanced user environment for all Windows NT plat-
forms, Windows 95 and Windows 3.x desktops. Exact
TN3270E, TN5250, VT420/320/220/101/100/52, WYSE
50/60 and ANSI-BBS display is extended to leverage the
wealth of the Windows desktop. It also supports all

DB2Tables

LOSS
Host Explorer

Data Cleansing and Extraction

Warehouse

Graphical User Interface

Figure 1. LOSS Architecture and Its Components

TCP /IP based TN3270 and TN3270E gateways.
The Host Explorer program is used as the terminal

emulation program in LOSS. It also provides VBA com-
patible BASIC scripting tools for complete desktop macro
development. Users can run these macros directly or
attach them to keyboard keys, toolbar buttons, and screen
hotspots for additional productivity.

The function of Host Explorer in the LOSS is v ery
simple. It has to "visit" all screens in the NOTIS system
corresponding to each NOTIS number present in the
BatchBam file, and capture all the data on the screens.
In order to do this, we wrote a macro that read the
NOTIS number one at a time from the BatchBam file
and input the number into the command string of Host
Explorer . The macro essentially performed the follow-
ing functions:

• Read the NOTIS numbers from the BatchBam file.
• Inserted the NOTIS number into the command string

of Host Explorer .
• Toggled the Screen Capture option in Host Explorer

so that data are scraped from the screen only at nec-
essary times.

• Saved all the scraped data into a flat file.

After the macro has been executed, all the data scraped
from the NOTIS screen reside in a flat file. The data present

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 19

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

in this file have to be cleansed in order to make them suit-
able for insertion into the Library Warehouse. A Visual
Basic program is written to perform this function. The
details of this program will be given in the next section.

I Data Cleansing and Extraction
This component of the LOSS is written in the Visual Basic
programming language. Its main function is to cleanse the
data that have been scraped from the NOTIS screen. The
Visual Basic code saves the cleansed data in a text-delimit-
ed format that is recognized by Microsoft Access. This file
is then imported into the Library Warehouse maintained
by Microsoft Access. The detailed working of the code that
performs the cleansing operation is discussed below.

The NOTIS screen that comes up for each NOTIS
number has several parts that are critical to the working
of the code. They are:

• NOTIS number present in the top-right of the screen
(in this case, AKR9234)

• Field numbers that have to be extracted. Example:
010::, 035::

• Delimiters. The " I " symbol is used as the delimiter
throughout this code. For example, in the 260 field of
a bibliographic record, "I a" delimits the place of
publication, " I b" the name of the publisher and,
"I c" the date of publication.

We shall now go step by step through the cleansing
process. Initially we have the flat file containing all the
data that have been scraped from the NOTIS screens.

• The entire list of NOTIS numbers from the BatchBam
file is read into an array called Bam_Number$.

• The file containing the data that have been scraped is
read into a single string called BibRecord$.

• This string is then parsed using the NOTIS numbers
from the Bam_Number$ array.

• We now have a string that contains a single NOTIS
record. This string is called Single_Record$.

• The program runs in a loop till all the records have
been read.

• Each string is now broken down into several smaller
strings based on the field numbers. Each of these
smaller strings contains data pertaining to the corre-
sponding field number.

• A considerable amount of the data present on the
NOTIS screen is unnecessary from the point of view
of our project. We need only certain fields from the
NOTIS screen. But even from these fields we need
the data only from certain delimiters. Therefore, we
now scan each of these smaller strings for a certain
set of delimiters, which was predefined for each indi-

vidual field. The data present in the other delimiters
are discarded.

• The data collected from the various fields and their
corresponding delimiters are assigned to correspon-
ding variables. Some variables contain data from
more than one delimiter concatenated together. The
reason for this can be explained as follows. There are
certain fields, which are present in the database only
for informational purposes and will not be used as a
criteria field in any query. Since these fields will never
be queried upon, they do not need to be cleansed as
rigorously as the other fields and therefore, we can
afford to leave the data of these fields as concatenated
strings. Example: The Catalog_source field which has
data from " I a" and " I c" is of the form " I a DLC I c
DLC" while the Lang code field which has data from
"I a" and" I h" is of the form" I a eng I h rus." But we
split this into two fields: Lang_code_l containing
"eng" and Lang_code_2 containing "rus."

• The data collected from the various fields are saved
in a flat file in the text-delimited format. Microsoft
Access recognizes this format.

A screen dump of the text-delimited file, which is the
end result of the cleansing operation, is shown in figure
2. The flat file, which we now have, can be imported into
the Library Warehouse.

I Graphical User Interface
In order to ease the tasks of the user (i.e., the decision
maker) to create the library warehouse and to query and
analyze its contents, a graphical user interface tool has
been developed. Through the GUI, the user can enact the
following processes or operations through a main menu:

• connection to NOTIS
• screen scraping
• data cleansing and extracting
• importing data
• viewing collected data
• querying
• report generating

The first option opens HostExplorer and provides a
connection to NOTIS. lt provides a shortcut to closing or
minimizing LDSS and opening HostExplorer. The Screen
Scraping option activates the data scraping process. The
Data Cleansing and Extracting option filters out the
unnecessary data fields and saves the cleansed data in a
text-delimited format. The Importing Data option imports
the data in the text-delimited format into the warehouse.
The Viewing Collected Data option allows the user to
view the contents of a selected relational table stored in

20 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

"RECORD HUMBER","System Control Humber","Catalogin Source","Language Codes 1","Language Code~
"AKR9234", "YBP1996 0507--CLARR done", "a DLC I c DLC ", "1 : I a eng "," I h rus", "e-ur-ru", "306/. 0~
"RKS6472", "YBP1996 0507--CLRRR done"," a DLC I c DLC ", "1 : I a eng "," I h rus", "Hull", "891. 73/ 44
"AKS6493", "YBP1996 0507--CLARR done"," a DLC I c DLC ","Hull", "Hull", "Hull"," 001. 4/225/ 028563 I ~f
"AJX7554", "YBP1996 05 08--CLARR done"," a Uk I c Uk ","Hull", "Hull", "e-uk---", "362. 1 / 068 12 2 O",'
"AKB3478", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "e-fr---", "843/. 7 12 2 O", "t "
"AKC6442","YBP19960508--CLARR done","a DLC c DLC ","1 : la eng ","lh ger","e-fr---","194 12
"AKE9837", "YBP1996 0508--CLARR done"," a DLC c DLC ","Hull", "Hull", "e-gr---", "883/. 01 12 20",'
"AKK9486", "YBP1996 0508--CLARR done", "a DLC c DLC ","Hull", "Hull", "e-uk---", "822/. 052309 12 ~%
l'AKL2258", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "e-xr---", "929. 4/2/ 08992401 1•
"AKM2455", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "e-gx---", "943. 086 12 2 O",'
"AKM4649", "YBP1996 0508--CLARR done"," a DLC c DLC ","Hull", "Hull", "Hull", "863/ .64 I 2 20", "Hu] '
"AKH0246","YBP19960508--CLARR done","a DLC c DLC ","Hull","Hull","n-us--- la e-uk-en","700/.
"AKH181 O", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull" ,"Hull", "e-uk---", "305. 6/2 042/ 0903.:
"AKH3749","YBP19960508--CLARR done","a DLC c DLC ","Hull","Hull","f-ke--- la f-so - --","327.{
"AKQ727 4", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "Hull", "355. 4/2 12 2 O", "Hu]
"AKQ9180", "Y.BP1996 0508--CLARR done", "a DLC c DLC ","Hull", "Hull", "n-us---", "23 0/. 93/ 09 12 2,f
"AKR 0424", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "n-us-mi", "331 . 88/1292/ 097'
"RKR1411", "YBP1996 05 08--CLARR done"," a CL I c CL ","Hull", "Hull", "n-us---", "3 05. 896/ 073 12 2 O'
"AKR1846", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "e-uk-ni", "Hull", "Hull", "x,
"AKR2169", "YBP1996Jt5 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "n-us-sc", "323. 1/196073/ 091
"AKR2245" ,"YBP19960508--C .LARR d.one" ," a DLC I c DLC ","Hull", "Hull", "Hull", "306 .4/6 I 2 20", "Hu1
"AKR2255", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "Hull", "3 03. 48/2 12 2 O", "2r
"AKR226 O", "YBP1996 0508--CLARR done"," a DLC I c DLC ","Hull", "Hull", "n-us- - -", "3 03. 48/2 12 2 O",
"AKR2281", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "t------ I a r------", "333. , ·
"AKR2287", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "Hull", "57 4. 5/262 12 2 O", "t
"RKR2357", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "e------", "361 . 6/1 / 094 12 l
"AKR2358", "YBP1996 0508--CLARR done"," a DLC I c DLC ","Hull", "Hull" ,"Hull", "333. 7/2/01 12 20" ,' ¥'
"AKR2371", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "e------", "3 07. 72/ 094 12 211
"AKR2386", "YBP1996 05 08--CLARR done", "DLC I c DLCI", "Hull" ,/'Hull", "e-uk---", "Hull", "Hull", "xu,
"RKR25 03", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "Hull", "575. 1 / 09 12 2 O", "HL

'i-r---· -----
-----·-----

Figure 2. A Text-Delimited File

the warehouse. The Querying option activates LDSS's
querying facility that provides wizards to guide the for-
mulations of different types of queries, as discussed later
in this article . The last option, Report Generating, is for the
user to specify the report to be generated.

I Data Mining Tool
A very important component of LOSS is the data mining
tool for discovering association rules that specify the
interrelationships of data stored in the warehouse. Many
data mining tools are now available in the commercial
world. For our project, we are investigating the use of a
neural-network-based data mining tool developed by

Limin Fu of the University of Florida.? The tool allows the
discovery of association rules based on a set of training
data provided to the tool. This part of our research and
development work is still in progress . The existing GUI
and report generation facilities will be expanded to
include the use of this mining tool.

I Library Warehouse
FCLA exports the data existing in the 0B2 tables into text
files. As a first step towards creating the database, these
text files are transferred using FTP and form separate
relational tables in the Library Warehouse. The data that

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 21

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

are scraped from the bibliographic and item record
screens result in the formation of two more tables.

Characteristics

Data in the warehouse are snapshots of the original data
files. Only a subset of the data contents in these files are
extracted for querying and analysis since not all the data
are useful for a particular decision-making situation.
Data are filtered as they pass from the operational envi-
ronment to the data warehouse environment. This filter-
ing process is necessary particularly when a PC system,
which has limited secondary storage and main memory
space, is used. Once extracted and stored in the ware-
house, data are not updateable. They form a read-only
database. However, different snapshots of the original
files can be imported into the warehouse for querying
and analysis. The results of the analyses of different snap-
shots can then be compared.

Structure

Data warehouses have a distinct structure. There are
summarization and detail structures that demarcate a
data warehouse. The structure of the Library Data
Warehouse is shown in figure 3.

The different components of the Library Data
Warehouse as shown in figure 3 are:

• NOTIS and 0B2 Tables. Bibliographic and circula-
tion data are obtained from NOTIS through the
screen scraping process and imported into the ware-
house. FCLA maintains acquisitions data in the form
of DB2 tables. These are also imported into the ware-
house after conversion to a suitable format.

• Warehouse. The warehouse consists of several rela-
tional tables that are connected by means of relation-
ships. The universal relation approach could have
been used to implement the warehouse by using a
single table. The argument for using the universal
relation approach would be that all the collected data
fall under the same domain. But let us examine why
this approach would not have been suitable. The dif-
ferent data collected for import into the warehouse
were bibliographic data, circulation data, order data,
and pay data. Now, if all these data were incorporat-
ed into one single table with many attributes, it
would not be of any exceptional use since each set of
attributes have their own unique meaning when
grouped together as bibliographic table, circulation
table, and so on. For example, if we group the circu-
lation data and the pay data together in a single table,
it would not make sense. However, the pay data and
the circulation data are related through the Bib_key.
Hence, our use of the conventional approach of hav-

User

.....--------~----.----------......----=--___
Bibliographic

Data View
Circulation
Data View

Ufbib, Ufpay, Ufinv, Ufcirc, Uford

WAREHOUSE

Pay
Data View

Import

Screen Scraping

NOTIS

FCLA DB2
Tables

Figure 3. Structure of the Library Data Warehouse

ing several tables connected by means of relation-
ships is more appropriate.

• Views. A view in SQL terminology is a single table
that is derived from other tables. These other tables
could be base tables or previously defined views. A
view does not necessarily exist in physical form; it is
considered a virtual table, in contrast to base tables
whose tables are actually stored in the database. In
the context of the LDSS, views can be implemented
by means of the AdHoc Query Wizard. The user can
define a query /view using the Wizard and save it for
future use. The user can then define a query on this
query I view.

• Summarization. The process of implementing views
falls under the process of summarization.
Summarization provides the user with views, which
make it easier for users to query on the data of their
interests.

As explained above, the specific warehouse we
established consists of five tables. Table names including
"_WH" indicates that it contains current detailed data of
the warehouse. Current detailed data represents the most
recent snapshot of data that has been taken from the
NOTIS system. The summarized views are derived from
the current detailed data of the warehouse. Since current
detailed data of the warehouse are the basic data of the

22 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

application, only the current detailed data tables are
shown in appendix A.

I
Decision Support
by Querying the
Warehouse

The warehouse contains a set of integrated relational
tables whose contents are linked by the common primary
key, the Bib_key (Biblio_key). The data stored across
these tables can be traver sed by matching the key values
associated with their tuples or records . Decision makers
can issue all sorts of SQL-type queries to retrieve useful
information from the warehouse. Two general types of
queries can be distinguished : predefined queries and ad
hoc queries . The former type refers to queries that are fre-
quently used by decision makers for accessing informa-
tion from different snapshots of data imported into the
warehouse . The latter type refers to queries that are
exploratory in nature. A decision maker suspects that
there is some relationship between different types of data
and issues a query to verify the existence of such a rela-
tionship. Alternatively, data mining tools can be applied
to analyze the data contents of the warehouse and dis-
cover rules of their relationships (or associations).

Predefined Queries

Below are some sample queries posted in English. Their
corresponding SQL queries can be processed using LOSS.

l. Number and percentage of approval titles circulated
and noncirculated.

2. Number and percentage of firm order titles circulat-
ed and noncirculated .

3. Amount of financial resources spent on acquiring
noncirculated titles.

4. Number and percentage of DLC/DLC cataloging
records in circulated and noncirculated titles .

5. Number and percentage of "shared" cataloging
records in circulated and noncirculated titles.

6. Numbers of original and "shared" cataloging
records of noncirculated titles.

7. Identify the broad subject areas of circulated and
noncirculated titles .

8. Identify titles that have been circulated "n" number
of times and by subjects .

9. Number of circulated titles without the 505 field.

Each of the above English queries can be realized by
a number of SQL queries. We shall use the first two
English queries and their corresponding SQL queries to
explain how the data warehouse contents and the query-

ing facility of Microsoft Access can be used to support
decision making. The results of SQL queries also are
given . The first English query can be divided into two
parts (see figure 4), each realized by a number of SQL
queries as shown below .

Sample Query Outputs

Query 1: Number and percentage of approval titles circu-
lated and noncirculated

Result : Total approval titles
Circulated
Noncirculated

1172
980
192

83.76 %
16.24 %

Similar to the above SQL queries, we can translate
the second English query into a number of SQL queries
and the result is given below:

Query 2: Number and percentage of firm order titles cir-
culated and noncirculated

Result : Total firm order titles
Circulated
Noncirculated

Report Generation

1829
1302
527

71.18 %
28.82 %

The results of the two predefined English queries can be
presented to users in the form of a report.

Total titles 3001
Approval 1172 39%
Circulated 980 83.76 %
Noncirculated 192 16.24 %

Firm Order 1829 61%
Circulated 1302 71.18 %
Noncirculated 527 28 .82 %

From the above report, we can ascertain that, though
39 percent of the titles were purchased through the
approval plan and 61 percent through firm orders, the
approval titles have a higher rate of circulation, 83.76 per-
cent, as compared to firm order titles of 71.18 percent. It
is important to note that the result of the above queries is
taken from only one snapshot of the circulation data.
Analysis from several snapshots is needed in order to
compare the results and arrive with reliable information.

We now present a report on the financial resources
spent on acquiring and processing noncirculated titles. In
order to generate this report, we need the output of
queries four and five listed earlier in this article. The cor-
responding outputs are shown below.

Query 4: Number and percentage of DLC/DLC
cataloging records in circulated and noncirculated titles.

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 23

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Result: Total DLC/DLC records
Circulated
Noncirculated

2852
2179
673

76.40%
23.60%

Query 5: Number and percentage of "shared" cataloging
records in circulated and noncirculated titles.

Result: Total "shared" records
Circulated
Noncirculated

149
100
49

67.11%
32.89%

In order to come up with the financial resources, we
need to consider several factors, which contribute to the
amount of financial resources spent. For the sake of sim-
plicity, we consider only the following factors:

1. the cost of cataloging each item with DLC/DLC record

Approval Titles Circulated

2. the cost of cataloging each item with shared record
3. the average price of noncirculated books
4. the average pages of noncirculated books
5. the value of shelf space per centimeter

Because the value of the above factors differs from
institution to institution and might change according to
more efficient workflow and better equipment used,
users are required to fill in the value for factors 1, 2, and
5. LOSS can compute factors 3 and 4. The financial report ,
taking into consideration the value of the above factors,
could be as shown below.

Processing cost of each DLC Title = $10.00
673 X $10.00 = $ 6,730.00
Processing cost of each Shared Title = $20.00

SQL query t.o retrieve the distinct bibliographic keys of all the approval titles:
SELECT DISTINCT BibScreen.Bib_key
FROM BibScreen RIGHT JOIN pa yl ON BibScreen.Bib_key = pa y l.BIB_NUM WHERE
(((payl.FUND_KEY) Like "*07*"));

SQL query to count the number of approval titles that have been circulated:
SELECT Count (Appr_Title.Bib_key) AS CountOfBib_key
FROM (BibScreen INNER JOIN Appr_Title ON BibScreen.Bib_key = Appr _Title.Bib_key) INNER JOIN

ItemScreen ON BibScreen.Bib_key = ItemScreen .Biblio_key
WHERE (((ItemScreen.CHARGES)>0))
ORDER BY Count(Appr _Title.Bib_key);

SQL query to calculate the percentage:
SELECT Cnt_Appr_Ti tle_Circ.CountOfBib_ke y,

Int(([Cnt_Appr_Titl e_Circ]![CountOfBib _key])*lO0/ Count([BibScreen)![Bib_key])) AS Percent_apprcirc
FROM BibScreen, Cnt_Appr_Title _Circ
GROUP BY Cnt _Appr _Title_Circ.CountOfBib _key;

Approval Titles Noncirculated

SQL query for counting the number of approval titles that have not been circulated:
SELECT DISTINCT Count(Appr_Title.Bib_key) AS CountOfBib_ke y
FROM (Appr _Title INNER JOIN BibScreen ON Appr_Title.Bib_key BibScreen.Bib_key) INNER JOIN

ItemScreen ON BibScreen .Bib_key = ItemScreen.Biblio_ke y
WHERE ( ( (ItemScreen.CHARGES)=0) );

SQL query to calculate the percentage:
SELECT Cnt_Appr_Title_Noncirc.CountOfBib_ke y, Int(([Cnt_Appr_Title_Noncirc)![CountOfBib_ke y])*lO0/

Count([BibScreen]! [Bib _key]))) AS Percent_appr _noncirc
FROM BibScreen, Cnt_Appr _Title_Noncirc
GROUP BY Cnt_Appr_Title_Noncirc .CountOfBib_ke y;

Figure 4. Example of an English Query Divided into Two Parts

24 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

49 X $20.00 = $ 980.00
Average price paid per noncirculated item = $48.00
722 X $48.00 = $34,656.00
Average size of book = 288 pages = 3 cm
Average cost of 1 cm of shelf space= $0.10
722 X $0.30 = $216.60
Grand Total = $42,582.60

Again it is important to point out that several snap-
shots of the circulation data have to be taken to track and
compare the different analyses before deriving the reli-
able information.

Ad Hoc Queries
Alternately, if the user wishes to issue a query that

has not been predefined, the Ad Hoc Query Wizard can
be used. The following example illustrates the use of the
Ad Hoc Query Wizard. Assume the sample query is:
How many circulated titles in the English subject area cost
more than $35?

We now take you on a walk-through of the AdHoc
Query Wizard starting from the first step till the output is
obtained.

Figure 4 depicts Step 1 of the Ad Hoc Query Wizard.
The sample query mentioned above requires the follow-
ing fields:

• Biblio_key for a count of all the titles which satisfy
the given condition.

• Charges to specify the criteria of "circulated title".
• Fund_Key to specify all titles under the "English"

subject area.
• Paid_Amt to specify all titles which cost more than

$35.

Step 2 of the Ad Hoc Query Wizard (figure 5) allows
the user to specify criteria and thereby narrow the search
domain. Step 3 (figure 6) allows the user to specify any
mathematical operations or aggregation functions to be
performed. Step 4 (figure 7) displays the user-defined
query in SQL form and allows the user to save the query
for future reuse. The output of the query is shown below
in figure 8. The figure shows the number of circulated
titles in the English subject area that cost more than $35.
Alternatively, the user might wish to obtain a listing of
these 33 titles. Figure 9 shows the listing.

I Conclusion
In this article, we presented the design and development
of a library decision support system based on data ware-
housing and data mining concepts and techniques. We
described the functions of the components of LOSS. The
screen scraping and data cleansing and extraction

Figure 4. Step 1: Ad Hoc Query Wizard

~ E.9,~Lang__;c~,tfe ... 1 Lik~ "'ft,f"
J.esi: !han Eg,. Crfi;irget t 4
Gr~er th'Jn·Eii, Q:,arges,> 0
Equal tci'E_g_- Cfiarge~= !1
Not . .

Figure 5. Step 2: Ad Hoc Query Wizard

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 25

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Figure 6. Step Three : Ad Hoc Query Wizard Figure 7. Step Four: Ad Hoc Query Wizard

Figure 8. Query Output Figure 9. Listing of Query Output

26 INFORMATION TECHNOLOGY AND LIBRARIES i MARCH 2000

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

processes were described in detail. The process of import-
ing data stored in LUIS as separate data files into the
library data warehouse was also described. The data con-
tents of the warehouse can provide a very rich informa-
tion source to aid the library management in decision
making. Using the implemented system, a decision
maker can use the GUI to establish the warehouse, and to
activate the querying facility provided by Microsoft
Access to explore the warehouse contents . Many types of
queries can be formulated and issued against the data-
base. Experimental results indicate that the system is
effective and can provide pertinent information for aid-
ing the library management in making decisions. We
have fully tested the implemented system using a small
sample database . Our on going work includes the expan-
sion of the database size and the inclusion of a data min-
ing component for association rule discovery. Extensions
of the existing GUI and report generation facilities to
accommodate data mining needs are expected.

I Acknowledgments
We would like to thank Professor Stanley Su for his sup-
port and advice on the technical aspect of this project. We
would also like to thank Donna Alsbury for providing us
with the 0B2 data, Daniel Cromwell for loading the 0B2
files and along with Nancy Williams and Tim Hartigan
for their helpful comments and valuable discussions on
this project.

References and Notes

1. John Ladley , "Operational Data Stores: Building an
Effective Strategy, " Data Warehouse: Practical Advice from the
Experts (Englewood Cliffs, N.J.: Prentice Hall , 1997).

2. Information on Har vard University's ADAPT proj ect.
Accessed March 8, 2000, www.adapt.harvard .edu/; Information
on the Arizona State University Data Administration and
Institutional Analysis warehou se. Accessed March 8, 2000,
www .asu .edu / Data_Admin / WH-1.html; Information on the
University of Minnesota CLARITY project. Accessed March 8,

2000,www.clarity.umn .edu/; Information on the UC San Diego
DARWIN project. Accessed March 8, 2000, www.act .ucsd .edu/
dw I darwin.html; Information on University of Wisconsin-
Madison InfoAccess . Accessed March 8, 2000, http :/ / wiscinfo.
doit.wisc .edu/infoac cess /; Information on the Univer sity of
Nebraska Data Warehouse-nulook. Accessed March 8, 2000,
www .nulook.uneb.edu /.

3. Ramon Barquin and Herbert Edelstein, eds ., Building,
Using, and Managing the Data Warehouse (Englewood Cliffs, N .J.:
Prentice Hall , 1997); Ramon Barquin and Herbert Edelstein,
eds ., Planning and Designing the Data Warehouse (Upper Saddle
River, N.J .: Prentice Hall, 1996); Joyce Bischoff and Ted
Alexander, Data Warehouse: Practical Advice from the Experts
(Englewood Cliffs, N.J.: Prentice Hall , 1997); Jeff Byard and
Donovan Schneider, "The Ins and Outs (and Everything in
Between) of Data War ehousing ," ACM SIGMOD 1996 Tutorial
Notes, May 1996. Accessed March 8, 2000, www .redbrick.com /
product s/ white / pdf/sigmod96.pdf ; Surajit Chaudhuri and
Umesh Dayal, "An Overview of Data Warehousing and OLAP
Technolog ," ACM SIGMOD Record 26(1), March 1997. Accessed
March 8, 2000, www.acm.org/sigmod / record/issue s/ 9703/
chaudhuri .ps ; B. Devlin , Data Warehouse: From Architecture to
Implementation (Reading, Mass.: Addison-Wesle y, 1997); U.
Fayyad and others, eds ., Advances in Knowledge Discovery and
Data Mining (Cambridge, Mass.: The MIT Pr., 1996); Joachim
Hammer, "Data War ehousing Overview, Terminology, and
Research Issues." Accessed March 8, 2000, www.cise.ufl .edu/
-jhammer / classes / wh-seminar / Overview / index .htm ; W. H.
Inmon, Building the Data Warehouse (New York, N.Y.: John Wiley,
1996); Ralph Kimball , "Dangerous Preconceptions." Accessed
March 8, 2000, www .dbmsmag.com/9608d05.html ; Ralph
Kimball , The Data Warehouse Toolkit (New York, N.Y.: John Wiley,
1996); Ralph Kimball, "Mastering Data Extraction," in DBMS
Magazine, June 1996. (Provides an overview of the process of
extracting , cleaning, and loading data .) Accessed March 8, 2000,
www .dbmsmag.com / 9606d05 .html ; Alberto Mendelzon ,
"Bibliography on Data Warehousing and OLAP." Accessed
March 8, 2000, www.cs.toronto.edu/-mendel/dwbib.html.

4. Daniel J. Boorstin, "The Age of Negative Discovery,"
Cleopatra's Nose: Essays on the Unexpected (New York: Random
Hous e, 1994).

5. Information on the ARROW system . Accessed March 8,
2000,www . fcla.edu /s ystem/intro_arrow.html.

6. Gary Strawn, "BatchBAMing." Accessed March 8, 2000,
http:/ /web .uflib.ufl .edu/rs/rsd/batchbam .html.

7. Li-Min Fu, "OOMRUL: Leaming the Domain Rules ."
Accessed March 8, 2000, www .cise.ufl .edu / -fu / domrul.html.

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 27

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Appendix A Warehouse Data Tables

UFCIRC_WH UFORD _WH UFPAY_WH

Attribute Domain Attribute Domain Attribute Domain

Bib_key Text(S0) Id AutoNumber Inv_key Text(20)

Status Text(20) Ord_num Text(20) Ord_num Text(20)

Enum / Chron Text(20) Ord_Div Number Ord_div Number
MidSpine Text(20) Process_Uni t Text(20) Process _Unit Text(20)
Temp_Locatn Text(20) Bib_num Text(20) Bib_key Text(20)
Pieces Number Order_da te Da te / Time Ord_Seq_Num Number

Ch arges Number Mod_Date Date / Time Inv_Seq_Num Number
Last_Use Date / Tune Vendor_Code Text(20) Status Text(20)

Browse s Number VndAdr_Order Text(20 Create_ Date Da te / Tune
Value Text(20) VndAdr_Claim Text(20) Lst_update Da te / Time
Invnt_Date Date / Time VndAdr_Retum Text(20) Currency Text(20)
Created Date / Time Vend_ Title_N um Text(20) Paid_am t Num ber

Ord_Unit Text(20) USD_amt N u mber
Rcv_Unit Text(20) Fund_Key Text(20)

UFINV_WH Ord_Scope Text(20 Exp_class Text(20)
Pur_Ord_prod Text(20) Fiscal_year Text(20)

Attribute Domain Action _Int Number Copies Number

Inv_Key Text(20) LibSpecl Text(20) Type_pay Text(lO)

Create _Dat e Date / Time LibSpec2 Text(20) Text Text(20)

Mod_Date Date / Time Vend_Note Text(20) DB2_11meStamp Date / Time

Approv _Stat Text(20) Ord_Note Text(20)

Vend_Adr _Code Text(20) Source Text(20)

Vend_Code Text(20) Ref Text(20) UFBIB_WH

Action_Date Text(20) CopyCtl _Num Number Attribute Domain
Vend_Inv _Date Date/Tune Mediu m Text(20)

Approval_Date Date / Tune Piece_Cnt N umber Bib_key Text(20)

Appro ver_Id Text(20) Div_No te Text(20) System_Control _Num Text(S0)

Vend_Inv _Num Text(20) Acr_Stat Text(20) Ca talog_Source Text(20)

Inv_Tot Number Rel_Stat Text(20) Lan g_Code_l Text(20)

Cale_ Tot_rym ts Num ber Lst_Date Date / Time Lang_Code_2 Text(20)

Calc_Net _Tot_Pymts Number Action_Date Text(20) Geo_Code Text(20)

Currency Text(20) LibSpec3 Text(20) Dewey_Num Text(20)

Discount_Percen t Number LibSpec4 Text(20) Edition Text(20)

Vouch_No te Text(20) Encum b_Units Number Pagina tion Text(20)

Official_ Vend Text(20) Currency Text(20) Size Text(20)

Process _Unit Text(20) Est_Price Number Series_440 Text(20)

Intemal_Note Text(20) Encumb_outs Num ber Series_490 Text(20)

DB2_ Timestamp Text(20) Fund _key Text(20) Conten t Text(20)
Fiscal_ Year Text(20) Subject_l Text(20)
Copies N u mber Subject_2 Text(20)
Xpay_Method Text(20) Subject_3 Text(20)
Vol_Isu_Date Text(20) Authors_l Text(20)
Title_Author Text(20) Au thors_2 Text(20)
DB2_ Timestamp Date / Time Au th ors_3 Text(20)

Series Text(20)

28 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000