Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

Harvesting Information from a Library Data Warehouse
Su, Siew-Phek T;Needamangala, Ashwin
Information Technology and Libraries; Mar 2000; 19, 1; ProQuest
pg. 17

Harvesting Information from 
a Library Data Warehouse 

Data warehousing technology has been defined by John 
Ladley as "a set of methods, techniques, and tools that are 
leveraged together and used to produce a vehicle that deliv-
ers data to end users on an integrated platform. "1 This 
concept has been applied increasingly by industries world-
wide to develop data warehouses for decision support and 
knowledge discovery. In the academic sector, several uni-
versities have developed data warehouses containing the 
universities'ftnancial, payroll, personnel, budget, and stu-
dent data.2 These data warehouses across all industries 
and academia have met with varying degrees of success. 
Data warehousing technology and its related issues have 
been widely discussed and published. 3 Little has been done, 
however, on the application of this cutting edge technology 
in the library environment using library data. 

I Motivation of Project 
Daniel Boorstin, the former Librarian of Congress, men-
tions that "for most of Western history, interpretation has 
far outrun data." 4 However, he points out "that modem 
tendency is quite the contrary, as we see data outrun 
meaning." His insights tie directly to many large organi-
zations that long have been rich in data but poor in 
information and knowledge. Library managers are 
increasingly finding the importance of obtaining a com-
prehensive and integrated view of the library operations 
and the services it provides. This view is helpful for the 
purpose of making decisions on the current operations 
and for their improvement. Due to financial and human 
constraints for library support, library managers increas-
ingly encounter the need to justify everything they do-
for example, the library's operation budget. The most 
frustrating problem they face is knowing that the infor-
mation needed is available somewhere in the ocean of 
data but there is no easy way to obtain it. For example, it 
is not easy to ascertain whether the materials of a certain 
subject area, which consumed a lot of financial resources 
for their acquisition and processing, are either frequently 
used (i.e., high rate of circulation), seldom used, or not 
used at all. Or, whether they satisfy users' needs. Another 
example, an analysis of the methods of acquisition (firm 
order vs. approval plan) together with the circulation rate 
could be used as a factor in deciding the best method of 
acquiring certain types of material. Such information can 
play a pivotal role in performing collection development 
and library management more efficiently and effectively. 
Unfortunately, the data needed to make these types of 
decisions are often scattered in different files maintained 

Siew-Phek T. Su and 
Ashwin Needamangala 

by a large centralized system, such as NOTIS, that does 
not provide a general querying facility or by different 
file/ data management or application systems. This situa-
tion makes it very difficult and time-consuming to extract 
useful information. This is precisely where data ware-
housing technology comes in. 

The goal of this research and development work is 
to apply data warehousing and data mining technolo-
gies in the development of a Library Decision Support 
System (LOSS) to aid the library management's decision 
making. The first phase of this work is to establish a data 
warehouse by importing selected data from separately 
maintained files presently used in the George A. 
Smathers Libraries of the University of Florida into a 
relational database system (Microsoft Access). Data 
stored in the existing files were extracted, cleansed, 
aggregated, and transformed into the relational repre-
sentation suitable for processing by the relational data-
base management system. A graphical user interface 
(GUI) is developed to allow decision makers to query for 
the data warehouse's contents using either some prede-
fined queries or ad hoc queries. The second phase is to 
apply data mining techniques on the library data ware-
house for knowledge discovery. This paper covers the 
first phase of this research and development work. Our 
goal is to develop a general methodology and inexpen-
sive software tools, which can be used by different func-
tional units of a library to import data from different 
data sources and to tailor different warehouses to meet 
their local decision needs. For meeting this objective, we 
do not have to use a very large centralized database 
management system to establish a single very large data 
warehouse to support different uses. 

I Local Environment 
The University of Florida Libraries has a collection of 
more than two million titles, comprising over three mil-
lion volumes. It shares a NOTIS-based integrated system 
with nine other State University System (SUS) libraries 
for acquiring, processing, circulating, and accessing its 
collection. All ten SUS libraries are under the consortium 
umbrella of the Florida Center for Library Automation 
(FCLA). 

Siew-PhekT. Su (pheksu@mail.uflib.ufl.edu) is Associate Chair 
of the Central Bibliographic Services Section, Resource 
Services Department, University of Florida Libraries, and 
Ashwin Needamangala (nsashwin@grove.ufl.edu) is a 
graduate student at the Electrical and Computer Engineering 
Department, University of Florida. 

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 17 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

I Library Data Sources 
The University of Florida Libraries' online database, 
LUIS, stores a wealth of data, such as bibliographic data 
(author, title, subject, publisher information), acquisitions 
data (price, order information, fund assignment), circula-
tion data (charge out and browse information, with-
drawn and inventory information), and owning location 
data (where item is shelved). These voluminous data are 
stored in separate files. The NOTIS system as used by the 
University of Florida does not provide a general querying 
facility for accessing data across different files. Extracting 
any information needed by a decision maker has to be 
done by writing an application program to access and 
manipulate these files. This is a tedious task since many 
application programs would have to be written to meet 
the different information needs. The challenge of this 
project is to develop a general methodology and tools for 
extracting useful data and metadata from these disjoint-
ed files, and to bring them into a warehouse that is main-
tained by a database management system such as 
Microsoft Access. The selection of Access and PC hard-
ware for this project is motivated by cost consideration. 
We envision that multiple special purpose warehouses be 
established on multiple PC systems to provide decision 
support to different library units. 

The Library Decision Support System (LOSS) is 
developed with the capability of handling and analyzing 
an established data warehouse. For testing our method-
ology and software system, we established a warehouse 
based on twenty thousand monograph titles acquired 
from our major monograph vendor. These titles were 
published by domestic U.S. publishers and have a high 
percentage of DLC/DLC records (titles cataloged by the 
Library of Congress). They were acquired by firm order 
and approval plan, The publication coverage is the calen-
dar year 1996-1997. Analysis is only on the first item 
record (future project will include all copy holdings). 
Although the size of the test data used is small, it is suffi-
cient to test our general methodology and the functional-
ity of our software system. 

FCLA D82 Tables and Key List 

Most of the data from the twenty-thousand-title domain 
that go into the LOSS warehouse are obtained from the 
DB2 tables maintained by FCLA. FCLA developed and 
maintains the database of a system called Ad Hoc Report 
Request Over the Web (ARROW) to facilitate querying 
and generating reports on acquisitions activities . The 
data are stored in 0B2 tables. 5 

For our research and development purpose, we 
needed DB2 tables for only the twenty-thousand titles 

that we identified as our initial project domain. These 
titles all have an identifiable 035 field in the bibliograph-
ic records (zybp1996, zybcip1996, zybp1997 or zybp-
cip1997). We used the BatchBAM program developed by 
Gary Strawn of Northwestern University Library to 
extract and list the unique bibliographic record numbers 
in separate files for FCLA to pick up. 6 Using the unique 
bibliographic record numbers, FCLA extracted the 0B2 
tables from the ARROW database and exported the data 
to text files. These text files then were transferred to our 
system using the file transfer protocol (FrP) and inserted 
as tables into the LOSS warehouse. 

Bibliographic and Item Records Extraction 

FCLA collects and stores complete acquisitions data 
from the order records as DB2 tables. However, only 
brief bibliographic data and no item record data are 
available . Bibliographic and item record data are essen-
tial for inclusion in the LOSS warehouse in order to cre-
ate a viable integrated system capable of performing 
cross-file analysis and querying for the relationships 
among different types of data. Because these required 
data do not exist in any computer readable form, we 
designed a method to obtain them. Using the identical 
NOTIS key lists to extract the targeted twenty-thousand 
bibliographic and item records, we applied a screen 
scraping technique to scrape the data from the screen 
and saved them in a flat file. We then wrote a program in 
Microsoft Visual Basic to clean the scraped data and 
saved them as text-delimited files that are suitable for 
importing into the LOSS warehouse. 

Screen Scraping Concept 

Screen scraping is a process used to capture data from a 
host application. It is conventionally a three-part process: 

• Displaying the host screen or data to be scraped. 
• Finding the data to be captured. 
• Capturing the data to a PC or host file, or using it in 

another Windows application. 

In other words, we can capture particular data on the 
screen by providing the corresponding screen coordinates 
to the screen scraping program. Numerous commercial 
applications for screen scraping are available on the mar-
ket. However, we used an approach slightly different from 
the conventional one. Although we had to capture only 
certain fields from the NOTIS screen, there were other fac-
tors that we had to take into consideration. They are: 

• The location of the various fields with respect to the 
screen coordinates changes from record to record . 
This makes it impossible for us to lock a particular 
field with a corresponding screen coordinate. 

18 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

• The data present on the screen are dynamic because 
we are working on a "live" database where data are 
frequently modified. For accurate query results, all 
the data, especially the item record data where the 
circulation transactions are housed, need to be cap-
tured within a specified time interval so that the data 
are uniform. This makes the time taken for capturing 
the data extremely important. 

• Most of the fields present on the screen needed to be 
captured. 

Taking the above factors into consideration, it was 
decided to capture the entire screen instead of scraping 
only certain parts of the screen. This made the process 
both simpler and faster . The unnecessary fields were fil-
tered out during the cleanup process . 

I System Architecture 
The architecture of the LOSS system is shown in figure 1 and 
is followed by a discussion on its components' functions. 

NOTIS 

NOTIS (Northwestern Online Totally Integrated System) 
was developed at the Northwestern University Library 
and introduced in 1970. Since its inception, NOTIS has 
undergone many versions. University of Florida Libraries 
is one of the earliest users of NOTIS. FCLA has made 
many local modifications of the NOTIS system since UF 
Libraries started using it. As a result, the UF NOTIS is dif-
ferent from the rest of the NOTIS world in many respects . 
NOTIS can be broken down into four subsystems: 

• acquisitions 
• cataloging 
• circulation 
• online public access catalog (OPAC) 

At the University of Florida Libraries, the NOTIS sys-
tem runs on an IBM 370 main frame computer that runs 
the OS/390 operating system . 

Host Explorer 

Host Explorer is a software program that provides a 
TCP /IP link to the main frame computer . It is a terminal 
emulation program supporting the IBM main frame, 
AS/400, and VAX hosts . Host Explorer delivers an 
enhanced user environment for all Windows NT plat-
forms, Windows 95 and Windows 3.x desktops. Exact 
TN3270E, TN5250, VT420/320/220/101/100/52, WYSE 
50/60 and ANSI-BBS display is extended to leverage the 
wealth of the Windows desktop. It also supports all 

DB2Tables 

LOSS 
Host Explorer 

Data Cleansing and Extraction 

Warehouse 

Graphical User Interface 

Figure 1. LOSS Architecture and Its Components 

TCP /IP based TN3270 and TN3270E gateways. 
The Host Explorer program is used as the terminal 

emulation program in LOSS. It also provides VBA com-
patible BASIC scripting tools for complete desktop macro 
development. Users can run these macros directly or 
attach them to keyboard keys, toolbar buttons, and screen 
hotspots for additional productivity. 

The function of Host Explorer in the LOSS is v ery 
simple. It has to "visit" all screens in the NOTIS system 
corresponding to each NOTIS number present in the 
BatchBam file, and capture all the data on the screens. 
In order to do this, we wrote a macro that read the 
NOTIS number one at a time from the BatchBam file 
and input the number into the command string of Host 
Explorer . The macro essentially performed the follow-
ing functions: 

• Read the NOTIS numbers from the BatchBam file. 
• Inserted the NOTIS number into the command string 

of Host Explorer . 
• Toggled the Screen Capture option in Host Explorer 

so that data are scraped from the screen only at nec-
essary times. 

• Saved all the scraped data into a flat file. 

After the macro has been executed, all the data scraped 
from the NOTIS screen reside in a flat file. The data present 

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 19 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

in this file have to be cleansed in order to make them suit-
able for insertion into the Library Warehouse. A Visual 
Basic program is written to perform this function. The 
details of this program will be given in the next section. 

I Data Cleansing and Extraction 
This component of the LOSS is written in the Visual Basic 
programming language. Its main function is to cleanse the 
data that have been scraped from the NOTIS screen. The 
Visual Basic code saves the cleansed data in a text-delimit-
ed format that is recognized by Microsoft Access. This file 
is then imported into the Library Warehouse maintained 
by Microsoft Access. The detailed working of the code that 
performs the cleansing operation is discussed below. 

The NOTIS screen that comes up for each NOTIS 
number has several parts that are critical to the working 
of the code. They are: 

• NOTIS number present in the top-right of the screen 
(in this case, AKR9234) 

• Field numbers that have to be extracted. Example: 
010::, 035:: 

• Delimiters. The " I " symbol is used as the delimiter 
throughout this code. For example, in the 260 field of 
a bibliographic record, "I a" delimits the place of 
publication, " I b" the name of the publisher and, 
"I c" the date of publication. 

We shall now go step by step through the cleansing 
process. Initially we have the flat file containing all the 
data that have been scraped from the NOTIS screens. 

• The entire list of NOTIS numbers from the BatchBam 
file is read into an array called Bam_Number$. 

• The file containing the data that have been scraped is 
read into a single string called BibRecord$. 

• This string is then parsed using the NOTIS numbers 
from the Bam_Number$ array. 

• We now have a string that contains a single NOTIS 
record. This string is called Single_Record$. 

• The program runs in a loop till all the records have 
been read. 

• Each string is now broken down into several smaller 
strings based on the field numbers. Each of these 
smaller strings contains data pertaining to the corre-
sponding field number. 

• A considerable amount of the data present on the 
NOTIS screen is unnecessary from the point of view 
of our project. We need only certain fields from the 
NOTIS screen. But even from these fields we need 
the data only from certain delimiters. Therefore, we 
now scan each of these smaller strings for a certain 
set of delimiters, which was predefined for each indi-

vidual field. The data present in the other delimiters 
are discarded. 

• The data collected from the various fields and their 
corresponding delimiters are assigned to correspon-
ding variables. Some variables contain data from 
more than one delimiter concatenated together. The 
reason for this can be explained as follows. There are 
certain fields, which are present in the database only 
for informational purposes and will not be used as a 
criteria field in any query. Since these fields will never 
be queried upon, they do not need to be cleansed as 
rigorously as the other fields and therefore, we can 
afford to leave the data of these fields as concatenated 
strings. Example: The Catalog_source field which has 
data from " I a" and " I c" is of the form " I a DLC I c 
DLC" while the Lang code field which has data from 
"I a" and" I h" is of the form" I a eng I h rus." But we 
split this into two fields: Lang_code_l containing 
"eng" and Lang_code_2 containing "rus." 

• The data collected from the various fields are saved 
in a flat file in the text-delimited format. Microsoft 
Access recognizes this format. 

A screen dump of the text-delimited file, which is the 
end result of the cleansing operation, is shown in figure 
2. The flat file, which we now have, can be imported into 
the Library Warehouse. 

I Graphical User Interface 
In order to ease the tasks of the user (i.e., the decision 
maker) to create the library warehouse and to query and 
analyze its contents, a graphical user interface tool has 
been developed. Through the GUI, the user can enact the 
following processes or operations through a main menu: 

• connection to NOTIS 
• screen scraping 
• data cleansing and extracting 
• importing data 
• viewing collected data 
• querying 
• report generating 

The first option opens HostExplorer and provides a 
connection to NOTIS. lt provides a shortcut to closing or 
minimizing LDSS and opening HostExplorer. The Screen 
Scraping option activates the data scraping process. The 
Data Cleansing and Extracting option filters out the 
unnecessary data fields and saves the cleansed data in a 
text-delimited format. The Importing Data option imports 
the data in the text-delimited format into the warehouse. 
The Viewing Collected Data option allows the user to 
view the contents of a selected relational table stored in 

20 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

"RECORD HUMBER","System Control Humber","Catalogin Source","Language Codes 1","Language Code~ 
"AKR9234", "YBP1996 0507--CLARR done", "a DLC I c DLC ", "1 : I a eng "," I h rus", "e-ur-ru", "306/. 0~ 
"RKS6472", "YBP1996 0507--CLRRR done"," a DLC I c DLC ", "1 : I a eng "," I h rus", "Hull", "891. 73/ 44 
"AKS6493", "YBP1996 0507--CLARR done"," a DLC I c DLC ","Hull", "Hull", "Hull"," 001. 4/225/ 028563 I ~f 
"AJX7554", "YBP1996 05 08--CLARR done"," a Uk I c Uk ","Hull", "Hull", "e-uk---", "362. 1 / 068 12 2 O",' 
"AKB3478", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "e-fr---", "843/. 7 12 2 O", "t " 
"AKC6442","YBP19960508--CLARR done","a DLC c DLC ","1 : la eng ","lh ger","e-fr---","194 12 
"AKE9837", "YBP1996 0508--CLARR done"," a DLC c DLC ","Hull", "Hull", "e-gr---", "883/. 01 12 20",' 
"AKK9486", "YBP1996 0508--CLARR done", "a DLC c DLC ","Hull", "Hull", "e-uk---", "822/. 052309 12 ~% 
l'AKL2258", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "e-xr---", "929. 4/2/ 08992401 1• 
"AKM2455", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "e-gx---", "943. 086 12 2 O",' 
"AKM4649", "YBP1996 0508--CLARR done"," a DLC c DLC ","Hull", "Hull", "Hull", "863/ .64 I 2 20", "Hu] ' 
"AKH0246","YBP19960508--CLARR done","a DLC c DLC ","Hull","Hull","n-us--- la e-uk-en","700/. 
"AKH181 O", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull" ,"Hull", "e-uk---", "305. 6/2 042/ 0903.: 
"AKH3749","YBP19960508--CLARR done","a DLC c DLC ","Hull","Hull","f-ke--- la f-so - --","327.{ 
"AKQ727 4", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "Hull", "355. 4/2 12 2 O", "Hu] 
"AKQ9180", "Y.BP1996 0508--CLARR done", "a DLC c DLC ","Hull", "Hull", "n-us---", "23 0/. 93/ 09 12 2,f 
"AKR 0424", "YBP1996 05 08--CLARR done"," a DLC c DLC ","Hull", "Hull", "n-us-mi", "331 . 88/1292/ 097' 
"RKR1411", "YBP1996 05 08--CLARR done"," a CL I c CL ","Hull", "Hull", "n-us---", "3 05. 896/ 073 12 2 O' 
"AKR1846", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "e-uk-ni", "Hull", "Hull", "x, 
"AKR2169", "YBP1996Jt5 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "n-us-sc", "323. 1/196073/ 091 
"AKR2245" ,"YBP19960508--C .LARR d.one" ," a DLC I c DLC ","Hull", "Hull", "Hull", "306 .4/6 I 2 20", "Hu1 
"AKR2255", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "Hull", "3 03. 48/2 12 2 O", "2r 
"AKR226 O", "YBP1996 0508--CLARR done"," a DLC I c DLC ","Hull", "Hull", "n-us- - -", "3 03. 48/2 12 2 O", 
"AKR2281", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "t------ I a r------", "333. , · 
"AKR2287", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "Hull", "57 4. 5/262 12 2 O", "t 
"RKR2357", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "e------", "361 . 6/1 / 094 12 l 
"AKR2358", "YBP1996 0508--CLARR done"," a DLC I c DLC ","Hull", "Hull" ,"Hull", "333. 7/2/01 12 20" ,' ¥' 
"AKR2371", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "e------", "3 07. 72/ 094 12 211 
"AKR2386", "YBP1996 05 08--CLARR done", "DLC I c DLCI", "Hull" ,/'Hull", "e-uk---", "Hull", "Hull", "xu, 
"RKR25 03", "YBP1996 05 08--CLARR done"," a DLC I c DLC ","Hull", "Hull", "Hull", "575. 1 / 09 12 2 O", "HL 

'i-r---· -----
-----·-----

Figure 2. A Text-Delimited File 

the warehouse. The Querying option activates LDSS's 
querying facility that provides wizards to guide the for-
mulations of different types of queries, as discussed later 
in this article . The last option, Report Generating, is for the 
user to specify the report to be generated. 

I Data Mining Tool 
A very important component of LOSS is the data mining 
tool for discovering association rules that specify the 
interrelationships of data stored in the warehouse. Many 
data mining tools are now available in the commercial 
world. For our project, we are investigating the use of a 
neural-network-based data mining tool developed by 

Limin Fu of the University of Florida.? The tool allows the 
discovery of association rules based on a set of training 
data provided to the tool. This part of our research and 
development work is still in progress . The existing GUI 
and report generation facilities will be expanded to 
include the use of this mining tool. 

I Library Warehouse 
FCLA exports the data existing in the 0B2 tables into text 
files. As a first step towards creating the database, these 
text files are transferred using FTP and form separate 
relational tables in the Library Warehouse. The data that 

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 21 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

are scraped from the bibliographic and item record 
screens result in the formation of two more tables. 

Characteristics 

Data in the warehouse are snapshots of the original data 
files. Only a subset of the data contents in these files are 
extracted for querying and analysis since not all the data 
are useful for a particular decision-making situation. 
Data are filtered as they pass from the operational envi-
ronment to the data warehouse environment. This filter-
ing process is necessary particularly when a PC system, 
which has limited secondary storage and main memory 
space, is used. Once extracted and stored in the ware-
house, data are not updateable. They form a read-only 
database. However, different snapshots of the original 
files can be imported into the warehouse for querying 
and analysis. The results of the analyses of different snap-
shots can then be compared. 

Structure 

Data warehouses have a distinct structure. There are 
summarization and detail structures that demarcate a 
data warehouse. The structure of the Library Data 
Warehouse is shown in figure 3. 

The different components of the Library Data 
Warehouse as shown in figure 3 are: 

• NOTIS and 0B2 Tables. Bibliographic and circula-
tion data are obtained from NOTIS through the 
screen scraping process and imported into the ware-
house. FCLA maintains acquisitions data in the form 
of DB2 tables. These are also imported into the ware-
house after conversion to a suitable format. 

• Warehouse. The warehouse consists of several rela-
tional tables that are connected by means of relation-
ships. The universal relation approach could have 
been used to implement the warehouse by using a 
single table. The argument for using the universal 
relation approach would be that all the collected data 
fall under the same domain. But let us examine why 
this approach would not have been suitable. The dif-
ferent data collected for import into the warehouse 
were bibliographic data, circulation data, order data, 
and pay data. Now, if all these data were incorporat-
ed into one single table with many attributes, it 
would not be of any exceptional use since each set of 
attributes have their own unique meaning when 
grouped together as bibliographic table, circulation 
table, and so on. For example, if we group the circu-
lation data and the pay data together in a single table, 
it would not make sense. However, the pay data and 
the circulation data are related through the Bib_key. 
Hence, our use of the conventional approach of hav-

User 

.....--------~----.----------......----=--___ 
Bibliographic 

Data View 
Circulation 
Data View 

Ufbib, Ufpay, Ufinv, Ufcirc, Uford 

WAREHOUSE 

Pay 
Data View 

Import 

Screen Scraping 

NOTIS 

FCLA DB2 
Tables 

Figure 3. Structure of the Library Data Warehouse 

ing several tables connected by means of relation-
ships is more appropriate. 

• Views. A view in SQL terminology is a single table 
that is derived from other tables. These other tables 
could be base tables or previously defined views. A 
view does not necessarily exist in physical form; it is 
considered a virtual table, in contrast to base tables 
whose tables are actually stored in the database. In 
the context of the LDSS, views can be implemented 
by means of the AdHoc Query Wizard. The user can 
define a query /view using the Wizard and save it for 
future use. The user can then define a query on this 
query I view. 

• Summarization. The process of implementing views 
falls under the process of summarization. 
Summarization provides the user with views, which 
make it easier for users to query on the data of their 
interests. 

As explained above, the specific warehouse we 
established consists of five tables. Table names including 
"_WH" indicates that it contains current detailed data of 
the warehouse. Current detailed data represents the most 
recent snapshot of data that has been taken from the 
NOTIS system. The summarized views are derived from 
the current detailed data of the warehouse. Since current 
detailed data of the warehouse are the basic data of the 

22 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

application, only the current detailed data tables are 
shown in appendix A. 

I 
Decision Support 
by Querying the 
Warehouse 

The warehouse contains a set of integrated relational 
tables whose contents are linked by the common primary 
key, the Bib_key (Biblio_key). The data stored across 
these tables can be traver sed by matching the key values 
associated with their tuples or records . Decision makers 
can issue all sorts of SQL-type queries to retrieve useful 
information from the warehouse. Two general types of 
queries can be distinguished : predefined queries and ad 
hoc queries . The former type refers to queries that are fre-
quently used by decision makers for accessing informa-
tion from different snapshots of data imported into the 
warehouse . The latter type refers to queries that are 
exploratory in nature. A decision maker suspects that 
there is some relationship between different types of data 
and issues a query to verify the existence of such a rela-
tionship. Alternatively, data mining tools can be applied 
to analyze the data contents of the warehouse and dis-
cover rules of their relationships (or associations). 

Predefined Queries 

Below are some sample queries posted in English. Their 
corresponding SQL queries can be processed using LOSS. 

l. Number and percentage of approval titles circulated 
and noncirculated. 

2. Number and percentage of firm order titles circulat-
ed and noncirculated . 

3. Amount of financial resources spent on acquiring 
noncirculated titles. 

4. Number and percentage of DLC/DLC cataloging 
records in circulated and noncirculated titles . 

5. Number and percentage of "shared" cataloging 
records in circulated and noncirculated titles. 

6. Numbers of original and "shared" cataloging 
records of noncirculated titles. 

7. Identify the broad subject areas of circulated and 
noncirculated titles . 

8. Identify titles that have been circulated "n" number 
of times and by subjects . 

9. Number of circulated titles without the 505 field. 

Each of the above English queries can be realized by 
a number of SQL queries. We shall use the first two 
English queries and their corresponding SQL queries to 
explain how the data warehouse contents and the query-

ing facility of Microsoft Access can be used to support 
decision making. The results of SQL queries also are 
given . The first English query can be divided into two 
parts (see figure 4), each realized by a number of SQL 
queries as shown below . 

Sample Query Outputs 

Query 1: Number and percentage of approval titles circu-
lated and noncirculated 

Result : Total approval titles 
Circulated 
Noncirculated 

1172 
980 
192 

83.76 % 
16.24 % 

Similar to the above SQL queries, we can translate 
the second English query into a number of SQL queries 
and the result is given below: 

Query 2: Number and percentage of firm order titles cir-
culated and noncirculated 

Result : Total firm order titles 
Circulated 
Noncirculated 

Report Generation 

1829 
1302 
527 

71.18 % 
28.82 % 

The results of the two predefined English queries can be 
presented to users in the form of a report. 

Total titles 3001 
Approval 1172 39% 
Circulated 980 83.76 % 
Noncirculated 192 16.24 % 

Firm Order 1829 61% 
Circulated 1302 71.18 % 
Noncirculated 527 28 .82 % 

From the above report, we can ascertain that, though 
39 percent of the titles were purchased through the 
approval plan and 61 percent through firm orders, the 
approval titles have a higher rate of circulation, 83.76 per-
cent, as compared to firm order titles of 71.18 percent. It 
is important to note that the result of the above queries is 
taken from only one snapshot of the circulation data. 
Analysis from several snapshots is needed in order to 
compare the results and arrive with reliable information. 

We now present a report on the financial resources 
spent on acquiring and processing noncirculated titles. In 
order to generate this report, we need the output of 
queries four and five listed earlier in this article. The cor-
responding outputs are shown below. 

Query 4: Number and percentage of DLC/DLC 
cataloging records in circulated and noncirculated titles. 

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 23 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

Result: Total DLC/DLC records 
Circulated 
Noncirculated 

2852 
2179 
673 

76.40% 
23.60% 

Query 5: Number and percentage of "shared" cataloging 
records in circulated and noncirculated titles. 

Result: Total "shared" records 
Circulated 
Noncirculated 

149 
100 
49 

67.11% 
32.89% 

In order to come up with the financial resources, we 
need to consider several factors, which contribute to the 
amount of financial resources spent. For the sake of sim-
plicity, we consider only the following factors: 

1. the cost of cataloging each item with DLC/DLC record 

Approval Titles Circulated 

2. the cost of cataloging each item with shared record 
3. the average price of noncirculated books 
4. the average pages of noncirculated books 
5. the value of shelf space per centimeter 

Because the value of the above factors differs from 
institution to institution and might change according to 
more efficient workflow and better equipment used, 
users are required to fill in the value for factors 1, 2, and 
5. LOSS can compute factors 3 and 4. The financial report , 
taking into consideration the value of the above factors, 
could be as shown below. 

Processing cost of each DLC Title = $10.00 
673 X $10.00 = $ 6,730.00 
Processing cost of each Shared Title = $20.00 

SQL query t.o retrieve the distinct bibliographic keys of all the approval titles: 
SELECT DISTINCT BibScreen.Bib_key 
FROM BibScreen RIGHT JOIN pa yl ON BibScreen.Bib_key = pa y l.BIB_NUM WHERE 
(((payl.FUND_KEY) Like "*07*")); 

SQL query to count the number of approval titles that have been circulated: 
SELECT Count (Appr_Title.Bib_key) AS CountOfBib_key 
FROM (BibScreen INNER JOIN Appr_Title ON BibScreen.Bib_key = Appr _Title.Bib_key) INNER JOIN 

ItemScreen ON BibScreen.Bib_key = ItemScreen .Biblio_key 
WHERE (((ItemScreen.CHARGES)>0)) 
ORDER BY Count(Appr _Title.Bib_key); 

SQL query to calculate the percentage: 
SELECT Cnt_Appr_Ti tle_Circ.CountOfBib_ke y, 

Int(([Cnt_Appr_Titl e_Circ]![CountOfBib _key])*lO0/ Count([BibScreen)![Bib_key])) AS Percent_apprcirc 
FROM BibScreen, Cnt_Appr_Title _Circ 
GROUP BY Cnt _Appr _Title_Circ.CountOfBib _key; 

Approval Titles Noncirculated 

SQL query for counting the number of approval titles that have not been circulated: 
SELECT DISTINCT Count(Appr_Title.Bib_key) AS CountOfBib_ke y 
FROM (Appr _Title INNER JOIN BibScreen ON Appr_Title.Bib_key BibScreen.Bib_key) INNER JOIN 

ItemScreen ON BibScreen .Bib_key = ItemScreen.Biblio_ke y 
WHERE ( ( (ItemScreen.CHARGES)=0) ); 

SQL query to calculate the percentage: 
SELECT Cnt_Appr_Title_Noncirc.CountOfBib_ke y, Int(([Cnt_Appr_Title_Noncirc)![CountOfBib_ke y])*lO0/ 

Count([BibScreen]! [Bib _key]))) AS Percent_appr _noncirc 
FROM BibScreen, Cnt_Appr _Title_Noncirc 
GROUP BY Cnt_Appr_Title_Noncirc .CountOfBib_ke y; 

Figure 4. Example of an English Query Divided into Two Parts 

24 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

49 X $20.00 = $ 980.00 
Average price paid per noncirculated item = $48.00 
722 X $48.00 = $34,656.00 
Average size of book = 288 pages = 3 cm 
Average cost of 1 cm of shelf space= $0.10 
722 X $0.30 = $216.60 
Grand Total = $42,582.60 

Again it is important to point out that several snap-
shots of the circulation data have to be taken to track and 
compare the different analyses before deriving the reli-
able information. 

Ad Hoc Queries 
Alternately, if the user wishes to issue a query that 

has not been predefined, the Ad Hoc Query Wizard can 
be used. The following example illustrates the use of the 
Ad Hoc Query Wizard. Assume the sample query is: 
How many circulated titles in the English subject area cost 
more than $35? 

We now take you on a walk-through of the AdHoc 
Query Wizard starting from the first step till the output is 
obtained. 

Figure 4 depicts Step 1 of the Ad Hoc Query Wizard. 
The sample query mentioned above requires the follow-
ing fields: 

• Biblio_key for a count of all the titles which satisfy 
the given condition. 

• Charges to specify the criteria of "circulated title". 
• Fund_Key to specify all titles under the "English" 

subject area. 
• Paid_Amt to specify all titles which cost more than 

$35. 

Step 2 of the Ad Hoc Query Wizard (figure 5) allows 
the user to specify criteria and thereby narrow the search 
domain. Step 3 (figure 6) allows the user to specify any 
mathematical operations or aggregation functions to be 
performed. Step 4 (figure 7) displays the user-defined 
query in SQL form and allows the user to save the query 
for future reuse. The output of the query is shown below 
in figure 8. The figure shows the number of circulated 
titles in the English subject area that cost more than $35. 
Alternatively, the user might wish to obtain a listing of 
these 33 titles. Figure 9 shows the listing. 

I Conclusion 
In this article, we presented the design and development 
of a library decision support system based on data ware-
housing and data mining concepts and techniques. We 
described the functions of the components of LOSS. The 
screen scraping and data cleansing and extraction 

Figure 4. Step 1: Ad Hoc Query Wizard 

~ E.9,~Lang__;c~,tfe ... 1 Lik~ "'ft,f" 
J.esi: !han Eg,. Crfi;irget t 4 
Gr~er th'Jn·Eii, Q:,arges,> 0 
Equal tci'E_g_- Cfiarge~= !1 
Not . . 

Figure 5. Step 2: Ad Hoc Query Wizard 

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 25 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

Figure 6. Step Three : Ad Hoc Query Wizard Figure 7. Step Four: Ad Hoc Query Wizard 

Figure 8. Query Output Figure 9. Listing of Query Output 

26 INFORMATION TECHNOLOGY AND LIBRARIES i MARCH 2000 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

processes were described in detail. The process of import-
ing data stored in LUIS as separate data files into the 
library data warehouse was also described. The data con-
tents of the warehouse can provide a very rich informa-
tion source to aid the library management in decision 
making. Using the implemented system, a decision 
maker can use the GUI to establish the warehouse, and to 
activate the querying facility provided by Microsoft 
Access to explore the warehouse contents . Many types of 
queries can be formulated and issued against the data-
base. Experimental results indicate that the system is 
effective and can provide pertinent information for aid-
ing the library management in making decisions. We 
have fully tested the implemented system using a small 
sample database . Our on going work includes the expan-
sion of the database size and the inclusion of a data min-
ing component for association rule discovery. Extensions 
of the existing GUI and report generation facilities to 
accommodate data mining needs are expected. 

I Acknowledgments 
We would like to thank Professor Stanley Su for his sup-
port and advice on the technical aspect of this project. We 
would also like to thank Donna Alsbury for providing us 
with the 0B2 data, Daniel Cromwell for loading the 0B2 
files and along with Nancy Williams and Tim Hartigan 
for their helpful comments and valuable discussions on 
this project. 

References and Notes 

1. John Ladley , "Operational Data Stores: Building an 
Effective Strategy, " Data Warehouse: Practical Advice from the 
Experts (Englewood Cliffs, N.J.: Prentice Hall , 1997). 

2. Information on Har vard University's ADAPT proj ect. 
Accessed March 8, 2000, www.adapt.harvard .edu/; Information 
on the Arizona State University Data Administration and 
Institutional Analysis warehou se. Accessed March 8, 2000, 
www .asu .edu / Data_Admin / WH-1.html; Information on the 
University of Minnesota CLARITY project. Accessed March 8, 

2000,www.clarity.umn .edu/; Information on the UC San Diego 
DARWIN project. Accessed March 8, 2000, www.act .ucsd .edu/ 
dw I darwin.html; Information on University of Wisconsin-
Madison InfoAccess . Accessed March 8, 2000, http :/ / wiscinfo. 
doit.wisc .edu/infoac cess /; Information on the Univer sity of 
Nebraska Data Warehouse-nulook. Accessed March 8, 2000, 
www .nulook.uneb.edu /. 

3. Ramon Barquin and Herbert Edelstein, eds ., Building, 
Using, and Managing the Data Warehouse (Englewood Cliffs, N .J.: 
Prentice Hall , 1997); Ramon Barquin and Herbert Edelstein, 
eds ., Planning and Designing the Data Warehouse (Upper Saddle 
River, N.J .: Prentice Hall, 1996); Joyce Bischoff and Ted 
Alexander, Data Warehouse: Practical Advice from the Experts 
(Englewood Cliffs, N.J.: Prentice Hall , 1997); Jeff Byard and 
Donovan Schneider, "The Ins and Outs (and Everything in 
Between) of Data War ehousing ," ACM SIGMOD 1996 Tutorial 
Notes, May 1996. Accessed March 8, 2000, www .redbrick.com / 
product s/ white / pdf/sigmod96.pdf ; Surajit Chaudhuri and 
Umesh Dayal, "An Overview of Data Warehousing and OLAP 
Technolog ," ACM SIGMOD Record 26(1), March 1997. Accessed 
March 8, 2000, www.acm.org/sigmod / record/issue s/ 9703/ 
chaudhuri .ps ; B. Devlin , Data Warehouse: From Architecture to 
Implementation (Reading, Mass.: Addison-Wesle y, 1997); U. 
Fayyad and others, eds ., Advances in Knowledge Discovery and 
Data Mining (Cambridge, Mass.: The MIT Pr., 1996); Joachim 
Hammer, "Data War ehousing Overview, Terminology, and 
Research Issues." Accessed March 8, 2000, www.cise.ufl .edu/ 
-jhammer / classes / wh-seminar / Overview / index .htm ; W. H. 
Inmon, Building the Data Warehouse (New York, N.Y.: John Wiley, 
1996); Ralph Kimball , "Dangerous Preconceptions." Accessed 
March 8, 2000, www .dbmsmag.com/9608d05.html ; Ralph 
Kimball , The Data Warehouse Toolkit (New York, N.Y.: John Wiley, 
1996); Ralph Kimball, "Mastering Data Extraction," in DBMS 
Magazine, June 1996. (Provides an overview of the process of 
extracting , cleaning, and loading data .) Accessed March 8, 2000, 
www .dbmsmag.com / 9606d05 .html ; Alberto Mendelzon , 
"Bibliography on Data Warehousing and OLAP." Accessed 
March 8, 2000, www.cs.toronto.edu/-mendel/dwbib.html. 

4. Daniel J. Boorstin, "The Age of Negative Discovery," 
Cleopatra's Nose: Essays on the Unexpected (New York: Random 
Hous e, 1994). 

5. Information on the ARROW system . Accessed March 8, 
2000,www . fcla.edu /s ystem/intro_arrow.html. 

6. Gary Strawn, "BatchBAMing." Accessed March 8, 2000, 
http:/ /web .uflib.ufl .edu/rs/rsd/batchbam .html. 

7. Li-Min Fu, "OOMRUL: Leaming the Domain Rules ." 
Accessed March 8, 2000, www .cise.ufl .edu / -fu / domrul.html. 

HARVESTING INFORMATION FROM A LIBRARY DATA WAREHOUSE I SU AND NEEDAMANGALA 27 



Reproduced with permission of the copyright owner.  Further reproduction prohibited without permission.

Appendix A Warehouse Data Tables 

UFCIRC_WH UFORD _WH UFPAY_WH 

Attribute Domain Attribute Domain Attribute Domain 

Bib_key Text(S0) Id AutoNumber Inv_key Text(20) 

Status Text(20) Ord_num Text(20) Ord_num Text(20) 

Enum / Chron Text(20) Ord_Div Number Ord_div Number 
MidSpine Text(20) Process_Uni t Text(20) Process _Unit Text(20) 
Temp_Locatn Text(20) Bib_num Text(20) Bib_key Text(20) 
Pieces Number Order_da te Da te / Time Ord_Seq_Num Number 

Ch arges Number Mod_Date Date / Time Inv_Seq_Num Number 
Last_Use Date / Tune Vendor_Code Text(20) Status Text(20) 

Browse s Number VndAdr_Order Text(20 Create_ Date Da te / Tune 
Value Text(20) VndAdr_Claim Text(20) Lst_update Da te / Time 
Invnt_Date Date / Time VndAdr_Retum Text(20) Currency Text(20) 
Created Date / Time Vend_ Title_N um Text(20) Paid_am t Num ber 

Ord_Unit Text(20) USD_amt N u mber 
Rcv_Unit Text(20) Fund_Key Text(20) 

UFINV_WH Ord_Scope Text(20 Exp_class Text(20) 
Pur_Ord_prod Text(20) Fiscal_year Text(20) 

Attribute Domain Action _Int Number Copies Number 

Inv_Key Text(20) LibSpecl Text(20) Type_pay Text(lO) 

Create _Dat e Date / Time LibSpec2 Text(20) Text Text(20) 

Mod_Date Date / Time Vend_Note Text(20) DB2_11meStamp Date / Time 

Approv _Stat Text(20) Ord_Note Text(20) 

Vend_Adr _Code Text(20) Source Text(20) 

Vend_Code Text(20) Ref Text(20) UFBIB_WH 

Action_Date Text(20) CopyCtl _Num Number Attribute Domain 
Vend_Inv _Date Date/Tune Mediu m Text(20) 

Approval_Date Date / Tune Piece_Cnt N umber Bib_key Text(20) 

Appro ver_Id Text(20) Div_No te Text(20) System_Control _Num Text(S0) 

Vend_Inv _Num Text(20) Acr_Stat Text(20) Ca talog_Source Text(20) 

Inv_Tot Number Rel_Stat Text(20) Lan g_Code_l Text(20) 

Cale_ Tot_rym ts Num ber Lst_Date Date / Time Lang_Code_2 Text(20) 

Calc_Net _Tot_Pymts Number Action_Date Text(20) Geo_Code Text(20) 

Currency Text(20) LibSpec3 Text(20) Dewey_Num Text(20) 

Discount_Percen t Number LibSpec4 Text(20) Edition Text(20) 

Vouch_No te Text(20) Encum b_Units Number Pagina tion Text(20) 

Official_ Vend Text(20) Currency Text(20) Size Text(20) 

Process _Unit Text(20) Est_Price Number Series_440 Text(20) 

Intemal_Note Text(20) Encumb_outs Num ber Series_490 Text(20) 

DB2_ Timestamp Text(20) Fund _key Text(20) Conten t Text(20) 
Fiscal_ Year Text(20) Subject_l Text(20) 
Copies N u mber Subject_2 Text(20) 
Xpay_Method Text(20) Subject_3 Text(20) 
Vol_Isu_Date Text(20) Authors_l Text(20) 
Title_Author Text(20) Au thors_2 Text(20) 
DB2_ Timestamp Date / Time Au th ors_3 Text(20) 

Series Text(20) 

28 INFORMATION TECHNOLOGY AND LIBRARIES I MARCH 2000