AUTOMATIC RETRIEVAL OF BIOGRAPHICAL 
REFERENCE BOOKS 

Cherie B. WElL: Institute for Computer Research, Committee on 
Information Science, University of Chicago, Chicago, Illinois 

239 

A description of one of the first pro;ected attempts to automate a refer-
ence service, that of advising which biographical reference book to use. 
Two hundred and thirty-four biographical books were categorized as to 
type of subjects included and contents of the uniform entries they con-
tain. A computer program which selects up to five books most likely to 
contain answers to biographical questions is described and its test results 
presented. An evaluation of the system and a discussion of ways to extend 
the scheme to other forms of reference work are given. 

Ideally the reference librarian is the "middleman between the reader 
and the right book" ( 1 ) , and this is what the program here described is 
intended to be. In the past there has been very little interest shown in 
automating this service, probably because it is neither urgent nor practical 
in current reference departments. Many developments in automating other 
areas of libraries have indirectly benefitted reference librarians, and the 
literature primarily emphasizes this aspect. For instance, where circulation 
systems have been automated, the location of a particular volume can be 
quickly ascertained and librarians need not waste time searching. Auto-
mation of the ordering phase provides them with information on the proc-
essing stage of a new volume. If the contents of the catalog have been 
put in machine readable form, special bibliographies can be rapidly pro-
duced in response to a particular request or as a regular service of selective 
dissemination. The development of KWIC (Key Word In Context) in-



240 Journal of Library Automation Vol. 1/ 4 December, 1968 

dexes, which are compiled and printed by computer, has enabled pub-
lishers to provide indexes to their books much faster. Computers have also 
been programmed to make concordances and citation indexes ( 2). The 
combination of paper-tape typewriters, computer and a photocomposer 
has introduced automation into compiling Index Medicus (3). 

Changes in reference services themselves, however, may make automa-
tion of question-answering practical. One trend is toward larger reference 
collections to be shared by several libraries; some areas have already set 
up regional reference services. There are also cooperative reference plans 
whereby several strong libraries agree to specialize in certain fields and 
cooperate in answering questions referred by the others ( 4). These trends 
will mean two things to reference librarians: greater concentration of re-
sources, allowing more specialized books and mechanization; and screen-
ing of questions at the local level, letting reference centers concentrate 
on more complex questions that utilize their specialized books. Thus it 
seems likely that special reference centers may look increasingly toward 
mechanizing their services, and retrieval schemes of the type presented 
here will be important to consider. 

BASIC ASSUMPTIONS 
The categorizing system was based on two nearly universal generaliza-

ti.ons about biographical reference books: 1) They are consistently con-
fined to biographies of persons who have something in common: for ex-
ample, being alive or dead; or having the same nationality, sex, occupa-
tion, religion, race, memberships; or possessing some combination of those 
attributes. These common characteristics in the people covered by a given 
book are herein called "exclusive categories." 2) The books generally 
maintain uniform entries for each subject; that is, they give the same data 
for each biography. These facts are referred to herein as "specifics" or 
"specific categories." 

Certain assumptions were made about reference work: 1) All biographi-
cal reference books fit into the scheme and can be categorized. 2) The 
more limited a book's scope, the more likely it is to contain the person 
a user wants to find. In other words, if a user is interested in a Dutch 
economist, he is more likely to find information in a book limited to Dutch 
economists than in a general biographical dictionary. The user, however, 
does not want to miss any source that might be useful. Therefore a gen-
eral biographical dictionary should be given to him as a last resort, after 
books on Dutch economists, Dutchmen of all occupations, and economists 
of all nationalities. 3) Certain requirements, the specifics, have no substi-
tutes. For example, a book lists addresses or it does not, and if a user 
wants an address, books without them are useless. 

There is merit in suggesting to a user which book to use as opposed to 
giving him the direct answer to his question. Probably the best argument 
for this assumption is that the volume of names that would have to be 



Retrieval of Biographical Reference BooksjWEIL 241 

compiled and stored for a direct inquiry system is staggering, only a small 
number would ever be looked up, and it is impossible to predict which 
ones would be searched for. 

There are advantages to mechanizing this particular task of a reference 
librarian: good reference librarians should be freed to perform work less 
easily mechanized; there are not enough reference librarians who have 
perfect recall of their collections even to knowing which exclusive cate-
gories all the books fit into; and no librarian could have complete recall 
as to the specifics contained in each biographical reference book in the 
collection. 

THE COMPUTER PROGRAM 
The program was written in the COMIT language, a non-numerical 

programming language developed for research in mechanical translation, 
information retrieval and artificial intelligence. It is a high-level problem-
oriented language for symbol manipulation, especially designed for deal-
ing with strings of characters. The program could probably be converted 
to other list-processing languages ( 6) for operation at other installations. 

The program was run at the University of Chicago Computation Center 
on an IBM 7094 having the COMIT system on a disk. Questions were 
submitted and nm in large batches. · 

THE DATA 
All biographical reference books in English, with alphabetical ordering 
of subjects, which are in the reference room of the University of Chicago's 
Harper Library were included in the data and no other books were in-
cluded. Since one assumption was that all biographical reference books 
could be categorized by the scheme, it seemed more useful to prove the 
system could handle any biographical reference tool than to compile a 
balanced list of biographical books. There was no difficulty in categorizing 
the books. 

All books are categorized in the following way. First an arbitrary ab-
breviation for the book is chosen to be its entry in the file; it is referred 
to as a "constituent." Each book is then described by determining the 
values of nine subscripts each constituent carries, the subscripts being 
SEX, LIVING, NAT (nationality), OCCUP (occupation), MIN ( minori-
ties), DATE, INDEX, SPECl and SPEC2 (specifics). 

Values of the first five subscripts-the exclusive categories-are first de-
termined. That is, is the book limited to one sex? Are all the subjects living 
or dead? Do they all have a certain occupation? Does the book include 
only certain nationalities? Or is there another restriction; e.g., to alumni 
of a college, members of the nobility or a religious group? The exclusive 
categories for a book are determined and coded from a table of abbrevia-
tions. SEX, for example, allows three values: restricted to males ( M), re-
stricted to females (F) , or no restriction ( Z) . Also a value X must occur 



242 Journal of Library Automation Vol. 1/ 4 December, 1968 

with M or F, indicating there is a restriction. Therefore SEX can have 
the following combinations: SEX Z, SEX F X, or SEX M X; the values 
M X and F X are both the opposite of Z. 

Next the book's date is determined by asking "At what date did the 
values on LIVING (yes or no) apply? Or, if the subjects are not restricted 
to living or dead (LIVING Z), "When was the book up to date?" 

Next any indexes to the biographies are noted. All the biographical 
books list subjects in alphabetical order by surname. Lists of subjects in 
any other order are considered indexes even if the subjects are actually 
listed in some other order in the main body and the list that is alphabetic 
by surname is an index. 

Finally, specific categories (SPEC I and SPEC2) are coded for such 
facts as birthdate, birthplace, college attended, degrees held, hobbies, il-
lustrations, social clubs, and marital status. 

When all categorizing is finished, a data item is punched in this form: 
DICTPHILBIO/ INDEX FIELD X, LIVING N X, OCCUP Z, SEX A, 
NAT PHILIP ASIAN X, SPECl DC DS FL BP L CL CM DG E I Z, 
DATE 50S X, SPEC2 P PL R MS PD Z, MIN Z +. This represents the 
Dictionary of Philippine Biography, a book limited to dead Filipinos and 
giving for each entry: dates, career, descendants, field, birthplace, long 
articles, class in college, degrees, education, picture, parents, publications, 
references, marital status and physical description. The book has a special 
index to find subjects by their field of work. 

One specific value, that for a long article, requires special mention. 
Though most biographical reference books provide the same facts about 
all the subjects in list form, a few provide different facts about different 
subjects in a nanative form. Such books carry the SPECl L, and the 
other specifics these books are listed as providing are not always given 
for every subject. For example, a book with a list format may provide 
the birthplace for every subject when it can be ascertained, but in a book 
using the narrative form, where often different authors write the articles, 
birthplace is not necessarily given. Books in narrative form are used less 
for quick reference; therefore the program provides a note, when a long 
article is requested, that the card catalog may provide more long articles 
on the subject. 

Ease of file maintenance is one advantage of this system. As data is 
analyzed in the first place, if a new value for a category is required, such 
as an occupation which is not in the list, the new value is simply added 
under OCCUP for that particular book and in the list of abbreviations 
for fuh1re use. It is a little more complicated to make an existing value 
more specific. For example, to differentiate BOTANIST, CHEM, PHYS-
ICS and ASTRON and still maintain SCIENTIST as a general category 
embracing them all, another short program is required to retrieve the data 
to be reclassified. 



Retrieval of Biographical Reference Books/ WElL 243 

CODING THE QUESTION 
A biographical question can be quickly coded. The nine required sub-
scripts are the same as those for the data books, but only one value for 
each subscript is necessary. For example, "'What are the publications of a 
living Dutch economist? A current book is desired." is coded as Q / SEX Z 
(or M), LIVING Y, NAT DUTCH, OCCUP ECON, MIN Z, INDEX Z, 
DATE 60S, SPEC! Z, SPEC2 PL +. 
OPERATION OF THE PROGRAM 

Briefly, the program reads in data and then the first question. It weeds 
out data items that can never be suitable, discarding all but those items 
that have the same values as the question has on the subscripts INDEX, 
SPEC! and SPEC2. It then weeds out data items that do not have either 
the same values as the question, or the value Z, on the subscripts OCCUP, 
NAT, MIN, SEX and LIVING. Mter each weeding the program checks 
to determine that there are data items left; if all the books have been 
weeded out, there are no answers. There is also a provision to allow the 
user to designate certain titles to be ignored on a particular question in 
case he has already checked them, for example. 

All data items left after weeding are potential answers and could simply 
be printed out. However, subsequent searches over the remaining items 
serve the purpose of rearranging them into an order in which they are 
more likely to produce answers. It was decided that five answers are 
enough to judge the types of titles chosen yet few enough to avoid very 
long searches. A shorter list of answers would obviously be cheaper and 
a longer list more likely to produce a book containing the desired subject. 

Ordering proceeds as follows: first values of subscripts SEX, LIVING, 
MIN, OCCUP, NAT and DATE on the question as originally stated are 
matched to those of books in the data. The computer is at this stage 
searching for books that are limited in just those categories in which the 
question is limited. For example, if the question Q / SEX Z, LIVING Y, 
MIN Z, NAT DUTCH, OCCUP ECON, INDEX Z, DATE 60S, SPEC! 
A, SPEC2 PL + will match only those books published in the 1960's and 
restricted to living Dutch economists which give publications for all the 
subjects (or the majority), and the books cannot be restricted to a sex or 
to any "minority" group. The books found may or may not have additional 
values on the subscripts; that is, a book may also contain French econo-
mists. Such books found on the first search are mostly likely to contain 
the subject the questioner is looking for. 

If there are fewer than five books found which are a perfect match with 
the question, the program begins to alter the question. To make the least 
significant possible change in the question, the program changes the 
value of the subscript judged to be the limiting factor on the fewest 
books in the data, namely sex. If SEX has a Z as its value (because the 
questioner did not know the sex or did not prefer a book limited to one 



244 Journal of Library Automation Vol. 1/ 4 December, 1968 

sex) it is changed to X so that a book limited to one sex will not be over-
looked. If SEX does not have a Z value (which means it has either M X 
or F X), it is changed to Z. This means the questioner preferred books 
limited to one sex but presumably his second choice is books not limited 
to any sex. Clearly if the question has SEX F X it can never be changed 
to SEX M X or SEX X, since SEX X will find books in the data classified 
SEX M X. Anything other than Z changes to Z, and Z only changes to X. 
Mter this change is made, another search is conducted and the answers 
counted. Until there are five books or the data is exhausted, the original 
question is altered and the cycle continued. Alterations proceed by chang-
ing the values of one subscript at a time in the following order: SEX, 
LIVING, MIN, NAT and OCCUP. Then they are changed two at a time, 
three at a time, four at a time, and finally all five are changed, so there 
are thirty-one possible changes. 

If at the end of the thirty-second search there are still not five answers 
and there are more data items, the date restriction on the question is 
checked. If DATE has a value other than Z, it is changed to Z, which 
matches all the data items, and the computer prints a note if this is done; 
the program will then select any book regardless of date. Control returns 
to search and begins the cycle again, continuing until five answers are 
found or the data is exhausted. 

Mter searching is finished, the writing routine commences. One at a 
time the computer takes each answer, writes out its code for possible 
further reference, and then writes out the complete author, title, copy-
right date and Library of Congress call number, all of which the computer 
finds in a list within the program. 

RESULTS 
To obtain some measure of the program's accuracy, fourteen textbook 

questions, probably more challenging than the average patron would ask, 
were submitted to the computer and to a professional librarian who was 
especially familiar with biographical reference books. (See Figure 1 for 
sample questions and results. ) 

The librarian spent a total of an hour and a half, and found answers to 
eleven out of fourteen questions. On the three she could not answer she 
felt she had exhausted the resources. In one of the eleven she answered 
("How many Americans won the Nobel Prize in medicine between 1930 
and 1950?") she found the answer in a source not specifically biographical 
(World Almanac) and therefore not in the computer's data. 

No problems occurred in forming the questions for submission to the 
computer. The program found some reasonable sources in all cases. It 
found books containing the answer in ten out of fourteen cases, the four 
answers not found being those three the librarian missed and the one 
requiring an almanac. In all but one case there were more possibilities 
than the five books given per answer. Some questions were rerun ignoring 



Retrieval of Biographical Reference Books / WElL 245 

Qu~stion: In one source find a list of .1t least twenty references t o 
biog raphical information about Dmitri ~1endelee£ (or Mende. lev), Russ i an 
chemist (1834-1907). 

As submitted to computer: Q / SEX H, LIVI NG N, OCCUP CHEH, NAT RUSSIAN, 
MIN Z, SPECl Z, SPECZ R , INDEX z, Dt.TE Z + 

Librarian's results: 
B Phillips, Dictionary of Biographical 
A Encyclopedia Britannica 
A Encyclopedia P-'llericana 
A Biography Index (1949-64 volumes) 

Reference .. 0 references 
6 references 
1 reference 

.. 14 references 

time: 10 minutes 

Computer's results: 
A Index to Scientists ... 27 references 
A Biography Index 
C Drake, Di ctionary of American Biography 

(sounds wrong but it is international.) 
B Phillips, Dictionary of Biographical Reference 
A Encyclopedia Britannica 

Question: What academic degrees have been earned by Professor Reuben L. 
Hill, Director of Family Study at the University of Minnesota'?: 

As submitted to computer: 
(l) Q/ SEX M, LIVING Y, OCCUP EDUC, NAT AHER, MIN Z, SPECl DG, SPEC2 z, 

I~"DEX z, DATE Z + 
( 2) IGNORE + 1\}iECONASSN + 

I GNORE + AMERSCIENCE + 
IGNORE + AMPOLISCI + 
IGNORE + DAMERSCHOL + 
IGNORE + LEADEDUC + 
Q / SEX M, LIVING Y, OCCUP EDUC, NAT AMER, MIN z, SPECl DG, SPECZ Z, 
INDEX Z, DATE Z 

Librarian's results: 
B Leaders in Education 
A Who's Who in Arne rica - Answer: BS, PhM, PhD 

time: 3 minutes 

Computer 's results: 
( l) D Handbook of the American Economic Association 

D 

D 
D 
B 

(2) B 
c 
A 

B 
B 

American Men of Science 
Biographical Directory of the American l'olitical Science Assoca t ion 
Directory of American Scholars 
Leaders in Education 
Who's Who i n American Education 
Outstanding Young Men of America 
Who's Who in America 
h'ho' s Who in various areas 
National Cyclopedia of American Biograp hy 

Question: Where might I find information about a New England ancestor 
named Jacob Billings who was born around 1753'? 

As submitted to the computer: A / SEX M, LIVING N , OCCUP Z, NAT AMER, 
MIN FF , INDEX Z, DATE z, SPECl z, SPEC2 Z + 

Librarian's results: 
D Handbook of Genealogy - about genealogists not families 
A Compendium of American Genealogy 

time: 8 minutes 

Computer t s results: 
A Compendium of A(!letican Genea logy 
C Dictionary of American Biography 
C Who ~.ras ~"1'10 in America 
C Lamb's Biographical Dict i onary of the U. S. 
C Concise Dictionary of American Biography 

A = It has the answer or a t least part of it 
B = Good choice but it does not have answer 
C = Reasonable choice but the r e arc better ones 
D = Poor choice 

Fig. 1. Sample Reference Questions. 



246 Journal of Library Automation Vol. 1/ 4 December, 1968 

the first five answers, and five more titles were retrieved; even then there 
were more possibilities. 

In some cases the program did better than the librarian because she 
wasted time looking in sources that did not give the specifics sought. For 
instance, when the question asked for the pronunciation of the surname 
of Paul and Lucian Hillemaker, French composers, she looked in diction-
aries that do not give pronunciation. The computer found the only four 
possible sources immediately. 

In other cases the program came up with rather far-fetched answers a 
human would have skipped. A question asking for biographies of Franz 
Rakoczy, an Hungarian hero, retrieved in its second five sources three 
Jewish encyclopedias and a book on composers! These were not wrong 
and, in cases where occupation or minority group affiliations were un-
known, these might be good sources. 

As an answer to the Nobel-prize-winner question the computer retrieved 
sources on American doctors, Nobel winners and scientists, which are the 
best choices from the data and would have the answers buried in them. 
However, what is really required is an index to award winners, and there 
were none in the data. 

The test revealed the necessity for allowing questions to have dummy 
values; that is, ones not used in the data. For instance there are no books 
limited to botanists, so OCCUP BOTANIST is not allowed in a question, 
though OCCUP SCIENTIST is, and CHEM and PHYSICS are included 
as more specific values under SCIENTIST. Asking for OCCUP SCIEN-
TIST when searching for a botanist avoids getting books devoted to non-
scientific occupations but also gets books devoted to chemists and physi-
cists. Since one would want these books if he did not know the scientist 
was a botanist, that should not be changed. If he asks for OCCUP BOT-
ANIST he wants books devoted to botanists first, then scientists in general. 

A short-term solution is to have dummy values to stand for all these 
other values. For example OCCUP OTHER-SCIENTIST could include 
all scientific occupations except those specifically listed, and it would re-
trieve books limited to all scientists but not to specific scientific occupa-
tions mentioned in the data. A long-term solution is to use a computer 
language allowing tree-structured data. Presently this problem does no 
more than cause extraneous retrievals which the person using the list can 
easily skip. 

DISCUSSION 

Advantages of the scheme can be speculated. From the library's point 
of view its virtues are that it is simple and inexpensive. Original imple-
mentation would not require a major block of time to be spent in human 
indexing or abstracting. Operating costs would be low because it does not 
require such a large store of information in memory that several tapes 
must be searched, and because updating the file is simple. When a new 



Retrieval of Biographical Reference Books/ WElL 247 

book is added, an experienced person could categorize it in five minutes, 
punch a new data card and, if required, add to the list of values in the 
table of abbreviations. 

The system could provide useful information to other departments. It 
could keep tallies for the acquisitions department of how often a book is 
given as an answer, indicating whether new editions of it or similar books 
would be good buys. 

From the user's point of view the system avoids a major pitfall of some 
retrieval schemes which retrieve on the basis of ambiguous terms or asso-
ciation chains; that is, missing relevant items. If the user resubmits the 
same question ignoring already retrieved books each time, he will eventu-
ally have a comprehensive list of possible sources in the data that have 
the index and specifics he requires. A user also wants his information as 
brief as possible, listed in order of importance and with no extraneous 
answers ( 7); this requirement could be met as the program stands by 
having a human simply cross out any unnecessary titles. Users like to 
know the reliability of the information ( 7) ; this detail could be provided 
along with the titles. 

Users also want speed and convenience. As it stands, this system could 
be made available to users of the University of Chicago Library tomorrow 
with no more equipment than is presently in the Computation Center. 
Time delay in the present implementation could be remedied by using 
an on-line system. 

Users often prefer to be given facts themselves and not just citations 
( 7). A program that gives biographical facts directly has no connection 
with this scheme or classification system, but the output of this program 
could be used as a tool by a librarian to find the answer for a patron. 

BIBLIOGRAPHIES 

The most obvious area to which the retrieval scheme could be extended 
is that of bibliographies. Like biographies, they are limited in their scopes 
to certain exclusive categories, and they contain the same specific facts 
for each entry. Logical exclusive categories could be: NATIONALITY, 
FORM (with such values as drama, poetry, fiction, maps, etc. ), SUBJECT 
(probably the most frequently used criterion on which to select books 
for a bibliography), and DATE. Since there is no LIVING with which to 
connect DATE, DATE here should probably have not just the most re-
cent relevant date but as many values as necessary. For instance DATE 
40S 50S 60S would apply to an index that began publication in the 1940's 
and is current. Then a request for any of those dates would find it. 

Possible SPECIFICS include number of pages, the cost, or a facsimile 
of the title page. ARRANGEMENT would be needed, being different 
from INDEX in that bibliographies, unlike biographies, cannot be assumed 
to have the same order (alphabetic by subject's name) plus indexes in 
other orders. ARRANGEMENT would list as values all the ways the con-



248 Journal of Library Automation Vol. 1/ 4 December, 1968 

tents of the bibliography could be approached: by subject, author, title, 
chronology or a combination of these. 

DICTIONARIES 
Dictionaries also lend themselves well to this type of scheme; one ex-

clusive category, SUBJECT, might even be adequate for dictionaries. Dic-
tionaries' special subjects could be broken down into FIELD (such as 
chemistry or business) and TYPE (such as slang or geography), if neces-
sary. LANGUAGE would be a specific category, since there are no sub-
stitutes for the language required. Other possible SPECIFICS are pronun-
ciation, definition, etymology and illustration. 

ATLASES 
Atlases are also suited to the scheme. Exclusive categories that seem 

appropriate are AREA covered, special SUBJECT atlases, and the size of 
the SCALE. SCALE should probably act as DATE does in the biographi-
cal program; that is, if a particular scale is requested, that would be 
searched for first and, if no answer is found, a note would be given and 
another search made for any scale. SPECIFICS for atlases could include 
items like topography, rainfall, winds, cities, highways and major products. 

Factual books (those that give the highest mountain, the first four-
minute mile, the January lOth price of U.S. Steel, etc.) do not lend them-
selves to the scheme. Because these books are not uniform as to entries 
and subject coverage, the list of possible specifics and exclusive categories 
would be extremely long and the number of searches consequently pro-
hibitive. Also, since such books are far fewer in number than biographical 
or bibliographical works, the proper one is easier to find by browsing. 

CONCLUSION 
A scheme for categorizing biographical reference books by their exclu-

sive and specific categories makes it possible to automatically retrieve 
titles of those which would best answer reference questions. When tested 
it was found acceptable, with minor refinements, and it is easily adaptable 
to other reference book forms. Such a system seems a logical direction in 
which to go when automation of actual reference functions is undertaken. 

ACKNOWLEDGMENT 
The project under discussion was undertaken in partial fulfillment of 

requirements for the M. A. degree at the University of Chicago's Gradu-
ate Library School. The computer program employed is detailed in the 
author's thesis ( 8). The work was partially completed under the auspices 
of AEC Contract No. AT(ll-1)614. 



Retrieval of Biographical Reference Books / WElL 249 

REFERENCES 
1. University of Illinois Library School: The Library as a Community 

Information Center. Papers presented at an Institute conducted by 
the University of Illinois Library School September 29-0ctober 2, 1957 
(Champaign, Illinois: University of Illinois Library School, 1959), p. 2. 

2. Shera, Jesse: "Automation and the Reference Librarian," RQ, III, 6 
(July 1964), 3-4. 

3. Austin, Charles J.: Medlars 1963-1967 (Bethesda, National Institutes 
of Health, 1968). 

4. Haas, Warren J.: "Statewide and Regional Reference Service," Li-
brary Trends. XII, 3 (January 1964), 407-10. 

5. Yngve, Victor: COM IT Programmers' Reference Manual (Cambridge, 
Mass.: M. I. T. Press, 1962). 

6. Hsu, R. W.: Characteristics of Four List-Processing Languages (U. S. 
Department of Commerce, National Bureau of Standards, Sept. 1963). 

7. Goodwin, Harry B. : "Some Thoughts on Improved Technical Infor-
mation Service," Readings in Information Retrieval (New York, Scare-
crow Press, 1964) , p. 43. 

8. Weil, Cherie B.: Classification and Automatic Retrieval of Biographi-
cal Reference Books (Chicago: University of Chicago Graduate Li-
brary School, 1967).