Microsoft Word - ART4081


Evidence Based Library and Information Practice 2008, 3:4 

 
18 

 
   Evidence Based Library and Information Practice  
 

Article 

 
Measuring the Extent of the Synonym Problem in Full-Text Searching 

 
Jeffrey Beall 

Metadata Librarian 

University of Colorado Denver 

Denver, Colorado, United States of America 

E-mail: jeffrey.beall@ucdenver.edu 

 
Karen Kafadar 

Rudy Professor of Statistics 

College of Arts and Sciences, Indiana University 

Bloomington, Indiana, United States of America 

E-mail: kkafadar@indiana.edu 

 
Received: 03 September 2008    Accepted: 23 October 2008 

 
© 2008 Beall and Kafadar. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, 

distribution, and reproduction in any medium, provided the original work is properly cited. 

 
Abstract 

  
Objective – This article measures the extent of the synonym problem in full-text 

searching. The synonym problem occurs when a search misses documents 

because the search was based on a synonym and not on a more familiar term.  

 
Methods – We considered a sample of 90 single word synonym pairs and 

searched for each word in the pair, both singly and jointly, in the Yahoo! database. 

We determined the number of web sites that were missed when only one but not 

the other term was included in the search field.  

 
Results – Depending upon how common the usage is of the synonym, the 

percentage of missed web sites can vary from almost 0% to almost 100%. When 

the search uses a very uncommon synonym ("diaconate"), a very high percentage 

of web pages can be missed (95%), versus the search using the more common term 

(only 9% are missed when searching web pages for the term "deacons"). If both 


Evidence Based Library and Information Practice 2008, 3:4 

 
19 

 
terms in a word pair were nearly equal in usage ("cooks" and "chefs"), then a 

search on one term but not the other missed almost half the relevant web pages.  

 
Conclusion – Our results indicate great value for search engines to incorporate 

automatic synonym searching not only for user-specified terms but also for high 

usage synonyms. Moreover, the results demonstrate the value of information 

retrieval systems that use controlled vocabularies and cross references to generate 

search results.  

 
Introduction and Context of the Study 

 
Full-text searching generates results by 

matching a word or words in a search query 

with words in a database. The synonym 

problem in full-text searching occurs when a 

searcher looks for information on a topic 

and enters a search using a single term to 

represent the topic but does not also enter 

any synonyms for that topic. For example, a 

search for information on dentures with 

only the word "dentures" as a search term 

could miss documents that refer to this 

concept by its synonym "false teeth", 

because the two terms have no words in 

common. For most full-text searching, 

“value-added” features such as controlled 

vocabularies and cross references are not 

present. These features serve to retrieve and 

co-locate documents on a given topic in 

search results regardless of the terms used 

in the full text of searched documents.  

 
This article seeks to measure the extent of 

the synonym problem in full-text searching. 

More precisely, this study looks at single 

word pairs of synonyms, and for each term 

measures the proportion of documents that 

are missed when one term is searched, and 

the proportion of  documents that contain 

only the synonym. This study is limited to 

traditional full-text search engines, that is, 

search engines that match words in a search 

query with words in full-text documents 

and return results.  

 
Full-text Search Engines and Synonyms 

 
With the advent of the Internet, full-text 

searching has proliferated, and with it the 

desire to retrieve as much information about 

a topic as possible. A problem that arises 

with such searches is the potential for the 

search to return only a subset of the web 

sites with relevant information because the 

search concept can be referenced by more 

than one term. The concept can be described 

by simple nouns ("false teeth" and 

"dentures"), or by broader terms, such as 

"botany" and "plant science", or "aurora 

borealis" and "northern lights". A search in 

most search engines on the term "botany" 

(or "aurora borealis") may well miss web 

pages that refer to the discipline only as 

"plant science" (or "northern lights").  

 
A few authors have commented on this 

effect. For example, in The Oxford Guide to 

Library Research Mann writes: 

 
When all is said and done, keyword 

searching necessarily entails the 

problem of the unpredictability of 

the many variant ways the same 

subject can be expressed, within a 

single language (“capital 

punishment,” “death penalty”) and 

across multiple languages (“peine 

de mort,” “pena capitale”). And no 

software algorithm will solve this 

problem when it is confined to 

dealing with only the actual words 

that it can retrieve from within the 


Evidence Based Library and Information Practice 2008, 3:4 

 
20 

 
given documents (or citations or 

abstracts) themselves. (102) 

 
Beall refers to this problem as the “synonym 

problem” and states, “In full-text searching, 

synonyms hinder effective information 

retrieval when a searcher enters a term in 

the search box and the system only returns 

results that match the term and does not 

return results that refer to the concept only 

by one of its synonyms” (“Weaknesses” 

439). For example, some use the term 

“botany” and others use “plant science” to 

describe the same concept. A search in most 

search engines on the term “botany” would 

probably miss web pages that refer only to 

the discipline as plant science (Beall “Death” 

6). 

 
Fugmann uses the term “paraphrase 

lexicalization” to describe the disconnect 

between a user's search terms and the terms 

used in relevant documents. He exemplifies 

the synonym problem by giving an example 

of a searcher looking for information on 

insecticides and missing documents that 

refer to them as pesticides.  He states, “…an 

inquirer expects all documents to be 

retrieved in which the concept of the search 

request is dealt with and in fact independent 

of how it happens to have been expressed 

by an author” (223). 

 
Dagan et al. describe the synonym problem 

from an information science perspective. 

Their study “investigates conceptually and 

empirically the novel sense matching task, 

which requires [one] to recognize whether 

the senses of two synonymous words match 

in context” (449). They describe this 

phenomenon  as “lexical substitution”. Their 

study does not measure the synonym 

problem but attempts to lay the 

groundwork for an algorithmic solution to 

it.  

 
While only a few authors have noted the 

synonym problem, even fewer have 

attempted to measure it. The challenges to 

measuring the extent of the synonym 

problem include defining an appropriate 

measure, and designing a study to quantify 

it. To our knowledge, no previous study has 

been conducted to measure the synonym 

problem. This article attempts to fill that 

void. 

 
On its web page, Google describes a 

“synonym search”, but it provides very little 

information about this type of search. On 

one of its help pages Google states, “If you 

want to search not only for your search term 

but also for its synonyms, place the tilde 

sign ("~") immediately in front of your 

search term” (“Web Search Help Center”). 

We suspect that rather few Google users are 

aware of this feature, and even fewer take 

advantage of it. Google offers no further 

explanation of this feature. Slightly more 

information is provided in the patent 

application granted to Google in 2002 and 

issued in 2005 for a process that essentially 

functions as an algorithmic synonym search, 

rather than a deterministic synonym search 

(by matching synonyms from a pre-

constructed list). According to the patent's 

abstract: 

 
Methods and apparatus determine 

equivalent descriptions for an 

information need. In one 

implementation, if adjacent entries 

in a query log contain common 

terms, the uncommon terms are 

identified as a candidate pair. The 

candidate pairs are assigned a score 

based on their frequency of 

occurrence, and pairs having a score 

exceeding a defined threshold are 

determined to be synonyms. (Dean 

et al. 2005) 

  
We assume that the phrase “equivalent 

descriptions” here means "synonyms", but it 

is unclear whether Google has implemented 

the process described in this patent into its 


Evidence Based Library and Information Practice 2008, 3:4 

 
21 

 
current search algorithms. For proprietary 

reasons, search engine companies release 

very little information about the algorithms 

they employ to generate results. Bade says 

“… the exact nature of the formulae used 

remains largely unknown to the public since 

these are valuable intellectual property for 

their owners” (831). 

 
At least one library online catalog product 

offers a synonym search feature. The 

Innovative Interfaces, Inc. online catalog 

allows libraries to program in synonyms. 

Once a synonym pair has been programmed 

into the system, a keyword search on either 

of the two words in the pair returns results 

as if both search terms had been entered. 

This feature is not used so much for 

synonyms as it is for variant spellings, such 

as British and American variants like 

“labor” and “labour”.  

 
Methods 

 
Our original plan was to generate a random 

sample of synonym groups and then to 

search them in both Google and Google 

Book Search. As our source for synonyms, 

we planned to use printed thesauri from the 

reference section in the Auraria Library on 

the campus of the University of Colorado 

Denver. After collecting the data, we 

planned to do a statistical analysis to answer 

our research question.  

 
Difficulties with Synonyms 

 
We soon realized that exact synonyms are 

rare, and words listed as synonyms in 

thesauri are close in meaning but frequently 

are not true synonyms. One example of a 

false pair of synonyms is the pair 

“waterfall” and “cascade”. While close in 

meaning, there is a significant semantic 

difference between these two terms. We 

sought to study synonym groups that were 

as semantically identical as possible. We 

suspected that the use of "non-exact 

synonyms" such as "waterfall" and "cascade" 

would result in even more missed web 

pages, and hence an even more severe 

problem than what we ultimately observed. 

 
Difficulties with Google 

 
Before we began to collect data we 

performed numerous test searches, which 

immediately revealed two significant 

problems for conducting this research with 

Google. The first problem was that the 

Google search software does not allow 

nested Boolean searching. That is, if a term 

contains more than one word, Google will 

not allow a searcher to apply the Boolean 

operator “not” to the phrase. This was a 

significant problem for us, because our 

study objective required us to search for one 

term but not the other. As an example, for 

the synonym pair “leprosy” and “Hansen’s 

disease”, ideally we would perform the 

following search in Google: 

 
Leprosy -“Hansen’s disease” 

 
The minus sign within Google activates the 

Boolean operator “not” in the search, and 

the quotation marks indicate a term to be 

searched as a phrase. Unfortunately, the 

Google search engine lacks the functionality 

to correctly perform this type of search. Our 

test searches showed that when we tried to 

use nested Boolean terms, the phrases we 

attempted to exclude often appeared in the 

pages retrieved by the search. This would 

prevent us from accurately measuring the 

number of resources missed due to the 

synonym problem.  

 
To address the difficulties with synonyms 

from thesauri, we abandoned printed 

thesauri as a source for a random selection 

of synonyms and turned instead to 

controlled vocabularies. Controlled 

vocabularies also are frequently called 

thesauri; they list the preferred term for a 

concept followed by a list of the variant 


Evidence Based Library and Information Practice 2008, 3:4 

 
22 

 
terms or "cross references". The Library of 

Congress Subject Headings is an example of 

a controlled vocabulary, and as one of the 

most comprehensive we selected this 

controlled vocabulary as the source for our 

random selection of synonyms.   

 
At this point we encountered our second 

major problem with Google: an apparent 

inconsistency in the search results in the 

Google database. One of the valuable 

features of the Google database that benefits 

information retrieval research is that results 

of each search include the total number of 

web sites retrieved. However, as we were 

conducting our test searches, we found this 

count to be highly variable. In some cases, 

for example, the same search performed at 

two different times retrieved significantly 

different numbers of "web pages found”. 

We illustrate this problem with a more 

detailed description of our study design. 

 
For our study we required data on the 

number of web pages found from the 

following searches for each word pair, 

expressed in Figure 1 as a Venn diagram. 

 
The Venn diagram in Figure 1 expresses the 

following Boolean logic: 

 
A not B (Represented by only the area in the 

left circle that is shaded green) 

B not A (Represented by only the area in the 

right circle that is shaded blue ) 

A and B (Represented by the blue-green 

shaded area in the center) 

A or B (Represented by the entire shaded 

area of the diagram) 

 
Our study depends critically on the 

numbers of web pages found for each of 

these four searches. One would expect that 

the sum of the numbers of web pages found 

from the first three searches should equal 

the number of web pages found by the 

fourth search. However, in our test searches 

we observed discrepancies between these 

two results, sometimes as large as several 

million. Indeed, the four individual 

numbers from the four searches often varied 

substantially. 

 
We postulated several explanations for the 

wide discrepancy. First, it could arise if 

Google actually applies its patented 

"synonym search" feature described in its 

help pages. Second, if Google's cited 

"number of web pages found" is not a 

deterministic count, but rather is a statistical 

estimate based on the current version of the 

search algorithm being used, then one 

would expect variability in the estimate at 

different times. A third possible explanation 

arises from the fact that "every search in 

Google is part of an experiment" (Pregibon 

and Lambert), so searches of the same query 

at different times may result in different 

algorithms being applied. 

 
Finally, discrepancies could reflect actual 

changes in the number of web pages 

available due to new and deleted web pages 

over time. However, we suspect the 

numbers of new and deleted web pages on 

widely diverse topics would not vary much, 

  
Fig. 1. A Venn diagram that illustrates 

the data gathered for each word pair. 

 
Evidence Based Library and Information Practice 2008, 3:4 

 
23 

 
casting doubt on this fourth possibility as a 

plausible explanation for the extreme 

variability we observed in our searches, 

some of which involved rather obscure 

terms. Slight discrepancies would have been 

tolerable, but for our study such huge 

discrepancies rendered Google searches too 

variable for our purposes. For this reason, 

we turned to alternative search engines. 

 
Revised Methods 

 
Because of the inability of Yahoo! (as well as 

Google) to perform nested Boolean searches, 

we decided to limit this study to only single 

word synonym pairs. (We would like to 

repeat this study on more complicated 

synonym-phrase pairs when a nested 

Boolean search feature is implemented in 

one of the search engines.) We generated a 

random list of synonym word pairs from the 

library catalog at the Auraria Library 

(University of Colorado Denver). Using the 

search functionality in the "staff" mode of 

the library's catalog, we created a list of all 

topical subject authority records that 

contained at least one cross reference. 

Because the Auraria Library serves three 

institutions of higher education, including a 

comprehensive university, the scope of the 

headings in the library is unusually broad. 

Our generated list contained 39,511 records. 

We then used a program available through 

the R Project for Statistical Computing 

<http://www.r-project.org> to generate 100 

random numbers distributed uniformly 

across the range [1, 39511], which identified 

the indices of the 39,511 records selected for 

this study. 

 
As indicated above, we limited our study to 

single word pairs of synonyms, meaning 

that both the heading and the cross 

reference had to be single words. We 

imposed two further conditions on the word 

pairs for this study to avoid the potential for 

the 100 pairs to include geographic- or 

location-specific terms. The two conditions 

thus relate to the structure and composition 

of the Library of Congress Subject Headings 

(LCSH) thesaurus. First, we skipped records 

whose cross references were also cross 

references from another record. Second, we 

insisted on semantically exact word pairs. 

The LCSH does group semantically related 

concepts on a single record. For example, 

the LCSH heading for "mountains" has a 

"see reference" for the word "hills". While 

similar, these two concepts are semantically 

different, even though LCSH groups them 

together on a single subject authority record 

for convenience.  

 
To impose the two further conditions, we 

eliminated a pair and went to the next 

record in the list of 39,511 if the main 

heading / cross reference pair (a) involved 

more than one word for either the main 

heading or the cross reference; (b) contained 

a cross reference that was itself a cross 

reference; or (c) contained terms that were 

not semantically exact. In ten instances, the 

random numbers were so close together that 

no valid single word synonym pair 

appeared between the previous and the next 

randomly-selected record. Thus our final 

sample consisted of ninety pairs of single 

word synonyms. All searches were 

conducted by the first author (Beall). 

 
Results 

 
We applied our study plan to search 100 

(later revised to 90) synonym, single word 

pairs in the Yahoo! database. The word pairs 

and the data are presented in the Appendix. 

The searches were conducted in March and 

April, 2007. When we gathered the data, we 

realized that the searches, like full-text 

searching, would not be perfect. For 

example, one of our synonym pairs was 

biologicals / biologics. It is likely that one of 

the terms is the name of a company, or is a 

word in a foreign language and in many 

contexts is not a synonym of the other, a 

situation that would affect our data. But 


Evidence Based Library and Information Practice 2008, 3:4 

 
24 

 
there were far too many search results to 

examine to determine their context, and the 

type of searching we are studying, full-text 

searching, is also burdened by the same 

problem. We acknowledge this potential 

contamination by proper company names, 

but believe it to be quite small. 

 
The pertinent data from this study are the 

percentages of total references ("A and B") 

found by searching for "A only" (i.e., 

number of pages found in search for word A 

only, divided by the total number of pages 

found in a search for either ("A or B")) and 

likewise searching for “B only”. Usually, 

one of A or B is the more common word, so 

the percentage for one will be higher, often 

substantially higher, than the percentage for 

the other word. Figure 2 displays via 

boxplots the data for the more common of 

the words in the pair (Max(%A,%B)), the 

data for the less common of the words in the 

pair (Min(%A,%B)), and the difference in the 

two percentages (Diff(Max–Min)).  For 

convenience in this article, we will designate 

“A” as the more common word and “B” as 

the less common word (i.e., a search on “A” 

returned more web pages than a search on 

“B”). 

 
Figure 3 displays information similar to the 

third box in Figure 2, but on an item-by-item 

basis. For example, the highest percentage 

among these word pairs occurred for word 

pair #53: “Mitochondria” but not  

“Chondriosomes” found 99.992% of the 

2,000,189 web pages, while 

“Chondriosomes” but not “Mitochondria” 

found only 0.006% of the web pages. The 

designated line in Figure 3 connects these 

two proportions: 0.99992 (left side) and 

0.00006 (right side). From this display it is 

clear that if one succeeds in identifying the 

more common word the search will yield 

most of the references, but if one asks for the 

less common word the search will miss 

almost all the web pages. For about 10-20 of 

the word pairs the words will each find  

 
about half of the available pages. For 

example, in word pair #72 each of the two 

searches, on "Preparedness" only and on 

"Readiness" only, returns about half of the 

total number of web pages, but also will 

completely miss the other half. 

 
If one happens to select the more common 

of the words in the pair, one is often likely to 

capture most of the references (on average, 

about 88% of the references), but in 10 of the 

90 pairs a search for even the more common 

of the words in the pair returned less than 

55% of the available web pages. See Table 1 

for the list of these 10 word pairs.  

 
In a search, the proportion of missed web 

pages depends on whether one searched the 

more common or the less common word in 

the synonym pair. In these 10 word pairs  

 
Fig. 2. Boxplot showing results missed from the 

perspective of the more- and less common word. 

 
Evidence Based Library and Information Practice 2008, 3:4 

 
25 

 
even a search on the more common term 

returned less than 55% of the web pages 

found if both words were used in the search 

(i.e., 45% or more web pages were missed 

when using only one term in the word pair). 

 
How costly can the search be, in terms of 

missed web pages, if one were to search on 

the less common of the two words in the 

pair?  Figure 2 shows that, when the more 

common ("A") of the words is used in the 

search, often one captures 78% or more of 

the total web pages (the lower quartile of the 

percentages of web pages found using "A 

only" is 78%, as demonstrated by the lower 

edge on the left-most box in the boxplot). 

Conversely, if one were so unlucky as to 

have selected the less common word, one is 

likely to capture no more than 20% of the 

web pages (the upper quartile of the 

percentages of web pages found using "B 

only" is 20%, as shown by the upper edge of 

the middle box in the boxplot).  

 
Even when the more common word is 

searched, the left box in Figure 2 shows that 

25% of the searches returned only 50-78% of 

the available web pages. These results 

suggest that the "cost" of web searches for 

information about a topic can be rather high 
if one unfortunately enters the less common  

 
term, which may be frequent depending 

upon one's native language. (For example, 

Australians often use the term "jumper" for 

the American term "sweater", and the British 

use the term "biscuits" for the American 

term "cookies".) The third box in Figure 2 

shows that the difference in "percentage of 

web pages found" can be very large -- often 

as high as 50-95% (lower quartile and upper 

quartile) -- depending on which word was 

selected for the search. Figure 3 shows both 

percentages for each word pair (A, the more 

common, on the left; B, the less common, on 

the right), connected by a dashed line. Often 

one of the words in the word pair is much 

more common than the other word. But for 

about one-fourth of the words in our study, 

both percentages are near 50%: a search for 

one term or the other fails to capture half of 

the web pages, regardless of whether one 

selected the "more" or "less" common word. 

 
Even when the more common word is 

searched, the left box in Figure 2 shows that 

25% of the searches returned only 50-78% of 

the available web pages. These results 

suggest that the "cost" of web searches for 

information about a topic can be rather high 

if one unfortunately enters the less common 

term, which may be frequent depending 

upon one's native language. (For example,  

Table 1  

 A B A (not B) B (not A) A and B A or B 

Prop. 

max 

Prop 

min 

2 Afrocentrism Afrocentricity   57900   63800 1040 128000 49.8     45.2 

20 Cooks Chefs 18500000  23600000 2200000 44200000 53.4     41.9 

24 Discrimination Bias 44400000 35900000 2910000 83200000 53.4     43.1 

26 Egoism            Egocentricity 1150000   10600 980 3000000 38.3      3.5 

27 Electromagnetism Electromagnetics 953000   750000 46500 1760000 54.1     42.6 

50 Marmots   Marmota          325000   299000 8900 645000 50.4     46.4 

69 Picornaviruses Picornaviridae 35400    39200 2360 84200 46.6     42.0 

72 Preparedness Readiness        17200000  16500000 986000 39000000 44.1     42.3 

81 Salafiyah Salafiyya   17500  12100 66 38700 45.2     31.3 

93 Tinsmithing Tinwork          16200    13900 79 41700 38.8     33.3 

99 Waka    Tanka 1530000   1580000 10700   3110000 50.8     49.2 

 
Evidence Based Library and Information Practice 2008, 3:4 

 
26 

 
Australians often use the term "jumper" for 

the American term "sweater", and the British 

use the term "biscuits" for the American 

term "cookies".) The third box in Figure 2 

shows that the difference in "percentage of 

web pages found" can be very large -- often 

as high as 50-95% (lower quartile and upper 

quartile) -- depending on which word was 

selected for the search. Figure 3 shows both 

percentages for each word pair (A, the more 

common, on the left; B, the less common, on 

the right), connected by a dashed line. Often 

one of the words in the word pair is much 

more common than the other word. But for 

about one-fourth of the words in our study,  

 
both percentages are near 50%: a search for 

one term or the other fails to capture half of 

the web pages, regardless of whether one 

selected the "more" or "less" common word. 

 
Discussion 

 
While small in scope, this study 

demonstrates the severity of the synonym 

problem in web searching. Because of 

cultural or sociological differences in terms, 

the use of one term instead of its more 

common counterpart could result in highly 

incomplete web searches, raising only a 

fraction of the available web pages on this 

 
Fig. 3. Display of proportions of web pages found when searching on "More common word" (left 

side) versus "Less common word" (right side). Segments connect proportions. The greatest 

discrepancy occurs with word pair #53, "Mitochondria" (0.99992) versus "Chondriosomes" 

(0.00006). The least discrepancy occurs with word pairs #72 ("Preparedness", 0.496; "Readiness", 

0.476) and #99 ("Tanka", 0.507; "Waka", 0.490). 

 
Evidence Based Library and Information Practice 2008, 3:4 

 
27 

 
topic. For example, our study included the 

word pair “appraisers/assessors”; the 

former term is more common in some 

societies (73%), while the latter is more 

familiar in other contexts (but which 

captures only 26% of the web pages found 

by using both terms). For other word pairs, 

both words are used roughly equally often, 

but not in the same document, and hence a 

search on either word, but not the other, 

misses almost half the web pages found by 

searching on both (e.g., preparedness 49.6%: 

readiness 47.6%). Some search engines (such 

as Google Inc.) appear to offer synonym-

searching capability, and based on our 

study, such a feature would result in more 

complete searches. 

 
This study involves some important 

limitations which we need to acknowledge. 

First, the study is limited in its sample size. 

We selected only 100 pairs and our study 

design yielded data on only 90 word pairs, 

which is not a huge study but is definitely 

large enough to demonstrate the variability  

that can arise with the synonym problem. In 

addition, the uncertainties on the estimates 

of the reported percentages of missed web 

pages with each word in the search pair 

includes the uncertainties in the algorithms 

used by the search engine. One must keep in  

mind the extent that the search engine  

algorithms themselves are based on some 

sort of sampling strategy that returns 

estimates on "approximate number of web  

 
Fig. 4. The fewer the hits in a search, the more precise the estimate of number of web pages found. 

 
Evidence Based Library and Information Practice 2008, 3:4 

 
28 

 
pages found", which then affect our 

reported percentages. Clearly a larger study 

that involves replication is warranted to 

yield better estimates of the variability in the 

percentages reported here. 

 
In a few instances, the data returned were 

illogical, in that the number of web pages 

found from a search of (A or B) exceeded the 

sum of the numbers of web pages found 

from the three searches combined (A not B) 

+ (B not A) + (A and B). Two factors 

probably contributed to such events. First, 

the distributed system architecture can 

change quickly and repeatedly, resulting in  

different values at different times. Second, 

the search engine reports only an estimate, 

not a precise value, of the number of web 

pages found. These estimates are less precise 

for larger numbers of web pages found, as 

illustrated in Figure 4.  

 
Ideally, the difference between  

 
N1 = #web pages found from "(A or B)" and 

N2 = #web pages found from "(A not B) + (B 

not A) + (A and B)" 

 
should be zero. Figure 4 shows the 

logarithm (base 10) of the absolute value of 

this difference, log(|N1-N2|), versus 

log(N2). As the number of web pages found 

increases, the discrepancy between N1 and 

N2 grows, with large N2 (on the order of 

100 million web pages) resulting in 

discrepancies of over 1 million. For the most 

part, though, this figure shows that the 

discrepancy is usually less than 1%, but can 

sometimes be as large as 10%. As search 

engine algorithms improve, we expect fewer 

large discrepancies of this type. 

 
This study attempts to address the extent of 

the synonym problem by comparing the 

numbers of web pages found by only one of 

the two words in a synonym word-pair but 

not the other word. However, a user's main 

interest may be in capturing not the total 

number of web sites for a given concept but 

rather the number of most relevant web sites. 

The results from this type of study would 

indeed be interesting, but we see two 

immediate problems in attempting to 

conduct such a study. First, one would have 

to define what is meant by "most relevant". 

The easiest definition would be "top 25 web 

sites", but some of those "top 25" could be 

duplicates, irrelevant, non-authoritative, or 

paid by advertisers. Moreover, human 

subjectivity would be involved in assessing 

"relevance". At some future time, search 

engines may offer functionalities that would 

reduce the human effort in this time-

intensive, possibly subjective, laborious 

process, and we would consider such a 

study at that point. 

 
Another issue may be whether our study 

measured "semantic exactness" rather than 

"extent of the synonym problem". Our 

criteria for word-pair synonyms in this 

study included one criterion that was aimed 

at achieving a high degree of homogeneity 

in semantic exactness, but this criterion did 

involve some human judgment. The 

confounding of these two concepts, 

"semantic exactness" and "extent of 

synonymy", may be difficult to resolve with 

present technology. 

 
Conclusion 

 
The extent of the synonym problem in full-

text searching depends on whether one 

searches the more common of the 

synonyms. Overall, the measure of what’s 

missed is as high as 30% in a large (90%) 

fraction of common word-pairs. Information 

discovery systems need to take the synonym 

problem into account and develop solutions 

for it, both probabilistic and deterministic. 

This study should be repeated with a wider 

and more systematic variety of synonym 

pairs from defined subject areas; searches 

that include phrases instead of single words 

in the pairs; replication, to determine the 


Evidence Based Library and Information Practice 2008, 3:4 

 
29 

 
variability in the reported percentages; and 

more search engines. The methodology here 

could result in the establishment of a 

benchmark data set against which various 

search engines can evaluate their search 

algorithms in terms of their ability to 

minimize the synonym problem. 

Additionally, the data demonstrate the 

value of vocabulary control and cross 

references in providing more precise search 

results. 

 
Acknowledgements: The authors wish to 

thank the Associate Editor and two 

anonymous referees for their thoughtful 

comments on an earlier version of this 

manuscript. This work was prepared in part 

with support from Army Research Office, 

Grant #W911NF0510490, awarded to the 

University of Colorado Denver (Kafadar). 

 
Works Cited 

 
Bade, David. “Relevance Ranking is Not 

Relevance Ranking or, When the 

User is Not the User, the Search 

Results are Not the Search Results.” 

Online Information Review 31.6 

(2007): 831-44. 

  
Beall, Jeffrey. "The Weaknesses of Full-Text 

Searching." Journal of Academic 

Librarianship 34.5 (2008): 438-44.  

  
---. "The Death of Full-Text Searching." 

PNLA Quarterly 70.2 (2006): 5-6.  

 
Dagan, Ido, Oren Glickman, Alfio Gliozzo, 

Efrat Marmorshtein, and Carlo 

Strapparava. “Direct Word Sense 

Matching for Lexical Substitution.” 

COLING-ACL 2006: Proceedings of 

the 21st International Conference on 

Computational Linguistics and 44th 

Annual Meeting of the ACL, 

Sydney, Australia, 17-21 Jul. 2006: 

449-56. 11 Nov. 2008 

<http://eprints.pascal-

network.org/archive/00002759/01/P0

6-1057.pdf>. 

 
Dean, Jeffrey A., Georges Harik, Benedict 

Gomes, and Noam Shazeer. 

Methods and Apparatus for 

Determining Equivalent 

Descriptions for an Information 

Need. Google Inc., assignee. Patent 

6,941,293. 6 Sep. 2005.  

 
Fugmann, Robert. “The Complementarity of 

Natural and Controlled Languages 

in Indexing,” Subject Indexing: 

Principles and Practices in the 90's : 

Proceedings of the IFLA Satellite 

Meeting held in Lisbon, Portugal, 

17-18 August 1993, and sponsored 

by the IFLA Section on 

Classification and Indexing and the 

Instituto da Biblioteca Nacional e do 

LIVRO, Lisbon, Portugal. Eds. 

Robert P. Holley, et al. Munich: 

Saur, 1995. 215-30. 

 
Mann, Thomas. The Oxford Guide to 

Library Research. 3rd ed. Oxford: 

Oxford UP, 2005. 

 
Pregibon, Daryl, and Diane Lambert. 

"Understanding Online 

Advertisers." Joint Statistical 

Meeting, Denver, CO, USA, 5 

August 2008.  

 
“Web Search Help Center.” Google. 2008. 11 

Nov. 2008 

<http://www.google.com/help/refin

esearch.html>. 

 
Evidence Based Library and Information Practice 2008, 3:4 

 
30 

 
Appendix 

 
Table 2 

The data collected in this study. “A” is designated as the more common word in the 

synonym pair, “B” as the less common word. 

 
Number Terms max min A (not B) B (not A) A and B A or B 

1.  A. Adivasis 

B. Adibasis 0.9979 0.0016 109,000 177 50 110,000 

2.  A. Afrocentricity  

 
B. Afrocentrism  0.5198 0.4717 63,800 57,900 1,040 128,000 

3.  A. Aluminum 

B. Aluminium 0.6850 0.2853 63,400,000 26,400,000 2,750,000 92,500,000 

4.  A. Anomie  

B. Anomy  0.8556 0.1429 443,000 74,000 776 521,000 

5.  A. Appraisers 

B. Assessors 0.7258 0.2597 6,400,000 2,290,000 128,000 8,850,000 

6.  A. Arctiidae 

B. Lithosiidae 0.9971 0.0016 85,200 138 106 85,800 

7.  A. Arthropods  

B. Arthropoda  0.6061 0.2905 1,400,000 671,000 239,000 2,310,000 

8.  A. Berberis  

B. Barberries  0.9407 0.0565 368,000 22,100 1,090 393,000 

9.  A.  

B.        

10.  A. Biologics  

B. Biologicals  0.7582 0.2275 2,290,000 687,000 43,200 3,030,000 

11.  A. Bleaching 

B. Blanching 0.9281 0.0715 3,390,000 261,000 1,510 3,660,000 

12.  A. Buddhists 

B. Lamaists 0.9998 0.0001 3,070,000 386 276 3,090,000 

13.  A.  

B.        

14.  A. Bullying 

B. Bullyism 0.9999 MM 10,100,000 984 160 10,100,000 

15.  A. Cachexia 

B. Cachexy 0.9899 0.0069 245,000 1,720 784 253,000 

16.  A. Cannibalism 

B.Anthropophagy 0.9946 0.0044 2,440,000 10,800 2,330 2,470,000 

17.  A. Catalans 

B. Catalonians 0.9941 0.0056 2,100,000 11,900 643 2,130,000 

18.  A.  

B.        

19.  A. Chimneys 

B. Smokestacks 0.8340 0.1627 2,820,000 550,000 11,200 3,410,000 

20.  A. Chefs 

B. Cooks 0.5327 0.4176 23,600,000 18,500,000 2,200,000 44,200,000 

21.  A. Deacons  

B. Diaconate 0.9085 0.0578 2,720,000 173,000 101,000 3,000,000 

22.  A. Deburring 

B. Burring 0.8388 0.1557 513,000 95,200 3,410 618,000 

23.  A.  

B.        


Evidence Based Library and Information Practice 2008, 3:4 

 
31 

 
24.  A. Discrimination 

B. Bias 0.5336 0.4314 44,400,000 35,900,000 2,910,000 83,200,000 

25.  A. Dreams 

B. Dreaming 0.8367 0.1246 133,000,000 19,800,000 6,160,000 159,000,000 

26.  A. Egoism 

B. Egocentricity 0.9149 0.0843 1,100,000 93,000 978 190,000 

27.  A.Electromagnetism 

B. Electromagnetics 0.5447 0.4287 953,000 750,000 46,500 1,760,000 

28.  A. Embezzlement 

B. Defalcation 0.9738 0.0230 2,460,000 58,200 7,900 2,540,000 

29.  A. Errors 

B. Mistakes 0.7010 0.2570 162,000,000 59,400,000 9,690,000 231,000,000 

30.  A. Eurocentrism 

B. Eurocentricity 0.9806 0.0178 104,000 1,890 171 113,000 

31.  A. Eviction 

B. Dispossession 0.8953 0.1005 4,590,000 515,000 21,800 5,120,000 

32.  A. Extraversion 

B. Extroversion 0.6307 0.3372 432,000 231,000 22,000 684,000 

33.  A. Faience 

B. Fayence 0.7405 0.2580 861,000 300,000 1,790 1,170,000 

34.  A. Fasteners 

B. Fastenings 0.9559 0.0373 12,100,000 472,000 86,800 12,700,000 

35.  A. Fireworks 

B. Pyrotechnics 0.9453 0.0402 32,200,000 1,370,000 495,000 34,200,000 

36.  A. Forearm 

B. Antebrachium 0.9996 0.0003 4,510,000 1,310 578 4,500,000 

37.  A. Formaldehyde 

B. Formalin 0.7469 0.2196 2,500,000 735,000 112,000 3,350,000 

38.  A. Gelatin 

B. Gelatine 0.7867 0.2003 4,360,000 1,110,000 72,400 5,550,000 

39.  A. Greenhouses 

B. Hothouses 0.9841 0.0152 5,090,000 78,600 3,390 5,170,000 

40.  A. Gums 

B. Gingiva 0.9664 0.0263 4,780,000 130,000 36,200 4,940,000 

41.  A. Heme 

B. Hematin 0.9862 
0.0119

  975,000 11,800 1,800 995,000 

42.  A. Hydrogeology 

B. Geohydrology 0.9327 0.0525 917,000 51,600 14,600 978,000 

43.  A. Intellectuals 

B. Intelligentsia 0.8409 0.1303 7,810,000 1,210,000 268,000 8,710,000 

44.  A. Ischemia 

B. Ischaemia 0.8551 0.0967 2,060,000 233,000 116,000 2,380,000 

45.  A. Kayasthas 

B. Kayasths 0.8249 0.1613 1,790 350 30 2,420 

46.  A. Kimchi 

B. Kimchee 0.7670 0.2070 930,000 251,000 31,500 1,220,000 

47.  A. Lakes 

B. Lochs 0.9858 0.0130 73,400,000 731,000 90,200 73,100,000 

48.  A. Larrea 

B. Covillea 0.9985 0.0003 568,000 133 82 558,000 

49.  A. Libertinage  

B. Libertinism  0.8570 0.1421 468,000 77,600 460 563,000 

50.  A. Marmots 

B. Marmota 0.5135 0.4724 325,000 299,000 8,900 645,000 

51.  A. 0.9999 0.0001 80,700 5 1 80,700 


Evidence Based Library and Information Practice 2008, 3:4 

 
32 

 
Mechanoreceptors 

B. 

Mechanicoreceptors 

52.  A. Micropipettes 

B. Micropipets 0.9662 0.0321 72,500 2,410 124 81,300 

53.  A. Mitochondria 

B. Chondriosomes 0.9999 0.0001 2,000,000 144 45 2,000,000 

54.  A. Monazite 

B. Cryptolite 0.9979 0.0018 112,000 207 24 113,000 

55.  A. Mutuality  

B. Mutualism  0.7695 0.2295 798,000 238,000 1,070 1,050,000 

56.  A. Natriuresis 

B. Natruresis 0.9982 0.0012 59,600 71 35 60,000 

57.  A. Norsemen  

B. Northmen  0.7713 0.2206 465,000 133,000 4,860 613,000 

58.  A. Ochre  

B. Ocher  0.8799 0.1151 1,460,000 191,000 8,260 1,660,000 

59.  A. Ointments 

B. Salves 0.6694 0.2958 1,430,000 632,000 74,300 2,140,000 

60.  A. Ontogeny 

B. Ontogenesis 0.8210 0.1428 638,000 111,000 28,100 778,000 

61.  A. Organotherapy 

B. Opotherapy 0.8937 0.066 2,750 203 124 3,420 

62.  A.  

B.        

63.  A.  

B.        

64.  A. Paramecium 

B. Paramaecium 0.9282 0.0712 314,000 24,100 207 339,000 

65.  A. Parsis  

B. Parsees  0.7687 0.2228 177,000 51,300 1,960 237,000 

66.  A. Pediatrics 

B. Paediatrics 0.9002 0.0806 18,200,000 1,630,000 388,000 20,200,000 

67.  A. Perimenopause 

B. Premenopause 0.8164 0.1475 631,000 114,000 27,900 772,000 

68.  A. Photogravure 

B. Heliogravure 0.8237 0.1579 395,000 75,700 8,840 487,000 

69.  A. Picornaviridae  

B. Picornaviruses  0.5094 0.4600 39,200 35,400 2,360 84,200 

70.  A. Pollination 

B. Pollinization 0.9988 0.001 2,080,000 2,070 382 2,100,000 

71.  A. Porpoises 

B. Phocoenidae 0.9720 0.0252 709,000 18,400 2,000 773,000 

72.  A. Preparedness 

B. Readiness 0.4959 0.4757 17,200,000 16,500,000 986,000 39,000,000 

73.  A.  

B.        

74.  A.  

B.        

75.  A. Procellariiformes 

B. Tubinares 0.9599 0.0293 55,700 1,700 627 60,300 

76.  A. Promethium 

B. Illinium 0.9933 0.0053 137,000 735 185 144,000 

77.  A. Radiologists 

B. Roentgenologists 0.9998 0.0001 1,740,000 221 133 1,770,000 


Evidence Based Library and Information Practice 2008, 3:4 

 
33 

 
78.  A. Religiosity  

B. Religiousness  0.8444 0.138 1,040,000 170,000 21,600 1,240,000 

79.  A. Rodents 

B. Rodentia 0.9456 0.0407 6,070,000 261,000 88,000 6,420,000 

80.  A. Sago 

B. Sagu 0.8664 0.1319 1,340,000 204,000 2,690 1,550,000 

81.  A. Salafi ̄yah 

B. Salafiyya 0.5899 0.4079 17,500 12,100 66 38,700 

82.  A. Metalloids  

B. Semimetals  0.6761 0.3187 77,000 36,300 595 116,000 

83.  A.  

B.        

84.  A. Shepherds 

B. Sheepherders 0.9855 0.0143 6,500,000 94,000 1,610 6,600,000 

85.  A.  

B.        

86.  A. Shrews 

B. Soricidae 0.8917 0.0895 485,000 48,700 10,200 551,000 

87.  A. Skunks 

B. Mephitidae 0.9983 0.0008 1,270,000 980 1,210 1,280,000 

88.  A. Slavists 

B. Slavicists 0.9725 0.0253 27,100 704 61 30,400 

89.  A. Somite 

B. Metamere 0.9646 0.0317 76,400 2,510 294 85,300 

90.  A. Spires 

B. Steeples 0.8532 0.1362 2,950,000 471,000 36,400 3,460,000 

91.  A. Stigmata  

B. Stigmatization  0.7671 0.2323 1,420,000 430,000 1,080 1,860,000 

92.  A. Summer 

B. Summertime 0.9779 0.0135 396,000,000 5,480,000 3,470,000 403,000,000 

93.  A. Tinsmithing 

B. Tinwork 0.5368 0.4606 16,200 13,900 79 41,700 

94.  A. Trilobites 

B. Trilobita 0.9052 0.0841 339,000 31,500 3,990 384,000 

95.  A. Urea 

B. Carbamide 0.9579 0.0356 3,790,000 141,000 25,500 3,970,000 

96.  A. Vietnamese 

B. Annamese 0.9997 0.0002 45,000,000 10,100 8,800 45,200,000 

97.  A. Violin 

B. Fiddle 0.6499 0.2931 20,400,000 9,200,000 1,790,000 31,500,000 

98.  A. Virilization  

B. Virilism  0.8697 0.1218 67,500 9,450 665 82,800 

99.  A. Tanka 

B. Waka 0.5063 0.4903 1,580,000 1,530,000 10,700 3,110,000 

100.  A. Wrasses 

B. Labridae 0.6217 0.3072 118,000 58,300 13,500 196,000