http://www.sajim.co.za/student21.4nr3.asp?print=1


  Student Work Vol.4(3) September 2002

Fee vs free: order vs chaos?  
Merle Ruff 
(Anglo American Information Centre, South Africa) 
Post Graduate Diploma in Information Management 
RAU University 
mruff@angloamerican.co.za  

Contents 

1. Introduction  
2. Comparison between commercial on-line databases and the World-Wide Web  
3. Evolution of the Web: directories, search engines, meta search engines and 

development of search techniques  
4. Developing a search strategy for the Internet  
5. Organizing chaos: development of structure and standardization on the Web  
6. Conclusions  
7. References  

1 Introduction 

Information retrieval via the World-Wide Web represents a complete reversal of the 
traditional fee-based commercial databases. In the fee-based environment, a single search 
strategy can be used to simultaneously search many different sources. Currently, it is 
impossible to use a standard search strategy when searching the Web. The information on 
the Web is stored in digital format on Web servers all over the world. In its unstructured 
state it would seem as if the originators of Web content do not want the information to be 
found! 

Industry vendors and Web developers have attempted to improve the retrievability of Web 
documents. These have included the development and constant refinement of powerful 
search engines and directories. In addition, advanced search tools which track information 
stored in Web formats have been developed to assist the serious searcher. While these 
features improve information retrieval, the results still do not compare with the degree of 
precision and relevancy that has become the trademark of the fee-based database clusters. 

More recently, attempts to improve the situation have focused on ways to organize the 
Web. These include methods of structuring and breaking down the different elements of 
digital information stored on a Web page. Portalization is also viewed as a mechanism to 
structure access to Web content. Coupled with this approach are the current initiatives to 
standardize the manner in which Web documents are stored. The HTML protocol for 
publishing Web content was the first attempt towards standardization. This has been 
enhanced by recent developments in the XML language and Metadata standards to organize 


Web content.  

This article focuses on the following problem: 

What are the major differences between searching on commercial databases versus 
the Internet?  

The sub-problems include the following: 

What are the problems with information retrieval using free search tools?  
What steps need to be adopted to alleviate these problems in the future?  
Are we moving towards a more 'structured' Web?  

2 Comparison between commercial on-line databases and the World-Wide Web 

The appearance of the first fee-based on-line commercial databases in the 1970s was in 
itself revolutionary. It provided searchers with on-line access to the world's published 
literature. This development has enabled searchers to access reliable information sources 
stored anywhere in the world from remote locations anywhere in the world. 

The Web consists of Web pages in digital format found on Web servers all over the world 
connected together by the Internet. The Internet, which became more and more popular 
during the 1990s, like the appearance of the on-line database aggregates, has resulted in an 
information revolution such that searching for information will never be the same again! 

The on-line database aggregates provide information access to a privileged group of 
individuals or professionals belonging to organizations that can afford to pay for 
information. Skills to use these databases require training and experience. Alternatively, 
information on the Web is available to anyone who has access to a personal computer with 
a modem and a telephone connection. Searching on the Web 'appears' to be very simple and 
user-friendly. 

The information stored on commercial databases contains bibliographical references to 
printed information stored in various locations. Although some commercial databases 
contain information in full-text formats, the information is still stored in a text-based 
format. 

What implications do these differences have on techniques for searching on-line databases 
vs free Web-based search engines? What are the differences in the nature of information 
sources (content, structure or size) found on the on-line databases vs the Web? 

2.1 Search techniques for retrieval: commercial on-line databases vs search engines 

Table 1 summarizes this author's findings with regard to the major differences between 
search techniques using traditional databases vs the Web. 

Table 1 Comparison of search techniques 

  top

 Traditional on-line and CD-ROM databases World-Wide Web

Database Structured and Unstructured and no 


2.1.1 Structure vs unstructured 

In contrast to on-line databases, CD-Roms and even traditional libraries, the Web is a place 
where information is unstructured as well as unfiltered. On-line databases contain 
references to published material that has undergone strict review or has been published in 
reputable periodicals and newspapers. Although the Web contains a wealth of valuable 
material, it also contains much unsolicited material and what Bates (1999:xxiv) refers to as 
'dreck'. The 'noise' that one encounters searching the Web is infinitely greater than that 
found on the commercial databases. Therefore the Web searcher must be much more 
vigilant in limiting numbers of results by using the filters listed in the above table. Most 
users are unaware of these filters and give up in frustration when presented with a hefty hit 
list. 

Search engines have at least 'borrowed' Boolean logic, which has been used by on-line 
database searchers for the last 30 years. While the Boolean 'and', 'or' and 'not' can be used in 
search engines, most search engines have not understood the use of proximity searching. 
AltaVista is the only search engine that offers this facility. The biggest problem with search 
engines is that although they use Boolean logic there is no single standard for the way in 
which you enter your Boolean command. There is one standard with standard terms and 
punctuation for entering a Boolean search for a commercial database host. 

2.1.2 Precision and recall 

In 1975, Sarecevic referred to the two most important criteria for evaluating search results 
(Dong and Su 1997:78), namely precision and recall. Recall measures the number of 
relevant documents retrieved out of the total number of relevant documents indexed in the 
system. In traditional databases, recall represents the completeness of a search. Every 
document is indexed using controlled vocabulary and all documents are searchable. 
However the number of the Web pages that could be indexed is infinitely larger than 
commercial databases. Search engines are unable to index or retrieve all the potentially 
available information.

authoritative validation
Indexing Controlled vocabulary 

Manual indexing 
Largely automatic 
indexing (except for 
directories). Robots have 
different strategies for 
indexing

Fields for limiting Author, title, descriptor, 
document type, date

URL, title, header, date, 
weighted words

Abstracting Clear and concise 
summary by humans 

Automatically generated. 
Often not clear or adequate 
summary

Search Query-based search 
Mouse-based. Follows 
hyperlinks.

Cannot repeat or formalize 
search

Evaluation criteria Relevance, precision and 
recall 

Difficult to apply 
traditional measures. Other 
criteria: speed, relevance 
ranking of results, validity 
of links and lack of dead 
links, Web interface


According to Leighton, 'the Web is as large and unstructured as one is likely to find, so 
recall is meaningless' (Dong and Su 1997:78). The only criteria for evaluating Web search 
results are user satisfaction with the completeness of a search. Dong and Su (1997:75) also 
refer to the fact that 'rapidly increasing Web size and the limited coverage of search engine 
databases have made recall a difficult measure to apply'. 

Precision is used to measure the number of relevant documents retrieved out of the total 
number of documents retrieved from traditional on-line databases. Precision ratios have 
been used in evaluating search engine retrieval. Dong and Su (1997:78) claim that 'the 
output relevance to a user's query is a very important indicator for judging an engine's 
quality and intelligence'. The problem arises when this measurement is applied. How is 
precision evaluated. Is it based on the top 10 hits or the top 20 hits? The problem with this 
approach is that ranking algorithms differ from one search engine to the next. How does the 
Internet's ranking, which influences the order in which the hits are displayed and hence 
selected by the user, compare with the user's ranking of search results? 

There is a greater degree of certainty as to the relevance and completeness of a search 
performed on a commercial database. According to Bates (1999:xix) the arrival of the Web 
has made business research both more complex and easier. While on the Web, one is 
guaranteed of being able to get hold of for example annual reports of companies or 
statistics, but it is 'harder to know when you have conducted a reasonably thorough search'. 

2.2 Structure of on-line bibliographic information sources vs digital Web-based 
information sources 

2.2.1 Database structure 

Commercial databases consist of defined fields with standardized indexing and retrieval 
mechanisms. Humans program the type of indexing to be applied to each field. The Web 
consists of words in a document. The words are indexed and not the subjects. Although 
subject directories are categorized using a method of classification, each subject directory 
uses its own unique classification system. 

2.2.2 Size and coverage 

How much of the Web is covered by search engines? It is estimated that there are 500 
billion pages on the Internet. Google covers 1 billion pages, therefore representing less than 
1% of the Internet. Commercial database hosts, for example Dialog, cover over 570 
databases. When a search is performed on Dialog, the researcher is searching the database's 
entire coverage. The question here is whether database size or accuracy of retrieval of the 
total coverage is more relevant. The answer must be the latter, since precision and 
relevancy is the aim of the serious researcher. 

In contrast to the commercial databases, the Web contains a lot of 'dreck' (Bates 1999:xxiv). 
Material on database hosts consists of carefully selected material published by scholarly 
publications. Lebedev concluded that search engines were no good at finding scientific 
information. In 1996 the Internet had only 10 to 20% of the documents he could find on 
INSPEC (Dong and Su 1997:80). 

2.2.3 Invisible Web 

Search engines miss a tremendous amount of information stored on the Web. This is due to 
the fact that search engines are 'barred' from retrieving relevant information contained 
within databases on the Web, paying information, video and audio material. There are 


technical reasons for this, particularly the fact that search engines cannot process non-text 
information, sounds, image, and information stored in Web-accessible databases. The 
information that search engines are prevented from seeing is referred to as the "invisible 
Web". In section 4 strategies for finding information on the Invisible Web are discussed 

2.2.4 Information content 

There is a 'blurring' of commercial databases and Web search engines on the Web. Web 
sites can consist of Web-enabled versions of previously published books or CD-Rom 
databases. The commercial database hosts have made their databases searchable on the 
Web. In this instance, the researcher is benefiting from the best of both worlds. The 
accuracy of on-line database indexing is linked with the user friendliness of mouse (click 
and point) interfaces. The horrors of modem searching are a thing of the past. The search 
statement remains unchanged. However, Web technology is used to search on-line database 
content in a considerably more user-friendly manner. 

3 Evolution of the Web: directories, search engines, meta search engines and 
development of search techniques 

Searching on the Web has evolved along with the enhancement of Web search technology. 
This section discusses the evolution of Web directories, search engines, Meta search 
engines and intelligent agents. 

The first generation search engines created indexes by automatic 'spidering' of Web sites 
and analysing location and frequency of words. These search engines match words against a 
search statement without considering how the pages interrelate, that is the context and 
syntax. Web directories continue to create their indexes manually. The third way of 
retrieval matches a search statement with the location and frequency of the words. The 
search engine then rates the relevancy of the results based on the Web sites that have been 
most used. 

Natural language searching was developed to overcome the problem of search engines' lack 
of consideration of 'the syntactical relationships between search terms and other vocabulary 
within their indexes' (Green 2000:128). Ask Jeeves, launched in June 1998, was the first of 
the natural language search agents. Ask Jeeves' search engine matches the user's query 
against a database of seven million templates of questions. If no match is found, it presents 
the nearest alternatives. Thereafter, the user can select the option on Ask Jeeves to conduct 
a meta search across AltaVista, InfoSeek, Lycos and Yahoo. 

Another development is 'links-based' analysis. Google is a prime example of a search 
engine that uses this technique. The Google search engine matches a search statement 
against the '1-billion or so hyperlinks that weave the Web together' (Green 2000:128). The 
results are then ranked in importance depending on how many other sites link to them. The 
theory is that if a Web author has included links to other sites that are considered important, 
then some form of editorial judgement has been exercised. Furthermore the Google search 
engine also processes the text around the hyperlink and therefore claims it can analyse far 
more Web sites than humans who build subject directories. According to Green (200:129), 
'in fact unlike search engines that become less useful the larger the index, Google claims to 
return even better results with a bigger index'. 

Google estimates that through this method of links analysis it can reach 300 million Web 
pages. Green (200:129) claims that 'Google's combination of extensive reach and greater 

  top


accuracy of results is its advantage'. 

As search engines become more sophisticated, attempts have also been made to create 
specialized and authoritative collections of Web content. These attempts are exemplified by 
the appearance of hubs, newsgroups, subject specific directories and intelligent agents. 

Hubs are Web pages that guide the user to a list of authoritative sources, for example 
Focused Crawler. Focused Crawler uses a 'classifier' that evaluates Web page relevance and 
a 'distiller' that identifies relevant hypertext nodes that point to relevant pages within a 
minimum amount of links. Newsgroups are the results of individuals or experts in a field 
who share their knowledge and opinion in specific subject areas. Specialized newsgroup 
search engines, for example Deja News, are important to searchers needing to seek out 
experts to solve specific problems. Subject specific directories are Web-enabled versions of 
commercial databases. 

Intelligent agents are sophisticated retrieval mechanisms that shift the power away from 
Web servers and on to the desktop, leading to greater search and retrieval capability. Agents 
can search across a wide range of document types and formats. They act as 'true 
infomediaries' (Green 2000:131). Copernic translates a single search statement for different 
search engines and simultaneously submits the search statement to the search engines, Web 
directories and databases. 

4 Developing a search strategy for the Internet 

The above discussion has highlighted some facts about Internet searching. Given our 
understanding of the nature of digital information stored on the Web, what should we take 
into consideration when developing search strategies for information retrieval on the Web? 
With the advent of the Web, searchers are confronted with the choice of using the free or 
fee-based information sources. There are circumstances when the world of published 
information on structured on-line database hosts will satisfy a query. However there are 
situations when Web-based digital sources will provide the 'best' information. 

The Internet has led to an increase in the choice of sources that provide information. On the 
one hand finding information on the Internet is easy and fast. For example, if one needs 
current weather conditions for anywhere in the world it is easy enough to find. However, if 
one wishes to research a particular subject, unless one knows which search engine to use, or 
knows the URL of a specific Web site, one will feel overloaded and frantic by having to 
browse through hundreds of listings trying to find useful (relevant) information. 

4.1 Why develop a search strategy? 

Owing to the shifting landscape of information products and services, one should adopt 
guidelines on when the Web is the best source of information. The fact that searching the 
Internet comes with its frustrations does not to detract from the fact that there is a lot of 
good information available on the Web if the tools are used with understanding. It is a well-
known principle that search strategies are essential in both the fee-based and the free Web-
based environment. This means analysing the aspects of a query, identifying the best 
sources and formulating an effective search strategy. If the Web is the 'best' source, then 
search tools whether they are search engines, subject directories or meta search engines 
need to be selected according to the query. 

4.2 What goes into a search strategy for the Web? 

  top


Calof (2002:7) has developed an approach for searching the Web, which he refers to as the 
'Searching Smarter' approach. Table 2 was compiled by this author to illustrate Calof's 
seven stages of a search strategy developed for the Web. 

Table 2 'Searching Smarter' search strategy 

Other approaches to searching the Web recommend using a query-based approach. 
Emphasis is placed on knowing what you are looking for and matching the information 
need to the appropriate Web tool that will 'best' retrieve the information. 

4.3 Effective Web searching: tips of 'super searchers' 

A growing number of researchers have become 'super searchers' on the Web. This section 
focuses on their knowledge of Web sources and their tips for getting 'better' results from 
Web searching. 

When to use the Web as the 'best' source of information 
Muchin (1999:152) says that one should 'become a free thinker'. He is referring to the fact 
that one can often find answers to questions quickly and inexpensively using the tools on 
the Web. The Web is an excellent source for: 

Free news sources and on-line versions of major newspapers 

Business and corporate information 

On-line shopping 

Reference sources on the Web: on-line dictionaries (www.onelook.com ) and links to 

S Specify information needs 
This refers to first establishing and analysing what it is you are 
looking for, what information specialists refer to as the 'reference 
interview'

M Match information sources to information needs 
Decide which is the 'best' source for retrieving the information you 
need. If the Web is the 'best' source, then proceed to the next level

A Assess Internet approaches 
Once the decision is to use the Web, one can choose whether to use a 
directory, search engine or meta search engine. Decide this based on 
what the tool accesses, and how it works

R Recognize search engine differences 
This requires gathering information on how to enter search strategies 
on different search engines etc.

T Think search statements and strategy 
This means asking the right question – formulating your strategy 
with the words that may be used in a Web document

E Execute the search 
Formulate your search statement, using all the operators, limiters and 
syntax offered by the search tool

R Refine the search 
Searching on the Web is an iterative process. Refine the results until 
you get closer to what you are looking for


common encyclopaedias (www.mygo.com), and has links to Britannica and Encarta 

Telephone directories 

Government documents on government Web sites. 

4.3.2 Invisible Web 

Price and Sherman (2001:32) define the invisible Web as content that is 'largely hidden 
from search engines'. There are many professional searchers who have perfected techniques 
to find this content. To get to Web sites that search engines cannot access, they listen and 
follow the literature to identify new sites regularly. To get to the home pages of these 
databases, Price and Sherman suggest that one include the term 'database' in the search 
statements. Yahoo directories are full of links to invisible Web sources. They insist that 
relevant Web sites should be bookmarked as soon as they are identified. They conclude that 
'once you've built your own virtual reference collection, finding what you're looking for on 
the Invisible Web will be as easy as using a search engine to navigate the visible 
Web' (Price and Sherman2001:34). 

There are also 'specialized search engines' that access the invisible Web (Calof 2002:29). 
Calof lists over 20 search engines that access some parts of the invisible Web. Some 
examples are annotated subject directories, such as the Librarians' Index to the Internet 
(www.lii.org) and Google's search engine for discussion groups (groups.google.com). Calof 
recommends the Sherman and Price directory. 

4.3.3 Challenges of Web searching 

During Bates's (1999:39) interview with Linda Cooper, she commented that with Web 
searching, 'after you think you've come full circle …a whole new world opens up'. Cooper 
says that the only way to know when to stop searching on the Web is to refer back to the 
reference interview and give the client what has been asked for. 

5 Organizing chaos: development of structure and standardization on the Web 

Individuals and businesses want to avoid searching multiple sources of information using 
different search engines tools. There is a growing need to access a single source that 
provide all the relevant information. This has led to the idea of creating a 'one stop shop' 
where information can be collected and accessed from a single point. The information 
technology industry is starting to address this requirement with their development of 
Internet portals. 

In 2002, the Web is still unstructured and disorganized, unlike the fee-based environment. 
Users are desperate for more control and standardization. Until recently HTML was the 
only Web standard. However HTML only describes the structure, design and layout of Web 
pages. It is being replaced by Xtensible Markup Language (XML). XML describes the 
actual content of the information. What XML brings to the Web is 'powerful structured 
searching akin to the database field searching, but on a textual Web page' (Green 
2000:132). In the future we will see search engine interfaces offering the option to search 
under keywords or tag searching (on the XML tags). Are we coming full circle to the level 
of precision offered by the commercial database hosts? 

  top

  top


6 Conclusion 

This article highlights the major differences between searching on commercial databases 
versus the Web. It has been shown that the on-line fee-based database hosts store structured 
information in highly organized databases. This results in high precision and recall and 
authentic information. The free Web on the other hand still consists of unstructured 
information, precision is low in comparison and information cannot always be 
authenticated. 

The problem with searching the Web therefore is two-fold: firstly the data are unstructured 
and secondly searching results in information overload (lack of precision). The major 
search engines are addressing the unstructured Web by introducing standards such as meta 
data and XML. Search engines 'have borrowed techniques (best match searching, relevance 
ranking etc,) from information retrieval research while the subject directories nod towards 
classification theory' (Poulter 1997:142). 

Finally the author poses the question regarding the free Web environment: Are we moving 
towards a more structured Web? The answer must be 'yes' although there is still a long way 
to go. At the rate that information technology responds to the changing needs of its 
industry, can we hope to reach this goal within half a decade? In the words of Green: 
(2000:134) 'Now emerging from its nascent stages, the Web may evolve into a highly 
organized, vastly diverse and complicated system'. 

7 References 

Bates, M.E. 1999. Super searchers do business: the online secrets of top business 
researchers. Medford, NJ: CyberAge Books. 

Calof, J. 2002. Searching Smarter: intelligence and the Net. [Unpublished seminar notes 
presented at author's workshop at RAU on 26 April 2002]. 

Cooper, L. 1999: Linda Cooper: Independent info pro and end user. In Super searchers do 
business: the online secrets of top business researchers. Medford, NJ: CyberAge 
Books:31–51. 

Dong, Z. and Su, L.T. 1997. Search engines on the World-Wide Web and information 
retrieval from the Internet: a review and evaluation. Online and CD-ROM review 21(2):67–
81. 

Green, D. 2000. The evolution of Web searching. Online information review 24(2):124–
137. 

Green, D. 2001. Infinitely extensible markup language? Information world review 
(December):34. 

Muchin, J.A. 1999. Free thinking: using free sites on the Internet to save time and money. 
The Bottom Line: Managing Library Finances 12(4):150–152. 

Poulter, A. 1997. The design of World-Wide Web search engines: a critical review. 
Program 31(2):131–145. 

Price, G. and Sherman, C. 2001. Exploring the invisible Web: seven essential strategies. 

  top

 
Online 25(4):32–34. 

  
Disclaimer 

Articles published in SAJIM are the opinions of the authors and do not 
necessarily reflect the opinion of the Editor, Board, Publisher, 
Webmaster or the Rand Afrikaans University. The user hereby waives 
any claim he/she/they may have or acquire against the publisher, its 
suppliers, licensees and sub licensees and indemnifies all said persons 
from any claims, lawsuits, proceedings, costs, special, incidental, 
consequential or indirect damages, including damages for loss of profits, 
loss of business or downtime arising out of or relating to the user’s use of 
the Website.

ISSN 1560-683X

Published by InterWord Communications for the Centre for Research in Web-based Applications,
Rand Afrikaans University