http://www.sajim.co.za/vol2.nr1.01_07_2000/student3.asp?print=1 Student Work Vol.2(1) June 2000 The Future of Web Searching Annette van der Merwe Postgaduate Diploma in Information Management Rand Afrikaans University annettevdm@yahoo.com Contents Introduction Terminology How do search engines work? Multiple search engines Search engine shortcomings Emerging trends Conclusion References Introduction Search engines gather and organize information to match keywords or phrases provided by the user. These matches are then presented to the user in the form of lists, ranked according to what the search engine believes is most relevant. This is helpful and often very useful. However, as the flow of information on the Internet increases, so does 'noise' in the system and 'the truth' becomes more and more elusive. The user is often presented with either far too much information and suffers from information overload, or the information that is presented by the search engine, is often less useful or irrelevant. It is necessary to clarify why these search engines provide too much information or information that is less useful: is it because search engines don’t always understand the request or is it because they cannot sort the available 'matches' to be as specific as possible? Or is it the Web that is not 'search-engine-friendly' enough? Or perhaps it is the user not being specific or clear enough about the request? The answer is not a simple one. Three components, namely the user, the Internet and search engines should be looked at to determine their influence on the future of Web searching. Search engines only deal with content, not the context or meaning of the query. Search engines retrieve blindly, without making any judgement about the relevance of the documents or the quality of links that the document contains. The Internet, on the other hand, is not indexed in such a way that search engines can read and retrieve everything that is available. Lastly, the user considers the Net both an information source and a unique means for connectedness and communication. It is therefore used for different reasons. The tools we need in future to assimilate information into useful knowledge, will have to reflect all these requirements. Both user requirements and technological development exert influence in shaping the future of information retrieval on the Web. For search engines to make judgements, choices and recommendations, they have to become like humans. They presently stand ready with power and speed to serve us, but they lack intelligence. Until computers can comprehend language and hold their own in a conversation, there will be a gap in their capabilities. A lack of structure surrounding the Internet, however, exacerbates the problem. Human powered indexing methods, that is the classification and description of documents by topical experts, is perhaps needed. This research suggests that a combination of human and machine qualities might be the answer in providing an effective search tool for the future. Terminology 1. The Internet and World Wide Web (WWW) The Internet (referred to hereafter as the Net) is an information system composed of a massive network of computers around the world (Griffin, 1999:1). An easy way to visualize the Net is to view it simply as a permanent physical connection of tens of thousands of computer networks scattered around the world. Every individual computer on any of these networks can, if they have permission, communicate with any other computer. This ability to communicate allows all the connected computers to share a vast resource of information. This information may have an academic, commercial or general interest content and be available for unrestricted (free) viewing, or it could involve the transfer of files between computers in a more controlled way (access by paid subscription). There are many different components to the Net. The World Wide Web (WWW), referred to hereafter as the Web, often confused with the Net itself , is actually just another protocol for navigating the Net (Leigh & Kelmer, 1996:10). The Web delivers documents in hypertext format (http: hypertext transfer protocol) and is indeed not just a hypertext system but a hypermedia system allowing text, sound, graphics and video to be mixed together in a multimedia format. Most Web pages are currently produced in hypertext mark-up language (HTML). 2. Search engines A search engine 'is a computer program that searches for documents containing keywords or phrases of interest to the user' (Peterson, 1997:2). A search engine acts as an information robot or 'info-bot', a sort of obedient servant finding dozens or even thousands of documents quickly. Search engines have three major elements: The first part is the spider, also called the crawler (Kapoor, 1999:2), indexing robot (Lynch, 1997:2) or robot/bot (Chamberlain, 2000:1) as well as worm or search bot (Hermans, 1997:2) and harvester (Taylor, 1999:2). The spider visits every site (each site being a set of documents, called pages) it can identify on the Web, examines these pages and extracts indexing information that can be used to describe them. This is what it means when someone refers to a site as being 'spidered' or 'crawled'. Everything the spider finds goes into the second part of a search engine, the index, top sometimes called the catalog. This process, which varies among search engines, may include simply locating most of the words that appear in Web pages or performing sophisticated analyses to identify key words and phrases (Lynch, 1997:2). This data is stored in the search engine’s database, along with an address, termed a URL (uniform resource locator). The third part of a search engine is the search engine software. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and to rank them in order of what it believes is most relevant. Chamberlain’s (2000:1) definition of search engines as ' huge databases of Web page files that have been assembled automatically by machine' refers mainly to the index/catalog part of the search engine. Kapoor (1999: 2) however, defines a search engine as 'an HTML interface to a database of locations of information'. There are different methods in which search engines classify/index information. Kapoor (1999:2) provides a distinction between search engines, directories and hybrid search engines. Search engines create their listings automatically. Search engines crawl the Web, then people search through what they have found. A directory depends on humans for its listings. You submit a short description to the directory for your entire site, or editors write one for the sites they review. A search engine will look for matches only in the descriptions as submitted. Hybrid search engines maintain an associated directory. Inclusion into the directory is however not automatic, but according to Kapoor, 'a combination of luck and quality'. Peterson (1997:2) provides the following categories for search engines: robotic search engines use a Web robot to retrieve a significant number of documents; mega-indexes, also known as meta-indexes, do not have their own databases; instead they are linked to robotic search engines; simultaneous (parallel) mega-indexes, also known as multi-threaded meta-indexes; these are mega-indexes which access robotic search engines in parallel (simultaneously) and present the unified results as a single package; subject directories; these are manually maintained, browsable and are often searchable with robotic search engines; and robotic specialized search engines; these are search engines which focus on a portion of the Net. How do search engines work? Search engines gather and organize (classify/index) descriptions and locations of information available on the Net and store this information in a catalog/index, thus creating their own database of information. Whenever one searches the Web using a search engine, one asks the engine to scan its index/catalog , to sift through the recorded pages and to provide matches to keywords and phrases it is given. The query produces a list of Web resources (URLs) that can be clicked on to connect to the sites identified by the search. Search engines use selected software programs to search their indexes, presenting the findings in some kind of ranking according to what it believes is most relevant. Although software programs may be similar, no two search engines are exactly the same in terms of size, speed and content. No two search engines use exactly the same ranking scheme and not top every search engine offers one the same search options. Chamberlain (2000:2) says it is important to remember that when we are using a search engine, we are not searching the entire Web as it exists, but only a portion of the Web, captured in a fixed index that has been created at an earlier date. It is, she says, difficult to say how much earlier the index had been captured. Spiders regularly return to the Web pages they index to look for changes. When changes occur, the index is updated to reflect the new information, but the process of updating can take a while, depending how often the spiders do their crawling and also how promptly the information they gather, is added to the index. Until a page has been both 'spidered' and 'indexed' the new information cannot be accessed. Multiple search engines There are two broad types of search engines: individual and multiple search engines. Individual search engines compile their own searchable databases, as described above. Examples of individual search engines are AltaVista, HotBot, Excite, Infoseek and the latest, Northern Light. Multiple search engines, also known as mega or parallel search engines (Notess, 1998:1) as well as meta (Chamberlain, 2000:1) search engines, do not crawl the Web compiling their own databases. Instead they search the databases of multiple sets of individual search engines simultaneously and then presents results. The results of these searches are presented in two ways: some display results in a single merged list, from which duplicate entries have been removed, while some do not collate results, but display them in separate (multiple) lists as they are received from each engine. Duplicate entries may thus appear. Multiple search engines do not offer the same selection of search options as individual search engines will do and do not return all the results retrieved from the individual engines they search. However, the results they do return are drawn from the top of the search engine list, and tend to be more relevant. Multiple searchers are fast and therefore very useful when the searcher is in a hurry or wants to obtain a quick overview on a subject and/or unique term. Multiple search engines do offer timesaving features for those doing an exhaustive search on obscure topics. Because multiple search engines cannot take full advantage of the unique features of the individual search engines, the quality of results is in a sense as strong as the weakest link offered (Hubbard, 2000:4). Inference Find, Dogpile and MetaFind are three mega search engines that offer some advantages for comprehensive searching, even while suffering from some of the problems mentioned above. Search engine shortcomings The Net, also referred to as 'the information highway', enables all types of users to access and share information with one another. Users are therefore exposed to large volumes of dynamic information, retrieved from thousands of computers spread all over the world. However, as the flow of information increases, so does 'noise' in the system, and 'truth' becomes ever more elusive. Users are beginning to suffer from severe information overload. Paul Saffo, director with the Institute for the Future in Menlo Park, California says 'information overload is not a function of the volume of information out there, it is a gap between the volume of information and the tools we have to assimilate that information into useful top top knowledge' (Foley, 1995:1). 1. Restricted intelligence Search engines and services like Yahoo! and Alta Vista are currently embedded in Netscape, MS Explorer and AOL and aid us in keyword searching. These engines are Boolean word matching engines and only deal with content, not the context or meaning of the query. Search engines therefore blindly retrieve documents available on the Net, without making any judgement about the relevance of the documents or the quality of links that they contain (Kapoor, 1999:1). In some cases they do return a list that is ranked according to relevance, but only relevance according to the keywords it was given, for example Northern Light. Many of these engines offer no filtering function or intelligent search strategy and return hundreds and sometimes thousands of responses to a query, thus adding to the problem of information overload. Information on the Net is dynamic. Often search engines refer to information that has moved to an unknown location or has disappeared. They do not learn from these searches and will look for these locations every time it is asked. Search engines allow a user to search for information in many different ways, but seldom 'remembers' what was done previously. It does not, for example store any query results, so the same query will be repeated over and over again starting from scratch. Taylor (1999:3) provides the following summary of why search engines are not always the solution to finding relevant information at present: Relevant information can be missed because sites contain types of resource in addition to html text, for example images, databases, PDF documents. The search engines frequently do not harvest every page on a site, but often only the top two or three hierarchical levels, thus missing significant documents which, on larger and more complex sites, may be located in lower levels of the hierarchy. Search engines, especially the more comprehensive ones, may index sites on an infrequent basis and may therefore not contain the most current data. Irrelevant information can be retrieved because the search engine has no (or very few) means of distinguishing between important and incidental words in the document text. Some intelligence is thus required from these engines in order to assist in finding only that information which is useful and relevant: the right information available to the right people at the right time. 2. Automated indexing The Web lacks standards to facilitate automated indexing. There is virtually no bibliographic control on the Net (Hubbard, 2000:2). As a result, documents are not structured in such a way that programs can reliably extract routine information that a human indexer might find through a cursory inspection of author, date of publication, length of text or subject matter. In contrast to human indexers, automated programs have difficulty identifying characteristics of a document such as its overall theme or genre – whether it is a poem or a play or even an advertisement. The professional indexer can describe the components of individual pages of all sorts (from video to text) and can clarify how those parts fit together into a database of information. Analyses of a site’s purpose, history and policies are beyond the capabilities of a crawler program. Another drawback of automated indexing is that most search engines recognize text only although there exists an intense interest in the Web's ability to display images. Some research has however moved forward to find colour or patterns within images, but no program can deduce the underlying meaning and cultural significance of an image. Automated spiders that crawl sites on the Web can only read open text formats, such as HTML files and cannot record any more than the basic file attributes of non-text format files, including PDF, sound, image and video files. Furthermore, most search engines cannot survey frame-based sites or dynamic pages and have problems with pages in XML format (Hubbard, 2000:5). There are insurmountable difficulties in unleashing computers to comprehend language. Language is replete with, for example, synonyms and spelling variations and is full of variable contextual meanings and linguistic nuances, making full-text databases rather blunt tools in their over reaching attempts to process natural language (Hubbard, 2000:3). The great promise of automated indexing tools is that they provide a level of detail greater than any humanly powered method of indexing. Automated searching aids are necessary to keep up with the millions of pages being added to the Net daily. Finding dead links and maintaining information on the Net are needs that can be met, very effectively, by existing search tools. When it comes to sorting through all of this data, however, and making best sense of what is available online, the power and assurance of human understanding and editorial control must be called upon (Hubbard, 2000:6). Emerging trends The Web has become the largest, most complex search space ever. As the Web continues to grow, search tools must evolve and adapt to remain useful. Both user requirements and technological developments will exert influence in shaping the future of information retrieval on the Net. East Stroudsburg University (1999) and the University at Albany (2000) talk about second generation search engine services on the Web. They suggest watching the coming trends with second generation services. These trends all include the human elements of: concept processing. Second generation services such as Ask Jeeves apply different kinds of concept processing to a search statement to determine the probable intent of the search. This is often accomplished by the use of human generated indexes. With these services, the burden of coming up with precise or extensive terminology is shifted from the user to the engine. These services are therefore taking on the role of thesauri; collective judgement. Search services such as Google and Direct Hit derive their results from the behaviour of millions of Web users; and directories. First generation search services have partnered with second generation services and/or include content from human gathered directories with their search results to supplement documents retrieved from the spider-indexed Web. 1. Agent/filter technology There are two main approaches for sifting through information (Omar, 1997:1): The first method puts the responsibility in the hands of the user who searches for what he/she needs. As part of this approach, the user PULLS the information, using search engines, which search by index, key words and subject. With the second method, the responsibility of providing information is shifted to special software systems on the Net. Based on predefined criteria about the information, this method PUSHES the information to the user. It can top however push information the user does not want, that is junk mail, and requires a next level of sophistication: agent/filter technology. Agent/filter technology underpin in a push-based solution allows both the sender and recipient to filter and select information. This technology provides personalized information automatically, based on the changing preferences of the user as a result of a two-way dialogue. (Omar, 1997:2). Although both approaches talk about agent/filter technology, this article focuses only on PULL strategies using search engines or 'smart spiders' (Kapoor, 1999:1). Many vendors refer to their filters as 'agents' but, says Berst (2000:1), we should not be confused. Real agent software makes decisions and takes action, while filters merely screen out information. Harper (1997:4) agrees with this statement when he describes an intelligent agent as 'a software entity that assists people and acts on their behalf'. He provides the following attributes for intelligent agents: Agency; the degree of independence the agent exhibits; at least, he says, the agent must be able to operate on the Net while the user is disconnected or not in the process of interacting with the Web. Feldman (1999:3) says agency is 'basically autonomy plus a little social ability or interactivity'. Intelligence; the amount of learned behaviour and the degree of reasoning an agent may have; at a minimum, he says, there must be a statement of preferences or a set of rules that have been pre-defined by the user to follow, with an inference mechanism to interpret and act on these rules; higher levels of intelligence will include the ability of the agent to learn/adapt to its environment. Mobility; this concept refers to the ability of agents to migrate in a self-directed way from one host to another on a network in order to perform their assigned duties. Ketsavapitak (1997:5) provides the following characteristics of intelligent agents: Agents are autonomous; that is, an agent has control over its own actions. Feldman (1999:2) says autonomy is the first and foremost common criterion for agents. She continues to say that autonomous agents use the knowledge of their owner’s needs and interests to undertake those tasks that their owner does repeatedly. Agents are goal-driven; they have a purpose and act in accordance with that purpose. Agents are reactive; that is, an agent senses changes in its environment and responds timeously. All agents continue to run, even when the user is gone. Rather than retrieving documents blindly, smart spiders (Kapoor, 1999:1) will thus be able to make some judgements over the relevance of documents and the quality of links that they contain. Hermans (1997:4) speaks about intelligent software agents that will be able to search information based on context. He says these agents will deduce this context from user information or by using tools such as a thesaurus, enabling them to search on related terms, or even on concepts. Coleman (1997:3) mentions Verity as a good example of a product providing information in context. This information/text retrieval technology improves somewhat on basic search engines. Having capabilities well beyond basic search, the Verity Knowledge Suite product family addresses the key problems of retrieving and sharing knowledge: automated classification and categorization; intelligent group dissemination and information profiling and organizing content for corporate re-use. This approach, Demers (1997:1) says, increases the relevancy of returned results, but fails to search across all data types, thus limiting effective exploitation of all the information assets in the organization and contributing only partially to comprehensive and robust knowledge management. 2. Knowledge retrieval Knowledge retrieval is the next wave in information search and retrieval technology. This technology uses advanced algorithms and sophisticated processes to enhance queries, accesses all data formats and then accurately sifts the information it finds to return only the most relevant results (Demers, 1997:1). Coleman (1997:5) mentions Excalibur as an example of a product that 'cannot only do a multi-media search but also uses the information it finds to focus the query or interact with the user to refine the query'. Excalibar is a leading developer of high performance software products for search and retrieval of knowledge assets over all media types: paper documents, text, images, video and other multimedia data types, throughout internets, LANs/WANs, extranets and the Internet (Demers, 1997:4). The specific advantages that Excalibur brings to the problem of knowledge retrieval are two- fold: semantic networks and pattern recognition (Demers, 1997:1). Semantic networks are built-in knowledge bases, derived from published dictionaries, thesauri and other lexical resources. The semantic network automatically identifies words and concepts that are related to the content of the query. The network even helps users to distinguish among the multiple meanings of words. The ability to paraphrase the query frees the user from having to define the words which an author might have used in discussing an idea. Pattern recognition (also known as Adaptive Pattern recognition Processing, or APRP) provides robust 'fuzzy spelling' to recognize query terms even if hit words within a document are misspelled due to errors made by either the author or the searcher or even errors made when transcripting foreign names (Demers, 1997:2). Pattern recognition and semantic networks can make users more productive, no matter what the nature or quality of the text they are working with. These technologies give users direct access to relevant information and minimize the time they spend filtering through irrelevant or false results. Equipped with the information they need and given navigation channels through captured organizational expertise, people work more knowledgeably, apply their individual skills and make better decisions. The differentiating technology here is that it cannot only search all data types but also applies semantic networks and APRP technologies to filter results to provide higher levels of both accuracy and relevance to the user. In other words, it can search all types of data and do sophisticated pattern matching (intelligent search) to help identify knowledge (Coleman, 1997:5). The objective of each technology is to maximize the accuracy of search and retrieval, while minimizing the costs of system set-up and maintenance. With these unique techniques users are able to find documents that would be impossible to find otherwise. Knowledge retrieval solutions are differentiated by two primary factors: Comprehensiveness to provide access to all information assets, beyond the traditional text collection; and accuracy to retrieve and rank the most relevant responses to best serve the user’s needs for precise information, or knowledge (Demers, 1997:2). He continues to say that knowledge retrieval technology is distinguished by the components of the acronym ASSETS: Accuracy is the critical variable that determines effectiveness of a knowledge retrieval product. Accuracy is measured with two statistics: precision (what percentage of the documents retrieved is useful, or relevant to the request) and recall (how many of the useful or relevant documents in the system are found). Scalability is more than just fast searching. It is the ability to maintain search performance even when the demands on the system rise by orders of magnitude. Security and protection of corporate information assets on Intranets is critical. Extensibility refers to the ability to start with text on paper assets and grow to multiple media (images, videos). Transparency refers to the invisibility of the incorporation of knowledge retrieval to the end user. Simple screens or even microphones may serve as the way people take advantage of these helpful techniques. Simplicity refers to how easy this solution is by using plain English and visual queries. 3. Access to the Invisible Web Despite its uniform interface and seamless linked integration, the Web is not a single coherent element. There are two distinct elements: the visible and the invisible Web. The visible Web consists of manually produced, static pages. It provides the same generic information to everyone and is therefore available for indexing to all search engines. The invisible Web consists of computer generated, dynamic pages and provides customized information according to specific requirements (Green, 2000). Andrews (1997:1) says software developers of search engines are seeking to exploit the thorny problem of invisible Web databases that search engines cannot 'see'. The opportunity exists, he says, because Web pages that are generated dynamically via databases, are different from what are generally known as 'flat html' pages. The latter are generated, one at a time by people using authoring tools or coding by hand, and then left on a server until someone requests them. Dynamically generated Web pages do not exist as separate files, so spiders from the major search engines do not generally discern them. The problem is intensifying because of the proliferation of off-the-shelf tools to link databases to the Web, whether as whole sites or as site components. This means that proportionately less and less is available for search engines to see. One response to this problem has been to divide the Web into vertical sections intended to appeal to specific interests. Kapoor (1999:1) predicts that there will be an explosion of vertical search sites, providing access to deep, tightly focused databases. Another benefit to search precision (Hubbard, 2000:3) is narrowing search domains to specific subjects, accomplished by honing the scope of what information is searched, perhaps by limiting searches to certain domains or languages, or conducting specialized searches in subject oriented search engines. Andrews (1997:2) predicts a change in how people will use the Web in future. Instead of wandering around bookmarking what looks interesting, he says, people are already activating their Internet connections with a specific goal in mind. He continues to say that databases are listed in categories, and users choose which to search based on brief descriptions, instead of searching through them all at once. 4. Newsgroup searching Thomas (1998: 1) says there are two different ways to look at the Net as a tool: one is to consider it as an information source and the other is to consider it as a unique means for connectedness and communication. As a source of information, the Net makes facts more readily available and it allows research to be done more quickly. As a tool for greater connectedness and communication, the Net has, according to Thomas (1998:2), great potential for promoting higher order literacy. He says, the Net possesses the potential for a free exchange of ideas as a result of many connections and the corresponding communication that would not have been possible before. Viewed in this way, he continues, the Net provides four key things that have the potential to further higher-order literacy and critical/creative/constructive thinking more dynamically than ever before: Great amounts of information are available quickly; the information-side of the Net is viewed here as a starting point for the use thereof, rather than as the final product/benefit of using it. Many minds from many different places have the opportunity to learn from each other (knowledge sharing). The way in which people communicate on the Net is one of the more 'pure' forms of communication, weeding out factors such as race, gender, disabilities that might hinder deep and meaningful exchange of ideas. It produces synergy; information, knowledge, learning, and understanding can be shared interactively and dynamically on the Net and when one is connected to the Net as a communication and connectedness tool, you are not just connected to information, but to the minds of other people. Green (2000:8) predicts that newsgroup searching will become more important, as individuals use the Net to seek out experts to help with problems. He says there are literally thousands of newsgroups, covering different topics, and a large number of specialized newsgroup search engines have emerged as a result. He mentions Deja News and says it is probably the most widely known newsgroup search engine. It contains, he says, a directory of selected newsgroups which users can browse through or search for a particular group, topic or posting. 5. XML vs HTML HTML is dead (Green, 2000:10). While HTML’s ease of use fuelled its widespread adoption, it is somewhat limited in that it is primarily concerned with the layout/design of a Web page, rather than the information that actually appears on that page. Considering that the primary use of the Web is for information retrieval, this design is something of a drawback. XML is an open technology that offers tremendous possibilities for electronic publishing, e- commerce, information retrieval and data exchange, for it consists of rules that enable anyone to create their own mark-up language. It also not only enables explicit description of Web page content, but also describes the rules for manipulating each data set contained within the information (Green, 2000:11). However, while XML will deliver great benefits for searching, publishing and exchanging information, Green continues, these benefits will not be realised without some effort. Standards for the tags describing information in different disciplines will have to be agreed to by each industry. Web publishers will require greater sophistication than simple knowledge of HTML, graphics and a few other applications. New XML tools are needed and computer programmers and information scientists who will be able to interpret the content of the information being published, will be required. Lastly, search engines will need to learn the standard tag structure that has been agreed by each industry/interest group. 6. Natural language interface As already discussed, second generation search engines will not only look at the content, but will consider the context/meaning of the terms provided. Green (2000:6) mentions two natural language search tools, both adopting different philosophies in developing their solutions: AskJeeves operates by matching a user’s query against a database of template questions. If there is no match, then the user is presented with the nearest alternatives from the database and asked to select the most appropriate. However, artificial intelligence experts have criticized the company’s natural language claims. The Electric Monk conducts a syntactical analysis of the query using natural algorithms. These algorithms also make use of thesauri to consider alternative related words. 7. A free service? To provide quality information with the help of sophisticated search tools, requires serious resources. Research suggests that users will be charged in future: Green (2000:12) talks about 'micro-payments' for search. Second generation search tool developers will probably also focus on providing the best solution for the moment. The possibility of search outsourcing, which will be in line with current business process trends, is also an option for the future. Conclusion We have more information available on the Net than we can possibly digest. Search engines presently stand ready with power and speed to provide hundreds and even thousands of documents quickly and effortlessly. What we need, however, are intelligent agents providing relevant information every time. The Net, however, has to be structured and indexed to make it more 'search friendly'. What is needed is a combination of human and machine qualities to ensure an effective search tool for future use. References Andrews, W. 1997. Challenge for spiders: searching invisible Web. http://www.internetworld.com/print/1997/02/03/industry/spiders.html Berst, J. 2000. Intelligent filters to the rescue! (sort of…). AnchorDesk. http://www.zdnet.com/anchordesk/story/story_1035.html Chamberlain, E. 2000. Search engines. University of South Carolina. Bare Bones 101. http://www.sc.edu/beaufort/library/lesson1.html Coleman, D. 1997. Knowledge retrieval. http://collaborate.com/hot_tip/tip0697.html Demers, M. 1997. Knowledge retrieval: the first step to knowledge management. http://www.excalib.com/news/tradepress/kmworld.nov04.97.html East Stroudsburg University. 1999. Search engine types: first and second generation. http://www.esu.edu/library/Enginehome.htm Feldman, S. 1999. Intelligent agents: a primer. Searcher, 7(9). http://www.infotoday.com/searcher/oct99/feldman+yu.htm Foley, J. 1995. Managing information: infoglut. http://www.iweek.com/551/51mtinf.htm top top Griffin, M.G. 1999. Computers in psychology. PSY 302. http://www.umsl.edu/~mgriffin/psy302/WWW_Definition.html Green, D. 2000. The evolution of Web searching. Online information review, 2(2):124-137. http://general.rau.ac.za/infosci/information/studyguide/First/unit_5/green.htm Harper, N. 1997. Intelligent agents and the Internet. http://osiris.sunderland.ac.uk/cbowww/AI/TEXTS/AGENTS3/agents.htm Hermans, B . 1997. Intelligent software agents on the Internet: an inventory of currently offered funcionality in the information society and a prediction of (near) future developments. http://www.broadcatch.com/agent_thesis/h12.htm Hubbard, J. 2000. Indexing the Internet. wysiwyg://47/http://www.tk421.net/essays/babel.shtml Kapoor, J. 1999. Web search engines. http://yallara.cs.rmit.edu.au/~achattar/search- engines/future.htm Ketsavapitak, D. 1997. Intelligent agents. http://www.uis.edu/~ketsavap/paper.html Leigh, D. & Kelmer, M. 1996. Learning on the Net. In: Forrest, E.J. (ed.). Issues in interactive communication: the impact of the new technologies on society. Florida State University. http://www.fsu.edu/~ic-prog/issuesbook/chapter5.html Lynch, C. 1997. Searching the Internet. Scientific American , March 1997. http://www.sciam.com/0397issue/0397lynch.html Notess, G.R. 1998. On the Net; toward more comprehensive Web searching: single searching versus mega searching. Online, March 1998. http://www.onlineinc.com/onlinemag/OL1998/net3.html Omar, M. 1997. How to deal with information overload on the Internet? http://star.arabia.com/971127/te2.html Peterson, R.E. 1997. Eight Internet search engines compared. http://www.firstmonday.dk/issues/issue2_2/peterson/index.html Thomas, M.M. 1998. Furthering higher-order literacy using the Internet. http://members.aol.com/csholumkc/mtholtranscript.html Taylor, C. 1999. An ntroduction to metadata. http://www.library.uq.edu.au/iad/ctmeta4.html University at Albany. 2000. Second generation searching on the Web. http://www.albany.edu/library/internet/second.html Disclaimer Articles published in SAJIM are the opinions of the authors and do not necessarily reflect the opinion of the Editor, Board, Publisher, Webmaster or the Rand Afrikaans University. The user hereby waives any claim he/she/they may have or acquire against the publisher, its suppliers, licensees and sub licensees and indemnifies all said persons from any claims, lawsuits, proceedings, costs, special, incidental, consequential or indirect damages, including damages for loss of profits, loss of business or downtime arising out of or relating to the user’s use of the Website. ISSN 1560-683 Published by InterWord Communications for the Centre for Research in Web-based Applications, Rand Afrikaans University