Windows Explorer: The Index Server Companion - Ariadne Web Magazine for Information Professionals Home Coming issue Archive Authors Articles Guidelines Windows Explorer: The Index Server Companion Buzz data mobile software html database apache css browser sql cataloguing windows linux rtf perl adobe doc samba php e-business mysql url standards ldap Citation BibTex RIS Brett Burridge describes the Index Server Companion, an application he has created that allows Microsoft Index Server to index content from remote websites and ODBC databases. Microsoft’s Index Server is a service supplied with the Windows NT 4.0 Server and Windows 2000 Server products. The service indexes HTML and other content residing on the file system. These indexed files may be queried using a number of techniques, but of particular relevance to web developers is the ability to build completely customised search facilities based on Active Server Pages (ASP) by making use of Index Server’s Component Object Model (COM) objects. A limitation to Index Server is that it can only be used to index files via the file system. Furthermore, integrating file and database searches isn’t always straightforward. This article looks at how the Index Server Companion overcomes these limitations, enabling Index Server and ASP to form the basis of some sophisticated web applications. An Overview of Index Server Index Server is one of Microsoft’s most useful server products. On the administrative side of things, it is easy to install, performance while indexing content is good, and once installed maintenance tasks for the systems administrator are minimal. The development of customised search applications using ASP is also made fairly straightforward through the use of the Query and Utility server components [1]. Figure 1. Index Server is administered through an easy to use interface The main limitation of Index Server is that it can really only be used to index content hosted on servers on the same machine (or network) as the machine hosting the Index Server service. Although it is possible to set up a share to a Unix or Linux web server using a file sharing solution such as SAMBA, this isn’t always satisfactory because Index Server is not case sensitive with respect to filenames, so this can cause problems when displaying search results. The Apache web server’s spelling module, modspelling [2] can correct some case-sensitivity issues, but not all servers will have this module installed. Added to this problem is the fact that organisations such UK academic institutions will often have web content distributed over a number of web servers that may be hosted on a number of different operating systems. For example, the University of Essex where I once worked had at least 15 web servers with a substantial amount of content hosted on them, and it isn’t by any means the largest University in the country. Another issue is that it can be a chore to prevent Index Server from indexing certain content. Since it indexes the file system, it has no concept of the Robots Exclusion Standard specification’s robots.txt files [3]. It also has no concept of the ‘robots’ meta tag. Although Index Server can be prevented from returning files with specific filenames and/or paths, it can get quite fiddly, as can be seen from this example from the search facility of my personal website: ” AND NOT #path ASPAlliance AND NOT #path ** AND NOT #path download AND NOT #path images AND NOT #path adm AND NOT #path backgrounds AND NOT #path dbase AND NOT #path test AND NOT #path script AND NOT #filename *.class AND NOT #filename *image.asp AND NOT #filename *.asa AND NOT #filename *.css AND NOT #filename redirect.asp AND NOT #filename *postinfo.html AND NOT #filename readme“ Finally, an increasing amount of data accessible via the Internet is stored in databases rather than HTML. A comprehensive search facility should ideally return search results from both static HTML content as well as content stored in databases. Extending the Functionality of Index Server Retrieving and indexing content from a web server by use of a web robot solves many of the problems Index Server has. The web robot is able to mimic a web browser, starting at one page in the site and traversing the links in the site until it has retrieved all of the pages of the site. The robot will potentially be able to retrieve content from any web server, regardless of the platform it is hosted on. Two products that allow you to do this are Microsoft’s Site Server 3.0 and the author’s own Index Server Companion. Microsoft Site Server 3.0 Microsoft’s Site Server 3.0 software suite has a Search application that enhances Index Server by allowing you to (amongst other things) retrieve and index content from remote websites using an integrated web robot. For an overview of Site Server 3.0 Search, take a look at a previous article on Ariadne [4]. Unfortunately Site Server 3.0 Search has a few shortcomings, including: Site Server 3.0 isn’t the easiest of applications to install, especially if SQL Server is installed on the same machine. The product wasn’t designed for Windows 2000 Server. It is no longer in active development. Few third party hosting companies support Site Server 3.0. Even with academic licensing discounts, Site Server 3.0 costs a lot of money, which cannot always be justified if you only want to use the Search application of the software suite. Index Server Companion The Index Server Companion is a cost effective method of retrieving content from remote webservers for Index Server to index. Furthermore it also allows retrieval of content from ODBC databases which can be subsequently indexed by Index Server. Features of the Index Server Companion The main features of the Index Server Companion are: Enables Index Server to allow searching of potentially any web server or ODBC compliant database. Integrated web robot extracts content from websites. Includes support for robots.txt files and robots meta tags. Robot can negotiate sites using HTML Frames. An optional mode allows Query Strings to be treated as distinct URLs (e.g. treat http://www.aspalliance.com/brettb/WebJobMarket.asp?Skill=ASP as being a distinct URL from http://www.aspalliance.com/brettb/WebJobMarket.asp?Skill=JSP). Ability to retrieve binary files from servers, including Adobe Acrobat PDF, Microsoft Office documents and even images. Support for full or incremental project updates of both web and database content, meaning that Index Server only has to re-index content that has changed. Configuration of the Index Server Companion is through the editing of a plain text configuration file. Index Server Companion can be run from the command line, and scheduled using the Windows Task Scheduler. Full reporting of activity to an external plain text log file. Flexible output options mean that administrative access to Index Server is not necessarily required. Fully documented VBScript examples show how to make use of the Index Server Companion in ASP pages. Detailed documentation in Microsoft’s HTML Help format. Fully documented source code. Access to product updates and technical support. Figure 2. The Index Server Companion contains fully searchable documentation in Microsoft’s HTML Help format System Requirements The Index Server Companion is written in Perl 5 for Windows NT or 2000. Although any version of Perl 5 could potentially be used to run the ASP Documentation Tool, it has been specifically developed using ActiveState’s ActivePerl. ActivePerl is available as a free download from the ActiveState website [5]. It also requires a server running either Index Server on Windows NT 4.0 Server, or the Indexing Service on Windows 2000. The Index Server Companion does not have to be run from the machine on which the Index Server is installed. Configuring and Running the Index Server Companion The Index Server Companion is a Perl script that needs to be run from the Windows command line. There is a single mandatory parameter, which tells the script which configuration file to use. So to run the Index Server Companion for the Sample Project, an MSDOS Command Prompt is opened in the folder where the Index Server Companion files are installed installed and the following is typed: IndexServerCompanion.pl –c=“SampleProject/SampleProject.ini” It is of course possible to run the Index Server Companion from .bat scripts, which can then be scheduled using the AT command or the Windows Task Scheduler. This makes it straightforward to update the Index Server’s index of website and database content at specific times and frequencies. The configuration file is a plain text file containing a number of settings. A sample configuration file is shown below: [ProjectName] = ASPArticles [CreateLogFile] = yes [LogFileDir] = Samples/ASPArticles/Logs [InfoStoreDir] = Samples/ASPArticles/InformationStore [OutputDir] = Samples/ASPArticles/Output [Verbose] = yes [StartURL] = http://www.aspalliance.com/brettb/Default.asp [BaseURL] = http://www.aspalliance.com/brettb/ [UserAgent] = Index Server Companion 1.1 (admin@server.com) [UseRobotsTextFile] = yes [UseRobotsMetaTag] = yes [UseURLQueryStrings] = no [CrawlType] = incremental [MaxURLSize] = 1024 [MaxNumberOfURLs] = 4096 [URLExtensions] = .htm .html .asp .aspx .jsp .php .cfm [FileExtensions] = .doc .pdf .rtf [AddURLToTitle] = yes [CaseSensitiveServer] = no [AddRowToTitle] = no [RefreshAllRows] = no The Index Server Companion is supplied with full documentation in Microsoft’s HTML Help format that describes each of the configuration settings. When the script is run, the Index Server Companion will display details of its status in the Command Prompt window. A detailed log file is also created. How the Index Server Companion Works The Index Server Companion script contains a fully functional web robot that is able to extract the content from all of the required pages of the specified website. It contains support for the Robots Exclusion Standard specification [3], and support for the robots meta tag contained within individual pages. Each file extracted from the website is modified to contain a special meta tag that give the original URL (for web content). It is then saved to disk from where it can be indexed by Index Server. The contents of these special meta tags can then be used by the ASP page displaying the results of a web search, so that clicking on a search result item will display the original URL. Unfortunately Index Server will not allow you to retrieve the content from custom meta tags without making a minor modification in the Index Server’s Microsoft Management Console (MMC), so there is also a special mode in the Index Server Companion that appends the original URL into the page’s HTML tag. Searching Web Content with the Index Server Companion Index Server Companion allows content from remote websites to be retrieved and consequently indexed by Index Server. A working example of this may be seen [6]. This is a search page running on Internet Information Server 4.0 (Windows NT 4 Server) that allows you to search my ASPAlliance site, together with the articles I have written for Ariadne.ac.uk and ASPToday.com. Since I don’t have administrative access to the Index Server on the machine hosting the search page, I have used the feature of the Index Server Companion that allows the document’s original URL to be appended to the original title. For example the <title> tag of the ASPToday article “ASP Documentation Systems” at http://asptoday.com/content.asp?id=1435 is modified in the file saved to read: <title>ISC_URL=http://asptoday.com/content.asp?id=1435 ASP Documentation Systems The URL and original title are separated by a tab character. The search results page then contains a small piece of ASP code to split this title back into the article’s URL and original title: <% ‘Extract the document’s URL and title If Instr(oRS(“doctitle”), “ISC_URL”) > 0 Then ‘Split the doctitle at a tab character DocumentInformation = Split(oRS(“doctitle”), chr(9)) ‘The document’s URL is the first item in the array sDocumentURL = DocumentInformation(0) ‘Remove the “ISC_URL=” text in the document URL sDocumentURL = Replace(sDocumentURL, “ISC_URL=“, “”) ‘The document’s title is the second item in the array sDocumentTitle = DocumentInformation(1) End If %> The entire ASP code for the sample search page is available online [7]. Searching Binary Files with the Index Server Companion Index Server is able to index content from a range of non-HTML content. This includes Microsoft Office documents, and once Adobe’s IFilter is installed, Adobe Acrobat PDF files. The Index Server Companion is able to retrieve any binary files that it may encounter while crawling a website. A configuration option specifies which binary file types are retrieved. The filenames of the saved files are modified to include the original URL in the filename. So for example, the Adobe Acrobat PDF document at: http://www.aspalliance.com/brettb/downloads/TheIndexserverCompanion.pdf Will be saved with the filename: Ohttp^c^b^bwww.aspalliance.com^bbrettb^bdownloads^bTheIndexServerCompanion.pdf If the file appears in search results, the original URL can be retrieved using the following ASP VBScript: <% ‘Extract the URL for other files (e.g. PDF and DOC files) If Left(oRS(“FileName”), 2) = “o” Then sDocumentTitle = oRS(“doctitle”) sDocumentURL = CreateURLFromFileName(oRS(“FileName”)) End If %> The CreateURLFromFileName function will return the original URL: <% ‘Non-HTML files like Adobe Acrobat PDF files and Word ‘documents are stored with their original URLs partially ‘encoded in their filenames. This function will return the ‘original URL of the file. ‘The encoding done by the Index Server Companion removes ‘characters that cannot be present in Windows filenames ’(these are: \/:?”<>|) Function CreateURLFromFileName(FileName) ‘Remove o_ prefix from URL FileName = Mid(FileName, 3, Len(FileName) - 2) ‘Remove other encoded characters FileName = Replace(FileName, “^f”, “\“) FileName = Replace(FileName, “^b”, “/”) FileName = Replace(FileName, “^c”, “:“) FileName = Replace(FileName, “^s”, “”) FileName = Replace(FileName, “^q”, “?”) FileName = Replace(FileName, “^d”, Chr(34)) FileName = Replace(FileName, “^l”, “<”) FileName = Replace(FileName, “^g”, “>”) FileName = Replace(FileName, “^p”, “|”) CreateURLFromFileName = FileName End Function %> Unfortunately Index Server stores the URLs of other files as lower case, but a workaround will hopefully be included with the next release of the Index Server Companion. URLs with Query Strings There is an option within the Index Server Companion to treat URLs containing Query Strings as distinct URLs. This means that it is possible to index custom built web applications that use the Query String to store data. For example, if an online phone book application had a form that allowed the user to search for users by forename, surname and department, the search results page for a specific user might be accessed using the URL: http://www.awebsite.ac.uk/phonebook/showresults.php?forename=John&surname=Smith&dept=Law The entire phone book could be indexed by the Index Server Companion if it was presented with a start page containing a hyperlink to every user. The advantage of this is that users would be able to see search results of people as well as other content in an integrated search results page. There may also be performance benefits, as searching LDAP directories can often be very time consuming, whereas Index Server returns search results almost instantaneously. Searching Databases with the Index Server Companion The Index Server Companion is able to index content from database tables, queries (Microsoft Access) and stored procedures (SQL Server). Database connectivity is achieved through the use of Open Database Connectivity (ODBC), the technology that allows Microsoft Windows to transparently connect to any database that has an ODBC driver. Microsoft Access and SQL Server naturally have ODBC drivers, as do many non-Microsoft databases such as Oracle and FileMaker Pro. It is of course possible to search databases using Structured Query Language (SQL), but by making use of Index Server Companion, it is possible to integrate database searches with search results from web page searches. There are also other advantages: Index Server contains sophisticated pattern matching syntax, and it is a lot faster at returning search results than an equivalent SQL statement would be when using a database such as Microsoft Access. The Index Server Companion makes it possible for Index Server to index databases by retrieving the rows of a specified database table and creating an HTML file containing the data from a specific database row. Index Server can then be used to index these HTML files. In search results pages, it is possible to extract the details of the table and row from which the data originated, so that the search results page can be modified to point to the original database data. A sample page produced from the SQL Server sample pubs database is shown below: Silicon Valley Gastronomic Treats In this example, the value of the title table column is optionally used to give the page an HTML title tag, and the notes table column is used for the description meta tag. Each of the custom ISC prefixed meta tags can be queried using Index Server, although to retrieve their contents a minor configuration change to Index Server is required. The Index Server Companion can also modify the HTML’s tag to include the table name and row ID, e.g.: <title>ISC_Table=titles ISC_KeyField=title_id ISC_RowNumber=MC2222 Silicon Valley Gastronomic Treats The following ASP code shows search results for database rows where the ISC_type meta tag (and hence the type column) is “mod_cook”: <%@ Language=VBScript %> <% Dim oQuery Dim sDataRow Dim sDataRow_Table Dim sDataRow_Keyfield Dim sDataRow_RowNumber Dim sDataRow_RowTitle Dim sQueryText sArticlesPath = “Titles_Simple\SampleContent“ sQueryText = “#path ” & sArticlesPath & “ AND NOT #path _vti “ ‘Search for content in the specified folder which also have the isc_type meta tag equal to “mod_cook” sQueryText = sQueryText & “ AND @isc_type mod_cook” Const SEARCH_CATALOG = “www.brettb.com” Set oQuery = Server.CreateObject(”IXSSO.Query”) ‘A column must be defined for each custom meta tag that is returned in the search ‘results RecordSet oQuery.DefineColumn “isc_type (DBTYPE_WSTR) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 isc_type” oQuery.Catalog = SEARCH_CATALOG oQuery.Query = sQueryText oQuery.MaxRecords = 200 oQuery.SortBy = “rank[d]” oQuery.Columns = “vpath, doctitle, FileName, Path, Write, Rank” Set oRS = oQuery.CreateRecordSet(“nonsequential”) %> <% If oRS.EOF Then Response.Write “No pages were found for the query ” & sSearchString & ““ Else Do While Not oRS.EOF If Instr(oRS(“doctitle”), “ISC_Table”) > 0 Then sDataRow = Split(oRS(“doctitle”), chr(9)) sDataRow_Table = sDataRow(0) sDataRow_Keyfield = sDataRow(1) sDataRow_RowNumber = sDataRow(2) sDataRow_RowTitle = sDataRow(3) sDataRow_Table = Replace(sDataRow_Table, “ISC_Table=“, “”) sDataRow_Keyfield = Replace(sDataRow_Keyfield, “ISC_KeyField=“, “”) sDataRow_RowNumber = Replace(sDataRow_RowNumber, “ISC_RowNumber=“, “”) End If Response.write “Database Table: “ & sDataRow_Table & “
“ Response.write “Database Key Field: “ & sDataRow_Keyfield & “
“ Response.write “Database Row Number: “ & sDataRow_RowNumber & “
“ Response.write “Database Row Title: “ & sDataRow_RowTitle & “
“ ‘Construct a URL that can be used to view the database data Response.write“View Row URL: ” & sDataRow_RowTitle & “
“ %>
<% oRS.MoveNext Loop End If %> <% Set oRS = nothing Set oQuery = nothing %> The disadvantage to using the Index Server Companion to index databases is that it cannot really be used for data that changes frequently, or for very large amounts of data. In these kinds of situations it is possible to use more advanced techniques, such as combining Index Server and SQL search queries using the linked servers feature of SQL Server [8]. Resources The Index Server Companion is available for online purchase [9]. An evaluation version and the user documentation are also available. The colour coded and syntax highlighted ASP code in this article was created using the ASP Documentation Tool [10]. Summary and Conclusions Index Server is a useful solution for creating website search facilities, but has a number of limitations that restrict its effectiveness. Although Site Server 3.0 Search greatly enhances the functionality of Index Server, it is expensive and no longer actively supported by Microsoft. The Index Server Companion is a low cost solution for allowing Index Server to index content from remote websites and ODBC databases, thereby assisting with the development and implementation of comprehensive website searching facilities on the Windows platform. References Searching Index Server With ASP, ASPAlliance.com http://www.aspalliance.com/brettb/SearchingIndexServerWithASP.asp mod_speling Apache httpd module, Apache documentation http://httpd.apache.org/docs/mod/mod_speling.html Robots Exclusion Standard, specification http://www.robotstxt.org/wc/norobots.html The Microsoft Site Server Search Facility, Ariadne Issue 19 http://www.ariadne.ac.uk/issue19/nt/ ActivePerl, ActiveState Tool Corp. website http://www.activestate.com/ Index Server Companion, sample search facility http://www.winnershtriangle.com/w/Products.IndexServerCompanion.SampleSearch.asp Index Server Companion, source code of sample search facility http://www.brettb.com/ASPAlliance/IndexServerCompanion/Sample_Web_ASPCode.html Creating Combined SQL Server and Index Service Queries, Avonelle Lovhaug http://www.asptoday.com/content.asp?id=295 Index Server Companion, website http://www.winnershtriangle.com/w/Products.IndexServerCompanion.asp ASP Documentation Tool, website http://www.winnershtriangle.com/w/Products.ASPDocumentationTool.asp Author Details Brett spent two years working in the University of Essex Computing Service, before moving to The Internet Applications Group in the Autumn of 1999, where he developed e-Business applications for a range of corporate clients and dot-com start ups. Brett is presently employed as an Internet developer and technical writer through his own company, Winnersh Triangle Web Solutions Limited. The company produces a number of innovative products, including the popular ASP Documentation Tool and the Index Server Companion. The company is also available for web application development, primarily using Microsoft technologies (ASP, Visual Basic, SQL Server) but also using open source technologies such as PHP, MySQL and Perl. Specialist services include development of search solutions using Microsoft’s Index Server and Site Server 3.0 Search. As well as Ariadne, Brett has written technical articles for Wrox’s ASPToday, and ASPAlliance.com.   Brett Burridge Web: http://www.brettb.com/ Mobile: +44 (0)7775 903972 Article Title: “Windows Explorer: The Index Server Companion” Author: Brett Burridge Publication Date: 10-Oct-2002 Publication: Ariadne Issue 33 Originating URL: http://www.ariadne.ac.uk/issue33/nt-explorer/ Facebook Twitter Google+ E-Mail Pinterest LinkedIn Ariadne is published by Loughborough University Library © Ariadne ISSN: 1361-3200. See our explanations of Access Terms and Copyright and Privacy Statement.