The Code4Lib Journal – Making Patron Data Work Harder: User Search Terms as Access Points? Mission Editorial Committee Process and Structure Code4Lib Issue 3, 2008-06-23 Making Patron Data Work Harder: User Search Terms as Access Points? Montana State University (MSU) Libraries are experimenting with re-using patron-generated data to create browseable access points for the Electronic Theses and Dissertations (ETD) collection. A beta QueryCatcher module logs recent search terms and the number of associated hits. These terms are used to create browseable lists and tagclouds which enhance access to the ETD collection. Gathering and reusing information about user behavior is an emerging trend in web application development. This article outlines MSU Libraries’ reasoning for moving towards a user-generated model and provides a complete walkthrough of the steps in building the application and example code. By Jason A. Clark Introduction Whether it is a respect for user privacy or a reluctance to give up control, libraries have traditionally shied away from harvesting or “reusing” data collected from our users. With the rise of social Web 2.0 apps, some interesting applications of user data have begun to appear. In popular web sites such as Flickr and del.icio.us, we have large-scale examples of user-generated content (e.g., the tagging systems or the social sharing functionalities that exist between the users of these systems) working to improve user access to content. Within the library environment, we often see applications and organizational systems that focus more on the top-down approach of assigning access points a priori. At Montana State University (MSU) Libraries, we started to ask ourselves: “Does it have to work like this all the time? What kinds of access might we be restricting in forcing users to navigate and learn our strict knowledge hierarchies?” To this end, we’ve started to try to rethink the ways in which our collected user data can open up access to our collections. The Initiative: QueryCatcher In his account of the rise of Google, John Battelle introduces the concept of a “database of intentions” to analyze the role that search engines play as the collectors of humanity’s curiosity, exploration, and expressed desires. [1] Battelle is speaking about the search terms that we use and how these terms are recorded, logged, and then mined for insights by Google. It was exactly this idea of collecting knowledge that made me think of a new possibility for our own library search engines. [2] In keeping with Battelle’s concept, we introduced a new component to our MSU Electronic Theses and Dissertations (ETD) web application (http://etd.lib.montana.edu/etd/view/). We created a search term recorder module, beta name “QueryCatcher,” that logs recent searches and uses those recent searches to create user access points in both a standard search list and “SearchCloud” format. The search list view (seen in Figure 1 below) shows the most recent 50 searches recorded into the system. It includes the number of hits for each search and the number of times that the term has been browsed or searched. You can see it in action at http://etd.lib.montana.edu/etd/view/searches.php. Figure 1: Search List View [View full-size image] The “SearchCloud” view (seen in Figure 2 below) shows the most recent 50 searches recorded into the system. It includes a visual weighting of the terms according to the number of times that the term has been browsed or searched. You can see it in action at http://etd.lib.montana.edu/etd/view/searchcloud.php. Figure 2: “SearchCloud” View [View full-size image] Our idea was to provide more search transparency for our users and for ourselves. Frequently, ETD searching was restricted to exploratory, single term searches. Some of this user behavior was a function of the traditional searching and browsing environment we had set up for the ETD application. (The traditional search interface is a series of web forms and browseable links based on predetermined controlled vocabularies.) We wanted to enable users to see what search terms were getting results so they could refine them for their own purposes as well as make visible the most popular terms, which are invisible to our standard web analytics for the application. Recording the actions of our users and turning that data into access points was an experiment in enabling our users to create their own browseable access to the collection. Unobtrusive Harvesting One of the experiments of the project has been the concept of releasing the user from any efforts to create the access points. The ETD application already has a tagging module that has met with mixed success. We are finding that ownership and real participation in tagging items can be hard to foster in a local system. With this in mind, we decided to go in a different direction with QueryCatcher. It works in the background to collect users’ search queries and places the onus on our system to make things happen. The goal was to allow users to create value without really knowing it. While this technique might raise concerns about user privacy, it is possible to take advantage of patron data in an anonymous fashion (these concerns will be addressed as part of “Complications and Limitations“). The “invisibility” of this process is a large reason it has proven much more successful than our user tagging module. Implementation: Making It Work At MSU Libraries, our library systems are built within the LAMP stack, so I’m going to walk you through the steps needed to create the QueryCatcher application using PHP, MySQL, CSS, and xHTML for markup. [3] I will describe the pieces of the application in terms of a data model layer, an application logic layer, and a display view layer. It’s my hope that in using these generic terms other developers will be able to understand the functions of each piece of the application and apply it to their programming environment of choice. 1. Data Model Layer Our first step is to create the database that will log the queries. Think of this stage as setting up the data model layer. Here’s the CREATE table syntax for the table which will record the queries into the system. -- -- Table structure for table `log` -- DROP TABLE IF EXISTS `log`; CREATE TABLE `log` ( `query_id` int(16) NOT NULL auto_increment, `query_data` varchar(255) default NULL, `query_results` int(6) default NULL, `query_time` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP, `query_referrer` varchar(255) default NULL, PRIMARY KEY (`query_id`), FULLTEXT KEY `full_index` (`query_data`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; There’s nothing too unique here as the table contains fields to collect the query, the number of hits for the query, and a timestamp to help us organize the data for display. The syntax is SQL as represented in MySQL. [4] 2. Application Logic Layer Our second step is to create the logic and insert statement that will search and record query data. At this stage, we are talking about the application logic layer that provides the searching and recording action for the ETD application. Here’s a snippet of the processing code from the results.php file: case 1: if($bKeyword) { $result = mysql_query("SELECT * FROM item WHERE item_status='a' AND MATCH (dc_title, dc_creator, dc_description, dc_contributor) AGAINST ('+$keyword*' IN BOOLEAN MODE)"); $num_rows = mysql_num_rows($result); if($num_rows == 0) { noMatches(); } else { //log search term query, trim whitespace and store in "log" table of database $logQuery = @mysql_query("INSERT INTO log SET query_data='".trim($keyword)."', query_results='".$num_rows."', query_referrer='".$_SERVER['HTTP_REFERER']."'"); if (!$logQuery) { die('

Error adding query to log: ' . mysql_error() . '

'); } echo '

Your search for Keyword '.stripslashes($keyword).' resulted in '.$num_rows.' match(es).

'."\n"; } } In this example, we use a PHP switch statement to check for the existence of the keyword variable. If the keyword variable exists and retrieves results, we record the number of results in the $num_rows variable. Next, we pass those values through some logic checks and trim functions (for cleanup) and insert into the “log” table to record the query data, referrer, and total hits. It should be noted that any variables passed to the application logic layer will be passed from a simple web form and we’ll build the form in the next section. 3. Display View Layer Our third step is to create the display view layer, which includes hooking the ETD search form into the “log” table and providing several displays for the recorded search data. First, the search form from the main page of the ETD application must be hooked into the “log” table. We’re talking here about basic user interface (UI) design and one of the first components of the display view layer. We use post data from an xHTML form stored as variables and pass the variables along to our processing script. The important piece to note here is the as it is the glue that binds the display view layer to the application logic layer. As a user enters a search term in the web form, the data is passed as the keyword variable to the results.php file which will allow for a search of the database and the recording of the query. The next piece of our display view layer involves showing the search query data. As mentioned earlier, we currently provide two views: a search list view that shows the number of hits and the number of times the term has been browsed or searched and a “SearchCloud” view where terms are visually weighted according to the number of times they’ve been used. The xHTML markup for the pages is pretty straightforward – a