The Code4Lib Journal – Building a Scalable and Flexible Library Data Dashboard Mission Editorial Committee Process and Structure Code4Lib Issue 35, 2017-01-30 Building a Scalable and Flexible Library Data Dashboard Data dashboards provide libraries with the means to demonstrate their ongoing activities and usage in an engaging and communicative fashion. Yet, due to the number of service platforms used by libraries, and the wide-ranging technical specifications they entail, bringing all of this content together in a sustainable way is a significant challenge. This article describes Portland State University’s project to design and build a data dashboard based on a scalable and flexible infrastructure that would enable them to present data in a visually compelling and dynamic interface. Introduction Data dashboards are an excellent way for libraries to provide an at-a-glance view of their many ongoing activities. A well-organized, well-built data dashboard is easily understood, narrates a coherent story about the library, and provides readily identifiable metrics and data points that can inform evaluation and decision-making activities. In addition, because they can include a broad cross-spectrum of data sources, from circulation data, to collection counts, to server logs, dashboards make it possible to accomplish two things that are of particular value for libraries: Illustrate the full extent of a library’s activities: visually representing these disparate data points alongside one another helps to thoroughly illustrate the wide-range of activities that libraries support. This can be very helpful in demonstrating the library’s impact and value to campus and community stakeholders. Uncover interconnections between data: dashboards offer an opportunity to directly relate different data points to one another, and in doing so uncover previously unseen relationships between them. For example, by overlaying gate count with website visits, it’s possible to illustrate what relationship exists between the two different types of traffic. Yet, taking advantage of these potential benefits is challenging. While the library generates a huge amount of data, it is far from consistent in its form or the systems that house it. As a result, there are three principal problems that libraries face in building a dashboard: Disparate data sources: each library uses a number of different systems for providing services. Examples include the integrated library system, interlibrary loan software, gate count trackers, web server logs, etc. Each of these systems requires a unique approach for accessing and merging the data programmatically. Heterogeneous data: the data structure is unique among for each of the source systems that house it. Representing this data in a holistic fashion requires normalizing it across systems. Accessing the data: each system provides different mechanisms for accessing the data. While some systems provide APIs for retrieving data, or the ability to export reports, in some cases you may need to screen-scrape web pages to extract the data. In spring 2016, the Portland State University Library undertook a project to engage with these challenges and build a public-facing data dashboard and the technical infrastructure to support its ongoing maintenance. The data dashboard is part of the library website, and is aimed at illustrating the Library’s broad range of activities and the relationships between them. Most importantly, it is a first step towards enabling the Library to more effectively analyze the vast range of data available to it. Portland State University Library’s Dashboard Project As a first step in tackling this project, the Library included the following goal in its 2015-2016 annual plan: “Develop the technical infrastructure and workflows for collecting and reporting on Library metrics across systems and units.” The Library Technologies unit was tasked with leading work on this goal, and began work on it mid-way through the year. Given the broad nature of the goal and the sheer size of the data challenge that the library faced, the scope for the project was quickly narrowed down to the creation of a data dashboard for the library website. The presentation of the library’s annual data on the website had always been rather simplistic, with tables and bulleted lists including a limited number of facts and figures from the previous fiscal year: Figure 1.Screenshot of previous library statistics page. Therefore, focusing on this specific goal enabled the project to achieve two good outcomes in one step: 1) replacing the current content with a more compelling and engaging dashboard, and 2) developing the infrastructure for managing and reporting on library data in a more extensive way in the future. The project got underway with the formation of the project team, that included Library Technologies’ two developers (Mike Flakus and Chris Geib), our content strategist (Sherry Buchanan), and myself, the Manager of Library Technologies. At the outset, several decisions were made to focus the project’s scope in order to meet our brief six month timeline. Specific decisions included: The audience for the dashboard was defined as campus stakeholders (administrators, students, staff) and Library administrators. The dashboard would be focused on including data that could be harvested on a monthly basis and that was of enduring value to the Library. The process for harvesting and storing data had to be sustainable in terms of ongoing staff time to maintain the system, and so needed to be automated and reuse application modules wherever possible. The infrastructure for storing, making available, and presenting the data had to be as scalable as possible, and so it needed to be possible to add new volumes of data and new data sources with minimal effort. The interface needed to be visually compelling and as interactive as possible. Data reflecting all of the Library’s principal public-facing services needed to be included, such as circulation and electronic resource usage, reference and instruction, and both online and in-person traffic. The dashboard would be focused on ongoing, repeated reporting, as opposed to ad-hoc reporting. With the scope defined, the project team broke the overall project down into discrete phases, to best structure our work and map out our timeline: Identifying data sources, evaluating available data, and selecting target data sources Selecting the data store, schema, and harvesting data Building the user interface Extracting and caching the data for repeated use in production Step 1: data sources Our first step was to identify the data sources that we would harvest information from. As with most libraries, the Portland State University Library makes use of a number of different systems to support all of our services, each of which is managed differently. Some are vendor-hosted, some locally-hosted, some have APIs while others don’t, etc. The sources we chose fit each of these models, giving us an opportunity to develop tools for working with most any data source. The specific data sources were: Alma integrated library system and Primo discovery system: the Alma/Primo system is vendor-hosted, and includes an analytics engine that has a dedicated API for extracting reports Google Analytics: vendor-hosted, but offers an API for reports ILLiad interlibrary loan: this application is locally-hosted, and while it does not offer an API, we do have the ability to query the SQL Server database directly Gate count: this is a locally-developed application for storing gate count statistics, and whose database we can query directly Study rooms booking system: this is a locally-developed application for reserving and checking out study rooms, and whose database we can query directly Libstats: this is a locally-hosted open source application that we use for tracking reference statistics, and whose database we can query directly Instruction stats: this is a locally-developed application for tracking instruction session statistics, and whose database we can query directly PDXScholar: this is the Library’s institutional repository, which is vendor-hosted, and doesn’t offer an API or extractable reports. Instead, we needed to screen-scrape to get the data that we wanted. Each of these systems houses a wide-range of data. But for the purposes of the dashboard, we focused on data that would clearly demonstrate high-level trends over time, and that would paint a broad picture of the Library’s ongoing activities. So after discussion with a range of stakeholders, we settled on including monthly data for each of the following data sets: Website visits for all of our sites (e.g. main website and all subsites, Libguides, PDXScholar, Primo discovery system, ) (Google Analytics) Walk-in (gate count) traffic (Gate count app) Study room checkouts (Study room app) Overall item checkouts (Alma) Item checkouts by specific item types (Alma) Consortial borrowing counts (Alma) Interlibrary loan borrowing counts (ILLiad) Reference question counts by question type (Libstats) Student counts for instruction sessions (Instruction stats app) We also determined that the dashboard would include a table of annual usage numbers, which would be derived from the harvested data, and a chart illustrating the Library’s collections. This last one would be hard-coded, since the data was collected annually. We also wanted to include information reflecting the usage of our electronic resources, but at the time of implementation that data was not available programmatically (though we were able to add it by hard-coding it into the system). Step 2: Data storage and harvesting With the data sources selected, we turned our attention to how best to store the data itself. Our initial schema for the data, based on our vision for how we would use it, looked like this: Name: the descriptive name of the data being collected, which would be used for display purposes on the dashboard Counts: the harvested count(s) data Month: the month the data was accrued Year: the year that the data was accrued Tags: descriptive term(s) for categorizing the data, which would be used for grouping related data sets together The counts and tags fields were especially challenging to handle when thinking about our storage mechanism. While some data points would include a single number for counts, such as “website visits”, others would include more granular data and multiple numbers for the counts, such as circulation data including “book checkouts, dvd checkouts, laptop checkouts, …” As a result, the storage mechanism needed to be flexible enough to accommodate a variable number of granular counts, with no upper limit. The same was true for the tags field, which would need to accommodate a variable number of values for any given data point. Tags would be one of the critical aspects of the system, as we planned to use these to dynamically build groups of data sets and related visualizations. Up until this point, we had principally used MySQL for all of our data storage, while recently adding Memcached and Redis for specific applications. But, none of these three solutions were optimal for this use case due to the variable nature of the data described above. So, for maximum flexibility we opted to use MongoDB (https://www.mongodb.com/), one of the many “nosql” databases that have risen to prominence in the last several years. MongoDB allows for data to be added to the system without a predefined schema, and has a robust query interface for retrieving data. And it can scale to support as much data as we could possibly throw at it. With MongoDB selected, we designed our data model for the initial implementation. MongoDB stores data sets as “documents” that are part of a “collection” (akin to MySQL’s rows and tables), and so the document for a set of harvested data would use the following schema: Document name: (string) raw_data: (hash) -[data point name (string)]: [data point value: (integer)] -… month: (integer) year: (integer) tags: (array) -[tag name (string)] -… (this excludes the automatically included fields, all of which are denoted with an underscore in the name (e.g. _id) in the examples below) Here are a couple examples of how this schema would play out for specific data points: *data collected with a single raw_data count { "_created": "Fri, 02 Dec 2016 18:08:52 GMT", "_etag": "ea2e8a6078cf32a4ebedaadf7b4a43908a5d40e5", "_id": "5841b8b43a3d5f05b79c71aa", "_links": { "self": { "href": "library_stats/5841b8b43a3d5f05b79c71aa", "title": "name" } }, "_updated": "Fri, 02 Dec 2016 18:08:52 GMT", "month": 7, "name": "Website Visits", "raw_data": [ { "Website Visits": "63817" } ], "tags": [ "Website", "Traffic" ], "year": 2014 } *data collected with a multiple raw_data counts { "_created": "Fri, 02 Dec 2016 18:09:20 GMT", "_etag": "fc55dca90348b7fe20751ec4be676b734da61d04", "_id": "5841b8d03a3d5f05b79c7257", "_links": { "self": { "href": "library_stats/5841b8d03a3d5f05b79c7257", "title": "name" } }, "_updated": "Fri, 02 Dec 2016 18:09:20 GMT", "month": 11, "name": "Reference Questions Answered", "raw_data": [ { "Email": 164 }, { "Chat/Text": 344 }, { "In Person": 2066 }, { "Phone": 139 } ], "tags": [ "Reference" ], "year": 2016 }, While MongoDB doesn’t enforce a schema on its own, we did need to ensure consistency in the centrally stored data, and so we implemented a Python library named Eve (http://python-eve.org/) that would function as middleware for enforcing consistency. Eve enables you to define a schema for each collection in your MongoDB database, and adds an API for the MongoDB database that validates data as it’s saved, before pushing it to the database. Using Eve, data is saved to the database via the Eve API, instead of being passed directly to the database, and data that does not meet the Eve-defined schema will be rejected. Eve enables us to be sure that data that is saved will have a consistent schema, while at the same time still benefiting from the flexibility in data structures that MongoDB offers. In other words, while the schema sets a structure for the data, within that structure we can proliferate data as needed. For example, the raw_data section of each document stores a hash of key-value pairs describing the data from the source system. In practice, this can store any number of hashes, describing sets and subsets of data as needed. Eve allows us to ensure that the data being saved to raw_data will be a hash and that it is present, but within that relatively broad definition, we have the flexibility to add any number of hashes as are appropriate for the data set. Schema definition in Eve: schema = { 'name': { 'maxlength': 128, 'minlength': 1, 'required': True, 'type': 'string', 'unique': False, }, 'tags': { 'maxlength': 64, 'minlength': 1, 'required': True, 'schema': { 'maxlength': 64, 'minlength': 1, 'type': 'string', }, 'type': 'list', 'unique': False, }, 'raw_data': { 'minlength': 1, 'required': True, 'schema': { 'type': 'dict', 'required': True, }, 'type': 'list', }, 'month': { 'max': 12, 'min': 1, 'required': False, 'type': 'integer', }, 'year': { 'min': 2000, 'required': False, 'type': 'integer', }, } With the database and data validator in place, we wrote mechanisms that would pull data from each of the systems listed above, and then pass it to the Eve API to be stored in the database. As noted previously, each system had different mechanisms for accessing the data, most using APIs or queries directly to the database. But the common element in each process was formatting the data for the API. Here’s an example of what that looks like when using Python to format the data from Alma, with the key section highlighted in bold: def reformat_to_json(report_data, name, month_filter, key='key', tags=[]): soup = BeautifulSoup(report_data, features='xml') this_year = date.today().year this_month = date.today().month - 1 group_name = "circulation" type_name = "circulation_all" library_stats = [] for row in soup.find_all('Row'): material_type = str(row.Column1.string) if material_type != 'Other': # We don't want type Other month = str(row.Column3.string) year = str(row.Column4.string) key = str(row.Column5.string) # 'key' argument is ignored value = str(row.Column6.string) if month_filter: if month == str(this_month) and year == str(this_year): library_stats.append({ 'name': name, 'tags': tags, 'raw_data': [ {key: value} ], 'month': month, 'year': year, }) return library_stats In this example, the data stored in the library_stats object is passed through the Eve API and stored in MongoDB. We opted to write individual connectors for each data source, with each connector capable of pulling multiple data sets as needed. For example, if we queried the Alma API for multiple data sets (e.g. checkouts by type, and resource sharing requests), the same connector would be used for collecting both and pushing the data to the database, and the above code block could be used for processing the data from each individual query to Alma. Step 3: Building the dashboard Principal goals for the dashboard were for it to be both visually compelling and able to present data in an analytically interesting way. We had used the Google Charts API in the past, which offers a nice set of tools for formatting and displaying data. But after doing some research into other options, we quickly settled on the C3 JavaScript library (http://c3js.org/). C3 is built on top of the D3 JavaScript library, which is great for creating rich, dynamic charts. Yet, D3 can be challenging to leverage on its own, which is where C3 comes in. C3 makes it simpler to build charts with D3, by abstracting out the JavaScript code needed to build the chart itself, enabling you to simply pass your data and the chart type (you can see examples on the C3 site) to C3, which will then build the chart based on this. With the chart library selected, we set about embedding the dashboard in our website. To do this, we developed shortcodes for each chart. Each shortcode includes all of the variables needed to construct and display the chart, and can be used to dynamically build and place it in the appropriate place on the page. For example, this is a shortcode for including the circulation count data: [chart name="MaterialsCirculated" type="bar" title="Materials Circulated" caption="Users borrow more than 28,000 PSU-owned items in certain months."] In this example, the variables will be parsed as: chart name=“MaterialsCirculated” is the name of the tag that will be used to wrap the chart type=“bar” indicates that the type of chart that will be used will be a bar chart title=”Materials Circulated” is the text that will be used as the label for the chart caption=”Users borrow…” is the text that will be used as the caption for the chart In turn, a PHP script parses the variables and builds the block of HTML and JavaScript that will embed the chart on the page: function chart($attr){ $result = ""; if(isset($attr['title'])) $attr['title'] = str_replace("&","&",$attr['title']); if(isset($attr['name'])) { switch($attr['name']) { case 'walkin_v_website': $result = '