The Code4Lib Journal – An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter Mission Editorial Committee Process and Structure Code4Lib Issue 32, 2016-04-25 An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter This article examines the tools, approaches, collaboration, and findings of the Web Archives for Historical Research Group around the capture and analysis of about 4 million tweets during the 2015 Canadian Federal Election. We hope that national libraries and other heritage institutions will find our model useful as they consider how to capture, preserve, and analyze ongoing events using Twitter. While Twitter is not a representative sample of broader society – Pew research shows in their study of US users that it skews young, college-educated, and affluent (above $50,000 household income) – Twitter still represents an exponential increase in the amount of information generated, retained, and preserved from 'everyday' people. Therefore, when historians study the 2015 federal election, Twitter will be a prime source. On August 3, 2015, the team initiated both a Search API and Stream API collection with twarc, a tool developed by Ed Summers, using the hashtag #elxn42. The hashtag referred to the election being Canada's 42nd general federal election (hence 'election 42' or elxn42). Data collection ceased on November 5, 2015, the day after Justin Trudeau was sworn in as the 42nd Prime Minister of Canada. We collected for a total of 102 days, 13 hours and 50 minutes. To analyze the data set, we took advantage of a number of command line tools, utilities that are available within twarc, twarc-report, and jq. In accordance with the Twitter Developer Agreement & Policy, and after ethical deliberations discussed below, we made the tweet IDs and other derivative data available in a data repository. This allows other people to use our dataset, cite our dataset, and enhance their own research projects by drawing on #elxn42 tweets. Our analytics included: breaking tweet text down by day to track change over time; client analysis, allowing us to see how the scale of mobile devices affected medium interactions; URL analysis, comparing both to Archive-It collections and the Wayback Availability API to add to our understanding of crawl completeness; and image analysis, using an archive of extracted images. Our article introduces our collecting work, ethical considerations, the analysis we have done, and provides a framework for other collecting institutions to do similar work with our off-the-shelf open-source tools. We conclude by ruminating about connecting Twitter archiving with a broader web archiving strategy. by Nick Ruest and Ian Milligan Introduction During the 2015 Canadian federal elections, we captured 3,918,932 tweets written using the #elxn42 hashtag: thoughts on the nature and stature of political candidates or parties, live running commentary during leader debates, exhortations to vote, and witty ripostes or jokes to liven up the long campaign. Political scientists, journalists, and other researchers can use these tweets as evidence of sentiment amongst a certain slice of the electorate: did a policy go over well? Did it not? What tweets get re-tweeted, or further shared, and which ones do not? If these are questions that resonate amongst contemporary researchers, historians are also interested in the long-term preservation of digital material. Tweets, as well as the much broader scope of archived webpages and born-digital data, are the primary sources of tomorrow. Tweets present considerable advantages in that they represent the preservation of material representing the voices of everyday people that might not otherwise be saved, but also considerable challenges in the collection and use of data on such a large scale. If the norm until the digital era was to have human information vanish, “now expectations have inverted. Everything may be recorded and preserved, at least potentially” (Gleick, 2012). Useful historical information is being preserved at mind-boggling rates that continue to accelerate. IBM Research, for example, notes that “every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” (IBM Research, 2016) This data has the potential to reshape multiple avenues of historical research. In the case of the #elxn42 hashtag, we have access to the tweets of some 318,176 unique users (which would include some bots and spam accounts, of course). Consider what the scale of this dataset means. Social and cultural historians will have access to the thoughts, behaviours, and activities of everyday people, the sorts of which are not generally preserved in the record. Military historians will have access to the voices of soldiers, posting from overseas missions and their bases at home. And political historians will have a significant opportunity to see how people engaged with politicians and the political sphere, during both elections and between them. The scale boggles. Modern social movements, from the Canadian #IdleNoMore protest focusing on the situation of First Nations peoples to the global #Occupy movement that grew out of New York City, leave the sorts of interactions that would rarely, if ever, have been recorded by previous generations. During the #IdleNoMore protest, for example, Twitter witnessed an astounding 55,334 tweets on 11 January 2013. If we were to take the median length of a tweet (60 characters), the average length of a word (5 characters plus a space), and think about 300 words per page, we’re looking at over 1,800 pages. This for a single day of a single social movement in the relatively small country of Canada. While Twitter is certainly not a representative sample of broader society – Pew Research shows in their study of US users that it skews young, college-educated, and affluent (above $50,000 household income). We need to keep the demographic limitations of this source base in mind, as we do with all source bases. This is not a random sample of Canadian society, but a self-selecting portion of it (as with many non-digital archival collections as well). As a record of society, Twitter certainly suffers from selection bias. Yet, Twitter – and other web archives – will still represent an exponential increase in the amount of information generated, retained, and preserved by everyday people. Therefore, when historians study the 42nd federal election, we believe that Twitter will be an important source. Recognizing Twitter’s significance, it calls out for active preservation. Once an event has happened, if a small window of time has passed – 7 to 9 days – the tweets become largely inaccessible on a large scale without considerable monetary resources. While the Library of Congress archives tweets, it remains unclear how their access regime will work. Yet using a combination of several open-source tools, librarians, archivists and other researchers can do the following: create their own Twitter archives using twarc; analyse tweets using twarc-report and twarc-utilities; visualize the material; use Twitter as a launchpad for further web archiving activities; and share tweet IDs with an eye to sharing collections in accordance with the Twitter Developer Agreement & Policy. This article walks users through these five steps, with an eye to presenting this as a model for other forms of analysis. Libraries, spread across the world, can collect hashtags of local or national significance, taking a step towards the more widespread preservation of today’s cultural record. Creating your Own Twitter Archive: Data Collection The Web Archives for Historical Research Group began capturing #elxn42 tweets on August 3, 2015 with twarc. “twarc is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that is exactly what was returned from the Twitter API. Tweets are stored as line-oriented JSON. Twarc runs in three modes: search, stream and hydrate. When running in each mode twarc will stop and resume activity in order to work within the Twitter API’s rate limits.” (Summers, et al, 2015) On August 3, the team initiated both a search API and stream API collection with twarc using the hashtag #elxn42. The search API was used to gather any tweets with the #elxn42 hashtag before initial collection date. The stream collection mode was initiated with the intention to gather #elxn42 tweets for the entirety of the election. However, we noticed that twarc had silently failed during September, and the research team did not notice. We believe the failure here was because of an issue with the Twitter API or network connection issues, but it is not clear, and we are not confident as to why we had a silent failure. As a result we lost 27 days in total. Upon realization of the collection failure, the research team immediately began collecting via the stream API and began search API collection (allows collection back 7-9 days) simultaneously. Data Collection was stopped on November 5, 2015, the day after Justin Trudeau was sworn in as the 42nd Prime Minister of Canada. A total of 102 days, 13hrs and 50 minutes. In retrospect, the research team recommends using a combination of collection via the Search and Streaming API. We would use a Streaming API collection over the period of the capture, as well as weekly Search API collections. Then, at the end of data collection and concatenating all the files together, we would deduplicate the entire dataset. Library and Archives Canada (LAC) also collected the #elxn42 hashtag, using the Search API, during a similar time period; August 11, 2015 – October 28, 2015. The team made use of the LAC #elxn42 capture by downloading their tweet id dataset (Library and Archives Canada, 2015), and hydrating it. Once the LAC dataset was hydrated, the team combined their original dataset (Ruest, 2015) with the LAC dataset, and deduplicated it (Ruest et al, 2015). $ twarc.py --hydrate elxn42-tweets-LAC.txt > elxn42-tweets-LAC.json $ cat elxn42-tweets.json elxn42-tweets-LAC.json > elxn42-tweets-combined.json $ python ~/git/twarc/utils/deduplicate.py elxn42-tweets-combined.json > elxn42-tweets-combined-deduplicated.json This does not necessarily mean that between LAC and our research group that we captured all tweets. Driscoll and Walker (2014) have shown substantial differences in what is captured using Twitter’s commercial Gnip service versus the streaming API. While the #elxn42 hashtag never exceeded the hard limit of 1% of all tweets enacted using the streaming API – which comes into play if the volume of tweets you are capturing exceeds 1%, common in cases such as high-profile events (the Paris shootings or an American presidential debate) – there is still a chance that some content was not collected. How Do You Collect? Collecting tweets is very straightforward. Once you install and configure twarc, you can collect tweets using the Twitter Stream and Search APIs. As noted below, syntax changed slightly with twarc 0.5.0 so we have provided both as an example: Search API: twarc.py --search "#elxn42" > elxn42-search.json Stream API ( < v0.5.0 twarc): twarc.py --stream "#elxn42" > elxn42-stream.json Stream API ( > v0.5.0 twarc): twarc.py --track "#elxn42" > elxn42-stream.json These two APIs complement each other well. The Search API provides historical search on a given query, such as #elxn42, stretching back somewhere between six and nine days of tweets. Their API cautions that “the Search API is focused on relevance and not completeness. This means that some tweets and users may be missing from search results.” Given our project goals, this makes the Search API insufficient. For completeness, then, we can turn to the Streaming API. This gives “developers low latency access to Twitter’s global stream of Tweet data,” up to the aforementioned 1% volume. Whereas Search API goes back into past tweets, Streaming only captures tweets as they happen. To put this into context, we could begin the Search API on #elxn42 on 5 September 2015 and still get tweets from 3 September 2015, for example; Streaming API cannot retroactively gather content. It is more complete, however. A combination of the two is a recommended approach: the streaming API for the bulk collection, and the search API to fill in any gaps that may have happened when using the system. Once collected, tweets can be shared with other people through the tweet IDs, which can be rehydrated using twarc. As twarc’s README notes: The Twitter API’s Terms of Service prevent people from making large amounts of raw Twitter data available on the Web. The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available. You can then use Twitter’s API to hydrate the data, or to retrieve the full JSON for each identifier. This is particularly important for verification of social media research. The command: twarc.py --hydrate elxn42-tweet-ids.txt > elxn42-tweets.json will recreate the original tweet(s) in json format, provided the content is still available on Twitter. If you wanted to use our dataset, for example, it could be download in Scholars Portal Dataverse. If a user deleted their tweet between the time of our collection and the time of your rehydration, you would not gain access to that tweet. Should You Collect? Ethical Considerations Beyond the technical question of how to collect tweets comes the ever-important question of should you, and if so, how to handle the question of consent? Strictly speaking, we have legal permission thanks to the Twitter Developer Agreement & Policy. We can only capture public tweets, and given the tweets are public, we interpret that as consent in the broadest form to archive and preserve this material. Consent is not perpetual, as users may decide to make their account “private” after collection. Accordingly, when tweet ids are hydrated, only publicly accessible tweets are hydrated (indeed, as deleted or private tweets are not made available via the API, this is unavoidable – one cannot get data about a deleted tweet from Twitter). So, if a tweet is deleted in the period between our capture and hydration, the tweet will not be hydrated. Similarly, if an account is public, and set to private in the period between our capture and hydration, the tweet will not be hydrated. We discuss this further in our section below on deleted tweets. George Washington University’s Library has been exploring, as part of their work with the Social Feed Manager, a platform to collect social media data from Twitter, the legal and ethical implications of Twitter archiving. In a recent presentation at Web Archives 2015: Capture, Curate, Analyze, Seemantani Sharma, Vakil Smallen, and Daniel Chudnov (2015) explored the three primary legal areas of concern: copyright, privacy, and access. While in the United States, the issues surrounding fair dealing largely would not see tweets as copyrighted content, they accordingly focus much of their attention on the murkier area of the ethical concerns of privacy and access. Securing consent at the collection stage is largely unworkable, as Sharma, Smallen, and Chudnov note – making this a far trickier question. As they note, and as we know, legal does not equal ethical, though. As Aaron Bady (2014) has noted, “[t]he act of linking or quoting someone who does not regard their Twitter as public is only ethically fine if we regard the law as trumping the ethics of consent.” As researchers at the University of Southern California discovered with their “Black Twitter Project,” many are uncomfortable with the prospect of their online content being harnessed without consent for research projects. (O’Neil, 2014) Yet, if we do not archive this material, it could be lost forever: invaluable, diverse perspectives on unfolding events like the 2015 Canadian federal election. Collecting these tweets raises the prospect of a historical record not dominated by the mainstream media. We thus collect the material with the proviso that it needs to be ethically used by researchers. As Dorothy Kim and Eunsong Kim (2014) put it in their “#TwitterEthics Manifesto,” academics and those using this material in their work need to rethink their approach: In the end, the work, the credit, the compensation, and the view need to be a shared, collaborative process. Twitter and New Media journalism, the internet and technology involves all of us. The voices on the platform are multiple, collective, dissenting, singular, and loud. You don’t need to speak for us–we are talking. Cite us, ask us to write, get our permission. We collect the material so that it can be used. Researchers need to be ethically aware. When distributing the tweet IDs, we encourage them to use this material with respect. Approach to Analysis To analyze the data set, we took advantage of command line utilities, a number of utilities that are available with twarc and twarc-report, as well as jq. twarc-report is a set of utilities “for generating reports from twarc collections using tools such as D3.js.” (Binkley, 2015) The timeline graphs above were created with twarc-report. The command is as follows: ~/git/twarc-report/d3times.py elxn42-tweets-combined-deduplicated.json -a -o embed -t local -i 24H > elxn42-times.html The flags do the following: -a aggregates output; -o specifies we wanted embedded output, -t specifies the timezone to use (local, or EST, in our case), -i sets the interval, in our case every 24 hours. Upon completion of capturing #elxn42, the team immediately began aggregating their dataset into a single file. The team began with 12 different line oriented JSON object files totaling 22GB and 4,117,753 undeduplicated tweets. These 12 files were aggregated into a single file: cat *json > elxn42-tweets.json. Once aggregated, the dataset was validated with validate.py (ensuring that each line was a valid JSON object), and deduplicated (we have to dedupe given the combination of Search API and Stream API collection modes with twarc) using deduplicate.py. Once deduplicated, we were able to come up with the number of tweets collected. Since each tweet is a single JSON object representing a single line in the file, we were able to quickly calculate with simple command line utilities: $ cat elxn42-tweets-combined-deduplicated.json | wc -l Since Twitter automatically shortens URLs, the team also unshortened every URL in the dataset so that we would be able create a canonical list of URLs tweeted for further analysis. We were able to create this using a combination of tools; unshorten.py and unshrtn (“a small leveldb backed URL unshortening microservice written for node”). $ sudo docker build --tag unshrtn:dev . $ sudo docker run -p 80:3000 -d -t unshrtn:dev $ cat elxn42-tweets-combined-deduplicated.json | ~/git/twarc/utils/unshorten.py > elxn42-tweets-combined-deduplicated-unshortened.json With the URLs, we were able to run subsequent analysis: from creating a subsequent web crawl using the corpus in order to launch further explorations of an #elxn42 web crawl, to comparing coverage within the #elxn42 URL corpus with the broader Internet Archive, and beyond. This sort of derivative dataset can be very useful, especially given the URL-centric nature of the Wayback Machine. Data Analysis and Results Text Using jq, we extracted all of the plain text of every tweet: $ cat elxn42-tweets.json | jq -c '.text' | cat > elxn42-tweets-text.txt This was useful for working with text analysis software, such as custom scripts written in R, Python, Mathematica, or even using the accessible online platform Voyant-Tools. We were also interested in contrasting Twitter data by day, to see how it evolved. To do so, we used this following script: #!/usr/bin/env python #CC0 1.0 Universal from __future__ import print_function import sys import json import fileinput import dateutil.parser import dateutil.rrule import pytz import pandas as pd import datetime import io eastern = pytz.timezone('US/Eastern') start_date = dateutil.parser.parse("25-July-2015") start_date = eastern.localize(start_date) end_date = dateutil.parser.parse("06-November-2015") end_date = eastern.localize(end_date) dates = pd.date_range(start_date, end_date).tolist() for date in dates: date_plus_one = date + pd.DateOffset(1) pretty_print = date.to_pydatetime().strftime('%Y%m%d') filename = 'elxn42-tweets-' + pretty_print + '.json' f = io.open(filename, 'w', encoding='utf-8') for line in fileinput.input(): tweet = json.loads(line) created_at = dateutil.parser.parse(tweet["created_at"]) created_at = created_at.astimezone(eastern) if ((created_at >= date) and (created_at < date_plus_one)): f.write(unicode(json.dumps(tweet, ensure_ascii=False) + '\n')) f.close() Once broken into dates, we could run further analysis. Built into twarc is the ability to generate word clouds of tweets, using the following command, for example (using the 18 October 2016 data): $ python ~/git/twarc/utils/wordcloud.py elxn42-tweets-18-oct-2016.json > wordcloud-18-oct-2016.html While word clouds have considerable limitations, especially in the occlusion of context around a given keyword, the simplicity of the visualization – where the more a word appears the larger it is – can surface overall trends. The ensuing results can be seen below: Here we can see the following transition in the tweets: 17 October 2015: We see the keyword “Harper” is the most prominent one, as it was throughout much of the election. As the incumbent was a politically polarizing individual, the election was largely a referendum on his leadership. 18 October 2015: The day before election day. “Vote” becomes the most prominent, as people want to exhort people to be ready for the polls. No one political party dominates, but the word “conservative” remains the most frequent word. 19 October 2015: Election day. We see “Vote” dominate, as well as the word “Liberal.” This was mostly reflecting the widely retweeted announcement of the Liberal Party of Canada’s victory that evening. 20 October 2015: The new Prime Minister Trudeau is the topic of the day, as well as his first name: “Justin.” At a glance, we are seeing a major narrative within the tweets. You can see all of the wordclouds yourself here, or animated here. This could be useful for a researcher wanting an overall birds-eye-view of content, or as a teaser to further investigations. It also speaks to how researchers could use more sophisticated textual analysis software or programming languages, such as R, Python, Mathematica, or beyond, to extract meaningful information from this soup of knowledge. Retweets Retweets can tell us quite a bit, mostly around which tweets were collectively deemed to be the most significant: whether because retweeters agreed with them, disagreed with them, or wanted to share in a pivotal moment. For example, the most retweeted tweet was Justin Trudeau and his wife declaring that they were “ready” after winning the election. The most retweeted tweets can be seen below: Using retweets.py from twarc utilities: $ python ~/git/twarc/utils/retweets.py elxn42-tweets-combined-deduplicated.json > elxn42-tweets-retweets.json $ python ~/git/twarc/utils/tweet_urls.py elxn42-tweets-combined-deduplicated.json > elxn42-tweets-retweets.txt Retweets Tweet 1. 5483 https://twitter.com/JustinTrudeau/status/656342399854223360 2. 2104 https://twitter.com/globalnews/status/655983013168336897 3. 2104 https://twitter.com/CBCAlerts/status/656283780152479744 4. 1999 https://twitter.com/CTVNews/status/656283368863223808 5. 1808 https://twitter.com/22_Minutes/status/655902459769004032 6. 1760 https://twitter.com/VancityReynolds/status/656355980997881856 7. 1541 https://twitter.com/pmharper/status/655828288594669569 8. 1456 https://twitter.com/TheAdamChristie/status/656228806118789120 9. 1421 https://twitter.com/west_ender/status/656295500765761537 10. 1417 https://twitter.com/JustinTrudeau/status/655912460101152768 Geographic Information 5,370 out of #elxn42 3,918,932 tweets (0.14%) had geographic information associated with them. We were able to determine this by utilizing geo.py and simple command line utilities: Using geo.py from twarc utilities: $ python ~/git/twarc/utils/geo.py elxn42-tweets-combined-deduplicated.json > elxn42-tweets-with-geo.json $ cat elxn42-tweets-with-geo.json | wc -1 5370 We were also able to create a geoJSON file of all the tweets with geographic information associated with them. With this geoJSON file, we were then able to map the tweets fairly simply with Leaflet.js. Using geojson.py from twarc utilities: $ python ~/git/twarc/utils/geojson.py elxn42-tweets-combined-deduplicated.json > elxn42-tweets.geojson Using the geoJSON file, we can put them on an interactive map with leaflet.js with some simple HTML and JavaScript boilerplate: