Layout 1 ISDS Annual Conference Proceedings 2012. This is an Open Access article distributed under the terms of the Creative Commons Attribution- Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. ISDS 2012 Conference Abstracts #wheezing: A Content Analysis of Asthma-Related Tweets Gwendolyn Gillingham*1, Michael A. Conway2, Wendy W. Chapman2, Michael B. Casale3 and Kathryn B. Pettigrew3 1Linguistics, UCSD, La Jolla, CA, USA; 2UCSD - Division of Biomedical Informatics, La Jolla, CA, USA; 3West Health Institute, La Jolla, CA, USA Objective We present a Content Analysis project using Natural Language Processing to aid in Twitter-based syndromic surveillance of Asthma. Introduction Recently, a growing number of studies have made use of Twitter to track the spread of infectious disease. These investigations show that there are reliable spikes in traffic related to keywords associated with the spread of infectious diseases like Influenza [1], as well as other Syndromes [2]. However, little research has been done using Social Media to monitor chronic conditions like Asthma, which do not spread from sufferer to sufferer. We therefore test the feasibility of using Twitter for Asthma surveillance, using techniques from NLP and machine learning to achieve a deeper understanding of what users Tweet about Asthma, rather than relying only on keyword search. Methods We retrieved a large volume of Tweets from the Twitter API. Search terms included “asthma,” and several misspellings of that word; terms for common medical devices associated with Asthma such as “inhaler” and “nebulizer”; and names of prescription drugs used to treat the condition, including “albuterol” and “Singulair.” A randomly sampled subset of these Tweets (N=3511) was annotated for content, based on an annotation scheme that coded for the fol- lowing elements: the Experiencer of Asthma symptoms (Self, Fam- ily, Friend, Named Other, Unidentified, and All-Non-Self, which was the union of these last four categories); aspects of the type of infor- mation being conveyed by each Tweet (Medication, Triggers, Phys- ical Activity, Contacting of a Medical Practitioner, Allergies, Questions, Suggestions, Information, News, Spam); as well as Neg- ative Sentiment, Future temporality, and Non-English content. Fur- ther details on the annotation scheme used can be found at http://idiom.ucsd.edu/!ggilling/annotation.pdf. Inter-annotator agree- ment on a subset of the Tweets (N=403) fell in an acceptable range for all categories (Cohen’s Kappa >0.6). Once annotation was com- plete, the Tweets’ texts were stemmed and converted into vectors of unigram and bigram counts. These were then stripped of sparse terms (all those words appearing in fewer than 1 in 200 Tweets), which left multi-dimensional vectors consisting of the counts of the remaining words in all Tweets. Statistical machine-learning classifiers including K-nearest neighbors, Naive Bayes and Support Vector Machines were then trained on the unigram and bigram models. Results SVM with 10-fold cross-validation achieved greatest prediction accuracy with the unigram model, as shown in Table 1. Categories that showed the greatest reduction in classification error using the un- igram model were Non-English, Self, All-Non-Self, Medication, Symptoms and Spam. The majority of these categories showed very high Precision, as well as fairly high Recall for the unigram model. Unexpectedly, the bigram model faired far worse than the Unigram model, which suggests that individual words in these Tweets were more reliably predictive of content than pairs of words, which oc- curred less frequently. Conclusions Text-classification increases the utility of Twitter as a data-source for studying chronic conditions such as Asthma. Using these methods, we can automatically reject Tweets that are non-English or Spam. We can also determine who is experiencing symptoms: the Twitter user or another individual. Fairly simple models are able to predict with good certainty whether a user is talking about their Symptoms, their Medication, or Triggers for their Asthma, as well as whether they are expressing Negative sentiment about their condition. We demonstrate that Social Media such as Twitter is a promising means by which to conduct surveillance for chronic conditions such as Asthma. Table 1: Performance of Classifiers on Unigram and Bigram Models Keywords social media; natural language processing; asthma; content analysis Acknowledgments This work was financially supported by the West Wireless Health Insti- tute and iDASH Summer Internship program (NIH U54HL108460). References 1. Chew, C. & Eysenbach, G. 2010. Pandemics in the Age of Twitter: Content Analysis of Tweets in the H1N1 Outbreak. PLoS ONE 5(11): e14118. 2. Collier, N. & Doan, S. 2011. Syndromic Classification of Twitter Mes- sages. Proc. eHealth 2011, Malaga, Spain. November 21-23. *Gwendolyn Gillingham E-mail: gwen.gillingham@ling.ucsd.edu Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 5(1):e65, 2013