OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

1 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Twitter Influenza Surveillance: Quantifying Seasonal 
Misdiagnosis Patterns and their Impact on Surveillance 
Estimates 
Jared Mowery 

The MITRE Corporation 

Abstract 

Background: Influenza (flu) surveillance using Twitter data can potentially save lives and 
increase efficiency by providing governments and healthcare organizations with greater 
situational awareness. However, research is needed to determine the impact of Twitter users’ 
misdiagnoses on surveillance estimates. 

Objective: This study establishes the importance of Twitter users’ misdiagnoses by showing 
that Twitter flu surveillance in the United States failed during the 2011-2012 flu season, 
estimates the extent of misdiagnoses, and tests several methods for reducing the adverse 
effects of misdiagnoses. 

Methods: Metrics representing flu prevalence, seasonal misdiagnosis patterns, diagnosis 
uncertainty, flu symptoms, and noise were produced using Twitter data in conjunction with 
OpenSextant for geo-inferencing, and a maximum entropy classifier for identifying tweets 
related to illness. These metrics were tested for correlations with World Health Organization 
(WHO) positive specimen counts of flu from 2011 to 2014. 

Results: Twitter flu surveillance erroneously indicated a typical flu season during 2011-2012, 
even though the flu season peaked three months late, and erroneously indicated plateaus of 
flu tweets before the 2012-2013 and 2013-2014 flu seasons. Enhancements based on 
estimates of misdiagnoses removed the erroneous plateaus and increased the Pearson 
correlation coefficients by .04 and .23, but failed to correct the 2011-2012 flu season estimate. 
A rough estimate indicates that approximately 40% of flu tweets reflected misdiagnoses. 

Conclusions: Further research into factors affecting Twitter users’ misdiagnoses, in 
conjunction with data from additional atypical flu seasons, is needed to enable Twitter flu 
surveillance systems to produce reliable estimates during atypical flu seasons. 

Keywords: biosurveillance, social media, natural language processing, supervised machine 
learning 

 
http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

2 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Correspondence: jmowery@mitre.org 

DOI: 10.5210/ojphi.v8i3.7011 

Copyright ©2016 the author(s) 
This is an Open Access article. Authors own copyright of their articles appearing in the Online Journal of Public Health 
Informatics. Readers may copy articles without permission of the copyright owner(s), as long as the author and OJPHI are 
acknowledged in the copy and the copy is used for educational, not-for-profit purposes. 

 
Introduction 

Many studies have investigated using social media data or online data to perform 
biosurveillance [1, 2]. Eysenbach [3] was the first to use trends in internet searches as a 
means of estimating flu prevalence, and Ritterman et al. [4] subsequently became the first 
to use Twitter data for flu surveillance. 

Twitter flu surveillance systems generally rely on keyword filters and classifiers to 
produce weekly counts of tweets indicative of flu prevalence. Lamb et al. [5] developed a 
classifier which distinguishes between tweets reflecting an awareness of the flu and 
tweets describing an infection with the flu, which tightens the causal relationship between 
weekly counts of flu tweets and Centers for Disease Control (CDC) or WHO 
measurements. Smith et al. [6] demonstrated that tweets related to general awareness of 
the flu yield substantially different trends than tweets related to infections, and Nagar et 
al. [7] reported that a classifier incorporating an annotator’s estimate of the likelihood 
that a tweet indicated illness was important for their analysis of flu prevalence in New 
York City. Zuccon et al. [8] tested a wide variety of classifier types, with results 
indicating the choice of classifier has a limited effect on accuracy. 

Recent studies have expanded the Twitter flu surveillance systems in a variety of ways, 
including encompassing multiple countries [9, 10], combining multiple indicators [10, 
11], increasing geospatial resolution [7, 12–14], handling additional languages [15, 16], 
and estimating the secondary attack rate and serial interval [17]. 

However, Twitter flu surveillance relies on Twitter users’ diagnoses of the flu. There are 
many potential causes of misdiagnoses. Nsoesie and Brownstein [1] observe that many 
existing systems likely measure influenza-like illness (ILI), which can be caused by a 
variety of non-flu pathogens. Chew and Eysenbach’s Twitter content analysis during the 
2009 pandemic [18] contains a rich set of metrics reflecting emotion levels, 
misinformation, and news or blog links that could all influence Twitter authors in 
choosing whether to tweet about an infection, and whether to diagnose that infection as 
the flu. 

Since Twitter is not a representative sample of the United States’ population [19-21], 
Twitter flu surveillance estimates will be biased. Studies have investigated potential 
variations in the peak time, morbidity, and rate of flu transmission as a function of age 

http://ojphi.org/
mailto:jmowery@mitre.org


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

3 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

group and social networks [22-25]. Region and humidity may also influence flu mortality 
rates and spread [26-27]. Finally, although positive specimen counts for the CDC or 
WHO are used as ground truth data, variations in the collection and testing of specimens, 
participation levels of laboratories, and other factors may introduce sampling biases. 

Detecting atypical flu seasons reliably is important, since they may require atypical 
responses from governments and healthcare organizations to save lives and increase 
efficiency. This study focuses on flu seasons with atypical onset times, such as the 2011-
2012 flu season, since these yield the most direct evidence for misdiagnoses. Since this 
study is intended to quantify Twitter users’ misdiagnoses rather than maximize the 
correlation between flu estimates and WHO counts, it does not incorporate additional 
data sources which could obscure misdiagnosis patterns in Twitter, such as search query 
volumes or time-lagged positive specimen count data. Many of the algorithms were 
implemented using the R Project for Statistical Computing [28]. 

Methods 

Data Collection and Classification 

This study used Gnip Decahose [29] data, which is a 10% pseudo-random sample of 
publicly available tweets. The tweet volumes collected each week between the weeks 
starting on 2011-08-01 and 2014-09-15 exhibit several gaps due to internet connectivity 
issues and hardware failures. These gaps were corrected by extrapolating from nearby 
data using a two pass process. 

The first pass applied a sliding median filter of width 15 to approximate the expected 
counts for each week. Any range of weeks with week indices [a, b] in which zero tweets 
were collected was replaced by the estimated values from a linear interpolation between 
the values at indices a − 2 and b + 2. 

The second pass applied a sliding median filter of width 7 to the results of the first pass. 
The following equation was used to produce a corrected count �̂�𝑡𝑖𝑖 for each week i: 

 �̂�𝑡𝑖𝑖 =  �
𝑠𝑠𝑖𝑖 if 𝑡𝑡𝑖𝑖 < 0.9𝑠𝑠𝑖𝑖,
𝑡𝑡𝑖𝑖 otherwise.

 (1) 

where 𝑠𝑠𝑖𝑖 is the output of the second sliding median filter and 𝑡𝑡𝑖𝑖 is the tweet count after 
zeroes were replaced by the first pass. The constant 0.9 was chosen to apply the 
correction only when the weekly count was at least 10% less than the expected count, 
which served as a rough method for identifying weeks during which data loss occurred. 
Applying Equation 1 compensated for the gaps in data collection (Figure 1). 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

4 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Figure 1: Tweets collected per week. The Original series shows the number of tweets 
collected from the Decahose feed. The First Pass series depicts the result of using linear 
interpolation to replace the counts for weeks in which zero tweets were collected. The 
Corrected series shows the estimated number of tweets which would have been collected 
for each week if there had not been data collection gaps. 

The metrics based on Twitter data must also be adjusted to compensate for the data losses. 
The following equation produced adjusted counts for each week i: 

 
c�𝑖𝑖 =

⎩
⎨

⎧
�̂�𝑐𝑘𝑘 if (𝑡𝑡𝑖𝑖 or 𝑡𝑡𝑖𝑖−1 or 𝑡𝑡𝑖𝑖+1 = 0), with the maximum 𝑘𝑘 s. t. 𝑘𝑘 < 𝑖𝑖 and 𝑡𝑡𝑘𝑘 > 0,

𝑐𝑐𝑖𝑖
�̂�𝑡𝑖𝑖
𝑡𝑡𝑖𝑖

 if 𝑡𝑡𝑖𝑖 ≠ �̂�𝑡𝑖𝑖and 𝑡𝑡𝑖𝑖 > 0,

𝑐𝑐𝑖𝑖 otherwise.

 (2) 

where 𝑐𝑐𝑖𝑖 is the count produced by a metric and �̂�𝑐𝑖𝑖 is the count adjusted for potential data 
loss. This equation assumes the fraction of tweets which match the criteria for a metric is 
consistent, so the value of the metric during a week which experienced data loss can be 
approximated by applying the same fraction to the number of tweets expected during that 
week. For weeks in which no tweets were collected, the adjusted metric value for the 
most recent week in which tweets were collected was used. Although a better estimate 
could have been obtained through linear interpolation, this approach uses only data which 
would have been available at the time. 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

5 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

This study used the WHO’s weekly positive counts of flu virus specimens in the United 
States, including types A and B [30], as ground truth data. The 2011-2012 flu season 
peaked approximately three months late compared to the 2012-2013 and 2013-2014 flu 
seasons. This is valuable for quantifying the extent to which Twitter users’ misdiagnoses 
adversely affect the correlation strength between Twitter flu surveillance estimates and 
WHO positive specimen counts, since tweets in late 2011 most likely reflect 
misdiagnoses. 

The maximum entropy classifier was trained on 1,274 English language tweets 
containing illness or symptom related terms collected between December 31, 2011 and 
January 31, 2012. Each tweet was hand-annotated by a single annotator for indications 
that the author, or someone the author knew, was ill. Examples of illness included flu, 
common colds, allergies, and symptoms such as nausea, sore throat, and nasal congestion. 
Instances of symptoms not due to illness, such as nausea due to overeating, stomach pain 
due to consuming spicy foods, and muscle aches due to exercise, were not counted as 
illness. The tweets which were related to illness according to the classifier are referred to 
as “sick tweets” in this paper. Due to the expense of developing classifiers for multiple 
languages, non-English tweets were not considered in this study. 

The maximum entropy classifier used Apache’s OpenNLP [31] implementation. 
Retweets and tweets containing URLs were excluded to help reduce the number of tweets 
related to news stories or memes. Unigrams, bigrams, and the tweet length in [0.0, 1.0], 
with 1.0 corresponding to a length of 200 characters, were used as features since they are 
commonly used and computationally inexpensive. The classifier used Gaussian 
regularization with σ = 1.0 and 10,000 iterations to ensure convergence. The classifier’s 
performance was tested using stratified 10-fold cross-validation. To bias the classifier in 
favor of precision over recall, only tweets whose classifier score exceeded 0.75 were 
designated as sick tweets. The constant 0.75 was chosen since it yielded weekly counts 
typically over 100 for sick tweets which contained the word “flu”. The lowest non-zero 
weekly count was 97, and the average count was 696. 

Metrics Collection 

This study collected several metrics from the sick tweets. Tweets were filtered using 
illness and symptom related keywords, restricted to the United States by applying 
OpenSextant [32] to the user-provided location fields, and then limited to the English 
language using the Cybozu Labs Language Detection Library for Java [33]. Out of the 
13,273,284 tweets containing illness or symptom related terms, OpenSextant provided 
estimated locations for 3,667,309 of them, or 27.6%. Retweets and tweets containing 
URLs were excluded to match the classifier training data. 

  
http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

6 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Table 1: Case-insensitive queries used to define each metric. Each metric is 
restricted to English tweets classified as sick tweets from the United States. 

 Query Example 

Flu Flu Feeling miserable. Go 
away flu! 

Uncertainty might or maybe or hope I might be coming down 
with a fever 

UncertaintyF (might or maybe or hope) 
and flu 

Sore throat… nose like a 
tap… might be flu 

Symptom sore throat or fever Had a sore throat for 
days now 

SymptomF (sore throat or fever) and 
flu 

Fever all day, hope it’s 
not flu 

Most of the metrics were simply defined as the fraction of tweets each week which 
matched a case insensitive query (Table 1). The Flu metric contained only sick tweets 
with the word “flu”, which are referred to as “flu tweets” in this paper. The Uncertainty 
metric is intended to measure Twitter authors’ uncertainty in their diagnoses, such as “I 
might be getting sick”, “Maybe this is just an allergy”, or “I hope this is not the flu”. The 
Symptom metric measures tweets containing two common symptoms of influenza-like 
illness: fevers and sore throats. Finally, metrics with the suffix “F” have been restricted to 
flu tweets. Since the weekly counts of flu tweets were generally over 100, this study did 
not examine misspellings of query terms or the use of slang. 

The Noise metric is an estimate of the expected fraction of flu tweets during periods in 
which the flu is not prevalent. The thirteen weeks occurring in the middle of each year 
were used to estimate the noise level, which corresponds to an estimate that 
approximately one quarter of weeks during the year are not substantially affected by the 
flu season. The mean count for each of these midyear periods was used as a noise 
estimate. Due to the difficulty of distinguishing flu tweets arising from flu infections 
from tweets arising from misdiagnoses, noise cannot effectively be measured during 
periods in which the flu is prevalent. Therefore, each consecutive pair of midyear noise 
estimates was linearly interpolated to generate the complete noise estimate. The noise 
level gradually decreased during the period tweets were collected, which may be a 
consequence of the atypical 2011-2012 flu season (Figure 2). 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

7 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Figure 2: Noise estimate based on linearly interpolating noise estimates from each 
midyear period. The Midyear series shows the weeks which were used to estimate the 
noise for each midyear period. Each series has been divided by the corrected total number 
of tweets collected each week. 

Misdiagnosis Measurement 

Since WHO positive specimen counts show the flu was not prevalent from August 2011 
through December 2011, despite an increase in flu tweets, the flu tweets from that time 
period largely represent misdiagnoses. Measuring the number of misdiagnosis tweets 
over time for a typical flu season is potentially valuable for counteracting their effects on 
Twitter flu surveillance, but there are two major challenges: 

1) separating the misdiagnosis tweets from the small number of correct 
diagnoses of the flu, classifier false positives, and other sources of noise 
from August 2011 to December 2011, and 

2) estimating misdiagnosis tweets for January 2012 through May 2012, since 
direct measurement is complicated by the genuine prevalence of the flu. 

To address the first challenge, this study subtracts the Noise metric from the Flu metric. 
The Noise metric is an estimate of the fraction of flu tweets expected during periods in 
which the flu is not prevalent. Since the flu was not prevalent in late 2011, the Flu metric 
should have equaled the Noise Metric during that time period. Therefore, subtracting the 
Noise metric leaves the flu tweets which contributed to the unexpected rise in flu tweets 
during late 2011. 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

8 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

To address the second challenge, this study estimates misdiagnosis tweets from late 2011 
and extrapolates them to early 2012. The weekly fractions of misdiagnosis tweets from 
August to December 2011 were estimated by smoothing the flu tweets, subtracting the 
Noise metric, and normalizing by the Noise metric: 

 
𝑚𝑚𝑖𝑖 =  

𝑚𝑚𝑚𝑚𝑚𝑚(𝑓𝑓𝑖𝑖) − 𝑛𝑛𝑖𝑖
𝑛𝑛𝑖𝑖

  (3) 

where i is the week (limited to August through December 2011), 𝑚𝑚𝑖𝑖 is a unitless factor 
which estimates the fraction of misdiagnosis tweets when multiplied by the Noise metric, 
med is a sliding median filter of width 5, 𝑓𝑓𝑖𝑖 is the flu metric, and 𝑛𝑛𝑖𝑖 is the Noise metric. 
Both 𝑓𝑓𝑖𝑖 and 𝑛𝑛𝑖𝑖 are expressed as fractions of the corrected total tweet count for week i. 
The smoothing is intended to reduce the effects of noise, and the normalization by 𝑛𝑛𝑖𝑖 
helps account for factors which may change from season to season by assuming the 
misdiagnosis estimate is proportional to the noise estimate. 

This study hypothesized two extrapolations based on m: Tapered and Symmetric. The 
Tapered extrapolation assumes misdiagnosis tweets taper off as the flu season progresses, 
which continues the downward trend seen in misdiagnosis tweets at the end of 2011. The 
tapering was implemented with a linear interpolation between the misdiagnosis fraction 
at the end of 2011 (week starting 2012-01-02) and the estimate of the noise baseline at 
the end of the flu season (week starting 2012-06-04). Tapering could be caused by 
psychosocial factors, such as decreasing anxiety due to news media coverage reporting 
that the flu season was mild or late. The Symmetric extrapolation assumes the 
misdiagnosis tweet pattern is symmetric around the end of 2011, and the symmetry was 
implemented by concatenating the weekly counts in the weeks [2011-08-01, 2012-01-02] 
with the reversed weekly counts in weeks [2011-08-01, 2011-12-26]. The symmetric 
extrapolation assumes misdiagnosis tweets do not taper off as the flu season progresses, 
and that Twitter authors’ misdiagnoses are symmetric around the typical peak of a flu 
season. This could correspond to Twitter users’ misdiagnoses reflecting their 
expectations of flu prevalence during a typical flu season. Both estimates of the 
misdiagnosis errors cover the same range of weeks. 

Copying the unitless estimates 𝑚𝑚𝑖𝑖  and the extrapolated values (weeks 2011-08-01 to 
2012-06-04) to the corresponding weeks centered on January 1st of the 2012-2013 
(weeks 2012-07-30 to 2013-06-03) and 2013-2014 (weeks 2013-07-29 to 2014-06-02) flu 
seasons, and then multiplying by the Noise metric, yielded the final estimate of the 
fraction of misdiagnosis tweets for 2011-2014 (Figure 3). Since the misdiagnosis estimate 
was constructed to be proportional to the noise estimates from the midyear periods, and 
since those midyear periods were likely to have few tweets correctly diagnosing the flu, 
the midyear periods were excluded from the misdiagnosis estimates. 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

9 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Figure 3: Estimated weekly fraction of misdiagnosis tweets. 

Finally, the two misdiagnosis based estimates of flu prevalence were produced by 
subtracting the weekly estimates of the fraction of misdiagnosis tweets from the weekly 
fraction of flu tweets for each of the two extrapolations. 

Misdiagnosis Cross-Validation 

The previous section used the prior knowledge that WHO positive specimen counts for 
late 2011 are approximately equal to the positive specimen counts when flu is not 
prevalent. However, this means its results can only be tested against data from early 2012 
onward, or that it must rely on comparisons with recent WHO positive specimen counts. 
Therefore, this study also uses a form of 3-fold cross-validation, in which an estimate is 
produced for a “test” flu season by using misdiagnosis tweet rates estimated by taking the 
difference between the WHO positive specimen counts and fractions of flu tweets for the 
remaining two “training” flu seasons. For each flu season, the same range of weeks was 
used as in the previous section. 

However, this approach requires comparing positive specimen counts and fractions of flu 
tweets. This paper used a simple linear regression, P ~ cF, between the WHO positive 
specimen counts (P) and the fraction of flu tweets for the non-test weeks (F) to obtain a 
constant (c) representing a best estimate of the unit conversion factor. The linear 
regression did not include a constant term, so the linear regression only estimated the 
single coefficient c. 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

10 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

𝑚𝑚𝑖𝑖 = �

𝑐𝑐𝑓𝑓𝑖𝑖 − 𝑝𝑝𝑖𝑖
𝑐𝑐

−  𝑛𝑛𝑖𝑖�  × 𝑛𝑛𝑖𝑖−1 (4) 

Equation 4 details obtaining the unitless misdiagnosis estimate 𝑚𝑚𝑖𝑖 for a flu season, where 
i is the week, c is the coefficient for unit conversion obtained via linear regression, 𝑓𝑓𝑖𝑖 is 
the Flu metric, 𝑝𝑝𝑖𝑖 is the positive specimen count from the WHO, and 𝑛𝑛𝑖𝑖 is the Noise 
metric. The final misdiagnosis tweet fraction estimate for the test flu season was obtained 
by averaging the unitless misdiagnosis estimates for the two training flu seasons and 
multiplying by the Noise metric for the test flu season. The misdiagnosis tweet fraction 
estimate was subtracted from the test flu season’s weekly fractions of flu tweets to yield 
the final estimate of flu prevalence. 

Results 

Data Collection and Classification 

The maximum entropy classifier achieved an F-measure of .76, with .73 precision and .79 
recall. There were 354 true positives compared to 129 false positives, and 697 true 
negatives compared to 94 false negatives. To produce the actual counts of sick tweets, the 
classifier’s threshold was increased to .75 to favor precision over recall, since precision is 
more important for this study. The .75 threshold achieved an F-Measure of .72, with .86 
precision and 0.61 recall. 

The Pearson correlation coefficient between the sick tweets and the WHO’s positive 
specimen counts is r = .66 (P < .001), which demonstrates that there is a significant 
degree of correlation even before filtering the sick tweets to examine only flu tweets. 

Metrics 

The Flu metric achieved a Pearson correlation with the WHO positive specimen counts of 
r = .72 (P < .001), which is an improvement over the correlation for sick tweets of r = .66. 
However, the Flu metric erroneously reports a typical flu season occurring in late 2011 
and early 2012, as well as plateaus of flu tweets occurring prior to the start of the next 
two flu seasons (Figure 4). The 2011-2012 flu season is erroneous in the sense that there 
is a substantial rise in flu tweets in late 2011 despite the lack of a corresponding increase 
in WHO positive specimen counts, resulting in the flu tweets exhibiting a pattern of 
elevated counts roughly centered on December even though the actual flu season peak 
occurred months later, according to the WHO positive specimen counts. 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

11 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Figure 4: Flu prevalence estimates versus WHO positive specimen count data (WHO) for 
the linear combination of the flu, noise, and uncertain metrics (Lin), and the flu metric 
alone (Flu). Although the Uncertain metric improves the correlation, both the flu and 
linear combination results erroneously estimated a 2011-2012 flu season occurring at the 
typical time, and produced plateaus of misdiagnosis tweets before each subsequent flu 
season. 

To measure the relative efficacy of the remaining metrics, the Pearson correlation 
coefficients between linear regressions of the metrics and the WHO positive specimen 
count data were calculated (Table 2). In each case, the linear regression included a 
constant term. To reduce over-fitting, each calculation used 10-fold cross-validation, in 
which the folds were obtained by partitioning the date range into 10 approximately equal-
length time periods. The combination of using 10-fold cross-validation and linear 
regression increased the difficulty of obtaining high correlation coefficients, which 
reduced the correlation for the Flu metric from r = .72 to r = .54. Introducing the Noise 
metric substantially improved the correlation result, while adding the Sick tweets metric 
yielded no additional benefit. Holding the number of regressors constant by substituting 
the other metrics for the Sick metric revealed that only the Uncertain metric provided a 
substantial benefit. 

  
http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

12 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Table 2: Pearson correlation coefficients for multiple variable linear regressions 
using 10-fold cross-validation. The Uncertain metric substantially increases the 
correlation with the WHO’s positive specimen count. Note: the correlation for the 
Flu metric is 0.72 when not using 10-fold cross-validation and multiple variable 
linear regression. 

 R 

Flu .54 

Flu + Noise .73 

Flu + Sick + Noise .73 

Flu + Uncertain + Noise .77 

Flu + UncertainF + Noise .73 

Flu + Symptom + Noise .72 

Flu + SymptomF + Noise .72 

While the Uncertain metric improved the correlation coefficient, the regressions failed to 
remove the misdiagnosis tweets, which erroneously indicated a typical 2011-2012 flu 
season and erroneously showed plateaus of flu activity occurring before each of the next 
two flu seasons (Figure 4). 

Misdiagnosis Measurement 

The Flu, Symmetric, and Tapering metrics all correlate with the WHO’s ILI positive 
specimen counts (Table 3). The sum of P values for each correlation in the table was P 
< .001, indicating that the set of correlations passes the Bonferroni correction. However, 
the metrics vary in correlation strength: the Flu metric suffers from significant plateaus of 
misdiagnosis tweets preceding each flu season, the Symmetric metric can be rejected 
since it produces flu estimates below the noise baseline during each of the three flu 
seasons, and the Tapering metric successfully removes the false positive plateaus 
preceding each flu season but shows the flu seasons starting late (Figure 5). The Tapering 
metric achieved slightly higher correlations than the other two metrics in all three test 
conditions, and the Tapering metric gains the most benefit when more of the atypical 
2011-2012 flu season is included in the test. However, the test which excludes none of 
the data from the 2011-2012 season is only included for reference; since the late 2011 
tweets were used to construct the misdiagnosis tweets estimate, using that data comingles 
tuning and testing data. 

  
http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

13 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Table 3: Pearson correlation coefficients for the flu metric as well as the flu metric 
after subtracting the Symmetric and Tapering estimates of misdiagnosis tweets. The 
rows present the correlations when excluding none of the data, the first half of the 
typical 2011-2012 flu season, or the entire 2011-2012 flu season. Flu tweets from late 
2011 were used to measure the misdiagnosis tweets, and are included in the row for 
excluding none of the data. 

Exclusion Flu Symmetric Tapering 

None .72 .73 .81 

Half .82 .77 .83 

2011-2012 .84 .83 .85 

 
Figure 5: Estimated flu prevalence before and after subtracting estimated misdiagnosis 
tweets for each of the Tapering and Symmetric extrapolation methods. The Symmetric 
method can be rejected since it produces flu estimates below the noise level for all three 
flu seasons. The Tapering method successfully removes the plateaus of misdiagnosis 
tweets which precede each of the three flu seasons, but shows the 2012-2013 and 2013-
2014 flu seasons starting late. The Tapering and Symmetric methods frequently overlap 
in the plot, due to sharing the same weekly misdiagnosis estimates for late 2011. 

The Tapering metric indicates that approximately 47,907 tweets were misdiagnoses, 
although this may be an overestimate since the 2012-2013 and 2013-2014 flu seasons 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

14 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

start late according to the Tapering metric. There were 121,234 flu tweets total, which 
suggests that roughly 39.52% of the flu tweets reflected misdiagnoses. 

Misdiagnosis Cross-Validation 

Removing estimated misdiagnosis tweets based on 3-fold cross-validation for the three 
flu seasons successfully removes the plateaus of misdiagnosis tweets occurring before the 
2012-2013 and 2013-2014 flu seasons, while accurately reflecting the correct start dates 
for the 2012-2013 and 2013-2014 flu seasons (Figure 6). However, the erroneous 
estimate for the 2011-2012 flu season remains. The Pearson correlation coefficient was r 
= .76 (P < .001), compared to r = .72 for the Flu metric. 

 
Figure 6: Comparison of the Flu metric, after subtracting the 3-fold misdiagnosis estimate, 
to WHO positive specimen counts. The 3-fold estimate successfully removes the plateaus 
of flu tweets occurring prior to the starts of the 2012-2013 and 2013-2014 flu seasons, 
and accurately reflects the start dates of the 2012-2013 and 2013-2014 flu seasons, but it 
is unable to remove sufficient misdiagnosis tweets from the 2011-2012 flu season to 
reveal the season’s atypical timing. 

Discussion 

This study establishes the importance of misdiagnoses by showing that the pattern of flu 
tweets during the 2011-2012 flu season fails to approximate the WHO positive specimen 
counts, and that the flu tweets exhibit plateaus of misdiagnosis tweets preceding each of 
the next two flu seasons. This study quantifies the importance of misdiagnosis tweets by 
showing that the Tapering metric increases the correlation coefficient from r = .72 for the 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

15 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

flu metric alone to r = .81, removes the plateaus of misdiagnosis tweets prior to the 2012-
2013 and 2013-2014 flu seasons, and yields an estimate that 39.52% of flu tweets (47,907 
/ 121,234) reflect misdiagnoses. Finally, this study demonstrates that misdiagnoses can be 
counteracted via the Uncertain and Noise metrics (r = .54 increased to r = .77) and by 
applying 3-fold cross-validation to produce an estimate of seasonal misdiagnosis patterns 
(r = .76). 

However, each approach has limitations. Only the Tapering metric enabled detection of 
the 2011-2012 flu season, and it was developed with the prior knowledge that WHO 
positive specimen counts in late 2011 were low. This is useful for quantifying the impact 
of misdiagnoses, but presents a challenge for non-retrospective flu surveillance. While an 
implementation could use time-lagged WHO counts and apply the Tapering metric only 
once the flu season began, this may not be robust and it would sacrifice the ability to 
detect the start of the flu season via Twitter data. Non-retrospective flu surveillance can 
be enhanced by using either the Uncertain and Noise metrics or the 3-fold cross-
validation estimate of seasonal misdiagnosis patterns. However, only the latter 
successfully removed misdiagnosis tweet plateaus before the 2012-2013 and 2013-2014 
flu seasons, which is necessary to accurately detect the beginnings of the 2012-2013 and 
2013-2014 flu seasons. 

The limited availability of Twitter data in atypical flu seasons is a significant challenge 
for further analysis of misdiagnosis tweets. Analyzing multiple countries during an 
atypical flu season may be beneficial, but evidence that flu is spread by air travel [34] 
means that results for each country could not be treated as statistically independent. 

Further research could address improvements to data collection and classification, such as 
developing classifiers for multiple languages, experimenting with more complex 
classifiers and feature extraction, examining the effects of different annotation guidelines, 
using larger volumes of annotated tweets, and using expanded queries including 
misspellings and references to taking medications. In addition, demographic differences 
between Twitter users and WHO sampling may introduce additional inaccuracies. Finally, 
the data losses experienced during certain weeks of data collection may have produced 
inaccurate estimates despite the corrections described in the Methods section. 

This study focused on quantifying seasonal misdiagnosis errors specifically in Twitter 
data, rather than incorporating multiple exogenous data sources or statistical techniques 
to obtain the best possible estimate of flu prevalence. Many studies have shown that 
using multiple data sources and applying a variety of models can improve flu estimates. 
As a recent example, Santillana et al. demonstrated that using a combination of time-
lagged CDC data and a new, timely source of electronic health records, which are not 
available to the public, can improve the accuracy of flu surveillance systems [35]. 

Twitter flu surveillance research is promising, but identifying misdiagnosis tweets 
remains a challenge. Although this paper presents methods of enhancing Twitter flu 
surveillance for flu seasons by using estimates of seasonal misdiagnosis tweeting patterns, 
these same seasonal misdiagnosis patterns also indicate a risk that there is only a weak 
causal connection between individuals infected with the flu and Twitter authors reporting 

http://ojphi.org/


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

16 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

flu infections. The weak causal connection is illustrated by the lack of correlation 
between flu tweets and WHO positive specimen counts during the 2011-2012 flu season, 
even after applying corrections for seasonal misdiagnosis patterns. Further research, in 
conjunction with data from additional atypical flu seasons, is needed to enable Twitter flu 
surveillance systems to produce reliable estimates of flu, rather than ILI, during atypical 
flu seasons. 

Acknowledgements 

The author would like to thank The MITRE Corporation for funding this research. 

Conflicts of Interest 

None declared. As a not-for-profit operator of federally funded research and development 
centers, The MITRE Corporation is not permitted to compete with industry. 

References 

1. Nsoesie EO, Brownstein JS. 2015. Computational approaches to influenza 
surveillance: beyond timeliness. Cell Host Microbe. 17(3), 275-78. PubMed 
http://dx.doi.org/10.1016/j.chom.2015.02.004 

2. Paul MJ, Sarker A, Brownstein JS, Nikfarjam A, Scotch M, et al. Social Media 
Mining for Public Health Monitoring and Surveillance. Pacific Symposium on 
Biocomputing (PSB); 2016; Kohala Coast, Hawaii. 2016. pp. 468-79. 

3. Eysenbach G. 2006. Infodemiology: tracking flu-related searches on the web for 
syndromic surveillance in AMIA. AMIA Annu Symp Proc. •••, 244-48. PubMed 

4. Ritterman J, Osborne M, Klein E. Using Prediction Markets and Twitter to Predict a 
Swine Flu Pandemic. Proceedings of the 1st International Workshop on Mining 
Social Media; 2009; Seville, Spain. 2009. pp. 9-17. 

5. Lamb A, Paul MJ, Dredze M. Separating Fact from Fear: Tracking Flu Infections on 
Twitter. HLT-NAACL; 2013; Atlanta, Georgia, USA. 2013. pp. 789–795. 

6. Smith MC, Broniatowski DA, Paul MJ, Dredze M. Towards Real-Time 
Measurement of Public Epidemic Awareness: Monitoring Influenza Awareness 
Through Twitter. AAAI Spring Symposium on Observational Studies Through 
Social Media and Other Human-Generated Content; 2016; Stanford, California. 2016. 

7. Nagar R, Yuan Q, Freifeld CC, Santillana M, Nojima A, et al. 2014. A case study of 
the New York City 2012-2013 influenza season with daily geocoded Twitter data 
from temporal and spatiotemporal perspectives. J Med Internet Res. 16(10), e236. 
PubMed http://dx.doi.org/10.2196/jmir.3416 

8. Zuccon G, Khanna S, Nguyen A, Boyle J, Hamlet M, Cameron M. Automatic 
detection of tweets reporting cases of influenza like illnesses in Australia. Health Inf 

http://ojphi.org/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=25766284&dopt=Abstract
http://dx.doi.org/10.1016/j.chom.2015.02.004
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=17238340&dopt=Abstract
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=25331122&dopt=Abstract
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=25331122&dopt=Abstract
http://dx.doi.org/10.2196/jmir.3416


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

17 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

Sci Syst 2015 Feb 24;3(Suppl 1 HISA Big Data in Biomedicine and Healthcare 2013 
Con):S4. PMID:25870759 

9. Paul M, Dredze M, Broniatowski D, Generous N. Worldwide Influenza Surveillance 
Through Twitter. Workshops at the Twenty-Ninth AAAI Conference on Artificial 
Intelligence; AAAI Conference on Artificial Intelligence; 2015; Austin, Texas. 2015. 

10. Zhang Q, Gioannini C, Paolotti D, Perra N, Perrotta D, et al. Social Data Mining and 
Seasonal Influenza Forecasts: The FluOutlook Platform. European Conference on 
Machine Learning and Principles and Practice of Knowledge Discovery in Databases; 
2015; Porto, Portugal. 2015. pp. 237-240. 

11. Santillana M, Nguyen AT, Dredze M, Paul MJ, Nsoesie EO, et al. 2015. Combining 
search, social media, and traditional data sources to improve influenza surveillance. 
PLOS Comput Biol. 11(10), e1004513. PubMed 
http://dx.doi.org/10.1371/journal.pcbi.1004513 

12. Dredze M, Paul MJ, Bergsma S, Tran H. Carmen: A Twitter Geolocation System 
with Applications to Public Health. Workshops at the Twenty-Seventh AAAI 
Conference on Artificial Intelligence; AAAI Conference on Artificial Intelligence; 
2013; Bellevue, Washington. 2013. 

13. Broniatowski DA, Dredze M, Paul MJ, Dugas A. 2015. Using social media to 
perform local influenza surveillance in an inner-city hospital: a retrospective 
observational study. JMIR Public Health Surveill. 1(1), e5. PubMed 

14. Broniatowski DA, Paul MJ, Dredze M. 2013. National and local influenza 
surveillance through Twitter: an analysis of the 2012-2013 influenza epidemic. PLoS 
One. 8(12), e83672. PubMed http://dx.doi.org/10.1371/journal.pone.0083672 

15. Li J, Huang W, Chen P. LDA Based Event Extraction: Detecting Influenza 
Epidemics Using Microblog. The Second International Conference on Data Science; 
2015; Sydney, Australia. 2015. pp. 30-33. 

16. Sun X, Ye J, Ren F. Hybrid Model Based Influenza Detection with Sentiment 
Analysis from Social Networks. Proceedings of the 4th National Conference in 
Social Media Processing; 2015; Guangzhou, China. 2015. pp. 51–62. 

17. Yom-Tov E, Johansson-Cox I, Lampos V, Hayward AC. 2015. Estimating the 
secondary attack rate and serial interval of influenza-like illnesses using social media. 
Influenza Other Respi Viruses. 9(4), 191-99. PubMed 
http://dx.doi.org/10.1111/irv.12321 

18. Chew C, Eysenbach G. 2010. Pandemics in the age of Twitter: content analysis of 
tweets during the 2009 H1N1 outbreak. PLoS One. 5(11), e14118. PubMed 
http://dx.doi.org/10.1371/journal.pone.0014118 

http://ojphi.org/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=26513245&dopt=Abstract
http://dx.doi.org/10.1371/journal.pcbi.1004513
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=27014744&dopt=Abstract
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=24349542&dopt=Abstract
http://dx.doi.org/10.1371/journal.pone.0083672
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=25962320&dopt=Abstract
http://dx.doi.org/10.1111/irv.12321
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=21124761&dopt=Abstract
http://dx.doi.org/10.1371/journal.pone.0014118


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

18 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

19. Mislove A, Lehmann S, Ahn Y-Y, Onnela J-P, Rosenquist JN. Understanding the 
demographics of Twitter users. ICWSM; 2011; Barcelona, Spain. 2011. pp. 554-557. 

20. Hecht B, Stephens M. A tale of cities: Urban biases in volunteered geographic 
information. ICWSM; 2014; Ann Arbor, Michigan. 2014. pp. 197-205. 

21. Malik M, Lamba H, Nakos C, Pfeffer J. Population bias in geotagged tweets. 
ICWSM; 2015; Oxford, England. 2015. pp. 18-27. 

22. Domnich A, Panatto D, Signori A, Lai PL, Gasparini R, et al. 2015. Age-related 
differences in the accuracy of web query-based predictions of influenzalike illness. 
PLoS One. 10(5), e0127754. PubMed 
http://dx.doi.org/10.1371/journal.pone.0127754 

23. Schanzer D, Vachon J, Pelletier L. 2011. Age-specific differences in influenza 
epidemic curves: do children drive the spread of influenza epidemics? Am J 
Epidemiol. 174(1), 109-17. PubMed http://dx.doi.org/10.1093/aje/kwr037 

24. Glass LM, Glass RJ. 2008. Social contact networks for the spread of pandemic 
influenza in children and teenagers. BMC Public Health. 8, 61. PubMed 
http://dx.doi.org/10.1186/1471-2458-8-61 

25. Nsoesie EO, Marathe M, Brownstein JS. 2013. Forecasting peaks of seasonal 
influenza epidemics. PLoS Curr. •••, 5. PubMed 

26. Shaman J, Pitzer VE, Viboud C, Grenfell BT, Lipsitch M. 2010. Absolute humidity 
and the seasonal onset of influenza in the continental United States. PLoS Biol. 8(2), 
e1000316. PubMed http://dx.doi.org/10.1371/journal.pbio.1000316 

27. Yang W, Lipsitch M, Shaman J. 2015. Inference of seasonal and pandemic influenza 
transmission dynamics. Proc Natl Acad Sci USA. 112(9), 2723-28. PubMed 
http://dx.doi.org/10.1073/pnas.1415012112 

28. R Core Team. 2016. A Language and Environment for Statistical Computing R 
Foundation for Statistical Computing. https://cran.r-project.org/doc/manuals/r-
release/fullrefman.pdf. Archived at: http://www.webcitation.org/6iCHaoYyS 

29. Gnip Inc. 2016. Decahose: Real Time Trend Detection and Discovery. 
https://gnip.com/realtime/decahose/. Archived at: 
http://www.webcitation.org/6jaw24k6R 

30. World Health Organization. 2016. FluNet. 
http://www.who.int/influenza/gisrs_laboratory/flunet/en/. Archived at: 
http://www.webcitation.org/6jawE3GNL 

31. Apache Software Foundation. 2016. The Apache OpenNLP Library. 
https://opennlp.apache.org/. Archived at: http://www.webcitation.org/6hvXhTr5U 

http://ojphi.org/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=26011418&dopt=Abstract
http://dx.doi.org/10.1371/journal.pone.0127754
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=21602300&dopt=Abstract
http://dx.doi.org/10.1093/aje/kwr037
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=18275603&dopt=Abstract
http://dx.doi.org/10.1186/1471-2458-8-61
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=23873050&dopt=Abstract
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=20186267&dopt=Abstract
http://dx.doi.org/10.1371/journal.pbio.1000316
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=25730851&dopt=Abstract
http://dx.doi.org/10.1073/pnas.1415012112


OJPHI Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on 
Surveillance Estimates 
 

19 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 8(3):e198, 2016 
 

32. Ubaldino M, Lutz D. 2015. Open Source Entity Extraction, Geocoding and Temporal 
Coding Tools. https://github.com/OpenSextant. Archived at: 
http://www.webcitation.org/6jawHinp2 

33. Nakatani S. 2010. Language Detection Library for Java. 
https://github.com/shuyo/language-detection. Archived at: 
http://www.webcitation.org/6jawJ5DMA 

34. Leitmeyer K, Adlhoch C. 2016. Influenza transmission on aircraft: a systematic 
literature review. Epidemiology. 27(5), 743-51. PubMed 
http://dx.doi.org/10.1097/EDE.0000000000000438 

35. Santillana M, Nguyen AT, Louie T, Zink A, Gray J, et al. 2016. Cloud-based 
electronic health records for real-time, region-specific influenza surveillance. Sci Rep. 
6, 25732. PubMed http://dx.doi.org/10.1038/srep25732 

 
http://ojphi.org/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=27253070&dopt=Abstract
http://dx.doi.org/10.1097/EDE.0000000000000438
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=27165494&dopt=Abstract
http://dx.doi.org/10.1038/srep25732

	Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on Surveillance Estimates
	Introduction
	Methods
	Data Collection and Classification
	Metrics Collection
	Misdiagnosis Measurement
	Misdiagnosis Cross-Validation
	Results
	Data Collection and Classification
	Metrics
	Misdiagnosis Measurement
	Misdiagnosis Cross-Validation
	Discussion
	Acknowledgements
	Conflicts of Interest
	References