ISDS Annual Conference Proceedings 2013. This is an Open Access article distributed under the terms of the Creative Commons Attribution- Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 101 (page number not for citation purposes) ISDS 2013 Conference Abstracts Utility of Potential Misdiagnoses in Predicting Foodborne Outbreaks Lucia Lucia*1, Artur Dubrawski2 and Lujie Chen2 1Singapore Management University, Singapore, Singapore; 2Carnegie Mellon University, Pittsburgh, PA, USA Objective To investigate utility of using inpatient and emergency room diag- noses to detect outbreaks of Salmonellosis in humans. To quantify the impact of including in the analysis cases diagnosed with conditions that may have physiological appearance similar to Salmonellosis. Introduction Reliable detection and accurate scoping of outbreaks of foodborne illness are the keys to effective mitigation of their impacts. However, relatively small number of persons affected and underreporting, chal- lenge the reliability of surveillance models. In this work, we correlate a record of identified outbreaks and sporadic cases of Salmonellosis in humans retained in PulseNet1, and diagnosis codes in hospital claims collected in California from 2006 to 2010. We hypothesize that the data support and reliability of detection could be improved by includ- ing cases in which Salmonella infection may be confused2. Methods We join the data in a table indexed with dates and locations, con- taining counts of inpatient and ED patients diagnosed with Salmonel- losis and related diseases, also counts of cases involved in outbreaks, aggregated by day (the admission date or the isolation date) and location (the county of hospital locations or the county where the outbreaks occurred). 9.5% of the 66,845 rows in the table involve sporadic cases and identified clusters. To quantify predictive utility of potential misdiagnoses, Zero-in- flated Poisson regression (ZIP) model3 is trained to predict the num- ber of cases in epidemiological data. Among Salmonellosis (counts in inpatient and ED) and 12 potential misdiagnoses, the best combi- nation of input features is found by exhaustive search to minimize 10 fold cross validation ZIP prediction error. The chosen model is then trained using thusly selected features using all data. Similarly, we train a Random Forest (RF) binary classifier4 that also includes spatio-temporal predictors (county and month) to discount seasonality and spatial propensity of outbreaks. Results We found that 8 diagnoses related to Salmonellosis have non-triv- ial impact on outbreak predictability (only Celiac is insignificant with p-value>0.05). Their contributory effect is indicated by positive co- efficients of ZIP count model and negative coefficients of ZIP zero model, as shown in the table. Including counts of these diagnoses improves predictability of the occurrence of outbreaks vs. using Salmonellosis diagnoses only. The AUC score of the RF model increases from 57% to 87%. Adding spatio-temporal factors improves the predictability to 91% AUC. The model discovers 71% of actual outbreak cases at 7% false positive rate (FPr) and correctly recalls 4.5 as many outbreak cases at 1% FPr as when using Salmonellosis diagnoses only. We found 37% of the predictions can be made 1 to 7 days earlier than the recorded isolation date, increasing precision to 89%. This suggests a potential early warning utility. It is also possible to spot outbreaks not revealed in Pulsenet. For instance, 22 out of 35 outbreak predictions in Yolo County are not in PulseNet; 60% of these 22 have at least 40% of nearby counties showing positive predictions or actual cases in Pulsenet in the same periods of time. Conclusions Empirically found informative correlation between the counts of hospital patients diagnosed with diseases that may have physiological appearance similar to Salmonellosis, and epidemiologically recorded cases of Salmonellosis. This suggests that tracking these diseases could support accuracy of foodborne illness surveillance. Further study is yet required to verify the actual extent of clinical misdiagnos- ing, and if there are other factors explaining the apparent correlation. Keywords foodborne outbreaks; misdiagnosis; predictive analytics Acknowledgments This work is supported by the National Science Foundation (awards 0911032, 1320347), and the Singapore National Research Foundation under its International Research Centre @Singapore Funding Initiative and administered by the IDM Programme Office, Media Development Authority. References 1. PulseNet. http://www.cdc.gov/pulsenet/about/index.html 2. Rightdiagnosis. http://www.rightdiagnosis.com 3. Lambert D. Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics. 1992.34(1): 1-14. 4. Breiman L. Random Forests, Machine Learning. 2001.45(1). *Lucia Lucia E-mail: lucia.2009@phdis.smu.edu.sg scholcommuser Stamp scholcommuser Rectangle scholcommuser Rectangle scholcommuser Text Box Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 6(1):e173, 2014