ISDS Annual Conference Proceedings 2017. This is an Open Access article distributed under the terms of the Creative Commons Attribution- Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. ISDS 2016 Conference Abstracts Identification of Sufferers of Rare Diseases Using Medical Claims Data Jieshi Chen* and Artur Dubrawski Auton Lab, Carnegie Mellon University, Pittsburgh, PA, USA Objective To identify sufferers of a rare and hard to diagnose diseases by detecting sequential patterns in historical medical claims. Introduction Patients who suffer from rare diseases can be hard to diagnose for prolonged periods of time. In the process, they are often subjected to tentative treatments for ailments they do not have, risking an escalation of their actual condition and side effects from therapies they do not need. An early and accurate detection of these cases would enable follow-ups for precise diagnoses, mitigating the costs of unnecessary care and improving patients’ outcomes. Methods A sequential rule learning algorithm1 was applied to a medical claim dataset of about 1,700 patients, who are pre-selected to have medical histories indicative of Gaucher Disease (GD) but only 25 of these patients were confirmed positives. About 168,000 medical claims and 142,000 pharmaceutical claims were featurized into sequences of asynchronous events and regularly sampled time series as inputs for the model, such that an occurrence of a certain diagnosis code in a medical claim was counted as one event along the timeline of the patient’s medical history. Similar method was applied to other key attributes of claims data including procedure codes, National Drug Codes, Diagnosis Related Groupers, etc. These types of events as well as their temporal statistics, e.g. moving frequencies, peaks, change points, etc., formed the input feature space for the algorithm which was trained to adjudicate each test case and estimate their likelihood of having GD. A random forest algorithm was also applied to the same feature set to comparatively evaluate the utility of sequential aspects of data. The models were evaluated with 10-fold cross-validation. Results Figure 1 shows the Receiver Operating Characteristic (ROC) curves of the temporal rule model with Area Under the Curve score exceeding 81% and significantly outperforming the random forest and default models. Considering the practical costs to perform follow-up genetic tests, we prefer a model achieving high positive recall at low risk of false detection. Our model correctly identifies more than 25% of known positive cases well within 0.1% of the false positive rate, while the performance of a more popular alternative is indistinguishable from random. This demonstrates the utility of sequential structure of medical claims in identifying patients who suffer from rare diseases. Our algorithm infers from data highly interpretable rules it uses in case adjudication. Figure 2 illustrates one of them. The root node of the case adjudication tree (Event.7969) reflects the ICD-9 diagnosis code of “Other nonspecific abnormal findings”. Among the 14 patients that have this particular ICD-9 code present in their claim history, 36% are confirmed GD sufferers. Compared to default prevalence in our pre-selected data set of 1.47%, this rule lifts the estimated likelihood of GD 25 times. The rule further develops into two children nodes. The left child node adds the condition of having any outpatient claim observed within 43 claims recorded nearby the occurrence of the root node event. It isolates 5 patients all of whom are GD-positive. The right child shows that 3 patients without Event.7969 in their claim history but prescribed NDC 62756-0137-02 (Gabapentin by Sun Pharmaceutical Industries Ltd.) are all GD-positive. This is just one example of a simple and easy to implement business rule that is capable of identifying previously undiagnosed sufferers of rare diseases. Conclusions Our model successfully utilizes sequential relationships among events recorded in medical claims data and reveals interpretable patterns that can identify sufferers of rare diseases with high confidence. The algorithm scales well to large volumes of medical claims data and it remains sensitive in despite of a very low prevalence of target cases in data. ROC diagrams of models trained to identify GD patients shown with decimal logarithmic scale of the false positive rate axis. Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 9(1):e29, 2017 ISDS Annual Conference Proceedings 2017. This is an Open Access article distributed under the terms of the Creative Commons Attribution- Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. ISDS 2016 Conference Abstracts Example rule used to adjudicate GD cases. Keywords sequential patterns; medical history; rare diseases Acknowledgments This work has been partially supported by NSF (1320347) and CMU Disruptive Health Technology Institute. References 1. Guillame-Bert M, Dubrawski A. Classification of Time Sequences using Graphs of Temporal Constraints. Journal of Machine Learning Research, 2016 (under review). *Jieshi Chen E-mail: jieshic@andrew.cmu.edu Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 9(1):e29, 2017 ISDS16_LIG_Identification of Sufferers_Chen ISDS16_Abstracts-Final 39