ISDS Annual Conference Proceedings 2018. This is an Open Access article distributed under the terms of the Creative Commons Attribution-
Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution,  
and reproduction in any medium, provided the original work is properly cited.

ISDS 2018 Conference Abstracts

Exploring the Value of Learned Representations for 
Automated Syndromic Definitions
Scott Lee*1, Drew Levin2, Jason Thomas1, Patrick Finley2 and Charles Heilig1
1Centers for Disease Control and Prevention, Decatur, GA, USA; 2Sandia National Laboratories, Albuquerque, NM, USA

Objective
To better define and automate biosurveillance syndrome 

categorization using modern unsupervised vector embedding 
techniques.

Introduction
Comprehensive medical syndrome definitions are critical for 

outbreak investigation, disease trend monitoring, and public health 
surveillance. However, because current definitions are based on 
keyword string-matching, they may miss important distributional 
information in free text and medical codes that could be used to build 
a more general classifier. Here, we explore the idea that individual 
ICD codes can be categorized by examining their contextual 
relationships across all other ICD codes. We extend previous work 
in representation learning with medical data [1] by generating 
dense vector embeddings of these ICD codes found in emergency 
department (ED) visit records. The resulting representations capture 
information about disease co-occurrence that would typically require 
SME involvement and support the development of more robust 
syndrome definitions.

Methods
We evaluate our method on anonymized ED visit records obtained 

from the New York City Department of Health and Mental Hygiene. 
The data set consists of approximately 3 million records spanning 
January 2016 to December 2016, each containing from one to ten 
ICD-9 or ICD-10 codes.

We use these data to embed each ICD code into a high-dimensional 
vector space following techniques described in Mikolov, et al. [2], 
colloquially known as word2vec. We define an individual code’s 
context window as the entirety of its current health record. Final 
vector embeddings are generated using the gensim machine learning 
library in Python. We generate 300-dimensional embeddings using a 
skip-gram network for qualitative evaluation.

We use the TensorFlow Embedding Projector to visualize the 
resulting embedding space. We generate a three-dimensional t-SNE 
visualization with a perplexity of 32 and a learning rate of 10, run 
for 1,000 iterations (Figure 1). Finally, we use cosine distance to 
measure the nearest neighbors of common ICD-10 codes to evaluate 
the consistency of the generated vector embeddings (Table 1).

Results
T-SNE visualization of the generated vector embeddings confirms

our hypothesis that ICD codes can be contextually grouped into 
distinct syndrome clusters (Figure 1). Manual examination of the 
resulting embeddings confirms consistency across codes from the 
same top-level category but also reveals cross-category relationships 
that would be missed from a strictly hierarchical analysis (Table 1). 
For example, not only does the method appropriately discover the 
close relationship between influenza codes J10.1 and A49.2, it also 
reveals a link between asthma code J45.20 and obesity code E66.09. 
We believe these learned relationships will be useful both for refining 
existing syndrome categories and developing new ones.

Conclusions
The embedding structure supports the hypothesis of distinct 

syndrome clusters, and nearest-neighbor results expose relationships 
between categorically unrelated codes (appropriate upon 
examination). The method works automatically without the need for 
SME analysis and it provides an objective, data-driven baseline for 
the development of syndrome definitions and their refinement.

Table 1

Figure 1: T-SNE visualization of [300 dimensional skip-gram] embedded 
ICD code vectors. The heterogeneous structure suggests distinct syndrome 
definitions. Image generated using Google’s online TensorFlow Projector.

Keywords
Word embeddings; Deep learning; Syndrome definitions; ICD codes


ISDS Annual Conference Proceedings 2018. This is an Open Access article distributed under the terms of the Creative Commons Attribution-
Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution,  
and reproduction in any medium, provided the original work is properly cited.

ISDS 2018 Conference Abstracts

Acknowledgments
This work was supported by Laboratory Directed Research and 
Development funding from Sandia National Laboratories. Sandia National 
Laboratories is a multimission laboratory managed and operated by 
National Technology and Engineering Solutions of Sandia LLC, a wholly 
owned subsidiary of Honeywell International Inc. for the U.S. Department 
of Energy’s National Nuclear Security Administration under contract 
DENA0003525.

References
[1] Choi Y, Chiu CY-I, Sontag D. Learning Low-Dimensional

Representations of Medical Concepts. AMIA Summits on
Translational Science Proceedings. 2016;2016:41-50.

[2] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed
representations of words and phrases and their compositionality.
InAdvances in neural information processing systems 2013 (pp. 3111-
3119).

*Scott Lee
E-mail: yle4@cdc.gov

Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(1):e11, 2018