Layout 1


ISDS Annual Conference Proceedings 2012. This is an Open Access article distributed under the terms of the Creative Commons Attribution-
Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and
reproduction in any medium, provided the original work is properly cited.

ISDS 2012 Conference Abstracts

Extracting Surveillance Data from Templated Sections of
an Electronic Medical Note: Challenges and
Opportunities
Adi Gundlapalli*1, 2, Guy Divita1, 2, Marjorie Carter1, 2, Shuying Shen1, 2, Miland Palmer1, Tyler
Forbush1, 2, Brett South1, 2, Andrew Redd1, 2, Brian Sauer1, 2 and Matthew Samore1, 2

1VA Salt Lake City Health Care System, Salt Lake City, UT, USA; 2Internal Medicine, University of Utah School of Medicine, Salt Lake
City, UT, USA

Objective
To highlight the importance of templates in extracting surveillance

data from the free text of electronic medical records using natural
language processing (NLP) techniques.

Introduction
The main stay of recording patient data is the free text of electronic

medical records (EMR). While stating the chief complaint and history
of presenting illness in the patients ‘own words’, the rest of the elec-
tronic note is written by the provider in their words. Providers often
use boiler-plate templates from EMR pull-downs to document infor-
mation on the patient in the form of checklists, check boxes, yes/no
and free text responses to questions. When these templates are used
for recording symptoms, demographic information or medical, social
or travel history, they represent an important source of surveillance
data [1]. There is a dearth of literature on the use of natural language
processing in extracting data from templates in the EMR.

Methods
A corpus of 1000 free text medical notes from the VA integrated

electronic medical record (CPRS) was reviewed to identify com-
monly used templates. Of these, 500 were enriched for the surveil-
lance domain of interest for this project (homelessness). The other
500 were randomly sampled from a large corpus of electronic notes.
An NLP algorithm was developed to extract concepts related to our
target surveillance domain. A manual review of the notes was per-
formed by three human reviewers to generate a document-level ref-
erence standard that classified this set of documents as either
demonstrating evidence of homelessness (H) or not (NH). A rule-
based NLP algorithm was developed that used a combination of key
word searches and negation based on an extensive lexicon of terms
developed for this purpose. A random sample of 50 documents each
of H and NH documents were reviewed after each iteration of the
NLP algorithm to determine the false positive rate of the extracted
concepts.

Results
The corpus consisted of 48% H and 52% NH documents as deter-

mined by human review. The NLP algorithm successfully extracted
concepts from these documents. The H set had an average of 8 con-
cepts related to homelessness per document (median 8, range 1 to
34). The NH set had an average 2 concepts (median 1, range 1 to
13)”. Thirteen template patterns were identified in this set of docu-
ments. The three most common were check boxes with square brack-

ets, Yes/No and free text answer after a question. Several positively
and negatively asserted concepts were noted to be in the responses to
templated questions such as “Are you currently homeless: Yes or
No”; “How many times have you been homeless in the past 3 years:
(free text response)”; “Have you ever been in jail? [Y] or [N]”; Are
you in need of substance abuse services? Yes or No”. Human review
of a random sample of documents at the concept level indicated that
the NLP algorithm generated 28% false positives in extracting con-
cepts related to homelessness when templates were ignored among
the H documents. When the algorithm was refined to include tem-
plates, the false positive rate declined to 22%. For the NH documents,
the corresponding false positive rates were 56% and 21%.

Conclusions
To our knowledge, this is one of the first attempts to address the

problem of information extraction from templates or templated sec-
tions of the EMR. A key challenge of templates is that they will most
likely lead to poor performance of NLP algorithms and cause bottle-
necks in processing if they are not considered. Acknowledging the
presence of templates and refining NLP algorithms to handle them
improves information extraction from free text medical notes, thus
creating an opportunity for improved surveillance using the EMR.
Algorithms will likely need to be customized to the electronic med-
ical record and the surveillance domain of interest. A more detailed
analysis of the templated sections is underway.

Keywords
natural language processing; surveillance; templates; VA

Acknowledgments

Funding from the US Department of Veterans Affairs (HSR&D); re-
sources from Veterans Informatics Computing Infrastructure and VA Salt
Lake City Health Care System and all our research team members who
have worked on this project.

References

1. DeLisle, S., et al., Combining free text and structured electronic med-
ical record entries to detect acute respiratory infections. PloS one,
2010. 5(10): p. e13377.

*Adi Gundlapalli
E-mail: adi.gundlapalli@hsc.utah.edu

Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 5(1):e75, 2013