Layout 1


ISDS Annual Conference Proceedings 2012. This is an Open Access article distributed under the terms of the Creative Commons Attribution-
Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and
reproduction in any medium, provided the original work is properly cited.

ISDS 2012 Conference Abstracts

Searching for Complex Patterns Using Disjunctive
Anomaly Detection
Maheshkumar Sabhnani*, Artur Dubrawski and Jeff Schneider

Carnegie Mellon University, Pittsburgh, PA, USA

Objective
Disjunctive anomaly detection (DAD) algorithm [1] can efficiently

search across multidimensional biosurveillance data to find multiple
simultaneously occurring (in time) and overlapping (across different
data dimensions) anomalous clusters. We introduce extensions of
DAD to handle rich cluster interactions and diverse data distributions.

Introduction
Modern biosurveillance data contains thousands of unique time

series defined across various categorical dimensions (zipcode, age
groups, hospitals). Many algorithms are overly specific (tracking each
time series independently would often miss early signs of outbreaks),
or too general (detections at state level may lack specificity reflective
of the actual process at hand). Disease outbreaks often impact multi-
ple values (disjunctive sets of zipcodes, hospitals, multiple age
groups) along subsets of multiple dimensions of data. It is not un-
common to see outbreaks of different diseases occurring simultane-
ously (e.g. food poisoning and flu) making it hard to detect and
characterize the individual events.

We proposed Disjunctive Anomaly Detection (DAD) algorithm [1]
to efficiently search across millions of potential clusters defined as
conjunctions over dimensions and disjunctions over values along
each dimension. An example anomalous cluster detectable by DAD
may identify zipcode = {z1 or z2 or z3 or z5} and age_group = {child
or senior} to show unusual activity in the aggregate. Such conjunc-
tive-disjunctive language of cluster definitions enables finding real-
world outbreaks that are often missed by other state-of-art algorithms
like What’s Strange About Recent Events (WSARE) [3] or Large Av-
erage Submatrix (LAS) [2]. DAD is able to identify multiple inter-
esting clusters simultaneously and better explain complex anomalies
in data than those alternatives.

Methods
We define the observed counts of patients reporting on a given day

as a random variable for each unique combination of values along all
dimensions. DAD iteratively identifies K subsets of these variables
along with corresponding ranges of their values and time intervals
that show increased activity that cannot be explained by random fluc-
tuations (K is generally unknown and could be 0). The resulting set
of clusters maximizes data likelihood while controlling for overall
complexity. We have successfully derived a versatile set of scoring
functions that allow Normal, Poisson, Exponential or Non-paramet-
ric assumptions about the underlying data distributions, and accom-
modate additive-scaled, additive-unscaled or multiplicative-scaled
models for the clusters.

Results
We present results of testing DAD on two real-world datasets. One

of them contains daily outpatient visit counts from 26 regions in Sri
Lanka involving 9 common diseases. The other data contains semi-
synthetically generated terrorist activities throughout regions of
Afghanistan (Sigacts). Both span multiple years and are representa-
tive of data seen in biosurveillance applications.

Figure 1 shows DAD systematically outperforming WSARE and
LAS. Each algorithm’s parameters were tuned to generate one false
positive per month in baseline data. The graphs represent average
days-to-detect performance of 100 sets with synthetically injected
clusters using additive-scaled (AS), additive-unscaled (AU), and mul-
tiplicative-scaled (MS) models of cluster interactions.

Conclusions
We extend applicability of DAD algorithm to handle wide variety

of input data distributions and various outbreak models. DAD effi-
ciently scans over millions of potential outbreak patterns and accu-
rately and timely reports complex outbreak interactions with speed
that meets requirements of practical applications.

Keywords
outbreak detection; anomalous clusters; disjunctive anomaly detec-
tion; prospective surveillance

Acknowledgments

This material is based upon work supported by the National Science
Foundation under Grant No. IIS-0911032.

References

1. Sabhnani M., Dubrawski A., Schneider J. Detection of Multiple Over-
lapping Anomalous Clusters in Categorical Data. Advances in Dis-
ease Surveillance, 2010.

2. Shabalin A., Weigman V., Perou C., Nobel A. Finding Large Average
Submatrices in high dimensional data. Annals of Statistics 3(3):985-
1012, 2009.

3. Wong W., Moore A., Cooper G., Wagner M. What’s Strange About Re-
cent Events (WSARE). J. of Machine Learning Research, 6:1961-
1998, 2005.

*Robin Sabhnani
E-mail: sabhnani@cs.cmu.edu

Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 5(1):e14, 2013