2014.ISDS.Abstracts.Final.pdf ISDS Annual Conference Proceedings 2014. This is an Open Access article distributed under the terms of the Creative Commons Attribution- Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. ISDS 2014 Conference Abstracts Development of Genomic Surveillance Bioinformatics Modules Eishita Tyagi1, R. C. Hopkins*1, Dan Baker1, King Jordan2 and Shiyuyun Tang1, 2 1Booz Allen Hamilton, Atlanta, GA, USA; 2Georgia Institute of Technology, Atlanta, GA, USA Objective To develop a modular approach to infectious disease genomic analysis that can easily integrate with public health analytics systems. Using dynamic approaches to genomic sequence analysis, relevant whole genome data can be quickly and accurately visualized and correlated, using a minimum of computational resources. We propose to develop visualization modules that integrate disparate data sources including integrate geospatial location metadata with associated epidemiological factors to enable faster outbreak identification and enhance surveillance. Introduction Whole-genome sequencing of disease-causing organisms provides an unabridged examination of the genetic content of individual pathogen isolates, enabling public health laboratories to benefit from comparative analyses of total genetic content. Combining this information with sample metadata such as temporal, geospatial, morbidity, and mortality can greatly increase the efficacy of genomics analysis. However, with the vast amount of data generated by such techniques, meaningful, rapid, and accurate analysis that interprets and correlates nucleotide polymorphisms for public health practice presents many challenges. To this end we have created a modular genomics analysis toolkit that can easily integrate diverse data streams and couple analysis with an array of visualization platforms. Methods Using open source tools we have assembled an analysis package that automatically processes next generation sequencing (NGS) data from the ubiquitous Illumina MiSeq. FastQ files are uploaded, filtered, trimmed and assembled. The largest contiguous DNA assemblies are BLASTed (Basic Local Alignment Search Tool) to determine closest reference genome match in RefSeq to identify the species of any isolate sequenced. After reference determination, a custom gene by gene typing algorithm calculates the core genome alignment required for phylogenetic evolutionary analysis. This approach is based on the whole genome multiple sequence typing (wgMLST) approach that was developed to define a rapid universal identification and typing scheme for pathogens. Alternative genomic methods used to process NGS data for evolutionary analysis rely on first calculating high quality single nucleotide polymorphisms (SNPs) for all sequenced isolates with respect to the reference genome and then creating a phylogeny. These approaches however can be computationally expensive as the number of sequenced isolates increases. Our algorithm attempts to overcome these computational bottlenecks through the more efficient gene by gene typing approach. Additionally, a key component of our algorithm is a rapid tree construction module where we calculate the minimal set of genes that can effectively recreate the ideal (core genome) phylogeny at a user accepted threshold of consensus identity. Results This toolkit provides an automated analysis suite for processing isolate sequencing data directly from the Illumina MiSeq. Utilizing a minimal core genome algorithm simplifies the data sets and reduces overall compute time for even large data sets. Additional modules being developed utilize open source tools and common sequence formats to integrate evolutionary analysis results from quality scored whole genome sequences with geographical data in order to provide geospatial visualization of distinct and related isolates in an outbreak. Output data from the PERL modules is seamlessly integrated into open source C++ Qt libraries prepackaged to perform geospatial visualization and relatedness clustering using multidimensional scaling (MDS) approaches. Platform independent Qt libraries provide a cross-platform application framework for easy integration of these “genomic surveillance” modules into existing surveillance applications. The virtual overlay of phylogenetic relationships onto isolate maps provides population structure in epidemiological studies and provides a mechanism for rapid real time analysis of transmission chains and effective retrospective analysis of pathogen evolutionary trends. Conclusions Utilizing and analyzing raw whole genome sequence data directly from the Illumina MiSeq moves current capabilities one step closer to real-time infectious disease characterization. Minimal core gene alignment analysis allows for computation on systems commonly available to infectious disease laboratories, circumventing the need for computationally expensive analysis. These genomic methods, if implemented within existing public health laboratory response programs, promise to revolutionize the ability of the laboratory to provide information and evidence on the evolution, transmission and virulence for pathogenic organisms. Keywords Bioinformatics; Epidemiology; Next Generation Sequencing *R. C. Hopkins E-mail: hopkins_robert@bah.com Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * (1):e96, 201