ISDS Annual Conference Proceedings 2018. This is an Open Access article distributed under the terms of the Creative Commons Attribution-
Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution,  
and reproduction in any medium, provided the original work is properly cited.

ISDS 2018 Conference Abstracts

Machine Learning for Identifying Relevance to 
Biosurveillance in Multilingual Text
Qiaochu Chen*1 and Lauren E. Charles2
1Tulane University, New Orleans, LA, USA; 2Pacific Northwest National Laboratory, Richland, WA, USA

Objective
The objective is to develop an ensemble of machine learning 

algorithms to identify multilingual, online articles that are relevant 
to biosurveillance. Language morphology varies widely across 
languages and must be accounted for when designing algorithms. 
Here, we compare the performance of a word embedding-based 
approach and a topic modeling approach with machine learning 
algorithms to determine the best method for Chinese, Arabic, and 
French languages.

Introduction
Global biosurveillance is an extremely important, yet challenging 

task. One form of global biosurveillance comes from harvesting 
open source online data (e.g. news, blogs, reports, RSS feeds). The 
information derived from this data can be used for timely detection 
and identification of biological threats all over the world. However, 
the more inclusive the data harvesting procedure is to ensure that 
all potentially relevant articles are collected, the more data that is 
irrelevant also gets harvested. This issue can become even more 
complex when the online data is in a non-native language. Foreign 
language articles not only create language-specific issues for Natural 
Language Processing (NLP), but also add a significant amount 
of translation costs. Previous work shows success in the use of 
combinatory monolingual classifiers in specific applications, e.g., 
legal domain [1]. A critical component for a comprehensive, online 
harvesting biosurveillance system is the capability to identify relevant 
foreign language articles from irrelevant ones based on the initial 
article information collected, without the additional cost of full text 
retrieval and translation.

Methods
The analysis text dataset contains the title and brief description 

of 3506 online articles in Chinese, Arabic, and French languages 
from the date range of August, 17, 2016 to July 5, 2017. The NLP 
article pre-processing steps are language-specific tokenization and 
stop words removal. We compare two different approaches: word 
embeddings and topic modeling (Fig. 1). For word embeddings, we 
first generate word vectors for the data using a pretrained Word2Vec 
(W2V) model [2]. Subsequently, the word vectors within a document 
are averaged to produce a single feature vector for the document. 
Then, we fit a machine learning algorithm (random forest classifier 
or Support Vector Machine (SVM)) to the training vectors and get 
predictions for the test documents. For topic modelling, we used a 
Latent Dirichlet Allocation (LDA) model to generate five topics for 
all relevant documents [3]. For each new document, the output is the 
probability measure for the document belonging to these five topics. 
Here, we classify the new document by comparing the probability 
measure with a relevancy threshold.

Results
The Word2Vec model combined with a random forest classifier 

outperformed the other approaches across the three languages (Fig. 
2); the Chinese model has an 89% F1-score, the Arabic model has 
86%, and the French model has 94%. To decrease the chance of 
calling a potentially relevant article irrelevant, high recall was more 

important than high precision. In the Chinese model, the Word2Vec 
with a random forest approach had the highest recall at 98% (Table 1).

Conclusions
We present research findings on different approaches of relevance 

to biosurveillance identification on non-English texts and identify the 
best performing methods for implementation into a biosurveillance 
online article harvesting system. Our initial results suggest that the 
word embeddings model has an advantage over topic modeling, 
and the random forest classifier outperforms the SVM. Directions 
for future work will aim to further expand the list of languages and 
methods to be compared, e.g., n-grams and non-negative matrix 
factorization. In addition, we will fine-tune the Arabic and French 
model for better accuracy results.

Table 1. Results of the Chinese model using different methods. Abbreviations 
in Text.

Figure 1: Methodology for comparing different methods to identify the best 
approach to classifying text data as relevant to biosurveillance.

Figure 2. F1-scores of different methods across languages. Abbreviations in 
Text.

Keywords
Machine learning; biosurveillance; natural language processing; 
multilingual; articles


ISDS Annual Conference Proceedings 2018. This is an Open Access article distributed under the terms of the Creative Commons Attribution-
Noncommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution,  
and reproduction in any medium, provided the original work is properly cited.

ISDS 2018 Conference Abstracts

Acknowledgments
This work was supported by the Department of Homeland Security 
Science and Technology Directorate under DOE Contract Number 
DE-AC05-76RL01830 for the management and operation of Pacific 
Northwest National Laboratory.

References
[1] Gonalves T, Quaresma P. 2010. Multilingual text classification

through combination of monolingual classifiers. Proceedings of
the 4th Workshop on Legal Ontologies and Artificial Intelligence
Techniques, pp. 29-38

[2] Bojanowski P, Grave E, Joulin A, Mikolov T. 2016. Enriching Word
Vectors with Subword Information. arXiv preprint arXiv:1607.04606.

[3] Blei D, Ng A, Jordan M. 2003. Latent Dirichlet Allocation. The
Journal of Machine Learning Research. p.993-1022.

*Qiaochu Chen
E-mail: qchen7@tulane.edu

Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * 10(1):e54, 2018