Paper details


Improving Agent Based Models and Validation through Data Fusion 

1 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

Improving Agent Based Models and 
Validation through Data Fusion  

 
Marek Laskowski, Bryan C.P. Demianyk, Marcia R. Friesen, Robert D. McLeod and 

Shamir N. Mukhi
1
  

Internet Innovation Centre, University of Manitoba 
1
The Canadian Network for Public Health Intelligence (CNPHI) 

1015 Arlington Street, Winnipeg, MB 

 
Abstract 
 

This work is contextualized in research in modeling and simulation of infection spread 

within a community or population, with the objective to provide a public health and policy 

tool in assessing the dynamics of infection spread and the qualitative impacts of public 

health interventions.  This work uses the integration of real data sources into an Agent 

Based Model (ABM) to simulate respiratory infection spread within a small municipality.   

Novelty is derived in that the data sources are not necessarily obvious within ABM 

infection spread models. The ABM is a spatial-temporal model inclusive of behavioral and 

interaction patterns between individual agents on a real topography.  The agent behaviours 

(movements and interactions) are fed by census / demographic data, integrated with real 

data from a telecommunication service provider (cellular records) and person-person 

contact data obtained via a custom 3G Smartphone application that logs Bluetooth 

connectivity between devices.  Each source provides data of varying type and granularity, 

thereby enhancing the robustness of the model.  The work demonstrates opportunities in 

data mining and fusion that can be used by policy and decision makers.  The data become 

real-world inputs into individual SIR disease spread models and variants, thereby building 

credible and non-intrusive models to qualitatively simulate and assess public health 

interventions at the population level.    

 
Keywords: Agent Based Modeling; Personal Contact Patterns.   

 
Introduction 
 

Complex networks underlie the transmission dynamics of many epidemiological models of 

disease spread, in particular agent based models (ABM). Network-based epidemiological 

models use a percolation-like principle to simulate disease spread through the population [1]. 

Agent based models are being increasingly employed due to their potential to capture 

complex emergent behaviours during the course of an simulated epidemic, where these 

behaviours arise from the non‐linearities of human-human contacts. ABMs may employ an 
explicit or implicit social contact network defined by structured agent interactions. In the 

explicit case, a disease model (e.g., susceptible ‐ exposed ‐ infected ‐ recovered, SEIR type) 
can be implemented directly on the network, although in the case of ABM, these resemble 

simulation models rather than the steady state analysis of network based models mentioned in 

[1]. 

In all cases, though, the fidelity of an agent-based model relies in part on the credibility of the 

social contact network data that feeds it.  Potential data sources include census and 


Improving Agent Based Models and Validation through Data Fusion 

2 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

demographic data (coarse) and finer-grained data made availability by various means of 

polling personal electronics such as cell phones. In related work, it was demonstrated that 

data to model a social contact network can be collected through web services or wireless 

sensory devices or “motes” worn by individuals in the target population and subsequently 

used in an infectious disease spread model [2]. Such an approach has been previously 

undertaken to gather data, for example in an organization (workplace or school). The 

resulting estimated social contact network was used to model an influenza‐like illness (ILI) 
within the setting [3], based on a standard SEIR type model. In this time‐stepped model, 
infection spreads between two vertices (individuals) along the weighted edges of the network 

which represent the amount of social contact between the two individuals. Estimating social 

contact networks in larger populations, (metropolitan scale or larger), is an area of research 

still in its relative infancy.  

 
In cases where precise contact network data is unavailable, an alternative is to mine data as 

done by EpiSimS [4] which uses United States Department of Transportation information to 

estimate the schedules of the agents in several metropolitan areas. This presumes that the 

choices of places for agents to interact is constrained by the transportation network (model), 

which itself is a complex network. Schedules for the agents are synthesized from census and 

USDOT data. A simulation is then run during which a synthetic contact network is 

constructed from the interactions of the agents and their locations. The resulting dynamic 

bipartite graph [4] is used to simulate disease spread in the manner stated earlier, except on a 

much larger scale. Both EpiSimS and another well‐vetted infectious disease simulator, 
BioWar [5], initially perform validation on model components separately. This is an 

important component of plausibly reasoned argument, supporting the statement that the 

model as a whole functions as specified. 

 
The objective of the present work is to investigate methods to begin validating ABMs in 

varying stages of development by comparing extracted contact networks to known theoretical 

social contact network models. Ideally, networks which embed some notion of space or time 

will be essential drivers of disease spread in the real world. Thus, extracted networks may 

need to be weighted, for example, to associate weight with the time period during which two 

agents were in contact. The first such model is of a rural community in the province of 

Manitoba, Canada.  The emphasis in this work is in integrating data from emerging sources 

that can be used within discrete time and space disease spread ABMs. The contagions of 

interest are influenza like illnesses (ILI) or other respiratory infections that are primarily 

contracted through direct or proximal contact.   

 
Methods 
 

In the first part of the study, we discuss a small scale ABM of two adjacent communities in the 

Rural Municipality of Stanley, Manitoba with a combined population of approximately 16,500 

residents:  Winkler, Manitoba at 10,000 residents and Morden, Manitoba at 6500 residents.  

This is a spatial temporal model with demographic data coming from Statistics Canada [6].  

From this perspective, agents are provided with schedules, and a model of disease spread is 

run. Figure 1 illustrates the topography of the region of interest. 

 
Improving Agent Based Models and Validation through Data Fusion 

3 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

 
Figure 1: The topography of the rural region of interest  

 
The towns of Morden and Winker are roughly seven miles apart in southwest Manitoba. One 

of the reasons for selecting this area is that it is representative of many North American rural 

municipalities. Figure 1 also illustrates the location of three cellular service towers with MTS 

Allstream as the service provider. The ABM is discussed in terms of model validation using 

data that is mined from anonymized cell phone use records. In addition to cell phone usage, 

the model is also improved using a Smartphone application that provides greater fidelity of 

proximity contacts using Bluetooth enabled devices as proxies for people.  

 
There are two primary obstacles to fusing data to a model. The first is the collection of the 

data, with assurances that the data collected is meaningful and accurate, and mining or 

interpreting the data for parameters or characteristics useful to the model. The second 

difficulty is integrating the data into the model itself, running simulations and attempting to 

qualify (and ideally, quantify) the outputs. In many instances, the results of the simulations 

may be self fulfilling, as in, overcrowding in isolated and impoverished communities leads to 

increased infection spread. The interventions that one could model may provide guidance for 

policies that may then be considered. For example, an intervention associated with reducing 

infection spread may be a recommendation to stay home while ill; in overcrowded residential 

communities a more effective intervention may be quarantine or a modified quarantine policy 

whereby an infected person may be advised to seek temporary housing in a facility set up 

specifically for that purpose. While somewhat self-evident, modeling with real data may help 

to elucidate these types of options or interventions. 

 
The Model and ABM Simulator 

The model described here is a milestone in the process of designing and implementing an 

ABM simulation framework geared towards high fidelity modeling of human institutions of 

varying scales. The broad design goals of this framework, called Simstitution, are based on the 

collective experience of the authors gained while developing Agent Based Models of human 

institutions. Originally, models of hospital emergency departments [7] and cities [8] were 

implemented upon “one-shot” simulators, that is, a simulator strongly coupled to the specific 

modeling application [9]. A one-shot simulator is comparatively easy to implement, and gives 

the modeler fine control over the simulator processes, enabling them to fulfill their 

requirements. Typically, in order to minimize development effort, the designer will make 

assumptions which ease the implementation of the model at hand, without consideration for 

how these assumptions will constrain or complicate re-purposing the simulator to implement a 

different model. From a software engineering perspective, part of the reason that one-shot 

models are so easy to produce is that little or no effort go into making the software reusable or 


Improving Agent Based Models and Validation through Data Fusion 

4 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

extendible. The large number of one-shot simulators observed in the literature [9] is 

problematic because by their nature they are difficult to re-use. The reusability of the simulator 

in turn affects the reliability of the simulator; the more researchers that (re)use a particular 

simulator the more chances that bugs will be identified and fixed. Furthermore, when a 

number of models produce reasonable results using a common simulator, confidence in the 

credibility of the simulator is increased. Publishing results from a series of models built upon a 

common simulator framework, combined with verification of model components (or sub-

models), is a common path for building confidence in simulator frameworks for 

epidemiological modeling [4][5]. 

 
Simstitution Design Goals 

Although there are several frameworks [10]-[15] which can be used to develop agent based 

models, these are dwarfed by the number of one-shot or otherwise domain-specific simulators, 

suggesting that no framework has yet hit upon a “sweet-spot” between flexibility, 

extendibility, and specific support classes for human-centric domains [9]. Human-centrism 

includes the notion that Agents are spatially oriented and situated since humans are physical 

entities that occupy and traverse space, rather than existing in some abstract information 

domain. Simulator support for a range of human time steps on the order of seconds to hours or 

days is also desirable. 

 
Other design features include adherence to software engineering principles to improve re-use 

and maintainability of the framework, as well as extendibility especially where machine 

learning can be leveraged for automated generation of agent policy [16][17]. 

 
For rapid model construction, a next generation ABM framework should facilitate the 

incorporation of real-time data such as from database leading to increasingly data-driven 

simulation. A tool for visualization and interacting with the model in a graphical manner 

(GUI) also facilitates model development, validation, and debugging. Visualization is also key 

for communicating results with subject matter experts and stakeholders [18]. Such a 

visualization tool can also be extended to serve as a tool for model construction or editing 

model parameters imported from real data.  

 
The accessibility of agent behavior development to persons with a non-programming 

background can be improved by first providing a scripting layer on top of the compiled code, 

and then perhaps adding a visual or block (e.g. OpenBlocks [19]) programming (drag and 

drop) on top of that. Over time a library of useful scripted behaviors can be built up. 

 
The increasing availability of parallel or distributed computing systems also suggests that 

contemporary or future agent based simulator frameworks have support for distributed, 

parallel, or cluster computing. The increasing availability of cluster-based compute resources 

(a consequence of Moore’s Law), sensitivity to real-time computational constraints, and 

medical data privacy issues augur well for cluster-based computing. As a result, the 

Simstitution design emphasizes scalability with respect to multiple processors and discrete 

memory spaces over efficiency in executing one particular type of model.  

 
Currently, there is an emergence of general-purpose computing on graphics processing units 

(GPGPU) as excellent accelerators for data parallel applications with regular data access 

patterns.  This leads to opportunities for accelerating agent based simulation as well. However, 

optimization is still challenging, as the data access patterns are still somewhat irregular for 


Improving Agent Based Models and Validation through Data Fusion 

5 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

most ABMs. Currently, GPUs are very well suited to ABMs that resemble cellular automata, 

percolation, game-of-life, or particle swarm models. Without doubt, higher level ABM (social 

autonomous interacting agents) simulations will also benefit from the compute resources of 

GPUs as the technology evolves (optimizing compilers, etc.). 

 
Naturally limiting the degree of accessibility of the environment limits what Agents can 

perceive and interact with in the environment (including other Agents).   Localizing agent 

perception, not only fits in well with the Agent paradigm, it also limits to what extent 

information needs to be shared between processes in a distributed model, which should 

facilitate using spatial decomposition as a guide for distributing computational load. 

 
These disparate goals require balance in feature choice and design.  

 
Simstitution design details 

Simulated entities within Simstitution fall into either of two major categories; Agents 

(SimAgent), which are the autonomous entities that make decisions and interact with the 

environment; and instances of the SimRegion class, which represent spatially partitioned 

subdivisions of the environment. Note from Figure 2 that the SimObject is abstract, and exists 

because SimAgent and SimRegion have much of their interfaces in common. 

 
Figure 2: Class diagram for core Simstitution class hierarchy 

 
One of the core design tenets of Simstitution is that the spatial division is closely intertwined 

with the division of computational work across processors and discrete memory boundaries. 

Therefore, SimRegion is unit of spatial decomposition as well as a convenient unit of 

computation. In the latter role, it can be considered as a container for agents that need to have 

their next state computed. Figure 3 illustrates the details of this relationship. A particular 

instance of SimRegion can be the parent container of SimAgents or SimRegions but not both 

types at the same time. This restriction will in practice result in tree hierarchies of 

SimRegions, with SimAgents contained in the leaf SimRegions, and the “top region” at the 

root of the tree. The SimRegion spatial decomposition granularity becomes increasingly fine 

away from the root and towards the “leaf regions” of the tree. 

 
Time advances in the simulation when the simulator advances the time of the top region (root 

of the tree) by some discrete time step. The top region will then advance the time of its 

children by the same time step in a recursive fashion such that the tree is traversed in a depth 

first manner until all the SimAgents in the leaf regions have been simulated for that time step. 

The simulator will restart this process again, until a certain number of time steps have elapsed. 


Improving Agent Based Models and Validation through Data Fusion 

6 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

 
Figure 3: Relationships between core class instances, forming a tree 

 
IndividualPolicy is a modular unit that affects the behavior of the subscribed SimAgent, which 

may also require the IndividualPolicy to store encapsulated SimAgent state data specific to 

that IndividualPolicy. Examples are a schedule policy which causes the SimAgent to observe a 

particular day/night work/home schedule, or in the case of a hospital being modeled, a doctor 

policy which causes the SimAgent to treat patients within a hospital. Within a SimRegion, 

each possible concrete derived IndividualPolicy class has a corresponding GroupPolicy for 

that SimRegion. The GroupPolicy acts as a factory for the corresponding IndividualPolicy 

and, if required, facilitates coordination between one or more derived IndividualPolicy classes 

(ex. healthcare worker policy in a hospital that coordinates interaction between nurse and 

doctor IndividualPolicies). Implicit here is the assumption that the properties of the local 

environment constrain the behavior of agents (ex. airport security lineup, swimming pool, 

hospital, bank, etc.). The associations between SimRegion, SimAgent, GroupPolicy, and 

IndividualPolicy are shown in Figure 4. 

 
Figure 4:  Relationships involving modular agent policies 

 
Communication or interaction between SimAgents exclusively uses messages passed between 

SimAgents. Messages received by a SimAgent are relayed to its IndividualPolicies which can 

lead to an internal change of state, or an action to be taken which could lead to additional 

messages being sent to other IndividualPolicies on the same subscribed SimAgent, or 

messages sent to other SimAgents. Message passing fits well with the agent paradigm, since 

the alternative implies a direct mapping between external events and internal agent state which 

violates the principle of agent autonomy [20]. 

 
Improving Agent Based Models and Validation through Data Fusion 

7 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

Details of Small Town Model - Morden 

The current work incorporates the framework features mentioned in the previous section, and 

includes visualization capabilities to observe emergent model behavior during execution. The 

model is fairly basic so the SimRegion tree only consists of two layers; the root or top 

SimRegion (Morden) and the leaf SimRegions which represent the home, school, and work 

locations that agents occupy. The leaf SimRegions are arranged in a grid with empty spaces 

between structures to allow for SimAgent travel.  Agents are assigned work, school, and home 

locations based on demographic data [6].  

 
Figure 5: Screenshot of running simulation. Morden (left), close-up of 6 classrooms (right). 

  
Figure 5 shows a screenshot of the Morden simulation at a particular time step. On the left side 

the entire city is shown. On the right is a detailed view of six classrooms in the center of town 

in which individual SimAgent details can be seen. Details include the gender and age of the 

SimAgent, as well as disease status. Disease status is the most interesting, and is indicated by 

the color of the SimAgent icon. The icon changes color, with green indicating a susceptible 

state. Once the agent is infected it turns yellow, orange, and red depending on how long they 

have spent in the infected state. Finally, recovered SimAgents turn blue. The leaf SimRegions 

are depicted as colored squares where the color of the square shows the aggregated disease 

state of the SimAgents within that region. SimRegions with no SimAgents contained inside 

are white. Those with one or more SimAgents display a blended color tile based on the 

aggregated disease state of the SimAgents inside. 

 
Four concrete IndividualPolicy subclasses were used to generate the SimAgent behavior in the 

Morden model. The SchedulePolicy determines whether a particular agent wants to be at its 

assigned work, school, or home, depending on the demographic profile of the particular 

SimAgent, and the current time which advances in increments of one hour. The 

SchedulePolicy sends messages containing the desired destination to the SimAgent’s 

MovementPolicy which handles the actual movement. The InfluenzaPolicy maintains the 

particular SimAgent’s disease state, and if in the Infected state, sends “infection” messages to 

other SimAgents in the same SimRegion, which is how disease spreads between SimAgents.  

Finally, the BluetoothTrackingPolicy emulates the Bluetooth Smartphone contact app, and is 

the source of the synthetic contact data. Currently the corresponding GroupPolicies were used 

to facilitate aggregation of data in a spatially explicit manner to achieve the tiling effect in 

Figure 5. 


Improving Agent Based Models and Validation through Data Fusion 

8 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

 
Framework Roadmap – Next Steps 

The next step or milestone will be to extend the framework by developing modules to simulate 

finer granularity time and space, namely facilities for Agents finding paths and steer through 

complex environments at time steps on the order of seconds. One such prototypical 

institutional environment would be improved hospital models [7]. 

 
Following that, we intend to scale up the number of agents, leveraging parallelism where 

possible to determine whether spatial partitioning will facilitate execution speedup and if so, 

under what conditions.  

 
In a concurrent development process (possible due to the modularity of the design) tools are 

also being created to facilitate the integration of increasingly detailed data such as street maps 

and demographics of places such as Morden. 

 
In order to promote the ideals of software re-use, once the core Simstitution simulator has 

reached a reasonable level of functional maturity, the code will be made available to other 

researchers under a general public license. 

 
Results 

Augmenting Data Sources  

In addition to demographic data, the two sources of augmenting data here are associated with 

coarse grained data from anonymized cellular records and a finer grained Smartphone 

application programmed to log close-proximity Bluetooth devices. Data from cellular records 

typically provide service providers with input for network planning, investments, and 

management of evolving needs.  This type of data also has considerable application to public 

health interests, although at this time it is difficult to derive its direct benefit in contrast to 

more explicit inputs such as those associated with census and demographic data, due to both 

technology and policy issues.  

 
 Cellular Data  

Data from four consecutive weekdays in November 2010 was extracted from the data 

provided by the cellular service provider.  The data includes the cell tower GPS and antenna 

sector (if applicable) that the mobile device is associated with, the AAA record (every time the 

phone accesses the network excluding voice and SMS), and time stamp of the access.    Even 

at four days, this represented just over 14 GB of data. Once processed for the connections with 

the towers of interest (Figure 1), this amounted to just under 500,000 records. Although 

statistical in nature, the data can be further processed to estimate flux of persons between the 

two neighboring towns. Within an infection spread model, this type of information helps in 

estimating patterns of movement that contribute to infection spread. Once stored in a database, 

queries allowed for extracting anonymized device activities. Figure 6 illustrates the breakdown 

of mobile devices accessing the towers in Morden and/or Winkler. For an individual, a duty 

cycle can be estimated, illustrating the percentage of time a person is likely to be in one region 

or another. The timestamp can also be used to infer primary community of residence. Users 

counts here indicate that approximately 2650 users remained in Morden, approximately 485 

users remained in Winkler, while 2285 users spent time in both Morden as well as Winkler 

over the four day data collection period.  


Improving Agent Based Models and Validation through Data Fusion 

9 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

 
Figure 6: Morden  and/or Winkler Mobile User Aggregates 

 
This data can be refined further based upon those with access records in both Morden and 

Winkler.  Figure 7 illustrates the breakdown of users who access cell towers in both 

communities over the duration of a single connection of their cellular device to the network. 

The actual device accesses between the two communities break down as approximately 65/35, 

reflecting durations more accurately.  

 
Figure 7: Breakdown of users with records in both communities 

 
Bluetooth Smartphone Data  

The second source of data was a Smartphone application designed to poll its local 

environment on regular intervals for close-proximity Bluetooth enabled devices. The 

application is representative of automated and non-intrusive proximity data collection methods 

where it is tacitly assumed that consumer electronics serve as proxies for their users. This 

assumption has limitations, including the disproportionate distribution of cellular devices 

within a given population to certain demographic subsets; yet, arguably these techniques have 

increasing credibility as more and more people carry electronic devices.  To date, a pilot test 

has been undertaken with four Smartphones collecting data on close-proximity Bluetooth-

enabled devices for just over a three month period. During this time approximately 500,000 

records were collected. Platforms to date include Blackberry Storm and HTC Hero devices.  

Data includes the MAC and any assigned meta-identity of both the probe device and the 

polled (probed) device, the timestamp, and a location if the probe device is GPS-enabled. 

 
Improving Agent Based Models and Validation through Data Fusion 

10 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

Figure 8 illustrates samples of the data collected and residing on the database. Some records 

provide more information than others and as such, several records are perhaps more interesting 

than others.  The second highlighted row indicates a device called “General Motors”, scanned 

while the Agent 2 probe was on a local highway.  Many other devices are much more easily 

identified and more easily associated with actual persons.  Culling of Bluetooth devices that 

are not obviously a person is possible but not undertaken here at this time.  

 
Figure 8: A sample of data collected 

 
The Bluetooth contact data is conjectured to be a type of data that can be described by 

empirical laws. The distribution used follows the Pareto Law. Pareto's law is given in terms 

of the cumulative distribution function (CDF), i.e. in this case the number of contacts (Nc) 

with duration larger than or equal to a duration is an inverse power of the duration as 

expressed below: 

 
From the Pareto distribution, a power law exponent was calculated and varied from 1.4 to 1.75 

for the four probe devices used (R
2
 values were consistently above 0.95). A power law 

exponent less than 2 implies that there is no first moment or mean associated with the 

distribution.  As the data obtained from the probe devices is finite, a mean can be calculated, 

though. An interesting but not surprising parameter that can be extracted from the Pareto 

principle is the 80/20 rule. From the data collected, the 80/20 rule was applied to indicate the 

number of contacts that comprised 80% of the total contact duration.  From this, it was 

estimated that 80% of a person’s time is spent with a number of personal contacts that varied 

between 7 and 20, for the four probe devices. This was extracted from the number and 

duration of contacts with approximately 5,000 unique Bluetooth devices probed. This is 

consistent with intuition that although the total number of daily contacts may be large, the 

majority of one’s time is spent with only a small number of people. 

 
Evolving the ABM 

This section discusses how models, in this case the ABM can be improved and validated to 

some degree through inclusion of as many data sources as practical.  The first and most 

obvious would be using as accurate demographic data as possible. The ABM developed here 

is based on data obtained through the federal census by Statistics Canada.  In addition, 

models of schools have been refined to provide for reasonable class sizes, data which are 

estimated here but would benefit from using real data of this type.  With this model, a disease 

spread simulation was run and provided a baseline for modeling the spread of a respiratory 


Improving Agent Based Models and Validation through Data Fusion 

11 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

infection or ILI.  Figure 9 illustrates the spread of a disease among a urban community, 

represented by Morden, in isolation.  

 
Figure 9: SIR Disease Spread Simulation 

 
In the first effort to improve the basic ABM, it was instrumented in terms of agent contacts 

and durations which should reflect the patterns in data extracted from the Bluetooth probe 

devices. The objective was to see how well the model reflected real person-person networks. 

For the baseline simulations of the single town ABM typical contact patterns for all agents 

were instrumented. The results of this analysis are summarized as follows: 

 
 Figure 10: Rank ordering of all agents (aggregated) 

 
Figure 10 illustrates the rank ordering aggregated over all agents. The rank order exponent 

(Zipf’s law) is approximately 1.9. This yields a estimated power law exponent of 

approximately 1.53. The implication is that an agent’s contact pattern would follow a power 

law distribution (heavy tail) without finite moments. This result is expected from both the 

Bluetooth proximity pilot as well as well intuitive perceptions of real face-to-face contact 

patterns. This instrumentation of the ABM helps validate it as approximating real world 

contact patterns. From these ABM simulations and the aggregated rank orderings, an 80/20 

rule can also be estimated. In this case, 80% of the contact durations are spent with 

approximately 4% of a person’s contacts (25/670). This again is consistent with data extracted 

from the Bluetooth data collection pilot. Figure 11 illustrates the rank ordering of contact 

parameterized by demographic. Intuitively these profiles appear reasonable. School age 

children spend considerable time with three groups, household members, school classmates, 

and friends. The knee in the curve of school age children is between 20 and 32. For samples of 

age groups the exponents associated with Zipf’s law are presented in Table I. Perhaps it is also 

intuitive that a 2 year old and a 70 year old have similar contact patterns, presumably though 


Improving Agent Based Models and Validation through Data Fusion 

12 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

the 2 year old eats more dirt. Also the distribution of the adults perhaps reflects the famous 

quote by American philosopher and naturalist Henry David Thoreau who said, “The mass of 

men lead lives of quiet desperation”. This type of parameter extraction is also consistent with 

actual survey results reported in [21]. 

 
The consequence of the rank ordering implies that the coefficient associated with the 

corresponding Pareto distribution would be between 0 and 1. The lack of a finite mean in the 

corresponding contact PDF approximation would imply that a few long duration contacts are a 

significant vector of infection spread. In these cases the (heavy) tail wags the dog. 

 
Figure 11: Rank ordering of agents of different demographics 

 
TABLE I.  ZIPF EXPONENTS FOR VARIOUS DEMOGRAPHICS 

Age Zipf Exponent R
2
 

2 -1.86 0.76 

6 -1.51 0.80 

12 -1.85 0.78 

16 -1.66 0.80 

20 -2.0 0.94 

30 -1.87 0.96 

40 -1.95 0.95 

50 -1.95 0.97 

70 -1.50 0.85 

 
Discussion 

Other means of validating the data from a simulation like this ABM includes its relation to 

other types of published data. For example, in [21] contact patterns are analyzed as derived 

from a large population survey that indicated that for their preliminary modeling “5- to 19-

year-olds are expected to suffer the highest incidence during the initial epidemic phase of an 

emerging infection transmitted through social contacts measured here when the population is 

completely susceptible”. These expectations are consistent with the contact patterns generated 

by our ABM. 

 
Improving Agent Based Models and Validation through Data Fusion 

13 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

In the second instance of enhancing the ABM, it was recognized that Morden does not exist in 

isolation and as such, flux of persons into and out of the area is required. This is not unlike 

large scale efforts where simulations are based upon data extracted from airline travel, for 

example. In this case the data - albeit voluminous - is reasonably extractable. It is more 

difficult to obtain inter-community travel in rural settings. In this environment, there are few if 

any directly available data sets but rather opportunities for inferencing from more disparate 

sources.  

 
Although an ABM running a bounded topography may be applicable to geographically 

isolated communities, in semi-rural settings there is considerable interaction with surrounding 

towns that need be accounted for. From Figure 6, an indication of interactions between 

Morden and Winkler can potentially be inferred from cellular tower access. The data suggests 

that of the cell phone carrying persons (approximately 4000) with primary residence in 

Morden, approximately 34% are seen to have records in both Winkler and Morden, with that 

person spending on average 65% of their time in Morden and 35% in Winkler. Similarly of the 

approximately 1400 phone carrying persons with primary residence in Winker, approximately 

65% are seen to have records in both Winkler and Morden, with that person spending on 

average 65% of their time in Winkler and 35% in Morden.  

 
These very coarse estimates nonetheless allow one to begin modeling multiple communities 

and their interactions. One can burrow deeper into the data and determine periods of time a 

representative individual would spend in each community. Further simulations will include 

representative agent movement trajectories extracted from the cell records integrated into the 

simulator. Figure 12 illustrates a typical duty cycle associated with randomly selected users 

and their access to cellular towers in Morden and Winkler. The first two user data duty cycle 

plots reinforces routine activity theory as users are primarily seen in Morden during the night 

with intertown tower records primarily during the day. The third user’s behavior is 

considerably more erratic. In either case these types of trajectories are required in improving 

interacting ABMs.  

 
Figure 12: Temporal sequence diagram of a user spending accessing towers in Morden and 

Winkler 

 
Model evolution is depicted in Figure 13 where external sources are integrated as they become 

available.  At present, these are done in a manual fashion but are amenable to automation 


Improving Agent Based Models and Validation through Data Fusion 

14 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

and/or machine learning further adapting the model to the real world. In general the ABM for 

Winkler would follow similar process of development.  A benefit to developing ABM in this 

fashion is that they provide opportunities for increasing levels of computational efficiencies by 

exploiting parallel compute paradigms.  

 
Figure 13: SEIR Disease Spread Simulation 

 
Conclusion 
 

This work has demonstrated the potential of incorporating disparate data sources within an 

infection spread ABM with the objective to improve the credibility and validity of the model. 

The data sources included a Smartphone application that estimated proximate contacts and 

durations to similar devices, serving as proxies for collection of face to face data. The second 

source of data that is underexploited is associated with cellular phone logs in helping to 

estimate a person’s trajectory. 

 
There are a number of limitations in attempting to incorporate real data from somewhat 

disparate sources. Ideally one would like to compare the output of a disease spread model 

with major outbreaks. For a number of reasons this is not always possible. The purposes of 

models are to aid in understanding how effective planned interventions will be in the event of 

future outbreaks. As such, when using ABMs, an objective is to make the models as accurate 

as possible using real data to the greatest degree possible. This is one of the major advantages 

of using ABM, in that they lend themselves to inclusion of real data which is correspondingly 

becoming increasingly available. Although not modeled here, there is also a significant 

medical facility intermediate between Morden and Winkler providing an effective vector for 

infection spread as both patients and health care workers largely come from both Morden and 

Winkler 

 
Corresponding Author 
 

Robert McLoed 

mcleod@ee.umanitoba.ca 

mailto:mcleod@ee.umanitoba.ca


Improving Agent Based Models and Validation through Data Fusion 

15 
Online Journal of Public Health Informatics * ISSN 1947-2579 * http://ojphi.org * Vol.3, No. 2, 2011 

 
References 
 

[1] Newman M, Spread of epidemic disease on networks. Physical Review. 2002 

Jul;66(1):016128. 

[2] Demianyk BCP, Sandison D, Libbey B, McLeod RD, Eskicioglu R, Guderian R, Friesen 

MR, Ferens K, Mukhi, SN. Technologies for generating personal social network contact 

graphs. 2010.  IEEE HealthCom 2010; 2002; Lyon, France. 

[3] Salathé M, Kazandjieva M, Lee JW, Levis P, Feldman MW, Jones JH. A high‐resolution 
human contact network for infectious disease transmission.  PNAS, in press 2010. 

[4] Stroud P, Del Valle S, Sydoriak S, Riese J, Mniszewski S.  Spatial dynamics of 

pandemic influenza in a massive artificial rociety.  J Artificial Societies and Social 

Simulation. 2007; 10(4)9, http://jasss.soc.surrey.ac.uk/10/4/9.html. 

[5] Carley K, Altman N, Casman E, Fridsma D, Yahja A, Chen L, Kaminsky B, Nave D. 

BioWar: Scalable agent‐based model of bioattacks.  IEEE Trans on Systems, Man, and 
Cybernetics. 2006;36:252‐265. 

[6] Statistics Canada [Internet]. http://www.statcan.gc.ca/ 

[7] Laskowski M, McLeod RD, Friesen MR, Podaima BW, Alfa, AS.  Models of emergency 

departments for reducing patient waiting times.  PLoS ONE. 2009;4(7):e6127. 

[8] Borkowski M, Podaima BW, McLeod RD. Epidemic modeling with discrete space 

scheduled walkers: Possible extensions to HIV/AIDS.  BMC Public Health. 2009; 

9(Suppl 1): S14, doi:10.1186/1471-2458-9-S1-S14. 

[9] Uhrmacher A, Weyns D, editors. Multi-Agent systems: Simulation and applications. 

New York: CRC Press; 2009. . 

[10] Multi-agent Simulation Environment [Internet].  www.simsesam.de 

[11] Modeling & Simulation – Subproject James II [Internet].  

http://wwwmosi.informatik.uni-rostock.de/mosi/projects/cosa/james-ii/ 

[12] Luke S, Cioffi-Revilla C, Panait L, Sullivan K, Balan G. MASON: A Multi-Agent 

Simulation Environment.  Simulation: Trans of the Society for Modeling and Simulation 

International. 2005;82(7):517-527.  

[13] SPADES: System for parallel agent discrete event simulation [Internet]. http://spades-

sim.sourceforge.net/ 

[14] XJ Technologies simulation software and services [Internet]. http://www.xjtek.com/ 

[15] SWARM development group wiki [Internet]. http://www.swarm.org/  

[16] Miller J. Active nonlinear tests (ANTs) of complex simulation models. Manage Sci. 

1998;44(6):820-30. 

[17] Laskowski M. An agent based decision support framework for healthcare policy, 

augmented with stateful genetic programming. 2010. Ph.D. Thesis, U of Manitoba. 

[18] Bonabeau E. Agent-based modeling: Methods and techniques for simulating human 

systems.  Proceedings of the National Academy of Science. 2002;99(Suppl 3):7280-

7287, http://www.pnas.org/content/99/suppl.3/7280.full#xref-ref-3-1 

[19] MIT open blocks download page [Internet].  http://education.mit.edu/openblocks 

[20] Parunak HVD. Go to the ant: Engineering principles from natural multi-agent systems.  

Annals of Operations Research. 1997;75(0):69-101. 

[21] Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk et al. Social Contacts and 

Mixing Patterns Relevant to the Spread of Infectious Diseases. PLoS Med. 

2008;5(3):e74. doi:10.1371/journal.pmed.0050074