Microsoft Word - 1.docx


 CHEMICAL ENGINEERING TRANSACTIONS  
 

VOL. 77, 2019 

A publication of 

 
The Italian Association 

of Chemical Engineering 
Online at www.cetjournal.it 

Guest Editors: Genserik Reniers, Bruno Fabiano 
Copyright © 2019, AIDIC Servizi S.r.l. 
ISBN 978-88-95608-74-7; ISSN 2283-9216 

No Good Data Goes Unpunished 
Nicole Murchison  
Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM 87185  
nmurchi@sandia.gov 

Data science includes a variety of scientific methods and processes to extract data from various sources. The 
integration of interdisciplinary fields such as mathematics, statistics, information science, and computer 
science affords techniques to analyze large volumes of data to arrive at unique insights and make data-driven 
decisions (Sinelnikov et al., 2015) in real time. The technique lends itself to other applications across many 
domains including hazard assessments, analysis of near-miss data, identification of leading and lagging 
indicators from past accidents, and others. Benefits of this technique include efficiency due to improved data 
acquisition. Near-miss data represents an important source to identify conditions that lead to accidents to 
develop strategies to prevent them. Analysis of near-miss data sets can involve various techniques. This 
paper will explore the use of data science to mine accident reports, with a special emphasis on near misses to 
uncover occurrences that were not initially identified in the documentation.  Data-science techniques such as 
text analyses facilitate searching large volumes of data to uncover patterns for more informed decisions. 
Regarding near-miss data, data science techniques can be used to test the ability to uncover new hazards/ 
hazardous preconditions and the accuracy of those findings. With the benefits of crunching large data sets 
and uncovering new hazards, considerations and implications are also made regarding how that might 
influence safety culture. 

1. Introduction 

Failure can come about by technical, human, or organizational means which lead to dangerous 
circumstances. In the event that some defense adequately prevents any consequence from occurring, there is 
no accident. However, if that is not the case, the conditions for an accident exist. Whether there is an accident 
depends upon human action or some other preventative/mitigative strategy. If a strategy or mitigation is 
employed, then the event is characterized as a near-miss; should there be no intervention for prevention, the 
result is an accident with varying levels of consequence severity. In other words, a near miss accident is 
categorized as an accident that could have resulted in mild to severe consequence, but did not by some 
intervention or by chance (Aspden et al., 2004). Figure 1 demonstrates how near-misses occur, as well as 
how they are related to adverse events. Near-miss data enables the identification of hazards or circumstances 
by which accidents occur and is invaluable because, by analyzing such data, organizations can reduce risk by 
understanding (1) the circumstances and hazards related to that risk and (2) those interventions or mitigations 
that prevented an accident from occurring.  
Looking across industries including aviation, nuclear power, healthcare, military operations and others, there is 
a demonstrated importance of near-miss data analysis (Barach & Small, 2000). The Three Mile Island disaster 
resulted in a reporting system for near-misses in the nuclear power industry and intensified the focus on safety 
and safety outcomes. In aviation, near-miss data has been analyzed to improve air safety through system 
monitoring and a reporting system for individuals to contribute specific information. Analogous to nuclear 
power, near-miss data can augment accident data in aviation, where accidents are exceedingly infrequent, 
and provide contextual information about accidents. Leading indicators represent conditions or events that 
precede some undesirable consequence. They can be used as predictors of specific events such as 
accidents, or near misses. When used correctly, leading indicators can proactively identify hazards to 
minimize or mitigate risk.  

                                
DOI: 10.3303/CET1977047 

 
Paper Received: 15 October 2018; Revised: 18 May 2019; Accepted: 28  June  2019 

Please cite this article as: Murchison N., 2019, No Good Data Goes Unpunished, Chemical Engineering Transactions, 77, 277-282  
DOI:10.3303/CET1977047  

277


Together, a sustainable process for identifying otherwise unknown leading indicators related to hazards 
associated with near-miss accidents can avert accidents by understanding and mitigating risks (Franca & 
Duarte, 2018). Prevention of future unsafe scenarios and associated consequences can occur by 
understanding and learning from near misses.  This focus on a data-driven preventative culture benefits from 
near-miss data (Korkmaz & Donnelly, 2018) as a leading metric, and results in a robust safety management 
approach. Analyzing data for near misses can improve safety over and above a focus on adverse events 
alone, should near misses and accidents share direct causal predictors (which was demonstrated to be true in 
transportation safety: Wright & Van Der Schaaf, 2004; chemical: Van Der Schaaf, 1995; and healthcare 
research: Aspden et al., 2004) and relative causal patterns.  

 
Figure 1: Accident causation model (from Van Der Schaaf, 1992, p. 9) 

1.1 Near-miss data and reporting 

‘Near-miss’ is a term borrowed from aviation that denotes an adverse event that did not occur because of the 
intervention of an individual or by a fortunate evolution of the circumstances. It refers to accidents that did not 
result in harm. They are believed to be precursors of adverse events (Bier, Kunreuther, & Phimister, 2004) 
and thus share the same causal continuum (Myers et al., 2008) …They allow for analysis of risks before 
consequences are observed and are more frequent than adverse events (7–100 times more frequent (Bird & 
Loftus, 1976) (Cure et al., 2011, p. 738).  
To supplement low accident rates of accidents and their consequences, near misses and their causal factors 
can be used to identify shortcomings in safety management practices (Sinelnikov et al., 2015). Combined with 
an analysis of leading factors that cause accidents, analysis of near-miss data can provide opportunities to 
recommend targeted mitigative strategies. Rather than strategies which do not directly relate to identified 
inadequacies, one should consider near misses in the context of potential related accidents to propose 
changes to current processes (Manuele, 2009). This data-driven approach results in more targeted strategies. 
A robust approach to safety management using near-miss data would also include analysis of data after 
mitigative strategies are introduced to demonstrate success of those new practices.  
Prior studies have demonstrated an underlying relationship between near misses, and accidents which result 
in minor and major (majority of investigations conducted focus on major accidents) consequences such that 
there are considerably fewer major accidents when compared to minor, and even yet more near misses 
(Heinrich, 1931). Underlying all the above are the unsafe actions and conditions which lead to accidents. 
Heinrich introduced the safety triangle (see Figure 2), which is a representation of near misses as an indicator 
of risk.  
 

278


Figure 2: Safety triangle: Adapted from Heinrich’s 300-29-1 Model (Heinrich, 1931) 

1.2 Data science techniques 

Learning from near-miss data requires analysis related to (1) the context and circumstances under which near 
miss accidents occurred, (2) the related hazards, and (3) if mitigative strategies were employed, a description 
of mitigative strategies and the effectiveness of such. (National Academies of Sciences, Engineering, and 
Medicine, 2017). Different data science techniques can support two distinct goals. Modeling allows one to gain 
insight into how and which failures lead to near-misses or accidents, with a long-term goal of identifying those 
factors which lead to adverse events and those which allow recovery from such events. Trending is used to 
identify root causes in large data sets. Over time, trending can identify the frequency of specific factors to 
promote a data-driven approach for incorporating specific interventions (Institute of Medicine, 2004).  

1.1.1 Data Mining 

Data mining is used to identify otherwise unknown patterns in a provided data set. Data mining can identify 
underlying groups in data that fit naturally together or can be used to identify anomalous instances in a data 
set. Importantly, it can also be used to discover relationships, which could support the analysis of near-miss 
data, specifically when determining those circumstances and hazards that lead to accidents (or near 
accidents) and those actions that prevented an accident from occurring. For example, clustering, in which like 
accidents are grouped into like events, can be used to identify previously unknown risks in large data sets. 

1.1.1.1 Text Analytics 

Text analysis, including natural language processing, is used to extract information from textual sources, such 
as data from near-miss reporting systems. Data could be analyzed to complete a root cause analysis or 
determine leading indicators such as circumstances or hazards which precede near misses in text data.   

1.1.2 Predictive Analysis 

The goal of predictive analysis is to determine patterns in a given data set to predict future outcomes. 
Regarding near-miss data analysis, predictive analytics can be used one of two ways. The first is to predict 
those circumstances that lead to near-miss accidents to predict when future accidents may occur; similarly, 
predictive analytics can support the identification of mitigative actions that prevented an accident from 
occurring. The second is to test hypotheses about specific mitigations and explore the effectiveness of such.  

1.1.2.1 Machine Learning 

Machine learning techniques are also used for their predictive power. These techniques rely on a machine 
acquiring those patterns of interest, and identifying them when new data comes in. Machine learning models 
can be developed in different ways: (1) a supervised model is refined and improved with time as more data is 
analyzed, and accuracy of that data is used to update the predictive model; (2) an unsupervised model 
analyses hidden patterns in data in absence of identified qualifying cases; (3) a reinforcement model is 
optimized based upon a reward signal, but does not have explicitly stated examples. Machine learning can be 
used to categorize data into distinct groups, such as accident from non-accident data to identify otherwise 
unknown preconditions of near-misses in large data sets.  

279


1.1.2.2 Bayesian Methods 

Bayesian analysis can be used to perform both inferential and predictive analyses, which can be leveraged 
when multiple data sources are combined. It is used to answer questions about unidentified parameters using 
probability statements. Bayesian methods indicate a state of belief via an underlying probability distribution; 
the analysis relies on a likelihood function which reveals information about parameters and the prior 
distribution which represents what is known about those parameters. From this, a posterior distribution can be 
determined which relies on previously observed data and characterizes the total information that is known. 
Advantages and disadvantages of each aforementioned analytical techniques can be found in Table 1. 

Table 1: Advantages and disadvantages of analytical techniques 

Analytical Approach  Advantages Disadvantages 
Data Mining • Output represents applicable 

knowledge for informed decisions 
• Output includes patterns of important 

factors that can be used to make 
necessary adjustments 

• Data mining allows detection of 
otherwise unknown risks 

• There are scalability issues 
• Data can be misinterpreted and, 

hence misused.  
 

 Text 
Analytics

• Output can quantify frequency of 
specific types of occurrences  

• Systematic procedures promote high 
reliability and replicability 

• Suitable for large datasets of text 

• Careful coding of initial data is crucial 
for categories to be analyzed 
correctly 

• Inaccurate coding nullifies the results

Predictive Analysis • Analysis uses past data to estimate 
potential risk  

• Pattern recognition supports 
prevention of accidents from re-
occurring  

• Results in a more reliable 
interpretation of conditions leading to 
accidents  

• Biases in how data are interpreted 
can be an issue 

• Spurious correlations can occur with 
large data sets 

 
 Machine 
Learning

• Model can learn relevant features 
with time 

• Can discover features in large data 
sets rather than requiring the hand-
coding such features 

• Can find optimal setting for 
numerous parameters in a large data 
set 

• There are instances where machine 
learning can fail and a human 
analyst must have some 
understanding of the issue(s)  

• Machine learning requires a large 
dataset so is not ideal for industries 
with sparse data 

• Training datasets must be large  
Bayesian Methods • Naturally combines prior knowledge 

with new data  
• Inferences logically follow Bayes’ 

theorem 
• Can handle large and small sample 

sizes 

• There is no absolute way to select 
the prior distribution 

• Prior beliefs can be subjective and 
lead to deceptive output 

• Posterior distributions can be 
influenced by prior distributions that 
can lead to questions regarding 
validity  

2. Analysis of near-miss data using data science techniques 

Much research has been done to develop methodologies to analyze near-miss reports. Three examples of 
how data science techniques have been used to analyze near-miss data are discussed below.   

2.1 Analysis using clustering 

Cure et al. (2011) proposed a methodology which involved clustering near-miss reports to support risk 
analysis. The methodology results in ranked clusters of near-miss accidents which supported the identification 
of obvious, and more importantly, harder-to-identify risks. The technique also leverages prior knowledge of 
risks that existed in the system. Clustering involves grouping near-miss report data, such that each report is 
clustered to those which it is closely related to compared to those reports that are outside the cluster. 

280


Importantly, the also sources of risk related to each cluster were characterized. The ranks then represented 
frequency of risk which supported the implementation of strategies to reduce those risks which were prioritized 
according to that frequency.  

2.2 Analysis using text analysis coupled with Bayesian modeling 

Walter and Bowers (2018) coupled multiple techniques, including text analysis and Bayesian modeling to 
analyze near-miss reports. Figure 3 represents their methodology, which provided a more holistic view to 
promote a superior safety culture (Walter and Bowers, 2018). Near-miss data is often captured as text, 
possibly supplemented by accident rates. Using natural language processing supports an accident profile 
such that different accident types can be prioritized according to their frequency. Additionally, the data can be 
analyzed using a Bayesian multilevel model to more accurately rank and direct resources appropriately. Taken 
together, a data-driven safety strategy is realized such that accident type is related to riskiest 
operations/locations/job types/etc. to facilitate the design of specific strategies for risk prevention.  
 

Figure 3: Big data safety strategy (Walter & Bowers, 2018, p. 9) 

2.3 Predictive Analysis using Bayesian modeling 

Myers et al. (2008) developed a procedure to assess the similarity between causes and contributing factors in 
near-miss data. The approach correlated the causes of error reported when an accident occurred as well as 
when a near-miss transpired. From there, a Bayesian hierarchical model supported learning by quantifying the 
evidence in a given near-miss report related to contributing factors for a particular near-miss event. For each 
near-miss report, there model relied on knowledge of the cause of the event and the consequence. The 
technique depended upon on the correlation as a measure of evidence for the causal continuum hypothesis 
(underlying causes of harm are shared between near-miss and accident events). From this analysis, Myers et 
al. (2008) crafted an evidence-based approach which identified the causes most associated with specific 
consequences within the near-miss datasets, thus allowing for the design of preventative strategies.  

3. Future work 

The techniques reviewed in this paper warrant further study. Specifically, the next step is to identify a proper 
dataset and implement the approaches covered in this review. The results obtained would then be compared 
across the different methods to capture the effectiveness of each in terms of analyzing near-miss data.   
Additionally, future work should evaluate the effectiveness of mitigative strategies according to some pre-
determined standard. One way to accomplish this is to establish baseline rates for specific types of accidents 
or near-miss accidents and compare the reduction after interventions have been put into place. Potential 
drawbacks to this may include very low accident rates (though grave consequences should one occur) or 
near-misses not identified in the data (thus representing a partial view of near-misses and preconditions of 
such).  

281


4. Summary and conclusions 

Analysis of near-miss data is an established practice across various industries. Techniques used in support of 
that analysis dictate the type of information that will be revealed. Typical analyses qualify the types of risk and 
associated causal factors to those risks. Techniques may also include approaches to better assign mitigative 
techniques to circumvent accidents based upon the results of near-miss analyses.  
As in industry, many techniques such as coupling a Bayesian model with text analysis, can complement each 
other to support the identification of otherwise unknown near-miss events. Accompanying this effort with an 
assessment of success metrics allows a comprehensive, data-driven approach to promoting safety culture. 

Acknowledgments 

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and 
Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. 
Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND2018-
11105 C. This paper describes objective technical results and analysis. Any subjective views or opinions that 
might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or 
the United States Government. 

References 

Aspden P, Corrigan JM, Wolcott J, Erickson SM., 2004, Patient safety achieving a new standard for care, 
National Academies Press, Washington, DC, USA. 

Barach, P., & Small, S. D., 2000, Reporting and preventing medical mishaps: lessons from non-medical near 
miss reporting systems, BMJ: British medical journal, 320, 759. 

Bier, V.M., Kunreuther, H., Phimister, J.R., 2004, Accident precursor analysis and management: reducing 
technological risk through diligence, National Academies Press, Washington, DC, USA. 

Bird, F. E., & Loftus, R. G., 1976, Loss control management, Institute Press. 
Cure, L., Zayas-Castro, J., & Fabri, P., 2011, Clustering-based methodology for analyzing near-miss reports 

and identifying risks in healthcare delivery, Journal of Biomedical Informatics, 44, 738-748 
Erickson, S. M., Wolcott, J., Corrigan, J. M., & Aspden, P. (Eds.), 2003, Patient safety: Achieving a new 

standard for care, National Academies Press, Washington, DC, USA 
França, S., & Duarte, T., 2018, Learning from experience: Preventive potential of near miss deep 

understanding and technical analysis - Experiences and concepts from Braskem (Poster 54at), Loss 
Prevention Conference, Konzerthaus Freiburg, Germany. 

Heinrich, H. W., 1931, Industrial accident prevention: A scientific approach, McGraw-Hill, New York, NY. 
Institute of Medicine, 2004, Patient Safety: Achieving a New Standard for Care, National Academies Press, 

Washington, DC, USA. 
Korkmaz, S., & Donnelly, J., 2018, Do not investigate. Learn. Ask how! (Poster 43b), Loss Prevention 

Conference, Konzerthaus Freiburg, Germany. 
National Academies of Sciences, Engineering, and Medicine, 2017, Strengthening data science methods for 

Department of Defense personnel and readiness missions. National Academies Press, Washington DC, 
USA. 

Manuele, F. A., 2009, Leading & lagging indicators, Professional Safety, 54, 28. 
Myers, J.A., Dominici, F., Morlock, L., 2015, Learning from near misses in medication errors: A Bayesian 

approach, Johns Hopkins University, Dept of Biostatistics Working Papers, p. 178. 
Sinelnikov, S., Inouye, J., & Kerper, S., 2015, Using leading indicators to measure occupational health and 

safety performance, Safety Science, 72, 240-248. 
Van Der Schaaf, T.W., 1995, Near miss reporting in the chemical process industry: an overview, Microelectron 

Reliab, 35,1233–1243. 
Walter, G., & Bowers, K., 2018, New Concepts for a Big Data Safety Strategy: A Big Data case study from 

Owens Corning. Campbell Institute National Safety Council <www.thecampbellinstitute.org/wp-
content/uploads/2018/02/Campbell-Institute_Executive-Insights_Big-Data-for-Safety-Strategy.pdf> 
accessed 21.09.2018 

Wright, L., Van Der Schaaf, T.W., 2004, Accident versus near miss causation: A critical review of the 
literature, an empirical test in the UK railway domain, and their implications for other sectors, J Hazard 
Mater, 111, 105–110. 

282