Microsoft Word - 1.docx CHEMICAL ENGINEERING TRANSACTIONS VOL. 77, 2019 A publication of The Italian Association of Chemical Engineering Online at www.cetjournal.it Guest Editors: Genserik Reniers, Bruno Fabiano Copyright © 2019, AIDIC Servizi S.r.l. ISBN 978-88-95608-74-7; ISSN 2283-9216 No Good Data Goes Unpunished Nicole Murchison Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM 87185 nmurchi@sandia.gov Data science includes a variety of scientific methods and processes to extract data from various sources. The integration of interdisciplinary fields such as mathematics, statistics, information science, and computer science affords techniques to analyze large volumes of data to arrive at unique insights and make data-driven decisions (Sinelnikov et al., 2015) in real time. The technique lends itself to other applications across many domains including hazard assessments, analysis of near-miss data, identification of leading and lagging indicators from past accidents, and others. Benefits of this technique include efficiency due to improved data acquisition. Near-miss data represents an important source to identify conditions that lead to accidents to develop strategies to prevent them. Analysis of near-miss data sets can involve various techniques. This paper will explore the use of data science to mine accident reports, with a special emphasis on near misses to uncover occurrences that were not initially identified in the documentation. Data-science techniques such as text analyses facilitate searching large volumes of data to uncover patterns for more informed decisions. Regarding near-miss data, data science techniques can be used to test the ability to uncover new hazards/ hazardous preconditions and the accuracy of those findings. With the benefits of crunching large data sets and uncovering new hazards, considerations and implications are also made regarding how that might influence safety culture. 1. Introduction Failure can come about by technical, human, or organizational means which lead to dangerous circumstances. In the event that some defense adequately prevents any consequence from occurring, there is no accident. However, if that is not the case, the conditions for an accident exist. Whether there is an accident depends upon human action or some other preventative/mitigative strategy. If a strategy or mitigation is employed, then the event is characterized as a near-miss; should there be no intervention for prevention, the result is an accident with varying levels of consequence severity. In other words, a near miss accident is categorized as an accident that could have resulted in mild to severe consequence, but did not by some intervention or by chance (Aspden et al., 2004). Figure 1 demonstrates how near-misses occur, as well as how they are related to adverse events. Near-miss data enables the identification of hazards or circumstances by which accidents occur and is invaluable because, by analyzing such data, organizations can reduce risk by understanding (1) the circumstances and hazards related to that risk and (2) those interventions or mitigations that prevented an accident from occurring. Looking across industries including aviation, nuclear power, healthcare, military operations and others, there is a demonstrated importance of near-miss data analysis (Barach & Small, 2000). The Three Mile Island disaster resulted in a reporting system for near-misses in the nuclear power industry and intensified the focus on safety and safety outcomes. In aviation, near-miss data has been analyzed to improve air safety through system monitoring and a reporting system for individuals to contribute specific information. Analogous to nuclear power, near-miss data can augment accident data in aviation, where accidents are exceedingly infrequent, and provide contextual information about accidents. Leading indicators represent conditions or events that precede some undesirable consequence. They can be used as predictors of specific events such as accidents, or near misses. When used correctly, leading indicators can proactively identify hazards to minimize or mitigate risk. DOI: 10.3303/CET1977047 Paper Received: 15 October 2018; Revised: 18 May 2019; Accepted: 28 June 2019 Please cite this article as: Murchison N., 2019, No Good Data Goes Unpunished, Chemical Engineering Transactions, 77, 277-282 DOI:10.3303/CET1977047 277 Together, a sustainable process for identifying otherwise unknown leading indicators related to hazards associated with near-miss accidents can avert accidents by understanding and mitigating risks (Franca & Duarte, 2018). Prevention of future unsafe scenarios and associated consequences can occur by understanding and learning from near misses. This focus on a data-driven preventative culture benefits from near-miss data (Korkmaz & Donnelly, 2018) as a leading metric, and results in a robust safety management approach. Analyzing data for near misses can improve safety over and above a focus on adverse events alone, should near misses and accidents share direct causal predictors (which was demonstrated to be true in transportation safety: Wright & Van Der Schaaf, 2004; chemical: Van Der Schaaf, 1995; and healthcare research: Aspden et al., 2004) and relative causal patterns. Figure 1: Accident causation model (from Van Der Schaaf, 1992, p. 9) 1.1 Near-miss data and reporting ‘Near-miss’ is a term borrowed from aviation that denotes an adverse event that did not occur because of the intervention of an individual or by a fortunate evolution of the circumstances. It refers to accidents that did not result in harm. They are believed to be precursors of adverse events (Bier, Kunreuther, & Phimister, 2004) and thus share the same causal continuum (Myers et al., 2008) …They allow for analysis of risks before consequences are observed and are more frequent than adverse events (7–100 times more frequent (Bird & Loftus, 1976) (Cure et al., 2011, p. 738). To supplement low accident rates of accidents and their consequences, near misses and their causal factors can be used to identify shortcomings in safety management practices (Sinelnikov et al., 2015). Combined with an analysis of leading factors that cause accidents, analysis of near-miss data can provide opportunities to recommend targeted mitigative strategies. Rather than strategies which do not directly relate to identified inadequacies, one should consider near misses in the context of potential related accidents to propose changes to current processes (Manuele, 2009). This data-driven approach results in more targeted strategies. A robust approach to safety management using near-miss data would also include analysis of data after mitigative strategies are introduced to demonstrate success of those new practices. Prior studies have demonstrated an underlying relationship between near misses, and accidents which result in minor and major (majority of investigations conducted focus on major accidents) consequences such that there are considerably fewer major accidents when compared to minor, and even yet more near misses (Heinrich, 1931). Underlying all the above are the unsafe actions and conditions which lead to accidents. Heinrich introduced the safety triangle (see Figure 2), which is a representation of near misses as an indicator of risk. 278 Figure 2: Safety triangle: Adapted from Heinrich’s 300-29-1 Model (Heinrich, 1931) 1.2 Data science techniques Learning from near-miss data requires analysis related to (1) the context and circumstances under which near miss accidents occurred, (2) the related hazards, and (3) if mitigative strategies were employed, a description of mitigative strategies and the effectiveness of such. (National Academies of Sciences, Engineering, and Medicine, 2017). Different data science techniques can support two distinct goals. Modeling allows one to gain insight into how and which failures lead to near-misses or accidents, with a long-term goal of identifying those factors which lead to adverse events and those which allow recovery from such events. Trending is used to identify root causes in large data sets. Over time, trending can identify the frequency of specific factors to promote a data-driven approach for incorporating specific interventions (Institute of Medicine, 2004). 1.1.1 Data Mining Data mining is used to identify otherwise unknown patterns in a provided data set. Data mining can identify underlying groups in data that fit naturally together or can be used to identify anomalous instances in a data set. Importantly, it can also be used to discover relationships, which could support the analysis of near-miss data, specifically when determining those circumstances and hazards that lead to accidents (or near accidents) and those actions that prevented an accident from occurring. For example, clustering, in which like accidents are grouped into like events, can be used to identify previously unknown risks in large data sets. 1.1.1.1 Text Analytics Text analysis, including natural language processing, is used to extract information from textual sources, such as data from near-miss reporting systems. Data could be analyzed to complete a root cause analysis or determine leading indicators such as circumstances or hazards which precede near misses in text data. 1.1.2 Predictive Analysis The goal of predictive analysis is to determine patterns in a given data set to predict future outcomes. Regarding near-miss data analysis, predictive analytics can be used one of two ways. The first is to predict those circumstances that lead to near-miss accidents to predict when future accidents may occur; similarly, predictive analytics can support the identification of mitigative actions that prevented an accident from occurring. The second is to test hypotheses about specific mitigations and explore the effectiveness of such. 1.1.2.1 Machine Learning Machine learning techniques are also used for their predictive power. These techniques rely on a machine acquiring those patterns of interest, and identifying them when new data comes in. Machine learning models can be developed in different ways: (1) a supervised model is refined and improved with time as more data is analyzed, and accuracy of that data is used to update the predictive model; (2) an unsupervised model analyses hidden patterns in data in absence of identified qualifying cases; (3) a reinforcement model is optimized based upon a reward signal, but does not have explicitly stated examples. Machine learning can be used to categorize data into distinct groups, such as accident from non-accident data to identify otherwise unknown preconditions of near-misses in large data sets. 279 1.1.2.2 Bayesian Methods Bayesian analysis can be used to perform both inferential and predictive analyses, which can be leveraged when multiple data sources are combined. It is used to answer questions about unidentified parameters using probability statements. Bayesian methods indicate a state of belief via an underlying probability distribution; the analysis relies on a likelihood function which reveals information about parameters and the prior distribution which represents what is known about those parameters. From this, a posterior distribution can be determined which relies on previously observed data and characterizes the total information that is known. Advantages and disadvantages of each aforementioned analytical techniques can be found in Table 1. Table 1: Advantages and disadvantages of analytical techniques Analytical Approach Advantages Disadvantages Data Mining • Output represents applicable knowledge for informed decisions • Output includes patterns of important factors that can be used to make necessary adjustments • Data mining allows detection of otherwise unknown risks • There are scalability issues • Data can be misinterpreted and, hence misused. Text Analytics • Output can quantify frequency of specific types of occurrences • Systematic procedures promote high reliability and replicability • Suitable for large datasets of text • Careful coding of initial data is crucial for categories to be analyzed correctly • Inaccurate coding nullifies the results Predictive Analysis • Analysis uses past data to estimate potential risk • Pattern recognition supports prevention of accidents from re- occurring • Results in a more reliable interpretation of conditions leading to accidents • Biases in how data are interpreted can be an issue • Spurious correlations can occur with large data sets Machine Learning • Model can learn relevant features with time • Can discover features in large data sets rather than requiring the hand- coding such features • Can find optimal setting for numerous parameters in a large data set • There are instances where machine learning can fail and a human analyst must have some understanding of the issue(s) • Machine learning requires a large dataset so is not ideal for industries with sparse data • Training datasets must be large Bayesian Methods • Naturally combines prior knowledge with new data • Inferences logically follow Bayes’ theorem • Can handle large and small sample sizes • There is no absolute way to select the prior distribution • Prior beliefs can be subjective and lead to deceptive output • Posterior distributions can be influenced by prior distributions that can lead to questions regarding validity 2. Analysis of near-miss data using data science techniques Much research has been done to develop methodologies to analyze near-miss reports. Three examples of how data science techniques have been used to analyze near-miss data are discussed below. 2.1 Analysis using clustering Cure et al. (2011) proposed a methodology which involved clustering near-miss reports to support risk analysis. The methodology results in ranked clusters of near-miss accidents which supported the identification of obvious, and more importantly, harder-to-identify risks. The technique also leverages prior knowledge of risks that existed in the system. Clustering involves grouping near-miss report data, such that each report is clustered to those which it is closely related to compared to those reports that are outside the cluster. 280 Importantly, the also sources of risk related to each cluster were characterized. The ranks then represented frequency of risk which supported the implementation of strategies to reduce those risks which were prioritized according to that frequency. 2.2 Analysis using text analysis coupled with Bayesian modeling Walter and Bowers (2018) coupled multiple techniques, including text analysis and Bayesian modeling to analyze near-miss reports. Figure 3 represents their methodology, which provided a more holistic view to promote a superior safety culture (Walter and Bowers, 2018). Near-miss data is often captured as text, possibly supplemented by accident rates. Using natural language processing supports an accident profile such that different accident types can be prioritized according to their frequency. Additionally, the data can be analyzed using a Bayesian multilevel model to more accurately rank and direct resources appropriately. Taken together, a data-driven safety strategy is realized such that accident type is related to riskiest operations/locations/job types/etc. to facilitate the design of specific strategies for risk prevention. Figure 3: Big data safety strategy (Walter & Bowers, 2018, p. 9) 2.3 Predictive Analysis using Bayesian modeling Myers et al. (2008) developed a procedure to assess the similarity between causes and contributing factors in near-miss data. The approach correlated the causes of error reported when an accident occurred as well as when a near-miss transpired. From there, a Bayesian hierarchical model supported learning by quantifying the evidence in a given near-miss report related to contributing factors for a particular near-miss event. For each near-miss report, there model relied on knowledge of the cause of the event and the consequence. The technique depended upon on the correlation as a measure of evidence for the causal continuum hypothesis (underlying causes of harm are shared between near-miss and accident events). From this analysis, Myers et al. (2008) crafted an evidence-based approach which identified the causes most associated with specific consequences within the near-miss datasets, thus allowing for the design of preventative strategies. 3. Future work The techniques reviewed in this paper warrant further study. Specifically, the next step is to identify a proper dataset and implement the approaches covered in this review. The results obtained would then be compared across the different methods to capture the effectiveness of each in terms of analyzing near-miss data. Additionally, future work should evaluate the effectiveness of mitigative strategies according to some pre- determined standard. One way to accomplish this is to establish baseline rates for specific types of accidents or near-miss accidents and compare the reduction after interventions have been put into place. Potential drawbacks to this may include very low accident rates (though grave consequences should one occur) or near-misses not identified in the data (thus representing a partial view of near-misses and preconditions of such). 281 4. Summary and conclusions Analysis of near-miss data is an established practice across various industries. Techniques used in support of that analysis dictate the type of information that will be revealed. Typical analyses qualify the types of risk and associated causal factors to those risks. Techniques may also include approaches to better assign mitigative techniques to circumvent accidents based upon the results of near-miss analyses. As in industry, many techniques such as coupling a Bayesian model with text analysis, can complement each other to support the identification of otherwise unknown near-miss events. Accompanying this effort with an assessment of success metrics allows a comprehensive, data-driven approach to promoting safety culture. Acknowledgments Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND2018- 11105 C. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. References Aspden P, Corrigan JM, Wolcott J, Erickson SM., 2004, Patient safety achieving a new standard for care, National Academies Press, Washington, DC, USA. Barach, P., & Small, S. D., 2000, Reporting and preventing medical mishaps: lessons from non-medical near miss reporting systems, BMJ: British medical journal, 320, 759. Bier, V.M., Kunreuther, H., Phimister, J.R., 2004, Accident precursor analysis and management: reducing technological risk through diligence, National Academies Press, Washington, DC, USA. Bird, F. E., & Loftus, R. G., 1976, Loss control management, Institute Press. Cure, L., Zayas-Castro, J., & Fabri, P., 2011, Clustering-based methodology for analyzing near-miss reports and identifying risks in healthcare delivery, Journal of Biomedical Informatics, 44, 738-748 Erickson, S. M., Wolcott, J., Corrigan, J. M., & Aspden, P. (Eds.), 2003, Patient safety: Achieving a new standard for care, National Academies Press, Washington, DC, USA França, S., & Duarte, T., 2018, Learning from experience: Preventive potential of near miss deep understanding and technical analysis - Experiences and concepts from Braskem (Poster 54at), Loss Prevention Conference, Konzerthaus Freiburg, Germany. Heinrich, H. W., 1931, Industrial accident prevention: A scientific approach, McGraw-Hill, New York, NY. Institute of Medicine, 2004, Patient Safety: Achieving a New Standard for Care, National Academies Press, Washington, DC, USA. Korkmaz, S., & Donnelly, J., 2018, Do not investigate. Learn. Ask how! (Poster 43b), Loss Prevention Conference, Konzerthaus Freiburg, Germany. National Academies of Sciences, Engineering, and Medicine, 2017, Strengthening data science methods for Department of Defense personnel and readiness missions. National Academies Press, Washington DC, USA. Manuele, F. A., 2009, Leading & lagging indicators, Professional Safety, 54, 28. Myers, J.A., Dominici, F., Morlock, L., 2015, Learning from near misses in medication errors: A Bayesian approach, Johns Hopkins University, Dept of Biostatistics Working Papers, p. 178. Sinelnikov, S., Inouye, J., & Kerper, S., 2015, Using leading indicators to measure occupational health and safety performance, Safety Science, 72, 240-248. Van Der Schaaf, T.W., 1995, Near miss reporting in the chemical process industry: an overview, Microelectron Reliab, 35,1233–1243. Walter, G., & Bowers, K., 2018, New Concepts for a Big Data Safety Strategy: A Big Data case study from Owens Corning. Campbell Institute National Safety Council accessed 21.09.2018 Wright, L., Van Der Schaaf, T.W., 2004, Accident versus near miss causation: A critical review of the literature, an empirical test in the UK railway domain, and their implications for other sectors, J Hazard Mater, 111, 105–110. 282