DOI: 10.3303/CET2291067 Paper Received: 1 March 2022; Revised: 24 March 2022; Accepted: 11 May 2022 Please cite this article as: Nakhal Akel A.J., Hovstad J., Ruth M., Parmeggiani S., Patriarca R., Paltrinieri N., 2022, A Machine Learning Approach to Analyze Natural Hazards Accidents Scenarios, Chemical Engineering Transactions, 91, 397-402 DOI:10.3303/CET2291067 CHEMICAL ENGINEERING TRANSACTIONS VOL. 91, 2022 A publication of The Italian Association of Chemical Engineering Online at www.cetjournal.it Guest Editors: Valerio Cozzani, Bruno Fabiano, Genserik Reniers Copyright © 2022, AIDIC Servizi S.r.l. ISBN 978-88-95608-89-1; ISSN 2283-9216 A Machine Learning Approach to Analyze Natural Hazards Accidents Scenarios Antonio Javier Nakhal Akel*a, Janna S. Hovstadb, Mathilde S. Ruthb, Stefano Parmeggianic, Riccardo Patriarcaa, Nicola Paltrinierib a Department of Mechanical and Aerospace Engineering, Sapienza University, Rome, Italy b Department of Mechanical and Industrial Engineering, Norwegian University of Science and Technology, Trondheim, Norway c ISPIRA ETS, Rimini, Italy antonio.nakhal@uniroma1.it Climate change has contributed to an increasing frequency and severity of natural hazards accidents over recent years, and the increasing trend is expected to continue and escalate. Globally, demographics are changing and urbanization, population growth and increasing coastal populations make societies more exposed and vulnerable to extreme weather events. As a consequence, attention towards natural disasters is increasing along with the interest in approaches to manage emerging risks. Some industries have been experiencing major losses to hazards, while others might be hit harder in the future. Current research shows that there is a need to further investigate underlying reasons for variations in disaster timing, impacts, and outcomes, as well as mitigation strategies. The purpose of this research is to enhance the understanding of natural disaster mortality and unravel underlying causes and influential factors that can inform decision-making and be relevant for risk reduction efforts. This is achieved by analyzing natural hazards accidents data and using data science techniques to define data clusters and delve into the related factors affecting mortality. The climate-driven, natural disaster events from the International Disaster (EM-DAT) database have been thoroughly explored and visualized to obtain an overview of the current natural disaster situation. More specifically, this manuscript concerns the development of clustering algorithms and analytics to map fatalities and economic damage. The results of the analysis showed the extent to which climate change has a significant effect on resulting fatalities and economic losses from natural hazards accident scenarios. Besides the achieved results of this work, it is acknowledged how further studies should try to dynamically represent vulnerability as well as improve the quality and selection of integrated features to improve the representation of industrial aspects. 1. Introduction Data analysis of natural hazards accidents can aid risk management by shedding light on disaster characteristics, challenges, differences amongst regions, and similar events. Natural hazard management denotes the systematic actions focused on reducing the negative effects of disasters (Department of Regional Development and Environment Executive Secretariat for Economic and Social Affairs, 1991). Mitigation measures contribute to natural hazard management by minimizing, monitoring, and reducing the probability of severe consequences, the corresponding avoidable impacts, and the unfortunate outcomes of natural hazards (Sarkar and Maiti, 2020). The risk for individuals inflicted by natural hazard disasters differs based on societal vulnerability and exposure, and environmental conditions (ISDR, 2009). Climate change has forced more than 20 million people to move from their homes each year (Masika, 2013). The development level of a country might affect the consequences of a natural disaster. It is often remarked how those living in poverty are hardest hit despite being the least responsible for climate change. The increasing frequency of natural hazards led to greater attention worldwide devoted to mapping and reducing natural risks (Cruz et al., 2006), unraveling and explaining potential impacts on societies. Vulnerability in this 397 context can be a risk factor, but also an outcome: disaster exposure may lead to poverty causing damage to assets and livelihoods (Suarez-Paba and Cruz, 2022). Besides, larger natural disasters often cause extensive property damages and a high number of fatalities. Research has shown that natural disaster-related damages and mortality have increased in the past decades (Jacobsson et al., 2009; Masika, 2013). Research is needed to develop systematic approaches on disaster causes and impacts to improve responses, and anticipation capacity and design risk prevention and mitigating interventions prior to, or following major natural hazards verification. The International Disaster Database (EM-DAT) developed by the Centre for Research on the Epidemiology of Disasters (CRED) gathers data on natural disasters and maps them into different classification categories, impacts, and causes. This paper aims to study these climate-driven accidents in terms of societal impact, both on populations and properties, as they can be of relevance for industrial systems as well. The manuscript fully relies on EM-DAT and proposes a Machine Learning (ML) algorithm to investigate potential clusters of countries that show commonalities and subsequently can drive to common natural risk management mitigations. The focus of this manuscript spans from natural hazards accidents to technological accidents in order to ensure a wider perspective on all societal impacts. 2. Materials and Methods 2.1 Exploring the database The EM-DAT database was created following the 1980's investigation by the Centre for Research on the Epidemiology of Disasters (CRED). The study was carried out to serve the purposes of humanitarian action at national and international levels. The initiative aimed to rationalize decision-making for disaster preparedness, as well as provide an objective base to assess vulnerability and set priorities. The database is compiled from various sources, including United Nations agencies, non-governmental organizations, insurance companies, research institutes, and press agencies (e.g.), United Nations Department of Humanitarian Affairs (UN-DHA), European Union Humanitarian Office (ECHO), International Federation of the Red Cross and Red Crescent, the Office of Foreign Disaster Assistance (OFDA-USAID), International Committee of the Red Cross and Red Croissant (ICRCRC, Switzerland), International Decade for Natural Disaster Reduction (IDNDR) (Center for research on the Epidemiology of Disasters, 2021). Currently, EM-DAT collects more than 25000 disasters between 1900 - 2020. All the events in the EM-DAT database fulfill one or more of these entry criteria (Center for research on the Epidemiology of Disasters, 2021): - Kill (10 or more deaths) - Affect (100 or more people affected, injuries or homeless) - Declaration/Appeal (declaration by the country of a state of emergency and/or appeal for international assistance) The 25000 incidents worldwide involve 189 countries, distributed as follows: - About 15000 accidents are related to natural impacts (e.g., drought, extreme temperature, flood, landslide, storm, wildfire, etc.), - About 10000 accidents refer to technological impacts (i.e., industrial, transport, and miscellaneous impacts). The database incorporates 43 parameters (e.g., location, date, damage, fatalities, disaster type, origin, reconstruction cost, insured damage, appeal, impacts) to fully details the characteristics of the accident and allow the accident identification and analysis (Center for research on the Epidemiology of Disasters, 2021). 2.2 Data clustering through Machine Learning Machine learning (ML) is known for providing meaning to raw data and solving practical problems in a reliable and efficient way. These problems require machine assistance since the amount of data and the complexity of the statistical patterns imply that humans would not be able to solve them via traditional techniques (Burkov, 2019). ML algorithms learn from examples and are thereby trained to find patterns that can help make decisions and predictions based on new, unseen information (Sharda et al., 2019). A ML pipeline includes training, test, and validation processes. One example of ML refers to clustering. This latter is used to uncover meaningful groups within a dataset based on underlying patterns or structures. Clustering is commonly used for dimensionality reduction and the most common methods are density-based, hierarchical, partition-based, and grid-based methods. This descriptive data mining technique is unsupervised since there are no target values to predict (Murtagh and Contreras, 2012). The clustering algorithm relies on a distance matrix that is created by computing the distance between every pair of data points. For this reason, a clustering algorithm requires standardized, numerical input. 398 K-Means Clustering K-Means is one of the most frequently used and effective clustering algorithms, as proved by results obtained in several diverse application contexts (Zhang et al., 2017). K-Means is a general-purpose clustering method preferred for data where a flat geometry. The algorithm tries to group data by minimizing the within-cluster-sum- of-squares which represents the distance between each data point and the cluster centroid (Chen et al., 2005). The most common metric to compute distances in K-Means is the Euclidean distance, as it is flexible to accommodate different operational situations. Another characteristic of the algorithm is that it requires an explicit specification of the resulting number of clusters. The algorithm will always converge, but it is vulnerable to local minima. This will depend on how centroids are initialized. By running the algorithm with a specified number of clusters 𝑘, 𝑘 random samples from the dataset are allocated as cluster centroids. The main steps of the K- Means clustering algorithm are: - Initialization: the step to choose k initial centroids - Looping: the iterative step to stabilize centroids, until a certain threshold is reached, or a certain number of iterations has been run. This loop requires two sub-steps: o Assigning samples to their nearest centroid based on a selected distance measure. o Compute the mean of the assigned samples and create a new centroid. K-Means with Euclidean distance has been used to map countries’ clusters as they appear in the EM-DAT database. 3. Results The clustering algorithm allowed splitting the 189 countries involved in natural hazard accidents into 40 clusters of varied sizes. The algorithm runs on a set of selected features presented considered relevant for the scope of the analysis: Disaster ID; Country; Location; Year; Disaster group; Disaster subgroup; Disaster type; Event name; Total death; Total damages. The chosen algorithm relies on a distance matrix created computing the distance between every pair of data points. The algorithm has been performed to group data minimizing the within-cluster sum of squares, which represents the distance between each data point and the cluster centroid. Clusters must be validated to check the logical cohesion between the clustered items and to compare the separation among them. A useful metric for validating the significance of clusters is the silhouette, whose scores represent the distance from one sample to the samples in the neighboring clusters (Kingrani et al., 2017). Silhouette coefficients range between -1 and 1 where values close to 1 indicate high compactness within the cluster, which in turn implies longer distances among the sample and the neighboring clusters. Silhouette scores close to 0 indicate overlapping clusters, while negative values indicate a possible misplacement of the sample (Milligan and Cooper, 1985). When examining the obtained results, for demonstration purposes, this manuscript details only the two clusters presenting the higher cumulative number of fatalities. On this basis, cluster 12 (Poland, Germany, Japan, Vietnam, Bangladesh, and South Korea) and cluster 32 (Pakistan, Afghanistan, Iran, Nepal, Sri Lanka, Turkey, Romania, Algeria, and Yemen) being identified have been selected for further explorative statistics. Their average silhouette score was respectively 0,21 and 0,31. Only one element in cluster 12 showed a negative silhouette score (i.e., Poland, -0,04), and it has been manually removed from the following analysis. Details on individual silhouette scores can be retrieved in Table 1. Table 1. Items in the two clusters were selected for demonstrative purposes, ordered by silhouette score. Country Silhouette Score Cluster Inclusion Japan 0,3720 Cluster 12 Included France 0,2832 Cluster 12 Included South Korea 0,2367 Cluster 12 Included Bangladesh 0,2164 Cluster 12 Included Germany 0,1039 Cluster 12 Included Vietnam 0,0588 Cluster 12 Included Poland - 0,0404 Cluster 12 Excluded Afghanistan 0,4990 Cluster 32 Included Nepal 0,4953 Cluster 32 Included Turkey 0,4266 Cluster 32 Included Iran 0,4196 Cluster 32 Included Pakistan 0,4061 Cluster 32 Included Sri Lanka 0,1781 Cluster 32 Included Alegria 0,1735 Cluster 32 Included Romania 0,1510 Cluster 32 Included Yemen 0,0346 Cluster 32 Included 399 Table 2 proposes a country classification by a number of deaths and economic damage for the countries being previously selected. It is possible to observe that Bangladesh, Japan, and France account for 98,32% of total deaths count in their cluster, with Bangladesh presenting 89,80%. On the other hand, Japan, Germany, and France represent the 90,93% (75,57%; 8,85% and; 6,50% respectively) of the economic damage in their cluster. Similarly, for cluster 32 the countries: Pakistan, Iran, and Turkey account for 79,06% of total deaths, with however a flatter distribution than the one in cluster 12, i.e. Pakistan 32,14%; Iran 29,04%; and Turkey 17,87%. The same three countries, Pakistan (24,16%), Iran (24,07%), Turkey (23,22%), plus Algeria (10,10%) account for 81,56% of the economic damage in their cluster. Overall, cluster 12 involves 2.769.968 reported deaths, 82,64% more than cluster 32. Likewise, cluster 12 has 82,33% economic damage losses reported more than cluster 32. Table 2. Country classification by death and economic damage (ordered by number of deaths). Country Total deaths Total economic damage [$] Bangladesh 3.010.075 21.893.565 Japan 250.305 534.091.500 Pakistan 186.943 30.157.109 Iran 168.942 30.049.696 Turkey 103.996 28.986.670 Sri Lanka 41.046 4.475.364 France 35.177 45.956.100 Vietnam 29.545 23.404.066 Nepal 27.224 6.836.415 Afghanistan 26.890 603.320 Germany 13.775 62.575.505 Algeria 13.382 12.614.846 South Korea 12.734 18.748.034 Yemen 7.539 4.894.400 Romania 5.681 6.199.920 Table 3 proposes a country classification in terms of deaths and economic damage by impact types and technological impacts, this latter divided into sub-types (industrial, transport, miscellaneous). Besides, the analysis has been separated into two parts. In terms of Natural impacts, Bangladesh has the higher number of fatalities in the reports with 2.993.988 deaths, followed by Japan with 239.374 deaths. Moreover, Japan has the higher economic damage 533.908.500 $, followed by Germany with 61.978.605 $. Similarly, about Technological impacts: for accidents related to industries and transports, Bangladesh has the higher number of reported fatalities (1.809, and 13.261 deaths respectively), followed by Germany with 1.650 deaths (industrial impacts) and Pakistan 6.184 deaths (transport impacts). Likewise, Algeria has the higher economic damage reported in the industrial impacts (800.000 $). Nevertheless, South Korea presents 38.400 $ related to transport issues. Figure 1. Trend over time of the number of natural hazards accidents (blue line) and deaths, the sum of the values for the two clusters being analyzed (cluster 12, cluster 32). Errore. L'origine riferimento non è stata trovata. represents a combined line chart describing the count of natural hazard accidents reported and the death losses (logarithmic scale to facilitate comparison) over time. The figure shows a peak in the decades of 1940s, mainly due to a Drought disaster that occurred in Bangladesh accounting for 1.900.000 deaths approximately. Besides, it is possible to observe how the behavior over the 400 decades in terms of count natural hazard accidents is increasing, and the death losses in the last two decades are decreasing. Table 3. Country classification for detailed impact, with a focus on technological aspects (industrial, transport, and miscellaneous). Background highlights maximum value per category. Country Natural Impact Technological Impact Industrial Transport Miscellaneous Deaths E. Dmg. [$] Deaths E. Dmg. [$] Deaths E. Dmg. [$] Deaths E. Dmg. [$] Afghanistan 25.197 603.320 280 No data 1.184 No data 229 No data Algeria 11.874 11.814.846 27 800.000 1.280 No data 201 No data Bangladesh 2.993.988 21.893.565 1.809 No data 13.261 No data 1.017 No data France 28.864 45.892.700 1.223 36.800 3.571 No data 1.519 26.600 Germany 10.419 61.978.605 1.650 226.300 1.618 No data 88 370.600 Iran 163.347 29.899.696 196 No data 4.682 No data 717 150.000 Japan 239.374 533.908.500 41 160.500 4.150 16.500 5.840 6.000 Nepal 24.871 6.835.155 No data No data 2.159 No data 194 1.260 Pakistan 178.840 29.955.969 861 179.080 6.184 No data 1.058 22.060 Romania 4.939 6.199.920 60 No data 587 No data 95 No data S. Korea 9.111 18.516.257 303 167.300 1.658 38.400 1.662 26.077 Sri Lanka 40.057 4.475.364 25 No data 871 No data 93 No data Turkey 97.086 28.708.670 1.239 No data 3.084 No data 2.587 278.000 Vietnam 27.240 23.399.566 762 2.000 1.354 No data 189 2.500 Yemen 5.623 4.894.400 129 No data 1.647 No data 140 No data Figure 2 represents a combined line chart describing the count of natural hazard accidents and the economic damage (logarithmic scale to facilitate comparison) over time. It is possible to notice the economic losses reported for the clusters being analyzed show an exponential trend over time, differently from deaths losses. Moreover, it is possible to observe how the society and countries has been enhance the safety constrains in the industrial processes affected by the natural hazards accidents to reduce the occupational losses, instead, economic damage related by them has increased over the year. These results could be a signal of the increased quality of reporting over years as well as potential improvements in safety management and risk assessment of natural hazards. More specifically, a larger number of events with smaller losses in EM-DAT may be used for analytics and support strategic decision-making also in relation and comparison with world risk indexes (Aleksandrova et al., 2021) Figure 2. Trend over time of the number of natural hazards accidents (blue line) and economic damage (dark blue line) for the two clusters being analysed (cluster 12, cluster 32). 4. Conclusions The overall goal of the analysis presented in this paper is to demonstrate the possible usage of data about natural disasters and their implications for societal safety and industrial management. The work presents methodological results obtained from clustering algorithms and analytics referred to fatalities and economic 401 damage. The purpose of the clustering was to find a relevant group of countries that could facilitate inter-country learning opportunities and create actionable insights. The defined similarity is based on a set of general exposure and sensitivity features. Furthermore, the purpose of the proposed analytics may be helpful for policymaker to facilitate comparative analysis of fatality patterns and external factors that affect mortality subsequent to natural and technological events. Overall, the extension of these results to the entire EM-DAT database proves how natural hazards generate more impactful consequences than technological disasters. While from a societal point of view, this result shall motivate the need to invest in both protective and preventive mitigating measures, the extent of natural hazards should also push for specific interest on industrial safety, especially to prevent disastrous cascading consequences. These early results require further refinement and improvements to further shape future risk learning processes. In this regard, they also constitute the basis for potential additional analyses, (e.g.) using generative model, anomaly detection, as for promising research in this area (Nakhal A. et al., 2021; Patriarca et al., 2022). The analytics may also be used in larger Business Intelligence (BI) solutions to support a multi-variate dynamic analysis both descriptive and predictive (Nakhal A. et al., 2021) if incorporating other ML solutions. A joint BI- ML development may indeed be a crucial instrument to support decision-makers at having a comprehensive understanding of natural hazards to shape risk prevention and mitigation programs. References Aleksandrova, M., Balasko, S., Kaltenborn, M., Malerba, D., Mucke, P., Neuschafter, O., Radtke, K., Prutz, R., Strupat, C., Weller, D., Wiebe, N., 2021. The World Risk Index 2021, World Risk Report 2021 F. Burkov, A., 2019. Machine Learning Engineering. ISSN 2502-3632 ISSN 2356-0304 J. Online Int. Nas. Vol. 7 No.1, Januari – Juni 2019 Univ. 17 Agustus 1945 Jakarta 53 9 , 1689–1699. Center for research on the Epidemiology of Disasters, C., 2021. The international Disaster Database [WWW Document]. URL https://www.emdat.be/ (accessed 12.2.21). Chen, T.S., Tsai, T.H., Chen, Y.T., Lin, C.C., Chen, R.C., Li, S.Y., Chen, H.Y., 2005. A combined K-means and hierarchical clustering method for improving the clustering efficiency of microarray. Proc. 2005 Int. Symp. Intell. Signal Process. Commun. Syst. ISPACS 2005 2005, 405–408. doi:10.1109/ispacs.2005.1595432 Cruz, A.M., Steinberg, L.J., Vetere-Arellano, A.L., 2006. Emerging issues for natech disaster risk management in Europe. J. Risk Res. 9 5 , 483–501. doi:10.1080/13669870600717657 Department of Regional Development and Environment Executive Secretariat for Economic and Social Affairs, O., 1991. Chapter 2 - Natural Hazard Risk Reduction in roject Formaulation and Evaluation [WWW Document]. URL https://www.oas.org/dsd/publications/Unit/oea66e/ch02.htm#chapter 2 natural hazard risk reduction in project formulation and evaluation (accessed 12.13.21). Jacobsson, A., Sales, J., Mushtaq, F., 2009. A sequential method to identify underlying causes from industrial accidents reported to the MARS database. J. Loss Prev. Process Ind. 22 2 , 197–203. doi:10.1016/j.jlp.2008.12.009 Kingrani, S.K., Levene, M., Zhang, D., 2017. Estimating the number of clusters using diversity. Artif. Intell. Res. 7 1 , 15. doi:10.5430/air.v7n1p15 Masika, R., 2013. Gender, Development and Climate Change. Oxfam GB, Oxford. Milligan, G.W., Cooper, M.C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 2 , 159–179. doi:10.1007/BF02294245 Murtagh, F., Contreras, P., 2012. Algorithms for hierarchical clustering: an overview. WIREs Data Min. Knowl. Discov. 2 1 , 86–97. doi:10.1002/widm.53 Nakhal A., A.J., Patriarca, R., Di Gravio, G., Antonioni, G., Paltrinieri, N., 2021. Business intelligence for the analysis of industrial accidents based on MHIDAS database. Chem. Eng. Trans. 86, 229–234. doi:10.3303/CET2186039 Patriarca, R., Di Gravio, G., Cioponea, R., Licu, A., 2022. Democratizing business intelligence and machine learning for air traffic management safety. Saf. Sci. 146 August 2021 , 105530. doi:10.1016/j.ssci.2021.105530 Sarkar, S., Maiti, J., 2020. Machine learning in occupational accident analysis: A review using science mapping approach with citation network analysis. Saf. Sci. 131, 104900. doi:10.1016/j.ssci.2020.104900 Sharda, R., Delen, D., Efraim, T., 2019. Analytics, Data Science, & Artificial Intelligence: Systems for Decision Support, Eleventh e. ed. Pearson, Hoboken, NJ. Suarez-Paba, M.C., Cruz, A.M., 2022. A paradigm shift in Natech risk management: Development of a rating system framework for evaluating the performance of industry. J. Loss Prev. Process Ind. 74. doi:10.1016/J.JLP.2021.104615 Zhang, Y., Mańdziuk, J., Quek, C.H., Goh, B.W., 2017. Curvature-based method for determining the number of clusters. Inf. Sci. (Ny). 415–416, 414–428. doi:10.1016/j.ins.2017.05.024 402 105nakhalakel.pdf A Machine Learning Approach to Analyze Natural Hazards Accidents Scenarios