Microsoft Word - 42landucci.docx CHEMICAL ENGINEERING TRANSACTIONS VOL. 82, 2020 A publication of The Italian Association of Chemical Engineering Online at www.cetjournal.it Guest Editors: Bruno Fabiano, Valerio Cozzani, Genserik Reniers Copyright © 2020, AIDIC Servizi S.r.l. ISBN 978-88-95608-80-8; ISSN 2283-9216 A Data Driven Model for Ozone Concentration Prediction in a Coastal Urban Area Tomaso Vairoa,c *, Andrea Rapuzzib, Mario Leccac, Bruno Fabianoa a DICCA - Civil, Chemical and Environmental Engineering Dept. – Genoa University, via Opera Pia 15 - 16145 Genoa, Italy b A-SIGN S.r.l - via XXV Aprile 10/3a - 16121 Genoa, Italy c ARPAL, via Bombrini 8 - 16149 Genoa Italy tomaso.vairo@edu.unige.it As amply known, ozone concentration in the coastal area of study is well relevant in connection with “photochemical smog”, due to high levels of solar radiation and temperature values and possible photochemical oxidation of volatile organic compounds (VOCs) in the presence of nitrogen oxides (NOx). In this paper, a framework for predicting ozone concentration in urban area is presented, relying a LightGBM algorithm for gradient boosting on decision trees. The system represents a pragmatic and scientifically credible approach to data driven modelling applied to complex and uncertain situations. The study concerns the application of data analytic standard methodologies to air quality analysis, which includes the pre- treatment of data, the choice of a suitable configuration of the learning algorithm, the identification of the fitting parameters and error minimization. Training and verification data are significant statistical time-series over the past years validated from the air quality monitoring network in the urban area of Genoa (Italy). Keywords: air quality, data driven model, machine learning, ozone, environmental quality. 1. Introduction The protection of air quality from pollution and the reduction of greenhouse gas emissions are essential goals gaining increasing attention in international and national strategies and policies. In this context, many attributes (such as safety, environment, reputation, policy, costs, etc.) need to be properly taken into consideration when prioritising safety plant/industry investments (Abrahamsen et al., 2020). The transition to low-carbon economy and the ambition to reach net zero emissions offers research challenges addressing pollution prevention, e.g. by advanced pyrolysis processes recovering mass and energy (Chiarioni et al., 2006). Further to emission reduction process, the enhancement of climate change resilience requires advanced pollution modelling forecasting for both emergency situations (Fabiano et al., 2017) and conventional environmental risk assessment (Sikorova et al., 2017). Two different approaches can be sorted in air pollution modelling: the former relies on atmospheric dispersion modelling of pollutants by simulating diffusive and transport mechanism (e.g. Vairo et al., 2014) and once correctly defined source terms and chemical processes involved, can be properly applied also to non-stationary sources (Vairo et al., 2017). The latter is based on advanced statistical models, such as machine learning methodologies, e.g. relying on statistical data elaboration from air monitoring networks. Table 1: Ozone (O3) reference values set down by Italian legislation. Reference Ozone concentration Information threshold on the hourly average 180 μg / m3 Alarm threshold on the hourly average 240 μg / m3 for 3 consecutive hours Target value on 8-hour average 120 μg / m3 as daily, not to be exceeded more than 25 times/y Long-term target value on 8-hour average 20 μg / m3 as daily average DOI: 10.3303/CET2082064 Paper Received: 5 January 2020; Revised: 27 March 2020; Accepted: 20 July 2020 Please cite this article as: Vairo T., Rapuzzi A., Lecca M., Fabiano B., 2020, A Data Driven Model for Ozone Concentration Prediction in a Coastal Urban Area, Chemical Engineering Transactions, 82, 379-384 DOI:10.3303/CET2082064 379 As a matter of fact, analogously to the risk assessment domain, main improvement challenges are based on the application of machine learning techniques and big data exploitation (De Rademaeker et al., 2014). In this regard, predicting ability is strictly connected to spatial and time interpolation schemes, such as Multiple Linear Regression (MLR), or Artificial Neural Network (ANN) for non-linear problems (e.g. Wand & Quian, 2018). Air pollution increases the risk of respiratory and heart disease, being recognized as a major environmental and health risk. As amply known, tropospheric ozone is a secondary pollutant, formed as a result of chemical reactions occurring in the atmosphere starting from the precursors (nitrogen oxides and volatile organic compounds), under high solar radiation level and elevated temperature conditions. Ozone pollution is a well relevant and characteristic phenomenon of the summer period, with the highest concentrations usually recorded in the afternoon, in suburban areas placed leeward with respect to the main urban areas. The forecasting ability of well-developed data driven model can outperform the predictions attained by mechanistic models due to inherent approximations and uncertainties in the emission source estimation (Cobourn et al., 2010). Table 2: Exceedances of the target and long-term target values set out by legislation in the year 2018. Urban station Target value exceedances [day] Long-term target value exceedances [day] Quarto 69 6 Corso Firenze 52 9 Parco Acquasola 108 89 The legislative reference values for health protection in the Italian legislation, in terms of non-compliance limits are summarized in Table 1. Table 2 summarizes the number of days of the year 2018 exceeding the target value and the long-term target value, experimentally obtained by the monitoring network (Regione Liguria, 2018). The main operational tools for air quality planning are monitoring systems and the regional inventory of emissions with indications on the regulatory framework. In order to plan useful actions for achieving environmental objectives, it is important to have reliable forecasting tools. The focus of this work is to evaluate the results that advanced data analysis techniques (i.e. proper regularization, data pre-treatments) coupled with a learning algorithm framework can achieve in reliable forecasting ozone concentrations. Focusing on trend forecasting, the remainder of this paper is as follows. Section 2 describes the methodology including modelling dataset and learning model, Section 3 presents the data statistics and the forecasting results with descriptions of contributions made, while in Section 4 conclusions are drawn, with the strengths of the proposed technique and future work. 2. Methodology 2.1 Data collection and preprocessing In this paper, we consider air quality and meteorological data measured in the urban area of the town of Genoa (Italy) over the time span May 2015- December 2018. Raw data were obtained for the three metropolitan zones of Genoa (Italy), i.e.: Quarto, Corso Firenze, Parco Acquasola, respectively. Upon validation, data have been statistical elaborated on a daily basis. (Garcìa et al. 2011). The following input variables has been considered: 1. Time variables: day of the year (doy), day of the week (dow), month. 2. Meteorological variables (daily aggregate): mean sea level pressure (MSLP), solar radiation (SLHR, SSHR), temperature (TEMP), wind direction and speed (UWIND, VWIND, MOD) humidity (HUM) and rain (RAIN). 3. Pollutants (daily aggregate): ozone (O3) daily mean. 4. Bank holiday information for each day (true or false) to consider the influence of holidays on ozone concentration. As suggested by Eapi et al. (2013,) time variables under heading 1 were assimilated via trigonometric functions, in order to account for the cyclic nature of their impact. The meteorological variables under heading 2 have been summarized on a daily frequency utilizing minimum, maximum, average and standard deviation functions. Additionally, in order to improve the prediction accuracy when forecasting ozone concentration by information correlation, we have added redundant values to each row related to previous time span (e.g. meteorological values from one day before or for one year before). 2.2 Validation strategy In accordance to the best practices for time series validation, we have cross validated the obtained results, according to a customized and accurate Walk-Forward approach (Cao et al. 2003) as detailed in the following. 380 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Training set Validation set Test set Unused time Figure 1: Validation results based on a Walk-Forward approach A. Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Training set Validation set Test set Unused time Figure 2: Validation results based on a Walk-Forward approach B. This is an approach that allows achieving a robust estimation of the model performances, without leaking information from the training to the validation set shown as a general example in Figure 1. Each fold consists of the following sub-sets: • Training: contains data points belonging to the time interval from T0 to Tt (included), where T0 is the oldest available data point and Tt – T0 is a sufficient time interval to train the model on the problem • Validation: contains data points belonging to Tt+1 • Test: contains data points belonging to Tt+2 • Unused: contains data points belonging to a time more recent than Tt+2 Validation data is used to perform early stopping of the learning process to identify when the model starts to overfit. When we use a single day to select the validation interval, we have an increased variance due to, among other factors, a premature stopping of the training for an initial (random) fitness of the model to the small validation data. In order to limit noise in the early stopping process (introduced by random good initial fit on such a small validation set) we tested two strategy implementations, as follows. A validation scheme based on the selection of 7 days interval for the validation set. As evidenced in Fig. 1, since the model final performances are measured on the test set (whose interval is kept one-day long), we can introduce a small data leakage between the training and validation, in order to stabilize the validation score and the early stopping strategy. B. validation scheme based on running a small number of training epochs without early stopping before the full training process. In this case, the resulting trend will evidence a “warm-up” step (see Figure 2). 381 2.3 Learning algorithm We selected a Light GBM model learning method mainly based on the Decision Tree algorithm, and frequently used in classification tasks. (Zhang et al., 2019). Light GBM is based on a highly optimized library that performs very well in structured/tabular data problems, capable of gracefully managing a mix of scalar and categorical variables (Ke et al., 2017). However, research on Light GBM application for spatial forecasting ability in the field of air quality is limited. It is noteworthy, noting that the system used for data assimilation, construction and network learning, testing and validation, is completely based on an open source statistical processing software. In the next chapter, the application of the validation schemes is thoroughly discussed, in order to evidence how the newly built model reflects better fitting effect and predictive data feature. 3. Results and discussion Several run tests were performed according to a wide Walk-Forward window in order to test the model convergence and its dependence on the training data dimension. As clearly depicted in Figure 3 a, higher folds use a progressively smaller training set: it provides an example of the performances in terms of Mean Absolute Error on the validation and test sets across 300 folds. Validation and test data are quite noisy, but their average (in the interval Fold 0, Fold i) tends to converge. After nearly 130 folds the model performances degrade slowly. The trend is even more evident considering the Validation and Test moving averages (across the 20 Folds) for the same experiment depicted in Figure 3 b, evidencing that the model needs to be adequately trained with nearly 80% of the available data to reach top performances. Figure 1: (a) Model Performance by Fold Id; (b) Model Performance by Fold Id - Moving Averages The variation a in relation with the 7 days-validation strategy above described allowed stabilizing the validation score and increasing the test performance, as shown in Figure 4 a. Conversely, the variation B according to the validation strategy based on a small training pre-run approach previously outlined, was un-effective in stabilizing the validation score but has provided the best overall test performance (see Fig. 4 b). The model is sensitive to modelling data size and its performance degrades when data are too few. In order to provide a reference point a naïve prediction has been performed using previous-day value as a prediction. Figure 4: (a) Validation Performance - Variation A; (b) Validation Performance - Variation B. 382 Table 3: Model scores on validation and testing. Validation mean score Test mean score Naïve prediction NA 14.011 Walk-Forward 6.509 9.456 A - Walk-Forward 7-day validation 8.593 9.066 B - Walk-Forward pre run 8.320 8.875 In Table 3, the model performance on validation and testing scores are summarized, by considering the different configurations previously outlined. The comparison of Ozone concentrations [ppm] experimentally observed (ground truth) and the model prediction is depicted in Figure 5. Figure 5: Predicted Ozone concentration [ppm] vs experimental values [ppm] (Ground Truth). Figure 6 shows the comparison of Ozone concentrations [ppm] experimentally observed (ground truth) to the naïve prediction: results reveal that the model yields again satisfactory predictions evidently less clustered around the identity line. Because of its predicting ability, this method can not only be used to forecast surface ozone concentrations, but also be used to make predictions of other air pollutants, upon proper refinement Figure 6: Predicted Ozone concentration [ppm] vs experimental values [ppm] (Ground Truth) and Naive Prediction. 383 4. Conclusions The work presented in this study aims to examine the feasibility of applying a machine learning algorithm based on gradient boosting techniques to predict the concentration of O3 in the metropolitan area of the city of Genoa. The model is based on a relative novel algorithm used in many different kinds of data mining tasks, such as classification, regression and ordering, while its application in the given urban context is still rather limited. The predictive model was trained with meteorological data, ozone measurements in three urban areas, and time variables, all suitably pretreated as described above. The best cross-validation strategy was therefore selected, in order to balance bias and variance in the prediction results and thus avoid situations of under-specification and over-specification. The model thus built showed excellent results. This work complements and improves the previous predictive model developed for PM10 prediction (Vairo et al. 2019), which was developed by a Bayesian inference approach. As a further refinement and extension of the study, it seems interesting to extend the framework to cover nitrogen oxides concentration too, in order to develop an overall predictive system of the main pollutants relevant for photochemical pollution and their environmental synergistic impact. References Abrahamsen E.B., Milazzo M.F., Selvika J.T., Asche F., Abrahamsen H.B., 2020, Prioritising investments in safety measures in the chemical industry by using the Analytic Hierarchy Process, Reliability Engineering & System Safety, 198, article 106811. Cao L.J., Tay F., 2003. Support vector machine with adaptive parameters in financial time series forecasting. Neural Networks, IEEE Transactions on Neural Networks 14(6), 1506-1518. Cobourn W.G., Dolcine L., French,M., Hubbard M.C., 2000. A comparison of nonlinear regression and neural network models for ground-level ozone forecasting. J. Air Waste Manage. Assoc. 50, 1999-2009. Chiarioni A., Reverberi A.P., Fabiano B., Dovì V.G., 2006, An improved model of an ASR pyrolysis reactor for energy recovery, Energy 31, 2460-2468. De Rademaeker, E., Suter, G., Pasman, H.J., Fabiano, B. 2014. A review of the past, present and future of the European Loss Prevention and Safety Promotion in the Process Industries. Process Safety and Environmental Protection 92, 280-291. Eapi G.R., Sattler M., Manry M.T., 2013, Comprehensive ozone forecasting model using neural networks, Conference paper. https://www.researchgate.net/publication/280026476. Fabiano B., Vianello C., Reverberi A.P., Lunghi E., Maschio G., 2018, A perspective on Seveso accident based on cause-consequences analysis by three different methods, Journal of Loss Prevention in the Process Industries, 49, 18-35. García, I., Rodríguez, J.G.,Tenorio, Y.M., 2011, Artificial Neural Network Models for prediction of ozone concentrations in Guadalajara, Mexico, Air Quality-Models and Applications, InTechOpen 2011. Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T., 2017, LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30, 3146-3154. Regione Liguria, ARPAL, Valutazione annuali di qualità dell’aria, 2018, Regional annual air quality report, http://www.ambienteinliguria.it/eco3/DTS_GENERALE/20191014/ValutazioneAnnuale_2018.pdf Sikorova K., Bernatik A., Lunghi E., Fabiano, B., 2017, Lessons learned from environmental risk assessment within the framework of Seveso Directive in Czech Republic and Italy, Journal of Loss Prevention in the Process Industries, 49, 47-60. Vairo T., Currò, F., Scarselli, S., Fabiano, B., 2014. Atmospheric emissions from a fossil fuel power station: dispersion modelling and experimental comparison. Chemical Engineering Transactions 36, 295-300, DOI:10.3303/CET1436050 Vairo T., Del Giudice T., Quagliati M., Barbucci A., Fabiano B., 2017, From land- to water-use-planning: A consequence-based case-study related to cruise ship risk, Safety Science 97, 120-133. Vairo T., Lecca M., Trovatore E., Reverberi A., Fabiano B., 2019, A Bayesian Belief Network for local air quality forecasting, Chemical Engineering Transactions, 74, 271-276 DOI:10.3303/CET1974046 Wang B., Quian F. 2018. Three-dimensional gas dispersion modeling using cellular automata and artificial neural network in urban environment. Process Safety and Environmental Protection, 120, 286-30. Zhang Y., Wang Y., Gao M., Ma Q., Zhao J., Zhang R., Wang Q., Huang L, 2019. A predictive data feature exploration-based air quality prediction approach. IEEE Access 7, 30732-30743. 384