1 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 35-46 Air Quality Prediction in Smart City’s Information System Ivan Kristianto Singgih Department of the Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 305-701, Republic of Korea Correspondence: E-mail: ivanksinggih@gmail.com A B S T R A C T S A R T I C L E I N F O The introduction of new technology and computational power enables more data usages in a city. Such a city is called a smart city that records more data related to daily life activities and analyzes them to provide better services. Such data acquisition and analysis must be conducted quickly to support real-time information sharing and support other decision-making processes. Among such services, an information system is used to predict the air quality to ensure people's health in the city. The objective of this study is to compare various machine learning techniques (e.g., random forest, decision tree, neural network, naïve Bayes, etc.) when predicting the air quality in a city. For the comparison, we perform the removal of records with empty values, data division into training and testing datasets, and application of the k-fold cross-validation method. Numerical experiments are performed using a given online dataset. The results show that the three best methods are random forest, Gradient Boosting, and k- nearest neighbors with precision, recall, and f1-score values more than 0.63. Article History: Received 8 Nov 2020 Revised 20 Nov 2020 Accepted 25 Nov 2020 Available online 26 Dec 2020 vailable online 09 Sep 2018 ___________________ _ Keywords: Definitions of Shophouses; Identity; George Town; Influence Architecture and Design. International Journal of Informatics, Information System and Computer Engineering International Journal of Informatics Information System and Computer Engineering 1 (2020) 35-46 I K Singgih. Air Quality Prediction in Smart City’s Information System | 36 1. INTRODUCTION MART city integrates the physical world and the virtual world. A concept used for performing such connectivity is called digital twin that is a virtual model representing the physical world (Marr, 2020). Using such a virtual model allows monitoring the physical system, preventing problems from happening, finding new opportunities, and planning the future. The interaction is illustrated in Smart City Korea (2020) through Fig. 1. Massive IoTs, digital twins, and data hubs are utilized to generate the required information in the integration process. In the smart city, fixed/mobile sensors are installed within the city to observe real behaviors (e.g., the people) and conduct a better operation of the virtual world. There are various subareas within the smart city, including smart mobility, smart buildings, etc. Among them, the smart environment is the one that manages air pollution control to ensure the health of the citizens (Alvear et al., 2018). The smart environment system's continuous improvement is supported by the wide use of the Internet of Things that provides better connectivity between multiple sensors located in the dispersed area and ease the air quality monitoring process (Zhang & Woo, 2020). The existence of different air pollutants causes harm to the respiratory systems. Such air pollutants are nitrogen dioxide (NO2), carbon monoxide (CO), ozone (O3), sulphur dioxide (SO2), and particulate matter (PM). Real-time monitoring stations are built by many cities to check the air quality, then inform people when it is safe to conduct outside activities and plan better movements (Zhang & Woo, 2020). Systems for collecting and assessing air quality have been installed in several areas, e.g., Peking University (with 100 thousand data from 30 devices) (Hu et al., 2019), Christchurch that is a part of IBM’s smart city initiatives (Marek et al., 2017), Los Angeles (Wu et al., 2017), etc. Various information systems are implemented for supporting the data collection and air quality information transfer to the people. An example of the information system used for air quality management in Los Angeles is presented in Fig. 2 (Wu et al., 2017). In this implementation, a remote data collection can be performed using a smartphone. The collected image data are analyzed using machine learning to calculate the particle concentration in the air and evaluate the air quality. 37 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 35-46 Fig. 1. Interaction between physical and virtual world in smart city concept I K Singgih. Air Quality Prediction in Smart City’s Information System | 38 Fig. 2. Information system for air quality management in Los Angeles 2. METHODOLOGY Information system in smart cities has a component for the data acquisition and a server to store and process the obtained data. The good adoption of technologies for data collection and computation determines the success of smart city developments (Marek et al., 2017). Given that multiple sources of data (e.g., open data, online data sharing) emerge, improving the information system interoperability, including how to utilize existing data, is a great challenge to be solved in smart city projects. Many smart city initiatives have been started. One of them is ERA-PLANET, a wide European network that consists of 118 researchers in 35 institutions and 18 countries (Tsinganos et al., 2017). The architecture of the information system related to the air quality sensing process performs the following tasks (Alvear et al., 2018): 1) Sampling The sampling task measures the pollutant in the air that includes the calibration process. By performing such sampling with many mobile sensors, the problem of sampling error can be handled because of the possibility of considering redundant data and statistical analysis. 39 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 35-46 2) Data filtering Through the filtering process, redundant data and wrong measurements are removed. 3) Data transfer The collected data are uploaded from the sensors to the cloud (servers). The upload process is managed based on some IoT protocols. 4) Data processing The observed data are processed to obtain a conclusion on the air quality. Through this process, a pollution distribution map is generated. 5) Presentation of the analysis result The results can be presented as a graphical map. The architecture itself can be divided into three layers (Schürholz et al., 2020): 1) Data layer The data layer contains a database of historical data and prediction data. 2) Logic layer The logic layer converts the input data before being used in the analysis and performs the prediction process. 3) Visualization layer The visualization layer passes the information to be visualized in the end- user devices. The introduction of inexpensive small sensors allows retrieving a huge amount of data in real-time fashion (Hu et al., 2019). Effective machine learning techniques are implemented in this study to perform such a real-time air quality assessment. The machine learning techniques used in this study are listed in Table I. The selected methods have been proven to perform well for predicting air quality purposes. Studies that used each method (or its variants) are presented as well. 3. RESULTS AND DISCUSSION We use the Python sklearn library (Pedregosa et al., 2011) to implement the machine learning techniques. The code is written using the Visual Studio 2019 Community platform. A partial view of the code is presented in Fig. 3. Air quality prediction data presented in (Bhat, 2020) is solved. Among 26,6219 data, we remove records with any empty values and obtain 4,646 records to be used in our study. We exclude the location and time stamp fields from the observed data. The preprocessed data are stored in an Excel input file and is imported into Python. Libraries for performing the calculations and generating a graphical representation of the results are used. The dependent variable is the air quality with the following values: Severe, Very Poor, Poor, Moderate, Satisfactory, Good. The independent variables are: 1) PM2.5 PM is the abbreviation of particulate matter that includes potential harmful compounds, which can reach human respiratory systems (Chaparro et al., 2020). PM2.5 refers to cases of air I K Singgih. Air Quality Prediction in Smart City’s Information System | 40 particles with the mass per cubic meter less than 2.5 µm. 2) PM10 3) NO NO refers to nitrogen oxide. 4) NO2 NO2 refers to nitrogen dioxide. 5) NOx NOx is the total amount of NO and NO2. 6) NH3 NH3 refers to ammonia. 7) CO CO refers to carbon monoxide. 8) SO2 SO2 refers to sulfur dioxide. 9) O3 O3 refers to ozone. 10) Benzene 11) Toluene 12) Xylene Analysis steps performed in this study are: 1) Dividing the dataset into training and testing data In our implementation, the percentage of testing data is set into 20%. The data is shuffled before the division. 2) Testing the accuracy of each technique using the k-fold cross- validation The number of used splits is 10. The training data is shuffled as well before performing the testing. 3) Fitting the testing data Table 1. Used machine learning techniques in this study Machine Learning Technique Reference adaptive boosting (AB) Liu and Chen (2020) linear classifiers with stochastic gradient descent training (SGD) Ganesh et al. (2017) neural network (multi-layer perceptron a) (NNMLP) Ganesh et al. (2017), Gu et al. (2020), Sun et al. (2020), Wang et al. (2017), Zhao et al. (2020) Gradient Boosting (GB) Zhang et al. (2019), Zhang et al. (2019b), Liu et al. (2019), Yu et al. (2016), Feng et al. (2018) random forest (RF) Liu et al. (2019), Yu et al. (2016), Feng et al. (2018) k-nearest neighbors (KNN) Zhao et al. (2020) decision tree (CART) Zhang et al. (2019b) Naive Bayes (Gaussian a) (NB) Feng et al. (2018), Melgarejo et al. (2015) support vector machine (C-Support Vector a) (SVM) Ganesh et al. (2017), Gu et al. (2020), Liu et al. (2019), Melgarejo et al. (2015), Dun et al. (2020) 41 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 35-46 aSpecific one considered in this study. Fig. 3. A partial view of the code Fig. 4. True positive, false positive, false negative, and true negative cases Precision, recall, f1-score, and support metrics are measured for each technique. True positive, false positive, false negative, and true negative cases are observed to calculate such values. The cases are defined in Fig. 4 based on the comparison between results concluded by the test and the real data (parikh et al., 2008). Definition and formula of precision, recall, f1-score, and support are presented in table ii. Machine learning methods that can predict the air quality better are the ones with higher scores. Result of accuracy testing using k- fold cross validation is presented using boxplots in Fig. 5. Three techniques that have the best accuracy are RF, GB, and KNN. These three methods have good I K Singgih. Air Quality Prediction in Smart City’s Information System | 42 average accuracy and a smaller deviation in the accuracy calculation when considering different training and testing datasets, compared with the others. The worst accuracies are obtained by SGD and SVM methods. Table 2. Definition and formula of precision, recall, f1-score, and support Metric Definition Formula Precision Total number of retrieved data that are relevant/total number of retrieved data (Ting, 2011) TP / (TP + FP) (Jiang et al., 2017) Recall Total number of retrieved data that are relevant/total number of relevant data in the database (Ting, 2011) TP / (TP + FN) (Jiang et al., 2017) F1-score A weighted value obtained from the precision and recall values with 1 as its best value and 0 as its worst 2 * precision * recall / (precision + recall) (Yuan et al., 2020) Support Number of occurrences of each class in y_true - Fig. 5. Accuracy comparison of the machine learning techniques Table 3. Average value of each metric using the testing data Machine Learning Technique Prediction Recall F1- score AB 0.36 0.52 0.41 SGD 0.38 0.41 0.39 NNMLP 0.73 0.64 0.66 GB 0.82 0.77 0.79 RF 0.81 0.76 0.78 KNN 0.81 0.77 0.79 CART 0.75 0.71 0.73 NB 0.65 0.70 0.67 SVM 0.22 0.19 0.14 Table 4. Detailed metric values of rf method Class Predicti on Reca ll F1- scor e Suppo rt Good 0.84 0.64 0.73 67 Satisfact ory 0.78 0.84 0.81 269 Moderat e 0.83 0.86 0.84 383 Poor 0.75 0.69 0.72 110 Very poor 0.81 0.81 0.81 77 Severe 0.85 0.71 0.77 24 43 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 35-46 Table 5. Detailed metric values of gb method Class Predicti on Reca ll F1- scor e Suppo rt Good 0.85 0.66 0.74 67 Satisfact ory 0.80 0.84 0.82 269 Moderat e 0.84 0.86 0.85 383 Poor 0.72 0.68 0.70 110 Very poor 0.81 0.79 0.80 77 Severe 0.90 0.79 0.84 24 Table 6. Detailed metric values of knn method Class Predicti on Reca ll F1- scor e Suppo rt Good 0.70 0.63 0.66 67 Satisfact ory 0.76 0.81 0.78 269 Moderat e 0.84 0.84 0.84 383 Poor 0.77 0.73 0.75 110 Very poor 0.86 0.81 0.83 77 Class Predicti on Reca ll F1- scor e Suppo rt Severe 0.90 0.79 0.84 24 The fitting results of the testing data are presented in Tables III-VI. In Table III, the average value of each metric calculated from all classification class is presented. The detailed metric values for the three best techniques are presented in Tables IV-VI. In these tables, evaluation is performed for each class (Severe, Very Poor, Poor, Moderate, Satisfactory, Good). It can be seen that the value of each metric is similar for each class when a certain method is implemented. 4. CONCLUSION In this study, we implement several machine learning techniques to predict air quality as part of the smart city's information system. Based on the numerical experiments, random forest, Gradient Boosting, and k-nearest neighbors have the best accuracies. Future studies must assess whether it is necessary to include all input values in the models and consider how to deal with incomplete records. REFERENCES Alvear, O., Calafate, C. T., Cano, J.-C., & Manzoni, P. (2018). Crowdsensing in smart cities: Overview, platforms, and environment sensing issues. Sensors, 18(2), 460. Bhat, N. (2020, October 1). Air quality level of different cities in India (2015-2020). Kaggle Dataset. https://www.kaggle.com/nareshbhat/air-quality-pre-and-post- covid19-pandemic I K Singgih. Air Quality Prediction in Smart City’s Information System | 44 Chaparro, M. A. E., Chaparro, M. A. E., Castañeda-Miranda, A. G., Marié, D. C., Gargiulo, J. D., Lavornia, J. M., Natal, M., & Böhnel, H. N. (2020). Fine air pollution particles trapped by street tree barks: In situ magnetic biomonitoring. Environmental Pollution, 266(1), Article 115229. Dun, M., Xu, Z., Chen, Y., & Wu, L. (2020). Short-term air quality prediction based on fractional grey linear regression and support vector machine. Mathematical Problems in Engineering, 2020(1), Article 8914501. Feng, C., Tian, Y., Gong, X., Que, X., & Wang, W. (2018). MCS-RF: mobile crowdsensing–based air quality estimation with random forest. International Journal of Distributed Sensor Networks, 14(10), 1–15. Ganesh, S. S., Arulmozhivarman, P., & Tatavarti, R. (2017). Forecasting air quality index using an ensemble of artificial neural networks and regression models. Journal of Intelligent Systems, 28(5), 893–903. Gu, K., Zhou, Y., Sun, H., Zhao, L., & Liu, S. (2020). Prediction of air quality in Shenzhen based on neural network algorithm. Neural Computing and Applications, 32(7), 1879–1892. Hu, Z., Bai, Z., Bian, K., Wang, T., & Song, L. (2019). Real-time fine-grained air quality sensing networks in smart city: Design, implementation, and optimization. IEEE Internet of Things Journal, 6(5), 7526–7542. Jiang, C., Liu, Y., Ding, Y., Liang, K., & Duan, R. (2017). Capturing helpful reviews from social media for product quality improvement: A multi-class classification approach. International Journal of Production Research, 55(12), 3528–3541. Liu, H., & Chen, C. (2020). Spatial air quality index prediction model based on decomposition, adaptive boosting, and three-stage feature selection: A case study in China. Journal of Cleaner Production, 265(1), Article 121777. Liu, H., Li, Q., Yu, D., & Gu, Y. (2019). Air quality index and air pollutant concentration prediction based on machine learning algorithms. Applied Sciences, 9(19), 4069. Marek, L., Campbell, M., & Bui, L. (2017). Shaking for innovation: The (re)building of a (smart) city in a post disaster environment. Cities, 63(1), 41–50. Marr, B. (2020, October 3). What is digital twin technology - And why is it so important? Forbes. https://www.forbes.com/sites/bernardmarr/2017/03/06/what-is- digital-twin-technology-and-why-is-it-so-important/#3388188e2e2a Melgarejo, M., Parra, C., & Obregón, N. (2015). Applying computational intelligence to the classification of pollution events. IEEE Latin America Transactions, 13(7), 2071–2077. 45 | International Journal of Informatics Information System and Computer Engineering 1 (2020) 35-46 Parikh, R., Mathai, A., Parikh, S., Sekhar, G. C., & Thomas, R. (2008). Understanding and using sensitivity, specificity and predictive values. Indian Journal of Ophthalmology, 56(1), 45–50. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(1), 2825–2830. Schürholz, D., Kubler, S., & Zaslavsky, A. (2020). Artificial intelligence-enabled context-aware air quality prediction for smart cities. Journal of Cleaner Production, 271(1), 121941. Smart City Korea. (2020, October 3). Introducing Core 1, 2, and 3 smart city innovation growth engines. Ministry of Land, Infrastructure and Transport. https://smartcity.go.kr/en/ Sun, X., Xu, W., Jiang, H., & Wang, Q. (2020). A deep multitask learning approach for air quality prediction. Annals of Operations Research. Ting, K. M. (2011). Precision and recall. In Encyclopedia of machine learning, 781-781. Springer, Boston, MA. Tsinganos, K., Gerasopoulos, E., Keramitsoglou, I., Pirrone, N., & The ERA-PLANET Team. (2017). ERA-PLANET, a European network for observing our changing planet. Sustainability, 9(6), 1040. Wang, J., Zhang, X., Guo, Z., & Lu, H. (2017). Developing an early-warning system for air quality prediction and assessment of cities in China. Expert Systems with Applications, 84(1), 102–116. Wu, Y.-C., Shiledar, A., Li, Y.-C., Wong, J., Feng, S., Chen, X., Chen, C., Jin, K., Janamian, S., Yang, Z., Ballard, Z. C., Göröcs, Z., Feizi, A., & Ozcan, A. (2017). Air quality monitoring using mobile microscopy and machine learning. Light: Science & Applications, 6(1), 17046. Yu, R., Yang, Y., Yang, L., Han, G., & Move, O. A. (2016). RAQ–A random forest approach for predicting air quality in urban sensing systems. Sensors, 16(1), Article 86. https://doi.org/10.3390/s16010086 Yuan, J., Zhang, L., Guo, S., Xiao, Y., & Li, Z. (2020). Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications, 16(3), Article 83. https://doi.org/10.1145/3394955 Zhang, D., & Woo, S. S. (2020). Real time localized air quality monitoring and prediction through mobile and fixed IoT sensing network. IEEE Access, 8(1), 89584– 89594. https://doi.org/10.1109/ACCESS.2020.2993547 I K Singgih. Air Quality Prediction in Smart City’s Information System | 46 Zhang, Y., Wang, Y., Gao, M., Ma, Q., Zhao, J., Zhang, R., Wang, Q., & Huang, L. (2019a). A predictive data feature exploration-based air quality prediction approach. IEEE Access, 7(1), 30732–30743. https://doi.org/10.1109/ACCESS.2019.2897754 Zhang, Y., Zhang, R., Ma, Q., Wang, Y., Wang, Q., Huang, Z., & Huang, L. (2019b). A feature selection and multi-model fusion-based approach of predicting air quality. ISA Transactions, 100(1), 210–220. https://doi.org/10.1016/j.isatra.2019.11.023 Zhao, Z., Qin, J., He, Z., Li, H., Yang, Y., & Zhang, R. (2020). Combining forward with recurrent neural networks for hourly air quality prediction in Northwest of China. Environmental Science and Pollution Research, 27(23), 28931–28948. https://doi.org/10.1007/s11356-020-08948-1 Zhao, X., Song, M., Liu, A., Wang, Y., Wang, T., & Cao, J. (2020). Data-driven temporal- spatial model for the prediction of AQI in Nanjing. Journal of Artificial Intelligence and Soft Computing Research, 10(4), 255–270. https://doi.org/10.2478/jaiscr-2020- 0017