KEDS_Paper_Template Knowledge Engineering and Data Science (KEDS) pISSN 2597-4602 Vol 5, No 1, December 2022, pp. 53–66 eISSN 2597-4637 https://doi.org/10.17977/um018v5i12022p53-66 ©2022 Knowledge Engineering and Data Science | W : http://journal2.um.ac.id/index.php/keds | E : keds.journal@um.ac.id This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/) Optimized Three Deep Learning Models Based-PSO Hyperparameters for Beijing PM2.5 Prediction Andri Pranolo a, b, 1, *, Yingchi Mao a, 2 , Aji Prasetya Wibawa c, 3 , Agung Bella Putra Utama c, 4 , Felix Andika Dwiyanto c, 5 a Department of Computer and Technology, College of Computer and Information, Hohai University 1 Xikang Road, Nanjing, Jiangsu 211100, China b Department of Informatics, Faculty of Industrial Technology, Universitas Ahmad Dahlan Jl. Prof. Dr. Soepomo, S.H., Janturan, Warungboto, Umbulharjo, Yogyakarta 55164, Indonesia c Department of Electrical Engineering, Faculty of Engineering, Universitas Negeri Malang Jl Semarang 5, Malang, East Java 65145, Indonesia 1 andri.pranolo@tif.uad.ac.id *; 2 maoyingchi@gmail.com; 3 aji.prasetya.ft@um.ac.id; 4 agungbpu02@gmail.com; 5 felix@ascee.org * corresponding author I. Introduction In air quality monitoring systems, PM2.5 concentration is a crucial measure. As public awareness rises, analyzing and anticipating pollution levels is vital. Monitoring stations can only perform a small role in PM2.5 pollution control due to the nonlinear character of PM2.5 concentrations in both time and space. As a result, improving PM2.5 concentrations prediction accuracy is crucial for preventing and controlling air pollution. Several studies have been conducted using machine learning techniques, such as neural networks, applied to environmental science issues. As a part of a neural network, deep learning is a technique that achieves high performance for various applications such as natural language processing, visual recognition, and forecasting has recently gained attention in the machine learning field. Machine learning models are characterized by large hyperparameter spaces and lengthy training times in their application. These properties, combined with the growth of parallel computing and the increasing demand for producing machine learning workloads. Therefore, developing mature hyperparameter optimization functionality for distributed computing environments is vital. In most cases, machine learning provides more sensible advice than humans can. The design and training of neural networks, called alchemy, are tricky and unpredictable [1]. Therefore, hyperparameter tuning has been extensively studied to lower entry barriers for non-technical users. ARTICLE INFO A B S T R A C T Article history: Received 4 August 2022 Revised 15 August 2022 Accepted 17 August 2022 Published online 7 November 2022 Deep learning is a machine learning approach that produces excellent performance in various applications, including natural language processing, image identification, and forecasting. Deep learning network performance depends on the hyperparameter settings. This research attempts to optimize the deep learning architecture of Long short term memory (LSTM), Convolutional neural network (CNN), and Multilayer perceptron (MLP) for forecasting tasks using Particle swarm optimization (PSO), a swarm intelligence-based metaheuristic optimization methodology: Proposed M-1 (PSO-LSTM), M-2 (PSO-CNN), and M-3 (PSO-MLP). Beijing PM2.5 datasets was analyzed to measure the performance of the proposed models. PM2.5 as a target variable was affected by dew point, pressure, temperature, cumulated wind speed, hours of snow, and hours of rain. The deep learning network inputs consist of three different scenarios: daily, weekly, and monthly. The results show that the proposed M-1 with three hidden layers produces the best results of RMSE and MAPE compared to the proposed M-2, M-3, and all the baselines. A recommendation for air pollution management could be generated by using these optimized models. This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/). Keywords: Air pollution Beijing PM2.5 Deep learning Forecasting Hyperparameter tuning 54 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 Hyperparameter refers to parameters that cannot be changed during machine learning training. It can be involved in the model structure, such as the hidden layer and the activation function. Two recent deep learning model development has made hyperparameter an increasingly important technique. The first is the scaling up of neural networks to achieve greater accuracy [2], and the second is the development of an intricate lightweight model to achieve greater accuracy with fewer data and parameters [3][4]. Furthermore, hyperparameter tuning plays an essential role in both cases. In its application, there are more hyperparameters to tune in a model with a complex structure than in a model with a well-defined structure. Several hypermeters for an LSTM model are necessary to improve performance, such as the number of hidden layers and neurons, dense layer, and weight initialization. The first consideration of hyperparameter is the number of nodes and hidden layers. Hidden layers are the layers between the input and output layers. No specific number of hidden layers in its application should be used. Therefore, it depends on each problem to use a trial-and-error tuning approach. One hidden layer will suffice for most simple problems, and for more complex ones, two layers are recommended. Even though many nodes within a layer can improve accuracy, fewer nodes may result in underfitting [5]. The next are some units in a dense layer, which is the most used layer and essentially layer where all neurons as input for each neuron in the prior densely connected layer, which can increase the accuracy, while 5–10 units or nodes per layer is an ideal starting point for dense layers. As a result, the final dense layer‟s shape is influenced by the number of neurons/units specified [6]. Then, a dropout layer should be present between each LSTM layer, such as a layer that reduces the network‟s sensitivity to specific weights of individual neurons. The dropout layer can be used with an input layer. However, it cannot be used with the output layer since it can make the model and calculation errors. The dropout can alleviate the risk of overfitting when adding complexity by increasing the number of nodes in dense layers or adding more dense layers, resulting in poor validation accuracy [7]. In other cases, weight initialization can be a hyperparameter that should be considered. Ideally, the weight initialization schemes should differ depending on the activation function. However, weight values are chosen using a uniform distribution. Initially, it is impossible to set all weights to 0.0 because the optimization algorithm highlights the asymmetry in the error gradient. Different weights can lead to different starting points for the optimization process, leading to different final sets with different performance characteristics [8]. Stochastic optimization assumes that weights will be randomly assigned to small numbers at the start of the search. As long as there is no weight update, weight decay can be included in the weight update rule. The weights are multiplied with slightly less than one factor to limit the weight growth. For references, the initial value of 0.97 should be sufficient. Moreover, the output of a node is defined by its activation functions, either ON or OFF. Using these functions, deep learning models can learn nonlinear prediction boundaries. Although it is technically possible to include activation functions in the dense layers, it is preferable to separate them into separate layers so that it could be reduced density layer output. The activation layer‟s choice depends on the application, but the most popular activation function is the rectifier [8]. The next hyperparameter is a learning rate. By using this hyperparameter, the network can update its parameters more quickly. To speed up the learning process, it is possible that increasing the learning rate will cause the model to diverge or even fail to converge. Learning will take longer, but the model will smoothly converge [9]. Alternatively, this hyperparameter is used in the training phase, with values between 0.0 and 0.1. Then, this hyperparameter specifies the number of epochs (integer) until the validation accuracy decreases even though training accuracy increases, thus risking overfitting. An ideal move is to use the early stopping method to specify the epochs number and stop training when the performance of the approach on the trained dataset drops below a pre-set threshold. The last consideration of hyperparameter tuning is batch size. This hyperparameter specifies the number of samples before updating internal model parameters. A more extensive sample size produces more significant gradient steps than smaller ones. The initial batch size is 32. However, it can adjust with multiples of 32, such as 64, 128, and 256, to determine which is better [8]. The research reveals that the PSO optimized deep learning models (LSTM, CNN, and MLP) for Beijing PM2.5 multivariate time series prediction acquire a minimum error and improve its A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 55 accuracy. The seven optimizer hyperparameters are the optimizer, type of activation function, loss function, number of batch sizes, hidden units, neurons, and epochs. The contribution of the research are: 1) To improve the accuracy of the multivariate time-series forecasting analysis applied to Beijing PM2.5 dataset using the proposed model M1 (PSO-LSTM), M2 (PSO-CNN), and M3 (PSO-MLP). 2) To generate the computer-based forecasting model that could as a recommendation for governmental regulations such as pollution prevention, Clean Air Technology Center, and transportation-emissions reduction. The research may present the alternative use of PSO as a tuning hyperparameter on deep learning instead of using it as a feature selection. The automatic tuning process may reduce the computational time due to the random parameter selection. Finally, this paper determines the best optimized deep learning approaches to predict Beijing PM2.5 concentrations. II. Method The proposed hyperparameter tuning of deep learning for forecasting is shown in Figure 1. As shown, the selected dataset will be preprocessed using normalization. The use of the PSO carries out the hyperparameter selection. The best-selected hyperparameter values will be used in the forecasting. Then, the forecasting process will take place by a deep learning method, namely LSTM, CNN, and MLP. In the end, the proposed models and the baseline performance were tested using MAPE and RMSE. Fig. 1. The proposed hyperparameter tuning of deep learning for forecasting 56 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 A. Dataset In this study, an evaluation of the hyperparameter setting of the LSTM method based on the PSO Dataset using PM2.5 Beijing was carried out, which was obtained from the UCI machine learning repository [10]. This dataset represents the weather conditions, and pollution levels reported hourly by the US. in Beijing, China, from 2010 to 2014, with 43.825 instances removed and 2.068 data row values missing in data preprocessing. Pre-processing is the initial process of datasets to improve data quality and selection to obtain high-performance results. Preprocessing data used are feature selection and data normalization. The feature selection process selects the attributes to be used by following a similar study conducted by Zhang [11] using seven attribute data features which include PM2.5 concentration (pm2.5), dew point (DEWP), temperature (TEMP), pressure (PRES), accumulated wind speed (lr), hourly snow accumulation (l), and hourly rain accumulation (lr) as shown in Figure 2. Normalization is a technique for reducing errors by converting the real number to a value range of 0 to 1. The min-max scaling approach is used for normalization [12]. Equation (1) presents the normalization min-max. (1) is a normalization result, represents the data to be normalized while and is the values of minimum and maximum of entire data. In this study, from the dataset, there were three scenarios used as testing data. They are monthly, weekly, and daily. B. Hyperparameter Optimized using PSO Developing an efficient machine learning model is a complex process that requires selecting a suitable algorithm and modifying the model‟s hyperparameters [13]. The primary goal of hyperparameter optimization is to simplify the selection of parameters to get the optimal results of the process and enable users to implement efficient machine learning models to solve practical issues [14]. The process of hyperparameter optimization predicts the best machine learning (ML) architecture [15]. It decreases the amount of human work necessary, enhances machine learning models' performance, and increases models' reproducibility. Particle swarm optimization (PSO) is a swarm optimization model that could use to select hyperparameters and is used in this research as an integrated approach with the other baseline deep learning models. PSO is a family of evolutionary algorithms frequently used to solve optimization problems and has been effectively applied as parameter optimization techniques [16]. PSO takes its inspiration from biological populations that exhibit individual and social behavior. PSO works by allowing a Fig. 2. Visualize the dataset of Beijing PM2.5 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 57 swarm of particles to navigate semi-random search space. Through integrated information sharing between individual particles in a group, PSO algorithms determine the optimal solution. In PSO, a swarm consists of a group of particles [17] as in (2), and a vector is used to represent each particle , as seen in (3). (2) ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ (3) where ⃗⃗⃗⃗ denotes the current position, ⃗⃗⃗⃗ denotes the current velocity, and ⃗⃗⃗⃗ denotes the swarm‟s best-known position. After initializing each particle‟s position and velocity, the current position and records are analyzed with their performance score. The following iteration modifies the velocity ⃗⃗⃗⃗ of each particle following the current global optimal position ⃗⃗⃗ and the prior position ⃗⃗⃗⃗ , as in (4). ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ (4) where denotes distributions of continuous uniform based on the and acceleration constants. Equation (5) represents that the particles move following their new velocity vectors. ⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗⃗⃗ (5) The technique outlined above is performed until convergence or termination constraints are met. The PSO algorithm has a computational complexity of [18]. Additionally, this approach can be parallelized to increase model efficiency because PSO particles act independently and share information only after each iteration. PSO‟s primary restriction requires adequate population initialization. It may reach a local rather than global optimum in discrete hyperparameters [19]. In carrying out the appropriate population initialization, using population initialization techniques or utilizing the developer‟s experience is necessary. Numerous population initialization strategies, such as the opposition-based optimization algorithm [20] and the space transformation search approach [21] have been developed to increase the performance of evolutionary algorithms. Thus, execution time and resource optimization can be increased by performing an extra population initialization strategy Through hyperparameter selection, PSO can improve good values of Deep learning (DL) models. DL is based on artificial neural network theory (ANN). Multilayer perceptrons (MLP), convolutional neural networks (CNNs), recurrent neural networks (RNN), Deep neural networks (DNN), and long short-term memory (LSTMs) are modified from the standard of ANN for deep learning designs [22]. The hyperparameters in the DL that PSO can optimize for selecting hyperparameters include the optimizer, activation function, loss function, batch size, number of neurons, and epochs. Hyperparameters tuning with PSO can be done by calling the optimal configuration „particle swarm‟ in the opportunity function in the TensorFlow Keras package. The used PSO parameters consist of 10 particles in the swarm, 5 generations (iterations), velocity minimum 0, velocity maximum 1, 1.5, 2.0, and 10 permitted function evaluations. The hyperparameters optimized by tuning PSO and retested using the Deep Learning method can be seen in Table 1 and applied a dropout value of 0.2. The parameters that are tuned are parameters that are shared by all deep learning methods in general. Table 1. Deep learning method hyperparameter space No. Hyperparameters Search Space Type 1. Hidden layers (HL) [2,10] Continuous 2. Neurons [1,100] Continuous 3. Activation function Linear, Sigmoid, ReLU Discrete with step=1 4. Loss function MSE, MAE Discrete with step=1 5. Optimizer Adam, RMSprop Discrete with step=1 6. Batch size [32, 64, 128] Discrete with step=1 7. Epoch [5,100] Continuous 58 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 C. Multilayer Perceptron (MLP) The forecasting method often used in research is MLP [23]. MLP belongs to the feedforward network. The characteristics possessed by MLP are advantages in determining the value of weights that are better than other methods, MLP can be used without prior knowledge, and the algorithm can be implemented quickly and can solve linear and nonlinear problems [24]. MLP characteristics make the forecasting value better. MLP in forecasting is used for time series [25] and stock prices [26][27]. As illustrated in Figure 3, the MLP model architecture consists of three layers of nodes: an input layer, a hidden layer, and an output layer. Each layer is connected to the network architecture nodes. The nodes in the input layer are connected to nodes in the hidden layer, and the hidden layer‟s nodes are directly connected to nodes in the output layer‟s node. The elements of a multilayer perceptron consist of network architecture, learning algorithms, and activation functions [28]. Activation function for an in a hidden neuron could be defined as in (6). ∑ (6) where is hidden neuron of , denotes a link function that adds non-linearity to the relationship between the input and hidden layers, denotes weight as input in a weight matrix, is represents an input value. is output values as in (7). ( ) ∑ (7) D. Long Short-Term Memory (LSTM) Long Short-Term Memory (LSTM) is developed from the recurrent neural network (RNN) that could implement to solve the problem of accuracy in time-series data prediction. LSTM can overcome long-term dependencies on its inputs [29]. LSTM creates RNN architectures capable of resolving learning challenges associated with information linkage. In an RNN, the old memory Fig. 3. MLP architecture A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 59 becomes increasingly ineffective as the new memory overwrites it [30]. However, RNNs suffer from vanishing and bursting gradients, which occur when the range of values across layers in architecture changes. The LSTM was developed and designed to address the issue of RNN gradient disappearing while faced with vanishing and bursting gradients [31]. Time series forecasting using LSTM can be used for time-series predictions [32], both short-term loads [33] or long-term [34], weather predictions [35], price movements [36][37][38][39]. The LSTM uses memory cells and gate units to manage memory at each input, with an architecture similar to the RNN. In LSTM, the hidden layer comprises memory cells with three gates: input, forget, and output, as illustrated in Figure 4. The input gate specifies the amount of data stored in the cell state and keeps the cell from holding extraneous data. Forget gate functions limit the time a value remains in a memory cell. The output gate determines the amount of data or value stored in a memory cell and calculates the output. On the LSTM, the gate is a unique network structure with an input vector and output intervals of 0 and 1. No information is permitted to flow when output is set to 0. In contrast, all information is permitted to pass when set to 1 [40]. If the input vector ) and output vector ) are defined, then gates could be formulated as in (8). (8) Sigmoid ; where denotes the weights and denotes the bias vector. The cell state represents the current condition of the cell as being determined as (9). tanh [ ] bc (9) denotes the cell state matrix‟s weight, denotes the cell state‟s bias vector as the input gate, and is the forget gate used to assist the network in forgetting input information and repeating memory cells. The input and forget gates can be computed using the formulas (10) and (11). [ ] bi (10) [ ] bf (11) and denote the weights of the input and forget gates, respectively, while bi denotes the bias vectors of the input-gate, and b denotes the forget-gate bias vectors. The output-gate of the LSTM regulates the amount of information processed into the output from the latest cell state. The output Fig. 4. Memory cells LSTM 60 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 can be estimated using the formula in (12). [ ] bo (12) denotes output gate matrix weight, and is the gate output bias vector. The LSTM process‟s ultimate output is computed as in (13). (13) Then the output will be used for forecasting the following time chosen. E. Convolutional Neural Network (CNN) CNN is part of the DL approach, which is included in the sub-field of ML, which applies the basic concepts of the ANN algorithm with more layers [41]. CNN is a feedforward network because information flow occurs in one direction only, from their inputs to their outputs CNN was applied and extremely popular in image classification research. Therefore, it could be implemented for 1- dimensional (1D) problems, such as forecasting the following values in a time series dataset [42]. The model used is a 1D CNN with architecture, as in Figure 5. Many types of CNN models can be used for each problem in predicting data time series. The model consists of univariate, multivariate, multi-step, and multivariate multi-step [43]. CNN in forecasting data is time series often used to estimate stock prices [44][45], gold prices [46][47][48], health [49][50][51], time series [52][53][54], solar cells and weather forecasts [55]. F. Evaluation The mean absolute percentage error (MAPE) as error evaluation metrics and the root mean square error (RMSE) [56] was used to evaluate and compare the implemented methods‟ performances. MAPE shows errors that can represent accuracy. At the same time, RMSE detect irregularities or outliers in the designed projection system. The formulas are given as in (14) and (15). From the calculation of the MAPE and RMSE value, it will be known which model has the best performance in forecasting. The smaller MAPE and RMSE values produced, the better the forecasting results, so the method was better [57]. ∑ (14) Fig. 5. 1D CNN architecture A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 61 √∑ (15) III. Results and Discussion The original deep learning (LSTM, CNN, and MLP) architecture are 7 input layers, 2 to 10 hidden layers (HL), and 1 output layer with the same setting parameter values. The parameters are 32 neurons, dropout 0.2, MSE for loss function, Adam Optimizer, 100 epoch, and 72 batch size. Unlike LSTM and MLP, CNN used the parameters in the fully connected layer. The specific CNN architecture setting uses 1D convolution layer with 2 kernel sizes, 64 filters, ReLU for activation function, pooling layer with MaxPooling1D type, size 1, and drop out 0.2. Then there was 1 flattened layer and a fully connected layer. Based on the current tuning results, test the PSO tuning results using the Deep Learning method with the result settings as shown in Table 2. PSO hyperparameter tuning was integrated with various deep learning models (LSTM, CNN, and MLP) to produce new models of Proposed Model M-1 (PSO-LSTM), M-2 (PSO-CNN), and M- Table 2. PSO hyperparameter search results deep learning method No. Hyperparameters Proposed M-1 Proposed M-2 Proposed M3 1. Hidden layers (HL) 3 4 3 2. Neurons 24 41 61 3. Activation function Sigmoid ReLU Linear 4. Loss function MSE MAE MSE 5. Optimizer Adam RMSprop RMSprop 6. Batch size 32 32 64 7. Epoch 46 60 68 Table 3. MAPE forecasting results Model MAPE HL-2 HL-3 HL-4 HL-5 HL-6 HL7 HL-8 HL-9 HL-10 Monthly LSTM 9.1216 8.8909 9.1385 9.1935 9.2448 9.2544 9.2612 9.2711 9.3865 CNN 8.6255 8.6195 8.5849 8.9762 9.1778 10.3037 10.7662 10.8264 11.1073 MLP 9.3308 9.2286 9.5347 9.6395 10.6010 10.6280 10.6702 10.6008 10.6035 Proposed M-1* - 8.4576 - - - - - - - Proposed M-2** - - 8.5281 - - - - - - Proposed M-3* - 9.0930 - - - - - - - Weekly LSTM 8.9777 8.8327 10.1041 10.2722 10.3538 11.5553 11.5812 11.5852 11.5940 CNN 9.8238 8.9021 8.8092 8.9096 9.1951 10.2191 11.7623 12.7759 13.3261 MLP 9.9057 9.7078 10.0382 11.6556 11.6180 11.6290 11.6234 11.6228 11.6118 Proposed M-1* - 8.6379 - - - - - - - Proposed M-2** - - 8.6987 - - - - - - Proposed M-3* - 9.2903 - - - - - - - Daily LSTM 5.5329 5.5306 5.5343 5.5351 7.7324 8.7756 9.0327 10.1076 10.4688 CNN 6.7490 6.9270 6.4845 6.8979 6.9833 6.8275 6.8986 6.8067 8.9088 MLP 6.4448 6.2857 7.4463 8.5707 8.5792 8.5765 8.5684 8.5739 8.5703 Proposed M-1* - 5.4676 - - - - - - - Proposed M-2** - - 6.3742 - - - - - - Proposed M-3* - 6.0990 - - - - - - - * the best selection parameter was hidden layer 3 (HL-3) ** the best selection parameter was hidden layer 4 (HL-4) 62 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 3 (PSO-MLP). MAPE and RMSE measured the performances of the proposed model and its comparison with the baselines, as shown in Table 3 and Table 4, respectively. In general, all proposed models have better accuracy performance for all monthly, weekly, and daily scenarios, is indicated by the minimum MAPE (Table 3) and RMSE (Table 4) values obtained by the three proposed models compared to the other models. More specifically, in the monthly scenario, Proposed M-1 has the best performance of the three proposed models, followed by M-2, and M-3, with MAPE values of 8.4576, 8.5281, and 9.0930, respectively. In addition, the RMSE value also shows the same order of performance for the three proposed models, namely 0.0250, 0.0346, and 0.0259, respectively. The same thing happened in weekly and daily scenarios. However, if it was sorted based on the scenarios, the accuracy of the three proposed models with the best performance was shown in the daily scenario, followed by weekly and monthly. The increasing amount of data and precise outliers or distance precision within values on the dataset has contributed to the proposed model performance. Proposed M-1 (PSO-LSTM) can also reduce the yield value of RMSE and MAPE to be better than LSTM as a baseline model. The tuning results for M-2 (PSO-CNN) have better RMSE and MAPE values than CNN when the hidden layer is 4 (HL-4). As for the proposed M-3 (PSO-MLP), the use of HL-3 has a better evaluation value when compared to MLP. From the overall results in Table 3 and Table 4, the best results can be visualized as shown in Figure 6 and Figure 7. Figure 6 demonstrates that, when compared to all other models, the proposed model has the best MAPE value in every scenario. In the Monthly scenario, proposed M-1 outperforms regular LSTM, CNN, and MLP with a MAPE of 8.4576. The weekly scenario's MAPE proposed M-1 has a superior MAPE than previous techniques, with a score of 8.6379. The MAPE generated by proposed M-1 in the daily scenario was 5.4676, which was likewise better and more effective than other techniques. Figure 7 shows that every proposed model has the best RMSE in every scenario. Compared to other models, the monthly scenario's RMSE of 0.025, which belongs to Table 4. RMSE forecasting results Model RMSE HL-2 HL-3 HL-4 HL-5 HL-6 HL7 HL-8 HL-9 HL-10 Monthly LSTM 0.0260 0.0257 0.0263 0.0265 0.0270 0.0952 0.0952 0.0952 0.0953 CNN 0.0362 0.0357 0.0351 0.0369 0.0429 0.0636 0.0668 0.0764 0.0773 MLP 0.0263 0.0262 0.0265 0.0266 0.0945 0.0944 0.0944 0.0945 0.0945 Proposed M-1* - 0.0250 - - - - - - - Proposed M-2** - - 0.0346 - - - - - - Proposed M-3* - 0.0259 - - - - - - - Weekly LSTM 0.0299 0.0297 0.0302 0.0303 0.0311 0.1182 0.1183 0.1183 0.1183 CNN 0.0523 0.0437 0.0412 0.0475 0.0497 0.0556 0.0927 0.1019 0.1092 MLP 0.0304 0.0302 0.0310 0.1185 0.1184 0.1184 0.1184 0.1184 0.1184 Proposed M-1* - 0.0232 - - - - - - - Proposed M-2** - - 0.0362 - - - - - - Proposed M-3* - 0.0301 - - - - - - - Daily LSTM 0.0041 0.0039 0.0043 0.0049 0.0091 0.0844 0.0845 0.0848 0.0852 CNN 0.0192 0.0172 0.0101 0.0188 0.0157 0.0178 0.0168 0.0181 0.0241 MLP 0.0056 0.0049 0.0109 0.0785 0.0772 0.0776 0.0788 0.0780 0.0786 Proposed M-1* - 0.0023 - - - - - - - Proposed M-2** - - 0.0031 - - - - - - Proposed M-3* - 0.0031 - - - - - - - * the best selection parameter was hidden layer 3 (HL-3) ** the best selection parameter was hidden layer 4 (HL-4) A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 63 proposed M-1, has the best value. The best result for the RMSE proposed M-1 in the Weekly scenario is 0.0232, which is lower than the RMSE of other models. The RMSE value for the daily data ranges from 0.0023 (proposed M-1) to 0.0039 (LSTM), 0.0101 (CNN), and 0.0049 (MLP). Overall, it can be seen that the PSO hyperparameter tuning in this research case study can improve the baseline models' performance. The RMSE and MAPE evaluation values of the M-1 produce the best values in all scenarios (Monthly, Weekly, and Daily) compared to other proposed models and the baselines. The government may use this research finding to reference their regulations as a benefit of this research. The first regulation is pollution-prevention approaches aiming to minimize, remove, and avoid pollution. The government promotes the use of less hazardous raw resources or fuels, a less toxic industrial operation, and increased process efficiency. The second policy is to establish the Clean Air Technology Center, which will provide information on technologies for preventing and controlling air pollution, including mechanical collectors, fabric filtration, combustion systems, wet scrubbers, and biological degradation and their use, cost, and effectiveness. The third regulation reduces transportation-related emissions by requiring car emission controls and cleaner fuels. Finally, economic incentives for air pollution control agencies, such as emissions banking and trading, can be created. IV. Conclusion This paper proposed improved deep learning approaches based on PSO hyperparameters tuning to select the best parameters. The experiment shows that all proposed models outperformed the baseline model. The best performance of Proposed M-1 (PSO-LSTM) outperformed other produced models, M-2 (PSO-CNN) and M-3 (PSO-MLP), and the baseline models, LSTM, CNN, and MLP. Governmental regulations such as pollution prevention, Clean Air Technology Center, and transportation-emissions reduction could be generated based on this promising finding. The proposed model in this study has good performance, which only applies to the dataset used. Fig. 6. Comparison of MAPE in all scenarios Fig. 7. Comparison of RMSE in all scenarios 64 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 Therefore, future research will use various datasets to produce a generally applicable model to all time-series datasets. Acknowledgment The authors are grateful for the support provided by the Chinese Government Scholarship (CGS), which has contributed funding to conduct this research through the CGS Scholarship. In addition, appreciate Hohai University, Universitas Ahmad Dahlan, and Universitas Negeri Malang, which have contributed to supporting laboratory facilities. Declarations Author contribution All authors contributed equally as the main contributor of this paper. All authors read and approved the final paper. Funding statement This work is supported by the Chinese Government Scholarship (CGS) that received by the corresponding author with CSC Number 2018GBJ006341 and by Universitas Ahmad Dahlan under grant number PD-226/SP3/LPPM- UAD/VII/2022. Conflict of interest The authors declare no known conflict of financial interest or personal relationships that could have appeared to influence the work reported in this paper. Additional information Reprints and permission information are available at http://journal2.um.ac.id/index.php/keds. Publisher‟s Note: Department of Electrical Engineering - Universitas Negeri Malang remains neutral with regard to jurisdictional claims and institutional affiliations. References [1] T Yu and H Zhu “Hyper-Parameter Optimization: A Review of Algorithms and Applications ” arXiv Prepr. arXiv2003.05689, Mar. 2020. [2] M Tan and Q V Le “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks ” arXiv Prepr., May 2019. [3] N Ma X Zhang H T Zheng and J Sun “Shufflenet v2: Practical guidelines for efficient cnn architecture design ” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131. [4] M. Sandler A Howard M Zhu A Zhmoginov and L C Chen “Mobilenetv2: Inverted residuals and linear bottlenecks ” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510– 4520. [5] X Zhang X Chen L Yao C Ge and M Dong “Deep Neural Network Hyperparameter Optimization with Orthogonal Array Tuning ” in Neural Information Processing, T. Gedeon, K. Wong, and M. Lee, Eds. Springer, 2019, pp. 287–295. [6] N Gorgolis I Hatzilygeroudis Z Istenes and L n G Gyenne “Hyperparameter Optimization of LSTM Network Models through Genetic Algorithm ” in 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), Jul. 2019, pp. 1–4. [7] G E Hinton N Srivastava A Krizhevsky I Sutskever and R R Salakhutdinov “Improving neural networks by preventing co-adaptation of feature detectors ” arXiv Prepr. arXiv1207.0580, Jul. 2012. [8] A Farzad H Mashayekhi and H Hassanpour “A comparative performance analysis of different activation functions in LSTM networks for classification ” Neural Comput. Appl., vol. 31, no. 7, pp. 2507–2521, Jul. 2019. [9] M D Zeiler “ADADELTA: An Adaptive Learning Rate Method ” arXiv Prepr. arXiv1212.5701, Dec. 2012. [10] X. Liang et al. “Assessing Beijing‟s PM2 5 pollution: Severity weather impact APEC and winter heating ” Proc. R. Soc. A Math. Phys. Eng. Sci., vol. 471, no. 2182, 2015. [11] M. Zhang D Wu and R Xue “Hourly prediction of PM2 5 concentration in Beijing based on Bi-LSTM neural network ” Multimed. Tools Appl., vol. 80, no. 16, pp. 24455–24468, 2021. [12] S E Buttrey “ Data Mining Algorithms Explained Using R ” J. Stat. Softw., vol. 66, no. Book Review 2, 2015. [13] R Elshawi M Maher and S Sakr “Automated Machine Learning: State-of-The-Art and Open Challenges ” Jun 2019. [14] L Yang and A Shami “On hyperparameter optimization of machine learning algorithms: Theory and practice ” Neurocomputing, vol. 415, pp. 295–316, Nov. 2020. [15] F. Hutter, L. Kotthoff, and J. Vanschoren, Automated Machine Learning. Cham: Springer International Publishing, 2019. http://journal2.um.ac.id/index.php/keds https://arxiv.org/abs/2003.05689 https://arxiv.org/abs/2003.05689 https://arxiv.org/abs/1905.11946 https://arxiv.org/abs/1905.11946 https://arxiv.org/abs/1807.11164 https://arxiv.org/abs/1807.11164 https://doi.org/1801.04381v4 https://doi.org/1801.04381v4 https://doi.org/1801.04381v4 https://doi.org/10.1007/978-3-030-36808-1_31 https://doi.org/10.1007/978-3-030-36808-1_31 https://doi.org/10.1007/978-3-030-36808-1_31 https://doi.org/10.1109/IISA.2019.8900675 https://doi.org/10.1109/IISA.2019.8900675 https://doi.org/10.1109/IISA.2019.8900675 https://arxiv.org/abs/1207.0580 https://arxiv.org/abs/1207.0580 https://doi.org/10.1007/s00521-017-3210-6 https://doi.org/10.1007/s00521-017-3210-6 https://arxiv.org/abs/1212.5701 https://doi.org/10.1098/rspa.2015.0257 https://doi.org/10.1098/rspa.2015.0257 https://doi.org/10.1007/s11042-021-10852-w https://doi.org/10.1007/s11042-021-10852-w https://doi.org/10.18637/jss.v066.b02 https://arxiv.org/abs/1906.02287 https://arxiv.org/abs/1906.02287 https://doi.org/10.1016/j.neucom.2020.07.061 https://doi.org/10.1016/j.neucom.2020.07.061 https://doi.org/10.1007/978-3-030-05318-5 https://doi.org/10.1007/978-3-030-05318-5 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 65 [16] N. Xue, I. Triguero, G. P. Figueredo, and D. Landa-Silva “Evolving Deep CNN-LSTMs for Inventory Time Series Prediction ” 2019 IEEE Congr. Evol. Comput. CEC 2019 - Proc., pp. 1517–1524, 2019. [17] M.-A Zöller and M F Huber “Benchmark and Survey of Automated Machine Learning Frameworks ” Apr 2019 [18] X.-H. Yan, F.-Z. He, and Y.-L Chen “A Novel Hardware/Software Partitioning Method Based on Position Disturbed Particle Swarm Optimization with Invasive Weed Optimization ” J. Comput. Sci. Technol., vol. 32, no. 2, pp. 340– 355, Mar. 2017. [19] M.-Y. Cheng, K.-Y Huang and M Hutomo “Multiobjective Dynamic-Guiding PSO for Optimizing Work Shift Schedules ” J. Constr. Eng. Manag., vol. 144, no. 9, p. 04018089, Sep. 2018. [20] S Rahnamayan H R Tizhoosh and M M A Salama “A novel population initialization method for accelerating evolutionary algorithms ” Comput. Math. with Appl., vol. 53, no. 10, pp. 1605–1614, May 2007. [21] H Wang Z Wu J Wang X Dong S Yu and C Chen “A New Population Initialization Method Based on Space Transformation Search ” in 2009 Fifth International Conference on Natural Computation, 2009, pp. 332–336. [22] M Hiransha E A Gopalakrishnan V K Menon and K P Soman “NSE Stock Market Prediction Using Deep- Learning Models ” in Procedia Computer Science, 2018, vol. 132, pp. 1351–1362. [23] Y. S. Park and S. Lek, Artificial Neural Networks: Multilayer Perceptron for Ecological Modeling, vol. 28. Elsevier, 2016. [24] T Marwala “Multi-layer Perceptron ” Handb. Mach. Learn., no. 2001, pp. 23–42, 2018. [25] J Gamboa “Deep Learning for Time-Series Analysis ” arXiv, 2017. [26] P Gao R Zhang and X Yang “The application of stock index price prediction with neural network ” Math. Comput. Appl., vol. 25, no. 3, 2020. [27] W Lu J Li Y Li A Sun and J Wang “A CNN-LSTM-based model to forecast stock prices ” Complexity, vol. 2020, 2020. [28] J. M. Nazzal, I. M. El-emary S a Najim A Ahliyya P O Box and K S Arabia “Multilayer Perceptron Neural Network MLPs For Analyzing the Properties of Jordan Oil Shale ” World Appl. Sci. J., vol. 5, no. 5, pp. 546–552, 2008. [29] G Van Houdt C Mosquera and G Nápoles “A review on the long short-term memory model ” Artif. Intell. Rev., vol. 53, no. 8, pp. 5929–5955, Dec. 2020. [30] Ferdiansyah, S H Othman R Zahilah Raja Md Radzi D Stiawan Y Sazaki and U Ependi “A LSTM-Method for Bitcoin Price Prediction: A Case Study Yahoo Finance Stock Market ” ICECOS 2019 - 3rd Int. Conf. Electr. Eng. Comput. Sci. Proceeding, no. March 2020, pp. 206–210, 2019. [31] M Lechner and R Hasani “Learning Long-Term Dependencies in Irregularly-Sampled Time Series ” arXiv, 2020. [32] H. Wang, Z. Yang, Q. Yu, T. Hong and X Lin “Online reliability time series prediction via convolutional neural network and long short term memory for service-oriented systems ” Knowledge-Based Syst., vol. 159, pp. 132–147, 2018. [33] J Lu Q Zhang Z Yang and M Tu “A hybrid model based on convolutional neural network and long short-term memory for short-term load forecasting ” IEEE Power Energy Soc. Gen. Meet., vol. 2019-Augus, 2019. [34] A K Jain C Grumber P Gelhausen I Häring and A Stolz “A Toy Model Study for Long-Term Terror Event Time Series Prediction with CNN ” Eur. J. Secur. Res., vol. 5, no. 2, pp. 289–309, 2020. [35] S S Baek J Pyo and J A Chun “Prediction of water level and water quality using a cnn-lstm combined deep learning approach ” Water (Switzerland), vol. 12, no. 12, 2020. [36] S Selvin R Vinayakumar E A Gopalakrishnan V K Menon and K P Soman “Stock price prediction using LSTM, RNN and CNN-sliding window model ” in 2017 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2017, 2017, vol. 2017-Janua, pp. 1643–1647. [37] C Yang J Zhai G Tao and P Haajek “Deep Learning for Price Movement Prediction Using Convolutional Neural Network and Long Short-Term Memory ” Math. Probl. Eng., vol. 2020, 2020. [38] S Mehtab and J Sen “Stock Price Prediction Using CNN and LSTM-Based Deep Learning Models ” 2020 Int. Conf. Decis. Aid Sci. Appl. DASA 2020, pp. 447–453, 2020. [39] J M T Wu Z Li N Herencsar B Vo and J C W Lin “A graph-based CNN-LSTM stock price prediction algorithm with leading indicators ” Multimed. Syst., no. Special Issue Paper, 2021. [40] A. J. Dautel, W. K. Härdle, S. Lessmann, and H.-V Seow “Forex exchange rate forecasting using deep recurrent neural networks ” Digit. Financ., vol. 2, no. 1, pp. 69–96, 2020. [41] A S Lundervold and A Lundervold “An overview of deep learning in medical imaging focusing on MRI ” Z. Med. Phys., vol. 29, no. 2, pp. 102–127, May 2019. [42] E Lewinson “Python for Finance Cookbook ” in Over 50 recipes for applying modern Python libraries to financial data analysis, 1st ed., Packt Publishing, 2020, p. 434. [43] K Wang K Li L Zhou Y Hu and Z Cheng “Multiple convolutional neural networks for multivariate time series prediction ” Neurocomputing, vol. 360, pp. 107–119, 2019. [44] E. Hoseinzade and S. Haratizadeh “CNNpred: CNN-based stock market prediction using a diverse set of variables ” Expert Syst. Appl., vol. 129, pp. 273–285, 2019. [45] L Ni Y Li X Wang J Zhang J Yu and C Qi “Forecasting of Forex Time Series Data Based on Deep Learning ” Procedia Comput. Sci., vol. 147, pp. 647–652, 2019. https://doi.org/10.1109/CEC.2019.8789957 https://doi.org/10.1109/CEC.2019.8789957 https://arxiv.org/abs/1904.12054 https://doi.org/10.1007/s11390-017-1714-2 https://doi.org/10.1007/s11390-017-1714-2 https://doi.org/10.1007/s11390-017-1714-2 https://doi.org/10.1061/(ASCE)CO.1943-7862.0001548 https://doi.org/10.1061/(ASCE)CO.1943-7862.0001548 https://doi.org/10.1016/j.camwa.2006.07.013 https://doi.org/10.1016/j.camwa.2006.07.013 https://doi.org/10.1109/ICNC.2009.371 https://doi.org/10.1109/ICNC.2009.371 https://doi.org/10.1016/j.procs.2018.05.050 https://doi.org/10.1016/j.procs.2018.05.050 https://doi.org/10.1016/B978-0-444-63623-2.00007-4 https://doi.org/10.1016/B978-0-444-63623-2.00007-4 https://doi.org/10.1142/9789813271234_0002 https://arxiv.org/abs/1701.01887 https://doi.org/10.3390/MCA25030053 https://doi.org/10.3390/MCA25030053 https://doi.org/10.1155/2020/6622927 https://doi.org/10.1155/2020/6622927 https://www.idosi.org/wasj/wasj5(5)/5.pdf https://www.idosi.org/wasj/wasj5(5)/5.pdf https://www.idosi.org/wasj/wasj5(5)/5.pdf https://doi.org/10.1007/s10462-020-09838-1 https://doi.org/10.1007/s10462-020-09838-1 https://doi.org/10.1109/ICECOS47637.2019.8984499 https://doi.org/10.1109/ICECOS47637.2019.8984499 https://doi.org/10.1109/ICECOS47637.2019.8984499 https://arxiv.org/abs/2006.04418 https://doi.org/10.1016/j.knosys.2018.07.006 https://doi.org/10.1016/j.knosys.2018.07.006 https://doi.org/10.1016/j.knosys.2018.07.006 https://doi.org/10.1109/PESGM40551.2019.8973549 https://doi.org/10.1109/PESGM40551.2019.8973549 https://doi.org/10.1007/s41125-019-00061-w https://doi.org/10.1007/s41125-019-00061-w https://doi.org/10.3390/w12123399 https://doi.org/10.3390/w12123399 https://doi.org/10.1109/ICACCI.2017.8126078 https://doi.org/10.1109/ICACCI.2017.8126078 https://doi.org/10.1109/ICACCI.2017.8126078 https://doi.org/10.1155/2020/2746845 https://doi.org/10.1155/2020/2746845 https://doi.org/10.1109/DASA51403.2020.9317207 https://doi.org/10.1109/DASA51403.2020.9317207 https://doi.org/10.1007/s00530-021-00758-w https://doi.org/10.1007/s00530-021-00758-w https://doi.org/10.1007/s42521-020-00019-x https://doi.org/10.1007/s42521-020-00019-x https://doi.org/10.1016/j.zemedi.2018.11.002 https://doi.org/10.1016/j.zemedi.2018.11.002 https://www.packtpub.com/product/python-for-finance-cookbook/9781789618518 https://www.packtpub.com/product/python-for-finance-cookbook/9781789618518 https://doi.org/10.1016/j.neucom.2019.05.023 https://doi.org/10.1016/j.neucom.2019.05.023 https://doi.org/10.1016/j.eswa.2019.03.029 https://doi.org/10.1016/j.eswa.2019.03.029 https://doi.org/10.1016/j.procs.2019.01.189 https://doi.org/10.1016/j.procs.2019.01.189 66 A. Pranolo et al. / Knowledge Engineering and Data Science 2022, 5 (1): 53-66 [46] I Halimi G I Marthasari and Y Azhar “Prediksi Harga Emas Menggunakan Univariate Convolutional Neural Network ” J. Repos., vol. 1, no. 2, p. 105, 2019. [47] A Vidal and W Kristjanpoller “Gold volatility prediction using a CNN-LSTM approach ” Expert Syst. Appl., vol. 157, 2020. [48] I E Livieris E Pintelas and P Pintelas “A CNN–LSTM model for gold price time-series forecasting ” Neural Comput. Appl., vol. 32, no. 23, pp. 17351–17360, 2020. [49] R Yamashita M Nishio R K G Do and K Togashi “Convolutional neural networks: an overview and application in radiology ” Insights Imaging, vol. 9, no. 4, pp. 611–629, Aug. 2018. [50] S Singhal H Kumar and V Passricha “Prediction of Heart disease using DNN ” Am. Interantional J. Res. Sci. Technol. Eng. Math., no. November, pp. 257–261, 2018. [51] G. T. Taye H J Hwang and K M Lim “Application of a convolutional neural network for predicting the occurrence of ventricular tachyarrhythmia using heart rate variability features ” Sci. Rep., vol. 10, no. 1, pp. 1–7, 2020. [52] M Afrasiabi H khotanlou and M Mansoorizadeh “DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features ” Vis. Comput., vol. 36, no. 6, pp. 1127–1139, 2020. [53] P Liu J Liu and K Wu “CNN-FCM: System modeling promotes stability of deep learning in time series prediction ” Knowledge-Based Syst., vol. 203, p. 106081, 2020. [54] Z. Zhang, Y Dong and Y Yuan “Temperature Forecasting via Convolutional Recurrent Neural Networks Based on Time-Series Data ” Complexity, vol. 2020, 2020. [55] A. G. Salman B Kanigoro and Y Heryadi “Weather Forecasting using Deep Learning Techniques ” ICACSIS, pp. 281–285, 2015. [56] T T Kieu Tran T Lee J Y Shin J S Kim and M Kamruzzaman “Deep learning-based maximum temperature forecasting assisted with meta-learning for hyperparameter optimization ” Atmosphere (Basel)., vol. 11, no. 5, pp. 1– 21, 2020. [57] Z Alameer M A Elaziz A A Ewees H Ye and Z Jianhua “Forecasting gold price fluctuations using improved multilayer perceptron neural network and whale optimization algorithm ” Resour. Policy, vol. 61, no. September 2018, pp. 250–260, 2019. https://doi.org/10.22219/repositor.v1i2.612 https://doi.org/10.22219/repositor.v1i2.612 https://doi.org/10.1016/j.eswa.2020.113481 https://doi.org/10.1016/j.eswa.2020.113481 https://doi.org/10.1007/s00521-020-04867-x https://doi.org/10.1007/s00521-020-04867-x https://doi.org/10.1007/s13244-018-0639-9 https://doi.org/10.1007/s13244-018-0639-9 https://doi.org/10.1109/ICIRCA48905.2020.9182991 https://doi.org/10.1109/ICIRCA48905.2020.9182991 https://doi.org/10.1038/s41598-020-63566-8 https://doi.org/10.1038/s41598-020-63566-8 https://doi.org/10.1038/s41598-020-63566-8 https://doi.org/10.1007/s00371-019-01722-6 https://doi.org/10.1007/s00371-019-01722-6 https://doi.org/10.1016/j.knosys.2020.106081 https://doi.org/10.1016/j.knosys.2020.106081 https://doi.org/10.1155/2020/3536572 https://doi.org/10.1155/2020/3536572 https://doi.org/10.1109/ICACSIS.2015.7415154 https://doi.org/10.1109/ICACSIS.2015.7415154 https://doi.org/10.3390/ATMOS11050487 https://doi.org/10.3390/ATMOS11050487 https://doi.org/10.3390/ATMOS11050487 https://doi.org/10.1016/j.resourpol.2019.02.014 https://doi.org/10.1016/j.resourpol.2019.02.014 https://doi.org/10.1016/j.resourpol.2019.02.014