Plane Thermoelastic Waves in Infinite Half-Space Caused Decision Making: Applications in Management and Engineering Vol. 4, Issue 2, 2021, pp. 225-240. ISSN: 2560-6018 eISSN: 2620-0104 DOI: https://doi.org/10.31181/dmame210402215s E-mail addresses: pinki.fet@mriu.edu.in (P. Sagar), prinima@mru.edu.in (P. Gupta), rohit.tanwar.cse@gmail.com (R. Tanwar). A NOVEL PREDICTION ALGORITHM FOR MULTIVARIATE DATA SETS Pinki Sagar1, Prinima Gupta 1 and Rohit Tanwar 2* 1 Computer Science and Technology, Manav Rachna University, Haryana, India 2 School of Computer Science, University of Petroleum & Energy Studies, Dehradun Uttarakhand, India Received: 26 April 2021; Accepted: 14 July 2021; Available online: 15 July 2021. Original scientific paper Abstract: Regression analysis is a statistical technique that is most commonly used for forecasting. Data sets are becoming very large due to continuous transactions in today's high-paced world. The data is difficult to manage and interpret. All the independent variables can’t be considered for the prediction because it costs high for maintenance of the data set. A novel algorithm for prediction has been implemented in this paper. Its emphasis is on the extraction of efficient independent variables from various variables of the data set. The selection of variables is based on Mean Square Errors (MSE) as well as on the coefficient of determination r2p, after that, the final prediction equation for the algorithm is framed based on of deviation of the actual mean. This is a statistical-based prediction algorithm that is used to evaluate the prediction based on four parameters: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and residuals. This algorithm has been implemented for a multivariate data set with low maintenance costs, preprocessing costs, lower root mean square error and residuals. For one dimensional, two-dimensional, frequent stream data, time series data and continuous data, the proposed prediction algorithm can also be used. The impact of this algorithm is to enhance the accuracy rate of forecasting and minimized the average error rate. Keywords: Coefficient of determination, Mean square error, Actual means, Multiple Linear Regression (MLR), Root Mean Square Error (RMSE), Mean Square Error (MSE). 1. Introduction Regression techniques come under the category of supervised learning methods in which the existing training data sets can be used as guidance and to supervise the mailto:rohit.tanwar.cse@gmail.com Pinki et al./Decis. Mak. Appl. Manag. Eng. 4 (2) (2021) 225-240 226 complete learning and prediction process. The results of supervised learning approaches are dependent on algorithms and their complexity. In the regression techniques, new values are predicted for future analysis, which will be calculated based on historical or previous data sets. Linear Regression fits a straight line and it has two components b0 (intercept) and coefficient b1 and one predictor termed as the independent variable. In today’s scenarios, data sets are maintained with multiple attributes and it requires so much processing time and costs for prediction. The cost of preprocessing and maintenance is depending on the type of data sets but at the time of analysis, it is not necessary to consider all attributes. In this paper prediction algorithm is introduced based on actual means which improve the prediction rate and reduce the cost of maintenance of the data. The regression line includes the following properties: The line restricts the aggregate of squared differentiation between observed values (y dependent variable) and foreseen characteristics (the ŷ values enrolled from the regression line). The regression line experiences the mean of the ‘x’ and mean of the ‘y’. In the linear regression, the b0 is considered as the intercept of the regression equation and it is the incline of the regression line. The regression coefficient b1 is the average change in the dependent variable ‘y’ for a per-unit change in the independent variable ‘y’. 1.1. Regression Coefficients Regression coefficients are estimates of the unknown population parameters and describe the relationship between a dependent and the independent variable. These are identified using methods, such as least square and matrix form. The Least square method is a modest linear forecasting approach, in which there is only a binary dependent variable and the other one is a neutral or independent variable. The equation for prediction using regression is: ŷ = b0 + x ∗ b1 (1) Least-squares regression coefficients Daniya et al. (2020): 𝑏1 = ∑[(𝑥(𝑖−𝑛)− 𝑥 ̅)((𝑦(𝑖−𝑛)− �̅�))] ∑[((𝑥(𝑖−𝑛)− �̅� 2) (2) 𝑏0 = 𝑦 − 𝑥 ∗ 𝑏1 (3) In equation (1) ŷ is the projected value of the reliant variable in linear regression two variables are used b0 and b1. The x(i---n) is the value of the independent variable for observation or new predicted values. For observed values y(i---n) is used, in which y is a dependent variable. x implies x score, and y is the mean y score. Equation (2) and (3) Daniya et al. (2020) represents the method of coefficients calculation. With Multiple linear regression, things are getting progressively jumbled and confused. In Multiple linear regression, ‘n’ free factors and ‘n + 1’ relapse coefficients and ordinary conditions are used. Finding the least-squares arrangement includes solving ‘n + 1’ condition with ‘n + 1’ questions. Equation (1) is eligible only for prediction in the one- dimensional data set. If the prediction is done in a multivariate data set then there will be many independent variables but all are not required for prediction. So, the proposed algorithm will work on the selection of attributes that have the highest weightage and more suitable for prediction, after selection of variable prediction equation will be formed based on the actual mean. A Novel Prediction Algorithm For Multivariate Data Sets 227 1.2. Research Contribution In this paper, an algorithm (MIPA) is explained that will be applicable for multivariate data sets. In the preprocessing part, irrelevant variables are reduced. Based on the selected variables actual mean has been calculated for identifying the coefficients. The selection of variables is based on the coefficient of determination (r2p) and mean square errors (MSE). This algorithm can be applied on various types of data set and reduced the errors like RMSE, MAE, and MAPE so that the accuracy rate of prediction can be improved. 2. Related Work Regression techniques are important tools for prediction and analysis. It indicates the significant associations between the dependent variable and independent variable and the strength of the impact of multiple independent variables on a dependent variable. Chai et al. (2007) introduced two prediction algorithms that are applicable for one-dimensional and two-dimensional stream data: Frequent Item Prediction Method (FIPM) and Frequent Temporal Pattern Data Stream (FTPDS). Stream data converted into discrete data to get dependent and independent variables so that regression models can be applied. In these algorithms, there were some limitations that were recovered by Sequence Forecast Algorithm Plane Regression (SFAPR) introduced. This plane regression algorithm is based on linear regression for two- dimensional data sets and reduces the error rate. Kavitha et al. (2016) discussed that, after the advancement of technologies in big data, data analytic has been developed wonderfully in today’s environment. The measurable strategies are utilized for the assessment of prescient models; the choice of accurate systems depends on the prerequisites of the information. The expectation and determining are done generally with time-series data sets. The majority of the applications of prediction are: climate determining, account and securities exchange join recorded information with the present gushing information for better exactness. In this paper, the author divided the time arrangement information using a regression model. Linear and multiple linear regression models are connected using the training data set for applying and also for preparation of informational assortment so that it can operate the right model for improvement. Ostertagova et al. (2016) presented the application of linear regression algorithm for processing of stress state data which were collected through drilling into a Harmonic Star Method (HSM) it was used for the collection of final data. The non- commercial software based on the harmonic star method enables us to automate the process of measurement for the direct collection of experiment data. Such programming empowered us to gauge worries in a specific purpose of the analyzed surface and, simultaneously, separate these anxieties. For example, a camera was utilized to move the image of its chromatic edges legitimately to a computer. Mustapha & Fadzil (2015) presented a regression algorithm for vendors to forecast their yearly profit and it is based on their historical data. Using a forecasting approach vendor can prepare their evaluation exercise. In this article, the author used various regression techniques that analyze the vendor’s performance. The performance report demonstrates the capability of data mining tasks in helping the Entrepreneur Development Unit (EDU) to predict vendors’ performance and to identify groups of on performance and under-performance. The Entrepreneur Development Unit was responsible for managing a big group of vendors that hold contracts with the company. Pinki et al./Decis. Mak. Appl. Manag. Eng. 4 (2) (2021) 225-240 228 Khan et al. (2016) discussed a non-linear regression by assuming that the data depend on a variety of folds. They divided the data space into multiple areas to construct a partitioned linear regression analysis as an estimation of the non-linearity among the experiential and the expected data, in place of setting up the range and limitations of the particular category, the Algorithm was exposed to immediately adapt to the differences in the data and it was very successful for high dimensional data as well as for small data sets. Saptawati et al. (2015) stated that the major activity in the mineral industry was most significant and costly in drilling. Although finding targets for drilling, geologists were using qualitative study, which resulted in a lot of failures in drilling. The authors worked on the analysis of methods, used for mining, and used to retrieve the facts in the outcome, the categorization of informed data, and mining of common item sets that can maintain the forecasting of drilling targets. The objective of the work is to reduce the threat of failure of drilling and hold the industry’s decision and decide on a new target in drilling. Ilayaraja & Meyyappan (2015) discussed the data mining techniques and applied them in many areas of medicine for different objectives. They partitioned a process to estimate the risk factors of the patients who were having symptoms of heart disease through the collected frequent data sets. Data sets for the heart patients were collected from the medical institutes or hospitals. Frequent data item sets are produced and depend on the selected symptoms and minimum support value. The frequent or common data sets which were extracted could help the doctors to make decisions in diagnostics and to predict the level of risk at an early stage so that immediate treatment could be provided. The projected approach could be applied to a data set of medical fields which helps in predicting the factors that affect risk with the level of risk and the patients based on selective factors item sets. Yang et al. (2019) used both a linear model and a nonlinear model to predict the future cash flow. A hybrid model integrating linear and nonlinear was constructed to enhance the prediction effect and calculate the fund reserve ratio and improved the accuracy rate of prediction. In the literature survey, it has been discussed that the prediction algorithm discussed by Zhao & Li (2005) is used for two-dimensional stream data that was based on plane regression. Chai et al. (2007) discussed the prediction algorithm for one-dimensional and two-dimensional stream data. Later on, a non-linear regression algorithm has been proposed for two-dimensional and one- dimensional stream data. After those algorithms for multi-dimensional data sets were discussed but it takes a lot of cost and time for the maintenance and preprocessing of data. In this paper, we minimize the cost of storing and preprocessing data sets and increase the accuracy rate of prediction and decrease the error rate via the proposed method. Antoniadis et al. (2021) reviewed the sector of sensitivity analysis and targeted the link between random forest and global sensitivity analysis (GSA). The concept is to use the random forest technique as an effective non-parametric method for building a meta-model that permits effective sensitivity analysis. In addition to its straightforward relevance to regression problems, the random forest methods additionally have the flexibility to implicitly handle correlation and high-dimensional data. Authors have used the rank-based random forest (RF) variable index to define sensitivity indexes. The author further reviewed the acceptable tool set for quantifying the importance of variables and used these tools to cut back the spatial property of the model, thereby conducting sensitivity analysis studies that might not be performed. Yıldırım et al. (2021) used a preferred deep learning tool known as long short-run memory (LSTM), which has been shown to be terribly effective in several time-series prognostication problems. They projected the hybrid model using two data sets that A Novel Prediction Algorithm For Multivariate Data Sets 229 mix two separate LSTMs to improve the prediction, and it was found that the model gave good results for real data. Mukherjee et al. (2019) proposed a model to predict the images based on spatio temporal sequence forecasting problems. They trained the Convolutional Long Short Term Memory (Conv-LSTM) to learn the temporal relationships while preserving the spatial data and present in the Latent space. In this method first, the encoder and decoder network are trained to learn the spatial features of the data. After that, the Conv-LSTM is inserted between the encoder and the decoder. The weights of the encoder-decoder are freezed and then the Conv-LSTM is trained. In the experiment loss function is used to predict the next set of frames for a given set of frames in a video. Gauba et al. (2017) proposed a novel approach to predict the rating of video advertisements based on a multimodal framework combining physiological analysis of the user and global sentiment-rating available on the internet. In the framework, they record the EEG signals while the users were asked to watch the video advertisement simultaneously. To predict the rating of an advertisement using EEG data, they used the regression technique based on Random forest and then EEG-based rating is combined with NLP-based sentiment score to improve the overall prediction. 3. Proposed Work In the literature survey, it has been identified that various algorithms have been introduced by authors for prediction using regression. Some existing algorithms and methods discussed in section 2 are used for the prediction process for forecasting. The proposed algorithm is based on the selection of efficient variables and prediction of new dependent variables with low residuals, Root Mean Square Errors, Mean Absolute Error, and Mean Absolute Percentage Error. 3.1. Problem Formulation Forecasting is the estimation of a dependent variable ‘y’, based on the independent variable ‘x’. Some algorithms are implemented for one-dimensional and two-dimensional data sets. These data sets can be stream, continuous or discrete data. Most of the data sets have multiple independent variables and in the prediction equation, all variables are used for prediction, which may cause the extra cost of maintenance of the data set, more execution time, and low accuracy rate of prediction. The proposed algorithm focused on the selection of relevant variable (independent) from multivariate data sets and improving accuracy rates because multivariate data set consist of various independent variable but all are not required for prediction model and it is very difficult to maintain huge data set with multiple variables due to high cost of maintenance and pre-processing of data set. In the literature survey, it has also been found that existing algorithms are restricted to the number of independent variables for one- and two-dimensional stream data. The objective of the proposed algorithm is to reduce the errors with more accuracy in prediction. The proposed algorithm is used for data sets that have numerous independent variables and reduce the cost of maintenance by selecting the appropriate variables for prediction then the prediction equation is framed based on assumed means. The selection of variables is based on the coefficient of determination and the prediction equation is based on actual means. https://ieeexplore.ieee.org/author/37086234303 https://www.sciencedirect.com/topics/engineering/regression-technique https://www.sciencedirect.com/topics/computer-science/random-decision-forest Pinki et al./Decis. Mak. Appl. Manag. Eng. 4 (2) (2021) 225-240 230 This algorithm is applied on “Energy Data Prediction”, from the UCI repository, (https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction). In the proposed prediction algorithm, there is one dependent variable and 'n’ independent variables. The dependent variable is Humidity (Average of all areas of the building) and the independent variables are Average temperature (average of all areas of the building), Pressure, RH_out, Wind speed, visibility, etc. 3.2. Multivariate Item Prediction Algorithm (MIPA) In the algorithm coefficients are calculated for independent variables using the actual mean of dependent and independent variables, in Step 4 formula for coefficient calculations has been explained. Algorithm Name: Multivariate Item Prediction Algorithm (MIPA) Input: Multivariate data set Output: Low Residuals, RMSE, MAE, MAPE during the prediction Step 1: Take 2n regression equations ‘n’ is the number of independent variables. Step 2: Categorize all the equations into Models, Model 1= no independent variable, Model 2 = 1 independent variable, Model 3 = 2 independent variables and so on. Step 3: Using ANOVA compare r2p and MSE for each regression model in each Model. If (r2p ↑ MSE ↓) /*select the regression model if this condition is true highest value of r2p and lowest value of MSE */ For i=0 to 2n /*select the appropriate regression model which has the highest r2p and lowest MSE out of 2n regression model*/ Step 4: Select the regression model from each Model which have highest r2p and lowest MSE. Step 5: Compare all selected regression equation from each Model consider the regression model with lowest MSE and highest r2p /*selection of independent variables.*/ Step 6: Find the deviation of actual mean for each selected independent variable. 𝑑𝑥(𝑖 = 1. . . . 𝑛) = 𝑥 – x̅ 𝑑𝑦(𝑖 = 1. . . . 𝑛) = 𝑦 − �̅� 𝑏𝑦𝑥(𝑖=1 𝑡𝑜 𝑛) = 𝑛∑[(𝑑𝑥(𝑖=1 𝑡𝑜 𝑛) ∗ 𝑑𝑦 ) − ∑(𝑑𝑦(𝑖=1 𝑡𝑜 𝑛) ∗ 𝑑𝑦)] 𝑛∑(𝑑𝑥(𝑖=1 𝑡𝑜 𝑛)) 2 − (∑(𝑑𝑥(𝑖=1 𝑡𝑜 𝑛))) 2 Step 7: Put value of byx(i=1 to n)in the prediction model Ŷ = 𝑏𝑦𝑥1 (𝑥1 – x̅1) + 𝑏𝑦𝑥2 (𝑥2 – x̅2) + 𝑏𝑦𝑥3 (𝑥3 – x̅3) + ⋯ + 𝑏𝑦𝑥𝑛 (𝑥 – x̅) + y̅ Step 8: Analysis of prediction algorithm (RMSE and residuals) 𝑅𝑀𝑆𝐸 = [∑((Y − Ŷ) 𝑛 𝑘=0 ∗ (Y − Ŷ))/N] Residuals=Actual values- Predicted values( y- ŷ) Algorithm MIPA has been discussed in detail as follows: A Novel Prediction Algorithm For Multivariate Data Sets 231 Step 1: Find the 2n regression models in which ’n’ is the number of independent variables. e.g. in the case of 4 independent variables total possible equations will be 16. In Table 1, Model 1 has not considered any independent variable, for Model 2 one independent variable is considered, for Model 3 two independent variables are considered, and so on. In Table 1 all possible regression models are shown. Table 1. The 2n Possible Regression equations model. Mode1 Model 2 Model 3 Model 4 Model 5 y=b0+e y=b0+b1X1 y= b0+b1X1+b2X2 y=b0+b1X1+b2X 2+b3X3 y= b0+b1X1+b2X2 +b3X3+ b4X4 y=b0+b2X2 y= b0+b1X1+b3X3 y=b0+b1X1+b2X 2+b4X4 y=b0+b3X3 y= b0+b1X1+b4X4 y=b0+b1X1+b3X 3+b4X4 y=b0+b4X4 y= b0+b2X2+b3X3 y=b0+b2X2+b3X 3+b4x4 y=b0+b2x2+b4X4 y= b0+b3X3+b4X4 Step 2: Find the Analysis of Variance (ANOVA table) of each regression model (2n). Select the regression model from each Model which has the highest r2p (coefficient of determination) and lowest MSE. Table 2 includes the values of r2p and MSE from each model of Table 1. Table 2. MSE and r2p (Coefficient of Determination) Model 2 Model 3 Model 4 Model 5 r2p (%) MSE square r2p (%) MSE square r2p (%) MSE square r2p (%) MSE square 85.35 0.24 95.74 0.07 95.87 0.07 95.91 0.08 12.08 1.47 90.91 0.16 95.83 0.07 88.11 0.19 93.96 0.11 93.96 0.11 3.7 1.61 88.14 0.21 91.33 0.16 76.92 0.4 88.35 0.2 Step 3: Model 2 consists of 1 independent variable, Model 3 consists of 2 independent variables, and Model 3 consists of 4 independent variables, and so on. The condition must be r2p ↑ MSE ↓ means that if the value of r2p is increasing then the value of MSE will decrease. So that Model 4 is selected in Table 2, which has values 95.87 and 0.07 belong to the first regression model of Model 4 in Table-1. It includes X1, X2, and X3 independent variables; it means that these three variables are most relevant for the prediction algorithm. Pinki et al./Decis. Mak. Appl. Manag. Eng. 4 (2) (2021) 225-240 232 Figure 1(a). Highest coefficient of determination r2p; Figure 1(b). Lowest Mean Square Error (MSE) In Figure 1(a) values of coefficient of determination are plotted which are selected as the highest value from each model and in Figure 1(b) plotted the values of lowest MSE of Table 2. In Figure 1(a) value 95.91 is highest but MSE is high corresponding to it so that 95.87 will be selected which is corresponding to the lowest MSE i.e. 0.07. Step 4: Derivation of proposed regression algorithm on the basis of deviation from actual mean is as follows: Find the value of dx and dy on the basis of actual mean: 𝑑𝑥(𝑖 = 1. . . . 𝑛) = 𝑥 – x̅ (4) 𝑑𝑦(𝑖 = 1. . . . 𝑛) = 𝑦 − �̅� (5) 𝑏𝑦𝑥(𝑖=1 𝑡𝑜 𝑛) = 𝑛∑[(𝑑𝑥(𝑖=1 𝑡𝑜 𝑛)∗𝑑𝑦 )−∑(𝑑𝑦(𝑖=1 𝑡𝑜 𝑛)∗𝑑𝑦)] 𝑛∑(𝑑𝑥(𝑖=1 𝑡𝑜 𝑛)) 2−(∑(𝑑𝑥(𝑖=1 𝑡𝑜 𝑛))) 2 (6) Ŷ = 𝑏𝑦𝑥1 (𝑥1 – x̅1) + 𝑏𝑦𝑥2 (𝑥2 – x̅2) + 𝑏𝑦𝑥3 (𝑥3 – x̅3) + ⋯ + 𝑏𝑦𝑥𝑛 (𝑥 – x̅) + y̅ (7) Step 5: Find the RMSE. Using equation (8) RMSE is calculated. If RMSE is low during the prediction means that accuracy of prediction is high. 𝑅𝑀𝑆𝐸 = [∑ ((Y − Ŷ) 𝑛 𝑘=0 ∗ (Y − Ŷ))N] (8) Step 6: Analysis of residual error (Actual y- ŷ) In the proposed algorithm Step 1, 2 and 3 are preprocessing steps, used for the selection of efficient variables which are required for the prediction equation (7). In step 4 equations (6) represent the coefficients calculations and prediction equation (7) is drafted for forecasting of new values. As we have mentioned that this algorithm is also valid for different types of data sets, the above-explained approach was extended to a multivariate data set. Here, this method is also applied to one- dimensional data set selected from the UCI repository https://archive.ics.uci.edu/ml/datasets/Parking +Birmingham), data set contain four attributes parking id (System Code Number), the capacity of parking (Capacity), parking rates, and updated details. In this data set parking occupancy is an independent variable and parking rates are the dependent variable. The coefficient for independent variables is calculated using equations (4), (5), and (6) and then places A Novel Prediction Algorithm For Multivariate Data Sets 233 the values of byx for each independent variable in equation (7). This proposed prediction equation can also be applicable to the stream data set. For the prediction of stream data first, the stream data need to convert into a form of discrete data sets. 4. Implementation and Result The algorithm is implemented in “R 3.3.2” version. One-dimensional and multivariate data sets are collected from an online repository. 4.1. Implementation One dependent variable (Humidity) and 4 independent variables (Wind speed, Average, temperature, t out, press m hg) are included in multivariate data collection. In one dimensional data set four attributes parking id (System Code Number), the capacity of parking (Capacity), parking rates are available. The capacity of parking is an independent variable and parking rate is a dependent variable. For multivariate datasets, all possible equations for independent variables are considered in this process. For the prediction system, appropriate independent variables can be identified using preprocessing. In this process, 2n equations are considered as n is the number of independent variables, and these equations are shown in Table-1. In the multivariate data sets, it is not needed to consider all variables in the prediction algorithm. Coefficient of determination and MSE are used for finding relevant variables. All independent variables have no equal significance or priority so few variables can be eliminated. For one-dimensional data set, there is no need to process the data set, only the regression equation will be applied on attributes x and y. 4.2. Results In this paper, a MIPA algorithm is compared with MLR because it deals with multiple independent variables. RMSE values for MLR and MIPA algorithms are plotted in Figure 4. As compared with MLR, it has been evaluated that RMSE's are low for the MIPA algorithm. In Table 3 residuals are analyzed or compared which are generated through MLR and MIPA algorithms. These values are corresponding to humidity. By analyzing Table 3 it can be easily observed that the improved algorithm gives low error rates during the prediction of humidity based on temperature, wind speed, etc. independent variables. Pinki et al./Decis. Mak. Appl. Manag. Eng. 4 (2) (2021) 225-240 234 Table 3. Residuals using MLR and MIPA Independent variables Dependen t variable Humidity Average(t emp)=X1 Press_mm _hg=X2 RH_out=X3 Windspee d=X4 Residuals MIPA Residuals MLR 50.91 17.1674074 733.5 92 7 0.42107573 0.99597124 50.83 17.1496296 733.6 92 6.66666666 0.00879531 1.17010151 50.63 17.1037037 733.7 92 6.33333333 0.55354625 1.48437316 50.57 17.0670370 733.8 92 6 0.95598559 1.64734098 50.73 17.0707407 733.9 92 5.66666666 1.10794818 1.56379616 50.79 17.0485185 734 92 5.33333333 1.37981300 1.59155517 50.79 17.0407407 734.1 92 5 1.70193451 1.6768479 50.8 17.0185185 734.166666 91.8333333 5.16666666 1.63454412 1.67028356 50.9 17.0185185 734.233333 91.6666666 5.33333333 1.46779281 1.56250345 51.05 17.0396296 734.3 91.5 5.5 1.21976676 1.38058893 51.23 17.0667592 734.366666 91.3333333 5.66666666 0.93158178 1.15983686 51.47 17.1103703 734.433333 91.1666666 5.83333333 0.58419480 0.87670657 51.85 17.1851851 734.5 91 6 0.04521525 0.41176678 52.68 17.2149074 734.616666 90.5 6 0.65000026 0.44031096 53.52 17.2522222 734.733333 90 6 1.37553362 1.32006378 53.5 17.2866666 734.85 89.5 6 1.24106697 1.33981661 53.38 17.3107407 734.966666 89 6 0.98628249 1.24189434 53.38 17.3133333 735.083333 88.5 6 0.83118018 1.24629699 52.97 17.3196296 735.2 88 6 0.27623678 0.84953717 54.37 17.3748148 735.233333 87.8333333 6 1.67429658 2.2952218 55.07 17.465 735.266666 87.6666666 6 2.42625301 3.0850061 54.9 17.4588888 735.3 87.5 6 2.19335931 2.90766546 4.3. Analysis with Existing algorithms MIPA algorithm can be used for various types of data sets such as stream data sets (one dimensional, two-dimensional), time-series data set, and multivariate data sets. It identifies the relevant variables from the data set which are the best suitable for the prediction. In FIPM, FTPDS preprocessing of data is done with the use of sliding window protocol and it is applicable only for one-dimensional and two-dimensional stream data sets. MLR is used for data sets where multiple independent variables are used for prediction, but it consumes lots of time and cost. Here in Table 4, MIPA is analyzed with MLR, FIPM, and FTPDS, based on residuals parameters. Residuals are the errors or difference values of the dependent variable and the observed values. In Table 3 independent variables are mentioned, humidity is predicted based on the selected independent variables. A Novel Prediction Algorithm For Multivariate Data Sets 235 Table 4. Analysis of residuals of existing algorithm with proposed algorithm. Humidity MIPA MLR FIPM FTPDS 50.91 0.421075736 0.9959712 1.3323720 0.58600286 50.83 0.008795312 1.1701015 1.4105585 0.786391724 50.63 0.553546256 1.4843731 1.6086900 1.114464984 50.57 0.955985598 1.6473409 1.6668764 1.297415314 50.73 1.107948187 1.5637961 1.5050629 1.252681248 50.79 1.379813003 1.5915551 1.4431944 1.313070113 50.79 1.70193451 1.6768479 1.4413808 1.430897512 50.8 1.634544126 1.6702835 1.4323151 1.420897512 Figure 2 Plotting of residuals during prediction In Figure 2 residuals of existing algorithms and MIPA algorithm are plotted compared to MIPA and have low residuals in comparison of MLR, FIPM, and FTPDS. Figure 3. Analysis of average of Residuals Pinki et al./Decis. Mak. Appl. Manag. Eng. 4 (2) (2021) 225-240 236 In Figure 3 we have compared the average residuals of algorithms MIPA has low average residual values 0.71525 and average residual values of MLR, FIPM, and FTPDS are 0.7664, 0.9615, and 0.8799 respectively. 5. Analysis of Results The analysis of MIPA has been done on the basis of parameters such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and residuals. 5.1. Analysis of RMSE RMSE is the standard deviation of residuals that are the prediction errors. It is a measure of how these residuals are spaced out. Residuals are a measure of how far data points are out from the regression line. By analysis of MLR and MIPA, it has been identified that MIPA has low RMSE values in comparison to MLR. In Figure 4, MIPA’s RMSE values are 0.647776757, 0.582948804, 0.516691943, and 0.496299287 are respectively plotted which are low in comparison to RMSE values calculated by MLR. Figure 4. Analysis of Root Mean Square Errors (RMSE) 5.2. Analysis of Residuals In Figure 5, the red line represents the plotting of residuals 0.99597124, 1.17010151, 1.48437316, respectively which are generated by the existing algorithm MLR, the blue line represents the plotting of residuals 0.421075736, 0.008795312, 0.553546256, respectively which are generated by MIPA. These values are corresponding to humidity. By analyzing Figure 5 it can be easily observed that the improved algorithm gives low error rates during the prediction of humidity based on temperature, wind speed, etc. independent variables. A Novel Prediction Algorithm For Multivariate Data Sets 237 Figure 5. Analysis of residuals during prediction 5.3. Analysis of Mean Absolute Error (MAE) MAE takes the absolute difference between the values that are actual and predicted and finds the average. MAE is crucial to identify the absolute value because it doesn't allow for any form of error value cancellation. For example, the average value of 0 if the average of 1 and -1 is considered because 1 and -1 will cancel out each other. In Figure 6 MAE for algorithms MLR and MIPA are compared, MIPA gives a better result. MIPA Prediction algorithm gives a better selection of important predictors or independent variables out of many independent variables in data sets. It reduces the cost of maintenance and collection of the data sets. Figure 6. Analysis of Mean Absolute Error (MAE) 5.4. Analysis of Mean Absolute Percentage Error (MAPE) MAPE is often referred to as the mean absolute percentage deviation (MAPD), a measure of the prediction accuracy of a statistical forecasting system. In Figure 7 MAPE for algorithms MLR and MIPA are compared, it gives a better result. 0 0.5 1 1.5 2 2.5 3 3.5 50 51 52 53 54 55 56 R e si d u a ls Humidity Residual Analysis of MLR and MIPA Residuals(MLR) Residuals(MIPA ) 0.7 0.72 0.74 0.76 0.78 0.8 MIPA MLR M e a n A b so lu te E rr o rs (M A E ) Mean Absolute Error of Algorithms Analysis of MAE Pinki et al./Decis. Mak. Appl. Manag. Eng. 4 (2) (2021) 225-240 238 Figure 7. Analysis of Mean Absolute Percentage Error (MAPE) 5.5. Low cost in execution and maintenance of data sets In the MIPA algorithm, only relevant variables are considered for data analysis and prediction process. The irrelevant variables get eliminate after the process of selection of variables or preprocessing, by eliminating irrelevant variables maintenance costs of data get reduce and it takes less time in the execution. Initially, in the data set four independent variables X1, X2, X3 and X4 are used, after the preprocessing of data set X1, X2, X3 selected as relevant independent variables. The irrelevant variable X4 gets eliminated. For further prediction process, we do not need to maintain the X4 as the independent variable, so it takes less time in the prediction process. 6. Conclusion and Future Scope In this paper, a MIPA algorithm is based on actual mean values. The analysis of the algorithms is done based on the parameters such as RMSE and residuals, MAE, MAPE. The accuracy of the prediction algorithm is measured with low RMSE and residuals. In this algorithm deviation for actual means is estimated for each relevant independent variable, using this estimated value, the prediction algorithm is framed. Prediction algorithm for ‘n’ independent variables is framed and it predicts the “Humidity” based on the rest of the independent variables. In Figure 3 it can be easily analyzed that average residuals of MIPA are 5.11% less than MLR, 24.62% are less than FIPM and 16.46 % are less than FTPDS. In section 5 it has been observed that MIPA is better than MLR. MIPA is the regression based algorithm that can also be used in medical areas for the prediction of diseases based on the symptoms of patients. Through analysis, it can be found that values of error rate are more reduced in the implemented regression algorithm rather than MLR. It reduces the cost of data maintenance and reduces the execution time. It can help in the forecasting of diseases, revenues of the company, production, and weather, and in other areas. Author Contributions: Each author has participated and contributed adequately to Take open accountability for suitable portions of the content. 1.34 1.36 1.38 1.4 1.42 1.44 1.46 1.48 1.5 MIPA MLRM e a n A b so lu te P e rc e n ta g e E rr o r (M A P E ) Mean Absolute Percentage Error of Algorithms Analysis of MAPE A Novel Prediction Algorithm For Multivariate Data Sets 239 Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflicts of interest. References Antoniadis, A., Lambert-Lacroix, S., & Poggi, J.-M. (2021). Random forests for global sensitivity analysis: A selective review. Reliability Engineering & System Safety. 28 193 – 222. Chai, D. J., Kim, E. H., Jin, L., Hwang, B., & Ryu, K. H. (2007). Prediction of Frequent Items to One Dimensional Stream Data. International Conference on Computational Science and its Applications(ICCSA). 353 – 360. IEEE. Daniya, T., Geetha. M., & Cristin, B. (2020).Least Square Estimation of Parameters for Linear Regression. International Journal of Control and Automation, 13, 447 - 452. Gauba, H., Kumar, P., Roy, P. P., Singh, P., Dogra, D. P., & Raman, B. (2017). Prediction of advertisement preference by fusing EEG response and sentiment analysis. Neural Networks, 92, 77–88 . Ilayaraja M., & Meyyappan T. (2015). Efficient Data Mining Method to Predict the Risk of Heart Diseases Through Frequent Itemsets. Procedia Computer Science,70, 586– 592. Kavitha S, Varuna S ., & Ramya R.(2016). A comparative analysis on linear regression and support vector regression, International Conference on Green Engineering and Technologies (IC-GET), (1-5).IEEE. Khan, F., Kari, D., Karatepe, I. A., & Kozat, S. S. (2016). Universal Nonlinear Regression on High Dimensional Data Using Adaptive Hierarchical Trees. IEEE Transactions on Big Data, 2(2), 175–188. Mukherjee, S., Ghosh, S., Ghosh, S., Kumar, P., & Roy, P. P. (2019). Predicting Video- frames Using Encoder-convlstm Combination. International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2027-2031). IEEE. Mustapha, A., & Fadzil, F. (2015). A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries. International Journal of Engineering and Technology. 6, 2604-2608. Ostertagova, E., Frankovsky, P., & Ostertag, O. (2016). Application of polynomial regression models for prediction of stress state in structural elements. Global Journal of Pure and Applied Mathematics. 12, 3187-3199. Saptawati, G. A. P., & Nata, G. N. M. (2015). Knowledge discovery on drilling data to predict potential gold deposit. International Conference on Data and Software Engineering (ICoDSE), (143-147). IEEE. Yang, X., Mao, S., Gao, H., Duan, Y., & Zou, Q. (2019). Novel Financial Capital Flow Forecast Framework Using Time Series Theory and Deep Learning: A Case Study Analysis of Yu’e Bao Transaction Data. IEEE Access, 7, 70662–70672. Pinki et al./Decis. Mak. Appl. Manag. Eng. 4 (2) (2021) 225-240 240 Yıldırım, D.C., Toroslu, I.H. & Fiore, U. (2021). Forecasting directional movement of Forex data using LSTM with technical and macroeconomic indicators. Financ Innov 7.1-36. Zhao. F., & Li, Q. (2005). A plane regression-based sequence forecast algorithm for stream data. International Conference on Machine Learning and Cybernetics(ICMLC), (1559-1562). IEEE. © 2021 by the authors. Submitted for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).