Sebuah Kajian Pustaka: Kurdistan Journal of Applied Research (KJAR) Print-ISSN: 2411-7684 | Electronic-ISSN: 2411-7706 Website: Kjar.spu.edu.iq | Email: kjar@spu.edu.iq A Novel Approach for Stock Price Prediction Using Gradient Boosting Machine with Feature Engineering (GBM-wFE) Rebwar M. Nabi Soran Ab. M. Saeed Habibollah Harron Technical College of Informatics VP for Scientific Affairs University of Technology Malaysia Sulaimani Polytechnic University Sulaimani Polytechnic University Johor, Malaysia Sulaimani, Iraq Sulaimani, Iraq Rebwar.nabi@spu.edu.ia Soran.saeed@spu.edu.iq habib@utm.my Article Info ABSTRACT Volume 5 – Issue 1 – June 2020 DOI: 10.24017/science.2020.1.3 Article history: Received: 27 January 2020 Accepted: 10 March 2020 The prediction of stock prices has become an exciting area for researchers as well as academicians due to its economic impact and potential business profits. This study proposes a novel multiclass classification ensemble learning approach for predicting stock prices based on historical data using feature engineering. The proposed approach comprises four main steps, which are pre-processing, feature selection, feature engineering, and ensemble methods. We use 11 datasets from Nasdaq and S&P 500 to ensure the accuracy of the proposed approach. Furthermore, eight feature selection algorithms are studied and implemented. More importantly, a feature engineering concept is applied to construct two new features, which are appears to be very auspicious in terms of improving classification accuracy, and this is considered the first study to use feature engineering for multiclass classification using ensemble methods. Finally, seven ensemble machine learning (ML) algorithms are used and compared to discover the ultimate collaboration prediction model. Besides, the best feature selection algorithm is proposed. This study proposes a novel multiclass classification approach called Gradient Boosting Machine with Feature Engineering (GBM-wFE) and Principal Component Analysis (PCA) as the feature selection. We find that GBM-wFE outperforms the previous studies and the overall prediction results are auspicious, as MAPE of 0.0406% is achieved, which is considered the best result compared to the available studies in the literature. Keywords: Stock Market Prediction, Feature Engineering Feature Selection Machine Learning Predictive Analysis Predictable Movement Multiclass Classification Copyright © 2020 Kurdistan Journal of Applied Research. All rights reserved. 1. INTRODUCTION 1.1. Background The stock market prediction has fascinated enormous considerations from scholars as well as the commercial industry. However, the question still leftover in terms of whether the historical price of the stock can be used to predict future prices. [1]. Efficient Market Hypothesis (EMH) and the random walk theory(RWT) are considered as the oldest study on stock market prediction [1], [2]. Both EMH and RWT stated that it is challenging to predict the stock prices because they are mostly affected by the news instead of historical data. Consequently, the classification accuracy had reached to 50% only[3]. Contrariwise, numerous researches [4–14] stated the opposing claim provided by EMH and RWT. These researches propose that the stock price can be forecasted. Stock prices prediction(SPP) is crucial in the financial biosphere [7], [8], [12] as a practically precise forecast can produce unique business paybacks and verge in contradiction of bazaar risks. Though, it remainders hard to predict the stock price since the financial market is a multifaceted, evolutionary, and nonlinear lively system, which interrelates with political measures, economic circumstances, and traders’ opportunities [12]. Nevertheless, understanding accurate prediction of stock prices in the quick term (one day, five days forward), intermediate term (ten days, 15 days forward), (20 days, 30 days forward), and extended term (Three-monthly) is regarded as one of the furthermost striking and evocative research topics in the investment ground and its submissions. The paybacks involved in imprecise forecasts have been inspiring encouraged investigators to advance novel and forward-thinking apparatuses and approaches. On the whole, there are two communal approaches to forecast the SPP such as, Fundamental Analysis (FA) and Technical Analysis (TA). The FA uses economic features to approximate the inherent values of securities, while the TA is based on historical prices of the stock. So far, several researches have been applied to forecast the SPP. Regarding the techniques used to examine the stock, several of the are founded on statistical approaches although the majority are built by artificial intelligence (AI) and Machine Learning (ML) algorithms [8]. Frequently, the financial data is disordered, deafening, and nonlinear, which is hardly follow immovable pattern. Consequently, statistical methods, for example “moving average”, “weighted moving average filtering potential smoothing”, “regression analysis”, “autoregressive moving average”, “autoregressive integrated moving average”, and “autoregressive moving average” do not achieve acceptable CA [5]. On the other hand, AI algorithms are capable to cater the arbitrary, disordered, and nonlinear data of the stock and have been extensively applied [5]. The Artificial Neural Network (ANN), Bayesian Analysis, K-Nearest, and Decision Tree are few examples of AI algorithms [15]. Therefore, this study aims to propose a novel multiclass classification approach to forecast the SPP using feature engineering. To the best of our knowledge, our study can be considered a pioneer in studying and implementing feature engineering for stock prediction utilizing ensemble methods. To support and prove this, we have searched and investigated international databases such as Science Direct1, Elsevier2, Scopus3, IEEE Digital Library4, Springer5, and ACM6. Furthermore, several other platforms and databases were investigated, for example, Google Scholar, EBSCO Information Services, and DOAJ. We will study and compare the current feature selection(FS) algorithms to discover the top- performing algorithm. Finally, we will compare available ensemble methods and other ML algorithms to find the best classifier. The rest of the study is structured in this manner. In sector two, the background and related work are outlined. In section three, the project methodology is explained in detail. The results and discussion can be found in section four. The conclusion of the paper is explained in section five. 2. RELATED WORK Based on the literature, numerous studies have been published. Jie Sun et al. [16] proposed the AdaBoost support vector machine combined with concept drift on weighting time (ADASVM- 1 https://www.sciencedirect.com 2 https://www.elsevier.com/en-xm 3 https://www.scopus.com/home.uri 4 https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=2188 5 https://www.springer.com/gp 6 https://www.acm.org/ https://www.sciencedirect.com/ https://www.elsevier.com/en-xm https://www.scopus.com/home.uri https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=2188 https://www.springer.com/gp https://www.acm.org/ Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 30 TW) to predict the financial distress. The results were promising, as they found that the proposed ADASVM-TW was outperforming the single SVM algorithm. The authors in [17] proposed using Random Forest as an ensemble method to predict the stock return value. The algorithm was used to minimize the prediction error by dealing with the problem as the classification model. Technical indicators such as Relative Strength Index (RSI) and Stochastic Oscillator (SO) were used to train the model as the input for multiple decision trees. Dong [18] discovered the dynABE, which is the dynamic Advisor-Based Ensemble for predicting the stock price by discovering the precise parts based on the companies of curiosity and differentiating the set of features into various advisors in the way that each advisor tackles a different area and follows the proposed ensemble procedure. Dong’s approach achieved a misclassification error of 31.12%. The EALasso was proposed by [19] as the feature selection approach for multiclass and multiclass learning problems by keeping the oracle belongings of recognizing the truthful subdivision prototypical and partaking the ideal approximation accurateness. Researchers in [20] recommended the mixture of an expert system consisted of a knowledge base(KB) and AI. The KB was applied to collect historical prices, numerous eminent technical pointers, counts and sentimentality notches of available news of the stock, movements in Google for the assumed stock ticker, and the number of exclusive visitors for Wikipedia pages. In the AI, numerous ensemble algorithms were implemented such as NN regression, SV regression, boosted RT, and RF regression. The MAPE ≤ 1.50% is achieved. Feature engineering is a massive subject and numerous approaches have been proposed, predominantly in the extent of involuntary feature learning. It is commonly known that data of the stock market covering daily prices, statements of earnings by distinct companies, and view articles from experts. When constructing new features [21], it is advantageous when the result is interpretable. explainable features and approaches are extra reachable, which this produce better forecasting results. In addition, it is a virtuous concept to add complexity to advance the CA. The main aim of feature engineering is to reach to optimal features for the task. The stock market data is ready to be investigated by mathematical theories and applications. A mathematical model defines the relations, which forecast stock prices could be a method that maps a company’s receiving history, historical prices, and trade to the forecasted stock price. Researchers in [22] implemented a NN method for stock forecast and the CA improved surprisingly. Furthermore, several studies have been conducted that implemented feature engineering; however, none are related to stock prediction. Researchers in [23] used feature engineering fault diagnosis of induction motors. A semantic feature model in concurrent engineering was conducted by researchers in [24]. Another study used feature engineering for energy theft detection using gradient boosting and found useful combinations from the origin features [25]. The authors of [26] suggested a feature engineering approach for short period earthquake forecast using AETA dataset. Feature engineering for search advertising recognition was investigated by researchers in [27]. Researchers in [28] proposed feature engineering for stock prediction but only considered the binary classification. They found significant improvement in prediction performance. Conversely, majority of the data related to the stock market and financial stress are the data imbalance and multiclass classification [9], [15]. Imbalanced data belongs to a dataset, that one or some of the values have a considerable bigger number of samples compared to others. Typical algorithms such as “logistic regression”, “SVM”, and DT are appropriate for stable datasets; however, when fronting imbalanced situations, these algorithms frequently deliver suboptimal CA results. The imbalance problem and multiclass classification attract many researchers to tackle these two issues. However, to date, the published approaches are not providing good accuracy in prediction. Ensemble learning techniques have been studied only recently [29], [30]. Therefore, this can be considered as a significant room in the area, as ensemble methods have been confirmed to be outweighed compared to other algorithms [8], [15], [29], [31]. However, the main issue in providing a good voting algorithm to fuse the weight of different classifiers and provide a correct aggregated decision is not optimal as it faces the local optima problem that is tackled by heuristic techniques, making the approach very limited and not algorithmic for the general class of problems. Therefore, based on the above literature, it can be identified that there is significant room for improvement because of inaccuracy. 3. EXPERIMENT METHODOLOGY 3.1. Research Framework In this study research framework, several phases are literature review and problem definition, dataset collection, Data pre-processing, feature Engineering, applying feature selection, finding the best ensemble classifier, proposing a novel multiclass classification approach GBM-wFE for stock prediction and evaluation analysis. Figure 2 illustrates the overall research framework in this study. Figure 1: A research framework. As we can see in figure 2, in the first phase, a thorough investigation and study in stock prediction is conducted to observe recent work, identify issues or problems arise and formulate a potential solution in solving the problem. In phase two, the datasets are downloaded from the international website such as the Nasdaq and S&P 500 index to evaluate our models and approaches. In this stage, the downloaded dataset is pre-processed to assign the classes for multiclass classes every month. Next, the feature engineering step is fitted in phase four, in which two new features are engineered to be added into the original dataset. Furthermore, feature selection is considered as phase five to apply different types of feature selection Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 32 algorithms that are available in WEKA. in Phase six, discovering the best ensemble method and proposing the GBM-wFE is studies. The evaluation process is conducted in phase seven and compares our contributions to the existing model. Finally, the contribution of this study can be found in phase eight which is the last stage of the research methodology. 3.2. Prediction System For this project, the complete prediction system was developed in Java using the Waikato Environment for Knowledge Analysis (WEKA)’s Java library for ML. WEKA has a collection of ML algorithms for pre-processing, feature selection, and classification algorithms [32], [33]. We wrote and coded the system from the zero to run the experiments and evaluate the proposed approach, including feature engineering. The overall look of the prediction system is shown in the following figure. Figure 2: Overall look of the prediction system NetBeans was used as an integrated development environment tool to develop the project. The external library was used as WEKA did not support a few algorithms. 3.3. Dataset Collection The datasets were collected and obtained from the NASDAQ and S&P 500 index. In general, 25 years of historical data were downloaded for the CMCSA, CSCO, AAPL, SBUX, LRCX, MCHP, MSFT, NTAP, QCOM, SWSK and in S&P 500 the GSPC was chosen, which is the historical data of top 500 companies for the last 25 years. The duration of the data is Jan 1995 to Jan 2020. In total, we have extracted 3270 months records as the monthly based prediction. In general, 65% of the dataset is used for training and 35% is used for testing purposes. The original downloaded data is daily bases and generally, each dataset has around 6294 records of the historical data. Here the explanation of stock market data based on duration such as daily, weekly, monthly, quarterly, yearly:  Daily data is Single day Data.  Weekly Data as 5 days Per Week – i.e. every day except Saturdays and Sundays.  Monthly Data with Index – i.e. every month, with an index in December.  Quarterly (Combined Months) – i.e. 4 issues per annum.  Yearly Data means i.e. 2020. In total, each dataset has six attributes: 1. Date: The current date of the stock movement. 2. A close price: The closing price of the stock. 3. Volume: The number of shares has been exchanged in a day. 4. open price: Open price of a stock. 5. high price: The highest price during a given day. 6. low price: The lowest price during a given day. A sample of the downloaded data is shown in the below table. Table 1: Sample of downloaded data 3.4. Pre-Processing It is widely known that pre-processing is regarded as a vital step in ML and data mining. Therefore, in our study, we suggest a new method to pre-process the data collected from Nasdaq. The stock movement to compare the predicted and real percentage change every month assigns the class to monthly data. To find the monthly movement here, stock movement is the difference between the monthly close and open price: Difference = close price (last date of the month) – open price (first date of the month) The stock price movement in terms of percentage (%) is calculated as follows: Percentage_Difference = Difference / open price (first date of the month) For assigning the classification class in a multiclass classification case: If Percentage_Difference >1, then the class is positive; If Percentage_Difference <-1, then the class is negative; Otherwise, the class is neutral. The output dataset attributes are described below: 1. Month (based on that date) 2. Close (month-end date close price) 3. Volume (whole month daily volume addition) 4. Volume (whole month daily volume addition) 5. High (the highest price of the month) 6. Low (lowest price of the month) 7. Generated classes (multiclass). Multiclass dataset for monthly based for every ten companies generated so in the output total of 20 datasets. Multiclass “MSFT3.csv”. dataset of MSFT (multiclass), as shown in table two. Table 2: generating the multiclass classification for the MSFT dataset. M. Close V Open High Low Result 12 101.57 9.38E+08 113 113.42 93.96 negative 11 110.89 7.17E+08 107.05 112.24 99.3528 positive 10 106.81 9.2E+08 114.75 116.18 100.11 negative 9 114.37 4.7E+08 110.85 115.29 107.23 positive 8 112.33 4.54E+08 106.03 112.777 104.84 positive 7 106.08 5.6E+08 98.1 111.15 98 positive 6 98.61 5.96E+08 99.28 102.69 97.26 neutral 5 98.84 5.06E+08 93.21 99.99 92.45 positive 4 93.52 6.64E+08 90.47 97.9 87.51 positive 3 91.27 7.45E+08 93.99 97.24 87.08 negative 2 93.77 7.21E+08 94.79 96.07 83.83 Negative 1 95.01 5.68E+08 86.125 95.45 85.5 positive date close volume open high low 2/25/2019 50.73 19,857,750 51.09 51.27 50.62 2/22/2019 51.17 28,022,740 50.73 51.17 50.64 2/21/2019 50.72 35,613,260 50.63 50.86 50.34 2/27/2009 4.91 27772220 4.88 5.08 4.88 2/26/2009 5.10 25730550 5.57 5.57 5.03 2/25/2009 5.35 20824750 5.38 5.51 5.14 Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 34 To sum up, the total data after pre-processing is 3270 rows from 25 years as months, of which %65 are used for training and the other %35 are used for testing purposes. 3.5. Feature Engineering Implementation As mentioned earlier, the study aimed to investigate and add new features to improve the classification accuracy. Two new features were added to the dataset to study the impact on improving accuracy. The first new feature is named High_Low_Difference (HL_Diff), which is defined as the difference of month’s high and low price. The mean value of close open difference as daily bases was also constructed. The following mathematical equations were used to produce new features: HLDiff = 𝐻𝑖𝑔ℎMax − 𝐿𝑜𝑤𝑀𝑖𝑛 eq (1) Where: 𝐻𝑖𝑔ℎMax = Maximum high price of month 𝐿𝑜𝑤Min = Minimum low price of month 𝑀𝑒𝑎𝑛 = ∑𝑓 (𝑐𝑙𝑜𝑠𝑒 −𝑜𝑝𝑒𝑛) 𝑡𝑜𝑡𝑎𝑙 𝑑𝑎𝑦𝑠 eq (2) Where: ∑ 𝑓 = 𝑆𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 difference of close and open price And: 𝑡𝑜𝑡𝑎𝑙 𝑑𝑎𝑦𝑠 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑦𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 As can be seen in equation one, HL_Diff is calculated by the difference between the high and low month in total. It displays the entire month’s supreme movement in the price. The mean is calculated based on the mean values of all differences between close and open prices. That shows the average movement in price. The system automatically generates a new CSV file under the name “MSFT _3F.csv.” We programmed the system to create multiple files with feature engineering and without feature engineering to compare results later. Table three demonstrates the newly constructed features after feature engineering. Table 3: Features added dataset of MSFT for multiclass Month Close Open High Low HL_Diff Mean Res 12 101.57 113 113.42 93.96 19.46 -0.58211 negative 11 110.89 107.05 112.24 99.3528 12.8872 0.098095 positive 10 106.81 114.75 116.18 100.11 16.07 -0.62174 negative 9 114.37 110.85 115.29 107.23 8.06 0.115263 positive 8 112.33 106.03 112.777 104.84 7.937 0.219783 positive 7 106.08 98.1 111.15 98 13.15 -0.03738 positive 6 98.61 99.28 102.69 97.26 5.43 -0.19271 neutral 5 98.84 93.21 99.99 92.45 7.54 0.2677 positive 4 93.52 90.47 97.9 87.51 10.39 -0.22314 positive 3 91.27 93.99 97.24 87.08 10.16 -0.2769 negative 2 93.77 94.79 96.07 83.83 12.24 0.003684 negative 1 95.01 86.125 95.45 85.5 9.95 0.117619 positive 3.6. Implementation of Feature Selection Algorithm In this study, to achieve one of the aims, multiple feature selection algorithms were used to find the best feature selection algorithm for a multiclass classification approach. It is worth mentioning that the WEKA’s default configuration was implemented for all algorithms, which means no parameter configuration was considered since it was not within the scope of this study. Generally, the following algorithm was considered: .1. “Sequential Feature Selection (Best First) Search and CFS Subset Evaluation” (SEQ) .2. “Genetic Search and CFS Subset” (GEN) .3. “Ranker Search and Chi-Squared” (CHI) .4. “Ranker Search and Recursive Feature Elimination“(REF) .5. “Ranker Search and Correlation Coefficient”(CC) .6. “Ranker Search and Info Gain Evaluation” (IG) .7. “Ranker Search and ReliefF and it is Variant Evaluation” (RV) .8. “Ranker Search and Principle Components Analysis Evaluation” (PCA) To discover the best feature selection, we proposed an approach of which the flowchart is shown in figure three. The developed prediction system runs intensive experiments for all feature selection algorithms and produces the best-performing one to be considered in the overall approach which is proposed in the study. 3.7. Implementation of Ensemble Classifier Techniques As the classifier, we implemented seven ensemble learning algorithms, all of which are used as a default configuration in the WEKA application programming interface library, which technically means we did not play with the parameters, base learners, and other parameters. The following algorithms were chosen for this study: 1. Bagging Classifier (BAG) Bagging bags algorithm to decrease variance. Forecasts are produced by be an average of probability approximations, not by voting. One of the parameters is called the size of the bags as a proportion of the training dataset. Furthermore, another parameter is whether to compute the out-of-bag error, which tells the average error of the ensemble members [33], [34]. 2. Stacking Classifier (SC) In the SC the classifiers will be combined by using stacking for classification and regression problems. The base classifiers will be specified, the meta-learner, and the number of cross- validation folds. 3. Voting Ensemble Classifier (VE) The baseline approach is provided by VE for combining classifiers. The default outline is to average their probability approximations or numeric forecasts for classification. Moreover, the other grouping outlines are obtainable, for example, using common voting for classification. 4. AdaBoost Classifier (ADA) In the ADA the classic boosting is applied. It can be enhanced by specifying a threshold for weight pruning. ADA resamples if the base classifier cannot lever weighted occurrences (you can also force resampling) [33]. Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 36 Figure 3: Finding the best feature selection algorithm flowchart 5. Gradient Boosting Machine (GBM) Gradient Boosting Machine [35] (also referred to as slope boosted designs) sequentially fits brand-new versions to supply an extra exact price quote of a response variable in supervised knowing jobs such as regression and classification. GBM is a set of either regression or classification tree versions. Both are forward-learning set approaches that acquire anticipating outcomes using slowly boosted estimations. LogitBoost in WEKA provides similar results for GBM. 6. Multi-Boosting Classifier (MB) Multi-Boosting is an ensemble strategy employed for improving the results of single classifiers, which is an extension of AdaBoost [36]. The input vectors are weighted and some of them have a higher chance to contribute to the new sets. Two types of weights are defined in the boosting strategy: the first for adjusting the contribution of data points (Bi), and the second for the integration of the single classifiers. 7. Random Forest (Random Subspace) (RF) RF is a homogeneous ensemble prediction approach created by incorporating multiple decision trees as base learners. It is a bagging-based ensemble created using multiple decision trees and passing a subset of data to each of these base learners for training. The combiner provides the final results. Random Forest provides consistency to the model and thus provides a robust classifier. It tends to solve the over-fitting issue contained in Decision Trees [37]. 8. Finding the Best Classifier (Finding the best from the eight above) Another goal of the study was to discover the top-performing ensemble. The developed system runs an exhaustive comparison among the implemented algorithms to propose and select the outperforming algorithm. The following flowchart demonstrates the proposed approach to achieve this aim. Figure 4: Find the ensemble classifier flowchart 3.8. Predicted Class (CSV File) The developed system creates a file called “ClassPredict.csv”, which contains the actual classes and predicted classes, so it has easily compared the file as input dataset class and output dataset class. The testing dataset has 29 months’ records, so it contains 29 actual and predicted records. Here, for example, the testing dataset includes CMCSA company 29 months record’s Predicted classes as output is shown in the below table. Table 4: Input file dataset and predicted class file Original Data CMCSA Prediction Data Open High Open High Open High 36.71 38.73 36.71 38.73 36.71 38.73 33.49 37.42 33.49 37.42 33.49 37.42 39.09 39.29 39.09 39.29 39.09 39.29 3.9. Evaluation Method For a comparative study of the supervised learning algorithms for stock market prediction, we followed and used the WEKA library default evaluation methods [33]. To evaluate our works, we have used several evaluation metrics such as CA, Precision(PR), Recall(RE), F-Score(F1S), Mean Absolute Percentage Error (MAPE), Kappa Statistics(KAPPA) and Root Mean Squared Error (RMSE) and other available methods in WEKA accordingly to benchmark our proposed approach. For reliable testing and results, we have divided our data into training and testing. Approximately, 65% used for training purposes, and 35% are used for testing. It is widely Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 38 known that WEKA provides various methods to evaluate classifiers such as Training Time (TT), CA, Correctly Classified Instances (CCI), Incorrectly Classified Instances (ICI), Kappa, MAE, RMSE, Relative Absolute Error (RAE), and Root Relative Squared Error (RRSR), PR, RE, F1S and provides the Confusion Matrix (CM). Since we are using the default WEKA evaluation, we are not going to provide the equations and mathematical details behind them as it would lead to repetition. Details of all the evaluation methods can be found in [33], [38]. 4. RESULTS AND DISCUSSION 4.1. Results We conducted various experiments on all datasets of 11 companies. Besides, all the classification methods and FS algorithms on each dataset and a separate graph were generated for each experiment. To compare and evaluate feature engineering as adding new features to the dataset, each graph displays two dataset results. The first is a dataset with added features represented by a solid line and the second is a dataset without added features represented by a dotted line. An example of the generated graph is shown in the following figure. Figure 5: Performance comparison graph The example of a performance comparison graph shown in figure three contains the following:  The X-Axis signifies the algorithm sign – all seven classifiers  Y-Axis represents prediction accuracy in percentage  Various lines, the solid line is with FE (9F) and the dotted line is without FE (7F)  Nine colors of the line with a dissimilar color for each FS method  PCA (6F) (six features chosen and F for with feature). Figure six illustrates the overall classification prediction on the LRCX company dataset. As can be seen in the majority of cases, the PCA outperforms all the other feature selection algorithms, in which, with few classifiers, the accuracy of 96.55% is achieved. Furthermore, various classifiers work better than the other classifiers when PCA is considered, for example, classifiers such as ADA and RF. Conversely, classifiers such as VOT and STA perform poorly, and in some cases, the accuracy of less than 65% is achieved. When comparing the difference in feature engineering, it can be seen that in several experiments the accuracy is improved while sometimes it decreases. For instance, when GEN feature selection is considered, the feature engineering has improved the classification accuracy significantly, in which for the BAG algorithm, the accuracy of approximately 50% is achieved, whereas, without feature engineering, less than 50% is achieved. Figure 6: Classification result for LRCX company dataset Furthermore, the overall CA result for the SWKS company dataset is shown in figure seven Similar to the LRCX company, the PCA algorithm with majority classifiers outclasses others, where for classifiers such as BAG, ADA, GBM, and RF, the accuracy of 100% is achieved. Conversely, few algorithms have low accuracies such as STA and VOT. Moreover, feature engineering has contributed significantly to improving the classification accuracy. For instance, the RF with GEN achieves approximately 67% accuracy when tested with feature engineering, whereas around 50% accuracy is achieved without feature engineering. To sum up, on the SWSK company dataset, PCA is also considered as the best feature selection algorithm, and the Genetic Algorithm comes second. Finally, feature engineering contributed significantly to improving the overall classification accuracy. Figure 7: Classification result for SWSK dataset Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 40 The overall result prediction for the MCHP, MSFT, and NTAP datasets can be seen in figures eight, nine, and ten, respectively. Figure 8: Classification result for MCHP dataset Figure 9: Classification result for MSFT dataset , Figure 10: Classification result for NTAP dataset. 4.1.1. GBM-wFE Prediction Results Table 4.6 describes the overall CA results of the projected approach (GBM-wFE) based on various evaluation metrics available in WEKA. As it can be noticed that our proposed approach performance is outstanding, whereabouts the average CA is 99.28% on all datasets. To be more precise, our approach achieved 100% CA on AAPL, SBUX, MHCP, LRCX, MSFT, NTAP, QCOM, and GSPC. Furthermore, on CMCSA and CSCO the CA of 99.03% and 94.05% is achieved respectively. However, the lowest CA percentage is noticed when the CSCO dataset is considered this could be because of the fewer number of records compared to other datasets. Furthermore, we have also calculated the F-Measures, precision, MAPE, RMSE, KAPPA statistics, and recall metrics so that we can use them during the benchmark and comparison with previous studies. Our approach has achieved the MAPE of 0.19% and F-Measures of 0.99 out of one, which these results prove that the model enjoys significant success compare to the existing studies. In the next section, we will benchmark the GBM-wFE with literature. Table 5:GBM-wFE prediction result. Algorithm Stock GBM-wFE Nasdaq CMCSA 99.03 0.53 0.06 0.98 0.98 0.98 0.99 AAPL 100.00 0.00 0.00 1.00 1.00 1.00 1.00 CSCO 94.05 0.48 0.20 0.89 0.90 0.94 0.94 SBUX 100.00 0.00 0.00 1.00 1.00 1.00 1.00 LRCX 100.00 0.07 0.00 1.00 1.00 1.00 1.00 MHCP 100.00 0.34 0.04 1.00 1.00 1.00 1.00 MSFT 100.00 0.01 0.00 1.00 1.00 1.00 1.00 NTAP 100.00 0.07 0.00 1.00 1.00 1.00 1.00 QCOM 100.00 0.00 0.00 1.00 1.00 1.00 1.00 SWKS 99.03 0.56 0.07 0.98 0.99 0.99 0.98 S&P GSPC 100.00 0.00 0.00 1.00 1.00 1.00 1.00 Average 99.28 0.19 0.03 0.99 0.99 0.99 0.99 4.2. Discussion 4.2.1. Best Feature Selection To find the best feature selection algorithm, we have run intensive experiments on the CMCA data using GBM for all eight feature selections algorithm we have considered in this study. Table 6 demonstrates the performance evaluation of all feature selection algorithm using CA evaluation metric. Table 6:Find best feature selection using GBM. N Feature Selection Alg. CA% 1 SEQ 70.19 2 GEN 70.19 3 CHI 31.73 4 REF 31.73 5 CC 54.80 6 IG 31.73 7 RV 54.80 8 PCA 99.03 As it can be observed in table five, the PCA is outperforming all the feature selection algorithms by achieving 99.03% of CA. whereas, the rest of the feature selection algorithms have almost Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 42 reached the same CA, which is relatively lower than PCA accuracy. Based on the achieved results, the PCA is going to be used to find the best classifier and the rest experiments. 4.2.2. The Best Ensemble Classifier As mentioned earlier, another aim of this research was to discover the top-performing classifier. Therefore, to achieve this goal, we conducted intensive experiments on the 11 companies’ datasets with FE and used PCA as the best FS algorithm with the seven ensemble classifiers considered in this study. Table seven shows the multiclass classification results with feature engineering. Table 7: Multiclass classification with feature engineering using PCA Alg\DS NASDAQ S&P 500 BAG 99.03 92.05 100 100 100 99.03 100 97.02 100 100 100 98.83 STA 49.03 54.45 45.19 47.11 52.88 52.88 58.65 60.39 55.76 49.03 52.88 52.57 VOT 49.03 54.45 45.19 47.11 52.88 52.88 58.65 60.39 55.76 49.03 52.88 52.57 ADA 99.03 93.06 100 96.15 100 99.03 100 97.02 100 99.03 100 98.48 GBM 99.03 94.05 100 100 100 100 100 100 100 99.03 100 99.28 MB 90.38 93.06 95.19 98.07 100 90.38 98.07 100 100 99.03 100 96.74 RF 100 93.06 100 100 100 99.03 100 98.01 100 100 100 99.10 As can be seen in seven, the overall prediction approach is tested on all datasets using the implemented ensemble classifiers, which have been considered in this study. GBM found to be outperforming all other ensemble classifiers by achieving the CA 99.28% on average on all datasets. Moreover, RF, BAG, and ADA were found to be very efficient as well, which on average, the accuracy of 99.10%, 98.83%, and 98.48% achieved respectively on all datasets. Conversely, STA and VOT were identified as the worst-performing ensembles by reaching the CA of 52.57% on average. Last but not least, the rest of the ensembles were ranked as the middle performance by achieving the CA of 98% and above on average approximately. Therefore, based on these results GBM will be chosen to develop the stock prediction approach along with feature engineering and PCA as feature selection. 4.2.3. Feature Engineering Benchmarking with WEKA To explore the contribution and the significance of the proposed feature engineering approach, we conducted intensive experiments on several datasets. The CA result on all datasets with all feature selection algorithms is outstandingly improved. The average CA for SEQ and GEN is impressively boosted with feature engineering by achieving 69.23% whereas the average CA is only 44.55% without feature engineering. Correspondingly, the average CA is increased by 25.64% when the CHI, RFE, CC, IG, and RV considered. Besides, the average CA is also enhanced with PCA with feature engineering by 1.93%. Last but not least, on average, the CA with feature engineering is 55.12% while it is increased to 74.67% with feature engineering. 4.2.4. GBM-wFE Approach Benchmarking The prediction performances of the GBM-wFE approach and benchmark will be demonstrated in this section. Tables eight, nine, ten, and 11 show the comparison result of MAPE, Kappa statistics, and RMSE evaluation criteria. The proposed GBM-wFE was found to be outperforming the available benchmark for stock prediction using ensemble methods. Researchers in [40] proposed AdaBoost-LSTM (Long Short-Term Memory) and AdaBoost with a few other algorithms such as MLP, support vector regression (SVR), and ELM for financial time series forecasting using the stock index. In table eight, we compare our proposed model with their results. Table 8: Benchmark comparison using MAPE with [40] Apparoch or Model MAPE Their Model AdaBoost-MLP 1.023 AdaBoost-SVR 0.841 AdaBoost-ELM 0.782 daBoost-LSTM 0.413 Our Model GBM-wFE 0.19 As can be seen in table nine, our proposed ensemble method GBM-wFE is outperforming all the other proposed ensemble methods. GBM-wFE achieved a MAPE of 0.19% while their best model, which was AdaBoost-LSTM, achieved a MAPE of 0.413%. In another benchmarking comparison with previous studies, table nine shows the result of the comparison of the proposed GBM-wFE model with the results of this study [41]. Researchers used an ensemble of Recurrent Neural Network (RNN) with LSTM on historical data. Table 9: Benchmark comparison using RMSE with [41] Benchmark Ensemble Model Dataset RMSE% Their Model RNN-LSTM POWERGRID 0.410 SUBEX 0.413 INDBANK 0.201 GREENPLY 0.011 Our Model GBM-wFE SBUX 0.0 CMCSA 0.06 CSCO 0.04 MSFT 0 As can be observed from the above table, our proposed GBM-wFE ensemble is outperforming the RNN-LSTM ensemble method. The GBM-wFE ensemble achieved an RMSE of 0.0% on the SBUX dataset, whereas in their best result, they achieved an RMSE of 0.011%. Accordingly, all the other results show our approach is performing better in all the datasets. Furthermore, another study proposed an ensemble approach for stock price prediction using historical data [42], which is Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS). The TOPSIS uses crow search based weighted voting classifier ensemble. The benchmarking and the result comparison are shown in table 10 using the S&P and NIFTY dataset. It is worth mentioning that researchers used and tested various classifiers as the base learner for their ensemble; however, we will choose their best accuracy result to compare it to our model. Table 10: Result comparison with TOPSIS ensemble model Benchmark Ensemble Model Stock CA% Their Models TOPSIS-MV S&P 82 TOPSIS-WV 82.5 TOPSIS-PSO-WV 84.5 TOPSIS-DE-WV NIFTY 81 TOPSIS-CS-WV 84 Our Model GBM-wFE S&P 500 100 NASDAQ 99.21 Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 44 Average 99.28 Our proposed ensemble is superior to the proposed TOPSIS ensemble with a different ensemble classifier. We can see in the table ten that the best CA achieved with TOPSIS-DE-WV is 81% on NIFTY and 82% on S&P 500 Stock, whereas the GBM-wFE achieved 100% on S&P 500 Stock and 99.21% on NASDAQ. Accordingly, GBM-wFE on average have achieved the CA of 99.28% but their average CA is not calculated. However, based on the table 4.13 it can be calculated, which is around 82.8%. So, we can conclude that our approach is superior to TOPSIS approach by 16.48%. Finally, in the table 11, we also cite a few other studies to benchmark with the proposed GBM- wFE. We indicate and cite the research paper and its results. The following table demonstrate the superiority of our proposed GBM-wFE over the approaches used in studies available in the literature. Table 11: Result comparison using ensemble models in the literature Benchmark Ensemble Model Stock RMSE Kappa% Literature Model Adaboost [43] NSE - 0.4263 Stacking [43] - 0.5516 FFNN [44] YF 0.0201 Not Provided Our Model GBM-wFE S&P 500 0.00 1 NASDAQ 0.04 0.99 Table 11 elaborates the contribution and novelty of the proposed GBM-wFE ensemble model compared with the previous studies. Researchers in [43]proposed AdaBoost and the stacking ensemble method to predict the stock price movement and achieved a Kappa of 0.5516% as the best result; however, our model achieved a Kappa of (0.99) with the NASDAQ and Kappa of (1) with the S&P 500. Furthermore, our proposed model surpasses the proposed ensemble model in [44], in which an RMSE of 0.00 and 0.04 achieved for S&P 500 and NASDAQ respectively, whereas they had an RMSE of 0.02 in the best situation. 4.2.5. Summary Based on the benchmarking and the comparison results in table’s seven to 11, it can be concluded that this study has contributed significantly to the stock market prediction by proposing the novel multiclass classification using GBM-wFE. This study proposed and proved that using feature engineering can significantly improve the accuracy of any ensemble model and can even improve the overall prediction model. It is worth mentioning that our study can be viewed as the first to consider feature engineering for multiclass classification when stock markets are used. Moreover, this study proposed the GBM-wFE, which has been proven to outperform the ensemble methods used in studies in the literature, as the best MAPE, RMSE, CA, and Kappa statistics were achieved with better results. 5. CONCLUSION The study aimed to propose a novel feature engineering approach for multiclass classification for stock prediction. It explored the best feature selection algorithms which are currently available on the WEKA library. It also aimed to find the best ensemble learning algorithm. Finally, it aimed to find the ultimate collaboration between feature engineering, feature selection, and ensemble classifiers. This study collected Nasdaq and S&P 500 index listed stocks for the last 25 years as the dataset. The dataset included data of various companies, such as CMCSA, CSCO, AAPL, SBUX, LRCX, MCHP, MSFT, NTAP, QCOM, and SWKS. Monthly stock movement is predicted for each month. We have implemented feature engineering to add two features to the dataset as 1. Mean value of Open and Close price difference and 2. The high low difference, which is part of feature engineering that improved the performance, shows in the results as increasing accuracy. The technology uses the interface of Java and WEKA to judge varied styles of feature selection and classifier over the given dataset. For the feature selection part, various algorithms were applied, which are CFS Subset, Chi-Squared, Recursive Feature Elimination, Correlation Coefficient, Info Gain, ReliefF and its Variant, PCA, Sequential Feature Selection (Best First), Genetic Search, and Ranker Search, for ML techniques applied different classifiers on datasets, such as Stacking, AdaBoost, GBM, Multi-Boosting, and Random Forest. We tested all the techniques using multiclass classification on stock market movement as positive, negative, and neutral. In this project work, new features were added to the dataset and it was found that the accuracy of the prediction improved. The study proposed GBM-wFE which is found to be improved on average on all datasets and outperform the available studies in the literature as well. Furthermore, we recommended the best feature selection technique as PCA and the best ensemble classifier as GBM. Numerous future works can be suggested. First, feature engineering can be extended by considering external factors such as growth domestic products (GDP) and calculating and engineering more features to be added to the feature set. Second, the daily price movement can be used instead of monthly movements. Third, the proposed approach can be tested and implemented on a large number of datasets, which can include 50 years instead of 25 years of data. Finally, some algorithms for feature engineering can be designed and proposed to be added to the WEKA library. Acknowledgments I would like to show my massive appreciation to my supervisor Prof. Dr. Soran for his continuous support. Furthermore, I would like to thank my both second supervisor, Prof. Hamido Fujita, without his support and contributions. I would not reach this level. Moreover, appreciation must also go to Prof. Habib bin Harron for his friendly and fantastic support. I would also like to thanks Sulaimani Polytechnic University for giving this fabulous opportunity to study Ph.D. Lastly. I should forget my lovely wife for her patient and support for the past three years. REFERENCES [1] E. F. Fama, “The Behavior of Stock-Market Prices,” J. Bus., vol. 38, no. 1, pp. 34–105, 1965. [2] E. F. Fama, L. Fisher, M. C. Jensen, and R. Roll, “The Adjustment of Stock Prices to New Information,” Int. Econ. Rev. (Philadelphia)., vol. 10, no. 1, pp. 1–21, 1969. [3] J. Bollen, H. Mao, and X.-J. Zeng, “Twitter mood predicts the stock market.” [4] M. Ballings, D. Van den Poel, N. Hespeels, and R. Gryp, “Evaluating multiple classifiers for stock price direction prediction,” Expert Syst. Appl., vol. 42, no. 20, pp. 7046–7056, 2015. [5] Y. Chen and Y. Hao, “A feature weighted support vector machine and K-nearest neighbor algorithm for stock market indices prediction,” Expert Syst. Appl., vol. 80, pp. 340–355, Sep. 2017. [6] T. A., “Improvement on Classification Models of Multiple Classes through Effectual Processes,” Int. J. Adv. Comput. Sci. Appl., vol. 6, no. 7, 2015. [7] E. Chong, C. Han, and F. C. Park, “Deep learning networks for stock market analysis and prediction: Methodology, data representations, and case studies,” Expert Syst. Appl., vol. 83, pp. 187–205, Oct. 2017. [8] R. T. Farias Nazário, J. L. e Silva, V. A. Sobreiro, and H. Kimura, “A literature review of technical analysis on stock markets,” Q. Rev. Econ. Financ., vol. 66, pp. 115–126, 2017. [9] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, May 2017. [10] L. Wang, Z. Wang, S. Zhao, and S. Tan, “Stock market trend prediction using dynamical Bayesian factor graph,” Expert Syst. Appl., vol. 42, no. 15, pp. 6267–6275, 2015. [11] A. H. Moghaddam, M. H. Moghaddam, and M. Esfandyari, “Stock market index prediction using artificial neural network,” J. Econ. Financ. Adm. Sci., vol. 21, no. 41, pp. 89–93, 2016. [12] A. Nayak, M. M. M. Pai, and R. M. Pai, “Prediction Models for Indian Stock Market,” Procedia Comput. Sci., vol. 89, pp. 441–449, 2016. [13] B. Weng, M. A. Ahmed, and F. M. Megahed, “Stock market one-day ahead movement prediction using disparate data sources,” Expert Syst. Appl., vol. 79, pp. 153–163, Aug. 2017. Kurdistan Journal of Applied Research | Volume 5 – Issue 1 – June 2020 | 46 [14] Y. Zhao, J. Li, and L. Yu, “A deep learning ensemble approach for crude oil price forecasting,” Energy Econ., vol. 66, pp. 9–16, 2017. [15] L. Zhou, Y. W. Si, and H. Fujita, “Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method,” Knowledge-Based Syst., vol. 128, pp. 93– 101, 2017. [16] J. Sun, H. Fujita, P. Chen, and H. Li, “Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble,” Knowledge-Based Syst., vol. 120, pp. 4–14, 2017. [17] L. Khaidem, S. Saha, and S. R. Dey, “Predicting the direction of stock market prices using random forest,” vol. 00, no. 00, pp. 1–20, 2016. [18] Z. Dong, “Dynamic Advisor-Based Ensemble (dynABE): Case study in stock trend prediction of critical metal companies,” 2019. [19] S.-B. Chen, Y.-M. Zhang, C. H. Q. Ding, J. Zhang, and B. Luo, “Extended adaptive Lasso for multi-class and multi-label feature selection,” Knowledge-Based Syst., vol. 173, pp. 28–36, Jun. 2019. [20] B. Weng, L. Lu, X. Wang, F. M. Megahed, and W. Martinez, “Predicting short -term stock prices using ensemble methods and online data sources,” Expert Syst. Appl., vol. 112, pp. 258–273, 2018. [21] U. Khurana, D. Turaga, H. Samulowitz, and S. Parthasrathy, “Cognito: Automated feature engineering for supervised learning,” in 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016, pp. 1304–1307. [22] W. Long, Z. Lu, and L. Cui, “Deep learning-based feature engineering for stock price movement prediction,” Knowledge-Based Syst., vol. 164, pp. 163–173, 2019. [23] P. S. Panigrahy, D. Santra, and P. Chattopadhyay, “Feature engineering in fault diagnosis of induction motor,” in 2017 3rd International Conference on Condition Assessment Techniques in Electrical Systems, CATCON 2017 - Proceedings, 2018, vol. 2018-Janua, pp. 306–310. [24] Y. J. Liu, K. L. Lai, G. Dai, and M. M. F. Yuen, “A semantic feature model in concurrent engineering,” IEEE Trans. Autom. Sci. Eng., vol. 7, no. 3, pp. 659–665, Jul. 2010. [25] R. Punmiya and S. Choe, “Energy theft detection using gradient boosting theft detector with feature engineering-based preprocessing,” IEEE Trans. Smart Grid, vol. 10, no. 2, pp. 2326–2329, Mar. 2019. [26] J. Huang, X. Wang, S. Yong, and Y. Feng, “A feature enginering framework for short -term earthquake prediction based on AETA data,” in Proceedings of 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference, ITAIC 2019, 2019, pp. 563–566. [27] Y. Sun and G. Yang, “Feature engineering for search advertising recognition,” in Proceedings of 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, ITNEC 2019 , 2019, pp. 1859–1864. [28] R. M. Nabi et al., “Ultimate Prediction of Stock Market Price Movement,” J. Comput. Sci. 2019, Vol. 15, Page 1795, vol. 15, no. 12, pp. 1795–1808, Dec. 2019. [29] L. Zhou and H. Fujita, “Posterior probability based ensemble strategy using optimizing decision directed acyclic graph for multi-class classification,” Inf. Sci. (Ny)., vol. 400–401, pp. 142–156, 2017. [30] L. Zhou, Q. Wang, and H. Fujita, “One versus one multi-class classification fusion using optimizing decision directed acyclic graph for predicting listing status of companies,” Inf. Fusion, vol. 36, pp. 80–89, 2017. [31] J. Sun, H. Fujita, P. Chen, and H. Li, “Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble,” Knowledge-Based Syst., vol. 120, pp. 4–14, 2017. [32] H.-F. Yu et al., “Feature engineering and classifier ensemble for KDD cup 2010,” JMLR Work. Conf. Proc., pp. 1–12, 2010. [33] E. Frank, M. A. Hall, and I. H. Witten, The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques.” 2016. [34] L. Breiman, “Random Forests,” 2001. [35] Y. Freund, R. E. Schapire, and others, “Experiments with a new boosting algorithm,” in icml, 1996, vol. 96, pp. 148–156. [36] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997. [37] N. Landwehr, M. Hall, and E. Frank, “Logistic model trees,” Mach. Learn., vol. 59, no. 1–2, pp. 161–205, 2005. [38] M. Swamynathan, Mastering Machine Learning with Python in Six Steps - review and good into in ML and NN approaches and basics + Python samples --Each topic has two parts: the first part will cover the theoretical concepts and the second part will cover practical impleme, vol. 19, no. 2. 2017. [39] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited, 2016. [40] S. Sun, Y. Wei, and S. Wang, “AdaBoost-LSTM Ensemble Learning for Financial Time Series Forecasting,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018. [41] M. S. Hegde, G. Krishna, and R. Srinath, “An Ensemble Stock Predictor and Recommender System,” in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2018, pp. 1981–1985. [42] R. Dash, S. Samal, R. Dash, and R. Rautray, “An integrated TOPSIS crow search based classifier ensemble: In application to stock index price movement prediction,” Appl. Soft Comput. J., vol. 85, p. 105784, Dec. 2019. [43] S. A. Gyamerah, P. Ngare, and D. Ikpe, “On Stock Market Movement Prediction Via Stacking Ensemble Learning Method,” in CIFEr 2019 - IEEE Conference on Computational Intelligence for Financial Engineering and Economics, 2019, pp. 1–8. [44] K. S. Gan, K. O. Chin, P. Anthony, and S. V. Chang, “Homogeneous ensemble feedforward neural network in CIMB stock price forecasting,” in Proceedings - 2018 IEEE International Conference on Artificial Intelligence in Engineering and Technology, IICAIET 2018, 2019, pp. 111–116.