Journal of Accounting and Investment                Vol. 24 No. 2, May 2023 

 
Article Type: Research Paper 
  

Ensemble learning with imbalanced data 
handling in the early detection of capital 
markets 
 
Putri Auliana Rifqi Mukhlashin1*, Anwar Fitrianto1, Agus M Soleh1, and 
Wan Zuki Azman Wan Muhamad2 

 
Abstract 
Research aims: This study aims to create an early detection model to predict 
events in the Indonesian capital market. 
Design/Methodology/Approach: A quantitative study comparing ensemble 
learning models with imbalanced data handling detected early capital market 
events. This study used five ensemble learning models—Random Forest, 
ExtraTrees, CatBoost, XGBoost, and LightGBM—to detect early events in the 
Indonesian capital market by handling imbalanced data, such as under sampling 
(RUS), oversampling (SMOTE, SMOTE-Broder, ADASYN), and over-under sampling 
(SMOTE-Tomek, SMOTE-ENN), weighted (class weight). Global and regional stock 
markets, commodities, exchange rates, technical indicators, sectoral indices, JCI 
leaders, MSCI, net buys of foreign stocks, national securities, and national share 
ownership all predicted the lowest return of Crisis Management Protocol (CMP) 
binary responses. 
Research findings: Hyperparameters and thresholds were tuned to produce the 
optimum model. The best model had the highest G-mean. ExtraTrees with 
SMOTE-ENN predicted the highest number of one-day events, with a G-Mean of 
96.88%. LightGBM with SMOTE handling best predicted five-day events with an 
89.21% G-Mean. With a G-Mean of 89.49%, CatBoost with SMOTE-Border 
handling was the best for a 15-day event. In addition, LightGBM with SMOTE-
Tomek handling and 68.02% G-Mean was best for 30-day events. Further, 
performance evaluation scores decreased with increased prediction time. 
Theoretical contribution/Originality: This work relates more imbalance handling 
methods and ensemble learning to capital market early detection cases. 
Practitioner/Policy implication: Capital markets can indicate economic stability. 
Maintaining capital market efficacy and economic value requires a system to 
detect pressure. 
Research limitation/Implication: This study used ensemble learning models to 
predict capital market events 1, 5, 15, and 30 days ahead, assuming Indonesian 
working days. The model's forecast results are expected to be utilized to monitor 
the capital market and take precautions. 
Keywords: Capital Market; Early Detection; Ensemble Learning; Imbalance Class; 
Risk Event 

 
Introduction 
 
Imbalanced data can be handled in various approaches, including under 
sampling and oversampling. While under sampling is done by reducing the  

 
AFFILIATION: 
1 Department of Statistics, Faculty 
of Mathematics and Science, IPB 
University, West Java, Indonesia 
 
2 Institute of Engineering 
Mathematics, Universiti Malaysia 
Perlis, Arau, Malaysia 
 
*CORRESPONDENCE:  
putriaulianarifqi@gmail.com 
 
DOI: 10.18196/jai.v24i2.17970 
 
CITATION: 
Mukhlashin, P. A. R., Fitrianto, A., 
Soleh, A. M., & Muhamad, W. Z. A. 
W. (2023). Ensemble learning with 
imbalanced data handling in the 
early detection of capital markets. 
Journal of Accounting and 
Investment, 24(2), 600-617. 
 
ARTICLE HISTORY 
Received: 
20 Feb 2023 
Revised: 
17 Mar 2023 
Accepted: 
19 Mar 2023 
 

This work is licensed under a Creative 
Commons Attribution-Non-Commercial-
No Derivatives 4.0 International License 
 

JAI Website: 

 
https://scholar.google.com/citations?user=rZQ9sj8AAAAJ&hl=en&oi=ao
https://scholar.google.com/citations?hl=en&user=EoqCy0oAAAAJ
https://scholar.google.com/citations?hl=en&user=xwIhpU8AAAAJ
https://www.stat.ipb.ac.id/main/
https://www.stat.ipb.ac.id/main/
https://www.stat.ipb.ac.id/main/
https://imk.unimap.edu.my/
https://imk.unimap.edu.my/
https://imk.unimap.edu.my/
mailto:putriaulianarifqi@gmail.com
http://dx.doi.org/10.18196/jai.v24i2.17970
https://crossmark.crossref.org/dialog/?doi=10.18196/jai.v24i2.17970&domain=pdf
https://journal.umy.ac.id/index.php/ai/article/view/17970


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 601 

majority class randomly (Wang & Liu, 2021), oversampling is accomplished by randomly 
duplicating the minority class (Putri & Dhini, 2019). 
 
The oversampling approach includes SMOTE, SMOTE-Borderline, and ADASYN. The 
Synthetic Minority Over-Sampling Technique (SMOTE) creates new samples from the 
minority class by taking the nearest minority class samples. SMOTE-Borderline generates 
new data along the line between the minority class and its nearest neighbors. Besides, 
Adaptive Synthetic Sampling (ADASYN) uses a distribution weight for instances in the 
minority class based on the model's difficulty level of learning data. Faris et al. (2020) 
showed that SMOTE improves the prediction strength in the geometric mean (G-Mean) 
and type II errors. However, both under sampling and oversampling techniques are 
limited (Rahardja et al., 2023). Under sampling can remove important parts of the 
majority class, while oversampling can cause overfitting in the model. Sir and Soepranoto 
(2022) found that SMOTE-ENN (Edited Nearest Neighbor), included in the over-under 
sampling approach, is the best resampling technique. Indrawati (2021) also uncovered 
that SMOTE-ENN improved the accuracy performance of SVM by 2%–23%. Another 
handling approach is class weight, which gives more weight to the minority class. This 
method efficiently handles class imbalance (Asundi et al., n.d.). 
 
Ensemble models combine multiple machine learning models to improve prediction 
performance and accuracy (Mishraz et al., 2021). They are typically used in classification 
problems and are better than single models, such as logistic regression, SVM, and neural 
networks, because the prediction errors of a single model are not always made by another 
model (Lutfiani et al., 2023). Examples of ensemble models include bagging (bootstrap 
aggregating) and boosting methods. These methods combine predictions from multiple 
single models through voting or weighted voting (Mishraz et al., 2021). 
 
Several models within the bagging method yield good evaluation results, such as 
Extremely Randomized Trees (ExtraTrees) and Random Forest (RF). Bagging and boosting 
models outperform the others in predicting bank failure, with the RF model having the 
highest area under the ROC curve (AUC) value at 93% (Liu et al., 2021). Thakkar and 
Chaudhari (2021) found that RF produced the best results, with an accuracy of 90.4%. 
According to research by Islam et al. (2019), Extremely Randomized Trees (ExtraTrees) 
generated comparatively better results with a ROC-AUC value of 94.2%. ExtraTrees also 
exhibited promising precision performance—more than one-period prediction (Aini et al., 
2023). Meanwhile, models that fall under the boosting method include Categorical 
Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting 
(LightGBM). Research by Aly et al. (2022), using a four-period class imbalanced dataset, 
showed that CatBoost with oversampling SMOTE provides a high accuracy result in 
predicting bankruptcy in Poland, with an AUC value in each period of more than 90% 
(Santoso et al., 2022). Carmona et al. (2019) revealed that XGBoost provided the best 
result with an accuracy of 95%. With its ability to prevent overfitting and create 
predictions that can be applied generally, XGBoost capability increases accuracy in bank 
failure prediction. In the study, XGBoost outperforms both the conventional method 
(Logistic Regression) and the modern machine learning approach (Random Forest). A 
study by Wang et al. (2022) also demonstrated that LightGBM was better than Decision 


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 602 

Tree (DT), K-Nearest Neighbor, and RF in predicting bankruptcy with an F1-Score of 
87.63%. 
 
On the other side, the capital market is a platform that brings together parties that need 
capital and investors. The capital market can be an indicator of a country's economic 
stability and needs to be monitored to provide optimal benefits. In this regard, early 
warning systems can help reduce negative impacts on the capital market. Pressure on the 
capital market can also be caused by various factors, such as technical indicators, 
macroeconomics, portfolio management, regional or global integration, monetary policy, 
the global financial crisis, and behavioral economics factors such as public mood, news, 
risk aversion, and consumer confidence (Hermawan et al., 2023). Specifically, in machine 
learning modeling, pressure events are rare occurrences called class imbalance problems. 
Imbalanced class classification is where the majority class distribution has a larger 
proportion than the minority class. The class ratio is divided into three categories: mild 
when the minority class proportion is 20% to 40%, moderate when it is 1% to 20%, and 
extreme when it is less than 1% of the total data available (Google Developers, 2021).  
 
Regarding the Chinese capital market early warning system, the RF algorithm has the 
highest accuracy for identifying bond market crises using under sampling and double-
sampling methods, with 96.08 and 90.2%, respectively (Zhang & Chen, 2022). In 
developing an early warning system for predicting stock market crises in China based on 
market indicators and mixed frequency investor sentiments, the Artificial Neural Network 
(ANN)-based model demonstrates more stable performance, with an accuracy range of 
98% to 99% (Pramono et al., 2022). For other models, such as Support Vector Machine 
(SVM), Decision Tree (DT), Random Forest (RF), Gradient Boosting (GBDT), K-Nearest 
Neighbor (KNN), and Logistic Regression (LR), the prediction accuracy is greatly influenced 
by different stock markets (Lu et al., 2021). On early warning, Wang and Liu (2021) showed 
that a long short-term memory (LSTM) network produces satisfying performance with a 
test-set accuracy of 96.4% and an average of 2.8 days of forewarning. Cross-validation, 
back-testing, and a reality check demonstrate the model's reliability and practical value in 
real-time decision-making. 
 
Using machine learning models, research on class imbalances for the Early Warning 
System (EWS) or early detection in the capital markets is still rare, especially in Indonesia. 
This study, therefore, aims to use ensemble learning models for the early detection of 
events in the capital market in Indonesia with the handling of imbalanced data for 
predictions 1 day, 5 days, 15 days, and 30 days in advance, assuming working days in 
Indonesia are Monday through Friday. This research is anticipated to contribute 
knowledge about handling imbalances from multiple data periods with ensemble learning 
in situations involving the early detection of capital markets in Indonesia or other similar 
mattresses. In practice, this research can be used to monitor and detect events in the 
Indonesian capital market to assist with decision-making (Putri et al., 2023).  
 
 
Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 603 

Literature Review and Hypotheses Development 
 

The Early Warning System (EWS) in the field of the economic crisis has been developed 
for a long time. Many works of literature use statistical models as the EWS. However, with 
the advancement of technology, new modeling types known as ensemble learning have 
emerged (Bintoro et al., 2023). Many studies have been conducted to investigate whether 
these ensemble learning models can accurately predict crises (Candra et al., 2023). 
 
Ensemble learning can avoid overfitting and bias in the model because it creates multiple 
machine learning algorithms to determine the optimal value by majority voting. In the 
case of the early warning system for bank failures, ensemble learning has the greatest 
accuracy compared to all other SMOTE-based methods for converting imbalanced to 
balanced data (Shrivastava et al., 2020). Gnip and Drotár (2019) have compared several 
ensemble machine learning methods applied to a recently acquired dataset of small and 
medium-sized enterprises in the Slovak Republic. In certain instances, the highest 
obtained prediction accuracy of the proposed classification models, measured by 
geometric mean, is nearly 100% (Hariguna et al., 2022). 
 
A study (Bluwstein et al., 2021) using macro-financial data from 17 countries from 1870 
to 2016 approached the machine learning model in the stock market crisis study, where 
it applied one of the ensemble models, Extremely Randomized Trees (Extra Trees) and 
produced the most accurate results. Tölö (2020), in predicting systemic financial crises 
one to five years ahead, which includes the crisis dates and annual macroeconomic series 
of 17 countries over the period 1870-2016, found that machine learning models can be 
significantly improved by using Long-Short Term Memory (RNN-LSTM) and Gated 
Recurrent Unit (RNN-GRU) neural nets (Marlina et al., 2023). In another study, Coffinet 
and Kien (2019) focused on detecting rare events in banking crisis cases, using data 
collected from 32 European and non-European countries from 2010 to 2017 (Pratama & 
Wijaya, 2023). The random forest method was found to be the best approach for 
computing an indicator for the probability of a banking crisis (Zanubiya et al., 2023). In the 
case of bank insolvencies with non-failed, failed, and assisted entity data from the Federal 
Deposit Insurance Corporation (FDIC) database from 2008 to 2014, the Random Forest 
model provided the best results by considering the flow of data and had more stable and 
consistent performance in all test samples (Petropoulos et al., 2020). The Random Forest, 
a modern classification tree ensemble technique, provides the best results with an 
accuracy of 90.4%, and includes global credit and real estate variables as predictors 
(Thakkar & Chaudhari, 2021). 
 
Furthermore, the objective of Carmona et al. (2019) was to anticipate bank failure in the 
United States financial system (Mahardika & Irawan, 2022). Expanded Gradient Boosting 
was used for empirical analysis (Sipahutar et al., 2020). This method evolved from 
previous boosting methods such as AdaBoost and boosted classification trees (Widiastuti 
et al., 2023). The XGBoost algorithm's applicability and ability are to increase the accuracy 
of bank failure prediction (Kosasi et al., 2022). In addition, an early warning system 
predicts banking crises to identify excessive credit growth and aggregate leverage. Junyu 
(2020) showed that the accuracy of XGBoost has reached approximately 80%, providing a 


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 604 

novel method for predicting financial crises. Lu et al. (2021) have also proven that, among 
single classifiers, the combination of XGBoost and random under sampling outperforms 
the random forest in predicting the probability of healthy enterprises. The probability of 
financial crisis enterprises in the t-period is maintained at 92.86%, the misjudgment rate 
of normal enterprises is reduced to approximately 12%, and the overall prediction 
accuracy is enhanced through a straightforward integration of the Random Forest and 
XGBoost (Tussa’diah & Kartika, 2023). 
 
 
Research Method 
 

Data 
 
This study used data from the Financial Services Authority. The predictor variables used 
were all numeric variables that fell into several categories, including Global Regional Stock 
Market, Commodities and Exchange Rates, Technical Indicators, Sector Indices, IHSG 
Leaders, Morgan Stanley Capital Indonesia (MSCI), Foreign Net Buy/Sell Stocks, and 
Government Securities (SBN) and SBN ownership, with a total of 2424 variables. The 
response variable employed was the event of the lowest Crisis Management Protocol 
(CMP) return. If the value is less than minus 5%, it is "Pressure," while if the value is greater 
than or equal to minus 5%, it is "Normal" (Soesilo and Tinggi, 2021). The time range of the 
data was from January 2010 to November 2020. A list of variables used can be found at 
bit.ly/thesisvariables. Afterward, this study compared ensemble learning models with 
imbalanced data handling to further predict the Indonesian capital market. In this regard, 
ensemble learning creates a more accurate and complete strong model by combining 
weak classifier models. Bagging and boosting are two ways to combine poor classifiers 
with strong ones. The bagging method generates base classifiers in parallel, while the 
boosting method successively generates them and influences later classifiers.  
 
Analysis Process  
 
This study took the following steps: data preprocessing, feature selection, data division, 
handling class imbalance, ensemble learning modeling, model improvement, and 
evaluation. Each stage had subprocesses involved. The detailed flow is shown in Figure 1. 
 
Data Preprocessing 
This process includes data cleaning, Exploratory Data Analysis (EDA), and feature 
engineering. Data were cleaned to remove or modify irrelevant, duplicate, and 
unformatted data. In EDA and Feature Engineering, the addition of the return variable 
from each predictor variable, which is a daily price index, and the conversion of the daily 
price index predictor variable into stock return value was carried out. Pt is the stock price 
at period t, and P(t-1) is the stock price at the previous period t-1. 
 
𝑃𝑡−𝑃𝑡−1

𝑃𝑡−1
…………………………………………………………………………………………………………………………(1) 

https://bit.ly/thesisvariables


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 605 

 
Figure 1 Flow Diagram 

 
In addition, other variables were added by creating moving average (MA) values. For 
example, the previously used ihsg_daily variable, the IHSG daily price index, was replaced 
with the return value ihsg_daily_return, and the moving average value was added, for 
example, ihsg_daily_return_movavg_five, or the five-day moving average of its daily 
return data. The MA value in this research was added to MA(k), with k = 5, 10, 15, and 30. 
Subsequently, data exploration was performed to identify the data patterns before and 
after each predictor variable's "Pressure" event. This pattern served as the basis for 
determining the change of the original variable into the time lag variable, such as in the 
example ihsg_daily_return_lag_i with i = 1, 2, 3, 4, and 5 to predict 1 day ahead. The 
original variable was replaced with the time lag variable (see Table 1) to avoid information 
leakage during the machine learning modeling. 
 
  
Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 606 

Table 1 Creating a time lag variable 

Period X x lag 1 x lag 2 … x lag i flag 

1 -2.86 NA NA … NA Normal 
2 -1.72 -2.86 NA … NA Pressure 
3 0.56 -1.72 -2.86 … NA Normal 

X column is not utilized in the modeling process. 
 
Feature Selection 
 
Many initial variables were the basis for the feature selection process. Feature selection 
was performed using the Random Forest algorithm to remove weakly influencing 
variables and improve accuracy and classification performance (Chen et al., 2020). A 
feature selection was performed for each scenario, a prediction for 1 day, 5 days, 15 days, 
and 30 days ahead. 
 
Data Splitting 
 
The data division was carried out using the expanding window time series method (see 
Figure 2). This approach is called forward-chaining cross-validation (Vien et al., 2021). The 
data division was done annually by dividing the data into eight parts. However, 2014 and 
2017 were not used as testing data due to the absence of "Pressure" events in those years. 
 

Figure 2 Expanding Window Time Series 
 
Handling Class Imbalance Issue 
 
Based on Table 2, the proportion of "Pressure" events is 0.017, or 44 days out of 2620. It 
can be categorized as an extreme to moderate imbalance (Google Developers, 2021). 
Handling imbalanced class techniques are divided into four categories: under sampling 
(RUS), oversampling (ROS, SMOTE, SMOTE-Borderline, ADASYN), over-undersampling 
(SMOTE-Tomek, SMOTE-ENN), and weighting (class weight). 
 
Ensemble Learning Modeling 
 
The modeling was carried out on the training data using five models: ExtraTrees (Alfian et 
al., 2022), RF (Speiser et al., 2019), CatBoost (Jabeur et al., 2021), XGBoost (Qiu et al., 
2021), and LightGBM (Sun et al., 2020). Each model was approached with and without the 
handling of class imbalance. It was also done for each prediction scenario. 


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 607 

Table 2 The proportion of pressure event labels based on the year 

Year Pressure Event 

2010 1.63% 
2011 4.45% 
2012 0.41% 
2013 3.20% 
2014 0.00% 
2015 0.81% 
2016 0.40% 
2017 0.00% 
2018 1.24% 
2019 0.41% 
2020 6.16% 

 
Hyperparameter Optimization and Threshold 
 
The hyperparameter optimization technique was performed using Optuna. Optuna is 
effectively used on the XGBoost model (Srinivas & Katarya, 2022). A total of 30 trials were 
performed to determine the best hyperparameter value of the model with imbalance 
class handling, such as the number of estimators, which is the number of decision trees, 
and max depth, which is the maximum depth of the decision tree. Threshold tuning was 
also performed using Optuna to obtain the optimal value based on the G-Mean. 
 
Model Evaluation 
 
This research aimed to predict the "Pressure" event for prevention and "Normal" for 
market capitalization monitoring. The evaluation was performed on the test data using 
the G-Mean value, maximizing both true positive and true negative while keeping both 
relatively balanced. The highest G-Mean value on the modeling algorithm is the basis for 
choosing the best model. 
 
 
Results and Discussion 
 

Data Exploration 
 
The general pattern of the predictor variables with the pressure event was seen in this 
process. From the following Figure 3, to see the occurrence of pressure on the next day, 
on average, the predictor variables decreased in the previous 4 to 6 days. Then, the 
predictor variables decreased to see if there was pressure in the next five days, as seen in 
the previous 10 to 15 days. Meanwhile, to see pressure events in the next 15 days, the 
predictor variables decreased over the past 22 to 25 days. In addition, to predict the next 
30 days, no pattern differed significantly from the predictor variable. Data exploration 
depicted that the predictor variables could detect pressure events before they happened. 


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 608 

   
(a) (b) (c) 

 
Figure 3 (a) Pattern of Global and Regional Market Indices: DJIA and KLSE; (b) Pattern of 

Commodity Prices: Brent and CPO; and (c) Pattern of MSCI: MXID and MXWD 
 
Feature Selection 
 
Feature selection was (see Figure 4) made to obtain variables that strongly influenced 
predicting pressure events in the capital market. In addition, it is expected to improve the 
performance of the classification algorithm in terms of computation. Fifty predictor 
variables were selected from the feature importance Random Forest model.  
 
In the prediction of pressure events for the next 1 day, the variables with a high impact 
on the occurrence of pressure on the capital market were the IHSG, sectoral indices 
(financial, manufacturing), leading stocks (LQ45, Kompas) and the MSCI index (MXID) 
which were the highest 50 variables or 2.1% of all variables, based on RF prediction 1 day 
ahead feature importance covering 23.23% importance of the model. Then, the mining 
sector, indices (PCOMP, OMX), and exchange rates (YEN, YUAN) for the next 5 days 
covered 12.70%. Furthermore, for the next 15 days, foreign investors' SBN holdings, the 
exchange rate (YEN), the mining sector, cons good, global/regional stock index returns 
(SZCOMP, IBEX, TOPIX), and commodity prices (BRENT) covered 10.77%. Then, for the 
next 30 days, foreign investors' SBN holdings, exchange rates (YEN, USD), leading stocks 
(Kompas), the IHSG in the agribusiness sector, finance, and the MSCI index (MIXD) covered 
16.03%. 
 
Ensemble Model Results 
 
The results of prediction modeling for the next 1-day (see Table 3) showed that the RF 
model with SMOTE-ENN treatment produced a G-Mean value of 0.9628 with a max depth 
of 2 and several decision trees of 42 at a threshold of 0.5393, in which this handling was 
better than other RF models. ExtraTrees with SMOTE-ENN handling produced the best 
performance value with a G-Mean of 0.9668, max depth of 4 and number of decision trees 
of 82 at a threshold of 0.5141. The CatBoost model with SMOTE-Border handling 
produced the highest G-Mean value of 0.9533 at max depth 4, and the number of decision 
trees was 710, the threshold of 0.4656. Likewise, with CatBoost, the XGBoost model 
handled SMOTE-Border with a max depth value of 4 and several decision trees of 86 and, 
at an optimal threshold of 0.5852, produced the highest G-Mean value of 0.9324. 
Whereas in LightGBM, SMOTE-ENN produced a good performance as seen from G-Mean 
0.9486 with optimal threshold and hyperparameters of 0.5525 with a max depth of 1 and 
a total of 22 decision trees. In general, the ExtraTrees model with SMOTE-ENN handling 
was the best algorithm in predicting events in the capital market 1 day ahead with a higher 


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 609 

G-Mean value of around 2% - 5% compared to other models at the same treatment and 
1% - 3% compared to other best algorithms. Some models produced low G-Mean (<50%), 
i.e., without handling, RUS on LightGBM models, ROS on CatBoost and XGBoost models, 
class weight on CatBoost, XGboost, and LightGBM models. 
 

(a) (b) 

 
(c) (d) 
 

Figure 4 Feature Selection Results (a) Prediction for the Next 1-Day, (b) Prediction for 
the Next 5 Days, (c) Prediction for the Next 15 Days, and (d) Prediction for the Next 30 

Days 
 

Furthermore, for predicting events for the next 5 days (see Table 3), the RF model with 
SMOTE-Border handling produced the best value compared to other RF models with a G-
Mean value of 0.8529 at max depth 4, and the number of decision trees was 562 at a 
threshold of 0.6384. The ExtraTrees-SMOTE-Border algorithm produced the highest G-
Mean of 0.8614 with a max depth of 4 and 82 decision trees at a threshold of 0.5141. The 
CatBoost model with SMOTE handling produced the highest G-Mean value of 0.8543 at 
max depth 4, and the number of decision trees was 541 at a threshold of 0.6955. The 
XGBoost model with SMOTE-Tomek handling produced the highest G-Mean value of 
0.8812 in the optimal hyperparameter combination max depth of 4 and the number of 
decision trees of 881 at a threshold of 0.6312. The LightGBM model with SMOTE handling 


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 610 

with a max depth of 4 and several decision trees of 991 and at a threshold of 0.7590 
produced the highest G-Mean value of 0.8921. Viewed as a whole, for predictions in the 
next 5 days, LightGBM with SMOTE treatment produced the highest G-Mean value. 
Compared to other models with SMOTE handling, the LightGBM-SMOTE algorithm 
showed a higher performance value, with a higher G-Mean value of about 1% - 14% higher 
than the same treatment and 1% - 4% compared to the other best algorithms. Some 
treatments produced a low G-Mean (0 - 50%), i.e., without overall handling producing a 
bad G-Mean value of less than 30%, RUS and class weight less than 40%, and ROS less 
than 50%. 
 
The RF-SMOTE-ENN algorithm, with an optimal threshold of 0.5453, max depth of 4, and 
several decision trees of 897, was the best model for predicting events for the next 15 
days, with the highest G-Mean of 0.7889 (see Table 3). The ExtraTrees model with SMOTE 
handling produced the highest G-Mean of 0.7573 with a max depth of 4, the number of 
decision trees was 60, and the threshold was 0.5285. The CatBoost model with SMOTE-
Border handling produced the highest G-Mean value of 0.8949, optimal at a max depth of 
4, number of decision trees of 798, and threshold of 0.3413. The XGBoost-SMOTE-Border 
algorithm revealed the best results with a G-Mean value of 0.8336 obtained after tuning 
with a threshold of 0.4006, a max depth of 2, and several decision trees of 490. In the 
LightGBM model, handling SMOTE-ENN produced the best G-Mean value of 0.8245 at the 
optimal max depth of 4, the optimal number of trees of 895, and the threshold of 0.7426. 
In general, the CatBoost-SMOTE-Border model was the best for predicting capital market 
events for the next 15 days. Compared with other SMOTE-Border models, it produced 
higher G-Mean values of 6% - 28% and 1% - 4% compared to the other best algorithms. 
Models that did not handle the imbalance class or use other handling methods (weight 
class, RUS, ROS) produced low G-Mean values (<30%). 
 
Predicting the pressure events in the capital market earlier can help take more effective 
preventive measures. However, the predicted results for the next 30 days (see Table 3) 
were not good enough compared to predictions for other shorter periods. The RF model 
with SMOTE-Tomek handling produced the highest G-Mean value of 0.5968 at the 
threshold of 0.6007, max depth of 4, and the number of trees 32. The ExtraTrees-ADASYN 
algorithm produced the highest G-Mean value of 0.4162 at the threshold of 0.5255, max 
depth of 4, and the number of decision trees of 45. In the CatBoost model, the SMOTE-
ENN treatment produced the highest G-Mean value of 0.6468 at a max depth of 4, a 
number of decision trees of 463, and a threshold of 0.4795. The XGBoost model with 
SMOTE handling produced the highest G-Mean value of 0.6632 at a max depth of 4, and 
the number of decision trees was 646. The LightGBM model with SMOTE-Tomek handling 
at a threshold of 0.6318 with a max depth of 3 and several decision trees of 224 produced 
the highest G-Mean value of 0.6802. Generally, compared to the SMOTE-Tomek handling 
model, LightGBM uncovered a higher G-Mean value of about 5% - 48% and 2% - 26% 
compared to the other best algorithms. Meanwhile, models with imbalanced class weight, 
RUS, and ROS handling produced G-Mean values below 50%. 
  

Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 611 

Table 3 Evaluation results of prediction model (in percent) 

Scenario Model 
G-Mean value from Imbalanced Class Handling Results 

None RUS ROS ADASYN SMOTE 
SMOTE 
Border 

SMOTE 
Tomek 

SMOTE 
ENN 

Class 
Weight 

1 day 
ahead 

RF 0.6628 0.9051 0.9109 0.9078 0.891 0.9391 0.9026 0.9628a 0.6712 
ExtraTrees 0.7672 0.8518 0.912 0.9411 0.939 0.9551 0.9446 0.9668b 0.7871 

CatBoost 0.7147 0.9132 0.4171 0.9182 0.8887 0.9533c 0.8846 0.9386 0.4578 
XGBoost 0.4252 0.7537 0.2729 0.9018 0.8938 0.9324d 0.8940 0.9158 0.2667 
LightGBM 0.1925 0.4514 0.7814 0.8942 0.901 0.9445 0.9047 0.9486e 0.3154 

5 days 
ahead 

RF 0.2108 0.3824 0.4026 0.8288 0.8124 0.8529a 0.8121 0.8137 0.3522 
ExtraTrees 0.2324 0.1939 0.2236 0.7555 0.7563 0.8614b 0.7543 0.7713 0.325 
CatBoost 0.1096 0.389 0.0961 0.8426 0.8543c 0.8314 0.8009 0.8282 0.0914 
XGBoost 0.2507 0.2213 0.1237 0.8762 0.8795 0.8677 0.8812d 0.8721 0 
LightGBM 0.0625 0.1537 0 0.8456 0.8921e 0.8811 0.8872 0.8777 0 

15 days 
ahead 

RF 0.0375 0.2042 0 0.7583 0.7589 0.6155 0.767 0.7889a 0.2387 
ExtraTrees 0 0.2861 0.1832 0.7024 0.7573b 0.7545 0.7429 0.7279 0.2554 
CatBoost 0.0437 0.1803 0.0452 0.7929 0.822 0.8949c 0.8128 0.8234 0.0968 
XGBoost 0.104 0.2627 0.0453 0.8017 0.7926 0.8336d 0.7979 0.8023 0 
LightGBM 0 0.2100 0 0.8034 0.8022 0.8228 0.8025 0.8245e 0 

30 days 
ahead 

RF 0.1462 0.1783 0 0.3929 0.4912 0.198 0.5968a 0.3657 0.4564 
ExtraTrees 0 0.205 0.1041 0.4162b 0.2546 0.2406 0.2046 0.3057 0.1227 
CatBoost 0.0422 0.4616 0.0388 0.4249 0.5919 0.589 0.5823 0.6484c 0.0985 
XGBoost 0.0707 0.3512 0.0766 0.6316 0.6632d 0.6237 0.6329 0.6263 0 
LightGBM 0.0371 0.2747 0.052 0.5215 0.6363 0.6583 0.6802e 0.648 0 

The bold value indicates the highest evaluation value among all models. c The highest evaluation value for the CatBoost model 
a The highest evaluation value for the Random Forest (RF) model d The highest evaluation value for the XGBoost model 
b The highest evaluation value for the ExtraTrees model  e The highest evaluation value for the LightGBM model 

 
The best model for every scenario yielded a different handling (see Figure 5). Not only did 
it have a high geometric mean value, but it also yielded high values in other metrics, such 
as the F1 score, indicating that it was correctly identifying most of the pressure events in 
the dataset and making accurate predictions, recall, or the proportion of actual pressure 
events that were correctly identified as pressure events by the model, and the precision 
value, denoting that most of the time when the model predicted "pressure," it actually 
was "pressure," and the AUC-ROC closed. For the 1-day prediction, the ExtraTrees model 
with SMOTE-ENN handling was the best with all evaluation metrics greater than 95%. For 
a 5-day prediction, the LightGBM model with SMOTE handling had the highest rate of 
"pressure" event capture (91.02%) and the strongest precision (92.75%). It also produced 
the best model in LightGBM, with an average score of about 80% in instances of 
imbalanced data and similar things (Wang et al., 2022), where the length of historical data 
used was around 10 years, and cross-validation was utilized to avoid overfitting. However, 
the hyperparameter tuning procedure in their research employed a grid search, while this 
study used OPTUNA. For the 15-day prediction, the CatBoost model with SMOTE-Border 
handling was the best, where all metrics were still satisfactory, with more than 80% for 
all metrics. Similar to previous studies (Aly et al., 2022), the best class imbalance 
treatment employed SMOTE-based oversampling, and both feature selection and 
parameter tuning were carried out to obtain the best model. However, (Aly et al., 2022) 
study differs in that cross-validation was not employed, while this study used cross-
validation expanding window time series. Then, for the 30-day prediction, the LightGBM 


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 612 

model with SMOTE-Tomek handling was the best, with all evaluation metrics greater than 
60%. As can be seen, the evaluation's effectiveness degrades as the prediction grows even 
further. Generally, this study aligns with the results of Indrawati (2021) and (Shrivastava 
et al., 2020), which found that SMOTE-Based oversampling is an effective method for 
handling class imbalance. 
 

Figure 5 Summary of the Best Model 

 
Feature Importance 

 
Based on the best algorithm ExtraTrees-SMOTE-ENN, it can be seen that the variables at 
a time lag 1 had the most impact, which can be said that predicting the pressure event 1 
day ahead was influenced by the event 1 day before (see Figure 6). The IHSG, industrial 
sector, trade, manufacturing, and MSCI Indonesia Index impact predicting the 1 day ahead 
event. The LightGBM-SMOTE algorithm showed that the most influential variables for 
predicting market events 5 days ahead were in the range of 5-14 days, namely indices and 
foreign stock exchanges such as the return of China's Yuan 8 days, the Philippine stock 
exchange 6 days, the Australian stock exchange 6 days, the return of Thailand's index 6 
days ago, SBN, and the consumer goods industry sector. Based on the best algorithm 
CatBoost-SMOTEBorder, which predicts 15 days ahead, the most influential factors were 
in the range of 16-25 days ago, which include SBN on days 16 and 20, the USD index on 
day 22, Japan TOPIX on day 25, the XAUD sector which is the gold/silver sector, Brent oil 
on days 19 and 25, trade, industry, utility, and transportation sectors with significant 
impact. Based on the best algorithm LightGBM-SMOTETomek, it can be seen that the 
moving average of foreign net buy/sell shares, the USD index, Shenzen, Yuan, the moving 
average MSCI, and some sectors such as basic industry, agriculture, and manufacturing, 
Brent oil, SBN, and the CSI300 and S&P 500 indices 30 days before are important factors 
in predicting events in the market 30 days ahead. 
 

Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 613 

  
(a) (b) 
 

(c) (d) 
 

Figure 6. Feature Importance Prediction Model (a) 1 Day Ahead; (b) 5 Days Ahead;  
(c) 15 Days Ahead; and (d) 30 Days Ahead 

 
Conclusion 
 

This study used ensemble learning modeling with and without handling the imbalance 
class to detect events in the Indonesian capital market. There are differences in the best 
algorithm for each scenario. For a 1-day prediction, the ExtraTrees model with SMOTE-
ENN handling had the highest G-Mean value of 0.9668. Then, for a 5-day prediction, the 
LightGBM algorithm with SMOTE handling had a G-Mean value of 0.8921. For a 15-day 
prediction, the CatBoost algorithm with SMOTE-Border handling had a G-Mean value of 
0.8949. Furthermore, for a 30-day prediction, the LightGBM algorithm with SMOTE-
Tomek handling had a G-Mean value of 0.6802. In conclusion, the further the prediction, 
the weaker the model's performance. In this study, effective methods for handling the 
class imbalance problem in machine learning models were oversampling techniques 
(SMOTE, SMOTEBorder) and over-under sampling (SMOTE-ENN, SMOTE-Border). On the 
other hand, random under sampling (RUS), random oversampling (ROS), and class weight 


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 614 

were less effective methods for handling the class imbalance problem in machine learning 
models in this study.  

 
Based on the study's findings, it is hoped it will be able to contribute knowledge about 
how to handle imbalances from multiple data points with ensemble learning in scenarios 
involving early detection of the Indonesian capital market or other similar situations in 
the application of machine learning, which has previously rarely been studied and can be 
further explored. It is also expected that this research can be utilized to monitor and 
detect events in the Indonesian capital market to assist in decision-making because the 
metrics evaluation findings show excellent performance in detecting events in the capital 
market. In this study, the ensemble learning model was only utilized for bagging and 
boosting purposes; there was no comprehensive description of the ensemble learning 
model's role in predicting how events would affect the capital market. This study also has 
limitations regarding the newest data. Therefore, for future research, the best existing 
models can be combined with other ensemble learning methods, such as stacking, and 
the feature importance of the best model can be explained with more advanced 
interpretations, such as LIME (Local Interpretable Model Agnostic Explanation). 
 

References 

 
Aini, Q., Manongga, D., Rahardja, U., Sembiring, I., & Efendy, R. (2023). Innovation and 

Key Benefits of Business Models in Blockchain Companies. Blockchain Frontier 
Technology, 2(2), 24-35. https://doi.org/10.34306/bfront.v2i2.161 

Alfian, G., Syafrudin, M., Fahrurrozi, I., Fitriyani, N. L., Atmaji, F. T. D., Widodo, T., ... & 
Rhee, J. (2022). Predicting breast cancer from risk factors using SVM and extra-trees-
based feature selection method. Computers, 11(9), 136. 
https://doi.org/10.3390/computers11090136 

Aly, S., Alfonse, M., & Salem, A. B. M. (2022). Intelligent Model for Enhancing the 
Bankruptcy Prediction with Imbalanced Data Using Oversampling and 
CatBoost. International Journal of Intelligent Computing and Information Sciences, 22(3), 92-
108. https://doi.org/10.21608/ijicis.2022.105654.1138 

Asundi, R. V., Prakash, R., & Kumar, K. (n.d.). Class Weight technique for Handling Class 
Imbalance. 

Bintoro, B. P. K., Lutfiani, N. and Julianingsih, D. (2023) ‘Analysis of the Effect of Service 
Quality on Company Reputation on Purchase Decisions for Professional Recruitment 
Services’, APTISI Transactions on Management (ATM), 7(1), pp. 35–41. 
https://doi.org/10.33050/atm.v7i1.1736 

Bluwstein, K., Buckmann, M., Joseph, A., Kapadia, S., & Simsek, Ö. (2021). Credit growth, 
the yield curve and financial crisis prediction: Evidence from a machine learning 
approach. ECB Working Paper No. 2021/2614. 
http://dx.doi.org/10.2139/ssrn.3969562 

Candra, O., Chammam, A., Rahardja, U., Ramirez-Coronel, A. A., Al-Jaleel, A. A., Al-
Kharsan, I. H., ... & Rezai, M. M. (2023). Optimal Participation of the Renewable 
Energy in Microgrids with Load Management Strategy. Environmental and Climate 
Technologies, 27(1), 56-66. https://doi.org/10.2478/rtuect-2023-0005 

https://doi.org/10.34306/bfront.v2i2.161
https://doi.org/10.3390/computers11090136
https://doi.org/10.21608/ijicis.2022.105654.1138
https://doi.org/10.33050/atm.v7i1.1736
https://dx.doi.org/10.2139/ssrn.3969562
https://doi.org/10.2478/rtuect-2023-0005


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 615 

Carmona, P., Climent, F., & Momparler, A. (2019). Predicting failure in the US banking 
sector: An extreme gradient boosting approach. International Review of Economics & 
Finance, 61, 304-323. https://doi.org/10.1016/j.iref.2018.03.008 

Chen, R. C., Dewi, C., Huang, S. W., & Caraka, R. E. (2020). Selecting critical features for 
data classification based on machine learning methods. Journal of Big Data, 7(1), 52. 
https://doi.org/10.1186/s40537-020-00327-4 

Coffinet, J., & Kien, J. N. (2019). Detection of rare events: A machine learning toolkit with 
an application to banking crises. The Journal of Finance and Data Science, 5(4), 183-207. 
https://doi.org/10.1016/j.jfds.2020.04.001 

Faris, H., Abukhurma, R., Almanaseer, W., Saadeh, M., Mora, A. M., Castillo, P. A., & 
Aljarah, I. (2020). Improving financial bankruptcy prediction in a highly imbalanced 
class distribution using oversampling and ensemble learning: a case from the Spanish 
market. Progress in Artificial Intelligence, 9, 31-53. https://doi.org/10.1007/s13748-019-
00197-9 

Gnip, P., & Drotár, P. (2019, September). Ensemble methods for strongly imbalanced data: 
bankruptcy prediction. In 2019 IEEE 17th International Symposium on Intelligent Systems 
and Informatics (SISY), 155-160. IEEE. 

Google Developers. (2021). Machine Learning. https://developers.google.com/machine-
learning/data-prep/construct/sampling-splitting/imbalanced-data 

Hariguna, T., Rahardja, U., & Sarmini. (2022). The Role of E-Government Ambidexterity as 
the Impact of Current Technology and Public Value: An Empirical 
Study. Informatics, 9(3), 67. https://doi.org/10.3390/informatics9030067 

Hermawan, A., Sunaryo, W., & Hardhienata, S. (2023). Optimal Solution for OCB 
Improvement Through Strengthening of Servant Leadership, Creativity, and 
Empowerment. Aptisi Transactions on Technopreneurship (ATT), 5(1Sp), 11-21. 
https://doi.org/10.34306/att.v5i1Sp.307 

Indrawati, A. (2021) ‘Penerapan Teknik Kombinasi Oversampling Dan Undersampling 
Untuk Mengatasi Permasalahan Imbalanced Dataset’, JIKO (Jurnal Informatika dan 
Komputer), 4(1), pp. 38–43. https://doi.org/10.33387/jiko.v4i1.2561 

Islam, S. R., Eberle, W., Ghafoor, S. K., Bundy, S. C., Talbert, D. A., & Siraj, A. (2019). 
Investigating bankruptcy prediction models in the presence of extreme class 
imbalance and multiple stages of economy. arXiv preprint arXiv:1911.09858. 

Jabeur, S. B., Gharib, C., Mefteh-Wali, S., & Arfi, W. B. (2021). CatBoost model and artificial 
intelligence techniques for corporate failure prediction. Technological Forecasting and 
Social Change, 166, 120658. https://doi.org/10.1016/j.techfore.2021.120658 

Junyu, H. (2020, August). Prediction of Financial Crisis Based on Machine Learning. In 2020 
The 4th International Conference on Business and Information Management , 71-75. 
https://doi.org/10.1145/3418653.3418674 

Kosasi, S., Yuliani, I. D. A. E., & Rahardja, U. (2022, February). Boosting e-service quality of 
online product businesses through it leadership. In 2022 International Conference on 
Science and Technology (ICOSTECH), 1-10. IEEE. 
10.1109/ICOSTECH54296.2022.9829036 

Liu, Q., Wang, C., Zhang, P., & Zheng, K. (2021). Detecting stock market manipulation via 
machine learning: evidence from China Securities Regulatory Commission punishment 
cases. International Review of Financial Analysis, 78, 101887. 
https://doi.org/10.1016/j.irfa.2021.101887 

Lu, S., Liu, C. and Chen, Z. (2021) ‘Predicting stock market crisis via market indicators and 
mixed frequency investor sentiments’, Expert Systems with Applications. Elsevier, 
186, p. 115844. https://doi.org/10.1016/j.eswa.2021.115844 

https://doi.org/10.1016/j.iref.2018.03.008
https://doi.org/10.1186/s40537-020-00327-4
https://doi.org/10.1016/j.jfds.2020.04.001
https://doi.org/10.1007/s13748-019-00197-9
https://doi.org/10.1007/s13748-019-00197-9
https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
https://doi.org/10.3390/informatics9030067
https://doi.org/10.34306/att.v5i1Sp.307
https://doi.org/10.33387/jiko.v4i1.2561
https://doi.org/10.1016/j.techfore.2021.120658
https://doi.org/10.1145/3418653.3418674
https://doi.org/10.1109/ICOSTECH54296.2022.9829036
https://doi.org/10.1016/j.irfa.2021.101887
https://doi.org/10.1016/j.eswa.2021.115844


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 616 

Lutfiani, N., Wijono, S., Rahardja, U., Iriani, A., Aini, Q., & Septian, R. A. D. (2023). A 
Bibliometric Study: Recommendation based on Artificial Intelligence for iLearning 
Education. Aptisi Transactions on Technopreneurship (ATT), 5(2), 112-119. 
https://doi.org/10.34306/att.v5i2.279 

Mahardika, R., & Irawan, F. (2022). The Impact Of Thin Capitalization Rules On Tax 
Avoidance In Indonesia. JURNAL PAJAK INDONESIA (Indonesian Tax 
Review), 6(2S), 651-662. https://doi.org/10.31092/jpi.v6i2S.1972 

Marlina, E., Putri, A. A. and Suriyanti, L. H. (2023) ‘Determinants of strategic management 
accounting implementation in Higher Education Institutions (HEIs) in Indonesia’, 
Journal of Accounting and Investment, 24(2), pp. 306–322. 
https://doi.org/10.18196/jai.v24i2.16562 

Mishraz, N., Ashok, S., & Tandon, D. (2021). Predicting Financial Distress in the Indian 
Banking Sector: A Comparative Study Between the Logistic Regression, LDA and 
ANN Models. Global Business Review, 09721509211026785.. 
https://doi.org/10.1177/09721509211026785 

Petropoulos, A., Siakoulis, V., Stavroulakis, E., & Vlachogiannakis, N. E. (2020). Predicting 
bank insolvencies using machine learning techniques. International Journal of 
Forecasting, 36(3), 1092-1113. https://doi.org/10.1016/j.ijforecast.2019.11.005 

Pramono, . E. S. ., Rudianto, D. ., Siboro, F. ., Abdul Baqi , M. P. ., & Julianingsih, D. (2022). 
Analysis Investor Index Indonesia with Capital Asset Pricing Model (CAPM). Aptisi 
Transactions on Technopreneurship (ATT), 4(1), 35–46. 
https://doi.org/10.34306/att.v4i1.218 

Pratama, A., & Wijaya, A. (2023). Implementasi Sistem Good Corporate Governance Pada 
Perangkat Lunak Berbasis Website PT. Pusaka Bumi Transportasi. Technomedia 
Journal, 7(3), 340-353. https://doi.org/10.33050/tmj.v7i3.1917 

Putri, H. R. and Dhini, A. (2019) ‘Prediction of financial distress: Analyzing the industry 
performance in stock exchange market using data mining’, in 2019 16th International 
Conference on Service Systems and Service Management (ICSSSM). IEEE, pp. 1–5. 
https://doi.org/10.1109/ICSSSM.2019.8887824 

Putri, R. L., Hidayat, S., Wahyono, E., & Rahmawati, L. (2023). Big Data and Strengthening 
MSMEs After the Covid-19 Pandemic (Development Studies on Batik MSMEs in 
East Java). IAIC Transactions on Sustainable Digital Innovation (ITSDI), 4(2), 83-100. 
https://doi.org/10.34306/itsdi.v4i2.574 

Qiu, Y., Zhou, J., Khandelwal, M., Yang, H., Yang, P., & Li, C. (2021). Performance 
evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to 
predict blast-induced ground vibration. Engineering with Computers, 1-18. 
https://doi.org/10.1007/s00366-021-01393-9 

Rahardja, U. et al. (2023) ‘Implementation of Tensor Flow in Air Quality Monitoring Based 
on Artificial Intelligence’, International Journal of Artificial Intelligence Research, 6(1). 

Santoso, R. E., Prawiyogi, A. G., Rahardja, U., Oganda, F. P., & Khofifah, N. (2022). 
Penggunaan dan Manfaat Big Data dalam Konten Digital. ADI Bisnis Digital 
Interdisiplin Jurnal, 3(2), 88-91. https://doi.org/10.34306/abdi.v3i2.836 

Shrivastava, S., Jeyanthi, P. M., & Singh, S. (2020). Failure prediction of Indian Banks using 
SMOTE, Lasso regression, bagging and boosting. Cogent Economics & Finance, 8(1), 
1729569.. https://doi.org/10.1080/23322039.2020.1729569 

Sipahutar, R. J. et al. (2020) ‘Drivers and Barriers to IT Service Management Adoption in 
Indonesian Start-up Based on the Diffusion of Innovation Theory’, in 2020 Fifth 
International Conference on Informatics and Computing (ICIC). IEEE, pp. 1–8. 
10.1109/ICIC50835.2020.9288556 

https://doi.org/10.34306/att.v5i2.279
https://doi.org/10.31092/jpi.v6i2S.1972
https://doi.org/10.18196/jai.v24i2.16562
https://doi.org/10.1177/09721509211026785
https://doi.org/10.1016/j.ijforecast.2019.11.005
https://doi.org/10.34306/att.v4i1.218
https://doi.org/10.33050/tmj.v7i3.1917
https://doi.org/10.1109/ICSSSM.2019.8887824
https://doi.org/10.34306/itsdi.v4i2.574
https://doi.org/10.1007/s00366-021-01393-9
https://doi.org/10.34306/abdi.v3i2.836
https://doi.org/10.1080/23322039.2020.1729569
https://doi.org/10.1109/ICIC50835.2020.9288556


Mukhlashin, Fitrianto, Soleh, & Muhamad 
Ensemble learning with imbalanced data handling … 

 
Journal of Accounting and Investment, 2023 | 617 

Sir, Y. A. and Soepranoto, A. H. H. (2022) ‘Pendekatan Resampling Data Untuk Menangani 
Masalah Ketidakseimbangan Kelas’, J-ICON: Jurnal Komputer dan Informatika, 
10(1), pp. 31–38. https://doi.org/10.35508/jicon.v10i1.6554 

Soesilo, T. H., & Tinggi, M. M. P. (2021). Analisis pengembangan sistem informasi gaji pegawai 
(sigap) menggunakan soft system methodology (Studi pada Biro Keuangan Universitas Brawijaya). 
Universitas Brawijaya. 

Speiser, J. L., Miller, M. E., Tooze, J., & Ip, E. (2019). A comparison of random forest 
variable selection methods for classification prediction modeling. Expert systems with 
applications, 134, 93-101. https://doi.org/10.1016/j.eswa.2019.05.028 

Srinivas, P., & Katarya, R. (2022). hyOPTXg: OPTUNA hyper-parameter optimization 
framework for predicting cardiovascular disease using XGBoost. Biomedical Signal 
Processing and Control, 73, 103456. https://doi.org/10.1016/j.bspc.2021.103456 

Sun, X., Liu, M., & Sima, Z. (2020). A novel cryptocurrency price trend forecasting model 
based on LightGBM. Finance Research Letters, 32, 101084. 
https://doi.org/10.1016/j.frl.2018.12.032 

Thakkar, A., & Chaudhari, K. (2021). Fusion in stock market prediction: a decade survey on 
the necessity, recent developments, and potential future directions. Information 
Fusion, 65, 95-107. https://doi.org/10.1016/j.inffus.2020.08.019 

Tölö, E. (2020). Predicting systemic financial crises with recurrent neural networks. Journal of 
Financial Stability, 49, 100746. https://doi.org/10.1016/j.jfs.2020.100746 

Tussa'diah, H., & Kartika, N. Y. (2023). Critical Discourse Analysis on Linguistic Ideology of 
The Netizens Comments. ADI Journal on Recent Innovation, 4(2), 110-121. 
https://doi.org/10.34306/ajri.v4i2.838 

Vien, B. S., Wong, L., Kuen, T., Rose, L. F., & Chiu, W. K. (2021). A Machine Learning 
Approach for Anaerobic Reactor Performance Prediction Using Long Short-Term 
Memory Recurrent Neural Network. Struct. Health Monit. 8apwshm, 18, 61.  

Wang, D. N., Li, L., & Zhao, D. (2022). Corporate finance risk prediction based on 
LightGBM. Information Sciences, 602, 259-268. 
https://doi.org/10.1016/j.ins.2022.04.058 

Wang, H., & Liu, X. (2021). Undersampling bankruptcy prediction: Taiwan bankruptcy 
data. Plos one, 16(7), e0254030. https://doi.org/10.1371/journal.pone.0254030 

Widiastuti, T., Karsa, K., & Juliane, C. (2023). Evaluasi Tingkat Kepuasan Mahasiswa 
Terhadap Pelayanan Akademik Menggunakan Metode Klasifikasi Algoritma C4. 
5. Technomedia Journal, 7(3), 364-380. https://doi.org/10.33050/tmj.v7i3.1932 

Zanubiya, J., Meria, L., & Juliansah, M. A. D. (2023). Increasing Consumers with Satisfaction 
Application based Digital Marketing Strategies. Startupreneur Bisnis Digital (SABDA 
Journal), 2(1), 12-21. 

Zhang, Z. and Chen, Y. (2022) ‘Tail risk early warning system for capital markets based on 
machine learning algorithms’, Computational Economics. Springer, 60(3), pp. 901–
923. https://doi.org/10.1007/s10614-021-10171-0 

https://doi.org/10.35508/jicon.v10i1.6554
https://doi.org/10.1016/j.eswa.2019.05.028
https://doi.org/10.1016/j.bspc.2021.103456
https://doi.org/10.1016/j.frl.2018.12.032
https://doi.org/10.1016/j.inffus.2020.08.019
https://doi.org/10.1016/j.jfs.2020.100746
https://doi.org/10.34306/ajri.v4i2.838
https://doi.org/10.1016/j.ins.2022.04.058
https://doi.org/10.1371/journal.pone.0254030
https://doi.org/10.33050/tmj.v7i3.1932
https://doi.org/10.1007/s10614-021-10171-0