*Corresponding Author

P-ISSN: 2087-1244
E-ISSN: 2476-907X

53

ComTech: Computer, Mathematics and Engineering Applications, 13(1), June 2022, 53−62
DOI: 10.21512/comtech.v13i1.7388

Prediction of Undergraduate Student’s Study Completion 
Status Using MissForest Imputation in Random Forest 

and XGBoost Models
Intan Nirmala1*, Hari Wijayanto2, and Khairil Anwar Notodiputro3 

1-3Department of Statistics, Institut Pertanian Bogor
Jln. Raya Dramaga, Jawa Barat 16680, Indonesia

1inirmalanirmala@apps.ipb.ac.id; 2hari@apps.ipb.ac.id; 3khairil@apps.ipb.ac.id

Received: 31st May 2021/ Revised: 06th August 2021/ Accepted: 09th August 2021 

How to Cite: Nirmala, I., Wijayanto, H., & Notodiputro, K. A. (2022). Prediction of Undergraduate Student’s Study 
Completion Status Using MissForest Imputation in Random Forest and XGBoost Models. ComTech: Computer, 

Mathematics and Engineering Applications, 13(1), 53−62. https://doi.org/10.21512/comtech.v13i1.7388

Abstract - The number of higher education 
graduates in Indonesia is calculated based on their 
completion status. However, many undergraduate 
students have reached the maximum length of study, 
but their completion status is unknown. This condition 
becomes a problem in calculating the actual number 
of graduates as it is used as an indicator of higher 
education evaluation and other policy references. 
Therefore, the unknown completion status of the 
students who have reached the maximum length of 
study must be predicted. The research compared 
the performance of Random Forest and Extreme 
Gradient Boosting (XGBoost) classification models 
in predicting the unknown completion status. 
The research used a dataset containing 13.377 
undergraduate students’ profiles from the Higher 
Education Database (PDDikti), Ministry of Education, 
Culture, Research, and Technology. The dataset 
was incomplete, and the proportion of missing data 
was 20,9% of the total data. Because missing data 
might lead to prediction bias, the research also used 
MissForest imputation to overcome the missing 
data in the classification modelling and compared it 
to Mean/Mode and Median/Mode imputation. The 
results show that MissForest outperforms the other 
two imputations in both classifiers but requires the 
longest computation time. Furthermore, the XGBoost 
model with MissForest is significantly superior to the 
Random Forest model with MissForest. Hence, the 
best model chosen to predict the completion status is 
XGBoost with MissForest imputation.

Keywords:  study completion status; MissForest 
imputation; Random-Forest model; XGBoost model  

I. INTRODUCTION

Every tertiary education institution in Indonesia 
is obliged to submit its higher education data to PDDikti 
which is coordinated by the Ministry of Education, 
Culture, Research, and Technology of the Republic of 
Indonesia. It is a system that collects higher education 
data from all tertiary education institutions, which are 
integrated at a national scale. One of the submitted data 
is about students’ study completion status, whether 
they have graduated or dropped out. The number of 
higher education graduates is calculated based on this 
completion status.

According to the 2020 National Higher 
Education Standards (Menteri Pendidikan 
dan Kebudayaan Republik Indonesia, 2020), 
undergraduate programs' maximum length of study is 
seven years. Those who have reached the maximum 
period shall have a completion status recorded on 
PDDikti. However, many students have reached the 
maximum length of study, but their completion status 
is unknown. There are two possibilities concerning 
this case. First, the students have graduated, but the 
universities do not report their graduation status. 
Second, students do not continue their studies, and 
there is no further information. This condition makes 
it difficult to calculate the actual number of graduates. 
Moreover, the number of graduates is used as an 
indicator of higher education evaluation and other 
policy references. Therefore, the unknown students’ 
study completion status with the maximum length of 
study must be predicted.

Prediction of student completion status can be 
carried out with a classification model. One of them 
is a classification tree. The advantage of this method 
is that it does not depend on certain assumptions, such 


54 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 53−62

as the normality of data distribution. The problem of 
instability and high variance of a single tree can be 
overcome by ensembling the classification tree (Wang 
& Wu, 2018).  

Some previous studies in education have used 
data mining algorithms with a classification tree 
as their base learner. Kurniawan, Anggrawan, and 
Hairani (2020) proposed a graduation prediction 
system for undergraduate students of Bumigora 
University using a classification tree algorithm. 
However, their study was limited to a small dataset 
and one type of classifier. Then, Yuliansyah, Imaniati, 
Wirasto, and Wibowo (2021) used a larger dataset to 
predict students’ graduation on time in the engineering 
faculty of some private universities in Indonesia using 
a classification tree. They also only used one type 
of classifier but compared the validation result on a 
various number of testing data.

In contrast, Hussain, Dahan, Ba-Alwib, and 
Ribata (2018) compared four classification methods 
(J48, PART, Random Forest, and Bayes Network) to 
predict students’ performance from three different 
colleges in India. The result showed that Random 
Forest outperformed the other classifiers. Similarly, 
Baruah, Baruah, and Goswami (2020) predicted 
students’ academic performance in an engineering 
college in India using seven different classifiers (J48, 
Random Forest, Rap Tree, Logistic Model Tree (LMT), 
Naïve Bayes, BayesNet, and PART). They found 
that Random Forest was the most efficient algorithm 
among all the considered algorithms. Then, Yan (2021) 
used some machine learning algorithms to predict 
students’ performance in China. Those algorithms 
were Extreme Gradient Boosting (XGBoost), Random 
Forest, Lasso, Elastic Net, Support Vector Machine, 
and Classification Tree. The XGBoost model achieved 
the best result than five other classic machine learning 
models. 

Based on those previous studies mentioned, 
the research compares two ensemble tree methods: 
Random Forest and XGBoost. It can predict the 
unknown completion status of undergraduate students 
who have reached the maximum length of study. 
Moreover, the previous studies use a complete dataset 
to build a classification model. Meanwhile, the 
research uses data with missing values. In addition, the 
data used are also larger than in the previous studies.

Random Forest is an extended Bagging 
method in which the training data are resampled by 
repeated bootstrap, and some classification trees are 
built based on the bootstrapping result (Breiman, 
2001). The training procedure for Random Forest is 
summarized by Ahmad, Mourshed, and Rezgui (2018) 
in the following steps. First, it performs bootstrap 
sampling from the original dataset. Second, for each 
bootstrap drawn in the first step, it grows an unpruned 
tree by randomly sampling m variables from the 
input variables and selects the best split from among 
those variables. Third, the first and second steps are 
repeated in k times until a forest consisting of k trees 
is formed. Fourth, it predicts new data by aggregating 

the prediction of all trees.
XGBoost is an extended Gradient Boosting 

method with a penalty component on the loss function 
to prevent overfitting. Compared to the traditional 
Gradient Boosting, this method has higher speed and 
performance owing to the parallel nature in which 
trees are built (Aminu, Abdulkarim, Aliyu, Aliyu, & 
Turaki, 2019). If the Random Forest method builds 
trees parallel, XGBoost builds trees sequentially. On 
XGBoost, every new tree is built to reduce the mistake 
of the previous tree (Anwar, Winarno, Hadikurniawati, 
& Novita, 2021). 

As previously mentioned, the data used in the 
research are incomplete, and the proportion of the 
missing data is 20,9% of the total data. Missing data 
may cause bias in the parameter estimates of analysis 
(Blazek, Zwieten, Saglimbene, & Teixeira-Pinto, 
2021). Hence, imputation for the missing value is 
performed in the pre-processing step of classification 
modelling (Khan & Hoque, 2020). 

The simplest imputation replaces missing 
data with mean or median for numerical data and 
mode for categorical data. Various imputation 
methods have been used in many fields, e.g., Hot-
Deck Imputation, Principal Component Analysis 
(PCA), K-Nearest Neighbors (KNN) (Troyanskaya 
et al., 2001), and MissPALasso (Städler, Stekhoven, 
& Bühlmann, 2014). However, these methods only 
work on one type of data, which is only numerical or 
categorical. For mixed data, imputation is conducted 
separately according to the type of data that ignores 
the relationship between numerical and categorical 
variables. 

According to Stekhoven and Bühlmann (2012), 
MissForest imputation can work on mixed data 
simultaneously and have a non-parametric character. 
It does not depend on certain assumptions of data 
distribution. In MissForest, Xs is the s-th variable 
containing the missing value in . 
Furthermore,  is the observed 
value of the Xs , and  is the missing value of the Xs. 
Variables other than Xs with  \  
are denoted by . Moreover, variables other than Xs 
having observation correspond to  are denoted by 

. Figure 1 illustrates the partition of the dataset in 
the MissForest imputation.

MissForest starts by replacing all missing values 
with initial values, which can be Mean/Median/Mode 
or other imputation values. Furthermore, the variables 
of Xs with s = 1, … , p are sorted from small to large 
according to their amount of missing value. The 
missing value is imputed for each Xs by constructing 
a Random Forest model using the  as the response 
and the  as the predictor. Furthermore, the missing 
value  can be predicted by implementing the model 


55Prediction of Undergraduate..... (Intan Nirmala et al.)

to the . The new prediction result replaces the 
previous imputation value. This procedure is conducted 
iteratively until the stopping criterion γ is met if the 
difference between the value of the latest imputation 
and the previous one increases for the first time. 

Figure 1 Dataset Partition in MissForest

According to Stekhoven and Bühlmann (2012), 
MissForest is outperformed by Multivariate Imputation 
by Chained Equation (MICE) and K-Nearest 
Neighbors (KNN). Then, the previous research has 
compared the effect of Mean, Median, KNN, MICE, 
and MissForest to impute the simulated missing data 
for Naïve Bayes and Decision Tree Classifier (Cihan, 
Kalıpsız, & Gökçe, 2019). The most successful 
imputation in both classifiers is MissForest. Then, 
according to Alsaber, Pan, and Al-Hurban (2021), 
several methods can be conducted to impute missing 
data for the air quality monitoring dataset. MissForest, 
Bayesian PCA, Predictive Mean Matching (PMM), 
KNN, and Expectation Maximization imputation are 
compared. It shows that MissForest is the only method 
with a consistent and comparatively lower imputation 
error.

The main objective of the research is to compare 
the performance of MissForest imputation, Mean/
Mode imputation, and Median/Mode imputation to 
overcome missing data in predicting the completion 
status of undergraduate students who have reached 
the maximum length of study. The next objective is to 
compare the Random Forest and XGBoost algorithms 
used as the classifiers. Then, the best model will be 
chosen to predict the unknown completion status 
of undergraduate students who have reached the 
maximum length of study.  The goodness of fit 
employed is accuracy, sensitivity, specificity, G-Mean, 
and Area Under the Curve (AUC) of Receiver 
Operating Characteristic (ROC).

II. METHODS

All data in the research are from the PDDikti 
database from the Ministry of Education, Culture, 
Research, and Technology of the Republic of 

Indonesia. The data consist of 13.377 samples of 
undergraduate students who were admitted in 2013 
from public and private universities. These samples 
are taken by stratified random sampling method 
with proportional allocation from the population of 
936.236 undergraduate students admitted in 2013. 
The determination of strata is based on the type 
of institution working unit, accreditation of study 
programs, and field of study. All the samples have 
reached the maximum length of study in 2020, but 
some of them have unknown completion status. 

The research uses 17 variables, consisting of 
10 numerical predictors, 6 categorical predictors, and 
1 binary response. The numerical predictors include 
the average number of credits in each semester (X1), 
cumulative Grade Point Average (GPA) (X2), average 
semester GPA (X3), number of courses with an index 
score of 0−0,99 (X4), number of courses with an 
index score of 1−1,99 (X5), number of courses with 
an index score of 2−2,99 (X6), number of courses with 
an index score of 3−4 (X7), students’ age when they 
are registered for the first time (X8), number of total 
semesters taken by students (X9), and number of the 
active semester (X10). All those numerical predictors 
contain missing values in some observations.

Then, the average number of credits in each 
semester is assumed to affect the study load for 
students. Therefore, it can have an impact on their 
completion status. Furthermore, the index score is 
used to measure students’ performance in a course. 
The value of this index ranges from 0 to 4. The higher 
the index score is, the better the students’ performance 
will be in a course. Index scores can be converted 
to letter scores. For example, an index score of 4 is 
equivalent to an index letter of A, an index score of 
3 is equivalent to an index score of B, and so forth. 
Cumulative GPA, semester GPA, and index score 
represent students’ academic performance. According 
to Costa, Bispo, and Pereira (2018), there is evidence 
that the higher the students’ academic performance is, 
the lower the risk of dropout will be. 

Next, the number of total semesters and 
active semesters is related to students’ attendance. 
Attendance has a significant indirect effect on academic 
performance (Kim, Shakory, Azad, Popovic, & Park, 
2020). Students’ age is also one of the features that 
most researchers agree on when predicting students’ 
academic achievement in higher education (Alturki, 
Hulpuș, & Stuckenschmidt, 2020).

The categorical variables include accreditation 
of higher education institution (X11), accreditation of 
study program (X12), type of institution working unit 
(X13), the field of study (X14), inactive history (X15), 
and gender (X16). The response (Y) is the students’ 
completion status, whether they have graduated or 
dropped out. All categorical variables are complete, 
except for inactive history and gender, which need to 
be imputed.

Moreover, the accreditation of higher education 
institutions and study programs represents the quality 
of education. They are assumed to have an impact on 


56 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 53−62

students’ performance. Based on both accreditation 
types, the research groups each education institution 
into four categories: not accredited, good, very good, 
and excellent. Meanwhile, the type of institution 
working unit divides education institutions based 
on their finance and human resources management. 
There are four categories of institutions based on it: 
Perguruan Tinggi Negeri (PTN), Perguruan Tinggi 
Negeri Berbadan Hukum (PTN-BH), Perguruan 
Tinggi Swasta (PTS), and Badan Layanan Umum 
(BLU). It is also assumed to have an impact on 
students’ performance.

Furthermore, the field of study is related to 
the level of difficulty experienced by students. The 
inactive history categorizes students into two groups: 
those who have ever taken an inactive semester and 
have never taken an inactive semester. This variable is 
related to students’ attendance, which affects academic 
performance. Meanwhile, like the age variable, gender 
is also one of the features that most researchers agree 
on when predicting students’ academic achievement 
(Alturki et al., 2020). 

The data consist of 13.377 rows of observations 
and 17 columns of variables that produce 227.409 
matrix cells. The proportion of the missing data is 
20,9% of the total cells. Table 1 shows the proportion 
of missing data for each variable. The cumulative GPA, 
the semester GPA, and the average number of semester 
credits are the variables with the highest proportion of 
missing data. In contrast, the accreditation of higher 

education institutions and study programs, the type of 
institution working unit, and the field of study do not 
contain any missing data in their observations.

The entire procedure of the research is 
completed using R software. The research uses some 
packages, such as MissForest, Caret, and Random 
Forest. The procedure involves splitting the data into 
two datasets: data with complete response variables 
(11.875 observations) and data with unknown 
response variables (1.502 observations). The data with 
complete responses are split again into two parts: 80% 
as the training data to build the model and 20% as the 
testing data to evaluate the model. 

Moreover, the missing data should be dealt with 
carefully before analysis. Otherwise, the information 
extracted from the dataset containing missing values 
will lead to the wrong decision-making (Manimekalai 
& Kavitha, 2018). The imputation is also performed 
separately for training and testing data to avoid data 
leakage (Marcinkevics, Reis Wolfertstetter, Wellmann, 
Knorr, & Vogt, 2021). Then, three approaches are made 
to overcome the missing training data: MissForest 
imputation, Mean/Mode imputation, and Median/
Mode imputation. The MissForest imputation uses 
mtry = √the number of predictors = 4 and ntree = 50. 
The same approaches also impute missing data in the 
testing data. From the three imputation approaches, 
there are three completed datasets. Each of them 
consists of completed training and testing data.

Table 1 Proportion of Missing Data

No. Variable Number of Missing Data 
in Each Variable

Missing Data Proportion (%) for Each 
Variable in the Total Data

1 X1 4.311 9,06
2 X2 5.926 12,45
3 X3 4.327 9,09
4 X4 3.939 8,27
5 X5 3.958 8,31
6 X6 3.893 8,18
7 X7 3.996 8,39
8 X8 3.226 6,78
9 X9 3.450 7,25
10 X10 3.365 7,07
11 X11 0 0,00
12 X12 0 0,00
13 X13 0 0,00
14 X14 0 0,00
15 X15 3.051 6,41
16 X16 2.663 5,59
17 Y 1.502 3,15

TOTAL 47.607 100


57Prediction of Undergraduate..... (Intan Nirmala et al.)

Random Forest and XGBoost classification 
models are built from the completed training data 
in the research. The research uses the default values 
for all hyperparameters in both classifiers. Then, the 
classification model and imputation performance 
are evaluated based on the completed testing data. 
The research repeats all processes ten times, from 
the splitting data into training and testing data until 
the evaluation of the model to see the stability of the 
models. The average performance of each classification 
model is calculated from the entire repetition. Then, 
the model with the best performance is chosen to 
predict the unknown completion status of the 1.502 
undergraduate students.

III. RESULTS AND DISCUSSIONS

The research uses 13.377 samples of admitted 
undergraduate students in 2013. About 88,77% of the 
observations are employed for modelling. Then, the 
data are split into training and testing data with a ratio of 
80:20. The research has split data into the training and 
testing data ten times. So, it has produced ten different 
sets of training and testing data. Furthermore, the 
unknown responses of the other 11,23% observations 
are predicted using the classification model that has 
been constructed. 

Missing data can affect the performance of a 
classification model. However, the missing data in the 
research happen randomly. The observed variables or 
the missing values do not influence the occurrence of 
missing data. Therefore, the missing data are assumed 
to be Missing Completely at Random (MCAR) and can 
be completed by imputation technique. Imputation is 
a technique to replace missing data with certain values 
obtained based on information from the available 
dataset (Kokla, Virtanen, Kolehmainen, Paananen, 
& Hanhineva, 2019). In classification modelling, 
imputation is carried out in the pre-processing stage. 

The research separates the imputation of training 
data and testing data separately to avoid information 
leakage. However, the imputation of the testing data 
utilizes the information from the imputation result of 
training data.

The imputation method used in the research is 
MissForest imputation. The first step is to replace all 
the missing data with the Mean or Mode of the observed 
variables. Furthermore, the variables containing 
missing data are sorted based on the amount of 
missing data from small to large. The Random Forest 
model is built using the observed variables for each 
variable. Then, the model predicts the missing data. 
This procedure is repeated iteratively until a stopping 
criterion is found. It is when the difference between the 
new imputed value and the previous value increases 
for the first time in numerical and categorical data. 

As previously mentioned, MissForest 
imputation is an imputation that works based on the 
Random Forest algorithm. In MissForest, each tree 
is built using the sample obtained from the bootstrap 
process. Each bootstrap sample randomly leaves out 
about one-third of the observations. These left-out 
observations for a given tree are called Out of Bag 
(OOB) (Schonlau & Zou, 2020). OOB observations are 
not included in the tree-building process. MissForest 
performance can be measured based on predicted and 
assumed OOB as testing data. Imputation performance 
on numerical data is measured by Normalized Root 
Mean Square Error (NRMSE) and categorical data 
by Proportion of Falsely Classification (PFC). Based 
on Table 2, MissForest imputation in the research 
yields an average NRMSE of 0,451 and PFC of 0,058, 
calculated based on OOB. Imputation performance is 
categorized as good if NRMSE and PFC are close to 
0. On the contrary, it is considered not good if it is 
close to 1 (Stekhoven & Bühlmann, 2012). NRMSE 
and PFC in the research appear to be close to 0, so the 
performance of MissForest is relatively good.

Table 2 Performance of MissForest Based on OOB

n-th train data NRMSE PFC

1 0,452 0,057
2 0,453 0,057
3 0,449 0,057
4 0,450 0,058
5 0,455 0,058
6 0,451 0,058
7 0,451 0,058
8 0,451 0,058
9 0,452 0,059
10 0,450 0,058


58 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 53−62

As a comparison, imputation is also conducted 
using Mean/Mode imputation and Median/Mode 
imputation. In the Mean/Mode imputation, the mean 
of all values within the same attribute is calculated 
and imputed in the missing data cells (Khan, Khan, & 
Singh, 2018). Meanwhile, Median/Mode imputation 
replaces the numerical missing value with the median 
of all values within the same attribute. In both methods, 
mode substitution can be used instead if the attribute is 
categorical (Acuña & Rodriguez, 2004).

Neither Mean/Mode imputation nor Median/
Mode imputation produces OOB observations as in 
MissForest. Consequently, the NRMSE and PFC in 
both methods can only be measured if the complete 
observations are available before. So, they can be 
compared to the imputation result. The data are a 
real case that contains the missing value from the 
beginning. It causes NRMSE and PFC of Mean/
Median/Mode imputation not to be measured. 
Therefore, a performance comparison of MissForest, 
Mean/Mode, and Median/Mode imputation is carried 
out after building the classification model, using the 
goodness of fit of the classification models.

Based on Table 3, MissForest computation 
time is much longer than the other two methods. It 
occurs due to the MissForest algorithm complexity. 
It is influenced by the proportion of missing values, 
number of variables, and number of observations. 
Conversely, the Mean/Mode and Median/Mode 
imputation procedures are less complicated. They 
only replace the missing value with Mean/Median/
Mode without complex algorithms. This procedure 
requires a shorter computation time. However, Mean 
imputation for missing values leads to large errors 
in variance estimates when variables have linear 

relationships (Köse, Özgür, Coşgun, Keskinoğlu, & 
Keskinoğlu, 2020). This condition also applies to the 
use of Median and Mode to impute the missing data.

Tables 4 and 5 present the average performance 
of Random Forest and XGBoost models with the three 
different imputation methods. In Random Forest and 
XGBoost, at the significance level of 5%, models with 
MissForest appear to be significantly better than the 
other two methods. Column P1 in Tables 4 and 5 is 
the p-value of paired t-test between MissForest and 
Mean/Mode imputation performance. Meanwhile, 
column P2 presents the p-value of paired t-test 
between MissForest and Median/Mode imputation 
performance. The alternative hypothesis in these tests 
is that the average of MissForest performance is better 
than the average of the other imputation performances. 
MissForest excels in all measures, regardless of the 
only slight difference from the Mean/Median/Mode 
imputation performance. It is simultaneously affirmed 
by the performance distribution of each model in 
Figure 2. Mean/Mode imputation has a slightly better 
average performance than Median/Mode imputation.

Table 6 shows the average classification 
performance in both models. Regardless of the 
imputation method used, performances of Random 
Forest and XGBoost differ significantly. It is 
indicated by the p-value of paired t-test of both models 
in Table 6. Most of the p-values are significant at a 
significance level of 5%. The alternative hypothesis 
carried out is that the performance of Random Forest is 
different from XGBoost. According to Table 6, Random 
Forest and XGBoost with MissForest imputation are 
the only models with an overall average performance 
of more than 90. Those two models are significantly 
different in all measurements, except for G-Mean.

Table 3 Computation Time of MissForest, Mean/Mode, and Median/Mode Imputation

n-th Train Data
MissForest 
(In Second)

Mean/Mode 
(In Second)

Median/Mode 
(In Second)

1 688,320 0,240 0,017

2 626,890 0,015 0,018

3 651,134 0,022 0,023

4 698,493 0,015 0,017

5 674,990 0,018 0,021

6 781,132 0,024 0,031

7 658,769 0,021 0,024

8 621,354 0,015 0,020

9 614,350 0,020 0,028

10 619,922 0,015 0,018


59Prediction of Undergraduate..... (Intan Nirmala et al.)

Table 4 Comparison of Random Forest Performance with Three Different Imputations

Goodness of Fit
Random Forest

MissForest (1) Mean/Mode (2) Median/Mode (3) P1 P2

Accuracy 93,98 92,33 92,00 2E-08 9E-11

Sensitivity 90,72 87,17 86,54 3E-06 8E-07

Specificity 94,90 93,77 93,52 2E-08 2E-09

G-Mean 92,78 90,41 89,96 2E-07 8E-09

AUC 97,62 96,17 95,88 4E-08 1E-08

Table 5 Comparison of XGBoost Performance with Three Different Imputations

Goodness of Fit
XGBoost

MissForest (1) Mean/Mode (2) Median/Mode (3) P1 P2

Accuracy 94,42 92,18 91,83 5E-09 5E-09

Sensitivity 90,21 86,38 85,78 5E-06 8E-08

Specificity 95,67 93,83 93,54 1E-08 8E-08

G-Mean 92,90 90,02 89,58 2E-07 8E-09

AUC 97,77 95,85 94,00 2E-09 5E-11

Note: 
P1 = p-value of paired t-test of model performance with imputation 1 and 2 (Ha: imputation 1 > imputation 2),
P2 = p-value of paired t-test of model performance with imputation 1 and 3 (Ha: imputation 1 > imputation 3).

Figure 2 Boxplot of Classification Performance with Three Different Imputations


60 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 53−62

Then, Figure 3 shows the distribution of 
Random Forest and XGBoost performances on the 
data imputed by MissForest. The median of XGBoost 
performance tends to be better than Random Forest, 
except for sensitivity. It is also confirmed by paired 
t-test between the models in Table 7. The alternative 
hypothesis used is that the performance of XGBoost 
with MissForest is superior to Random Forest with 
MissForest. At a significance level of 5%, XGBoost 
is significantly superior to Random Forest except for 
G-Mean and sensitivity. Hence, the best model chosen 
is XGBoost with MissForest imputation, with an 
average accuracy of 94,42%, sensitivity of 90,21%, 
specificity of 95,67%, G-Mean of 92,90%, and AUC 
of 97,77%.

The results show that 1.502 undergraduate 
students have reached the maximum length of study 
with unknown completion status. The best model is 
used to predict this unknown completion status. Before 
the prediction, the missing value of the data having 
unknown responses is imputed with MissForest. The 
imputation obtains NRMSE of 0,412 and PFC of 0,060. 
It is categorized as performing well. Furthermore, 
responses of those imputed data are predicted using 
the best model. 

According to the best model, 1.502 students 
with unknown completion status are predicted. About 
62,12% of the students are dropouts, and 37,88% have 
graduated. This prediction can estimate the actual 
number of graduates from undergraduate programs. It 

Table 6 Comparison of Random Forest and XGBoost Classification Models

Imputation 
Method

Goodness of Fit
Random Forest XGBoost

P-Value of 
Paired t-testAverage

Standard 
Deviation

Average
Standard 
Deviation

MissForest

Accuracy 93,98 0,45 94,42 0,42 0,00
Sensitivity 90,72 1,01 90,21 1,21 0,02
Specificity 94,90 0,47 95,67 0,40 0,00
G-Mean 92,78 0,60 92,90 0,66 0,34
AUC 97,62 0,43 97,77 0,32 0,02

Mean/Mode

Accuracy 92,33 0,48 92,18 0,38 0,08
Sensitivity 87,17 1,23 86,38 1,14 0,02
Specificity 93,77 0,44 93,83 0,42 0,50
G-Mean 90,41 0,73 90,02 0,60 0,02
AUC 96,17 0,53 95,85 0,43 0,01

Median/Mode

Accuracy 92,00 0,39 91,83 0,46 0,12
Sensitivity 86,54 0,84 85,78 1,08 0,00
Specificity 93,52 0,53 93,54 0,54 0,87
G-Mean 89,96 0,47 89,58 0,62 0,01
AUC 95,88 0,49 94,00 0,56 0,00

Note: 
P-value = p-value of paired t-test between the average of Random Forest and XGBoost performances 
(Ha: the performance of Random Forest model ≠ the performance of  XGBoost model).

Figure 3 Boxplot of MissForest Performance on Random Forest and XGBoost Classification Models


61Prediction of Undergraduate..... (Intan Nirmala et al.)

also helps the government to evaluate higher education 
performance in Indonesia and make certain decisions.

Table 7 Paired T-Test between MissForest Performance in 
Random Forest and XGBoost Classification Models

Goodness of Fit P-Value

Accuracy 0,00
Sensitivity 0,99
Specificity 0,00
G-Mean 0,17
AUC 0,01

Note: 
P-value = p-value of paired t-test between average 
Random Forest and XGBoost performance on the data 
imputed by MissForest (Ha: performance of XGBoost 
model > performance of Random Forest model).

IV. CONCLUSIONS

Many undergraduate students have reached 
the maximum length of study, but some have 
unknown completion status. The research compares 
the performance of Random Forest and XGBoost 
models in predicting the unknown completion status. 
A dataset containing 13.377 undergraduate students’ 
profiles from the PDDikti is used. However, the 
dataset is incomplete, and the proportion of missing 
data is 20,9% of the total data. The research also 
compares MissForest, Mean/Mode, and Median/Mode 
imputation to cope with missing data.

The results show that MissForest imputation 
on Random Forest and XGBoost models outperform 
Mean/Mode and Median/Mode imputations. 
Meanwhile, the XGBoost model with MissForest is 
significantly superior to the Random Forest model 
with MissForest, except for sensitivity and G-Mean. 
The best model chosen is XGBoost with MissForest 
with an average accuracy of 94,42%, sensitivity of 
90,21%, specificity of 95,67%, G-Mean of 92,90%, 
and AUC of 97,77%. According to the best model, 
1.502 students with unknown completion status are 
predicted. It shows 62,12% of the data are dropouts, 
and 37,88% have graduated. This prediction can 
estimate the actual number of graduates from 
undergraduate programs. It also helps the government 
to evaluate higher education performance in Indonesia 
and to make certain decisions. However, the scope of 
the research is limited to predicting the completion 
status of undergraduate students. It can be conducted 
in more levels of higher education in future research.

Despite its superior performance, MissForest 
imputation has a drawback in computational efficiency. 
For future research, selecting a smaller number of 
trees and mtry can be used to reduce the computation 
time.  It does not significantly reduce the accuracy, but 

it must also be adjusted to the size and complexity of 
the dataset.

REFERENCES

Acuña, E., & Rodriguez, C. (2004). The treatment of 
missing values and its effect on classifier accuracy. 
In Classification, clustering, and data mining 
applications (pp. 639−647). Berlin: Springer. 

Ahmad, M. W., Mourshed, M., & Rezgui, Y. (2018). Tree-
based ensemble methods for predicting PV power 
generation and their comparison with support vector 
regression. Energy, 164, 465–474. https://doi.
org/10.1016/j.energy.2018.08.207

Alsaber, A. R., Pan, J., & Al-Hurban, A. (2021). Handling 
complex missing data using random forest approach 
for an air quality monitoring dataset: A case study 
of Kuwait environmental data (2012 to 2018). 
International Journal of Environmental Research 
and Public Health, 18(3), 1–25. https://doi.
org/10.3390/ijerph18031333

Alturki, S., Hulpuș, I., & Stuckenschmidt, H. (2020). 
Predicting academic outcomes: A survey from 2007 
till 2018. Technology, Knowledge and Learning, 
27, 275–307. https://doi.org/10.1007/s10758-020-
09476-0

Aminu, A. A., Abdulkarim, A., Aliyu, A. Y., Aliyu, M., & 
Turaki, A. M. (2019). Detection of phishing websites 
using Random Forest and XGBoost algorithms. 
International Journal of Pure and Applied Sciences, 
2(3), 1–14.

Anwar, M. T., Winarno, E., Hadikurniawati, W., & Novita, 
M. (2021). Rainfall prediction using Extreme 
Gradient Boosting. Journal of Physics: Conference 
Series, 1869, 1–5. https://doi.org/10.1088/1742-
6596/1869/1/012078

Baruah, E. A., Baruah, S., & Goswami, J. A. (2020). 
Comparative analysis of different classification 
algorithms based on students’ academic performance 
using WEKA. IOSR Journal of Computer 
Engineering (IOSR-JCE), 22(1), 49–56. 

Blazek, K., Zwieten, A. V., Saglimbene, V., & Teixeira-
Pinto, A. (2021). A practical guide to multiple 
imputation of missing data in nephrology. 
Kidney International, 99(1), 68–74. https://doi.
org/10.1016/j.kint.2020.07.035

Breiman, L. (2001). Random Forests. Machine 
Learning, 45(1), 5−32. https://doi.org/10.1023/ 
A:1010933404324

Cihan, P., Kalıpsız, O., & Gökçe, E. (2019). Effect of 
imputation methods in the classifier performance. 
Sakarya University Journal of Science, 23(6), 1225–
1236. 

Costa, F. J. D., Bispo, M. D. S., & Pereira, R. D. C. D. F. 
(2018). Dropout and retention of undergraduate 
students in management: A study at a Brazilian Federal 
University. RAUSP Management Journal, 53(1), 
74–85 https://doi.org/10.1016/j.rauspm.2017.12.007

Hussain, S., Dahan, N. A., Ba-Alwib, F. M., & Ribata, 
N. (2018). Educational data mining and analysis 


62 ComTech: Computer, Mathematics and Engineering Applications, Vol. 13 No. 1 June 2022, 53−62

of students’ academic performance using WEKA. 
Indonesian Journal of Electrical Engineering and 
Computer Science, 9(2), 447–459. 

Khan, F. U. F., Khan, K. U. Z., & Singh, S. K. (2018). Is 
Group Means imputation any better than Mean 
imputation: A study using C5.0 classifier. Journal of 
Physics: Conference Series, 1060, 1‒5. https://doi.
org/10.1088/1742-6596/1060/1/012014

Khan, S. I., & Hoque, A. S. M. L. (2020). SICE: An improved 
missing data imputation technique. Journal of Big 
Data, 7(1), 1–21.

Kim, A. S. N., Shakory, S., Azad, A., Popovic, C., & Park, L. 
(2020). Understanding the impact of attendance and 
participation on academic achievement. Scholarship 
of Teaching and Learning in Psychology, 6(4), 272–
284. https://doi.org/10.1037/STL0000151

Kokla, M., Virtanen, J., Kolehmainen, M., Paananen, 
J., & Hanhineva, K. (2019). Random forest-
based imputation outperforms other methods for 
imputing LC-MS metabolomics data: A comparative 
study. BMC Bioinformatics, 20, 1−10. https://doi.
org/10.1186/s12859-019-3110-0

Köse, T., Özgür, S., Coşgun, E., Keskinoğlu, A., & Keskinoğlu, 
P. (2020). Effect of missing data imputation on deep 
learning prediction performance for vesicoureteral 
reflux and recurrent urinary tract infection clinical 
study. BioMed Research International, 2020, 1‒15. 
https://doi.org/10.1155/2020/1895076

Kurniawan, D., Anggrawan, A., & Hairani. (2020). 
Graduation prediction system on students using 
C4.5 algorithm. MATRIK: Jurnal Manajemen, 
Teknik Informatika dan Rekayasa Komputer, 19(2), 
358‒366. https://doi.org/10.30812/matrik.v19i2.685

Manimekalai, K., & Kavitha, A. (2018). Missing value 
imputation and normalization techniques in 
myocardial infarction. ICTACT Journal on Soft 
Computing, 8(03), 1655‒1662. 

Marcinkevics, R., Reis Wolfertstetter, P., Wellmann, S., 
Knorr, C., & Vogt, J. E. (2021). Using machine 
learning to predict the diagnosis, management 

and severity of pediatric appendicitis. Frontiers 
in Pediatrics, 9, 1‒12. https://doi.org/10.3389/
fped.2021.662183

Menteri Pendidikan dan Kebudayaan Republik Indonesia. 
(2020). Peraturan Menteri Pendidikan dan 
Kebudayaan Republik Indonesia Nomor 3 Tahun 
2020 Tentang Standar Nasional Pendidikan Tinggi. 
Retrieved from https://jdih.kemdikbud.go.id/
arsip/Salinan%20PERMENDIKBUD%203%20
TAHUN%202020%20FIX%20GAB.pdf

Schonlau, M., & Zou, R. Y. (2020). The random 
forest algorithm for statistical learning. 
The Stata Journal, 20(1), 3–29. https://doi.
org/10.1177/1536867X20909688

Städler, N., Stekhoven, D. J., & Bühlmann, P. (2014). Pattern 
alternating maximization algorithm for missing data 
in high-dimensional problems. Journal of Machine 
Learning Research, 15(1), 1903‒1928.

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—
Non-parametric missing value imputation for mixed-
type data. Bioinformatics, 28(1), 112‒118.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., 
Hastie, T., Tibshirani, R., ... & Altman, R. B. 
(2001). Missing value estimation methods for DNA 
microarrays. Bioinformatics, 17(6), 520-525. 

Wang, C., & Wu, H. (2018). A new machine learning 
approach to house price estimation. New Trends in 
Mathematical Sciences, 6(4), 165‒171. https://doi.
org/10.20852/ntmsci.2018.327

Yan, K. (2021). Student performance prediction using 
XGBoost method from a macro perspective. In 2021 
2nd International Conference on Computing and 
Data Science (CDS) (pp. 453–459). IEEE. https://
doi.org/10.1109/CDS52072.2021.00084

Yuliansyah, H., Imaniati, R. A. P., Wirasto, A., & Wibowo, 
M. (2021). Predicting students graduate on time 
using C4. 5 algorithm. Journal of Information 
Systems Engineering and Business Intelligence, 
7(1), 67‒73.