Comparing Several Missing Data Estimation Methods in Linear Regression;Real Data Example and A Simulation Study


CAUCHY –Jurnal Matematika Murni dan Aplikasi 
Volume 7(4) (2023), Pages 641-648 
p-ISSN: 2086-0382; e-ISSN: 2477-3344 

Submitted: February 27, 2023 Reviewed: April 01, 2023 Accepted: April 23, 2023 
DOI: http://dx.doi.org/10.18860/ca.v7i4.20548  

Comparing Several Missing Data Estimation Methods in Linear 
Regression;Real Data Example and A Simulation Study 

Anwar Fitrianto1,2*, Yap Ee Jia3, Budi Susetyo1 , La Ode Abdul Rahman1  

1Department of Statistics, IPB University, Bogor, West Java, Indonesia 
2Institute of Engineering Mathematics, Universiti Malaysia Perlis, Malaysia 

3Department of Mathematics and Statistics, Universiti Putra Malaysia, Serdang, Malaysia  
 

Email: anwarstat@gmail.com 

ABSTRACT 

One of the causes of bias in parameter estimation is incomplete data when analyzed using standard 
statistical procedures. In addition, if the analysis is performed on missing data, the researcher may not have 
sufficient observations necessary for the analysis. For this reason, a method is needed to estimate the 
missing data. Until now, there are many methods for estimating lost data.. The main objective of the study is 
to compare the performance between listwise deletion (LD), mean substitution (MS) and multiple 
imputation (MI) methods in estimating parameters.  The performance will be measured through bias, 
standard error and 95% confidence interval of interested estimates for handling missing data with 10% 
missing observations. A complete empirical data set was used and assumed as population data. Ten percent 
of total observations in the population ere set as missing arbitrarily by generating random numbers from a 
uniform distribution,  . Then, bias of parameter estimates and confidence interval of parameter estimates 
are calculated to compare the three methods. A Monte Carlo simulation was carried out to know the 
properties of missing data estimation methods. Simulation of 1000 sampled data with 20, 50, and 100 
observations and each sample is set to have 10% missing observations. Standard statistical analyses are run 
for each missing data and get the average of parameter estimates to calculate the bias and standard error of 
parameter estimates for every missing data method. The analysis was conducted by using SAS version 9.2. 
It was found that the MI method provided the smallest bias and standard error of parameter estimates and 
a narrower confidence interval compared to the LD and MS methods. Meanwhile, the LD method gives a 
smaller bias of parameter estimates and standard error for small sample size of missing data. And, MS 
method is strongly recommended not to use for handling missing data because it will result in large bias 
and standard error of parameter estimates. 

Copyright © 2023 by Authors, Published by CAUCHY Group. This is an open access article under the CC BY-
SA License (https://creativecommons.org/licenses/by-sa/4.0/) 
 
Keywords: incomplete, mcar, missing, regression, simulation 

INTRODUCTION 

Many researchers try to solve the problems of missing data analysis by using last 
observation carried forward (LOCF), complete case, available case, and single imputation 
method [1]. The LOCF is a method on substituting the last available measurement 
whenever there is a missing value [2]. This method was popular for treating data with 
monotone and non-monotone missing patterns and is valid for longitudinal studies. 
Beside that, many people use complete case methods such as the Listwise Deletion (LD) 

http://dx.doi.org/10.18860/ca.v7i4.20548
mailto:anwarstat@gmail.com
https://creativecommons.org/licenses/by-sa/4.0/


Comparing Several Missing Data Estimation Methods in Linear Regression;Real Data Example and 
A Simulation Study 

Anwar Fitrianto 642 

method to solve the missing data problem, [3]. LD is the simplest and easiest method 
among conventional methods of dealing with missing data, [4]. Although LD is the 
simplest method, the number of deleted cases will increase as the values are missing 
arbitrarily. The major advantage of listwise deletion is that it produces a complete data 
set, which in turn allows for the use of standard analysis techniques, [5]. However, there 
is a loss in power using LD method even though the data is MCAR, especially since a large 
number of subjects have been deleted. 

Furthermore, pairwise deletion (PD) is another case-based method. It uses all 
available information in statistical analysis. At the same time, the mean substitution (MS) 
method is also popular since it is one of the single imputation methods. A study by [6] 
found that overall mean computed from mean substitution (MS) is equal to the complete 
case values but the variance of the same variable is underestimated. Meanwhile, [7] 
defined the MS method as an approach to substitute the missing items with the mean of 
the non-missing items. It could be problematic when applying to categorical data.   
Although the MS method can be used to analyze a complete case, it will reduce the 
variability and weaken covariances and correlation estimates in the data, [8]. 

Another method is Multiple Imputation (MI) proposed by [3]. It is conducted by 
replacing each missing value with a set of plausible values. In the MI method, missing 
values for any variable are predicted using existing values from other variables. The 
predicted values, called “imputes”, are substituted for the missing values, resulting in a 
full data set called an imputed data set, [9]. [10] provided three phases for MI inference. 

Conventional statistical methods and analysis tools presume that all variables in a 
specified model are measured for all cases. The default method for all statistical software 
is simply deleting the cases with missing value on the interesting variables such as LD 
method. [11] supported that if the standard approach analysis is used to analyze 
incomplete data, the estimation of such analysis can be biased. Complete case analysis by 
ignoring the missing data will lead to an inefficient, biased, unreliable result. Missing data 
results in information loss and statistical power, [12]. Meanwhile, for a small data set that 
contains a relatively large number of missing observations, many cases will be simply 
deleted by the default method. Based on that fact, the research aims to compare the 
performance of Listwise Deletion (LD) and Mean Substitution (MS) methods on bias, 
standard error and 95% confidence interval. 

 
METHOD 

Data 

The data set involved in this study is about human resources obtained from Human 
Development Reports of United Nationals Development Programme, [13]. Several 
variables are involved, such as the human development index, life expectancy at birth and 
gross national income variable. Human development enlarges people’s choices. The most 
critical of these wide-ranging choices are to live a long and healthy life, be educated and 
access the resources needed for a decent living standard. The measure includes life 
expectancy, literacy, and a modified measure of income, [14]. 

Meanwhile, life expectancy at birth reflects the overall mortality level of a population. 
It summarizes the mortality pattern across all age groups from children to the elderly,  
[15]. Life expectancy relates to the value people attach to living long, healthy, and well. 
The life expectancy at birth component in the HDI data is calculated using a minimum 
value of 20 years and a maximum value of 85 years. The other variable included in the 
study is gross national income. According to [16], the gross national income is the total 


Comparing Several Missing Data Estimation Methods in Linear Regression;Real Data Example and 
A Simulation Study 

Anwar Fitrianto 643 

domestic and foreign output claimed by residents of a country, consisting of the gross 
domestic product plus factor incomes earned by foreign residents minus income earned 
in the domestic economy by non-residents. In the data, the standard of living dimension 
is measured by gross national income per capita. The human development index is 
expected to be strongly affected by life expectancy at birth and gross national income. 
Thus, human development index is response variable, denoted by  , life expectancy at birth   
and gross national income per capita   are the predictors. 

Simulation Designs 

Monte Carlo simulation was conducted to compare performance between missing data 
estimation methods in estimating missing values. According to [17], the researcher begins 
the simulation by creating a model with known population parameters that the 
researcher sets the values. In the Monte Carlo simulation, B  samples of N size are drawn 
from the population and, for each sample, estimates the interested parameter. Next, a 
sampling distribution is estimated for each population parameter by collecting the 
parameter estimates from all the samples. The properties of that sampling distribution, 
such as its mean and variance, come from this estimated sampling distribution. 

SAS version 9.2 was used for the simulation. There are few steps involved in the 
simulation: 
Step 1: Develop a simple linear regression model with known parameters. 
The simulation is begun with a simple linear regression model 𝑦 = 𝛽0 + 𝛽1𝑥 + 𝜀 with 
arbitrary known parameters 𝛽0=5 and 𝛽1=10. The independent variable 𝑋 is generated as 
2 ×  observation number. In this simulation, the number of observations was set equal to 
20. Then random errors were generated from a normal distribution 𝜀~𝑁(0, 𝜎 = 2), 

Step 2: Generate the missing data set 
Let’s d  is the number of variable observations in the generated data set that are missing 
arbitrarily and 𝑑 < 𝑛. The missing measurements are selected by generating a random 
number from a uniform distribution 𝑈(0,1). The measurements of the variable 𝑋1 with 
the first d smallest random numbers are made to be missing by denoted as dot “.” . Now, 
a missing data set are created, 

Step 3: Repeat B  samples from a population 
Repeat step 2 for 1000 times to generate 1000 missing data sets with 20 observations, 

Step 4: Missing data analysis using LD, MS and MI method 
For each missing data set, statistical analysis are carried out to calculate the parameter 
estimates of the linear regression model by using LD, MS and MI method,  

Step 5: Calculate the standard error (SE) of parameter estimates and confidence intervals 
After obtaining 1000 parameter estimates for the linear regression model based on each 
missing data estimation method, bias, standard errors and confidence intervals of the 
parameters are calculated as follows: 

𝐵𝑖𝑎𝑠 = �̅�𝑖 − 𝛽𝑖 = 𝑖 = 1, 2, ...., 1000. (1) 

𝑆𝐸 =
√∑ �̄�𝑖

2
−

(∑ �̄�𝑖)
2

𝑛

𝑛 − 1
(2)

 
Comparing Several Missing Data Estimation Methods in Linear Regression;Real Data Example and 
A Simulation Study 

Anwar Fitrianto 644 

where 𝛽0 = 5 and 𝛽0 = 10. Meanwhile, the 95% confidence interval of 𝛽𝑖  is calculated as: 

�̄�𝑖 ± 𝑡0.05/2;𝑑𝑓
√

∑ �̄�𝑖
2 −

(∑ �̄�𝑖)
2

𝑛

𝑛 − 1
, (3)

 
Step 6: Repeat Step 1 to step 5 for 50 and 100 observations, respectively. 

Mechanism of Missing Data 

[3] had developed a taxonomy and terminology of missing data mechanisms. The 
missing data mechanisms include missing completely at random (MCAR), missing at 
random (MAR), and missing not at random (MNAR). It is essential to understand the 
mechanism because the problems caused by the missing data and the solutions to these 
problems are different.  

Regarding tests for missing data, [18] proposed an MCAR test that is distributed as a 
𝜒2 under the 𝐻0 and called 𝑑

2. The Little’s MCAR test was conducted using a custom SAS 
macro. Little’s MCAR test is a chi-square test to determine whether data is MCAR, [19]. 
Most researchers use Little’s MCAR test to test MCAR. The 𝑑2 sums the squared 
standardized mean differences across the J  missing data patterns. The 𝑑2 is Mahalanobis 
distances which is written as: 

𝑑2 = ∑ 𝑚𝑗 (�̄�𝑜𝑏𝑠.𝑗 − �̂�𝑜𝑏𝑠.𝑗 )

𝐽

𝑗=1

∑̂𝑜𝑏𝑠.𝑗
−1 (�̄�𝑜𝑏𝑠.𝑗 − �̂�𝑜𝑏𝑠.𝑗 ), (4) 

 
where 𝑚𝑗  is the number of cases in pattern 𝑗, 𝑑𝑓 =∑𝑝𝑗 − 𝑝, where 𝑝𝑗  is the number of 

complete variables for pattern j.  The estimates �̂� and ∑̂ shown in Equation (4) are 
obtained via maximum likelihood estimation using Expectation Maximization (EM) 
algorithm. It is an iterative procedure that produces Maximum Likelihood (ML) estimates 
of a covariance matrix and mean vector, assuming MAR. The data is considered as MCAR 
if the p value from Little’s MCAR test is insignificant. Unfortunately, no standard statistical 
test determines if the missing data is MAR, [20].  

The three common missing data mechanisms are as follows: 

Missing Completely At Random 

According to [11], MCAR mechanism indicates that the probability of a missing value 
is unrelated to any observed and unobserved values.  [21] explained the MCAR by 
denoting a single variable with missing data as 𝑊. Suppose a set of variables is 
represented by vector 𝑍. Now, let 𝑅𝑊be a dummy variable with a value 1 if 𝑊is missing 
and 0 if 𝑊is observed. The expression of MCAR mechanism is written as: 

𝑃𝑟(𝑅𝑊 = 1|𝑊, 𝑍) = 𝑃𝑟(𝑅𝑊 = 1) (5) 

 
The probability that 𝑊 is missing depends neither on the observed variables in 𝑍 
vector nor on the missing values 𝑊itself. Complete case analysis will only give unbiased 
result if the missing data is assumes MCAR. 


Comparing Several Missing Data Estimation Methods in Linear Regression;Real Data Example and 
A Simulation Study 

Anwar Fitrianto 645 

Missing At Random 

MCAR mechanism is a special case of MAR mechanism. For the data to be MAR, the 
probability of 𝑊is missing depends on observed variables but independent on 𝑊itself. 
[22] claimed that the data is MAR when the missingness is correlated with other observed 
variables in the analysis. The MAR mechanism is expressed as follows: 

𝑃𝑟(𝑅𝑊 = 1|𝑊, 𝑍) = 𝑃𝑟(𝑅𝑊 = 1|𝑍) (6) 

Since missing 𝑊is arising from the same distribution as the observed 𝑍, thus missing 
𝑊can be predicted by using the observed 𝑍 distribution, [11]. MCAR or MAR data are 
sometimes considered ignorable missingness, [23]. This is because for those data still can 
produce unbiased estimates without needing a model to explain the missingness. 

Missing Not At Random 

If the data are not MCAR or MAR, then the data would be MNAR, where the probability 
of 𝑊is missing depends on both observed and missing variables. The data for MNAR is a 
case of non-ignorable missingness. The missing data mechanism must be modeled as part 
of the estimation process for a valid estimation. Unfortunately, MCAR mechanism is 
fraught with difficulty. The model for the missing data mechanism must be carefully 
tailored with each situation because every MNAR situation is different, [21]. She also 
stated that there is no information in data to help choose the appropriate model and no 
statistic to tell how well a chosen model fits the data. 

RESULTS AND DISCUSSION  

Bias of Parameter Estimates 

As shown in Table 1, MI results in the least bias in parameter estimates. For 𝑏0, there 
is only 0.0047 bias from the true parameter estimates from the original data. The 𝑏1 can 
be said to be unbiased parameter estimates since the bias value is small enough, which is 
only -0.0001. While 𝑏2 have biased downward with 0.0038. The three-parameter 
estimates using the MI method yield good parameter estimates where their biases are 
nearly zero. 
 

           Table 1. Bias of Parameter Estimates of LD, MS and MI Method 

Methods 
Bias of Parameter Estimates 

𝒃𝟎 𝒃𝟏 𝒃𝟐 

LD -0.0170 0.0002 0.0202 

MS 0.2716 -0.0020 -0.4992 

MI 0.0047 -0.0001 -0.0038 

 
The MI was expected to have the smallest bias compared to the other two methods. 
Although the missing data is MCAR, since the human development index and life 
expectancy at birth are highly correlated, it helps predict the missing value nearest to the 
original value. In the MS method, each missing value is substituted by the mean of the 
observed variable. It means that all missing values are replaced with same value. It causes 
the substituted values are not similar or nearer to the original value since a single mean 
value only substitutes every missing value. That is the reason MS method yields the most 
extensive bias among these three methods.  


Comparing Several Missing Data Estimation Methods in Linear Regression;Real Data Example and 
A Simulation Study 

Anwar Fitrianto 646 

Table 2. 95% Confidence Interval and Corresponding Length of Parameter Estimates of LD, MS and MI 
Methods 

Estimator 𝒃𝟎 𝒃𝟏 𝒃𝟐 

True Value 0.8252 0.00736 -2.5594 

Method LCI UCI Length LCI UCI Length LCI UCI Length 

LD 0.6739 0.9426 0.2687 0.0065 0.0086 0.0021 -2.7980 -2.2802 0.5178 

MS 0.9738 1.2199 0.2461 0.0044 0.0064 0.0020 -3.2985 -2.8186 0.4800 

MI 0.7015 0.9583 0.2567 0.0063 0.0083 0.0020 -2.8077 -2.3187 0.4889 

 
Meanwhile, the performance of each estimation method of missing data about 95% 
confidence interval is displayed in Table 2. The MS method gives the smallest confidence 
interval length based on the table. But, true parameter values from MS method did not fall 
in the interval. Hence, the MS method was not suitable for estimating the true value. But, 
the LD and MI methods can help estimate true value since true value falls in the confidence 
interval. MI method was preferable than the LD method in estimating parameters since it 
can produce a narrower confidence interval. 

Results of The Simulation Study 

When there are 10% missing observations in a small data set (n=20), for after 1000 
simulation runs, the MS method provides the least bias of parameter estimates. However, 
the bias of parameter estimates for MS method increases as the sample size increases. In 
other words, the MS method is not good enough to handle medium or large numbers of 
missing data since it will produce large bias.       
 

Table 3. Bias of Parameter Estimates from 1000 Simulation Runs for LD, MS and MI with 10% 
Missing Data in 20, 50, and 100 Observations, respectively 

Number of 
Simulation  

Number of 
Observations 

Number of 
Missing 

Observations  
Methods 

Bias of  
Parameter Estimates 

𝑏0 𝑏1 

1000 

20 2 LD -1.0013 0.0125 

   MS 0.0969 0.0070 

    MI -1.0950 0.0141 

50 5 LD -0.8805 0.0093 

   MS 2.0675 -0.0024 

    MI -0.7981 0.0071 

100 10 LD -0.8420 0.0083 

   MS 6.3112 -0.0095 

    MI -0.7075 0.0062 

 
Meanwhile, the LD and MI methods are better for handling many missing data. Bias of 
𝑏0 produced by LD method for small number of missing data is negative with -1.0013 and 
𝑏1 is 0.0125. When 𝑛 is increases to 50, bias of parameter estimates reduces. Table 3  
shows that the bias for 𝑏0 and 𝑏1 in LD method decrease as the number of observations 
increases. As we see in Table 3, MI method follows the trends as LD method that the bias 
of parameter estimates is reduced as 𝑛 increases. However, we notice that the bias from 
MI method reduces more drastically compared to LD method as 𝑛 increases.  

The results from the simulation study follow the results from the empirical data where 
the MI method gives the least bias for a large number of missing data. 
 

Comparing Several Missing Data Estimation Methods in Linear Regression;Real Data Example and 
A Simulation Study 

Anwar Fitrianto 647 

Table 4. Standard Error of Parameter Estimates from 1000 Simulation Runs for LD, MS and 
MI with 10% Missing Data in 20, 50, and 100 Observations, respectively. 

Number of 
Simulation  

Number of 
Observations 

Number of 
Missing 

Observations  
Methods 

Standard Error of  
Parameter Estimates 

𝑏0 𝑏1 

1000 

20 2 LD 0.2368 0.0109 

   MS 10.3413 0.0763 

    MI 0.2411 0.0110 

50 5 LD 0.2305 0.0086 

   MS 14.6124 0.0744 

    MI 0.1466 0.0045 

100 10 LD 0.1838 0.0062 

   MS 20.4287 0.0647 

    MI 0.1110 0.0010 

 
Table 4 shows the standard error of parameter estimates for LD, MS and MI method. 
The standard error of 𝑏0 from MS method is increase as 𝑛 increases. And, although the 
standard error of 𝑏1 decreases as 𝑛 increases, but the standard error of 𝑏1 is the largest 
compared to the standard error of parameter estimates from LD and MI methods. Overall, 
the MS method is not suggested to handle missing data because it will result in the largest 
standard error of parameter estimates if we compare to the LD and MI methods. 

On the other hand, the LD and MI method provide a small standard error of parameter 

estimates. The LD method provides the smallest standard error of parameter estimates when n
=20. However, the MS method gives the smallest standard error of parameter estimates, 

increasing to 50 and 100. 

CONCLUSIONS AND FUTURE WORKS 

Missing data may result in biased parameter estimates. However, a different sample 
size of missing data is recommended by various methods. The LD method is more 
appropriate for treating small missing data because it gives negligible bias and standard 
error of parameter estimates. However, the MS method is strongly recommended not to 
use for handling missing data due to large bias and standard error of parameter estimates. 
From the study result, MI method can reduce the bias of parameter estimates of missing 
data. Based on that fact, it is strongly recommended to analyze missing data using MI 
method rather than LD method and MS method because the MI method results in smaller 
bias and standard error of parameter estimates. 

In the future, analyses of data with missing values in two or more variables are 
suggested. In the statistical analysis, other than standard error and bias of parameter 
estimates, analysts can also compare the mean and efficiency of different missing data 
estimation methods at different missing percentages. Furthermore, analyzing missing 
data with different mechanisms or patterns can also be a research topic.   

REFERENCES 

 
[1] S. Ghosh and P. Pahwa, “Assessing bias associated with missing data from Joint 
Canada,” in United States Survey of Health: an application, paper presented at the 
Joint Statistical Meetings, Denver, CO, USA, 2008. 


Comparing Several Missing Data Estimation Methods in Linear Regression;Real Data Example and 
A Simulation Study 

Anwar Fitrianto 648 

[2] G. Molenberghs and G. Verbeke, “Multiple imputation and the expectation-
maximization algorithm,” Models for discrete longitudinal data, pp. 511–529, 2005. 

[3] D. B. Rubin, Multiple imputation for nonresponse in surveys, vol. 81. John Wiley & 
Sons, 2004. 

[4] R. L. Carter, “Solutions for missing data in structural equation modeling.,” Research 
& Practice in Assessment, vol. 1, pp. 4–7, 2006. 

[5] A. N. Baraldi and C. K. Enders, “An introduction to modern missing data analyses,” J 
Sch Psychol, vol. 48, no. 1, pp. 5–37, 2010. 

[6] R. J. A. Little, “Regression with missing X’s: a review,” J Am Stat Assoc, vol. 87, no. 
420, pp. 1227–1237, 1992. 

[7] P. R. de Gil and J. D. Kromrey, “Missing_Items: A SAS® Macro for Missing Data 
Imputation in Summative Response Scales”. 

[8] K. F. Widaman, “Best practices in quantitative methods for developmentalists: III. 
Missing data: What to do with or without them.,” Monogr Soc Res Child Dev, 2006. 

[9] J. C. Wayman, “Multiple imputation for missing data: What is it and how can I use 
it,” in Annual Meeting of the American Educational Research Association, Chicago, IL, 
2003, vol. 2, p. 16. 

[10] Y. C. Yuan, “Multiple imputation for missing data: Concepts and new development 
(Version 9.0),” SAS Institute Inc, Rockville, MD, vol. 49, no. 1–11, p. 12, 2010. 

[11] T. E. Raghunathan, “What do we do with missing data? Some options for analysis of 
incomplete data,” Annu. Rev. Public Health, vol. 25, pp. 99–117, 2004. 

[12] C.-Y. J. Peng, M. Harwell, S.-M. Liou, and L. H. Ehman, “Advances in missing data 
methods and implications for educational research,” Real data analysis, vol. 3178, 
p. 102, 2006. 

[13] UNDP, “United Nations Development Programme: Human Development Reports 
2015.” Oxford University Press Oxford, 2014. 

[14] G. Ranis, F. Stewart, and E. Samman, “Human development: beyond the human 
development index,” Journal of Human Development, vol. 7, no. 3, pp. 323–358, 
2006. 

[15] W. H. Organization, The world health report 2006: working together for health. 
World Health Organization, 2006. 

[16] G. Chamberlin, “Gross domestic product, real income and economic welfare,” 
Economic & Labour Market Review, vol. 5, pp. 5–25, 2011. 

[17] P. Paxton, P. J. Curran, K. A. Bollen, J. Kirby, and F. Chen, “Monte Carlo experiments: 
Design and implementation,” Structural Equation Modeling, vol. 8, no. 2, pp. 287–
312, 2001. 

[18] R. J. A. Little, “A test of missing completely at random for multivariate data with 
missing values,” J Am Stat Assoc, vol. 83, no. 404, pp. 1198–1202, 1988. 

[19] T. G. Morrison, M. A. Morrison, and J. M. McCutcheon, “Best practice 
recommendations for using structural equation modelling in psychological 
research,” Psychology, vol. 8, no. 09, p. 1326, 2017. 

[20] T. Schwartz and R. Zeig-Owens, “Knowledge (of your missing data) is power: 
handling missing values in your SAS dataset,” in SAS Global Forum, 2012, pp. 1–8. 

[21] P. D. Allison, Missing data, vol. 200210, no. 9781412985079.31. Sage Thousand 
Oaks, CA, 2010. 

[22] D. C. Howell, “The treatment of missing data,” The Sage handbook of social science 
methodology, vol. 208, p. 224, 2007. 

[23] T. D. Pigott, “A review of methods for missing data,” Educational research and 
evaluation, vol. 7, no. 4, pp. 353–383, 2001