Restricted Maximum Likelihood Method as An Alternative Parameter Estimation in Heteroscedastic Regression CAUCHY –Jurnal Matematika Murni dan Aplikasi Volume 5(3) (2018), Pages 80-87 p-ISSN: 2086-0382; e-ISSN: 2477-3344 Submitted: 18 Nopember 2016 Reviewed: 15 March 2018 Accepted: 30 November 2018 DOI: http://dx.doi.org/10.18860/ca.v5i3.3777 Restricted Maximum Likelihood Method as An Alternative Parameter Estimation in Heteroscedastic Regression Dwi Masrokhah, Loekito Adi Soehono, Suci Astutik Department of Statistics, Faculty of Mathematics and Natural Sciences, Brawijaya University, Malang, Indonesia Email: dwi.masrokhah@gmail.com ABSTRACT Students are part of the community who have an income. The income of student is pocket money, scholarships, part-time jobs and so forth. They are trying to become trendsetter for their dress style. The consumption patterns are very influential in the behavior of saving. If the savings increases, not only the public funds will increase but also the investment. If the investment increases, the economic growth will also increase. The purpose of this research is to estimate multiple regression parameters using REML methods in modeling the student’s saving in Faculty of Mathematics and Natural Science, Brawijaya University. The variables used were: the student’s age, the amount of income of student’s parent, the amount of student’s pocket money, the amount of student’s additional income, the amount of student’s consumption and the amount of student’s saving. REML method can overcome heteroscedasticity of error variance and provide unbiased estimator. The model of student’s saving using REML method is as follows: �̂�𝑖 = −1609 + 112 𝑋1 + 0.0088𝑋2 + 0.0504 𝑋3 + 0.4706 𝑋4 − 0.636𝑋5 Student’s saving is affected significantly by: student’s age (𝑋1), the amount of student’s additional income (𝑋4), and the amount of student’s consumption (𝑋5). Keywords: Student’s Saving, REML, regression, assumptions, Heteroscedasticity INTRODUCTION The regression analysis is used to create a functional model of the data to explain or predict a natural phenomenon based on the other phenomena. Regression analysis was introduced by Sir Francis Galton in 1822-1911. The purpose of regression analysis is for prediction based on the relationship between the predictor variables and the response variables [1]. Based on the shape of the relationship, the regression analysis can be divided into linear regression and non-linear regression. Linear regression is an approach for modeling the relationship between a dependent variable y and one or more explanatory variables (or independent variables) denoted by X. Parameter estimation methods which are often used in multiple linear regression is Ordinary Least Squares (OLS). The OLS method minimizes the sum squared of residuals (error). The OLS method require some classical assumptions in order to achieve estimator which is Best Linear Unbiased Estimator (BLUE). The assumptions related to errors that is generated from that model. The assumptions that must be met, namely the normality of error, non- autocorrelation, homoscedasticity, and non-multicollinearity. Homoscedasticity is one of the important assumptions in the regression analysis, where the variance of the error term is constant otherwise heteroscedasticity. The effect of heteroscedasticity will give much weight to a small subset of data (namely the subset where the error variance is largest) when estimating regression parameters. Restricted Maximum Likelihood (REML) is known as an unbiased parameter estimation method. REML method can be applied to models that have a normal experimental error, mailto:dwi.masrokhah@gmail.com https://en.wikipedia.org/wiki/Dependent_variable https://en.wikipedia.org/wiki/Explanatory_variable Restricted Maximum Likelihood Method as An Alternative Parameter Estimation in Heteroscedastic Regression Dwi Masrokhah 81 interrelated and different variance. The used of REML variance component estimation can be done even if the data did not meet the assumptions of analysis of variance [2]. Economics is one of social field that is often use regression analysis to make decisions. One of the developed economic theory is consumption theory. Consumption theory states that any individual who has income, is assumed to set aside their part of revenue after being deducted by consumption [3]. The consumption pattern is significantly affecting the saving’s behavior. Indonesian society is known as a consumer society, it could lead to the low motivation of savings. The benefits of savings are degrading consumerist patterns, practicing thrift and as a reserve fund. If the savings increases, not only the public funds will increase but also the investment [4]. Students are part of the community who have an income. The purpose of this research to estimate multiple regression parameters using REML methods in modeling student’s saving at the Faculty of Mathematics and Natural Science, Brawijaya University. METHODS The parameters used in this study consisted of a response variable and five predictor variables. The response variable is student’s saving (Y). Five predictor variables that affect student savings and used in the research are: 𝑋1 = The student’s age (years), 𝑋2 = The amount of income of student’s parent (thousand rupiah), 𝑋3 = The amount of student’s pocket money (thousand rupiah), 𝑋4 = The amount of student’s additional income (thousand rupiah) and 𝑋5 = The amount of student’s consumption (thousand rupiah). Linear regression analysis is a statistical method that is useful to model the relationship between the response variable and predictor variables. The relationships model derived from regression analysis can be used as a description of the phenomenon of data. The regression model can also be used for predicting the values of the response variable. The concept of predicting in the regression analysis can only be done in the data range of the predictor variables used to establish the regression model [5]. The response variable is also called dependent variable and denoted by Y. Predictor variables are called independent variables and denoted by X. Multiple linear regression model is a model where one response variable is determined as a function of more than one predictor variable (p): 𝑌𝑖 = 𝛽0 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + ⋯ + 𝛽𝑝𝑋𝑝𝑖 + 𝜀𝑖 (1) where: 𝑖 = 1,2, ⋯ , 𝑛 𝑌𝑖 = response variable 𝑋1𝑖 , 𝑋2𝑖 , ⋯ , 𝑋𝑝𝑖 = predictor variables 𝛽0, 𝛽1, ⋯ , 𝛽𝑝 = regression coefficients 𝜀𝑖 = error 𝑝 = number of predictor variables. Equation (1) has (𝑝 + 1) unknown parameters, with {𝑥1𝑖 , . . . , 𝑥𝑝𝑖 , 𝑖 = 1, ⋯ , 𝑛} is assumed fix and {𝜀𝑖 } assumed variables are independent, normal distribution with average 0 and variance 𝜎 2: Restricted Maximum Likelihood Method as An Alternative Parameter Estimation in Heteroscedastic Regression Dwi Masrokhah 82 Using a matrix of the equation (1) can be denoted: [ 𝑦1 𝑦2 ⋮ 𝑦𝑛 ] = [ 1 𝑥11 … 𝑥𝑝𝑖 1 𝑥21 … 𝑥2𝑘 1 ⋮ ⋱ ⋮ 1 𝑥𝑛1 … 𝑥𝑛𝑘 ] [ 𝛽0 𝛽1 ⋮ 𝛽𝑛 ] + [ 𝜀1 𝜀2 ⋮ 𝜀𝑛 ] or 𝐘 = 𝐗𝛃 + 𝛆 (2) where 𝐘 = respons vector of size (𝑛 × 1) 𝐗 = predictor matrix size (𝑛 × (𝑝 + 1)) 𝛃 = regression coefficient measuring ((p + 1) × 1) 𝛆 = error vector size (n × 1) The steps of data analysis are as follows: 1. Estimate the parameters by using the OLS. Ordinary Least Squares method is one of the parameter estimations in regression analysis by minimizing the sum of squared errors. By using OLS, the obtained estimators for the parameter β is �̂�. Based on the model (2), it is obtained: 𝛆 = 𝐘 − 𝐗𝛃 (3) So that 𝑆(𝛽) = ∑ 𝜀𝑖 2𝑛 𝑖=1 = 𝜺 𝑻𝜺 = (𝐘 − 𝐗𝛃)𝐓(𝐘 − 𝐗𝛃) = 𝐘𝐓𝐘 − 𝛃𝐓𝐗𝐓𝐘 − 𝐘𝐓𝐗𝛃 + 𝛃𝐓𝐗𝐓𝐗𝛃 = 𝐘𝐓𝐘 − 𝟐𝛃𝐓𝐗𝐓𝐘 + 𝛃𝐓𝐗𝐓𝐗𝛃 By using the properties of the inverse matrix, 𝛃𝐓𝐗𝐓𝐘 = 𝐘𝐓𝐗𝛃 is a scalar, then the least squares estimators must meet: 𝜕𝑆 𝜕𝛽 |�̂� = −𝟐𝐗𝐓𝐘 + 𝟐𝐗𝐓𝐗�̂� = 0 be simplified, 𝐗𝐓𝐗�̂� = 𝐗𝐓𝐘 (4) Multiply the final form of the matrix equation (4) both sides with (𝐗𝐓𝐗)−𝟏, produces the least squares estimator for β is: (𝐗𝐓𝐗)−𝟏𝐗𝐓𝐗�̂� = (𝐗𝐓𝐗)−𝟏𝐗𝐓𝐘 𝐈�̂� = (𝐗𝐓𝐗)−𝟏𝐗𝐓𝐘 �̂� = (𝐗𝐓𝐗)−𝟏𝐗𝐓𝐘 (5) 2. Test the classical assumption of multiple linear regression analysis. The model derived from multiple regression analysis must meet the assumptions of the classical regression analysis. The assumptions include: error normally distributed error, homoscedasticity of error variance, non-autocorrelation and non-multicollinearity. The normality assumption of error is an error value (𝜀𝑖 ) obtained from the regression model should follow the normal distribution. One of the methods to detect normality of error is Shapiro-Wilk [6] . The hypotheses tested: 𝐻0 ∶ 𝜀𝑖𝑗 is normally distributed 𝐻1 ∶ 𝜀𝑖𝑗 is not normally distributed If 𝐻0 is true, Shapiro-Wilk test statistics: 𝐺 = 𝑏𝑛 + 𝑐𝑛 ln [ 𝑇3−𝑑𝑛 1−𝑇3 ] ~𝑍(0,1) (6) where 𝑇3 = 1 𝐷 [∑ 𝑎𝑖 𝑛 𝑖=1 (𝑋(𝑛−𝑖+1) − 𝑋𝑖 )] 2 Restricted Maximum Likelihood Method as An Alternative Parameter Estimation in Heteroscedastic Regression Dwi Masrokhah 83 𝐷 = ∑ (𝜀𝑖 − 𝜀)̅ 2𝑛 𝑖=1 G Value can be approximated by the normal distribution as the Z value is the value of the coefficient counting. The value of 𝑎𝑖 is Shapiro-Wilk’s value with certain n. Value 𝑏𝑛, 𝑐𝑛, and 𝑑𝑛is the conversion value Shapiro-Wilk statistical approaches a normal distribution for n (many observations), If G value less than the critical value of Z distribution, then it can be decided to accept H0, which means that the experimental error is normally distributed [7]. One of the assumptions of classical regression model is homoscedasticity [8]. If the variance is not constant, is expressed as heteroscedasticity. One of the methods to detect the presence of heteroscedasticity is by using Glejser test. After getting 𝑒𝑖 from regression with OLS method, Glejser suggest regressing the absolute 𝑒𝑖 as a response to the predictor variables based on the hypotheses: 𝐻0 : σ1 2 = σ2 2 = ⋯ = σj 2 = σ2 ; σ𝑗 2 = 𝜎2 𝐻1 : At least one j where σ𝑗 2 ≠ 𝜎2 If 𝐻0true, the test statistic 𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑀𝑆𝑒𝑟𝑟𝑜𝑟 ~𝐹(𝑝,(𝑛−(𝑝+1))) (7) where: 𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 𝛃𝐗′𝐘 − 𝑛�̅�2 𝑝⁄ 𝑀𝑆𝑒𝑟𝑟𝑜𝑟 = 𝐘 ′𝐘 − 𝛃𝐗′𝐘 (𝑛 − (𝑝 + 1)) ⁄ If the test statistic is less than the critical point 𝐹(𝑝,(𝑛−𝑝−1)), then it is decided to accept H0, which means that the error variance is homogeneous [8] Autocorrelation is the correlation between members of a series of observations which are sorted by time (time series) or space (data cross-section). To detect the presence of autocorrelation, the Durbin Watson’s test was used based on: 𝐻0: 𝜌 = 0 (Error are independent) 𝐻1:: 𝜌 ≠ 0 (Error are not independent) Statistical test: 𝑑 = ∑ (𝑒𝑖−𝑒𝑖−1) 2𝑛 𝑖=2 ∑ 𝑒𝑖 2𝑛 𝑖=1 (8) where: 𝑑 : Durbin Watson statistic 𝑒𝑖 : the 𝑖 − 𝑡ℎ error value 𝑒𝑖−1 : the (𝑖 − 1) error value 𝐻0 rejected if 𝑑 < 𝑑𝐿 𝑜𝑟 𝑑 > 4 − 𝑑𝐿 𝐻0 acceptable if 𝑑𝑈 < 𝑑 < 4 − 𝑑𝑈 No decision if dL 𝛼 and statistic’s test G < critical point Z (1,96) , H0 accepted and concluded that the error distributes normally with a confidence level of 95%. Detection of autocorrelation using Durbin Watson test[10]. Hypotheses were tested: 𝐻0 ∶ 𝜌 = 0 (error is independent) 𝐻1 ∶ 𝜌 ≠ 0 (error is not independent) D test statistic of 1.928. According to the Durbin Watson table then obtained a value of 1.464 dL and dU value of 1.768. The test of the statistic is between the value d and 4-dU dU then H0 accepted, can be conclude non-autocorrelation assumptions are met. The Variance Inflation Factor (VIF) is one of the values used to detect the presence of multicollinearity [9]. Hypotheses were tested: 𝐻0 ∶ Non multicollinierity 𝐻1 ∶ Multicollinierity Table 1. VIF value of each predictor variable Predictor Value of VIF Information 𝑋1 1,04 𝐻0 accepted 𝑋2 1,13 𝐻0 accepted 𝑋3 1,05 𝐻0 accepted 𝑋4 1,15 𝐻0 accepted 𝑋5 1,06 𝐻0 accepted Table 1 showed VIF value of each predictor variable <10 so H0 is accepted, non- multicollinearity assumptions are met. An error variance assumption test homoscedasticity using Glejser Test [8]. After getting 𝑒𝑖 from OLS method, Glejser suggest regressing the absolute error and predictor variables. Hypothesis were tested: 𝐻0 : σ1 2 = σ2 2 = ⋯ = σj 2 = σ2 ; σ𝑗 2 = 𝜎2 𝐻1 : At least one 𝑗 where σ𝑗 2 ≠ 𝜎2 P value of Glejser test 0,006, then 𝐻0 rejected and concluded that error variance not homogen. Estimation of Regression Parameter Using REML Restricted Maximum Likelihood (REML) is an alternative variance estimation parameter derived from the Maximum Likelihood Method (MLM) [2]. The outlinesof the parameters REML estimators into two parts, namely fixed effects parameter by parameter 𝛽 and 𝜎2. The model obtained using Restricted Maximum Likelihood (REML) Method is: �̂�𝑖 = −1609 + 112 𝑋1 + 0.0088𝑋2 + 0.0504 𝑋3 + 0.4706 𝑋4 − 0.636𝑋5 The results of testing each parameter using Wald test is shown in Table 4.2 Restricted Maximum Likelihood Method as An Alternative Parameter Estimation in Heteroscedastic Regression Dwi Masrokhah 86 Table 2. The result of partial test use REML method Parameter Value Information X1 0,001 𝐻0 rejected X2 0,615 𝐻0 accepeted X3 0,218 𝐻0 accepted X4 0,017 𝐻0 rejected X5 0,017 𝐻0 rejected Based on Table 2, variables that affect student’s saving are student’s age (years), the amount of student’s additional income (thousand rupiah), and the amount of student’s consumption (thousand rupiah). Model Validation To find out whether the model obtained in the study is in accordance with the actual conditions in the field, the model validation is performed. Then, the comparison between the predicted values from REML method with the actual value of observations is tested using paired t test. The summary result of paired t test is presented in Table 4.3. Table 3. Paired t Test predicted value from REML method and actual value Data N Average Standard Deviation Y Observation 29 408,62 364,76 Y Prediction 29 399,19 184,22 Difference 29 9,43 182,10 p-value = 0,904 The result of paired t test provided in Table 3 decided to accept H0 because p value is larger than 0.05, lead to conclusion that the model of REML method can be used to predict student’s saving. CONCLUSION The model of student’s saving using OLS method as follows: �̂�𝑖 = −1855 + 121,5𝑋1 + 0,0098𝑋2 + 0.0524 𝑋3 + 0.559 𝑋4 − 0.585𝑋5 Student’s saving is affected significantly by: student’s age (𝑋1), the amount of student’s additional income (𝑋4), and the amount of student’s consumption (𝑋5). REML method can overcome heteroscedasticity error variance and producing unbiased estimator. The model of student saving using REML method as follows: �̂�𝑖 = −1609 + 112 𝑋1 + 0.0088𝑋2 + 0.0504 𝑋3 + 0.4706 𝑋4 − 0.636𝑋5 Student’s saving is affected significantly by: student’s age (𝑋1), the amount of student’s additional income (𝑋4), and the amount of student’s consumption (𝑋5). Based on the results and the discussion it can be concluded that: REML method can overcome heteroscedasticity error variance and unbiased estimator. REFERENCES [1] Chatterjee, S., & Hadi, A. S. (2006). Regression Analysis by Example Fourth Edition. New York: John Wiley and Sons, Inc. [2] O' Neill, M. (2010). Anova & REML: A Guide to Linier Mixed Models in an Experimental Design Context. Statsitical Advsory Training Service Pty Ltd. [3] Browning, M., & Lusardi, A. (1996). Household Saving: Micro Theories and Micro Facts. Journal of Ecomomic Literature. Restricted Maximum Likelihood Method as An Alternative Parameter Estimation in Heteroscedastic Regression Dwi Masrokhah 87 [4] Dupas, P., & Robinson, S. (2009). Savings constraints and microenterprise development: Evidence from a field experiment in Kenya. National Bureau Research Working Paper. [5] Draper, N. R., & Smith, N. R. (1998). Applied Regression Analysis. Third Edition. New York: John Wiley and Sons. [6] Shapiro, S. S., & Wilk, M. B. (1965). An Anlytics of Variance Test for Normality. Biometrika. [7] Razali, N., & Wah, Y. P. (2011). Power Comparisons of Shapiro Wilk, Kolmogorov Smirnov, Lilliefors and Anderson Darling. Journal of Statistical Modeling and Analytics. [8] Gujarati, D. (2004). Basic Econometrics Fourth Edition. New York: The McGraw-Hill. [9] Alaudin, M., & Nghiem, H. S. (2010). Do Instructional Attributes Pose Multicollinierity Problems? Am Empirical Exploration. Economic Analysis and Policy. [10] Gujarati, D., & Porter, D. C. (2012). Dasar-dasar Ekonometrika (Jilid 2) Terjemahan Raden Carlos Mangunsong. Jakarta: Salemba Empat.