Ratio Mathematica Vol. 32, 2017, pp. 63-75 ISSN: 1592-7415 eISSN: 2282-8214 Teaching Least Squares in Matrix Notation Guglielmo Monaco1, Aniello Fedullo2∗ 1Department of Chemistry and Biology ”A. Zambelli”, University of Salerno, Italy gmonaco@unisa.it 2Department of Physics ”E. R. Caianiello”, University of Salerno, Italy afedullo@unisa.it Received on: 30-05-2017. Accepted on: 27-06-2017. Published on: 30-06-2017 doi:10.23755/rm.v32i0.335 c©G. Monaco and A. Fedullo Abstract Material for teaching least squares at the undergraduate level in matrix notation is reported. The weighted least squares equations are first derived in matrix form; equivalence with the standard results obtained by standard algebra are then given for the weighted average and the simplest linear re- gression. Indicators of goodness of fit are introduced and interpreted. Even- tually a basic equation for resampling is derived. Keywords: coefficient of determination, weighted sample mean, resam- pling, undergraduate education. 2010 AMS subject classifications: 62J05. 63 G. Monaco and A. Fedullo 1 Introduction Statistics is a never missing topic in first degree courses of scientific programs. Very soon, often at the second year undergraduate, the basic knowledge of random variables and distributions, is complemented by the simple linear regression, as a necessary tool for the interpretation of experimental data gathered in the labo- ratories. Indeed, the critical practice of linear regressions often forms students’ basic awareness of data analysis. The advent of powerful and handy softwares on the one hand has reduced the effort required to the students for accomplish- ing the needed calculations, on the other hand has given them the possibility to easily perform more advanced statistical analyses [1, 2], which they cannot really understand on the grounds of the course. One of simplest of such more advanced analyses is the consideration of more regressors, the starting point of multivariate data analysis [3]. Although a specific course at the last undergraduate or first grad- uate year can be much profitable, we experienced that, provided the students have a basic knowledge in linear algebra, the generalized least squares can be thought at the second year undergraduate with reasonable appreciation from the class. Reference textbooks on the matter, seemingly more diffused in the community of econometrics [5] than in that of experimental sciences [6], are not missing. However, we needed to compact some fundamental concepts and equations, and still convince the students that the more general matrix form of the least squares allows to easily retrieve the results obtainable with standard algebra. Thus, we prepared the following material, and we presented it effectively in a 12 hours module together with numerical exercises. Although our lessons obviously have a significant overlap with reference textbooks, the revised simple linear regression and the introduction of the (adjusted) weighted coefficient of determination are not easily retrieved from any of the textbooks known to us. 2 Matrix Form of the Weighted Least Squares We consider n measures {y1,y2, ...,yn} and for each of them, say the i-th one, the regressors {xi1,xi2, ...,xip}, here assumed constant, which are generally coming from different associated measures. We will assume that for each measure the first regressor equals one, xi1 = 1, in order to take into account the so called intercept. The linear regression model connects the above quantities by yi = p∑ j=1 xijβj + εi i = 1, 2, ...,n (1) 64 Teaching Least Squares in Matrix Notation where β1,β2, ...,βp are the parameters to be estimated and ε1,ε2, ...,εn are ran- dom errors, assumed independent and possibly normally distributed, with mean 0 and standard deviations σ1,σ2, ...,σn. Ordinary least squares (OLS) and weighted least squares (WLS), also called homoskedastic and heteroskedastic regressions, are the names used to distinguish the special case of equal values for all standard deviations from the case of different values. The equations for WLS of course also apply to the special OLS case. Dividing eq. 1 by σi, i.e. given zi := yi σi , qij := xij σi , ςi := εi σi , and using the matrix notation, the model is written as z = Qβ + ς, (2) or, equivalently, W 1 2 y = W 1 2 Xβ + W 1 2 �, where W is a diagonal matrix whose elements Wii := wi = σ −2 i are known as statistical weights, z and β are column matrices of n and p elements, respectively, Q is a matrix of dimension n×p. It should be noticed that Qβ is the expectation value of z, i.e. Qβ =< z >. Under these hypotheses the least squares method gives an estimate of the model parameters by the minimization with respect to β of the functional SS := ςTς = (z −Qβ)T (z −Qβ) (3) = (z −Qβ)T (z −Qβ) = zTz − 2βTQTz + βTQTQβ, (4) where it has been considered that βTQTz = zTQβ. The estimates of the parameters by the least squares method are the solutions of the equations ∂SS ∂βi = 0, for i = 1, 2, ...,p, one for each model parameter. The computation of the derivative with respect to the vector of the parameters gives: −QTz + QTQβ = 0, (5) whose solution β̂ = V QTz (6) is, by definition, the least squares estimator of β, where V := C−1, and C := QTQ, which we will assume always invertible. 65 G. Monaco and A. Fedullo We note that β̂ is an unbiased estimator of β, indeed form eqs. 2 and 6 we have < β̂ >= V QT < z >= V QTQβ = V Cβ = β. (7) An unbiased behavior also characterizes the weighted sample mean. Indeed, eq. 5 for β = β̂ gives QTz = QT ẑ which, rewritten in the original variables, is XTWy = XTWŷ. From this and from the initial hypothesis xi1 = 1, for any i, one gets ∑ i wiyi = ∑ i wiŷi, which divided by ∑ i wi shows that the weighted sample mean of the fitted values equals the weighted sample mean of the mea- sures: ȳw = ŷw. (8) Given δ:= β̂ −β from eqs. 6 and 7 one gets δ = V QTς, (9) which allows to easily compute the covariance matrix of the parameters, showing that it coincides with V < δδT >= V QT < ςςT > QV = V QTIQV = V, where I denotes the identity matrix. The standard deviations of the estimators of the parameters are given by the square roots of the diagonal elements of V . Using the fitted values, one can write z = ẑ + (z − ẑ) = Qβ̂ + e, where e is known as the vector of residuals, whose analysis is object of much concern in literature. The fitted values are often written as ẑ = Qβ̂ = QV QTz =: Hz, (10) where we have introduced the symmetric matrix H, which is known as hat matrix as it ’puts the hat on z’. This matrix is readily verified to be idempotent, H2 = QV QTQV QT = H, a feature which readily allows to demonstrate the useful property of orthogonality of residuals and fitted values: (z − ẑ)T ẑ = zT (I −H)Hz = 0. Given 66 Teaching Least Squares in Matrix Notation SSE := minβSS = e Te, the expansion 4, with ς in place of z and δ in place of β, can be rewritten as: SSE = (ς −Qδ)T (ς −Qδ) = ςTς−2δTQTς + δTCδ = ςTς − δTCδ, where we have considered that QTς = Cδ thanks to eq. 9. Given SSR := δTCδ, which as SS and SSE is non-negative, the preceding equation becomes SSE = SS −SSR whose interpretation is that the error in the estimation of the parameters, yielding a nonzero SSR, reduces the sum of squares SS which could have been computed with the expectation value 〈z〉 = Qβ. The average of SSE can be easily computed considering that δTCδ = Tr [ δδTC ] , and then 〈SSE〉 = 〈 ςTς 〉 −Tr [〈 δδT 〉 C ] = n−Tr (V C) = n−p, known as the number of degrees of freedom, denoted by ν. Notation. In the following sxx,w, Sxx,w e sxy,w indicate respectively the sample variance, sum of squares and weighted covariance, defined from the weighted sample mean ȳw := ∑ i wiyi∑ i wi in analogous manner to the corresponding unweighted means. We recall that their expressions are sxx,w = x2w− x̄2w, Sxx,w = sxx,w ∑ i wi e sxy,w = xyw − x̄wȳw, where xy := (x1y1, ...,xnyn). 3 Indicators for the Goodness of Fit Besides reporting the best-fit parameters and the resulting fitted values, it is customary to give compact indicators of the goodness of fit. A method which is widely used in the analysis of experimental data consists in the chi-squared test: the hypothesis that the model is correct is not rejected, at the appropriate level of significance, if SSE assumes values close to 〈SSE〉, i.e., for any number of parameters, if χ2r = SSE ν is close to 1. Values of χ2r larger or smaller than 1 are then considered as indicators of a poor fit or, respectively, overfitting. A different approach considers weighted sample means. Defining the weighted coefficient of determination R2w as the square of the weighted sample correlation 67 G. Monaco and A. Fedullo coefficient syŷ,w√ syy,wsŷŷ,w between data y and fitted values ŷ = Xβ̂ and thus limited by 0 ≤ R2w ≤ 1, one has that 1 −R2w = SSE Syy,w = see syy,w , (11) showing that R2w = 1 iff SSE = 0, i.e. iff all residuals are zero. Therefore the greater the value of R2w the better the agreement. Eq. 11 can be proven thanks to the orthogonality relation discussed above. The vector yww 1 2 , where w 1 2 is a column vector of elements w 1 2 i , is orthogonal to the vector of residuals z − ẑ, by virtue of eq. 8. Therefore the orthogonality of residuals and fitted values, eq. 2, still holds if the fitted values are translated by yww 1 2 . The vector relationship z −yww 1 2 = (ẑ −yww 1 2 ) + (z − ẑ), (12) graphically sketched in Figure 1, allows to assess that Syy,w = Sŷŷ,w + SSE, (13) whose interpretation is that Sŷŷ,w/Syy,w is the fraction of variability of the data explained by the knowledge of Q, i.e. by the regression, and SSE/Syy,w is the unexplained one, i.e. that coming from errors. Still from eq. 12 one gets Sŷy,w = (ẑ − ŷww 1 2 )T (z −yww 1 2 ) = (ẑ − ŷww 1 2 )T (ẑ − ŷww 1 2 ) = Sŷŷ,w (14) and then R2w = S2ŷy,w Sŷŷ,wSyy,w = Sŷŷ,w Syy,w . (15) Insertion of eq. 15 in eq. 13 readily gives eq. 11. In order to discourage the introduction of models too complicated for the data examined, it has been introduced the adjusted determination coefficient R2a = 1 − (1 −R 2 w) n− 1 n−p , obtained substituting the unbiased variances in the rhs of eq. 11. It often happens that standard deviations of experimental data are only ap- proximately known. A common assumption is that the standard deviations σi are known but for a factor k: σi = kσ̃i, with the σ̃i known a priori. If the adjustment 68 Teaching Least Squares in Matrix Notation w1/2yw ẑ ẑ − w1/2yw z − w1/2yw z w1/2yw z − ẑ z − ẑ Figure 1: The residuals z− ẑ are orthogonal to both the estimates ẑ and the vector yww 1 2 . 69 G. Monaco and A. Fedullo of k leads to a good fitting for the model, χ2r should be close to ν. Using this value, one gets ν = n∑ i=1 (yi − ŷi)2 k2σ̃2i , and a trial value for k is obtained as k = √ 1 ν ∑ (yi − ŷi)2 σ̃2i . 4 Basic Applications 4.1 (Weighted) mean The model y = β1 + ε has an n× 1 matrix of relative regressors, whose i-th element is qi1 = w 1 2 i Application of eq. 7 soon gives as the best fit parameter the weighted mean β̂ = V Qz = ∑ i wiyi∑ i wi = ȳw and its variance is the sum of the weights: σ2β = V11 = ∑ i wi. 4.2 WLS for a straight line The standard linear regression considers the model y = a + bx. In the above notation a = β1 and b = β2 and the regressor matrix is X = [ 1 1 ... 1 x1 x2 ... xn ]T The matrix of relative regressors will be then Q = [ √ w1 √ w2 ... √ wn√ w1x1 √ w2x2 ... √ wnxn ]T , the vector of relative data 70 Teaching Least Squares in Matrix Notation z = [ √ w1y1 √ w2y2 ... √ wnyn ]T , and C = ∑ i wi [ 1 x̄w x̄w x2w ] , whose inverse gives the covariance matrix of the parameters V = 1 Sxx,w [ x2w −x̄w −x̄w 1 ] . The standard deviations of the estimators of the parameters will be then [ σâ σb̂ ] = [ √ V11√ V22 ] = 1√ Sxx,w [ √ x2w 1 ] , and the estimated parameters will be [ â b̂ ] = V QTz = 1 sxx,w [ x2w −x̄w −x̄w 1 ][ yw xyw ] = [ ȳw − sxy,w sxx,w x̄w sxy,w sxx,w ] , which in case of all equal weights (homoskedastic regression) have the simpler expression [ ȳ − sxy sxx x̄ sxy sxx ]T . 4.3 Revised simple linear regression We now give a simplified approach for the bivariate weighted linear regres- sion: given 1 := [1 1 ... 1]T , we subtract yw1 from the data and from the fitted data and, considering that yw = ŷw = a + bx̄w, we obtain y − yw1 = b(x− x̄w1) + ε, (16) which, with z := W 1 2 (y −yw1), q := W 1 2 (x− x̄w1) e ς := W 1 2 ε, can be written as in eq. 2, z = bq + ς, but here there is the single parameter b to be determined, as in the example of the weighted mean. This means that matrix C is the scalar Sxx,w readily invertible, and then V = C−1 = 1 Sxx,w . On the other hand, as 71 G. Monaco and A. Fedullo qTz = (x− x̄w1)TW(y − yw1) = Sxy,w + yw ∑ i wi(xi − x̄w) = Sxy,w, from eq. 6, one gets again b̂ = sxy,w sxx,w . Writing now the model as y − bx = a1 + ε, the example in 4.1 gives for the intercept ȳw − bx̄w, from where, replacing b with its estimator1, one finally gets â = ȳw − b̂x̄w, as in 4.2. It is to be considered, however, that this simplification leads to loose informa- tion on the covariance of the a and b parameters, which should then be recover ex post (Appendix). 4.4 Resampling and the Best-fit Parameters A remarkable representation of the p best-fit parameters can be obtained if one tries to determine them from the ( n p ) p-elements subsets of the original set of n measures [4]. Let S(s) be a p×n matrix obtained from the n×n identity matrix, upon selecting the p rows whose indices form subset s, with s = 1, . . . ( n p ) . Let also M [k|v] be the matrix obtained from matrix M upon replacing its k-th column with vector v. For any p-elements subset s, the data needed for the WLS are stored in vector z(s) = S(s)z and the square matrix Q(s) = S(s)Q; the best-fit parameters are β̂(s) = Q −1 (s) z(s) = X −1 (s) W −1/2 (s) W 1/2 (s) y(s) = X −1 (s) y(s), (17) which shows that, for p measures, WLS and OLS give the same results. Use of Cramer’s rule on eqs. 5 and 17 gives β̂k = det QTQ[k|z] det QTQ , (18) and β̂(s)k = det Q [k|z] (s) det Q(s) = det X [k|y] (s) det X(s) , (19) Use of the Cauchy-Binet theorem to expand the determinants of the equation 18 leads to β̂k = ∑ s det Q(s) det Q [k|z] (s)∑ s det Q(s) det Q(s) = ∑ s wsβ̂(s)k∑ s ws , (20) which is the equation for a weighted average of the OLS results β̂(s)k with weights ws = (det Q(s)) 2. (21) 1Implicit use is made of the functional invariance of the estimator b̂. 72 Teaching Least Squares in Matrix Notation The above representation of the best-fit parameters is the starting point for robust modifications of WLS, where the basic idea is to exclude from the mean the more extreme values of β(s)k [7]. 5 Conclusions The least squares method, a fundamental piece of knowledge for students of all scientific tracks, is often introduced considering the simple linear regression with only two parameters to be determined. However, the availability of ever more large data sets prompts even undergraduate students to a sounder and wider knowledge of linear regression. Here, we have used the linear algebra formal- ism to compact the main results of the least squares method, encompassing ordi- nary and weighted least squares, goodness of fit indicators, and eventually a basic equation of re-sampling, which could be used to stimulate interested students in an even broader knowledge of data analysis. The compactness of the equations reported above allow their introduction at the undergraduate level, provided that basic linear algebra has been previously introduced. Acknowledgements Financial support from the MIUR (FARB2015) is gratefully acknowledged. 73 G. Monaco and A. Fedullo References [1] R.J. Carroll and D. Ruppert. Transformation and weighting in regression. Chapmand and Hall, 1988. [2] G. Casella and R.L. Berger. Statistical inference. Cengage Learning, 2001. [3] A. J. Dobson and A. G. Barnett. An Introduction to Generalised Linear Models. Chapman and Hall, 2008. [4] R. W. Farebrother. Relations among subset estimators: A bibliographical note. Technometrics, 27(1):85–86, 1985. [5] W.H. Greene. Econometric Analysis. Prentice Hall, 2012. [6] J. R. Taylor. An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books, 1997. [7] C. F. J. Wu. Jackknife, bootstrap and other resampling methods in regression analysis. The Annals of Statistics, 14(4):1261–1295, 1986. 74 Teaching Least Squares in Matrix Notation Appendix Moments of â e b̂ Averages. < b̂ >= sxx,w = (x−x̄w1)T W sxx,w = (x−x̄w1)T W sxx,w = (x−x̄w1)T W sxx,w = b (x−x̄w1)T W(x−x̄w1) sxx,w = b; < â >=< ȳw > −x̄w < b̂ >= a + bx̄w − x̄wb = a The estimators are then unbiased. Variances. We shall use the following auxiliary results: i) Cov(yi,yj) = δijw −1 i ii) V ar(ȳw) = Tr(W)−1 iii) Cov(ȳw, b̂) = 0 Given d := W(x − x̄w1), we have that dT1 = ∑ i di = 0 and then Sxy,w = dT (y − ȳw1) = dTy; Then V ar(Sxy,w) = ∑ ij didjCov(yi,yj) = ∑ i d 2 iw −1 i = ∑ i wi(xi − x̄w) 2 = Sxx,w from which V ar(b̂) = V ar(Sxy,w) S2xx,w = 1 Sxx,w . V ar(â) = V ar(ȳw)−x̄wCov(ȳw, b̂)+x̄2wV ar(b̂) = Tr(W)−1 + x̄2w Sxx,w = x 2 w Sxx,w . . Cov(â, b̂) = Cov(ȳw − x̄wb̂, b̂) = Cov(ȳw, b̂) − x̄wV ar(b̂) = − x̄wSxx,w . Proof of the auxiliary results i) Cov(yi,yj) = Cov(a + bxi + εi,a + bxj + εj) = Cov(εi,εj) = δijσ2i = δijw −1 i . ii) V ar(ȳw) = Tr(W)−2 ∑ ij wiwjCov(yi,yj) = Tr(W) −2 ∑ i wi = Tr(W) −1 iii) Tr(W)Cov(ȳw,Sxy,w) = ∑ ij widjCov(yi,yj) = ∑ i di = 0 and then Cov(ȳw, b̂) = 0. 2 75