1.2 Least squares estimation
The regression coefficients \(\beta_0, \ldots, \beta_p\) describe the pattern by which the response depends on the explanatory variables. We use the observed data \(y_1, \ldots, y_n\) to estimate this pattern of dependence.
In least squares estimation, roughly speaking, we choose \(\hat \beta,\) the estimates of \(\beta,\) to make the estimated means \(\hat{E}(Y)=X\hat \beta\) as close as possible to the observed values \(y\), i.e. \(\hat \beta\) minimises the sum of squares \[\begin{align*} \sum_{i=1}^n [y_i-E(Y_i)]^2&=\sum_{i=1}^n \left(y_i-x_i^T\beta\right)^2 \\ &=\sum_{i=1}^n \left(y_i-\sum_{j=0}^p x_{ij}\beta_j\right)^2 \end{align*}\] as a function of \(\beta_0, \ldots, \beta_p\).
Exercise 1: By differentiating the sum of squares above w.r.t. \(\beta_k,\;k=0,\ldots ,p\), show that \[X^TX\hat{\beta}=X^Ty \]
The least squares estimates \(\hat\beta\) are the solutions to this set of \(p+1\) simultaneous linear equations, which are known as the normal equations. If \(X^TX\) is invertible (as it usually is) then the least squares estimates are given by \[ \hat{\beta}=(X^TX)^{-1}X^Ty. \] The corresponding fitted values are \[\begin{align*} &\qquad\hat{y}=X\hat{\beta}=X(X^TX)^{-1}X^Ty \\ \Rightarrow &\qquad \hat{y}_i=x_i^T\hat{\beta}\qquad i = 1, \ldots, n. \end{align*}\] We define the hat matrix by \(H=X(X^TX)^{-1}X^T\), so \(\hat{y} =Hy\). The residuals are \[\begin{align*} &\qquad{e}=y-\hat{y}=y-X\hat{\beta}=(I_n-H)y \\ \Rightarrow &\qquad e_i=y_i-x_i^T\hat{\beta}\qquad i = 1, \ldots, n. \end{align*}\] The residuals describe the variability in the observed responses \(y_1, \ldots, y_n\) which has not been explained by the linear model. The residual sum of squares or deviance for a linear model is defined to be \[ D=\sum_{i=1}^n e_i^2 =\sum_{i=1}^n \left(y_i-x_i^T\hat{\beta}\right)^2. \]
It is the actual minimum value attained in the least squares estimation.
Properties of the least squares estimator
(Exercise 2) Show that \(\hat{\beta}\) is multivariate normal with \(E(\hat{\beta})=\beta\) and \(Var(\hat{\beta})=\sigma^2(X^T X)^{-1}\).
Assuming that \(\epsilon_1, \ldots, \epsilon_n\) are i.i.d. N\((0,\sigma^2)\) the least squares estimate \(\hat{\beta}\) is also the maximum likelihood estimate. This is obvious when one considers the likelihood for a linear model \[\begin{equation} f_{Y}(y;\beta,\sigma^2)=\left(2\pi\sigma^2\right)^{-{n\over 2}} \exp\left(-{1\over{2\sigma^2}} \sum_{i=1}^n (y_i-x_i^T\beta)^2\right). \tag{1.4} \end{equation}\]