\( \newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\corr}{corr} \newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\nindep}{\perp\!\!\!\perp\!\!\!\!\!\!/\;\;} \)

1.7 Model checking

Confidence intervals and hypothesis tests for linear models may be unreliable if all the model assumptions are not justified. In particular, we have made four assumptions about the distribution of \(Y_1, \ldots, Y_n\).

  1. The model correctly describes the relationship between \(E(Y_i)\) and the explanatory variables
  2. \(Y_1, \ldots, Y_n\) are normally distributed.
  3. \(\var(Y_1)=\var(Y_2)=\cdots =\var(Y_n)\).
  4. \(Y_1, \ldots, Y_n\) are independent random variables.

These assumptions can be checked using plots of raw or standardised residuals.

  1. If a plot of the residuals against the values of a potential explanatory variable reveals a pattern, then this suggests that the explanatory variable, or perhaps some function of it, should be included in the model.

  2. A simple check for non-normality is obtained using a normal probability plot of the ordered residuals. The plot should look like a straight line, with obvious curves suggesting departures from normality.

  3. A simple check for non-constant variance is obtained by plotting the residuals \(r_1, \ldots, r_n\) against the corresponding fitted values \(x_i^T\hat{\beta},\quad i = 1, \ldots, n\). The plot should look like a random scatter. In particular, check for any behaviour which suggests that the error variance increases as a function of the mean (‘funnelling’ in the residual plot).

  4. In general, independence is difficult to validate, but where observations have been collected in serial order, serial correlation may be detected by a lagged scatterplot or correlogram.

Another place where residual diagnostics are useful is in assessing influence. An observation is influential if deleting it would lead to estimates of model parameters being substantially changed. Cook’s distance \(C_j\) is a measure of the change in \(\hat\beta\) when observation \(j\) is omitted from the dataset. \[ C_j={{\sum_{i=1}^n \left(\hat{y}^{(j)}_i-\hat{y}_i\right)^2}\over{ps^2}} \] where \(\hat{y}^{(j)}_i\) is the fitted value for observation \(i\), calculated using the least squares estimates obtained from the modified data set with the \(j\)th observation deleted. A rule of thumb is that values of \(C_j\) greater than \(8/(n-2p)\) indicate influential points. It can be shown that \[ C_j=\frac{r_j^2h_{jj}}{p(1-h_{jj})} \] so influential points have either a large standardised residual (unusual \(Y\) value) or large \(h_{jj}\). The quantity \(h_{jj}\) is called the leverage and is a measure of how unusual (relative to the other values in the data set) the explanatory data for the \(j\)th observation are.