3 Linear Models

3.1 The linear model

In practical applications, we often distinguish between a response variable and a group of explanatory variables. The aim is to determine the pattern of dependence of the response variable on the explanatory variables. We denote the $n$ observations of the response variable by $\boldsymbol{y}=(y_1,y_2,\ldots ,y_n)^T$. These are assumed to be observations of random variables $\boldsymbol Y=(Y_1,Y_2,\ldots ,Y_n)^T$. Associated with each $y_i$ is a vector $\boldsymbol{x}_i=(x_{i1},x_{i2},\ldots ,x_{ip})^T$ of values of $p$ explanatory variables.

In a linear model, we assume that \[ \begin{aligned} Y_i&= \beta_1 x_{i1} +\beta_2 x_{i2} +\ldots + \beta_p x_{ip} + \epsilon_i \cr &= \sum_{j=1}^p x_{ij} \beta_j + \epsilon_i \cr &= \boldsymbol{x}_i^T\boldsymbol{\beta} + \epsilon_i \cr &= [\boldsymbol{X}\boldsymbol{\beta}]_i + \epsilon_i,\qquad i=1,\ldots ,n \end{aligned} \tag{3.1}\]

where $\epsilon_i \sim N(0, \sigma^2)$ independently, \[\boldsymbol{X}= \begin{pmatrix} \boldsymbol{x}_1^T \\ \vdots \\ \boldsymbol{x}_n^T \end{pmatrix} =\begin{pmatrix} x_{11} & \cdots &x_{1p}\cr \vdots & \ddots &\vdots\cr x_{n1} &\cdots &x_{np} \end{pmatrix}\] and $\boldsymbol{\beta}=(\beta_1,\ldots ,\beta_p)^T$ is a vector of fixed but unknown parameters describing the dependence of $Y_i$ on $\boldsymbol{x}_i$. The four ways of describing the linear model in Equation 3.1 are equivalent, but the most economical is the matrix form

\[ \boldsymbol{Y}=\boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}. \tag{3.2}\]

where $\boldsymbol{\epsilon}=(\epsilon_1,\epsilon_2,\ldots ,\epsilon_n)^T$.

The $n\times p$ matrix $\boldsymbol{X}$ consists of known (observed) constants and is called the design matrix. The $i$th row of $\boldsymbol{X}$ is $\boldsymbol{x}_i^T$, the explanatory data corresponding to the $i$th observation of the response. The $j$th column of $\boldsymbol{X}$ contains the $n$ observations of the $j$th explanatory variable.

The error vector $\boldsymbol{\epsilon}$ has a multivariate normal distribution with mean vector ${\boldsymbol 0}$ and variance covariance matrix $\sigma^2\boldsymbol{I}$, since $\text{Var}(\epsilon_i)=\sigma^2$, and $\text{Cov}(\epsilon_i,\epsilon_j)=0$, as $\epsilon_1, \ldots, \epsilon_n$ are independent of one another. It follows from Equation 3.2 that the distribution of $\boldsymbol Y$ is multivariate normal with mean vector $\boldsymbol{X}\boldsymbol{\beta}$ and variance covariance matrix $\sigma^2\boldsymbol{I}$, i.e. $\boldsymbol Y\sim N(\boldsymbol{X}\boldsymbol{\beta},\sigma^2\boldsymbol{I})$.

3.2 Examples of linear model structure

Example 3.1 (The null model) If we do not include any variables $x_i$ in the model, we have \[ Y_i=\beta_0 + \epsilon_i, \qquad \epsilon_i \sim N(0, \sigma^2), \qquad i = 1, \ldots, n, \] so \[ \boldsymbol{X}=\begin{pmatrix} 1\cr 1\cr \vdots \cr 1 \end{pmatrix}, \qquad \boldsymbol{\beta}=(\beta_0). \] This is one (dummy) explanatory variable. In practice, this variable is present in all models.

Example 3.2 (Simple linear regression) If we include a single variable $x_i$ in the model, we might have \[ Y_i=\beta_0+\beta_1 x_i + \epsilon_i, \qquad \epsilon_i \sim N(0, \sigma^2) \qquad i = 1, \ldots, n \] so \[ \boldsymbol{X}=\begin{pmatrix} 1&x_1\cr 1&x_2\cr \vdots&\vdots\cr 1&x_n \end{pmatrix}, \qquad \boldsymbol{\beta}=\begin{pmatrix} \beta_0\cr\beta_1 \end{pmatrix}. \] There are two explanatory variables: the dummy variable and one ‘real’ variable.

Example 3.3 (Polynomial regression) If we want to allow for a non-linear impact of $x_i$ on the mean of $Y_i$, we might model \[ Y_i=\beta_0+\beta_1 x_i+\beta_2 x_i^2 +\ldots +\beta_{p-1} x_i^{p-1} + \epsilon_i, \qquad \epsilon_i \sim N(0, \sigma^2), \qquad i = 1, \ldots, n, \] so \[ \boldsymbol{X}= \begin{pmatrix} 1&x_1&x_1^2&\cdots&x_1^{p-1}\cr 1&x_2&x_2^2&\cdots&x_2^{p-1}\cr \vdots&\vdots&\vdots&\ddots&\vdots\cr 1&x_n&x_n^2&\cdots&x_n^{p-1} \end{pmatrix}, \qquad \boldsymbol{\beta}= \begin{pmatrix} \beta_0\cr\beta_1\cr\vdots\cr\beta_{p-1} \end{pmatrix}. \] There are $p$ explanatory variables: the dummy variable and one ‘real’ variable, transformed to $p-1$ variables.

Example 3.4 (Multiple regression) To include multiple explanatory variables, we might model \[ Y_i=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2} +\ldots +\beta_{p-1} x_{i\,p-1} + \epsilon_i, \qquad \epsilon_i \sim N(0, \sigma^2), \qquad i = 1, \ldots, n, \] so \[ \boldsymbol{X}= \begin{pmatrix} 1&x_{11}&x_{12}&\cdots&x_{1\,p-1}\cr 1&x_{21}&x_{22}&\cdots&x_{2\,p-1}\cr \vdots&\vdots&\vdots&\ddots&\vdots\cr 1&x_{n1}&x_{n2}&\cdots&x_{n\,p-1} \end{pmatrix}, \qquad \boldsymbol{\beta}=\begin{pmatrix} \beta_0\cr\beta_1\cr\vdots\cr\beta_{p-1} \end{pmatrix}. \] There are $p$ explanatory variables: the dummy variable and $p-1$ ‘real’ variables.

Example 3.5 (One categorical explanatory variable) Suppose $x_i$ is a categorical variable, taking values in a set of $k$ possible categories. For simplicity of notation, we will give each category a number, and write $x_i \in \{1, \ldots, k\}$. We wish to model \[Y_i = \mu_{x_i} + \epsilon_i, \qquad \epsilon_i \sim N(0, \sigma^2), \qquad i = 1, \ldots, n,\] so that the mean of $Y_i$ is the same for all observations in the same category, but differs for different categories.

We could rewrite this model to include an intercept, as \[Y_i = \beta_0 + \beta_{x_i} + \epsilon_i, \qquad \epsilon_i \sim N(0, \sigma^2), \qquad i = 1, \ldots, n,\] so that $\mu_j = \beta_0 + \beta_{j}$, for $j = 1, \ldots, k$. It is not possible to estimate all of the $\boldsymbol \beta$ parameters separately, as they only affect the distribution through the combination $\beta_0 + \beta_{j}$. Instead, we choose a reference category $l$, and set $\beta_l = 0$. The intercept term $\beta_0$ then gives the mean for the reference category, with $\beta_j$ giving the difference in mean between category $j$ and the reference category. In R, categorical variables are called factors, and by default the reference category will be the first category when the names of the categories (the levels of the factor) are sorted alphabetically.

We can rewrite the model as a form of multiple regression by first defining a new explanatory variable $\boldsymbol{z}_i$ \[\boldsymbol{z}_i = (z_{i1}, \ldots, z_{ik})^T,\] where

\[z_{ij} = \begin{cases} 1 & \text{if $x_i= j$,} \\ 0 & \text{otherwise.}\end{cases}\]

$\boldsymbol{z}_i$ is sometimes called the one-hot encoding of $x_i$, as it contains precisely one $1$ (corresponding to the category $x_i$), and is $0$ everywhere else. We then have

\[Y_i=\beta_0+\beta_1 z_{i1}+\beta_2 z_{i2} +\ldots +\beta_{k} z_{ik} + \epsilon_i,\]

\[\boldsymbol{X}= \begin{pmatrix} 1 & z_{11} & z_{12} &\cdots & z_{1k}\cr 1 & z_{21} & z_{22} &\cdots & z_{2k}\cr \vdots&\vdots&\vdots&\ddots&\vdots\cr 1 & z_{n1} & z_{n2} &\cdots & z_{nk} \end{pmatrix}, \qquad \boldsymbol{\beta}= \begin{pmatrix} \beta_0\cr\beta_1\cr\vdots\cr\beta_{k} \end{pmatrix},\]

where each row of $\boldsymbol X$ will have two ones, and the remaining entries will be zero.

Example 3.6 (Two categorical explanatory variables) Suppose we have two categorical variables $x_{i1} \in \{1, \ldots, k_1\}$ and $x_{i2} \in \{1, \ldots, k_2\}.$ We might consider a model \[Y_i=\beta_0+\beta^{(1)}_{x_{i1}} + \beta^{(2)}_{x_{i2}} + \epsilon_i, \qquad \epsilon_i \sim N(0, \sigma^2), \qquad i = 1, \ldots, n,\] where \[\boldsymbol \beta = \left(\beta_0, \beta^{(1)}_1, \ldots, \beta^{(1)}_{k_1}, \beta^{(2)}_1, \ldots, \beta^{(2)}_{k_2}\right)^T,\] where as in Example 3.5 we choose reference categories $l_1$ and $l_2$ for each categorical variable, and set $\beta^{(1)}_{l_1} = \beta^{(2)}_{l_2} = 0$. The terms $\beta^{(1)}_j$ are called the main effects for the categorical variables $x_{i1}$, and $\beta^{(2)}_j$ are the main effects for $x_{i2}$.

We might also want to allow an interaction between $x_{i1}$ and $x_{i2}$, letting \[Y_i=\beta_0+\beta^{(1)}_{x_{i1}} + \beta^{(2)}_{x_{i2}} + \beta^{(1, 2)}_{x_{i1}, x_{i2}} + \epsilon_i,\] where \[\boldsymbol \beta = \left(\beta_0, \beta^{(1)}_1, \ldots, \beta^{(1)}_{k_1}, \beta^{(2)}_1, \ldots, \beta^{(2)}_{k_2},\beta^{(1, 2)}_{11}, \beta^{(1, 2)}_{1, 2}, \ldots, \beta^{(1, 2)}_{k_1, k_2}\right)^T.\] The terms $\beta^{(1, 2)}_{j_1, j_2}$ are called the interaction effects. This model is equivalent to \[Y_i= \mu_{x_{i1}, x_{i2}} + \epsilon_i,\] allowing a different mean for each possible combination of categories. To allow us to estimate the parameters, given reference categories $l_1$ and $l_2$, we set \[\beta^{(1)}_{l_1} = \beta^{(2)}_{l_2} = 0; \qquad \beta^{(1, 2)}_{l_1, j} = 0, \quad j = 1, \ldots, k_2; \qquad \beta^{(1, 2)}_{j, l_2} = 0, \quad j = 1, \ldots, k_1.\] As in Example 3.5, it is possible to rewrite the model with a design matrix $\boldsymbol X$, by using one-hot encoding of $x_{i1}$ and $x_{i2}$.

3.3 Maximum likelihood estimation

The regression coefficients $\beta_1, \ldots, \beta_p$ describe the pattern by which the response depends on the explanatory variables. We use the observed data $y_1, \ldots, y_n$ to estimate this pattern of dependence.

The likelihood for a linear model is \[ L(\boldsymbol{\beta},\sigma^2)=\left(2\pi\sigma^2\right)^{-{n\over 2}} \exp\left(-{1\over{2\sigma^2}} \sum_{i=1}^n (y_i-\boldsymbol{x}_i^T\boldsymbol{\beta})^2\right). \tag{3.3}\]

This is maximised with respect to $(\boldsymbol{\beta},\sigma^2)$ at \[ \hat{\boldsymbol{\beta}}=(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y} \] and \[ \hat \sigma^2 = \frac{1}{n} \sum_{i=1}^n \left(y_i-\boldsymbol{x}_i^T\hat{\boldsymbol{\beta}}\right)^2. \]

The corresponding fitted values are \[\hat{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}}=\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}\] or \[\hat{y}_i=\boldsymbol{x}_i^T\hat{\boldsymbol{\beta}}, \quad i = 1, \ldots, n.\]

The residuals $\boldsymbol{r} = (r_1, \ldots, r_n)$ are $\boldsymbol{r}=\boldsymbol{y}-\boldsymbol{\hat y}$ or $r_i=y_i-\boldsymbol{x}_i^T\hat{\boldsymbol{\beta}}$ for $i = 1, \ldots, n.$. These residuals describe the variability in the observed responses $y_1, \ldots, y_n$ which has not been explained by the linear model. We call \[ D= \sum_{i=1}^n r_i^2 = \sum_{i=1}^n \left(y_i-\boldsymbol{x}_i^T\hat{\boldsymbol{\beta}}\right)^2\] the residual sum of squares or deviance for the linear model.

3.4 Properties of the MLE

As $\boldsymbol Y$ is normally distributed, and $\hat{\boldsymbol{\beta}}= (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}$ is a linear function of $\boldsymbol Y$, then $\hat{\boldsymbol{\beta}}$ must also be normally distributed. We have $E(\hat{\boldsymbol{\beta}})=\boldsymbol{\beta}$ and $\text{Var}(\hat{\boldsymbol{\beta}})=\sigma^2(\boldsymbol{X}^T\boldsymbol{X})^{-1}$, so \[\hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2(\boldsymbol{X}^T\boldsymbol{X})^{-1}).\]

It is possible to prove (although we shall not do so here) that \[ {D\over\sigma^2}\sim\chi^2_{n-p} \] which implies that \[ E(\hat \sigma^2)={{n-p}\over n}\sigma^2, \] so the maximum likelihood estimator is biased for $\sigma^2$ (although still asymptotically unbiased as ${{n-p}\over n}\to 1$ as $n\to\infty$). We often use the unbiased estimator of $\sigma^2$ \[ \tilde \sigma^2={D\over {n-p}}={1\over {n-p}}\sum_{i=1}^n r_i^2. \] The denominator $n-p$, the number of observations minus the number of linear coefficients in the model is called the degrees of freedom of the model. Therefore, we estimate the residual variance by the deviance divided by the degrees of freedom.

3.5 Comparing linear models

If we have a set of competing linear models which might explain the dependence of the response on the explanatory variables, we will want to determine which of the models is most appropriate.

As described previously, we proceed by comparing models pairwise using a likelihood ratio test. For linear models this kind of comparison is restricted to situations where one of the models, $H_0$, is nested in the other, $H_1$. This usually means that the explanatory variables present in $H_0$ are a subset of those present in $H_1$. In this case model $H_0$ is a special case of model $H_1$, where certain coefficients are set equal to zero. We let $\boldsymbol \theta$ represent the collection of linear parameters for model $H_1$, together with the residual variance $\sigma^2$, and let $\Theta^{(1)}$ be the unrestricted parameter space for $\boldsymbol \theta$. Then $\Theta^{(0)}$ is the parameter space corresponding to model $H_0$, i.e. with the appropriate coefficients constrained to zero.

We will assume that model $H_1$ contains $p$ linear parameters and model $H_0$ a subset of $q<p$ of these. Without loss of generality, we can think of $H_1$ as the model

\[ Y_i=\sum_{j=1}^p x_{ij} \beta_j + \epsilon_i, \quad i = 1, \ldots, n \]

and $H_0$ being the same model with

\[\beta_{q+1}=\beta_{q+2}=\cdots=\beta_p=0. \]

Now, a likelihood ratio test of $H_0$ against $H_1$ has a critical region of the form

\[ C=\left\{ \boldsymbol{y}:{{\max_{(\boldsymbol{\beta},\sigma^2)\in \Theta^{(1)}}L(\boldsymbol{\beta},\sigma^2)}\over {\max_{(\boldsymbol{\beta},\sigma^2)\in \Theta^{(0)}}L(\boldsymbol{\beta},\sigma^2)}} >k\right\} \]

where $k$ is determined by $\alpha$, the size of the test, so \[ \max_{\boldsymbol \theta\in\Theta^{(0)}}P(\boldsymbol{y}\in C;\boldsymbol{\beta},\sigma^2)=\alpha. \] For a linear model, \[ L(\boldsymbol{\beta},\sigma^2)=\left(2\pi\sigma^2\right)^{-{n\over 2}} \exp\left(-{1\over{2\sigma^2}} \sum_{i=1}^n (y_i-\boldsymbol{x}_i^T\boldsymbol{\beta})^2\right). \] This is maximised with respect to $(\boldsymbol{\beta},\sigma^2)$ at $\boldsymbol{\beta}=\hat{\boldsymbol{\beta}}$ and $\sigma^2=\hat \sigma^2=D/n$. Therefore \[\begin{align*} \max_{\boldsymbol{\beta},\sigma^2} L(\boldsymbol{\beta},\sigma^2)&=(2\pi D/n)^{-{n\over 2}} \exp\left(-{n\over{2D}} \sum_{i=1}^n (y_i-\boldsymbol{x}_i^T\hat{\boldsymbol{\beta}})^2\right)\cr &=(2\pi D/n)^{-{n\over 2}} \exp\left(-{n\over2}\right). \end{align*}\] This form applies for both $\boldsymbol \theta\in\Theta^{(0)}$ and $\boldsymbol \theta\in\Theta^{(1)}$, with only the model changing. Let the deviances under models $H_0$ and $H_1$ be denoted by $D_0$ and $D_1$ respectively. Then the critical region for the likelihood ratio test is of the form \[{{(2\pi D_1/n)^{-{n\over 2}}}\over{(2\pi D_0/n)^{-{n\over 2}}}}>k\] so \[\left({{D_0}\over{D_1}}\right)^{n\over 2}>k,\] and \[\left({{D_0}\over{D_1}}-1\right){{n-p}\over{p-q}}>k'\] for some $k'$. Rearranging, \[{{(D_0-D_1)/(p-q)}\over{D_1/(n-p)}}>k'.\] We refer to the left hand side of this inequality as the $F$-statistic. We reject the simpler model $H_0$ in favour of the more complex model $H_1$ if $F$ is ‘too large’.

As we have required $H_0$ to be nested in $H_1$, $F \sim F_{p-q, \, n-p}$ when $H_0$ is true. To see this, note that \[ {{D_0}\over\sigma^2}={{D_0-D_1}\over\sigma^2}+{{D_1}\over\sigma^2}. \] Furthermore, under $H_0$, $D_1/\sigma^2 \sim \chi^2_{n-p}$ and $D_0/\sigma^2 \sim \chi^2_{n-q}$. It is possible to show (although we will not do so here) that under $H_0$, $(D_0-D_1)/\sigma^2$ and $D_0/\sigma^2$ are independent. Therefore, from the properties of the chi-squared distribution, it follows that under $H_0$, $(D_0-D_1)/\sigma^2 \sim \chi^2_{p-q}$, and $F \sim F_{p-q,\,n-p}$ distribution.

Therefore, the precise critical region can be evaluated given the size, $\alpha$, of the test. We reject $H_0$ in favour of $H_1$ when \[ {{(D_0-D_1)/(p-q)}\over{D_1/(n-p)}}>k \] where $k$ is the $100(1-\alpha)\%$ point of the $F_{p-q,\,n-p}$ distribution.