5 Generalised Linear Models

5.1 Regression models for non-normal data

The linear model of Chapter 3 assumes each response \(Y_i \sim N(\mu_i, \sigma^2)\), where the mean \(\mu_i\) depends on explanatory variables through \(\mu_i = \boldsymbol x_i^T \boldsymbol \beta\). For many types of data, this assumption of normality of the response may not be justified. For instance, we might have

a binary response (\(Y_i \in \{0, 1\}\)), e.g., representing whether or not a patient recovers from a disease. A natural model is that \(Y_i \sim \text{Bernoulli}(p_i)\), and we might want to model how the ‘success’ probability \(p_i\) depends on explanatory variables \(\boldsymbol x_i\).
a count response (\(Y_i \in \{0, 1, 2, 3, \ldots\}\)), e.g., representing the number of customers arriving at a shop. A natural model is that \(Y_i \sim \text{Poisson}(\lambda_i)\), and we might want to model how the rate \(\lambda_i\) depends on explanatory variables.

In Section 5.2, we define the exponential family, which includes the Bernoulli and Poisson distributions as special cases. In a generalised linear model, the response distribution is assumed to be a member of the exponential family.

To complete the specification of a generalised linear model, we will need to model how the parameters of the response distribution (e.g. the success probability \(p_i\) or the rate \(\lambda_i\)) depend on explanatory variables \(\boldsymbol x_i\). We need to do this in a way which respects constraints on the possible values which these parameters may take; for instance, we should not model \(p_i = \boldsymbol x_i^T \boldsymbol \beta\) directly, as we need to enforce \(p_i \in [0, 1]\).

5.2 The exponential family

A probability distribution is said to be a member of the exponential family if its probability density function (or probability function, if discrete) can be written in the form

\[ f_Y(y;\theta,\phi)=\exp\left({{y\theta-b(\theta)}\over{a(\phi)}} +c(y,\phi)\right). \tag{5.1}\]

The parameter \(\theta\) is called the natural or canonical parameter. The parameter \(\phi\) is usually assumed known. If it is unknown then it is often called the nuisance parameter.

The density Equation 5.1 can be thought of as a likelihood resulting from a single observation \(y\). Then the log-likelihood is

\[\ell(\theta,\phi)={{y\theta-b(\theta)}\over{a(\phi)}} +c(y,\phi)\]

and the score is

\[u(\theta)=\frac{\partial}{\partial \theta}\ell(\theta,\phi) ={{y-\frac{\partial}{\partial \theta} b(\theta)}\over{a(\phi)}} ={{y- b'(\theta)}\over{a(\phi)}}.\]

The Hessian is

\[H(\theta)=\frac{\partial^2}{\partial \theta^2}\ell(\theta,\phi) =-{{\frac{\partial^2}{\partial \theta^2} b(\theta)}\over{a(\phi)}} =-{{b''(\theta)}\over{a(\phi)}}\]

so the expected information is

\[{\cal I}(\theta)=E[-H(\theta)]={{b''(\theta)}\over{a(\phi)}}.\]

From the properties of the score function in Section 2.4, we know that \(E[U(\theta)]=0\). Therefore

\[E\left[{{Y- b'(\theta)}\over{a(\phi)}}\right]=0,\]

so \(E[Y]=b'(\theta)\). We often denote the mean by \(\mu\), so \(\mu=b'(\theta)\).

Furthermore, \[ \text{Var}[U(\theta)]= \text{Var}\left[{{Y- b'(\theta)}\over{a(\phi)}}\right]= {{\text{Var}[Y]}\over{a(\phi)^2}}, \]

as \(b'(\theta)\) and \(a(\phi)\) are constants (not random variables). We also know from Section 2.5 that \(\text{Var}[U(\theta)]={\cal I}(\theta)\). Therefore,

\[ \text{Var}[Y]=a(\phi)^2\text{Var}[U(\theta)]=a(\phi)^2 {\cal I}(\theta) = a(\phi)b''(\theta). \]

and hence the mean and variance of a random variable with probability density function (or probability function) of the form Equation 5.1 are \(b'(\theta)\) and \(a(\phi)b''(\theta)\) respectively.

The variance is the product of two functions; \(b''(\theta)\) depends on the canonical parameter \(\theta\) (and hence \(\mu\)) only and is called the variance function (\(V(\mu)\equiv b''(\theta)\)); \(a(\phi)\) is sometimes of the form \(a(\phi)=\sigma^2/w\) where \(w\) is a known weight and \(\sigma^2\) is called the dispersion parameter or scale parameter.

Example 5.1 (Normal distribution) Suppose \(Y\sim N(\mu, \, \sigma^2)\). Then \[\begin{align*} f_Y(y;\mu,\sigma^2)&= {1\over{\sqrt{2\pi\sigma^2}}} \exp\left(-{1\over{2\sigma^2}}(y-\mu)^2\right)\quad\;\; y\in\mathbb{R};\;\;\mu\in\mathbb{R}\cr &= \exp\left({{y\mu-{1\over 2}\mu^2}\over \sigma^2}-{1\over 2}\left[ {{y^2}\over\sigma^2}+\log(2\pi\sigma^2)\right]\right). \end{align*}\] This is in the form Equation 5.1, with \(\theta=\mu\), \(b(\theta)={1\over 2}\theta^2\), \(a(\phi)=\sigma^2\) and \[c(y,\phi)=-{1\over 2}\left[ {{y^2}\over{a(\phi)}}+\log(2\pi a[\phi])\right]. \] Therefore \[E(Y)=b'(\theta)=\theta=\mu,\] \[\text{Var}(Y)=a(\phi)b''(\theta)=\sigma^2\] and the variance function is \[V(\mu)=1.\]

Example 5.2 (Poisson distribution) Suppose \(Y\sim \text{Poisson}(\lambda)\). Then \[\begin{align*} f_Y(y;\lambda)&= {{\exp(-\lambda)\lambda^y}\over{y!}} \qquad y\in\{0,1,\ldots\};\quad\lambda\in{\cal R}_+\cr &= \exp\left(y\log\lambda-\lambda-\log y!\right). \end{align*}\] This is in the form Equation 5.1, with \(\theta=\log\lambda\), \(b(\theta)=\exp\theta\), \(a(\phi)=1\) and \(c(y,\phi)=-\log y!\). Therefore \[E(Y)=b'(\theta)=\exp\theta=\lambda,\] \[\text{Var}(Y)=a(\phi)b''(\theta)=\exp\theta=\lambda\] and the variance function is \[V(\mu)=\mu.\]

Example 5.3 (Bernoulli distribution) Suppose \(Y\sim \text{Bernoulli}(p)\). Then \[\begin{align*} f_Y(y;p)&= p^y(1-p)^{1-y}\qquad y\in\{0,1\};\quad p\in(0,1)\cr &= \exp\left(y\log{p\over{1-p}}+\log(1-p)\right) \end{align*}\] This is in the form Equation 5.1, with \(\theta=\log{p\over{1-p}}\), \(b(\theta)=\log(1+\exp\theta)\), \(a(\phi)=1\) and \(c(y,\phi)=0\). Therefore \[E(Y)=b'(\theta)={{\exp\theta}\over{1+\exp\theta}}=p,\] \[\text{Var}(Y)=a(\phi)b''(\theta)={{\exp\theta}\over{(1+\exp\theta})^2}=p(1-p)\] and the variance function is \[V(\mu)=\mu(1-\mu).\]

Example 5.4 (Binomial distribution) Suppose \(Y^*\sim \text{Binomial}(n,p)\). Here, \(n\) is assumed known (as usual) and the random variable \(Y= Y^*/n\) is taken as the proportion of successes, so \[\begin{align*} f_Y(y;p)&=\left({n\atop{ny}}\right) p^{ny} (1-p)^{n(1-y)}\qquad y\in\left\{0,{1\over n},{2\over n},\ldots ,1\right\}; \quad p\in(0,1)\cr &= \exp\left({{y\log{p\over{1-p}}+\log(1-p)}\over{1\over n}} +\log\!\left({n\atop{ny}}\right)\right). \end{align*}\] This is in the form Equation 5.1, with \(\theta=\log{p\over{1-p}}\), \(b(\theta)=\log(1+\exp\theta)\), \(a(\phi)={1\over n}\) and \(c(y,\phi)=\log\!\left({n\atop{ny}}\right)\). Therefore \[E(Y)=b'(\theta)={{\exp\theta}\over{1+\exp\theta}}=p,\] \[\text{Var}(Y)=a(\phi)b''(\theta)={1\over n}{{\exp\theta}\over{(1+\exp\theta})^2}= {{p(1-p)}\over n}\] and the variance function is \[V(\mu)=\mu(1-\mu).\] Here, we can write \(a(\phi)\equiv \sigma^2/w\) where the scale parameter \(\sigma^2=1\) and the weight \(w\) is \(n\), the binomial denominator.

5.3 Components of a generalised linear model

5.3.1 The random component

As in a linear model, the aim is to determine the pattern of dependence of a response variable on explanatory variables. We denote the \(n\) observations of the response by \(\boldsymbol{y}=(y_1,y_2,\ldots ,y_n)^T\). In a generalised linear model (GLM), these are assumed to be observations of independent random variables \(\boldsymbol{Y}=(Y_1,Y_2,\ldots ,Y_n)^T\), which take the same distribution from the exponential family. In other words, the functions \(a\), \(b\) and \(c\) and usually the scale parameter \(\phi\) are the same for all observations, but the canonical parameter \(\theta\) may differ. Therefore, we write

\[ f_{Y_i}(y_i;\theta_i,\phi_i)= \exp\left({{y_i\theta_i-b(\theta_i)}\over{a(\phi_i)}} +c(y_i,\phi_i)\right) \]

and the joint density for \(\boldsymbol{Y}=(Y_1,Y_2,\ldots ,Y_n)^T\) is

\[ \begin{aligned} f_{\boldsymbol{Y}}(\boldsymbol{y};\boldsymbol{\theta},\boldsymbol{\phi}) &= \prod_{i=1}^n f_{Y_i}(y_i;\theta_i,\phi_i) \cr &= \exp\left(\sum_{i=1}^n{{y_i\theta_i-b(\theta_i)}\over{a(\phi_i)}} +\sum_{i=1}^nc(y_i,\phi_i)\right) \end{aligned} \tag{5.2}\]

where \(\boldsymbol{\theta}=(\theta_1,\ldots ,\theta_n)^T\) is the collection of canonical parameters and \(\boldsymbol{\phi}=(\phi_1,\ldots ,\phi_n)^T\) is the collection of nuisance parameters (where they exist).

Note that for a particular sample of observed responses, \(\boldsymbol{y}=(y_1,y_2,\ldots ,y_n)^T\), Equation 5.2 is the likelihood function \(L(\boldsymbol{\theta}, \boldsymbol{\phi})\) for \(\boldsymbol{\theta}\) and \(\boldsymbol{\phi}\).

5.3.2 The systematic (or structural) component

Associated with each \(y_i\) is a vector \(\boldsymbol{x}_i=(x_{i1},x_{i2},\ldots ,x_{ip})^T\) of \(p\) explanatory variables. In a generalised linear model, the distribution of the response variable \(Y_i\) depends on \(\boldsymbol{x}_i\) through the linear predictor \(\eta_i\) where

\[ \begin{aligned} \eta_i &=\beta_1 x_{i1} +\beta_2 x_{i2} +\ldots + \beta_p x_{ip} \notag \\ &= \sum_{j=1}^p x_{ij} \beta_j \notag \\ &= \boldsymbol{x}_i^T \boldsymbol{\beta} \notag \\ &= [\boldsymbol{X}\boldsymbol{\beta}]_i,\qquad i=1,\ldots ,n, \end{aligned} \tag{5.3}\] where, as with a linear model, \[ \boldsymbol{X}=\begin{pmatrix} \boldsymbol{x}_1^T\cr\vdots\cr \boldsymbol{x}_n^T \end{pmatrix} =\begin{pmatrix} x_{11}&\cdots&x_{1p}\cr\vdots&\ddots&\vdots\cr x_{n1}&\cdots&x_{np}\end{pmatrix} \]

and \(\boldsymbol{\beta}=(\beta_1,\ldots ,\beta_p)^T\) is a vector of fixed but unknown parameters describing the dependence of \(Y_i\) on \(\boldsymbol{x}_i\). The four ways of describing the linear predictor in Equation 5.3 are equivalent, but the most economical is the matrix form

\[ \boldsymbol{\eta}=\boldsymbol{X}\boldsymbol{\beta}. \tag{5.4}\]

Again, we call the \(n\times p\) matrix \(\boldsymbol{X}\) the design matrix. The \(i\)th row of \(\boldsymbol{X}\) is \(\boldsymbol{x}_i^T\), the explanatory data corresponding to the \(i\)th observation of the response. The \(j\)th column of \(\boldsymbol{X}\) contains the \(n\) observations of the \(j\)th explanatory variable.

As for the linear model in Section 3.2, this structure allows quite general dependence of the linear predictor on explanatory variables. For instance, we can allow non-linear dependence of \(\eta_i\) on a variable \(x_i\) through polynomial regression (as in Example 3.3), or include categorical explanatory variables (as in Example 3.5 and Example 3.6).

5.3.3 The link function

For specifying the pattern of dependence of the response variable on the explanatory variables, the canonical parameters \(\theta_1,\ldots,\theta_n\) in Equation 5.2 are not of direct interest. Furthermore, we have already specified that the distribution of \(Y_i\) should depend on \(\boldsymbol{x}_i\) through the linear predictor \(\eta_i\). It is the parameters \(\beta_1,\ldots ,\beta_p\) of the linear predictor which are of primary interest.

The link between the distribution of \(\boldsymbol{Y}\) and the linear predictor \(\boldsymbol{\eta}\) is provided by the link function \(g\),

\[ \eta_i=g(\mu_i),\quad i = 1, \ldots, n, \]

where \(\mu_i\equiv E(Y_i),\;i = 1, \ldots, n\). Hence, the dependence of the distribution of the response on the explanatory variables is established as

\[ g(E[Y_i])=g(\mu_i)=\eta_i=\boldsymbol{x}_i^T\boldsymbol{\beta},\quad i = 1, \ldots, n. \]

In principle, the link function \(g\) can be any one-to-one differentiable function. However, we note that \(\eta_i\) can in principle take any value in \(\mathbb{R}\) (as we make no restriction on possible values taken by explanatory variables or model parameters). However, for some exponential family distributions, \(\mu_i\) is restricted. For example, for the Poisson distribution, \(\mu_i\in\mathbb{R}_+\); for the Bernoulli distribution, \(\mu_i\in(0,1)\). If \(g\) is not chosen carefully, then there may exist a possible \(\boldsymbol{x}_i\) and \(\boldsymbol{\beta}\) such that \(\eta_i\ne g(\mu_i)\) for any possible value of \(\mu_i\). Therefore, ‘sensible’ choices of link function map the set of allowed values for \(\mu_i\) onto \(\mathbb{R}\).

Recall that for a random variable \(Y\) with a distribution from the exponential family, \(E(Y)=b'(\theta)\). Hence, for a generalised linear model

\[ \mu_i=E(Y_i)=b'(\theta_i),\quad i = 1, \ldots, n. \]

Therefore

\[ \theta_i=b^{'-1}(\mu_i),\quad i = 1, \ldots, n \]

and as \(g(\mu_i)=\eta_i=\boldsymbol{x}_i^T\boldsymbol{\beta}\), then

\[ \theta_i=b^{'-1}(g^{-1}[\boldsymbol{x}_i^T\boldsymbol{\beta}]),\quad i = 1, \ldots, n. \tag{5.5}\]

Hence, we can express the joint density Equation 5.2 in terms of the coefficients \(\boldsymbol{\beta}\), and for observed data \(\boldsymbol{y}\), this is the likelihood \(L(\boldsymbol{\beta})\) for \(\boldsymbol{\beta}\). As \(\boldsymbol{\beta}\) is our parameter of real interest (describing the dependence of the response on the explanatory variables) this likelihood will play a crucial role.

Note that considerable simplification is obtained in Equation 5.5 if the functions \(g\) and \(b^{'-1}\) are identical. Then

\[ \theta_i=\boldsymbol{x}_i^T\boldsymbol{\beta}\qquad i = 1, \ldots, n \]

and the resulting likelihood is

\[ L(\boldsymbol{\beta})= \exp\left(\sum_{i=1}^n{{y_i\boldsymbol{x}_i^T\boldsymbol{\beta}-b(\boldsymbol{x}_i^T\boldsymbol{\beta})}\over{a(\phi_i)}} +\sum_{i=1}^nc(y_i,\phi_i)\right). \]

The link function

\[ g(\mu)\equiv b^{'-1}(\mu) \]

is called the canonical link function. Under the canonical link, the canonical parameter is equal to the linear predictor.

The canonical link functions are:

Distribution	\(b(\theta)\)	\(b'(\theta)\equiv\mu\)	\(b^{'-1}(\mu)\equiv\theta\)	Link	Name
Normal	\({1\over 2}\theta^2\)	\(\theta\)	\(\mu\)	\(g(\mu)=\mu\)	Identity
Poisson	\(\exp\theta\)	\(\exp\theta\)	\(\log\mu\)	\(g(\mu)=\log\mu\)	Log
Binomial	\(\log(1+\exp\theta)\)	\(\frac{\exp\theta}{1+\exp\theta}\)	\(\log{\frac{\mu}{1-\mu}}\)	\(g(\mu)=\log{\frac{\mu}{1-\mu}}\)	Logit

5.4 Examples of generalised linear models

5.4.1 The linear model

The linear model considered in Chapter 3 is also a generalised linear model. We assume \({Y_1,\ldots ,Y_n}\) are independent normally distributed random variables, so that \(Y_i \sim N(\mu_i, \sigma^2)\). We have seen in Example 5.1 that the normal distribution is a member of the exponential family.

The explanatory variables enter a linear model through the linear predictor \[ \eta_i=\boldsymbol{x}_i^T\boldsymbol{\beta}, \quad i = 1, \ldots, n. \]

The link between \(E(\boldsymbol{Y})=\boldsymbol{\mu}\) and the linear predictor \(\boldsymbol{\eta}\) is through the (canonical) identity link function \[ \mu_i=\eta_i, \quad i = 1, \ldots, n. \]

5.4.2 Models for binary data

In binary regression, we assume either \(Y_i \sim \text{Bernoulli}(p_i)\), or \(Y_i \sim \text{binomial}(n_i, p_i)\), where \(n_i\) are known. The objective is to model the success probability \(p_i\) as a function of the explanatory variables \(\boldsymbol x_i\). We have seen in Example 5.3 and Example 5.4 that the Bernoulli and binomial distributions are members of the exponential family.

When the canonical (logit) link is used, we have \[\text{logit}(p_i) = \log \frac{p_i}{1-p_i} = \eta_i = \boldsymbol{x}_i^T\boldsymbol{\beta}.\] This implies \[p_i = \frac{ \exp(\eta_i) }{1+ \exp(\eta_i)} = \frac{1}{1+ \exp(-\eta_i)}.\] The function \(F(\eta) = \frac{1}{1+ \exp(-\eta)}\) is the cumulative distribution function (cdf) of a distribution called the logistic distribution.

The cumulative distribution functions of other distributions are also commonly used to generate link functions for binary regression. For example, if we let \[p_i = \Phi(\boldsymbol{x}_i^T \boldsymbol{\beta}) = \Phi(\eta_i),\] where \(\Phi(\cdot)\) is the cdf of the standard normal distribution, then we get the link function \[g(\mu) = g(p) = \Phi^{-1}(\mu) = \eta,\] which is called the probit link.

5.4.3 Models for count data

If \(Y_i\) represent counts of the number of times an event occurs in a fixed time (or a fixed region of space), we might model \(Y_i \sim \text{Poisson}(\lambda_i)\). We have seen in Example 5.2 that the Poisson distribution is a member of the exponential family.

With the canonical (log) link, we have \[\log \lambda_i = \eta_i = \boldsymbol{x}_i^T\boldsymbol{\beta},\] or \[\lambda_i = \exp\{\eta_i\} = \exp\{\boldsymbol{x}_i^T\boldsymbol{\beta}\}.\] This model is often called a log-linear model.

Now suppose that \(Y_i\) represents a count of the number of events which occur in a given region \(i\), for instance the number of times a particular drug is prescribed on a given day, in a district \(i\) of a country. We might want to model the prescription rate per patient in the district \(\lambda_i^*\). Write \(N_i\) is the number of patients registered in district \(i\), often called the exposure of observation \(i\). We model \(Y_i \sim \text{Poisson}(N_i \lambda_i^*)\), where \[\log \lambda_i^* = \boldsymbol{x}_i^T\boldsymbol{\beta}.\] Equivalently, we may write the model as \(Y_i \sim \text{Poisson}(\lambda_i)\), where \[\log \lambda_i = \log N_i + \boldsymbol{x}_i^T\boldsymbol{\beta},\] (since \(\lambda_i = N_i \lambda_i^*\), so \(\log \lambda_i = \log N_i + \log \lambda_i^*\)). The log-exposure \(\log N_i\) appears as a fixed term in the linear predictor, without any associated parameter. Such a fixed term is called an offset.

5.5 Maximum likelihood estimation

The regression coefficients \({\beta_1,\ldots ,\beta_p}\) describe the pattern by which the response depends on the explanatory variables. We use the observed data \({y_1,\ldots ,y_n}\) to estimate this pattern of dependence.

As usual, we maximise the log-likelihood function which, from Equation 5.2, can be written \[ \ell(\boldsymbol{\beta},\boldsymbol{\phi})= \sum_{i=1}^n{{y_i\theta_i-b(\theta_i)}\over{a(\phi_i)}} +\sum_{i=1}^nc(y_i,\phi_i) \tag{5.6}\] and depends on \(\boldsymbol{\beta}\) through \[\begin{align*} \theta_i &= (b')^{-1}(\mu_i), \cr \mu_i&= g^{-1}(\eta_i), \cr \eta_i&=\boldsymbol{x}_i^T\boldsymbol{\beta}=\sum_{i=1}^p x_{ij} \beta_j, \quad i = 1, \ldots, n. \end{align*}\] To find \(\hat{\boldsymbol{\beta}}\), we consider the scores \[ u_k(\boldsymbol{\beta})={\partial\over{\partial\beta_k}} \ell(\boldsymbol{\beta},\boldsymbol{\phi})\qquad k=1,\ldots ,p \] and then find \(\hat{\boldsymbol{\beta}}\) to solve \(u_k(\hat{\boldsymbol{\beta}})=0\) for \(k=1,\ldots ,p.\)

From Equation 5.6 \[\begin{align*} u_k(\boldsymbol{\beta})&= {\partial\over{\partial\beta_k}}\ell(\boldsymbol{\beta},\boldsymbol{\phi})\cr &= {\partial\over{\partial\beta_k}}\sum_{i=1}^n{{y_i\theta_i-b(\theta_i)}\over{a(\phi_i)}} +{\partial\over{\partial\beta_k}}\sum_{i=1}^nc(y_i,\phi_i)\cr &= \sum_{i=1}^n{\partial\over{\partial\beta_k}} \left[{{y_i\theta_i-b(\theta_i)}\over{a(\phi_i)}}\right]\cr &=\sum_{i=1}^n{\partial\over{\partial\theta_i}}\left[{{y_i\theta_i-b(\theta_i)} \over{a(\phi_i)}}\right]{{\partial\theta_i}\over{\partial\mu_i}} {{\partial\mu_i}\over{\partial\eta_i}}{{\partial\eta_i}\over{\partial\beta_k}}\cr &= \sum_{i=1}^n{{y_i-b'(\theta_i)} \over{a(\phi_i)}}{{\partial\theta_i}\over{\partial\mu_i}} {{\partial\mu_i}\over{\partial\eta_i}}{{\partial\eta_i}\over{\partial\beta_k}}, \quad{k=1,\ldots ,p},\cr \end{align*}\] where \[\begin{align*} {{\partial\theta_i}\over{\partial\mu_i}}&=\left[{{\partial\mu_i}\over{\partial\theta_i}}\right]^{-1} ={1\over{b''(\theta_i)}}\cr {{\partial\mu_i}\over{\partial\eta_i}}&=\left[{{\partial\eta_i}\over{\partial\mu_i}}\right]^{-1} ={1\over{g'(\mu_i)}}\cr {{\partial\eta_i}\over{\partial\beta_k}}&= {\partial\over{\partial\beta_k}}\sum_{j=1}^p x_{ij}\beta_j=x_{ik}. \end{align*}\] Therefore

\[ u_k(\boldsymbol{\beta})= \sum_{i=1}^n{{y_i-b'(\theta_i)}\over{a(\phi_i)}} {{x_{ik}}\over{b''(\theta_i)g'(\mu_i)}} =\sum_{i=1}^n{{y_i-\mu_i}\over{\text{Var}(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}},\quad{k=1,\ldots ,p}, \tag{5.7}\]

which depends on \(\boldsymbol{\beta}\) through \(\mu_i\equiv E(Y_i)\) and \(\text{Var}(Y_i),\) \(i = 1, \ldots, n\).

In theory, we solve the \(p\) simultaneous equations \(u_k(\hat{\boldsymbol{\beta}})=0,\;{k=1,\ldots ,p}\) to evaluate \(\hat{\boldsymbol{\beta}}\). In practice, these equations are usually non-linear and have no analytic solution. Therefore, we rely on numerical methods to solve them.

First, we note that the Hessian and Fisher information matrices can be derived directly from Equation 5.7. \[ [\boldsymbol{H}(\boldsymbol{\beta})]_{jk}={{\partial^2}\over{\partial\beta_j\partial\beta_k}}\ell(\boldsymbol{\beta},\boldsymbol{\phi}) ={\partial\over{\partial\beta_j}}u_k(\boldsymbol{\beta}). \] Therefore \[\begin{align*} [\boldsymbol{H}(\boldsymbol{\beta})]_{jk} &={\partial\over{\partial\beta_j}}\sum_{i=1}^n{{y_i-\mu_i}\over{\text{Var}(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}}\cr &=\sum_{i=1}^n{{-{{\partial\mu_i}\over{\partial\beta_j}}}\over{\text{Var}(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}} +\sum_{i=1}^n(y_i-\mu_i){\partial\over{\partial\beta_j}} \left[{{x_{ik}}\over{\text{Var}(Y_i) g'(\mu_i)}}\right] \end{align*}\] and \[\begin{align*} [{\cal I}(\boldsymbol{\beta})]_{jk} &=\sum_{i=1}^n{{{{\partial\mu_i}\over{\partial\beta_j}}}\over{\text{Var}(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}} -\sum_{i=1}^n(E[Y_i]-\mu_i){\partial\over{\partial\beta_j}} \left[{{x_{ik}}\over{\text{Var}(Y_i) g'(\mu_i)}}\right]\cr &=\sum_{i=1}^n{{{{\partial\mu_i}\over{\partial\beta_j}}}\over{\text{Var}(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}}\cr &=\sum_{i=1}^n{{x_{ij}x_{ik}}\over{\text{Var}(Y_i)g'(\mu_i)^2}}. \end{align*}\]

Hence we can write \[ {\cal I}(\boldsymbol{\beta})=\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X} \tag{5.8}\]

where \[ \boldsymbol{X}=\begin{pmatrix} \boldsymbol{x}_1^T\cr\vdots\cr \boldsymbol{x}_n^T \end{pmatrix} =\begin{pmatrix} x_{11}&\cdots&x_{1p}\cr\vdots&\ddots&\vdots\cr x_{n1}&\cdots&x_{np} \end{pmatrix}, \] \[ \boldsymbol{W}=\text{diag}(\boldsymbol{w})= \begin{pmatrix} w_1&0&\cdots&0\cr 0&w_2&&\vdots\cr \vdots&&\ddots&0\cr 0&\cdots&0&w_n \end{pmatrix} \] and \[ w_i={1\over{\text{Var}(Y_i)g'(\mu_i)^2}},\quad i = 1, \ldots, n. \] The Fisher information matrix \(\mathcal{I}(\boldsymbol{\beta})\) depends on \(\boldsymbol{\beta}\) through \(\boldsymbol{\mu}\) and \(\text{Var}(Y_i),\;i = 1, \ldots, n\).

We notice that the score in Equation 5.7 may now be written as \[u_k(\boldsymbol{\beta})=\sum_{i=1}^n(y_i-\mu_i)x_{ik}w_ig'(\mu_i) =\sum_{i=1}^n x_{ik}w_iz_i,\quad{k=1,\ldots ,p},\] where \[ z_i=(y_i-\mu_i)g'(\mu_i),\quad i = 1, \ldots, n. \] Therefore

\[ \boldsymbol{u}(\boldsymbol{\beta})=\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{z}. \tag{5.9}\]

One possible method to solve the \(p\) simultaneous equations \({\boldsymbol{u}}(\hat{\boldsymbol{\beta}})={\boldsymbol 0}\), which yield \(\hat{\boldsymbol{\beta}}\), is the (multivariate) Newton-Raphson method.

If \(\boldsymbol{\beta}^{(m)}\) is the current estimate of \(\hat{\boldsymbol{\beta}}\) then the next estimate is

\[ \boldsymbol{\beta}^{(m+1)}=\boldsymbol{\beta}^{(m)}-\boldsymbol{H}(\boldsymbol{\beta}^{(m)})^{-1}\boldsymbol{u}(\boldsymbol{\beta}^{(m)}). \tag{5.10}\]

In practice, an alternative to Newton-Raphson replaces \(\boldsymbol{H}(\boldsymbol{\theta})\) in Equation 5.10 with \(E[\boldsymbol{H}(\boldsymbol{\theta})]\equiv-\mathcal{I}(\boldsymbol{\beta})\). Therefore, if \(\boldsymbol{\beta}^{(m)}\) is the current estimate of \(\hat{\boldsymbol{\beta}}\) then the next estimate is

\[ \boldsymbol{\beta}^{(m+1)}=\boldsymbol{\beta}^{(m)}+{\cal I}(\boldsymbol{\beta}^{(m)})^{-1}\boldsymbol{u}(\boldsymbol{\beta}^{(m)}). \tag{5.11}\]

The resulting iterative algorithm is called Fisher scoring. Notice that if we substitute Equation 5.8 and Equation 5.9 into Equation 5.11 we get \[\begin{align*} \boldsymbol{\beta}^{(m+1)}&=\boldsymbol{\beta}^{(m)}+[\boldsymbol{X}^T\boldsymbol{W}^{(m)}\boldsymbol{X}]^{-1}\boldsymbol{X}^T\boldsymbol{W}^{(m)}\boldsymbol{z}^{(m)}\cr &=[\boldsymbol{X}^T\boldsymbol{W}^{(m)}\boldsymbol{X}]^{-1}[\boldsymbol{X}^T\boldsymbol{W}^{(m)}\boldsymbol{X}\boldsymbol{\beta}^{(m)}+\boldsymbol{X}^T\boldsymbol{W}^{(m)}\boldsymbol{z}^{(m)}]\cr &=[\boldsymbol{X}^T\boldsymbol{W}^{(m)}\boldsymbol{X}]^{-1}\boldsymbol{X}^T\boldsymbol{W}^{(m)}[\boldsymbol{X}\boldsymbol{\beta}^{(m)}+\boldsymbol{z}^{(m)}]\cr &=[\boldsymbol{X}^T\boldsymbol{W}^{(m)}\boldsymbol{X}]^{-1}\boldsymbol{X}^T\boldsymbol{W}^{(m)}[\boldsymbol{\eta}^{(m)}+\boldsymbol{z}^{(m)}], \end{align*}\] where \(\boldsymbol{\eta}^{(m)},\,\boldsymbol{W}^{(m)}\) and \(\boldsymbol{z}^{(m)}\) are all functions of \(\boldsymbol{\beta}^{(m)}\).

Note that this is a weighted least squares equation, that is \(\boldsymbol{\beta}^{(m+1)}\) minimises the weighted sum of squares \[ (\boldsymbol{\eta}+\boldsymbol{z}-\boldsymbol{X}\boldsymbol{\beta})^T\boldsymbol{W}(\boldsymbol{\eta}+\boldsymbol{z}-\boldsymbol{X}\boldsymbol{\beta})= \sum_{i=1}^n w_i\left(\eta_i+z_i-\boldsymbol{x}_i^T\boldsymbol{\beta}\right)^2 \] as a function of \(\boldsymbol{\beta}\) where \(w_1,\ldots ,w_n\) are the weights and \(\boldsymbol{\eta}+\boldsymbol{z}\) is called the adjusted dependent variable. Therefore, the Fisher scoring algorithm proceeds as follows.

Choose an initial estimate \(\boldsymbol{\beta}^{(m)}\) for \(\hat{\boldsymbol{\beta}}\) at \(m=0\).
Evaluate \(\boldsymbol{\eta}^{(m)},\,\boldsymbol{W}^{(m)}\) and \(\boldsymbol{z}^{(m)}\) at \(\boldsymbol{\beta}^{(m)}\).
Calculate \[\boldsymbol{\beta}^{(m+1)} =[\boldsymbol{X}^T\boldsymbol{W}^{(m)}\boldsymbol{X}]^{-1}\boldsymbol{X}^T\boldsymbol{W}^{(m)}[\boldsymbol{\eta}^{(m)}+\boldsymbol{z}^{(m)}].\]
If \(||\boldsymbol{\beta}^{(m+1)}-\boldsymbol{\beta}^{(m)} ||> \epsilon\), for some prespecified (small) tolerance \(\epsilon\) then set \(m\to m+1\) and go to 2.
Use \(\boldsymbol{\beta}^{(m+1)}\) as the solution for \(\hat{\boldsymbol{\beta}}\).

As this algorithm involves iteratively minimising a weighted sum of squares, it is sometimes known as iteratively (re)weighted least squares.

Notes

Recall that the canonical link function is \(g(\mu)=b^{'-1}(\mu)\) and with this link \(\eta_i=g(\mu_i)=\theta_i\). Then \[ {1\over{g'(\mu_i)}}={{\partial\mu_i}\over{\partial\eta_i}} ={{\partial\mu_i}\over{\partial\theta_i}}=b''(\theta_i),\quad i = 1, \ldots, n. \] Therefore \(\text{Var}(Y_i)g'(\mu_i)=a(\phi_i)\) which does not depend on \(\boldsymbol{\beta}\), and hence \[ {\partial\over{\partial\beta_j}}\left[{{x_{ik}}\over{\text{Var}(Y_i)g'(\mu_i)}}\right]=0 \] for all \(j=1,\ldots ,p\). It follows that \(\boldsymbol{H}(\boldsymbol{\theta})=-\mathcal{I}(\boldsymbol{\beta})\) and, for the canonical link, Newton-Raphson and Fisher scoring are equivalent.
The linear model is a generalised linear model with identity link, \(\eta_i=g(\mu_i)=\mu_i\) and \(\text{Var}(Y_i)=\sigma^2\) for all \(i = 1, \ldots, n\). Therefore \(w_i=[\text{Var}(Y_i)g'(\mu_i)^2]^{-1}=\sigma^{-2}\) and \(z_i=(y_i-\mu_i)g'(\mu_i)=y_i-\eta_i\) for \(i = 1, \ldots, n\). Hence \(\boldsymbol{z}+\boldsymbol{\eta}=\boldsymbol{y}\) and \(\boldsymbol{W}=\sigma^{-2}\boldsymbol{I}\), neither of which depend on \(\boldsymbol{\beta}\). So the Fisher scoring algorithm converges in a single iteration to the usual least squares estimate.
Estimation of an unknown scale parameter \(\sigma^2\) is discussed later. A common (to all \(i\)) \(\sigma^2\) has no effect on \(\hat{\boldsymbol{\beta}}\).

5.6 Confidence intervals

Recall from Section 2.6 that the maximum likelihood estimator \(\hat{\boldsymbol{\beta}}\) is asymptotically normally distributed with mean \(\boldsymbol{\beta}\) (it is unbiased) and variance covariance matrix \({\cal I}(\boldsymbol{\beta})^{-1}\). For ‘large enough \(n\)’ we treat this distribution as an approximation.

Therefore, standard errors (estimated standard deviations) are given by \[ s.e.(\hat{\beta}_i)=[{\cal I}(\hat{\boldsymbol{\beta}})^{-1}]_{ii}^{{1\over 2}} =[(\boldsymbol{X}^T\hat{\boldsymbol{W}}\boldsymbol{X})^{-1}]_{ii}^{{1\over 2}} \qquad i=1,\ldots ,p. \] where the diagonal matrix \(\hat{\boldsymbol{W}}=\text{diag}(\hat{\boldsymbol{w}})\) is evaluated at \(\hat{\boldsymbol{\beta}}\), that is \(\hat{w}_i=(\hat{\text{Var}}(Y_i)g'(\hat{\mu}_i)^2)^{-1}\) where \(\hat{\mu}_i\) and \(\hat{\text{Var}}(Y_i)\) are evaluated at \(\hat{\boldsymbol{\beta}}\) for \(i = 1, \ldots, n\). Furthermore, if \(\text{Var}(Y_i)\) depends on an unknown scale parameter, then this too must be estimated in the standard error.

The asymptotic distribution of the maximum likelihood estimator can be used to provide approximate large sample confidence intervals. For given \(\alpha\) we can find \(z_{1-\frac{\alpha}{2}}\) such that \[ P\left(-z_{1-\frac{\alpha}{2}}\le {{\hat{\beta}_i-\beta_i}\over{[\mathcal{I}(\boldsymbol{\beta})^{-1}]_{ii}^{1\over 2}}}\le z_{1-\frac{\alpha}{2}}\right) =1-\alpha. \] Therefore \[ P\left(\hat{\beta}_i-z_{1-\frac{\alpha}{2}}[\mathcal{I}(\boldsymbol{\beta})^{-1}]_{ii}^{1\over 2}\le\beta_i \le\hat{\beta}_i+z_{1-\frac{\alpha}{2}}[\mathcal{I}(\boldsymbol{\beta})^{-1}]_{ii}^{1\over 2} \right) =1-\alpha. \] The endpoints of this interval cannot be evaluated because they also depend on the unknown parameter vector \(\boldsymbol{\beta}\). However, if we replace \({\cal I}(\boldsymbol{\beta})\) by its MLE \({\cal I}(\hat{\boldsymbol{\beta}})\) we obtain the approximate large sample 100\((1-\alpha)\)% confidence interval \[ [\hat{\beta}_i-s.e.(\hat{\beta}_i)z_{1-\frac{\alpha}{2}}\,,\, \hat{\beta}_i+s.e.(\hat{\beta}_i)z_{1-\frac{\alpha}{2}}]. \] For \(\alpha=0.10,0.05,0.01\), \(z_{1-\frac{\alpha}{2}}=1.64,1.96,2.58\), respectively.

5.7 Comparing generalised linear models

5.7.1 The likelihood ratio test

If we have a set of competing generalised linear models which might explain the dependence of the response on the explanatory variables, we will want to determine which of the models is most appropriate. Recall that we have three main requirements of a statistical model; plausibility, parsimony and goodness of fit, of which parsimony and goodness of fit are statistical issues.

As with linear models, we proceed by comparing models pairwise using a likelihood ratio test. This kind of comparison is restricted to situations where one of the models, \(H_0\), is nested in the other, \(H_1\). Then the asymptotic distribution of the log likelihood ratio statistic under \(H_0\) is a chi-squared distribution with known degrees of freedom.

For generalised linear models, ‘nested’ means that \(H_0\) and \(H_1\) are

based on the same exponential family distribution, and
have the same link function, but
the explanatory variables present in \(H_0\) are a subset of those present in \(H_1\).

We will assume that model \(H_1\) contains \(p\) linear parameters and model \(H_0\) a subset of \(q<p\) of these. Without loss of generality, we can think of \(H_1\) as the model \[ \eta_i=\sum_{j=1}^p x_{ij} \beta_j \qquad i = 1, \ldots, n \] and \(H_0\) is the same model with

Then model \(H_0\) is a special case of model \(H_1\), where certain coefficients are set equal to zero, and therefore \(\Theta^{(0)}\), the set of values of the canonical parameter \(\boldsymbol{\theta}\) allowed by \(H_0\), is a subset of \(\Theta^{(1)}\), the set of values allowed by \(H_1\).

Now, the log likelihood ratio statistic for a test of \(H_0\) against \(H_1\) is

\[ \begin{aligned} L_{01}&\equiv 2\log \left({{\max_{\boldsymbol{\theta}\in \Theta^{(1)}} L(\boldsymbol{\theta})}\over {\max_{\boldsymbol{\theta}\in \Theta^{(0)}}L(\boldsymbol{\theta})}}\right)\cr &=2\log L(\hat{\boldsymbol{\theta}}^{(1)})-2\log L(\hat{\boldsymbol{\theta}}^{(0)}), \end{aligned} \tag{5.12}\]

where \(\hat{\boldsymbol{\theta}}^{(1)}\) and \(\hat{\boldsymbol{\theta}}^{(0)}\) follow from \(b'(\hat{\theta}_i)=\hat{\mu}_i\), \(g(\hat{\mu}_i)=\hat{\eta_i}\), \(i = 1, \ldots, n\) where \(\hat{\boldsymbol{\eta}}\) for each model is the linear predictor evaluated at the corresponding maximum likelihood estimate for \(\boldsymbol{\beta}\). Here, we assume that \(a(\phi_i),\;i = 1, \ldots, n\) are known; unknown \(a(\phi)\) is discussed in Section 5.9.

Recall that we reject \(H_0\) in favour of \(H_1\) when \(L_{01}\) is ‘too large’ (the observed data are much more probable under \(H_1\) than \(H_0\)). To determine a threshold value \(k\) for \(L_{01}\), beyond which we reject \(H_0\), we set the size of the test \(\alpha\) and use the result of Section 2.8.2 that, because \(H_0\) is nested in \(H_1\), \(L_{01}\) has an asymptotic chi-squared distribution with \(p-q\) degrees of freedom. For example, if \(\alpha=0.05\), we reject \(H_0\) in favour of \(H_1\) when \(L_{01}\) is greater than the 95% point of the \(\chi^2_{p-q}\) distribution.

Note that setting up our model selection procedure in this way is consistent with our desire for parsimony. The simpler model is \(H_0\), and we do not reject \(H_0\) in favour of the more complex model \(H_1\) unless the data provide convincing evidence for \(H_1\) over \(H_0\), that is unless \(H_1\) fits the data significantly better.

5.8 Scaled deviance and the saturated model

Consider a model where \(\boldsymbol{\beta}\) is \(n\)-dimensional, and therefore \(\boldsymbol{\eta}=\boldsymbol{X}\boldsymbol{\beta}\). Assuming that \(\boldsymbol{X}\) is invertible, then this model places no constraints on the linear predictor \(\boldsymbol{\eta}=(\eta_1,\ldots ,\eta_n)\). It can take any value in \(\mathbb{R}^n\). Correspondingly the means \(\boldsymbol{\mu}\) and the canonical parameters \(\boldsymbol{\theta}\) are unconstrained. The model is of dimension \(n\) and can be parameterised equivalently using \(\boldsymbol{\beta}\), \(\boldsymbol{\eta}\), \(\boldsymbol{\mu}\) or \(\boldsymbol{\theta}\). Such a model is called the saturated model.

As the canonical parameters \(\boldsymbol{\theta}\) are unconstrained, we can calculate their maximum likelihood estimates \(\hat{\boldsymbol{\theta}}\) directly from their likelihood Equation 5.2 (without first having to calculate \(\hat{\boldsymbol{\beta}}\))

\[ \ell(\boldsymbol{\theta})=\sum_{i=1}^n{{y_i\theta_i-b(\theta_i)} \over{a(\phi_i)}}+\sum_{i=1}^nc(y_i,\phi_i). \tag{5.13}\]

We obtain \(\hat{\boldsymbol{\theta}}\) by first differentiating with respect to \(\theta_1,\ldots ,\theta_n\) to give

\[ {\partial\over{\partial\theta_k}}\ell(\boldsymbol{\theta})={{y_k-b'(\theta_k)} \over{a(\phi_k)}}\qquad k=1,\ldots ,n. \]

Therefore \(b'(\hat{\theta}_k)=y_k,\;k=1,\ldots ,n\), and it follows immediately that \(\hat{\mu}_k=y_k,\;k=1,\ldots ,n\). Hence the saturated model fits the data perfectly, as the fitted values \(\hat{\mu}_k\) and observed values \(y_k\) are the same for every observation \(k=1,\ldots ,n\).

The saturated model is rarely of any scientific interest in its own right. It is highly parameterised, having as many parameters as there are observations. This goes against our desire for parsimony in a model. However, every other model is necessarily nested in the saturated model, and a test comparing a model \(H_0\) against the saturated model \(H_S\) can be interpreted as a goodness of fit test. If the saturated model, which fits the observed data perfectly, does not provide a significantly better fit than model \(H_0\), we can conclude that \(H_0\) is an acceptable fit to the data.

The log likelihood ratio statistic for a test of \(H_0\) against \(H_S\) is, from Equation 5.12

\[ L_{0s}=2\log L(\hat{\boldsymbol{\theta}}^{(s)})-2\log L(\hat{\boldsymbol{\theta}}^{(0)}), \]

where \(\hat{\boldsymbol{\theta}}^{(s)}\) follows from \(b'(\hat{\boldsymbol{\theta}})=\hat{\boldsymbol{\mu}}=\boldsymbol{y}\) and \(\hat{\boldsymbol{\theta}}^{(0)}\) is a function of the corresponding maximum likelihood estimate for \(\boldsymbol{\beta}=(\beta_1,\ldots ,\beta_q)^T\). Under \(H_0\), \(L_{0s}\) has an asymptotic chi-squared distribution with \(n-q\) degrees of freedom. Therefore, if \(L_{0s}\) is ‘too large’ (for example, larger than the 95% point of the \(\chi^2_{n-q}\) distribution) then we reject \(H_0\) as a plausible model for the data, as it does not fit the data adequately.

The degrees of freedom of model \(H_0\) is defined to be the degrees of freedom for this test, \(n-q\), the number of observations minus the number of linear parameters of \(H_0\). We call \(L_{0s}\) the scaled deviance (R calls it the residual deviance) of model \(H_0\).

From Equation 5.12 and Equation 5.13 we can write the deviance of model \(H_0\) as

\[ L_{0s}=2\sum_{i=1}^n{{y_i[\hat{\theta}^{(s)}_i-\hat{\theta}^{(0)}_i] -[b(\hat{\theta}^{(s)}_i)-b(\hat{\theta}^{(0)}_i)]} \over{a(\phi_i)}}, \tag{5.14}\]

which can be calculated using the observed data, provided that \(a(\phi_i),\; i = 1, \ldots, n\) is known.

Notes

The log likelihood ratio statistic Equation 5.12 for testing \(H_0\) against a non-saturated alternative \(H_1\) can be written as

\[ \begin{aligned} L_{01}&=2\log L(\hat{\boldsymbol{\theta}}^{(1)})-2\log L(\hat{\boldsymbol{\theta}}^{(0)})\cr &=[2\log L(\hat{\boldsymbol{\theta}}^{(s)})-2\log L(\hat{\boldsymbol{\theta}}^{(0)})] -[2\log L(\hat{\boldsymbol{\theta}}^{(s)})-2\log L(\hat{\boldsymbol{\theta}}^{(1)})]\cr &=L_{0s}-L_{1s}. \end{aligned} \tag{5.15}\]

Therefore the log likelihood ratio statistic for comparing two nested models is the difference of their deviances. Furthermore, as \(p-q=(n-q)-(n-p)\), the degrees of freedom for the test is the difference in degrees of freedom of the two models.

The asymptotic theory used to derive the distribution of the log likelihood ratio statistic under \(H_0\) does not really apply to the goodness of fit test (comparison with the saturated model). However, for binomial or Poisson data, we can proceed as long as the relevant binomial or Poisson distributions are likely to be reasonably approximated by normal distributions (i.e. for binomials with large denominators or Poissons with large means). However, for Bernoulli data, we cannot use the scaled deviance as a goodness of fit statistic in this way.
An alternative goodness of fit statistic for a model \(H_0\) is Pearson’s \(X^2\) given by \[ X^2=\sum_{i=1}^n {{(y_i-\hat{\mu}_i^{(0)})^2}\over{\hat{\text{Var}}(Y_i)}}. \tag{5.16}\] \(X^2\) is small when the squared differences between observed and fitted values (scaled by variance) is small. Hence, large values of \(X^2\) correspond to poor fitting models. In fact, \(X^2\) and \(L_{0s}\) are asymptotically equivalent and under \(H_0\), \(X^2\), like \(L_{0s}\), has an asymptotic chi-squared distribution with \(n-q\) degrees of freedom. However, the asymptotics associated with \(X^2\) are often more reliable for small samples, so if there is a discrepancy between \(X^2\) and \(L_{0s}\), it is usually safer to base a test of goodness of fit on \(X^2\).
Although the deviance for a model is expressed in Equation 5.14 in terms of the maximum likelihood estimates of the canonical parameters, it is more usual to express it in terms of the maximum likelihood estimates \(\hat{\mu}_i,\; i = 1, \ldots, n\) of the mean parameters. For the saturated model, these are just the observed values \(y_i,\;i = 1, \ldots, n\), and for the model of interest, \(H_0\), we call them the fitted values. Hence, for a particular generalised linear model, the scaled deviance function describes how discrepancies between the observed and fitted values are penalised.

Example 5.5 (Poisson) Suppose \(Y_i\sim \text{Poisson}(\lambda_i),\;i = 1, \ldots, n\). Recall from Section 5.2 that \(\theta=\log\lambda\), \(b(\theta)=\exp\theta\), \(\mu=b'(\theta)=\exp\theta\) and \(\text{Var}(Y)=a(\phi)V(\mu)=1\cdot\mu\). Therefore, by Equation 5.14 and Equation 5.16 \[\begin{align*} L_{0s}&=2\sum_{i=1}^n y_i[\log\hat{\mu}^{(s)}_i-\log\hat{\mu}^{(0)}_i] -[\hat{\mu}^{(s)}_i-\hat{\mu}^{(0)}_i]\cr &=2\sum_{i=1}^n y_i\log \left({{y_i}\over{\hat{\mu}^{(0)}_i}}\right) -y_i+\hat{\mu}^{(0)}_i \end{align*}\] and \[ X^2=\sum_{i=1}^n {{(y_i-\hat{\mu}_i^{(0)})^2}\over{\hat{\mu}_i^{(0)}}}. \]

Example 5.6 (Binomial) Suppose \(n_iY_i\sim\) Binomial\((n_i,p_i),\;i = 1, \ldots, n\). Recall from Section 5.2 that \(\theta=\log{p\over{1-p}}\), \(b(\theta)=\log(1+\exp\theta)\), \(\mu=b'(\theta)={{\exp\theta}\over{1+\exp\theta}}\) and \(\text{Var}(Y)=a(\phi)V(\mu)={1\over n}\cdot\mu(1-\mu)\). Therefore, by Equation 5.14 and Equation 5.16 \[\begin{align*} L_{0s}&=2\sum_{i=1}^n n_iy_i\left[\log{\hat{\mu}^{(s)}_i\over{1-\hat{\mu}^{(s)}_i}} -\log{\hat{\mu}^{(0)}_i\over{1-\hat{\mu}^{(0)}_i}}\right] + 2\sum_{i=1}^n n_i \left[\log(1-\hat{\mu}^{(s)}_i)-\log(1-\hat{\mu}^{(0)}_i) \right]\cr &=2\sum_{i=1}^n \left[ n_iy_i\log \left({{y_i}\over{\hat{\mu}^{(0)}_i}}\right) +n_i(1-y_i) \log \left({{1-y_i}\over{1-\hat{\mu}^{(0)}_i}}\right) \right] \end{align*}\] and \[ X^2=\sum_{i=1}^n {{n_i(y_i-\hat{\mu}_i^{(0)})^2}\over{\hat{\mu}_i^{(0)} (1-\hat{\mu}^{(0)}_i)}}. \] Bernoulli data are binomial with \(n_i=1,\;i = 1, \ldots, n\).

5.9 Models with unknown \(a(\phi)\)

The theory of Section 5.7 has assumed that \(a(\phi)\) is known. This is the case for both the Poisson distribution (\(a(\phi)=1\)) and the binomial distribution (\(a(\phi)=1/n\)). Neither the scaled deviance Equation 5.14 nor Pearson \(X^2\) statistic Equation 5.16 can be evaluated unless \(a(\phi)\) is known. Therefore, when \(a(\phi)\) is not known, we cannot use the scaled deviance as a measure of goodness of fit, or to compare models using Equation 5.15. For such models, there is no equivalent goodness of fit test, but we can develop a test for comparing nested models.

Here we assume that \(a(\phi_i)=\sigma^2/m_i,\;i = 1, \ldots, n\) where \(\sigma^2\) is a common unknown scale parameter and \(m_1,\ldots ,m_n\) are known weights. (A linear model takes this form, as \(\text{Var}(Y_i)=\sigma^2,\;i = 1, \ldots, n\), so \(m_i=1,\;i = 1, \ldots, n\).) Under this assumption

\[ L_{0s}={2\over\sigma^2}\sum_{i=1}^nm_iy_i[\hat{\theta}^{(s)}_i-\hat{\theta}^{(0)}_i] -m_i[b(\hat{\theta}^{(s)}_i)-b(\hat{\theta}^{(0)}_i)] ={1\over\sigma^2}D_{0s}, \tag{5.17}\]

where \(D_{0s}\) is defined to be twice the sum above, which can be calculated using the observed data. We call \(D_{0s}\) the deviance of the model.

In order to test nested models \(H_0\) and \(H_1\) as set up in Section 5.7.1, we calculate the test statistic

\[ \begin{aligned} &F={{L_{01}/(p-q)}\over{L_{1s}/(n-p)}}={{(L_{0s}-L_{1s})/(p-q)}\over{L_{1s}/(n-p)}} \hbox{\hskip 1.2in}\cr &\;={{\left({1\over\sigma^2}D_{0s}-{1\over\sigma^2}D_{1s}\right)/(p-q)} \over{{1\over\sigma^2}D_{1s}/(n-p)}} ={{(D_{0s}-D_{1s})/(p-q)}\over{D_{1s}/(n-p)}}. \end{aligned} \tag{5.18}\]

This statistic does not depend on the unknown scale parameter \(\sigma^2\), so can be calculated using the observed data. Asymptotically, if \(H_0\) is true, we know that \(L_{01} \sim \chi^2_{p-q}\) and \(L_{1s} \sim \chi^2_{n-p}\). Furthermore, \(L_{01}\) and \(L_{1s}\) are independent (not proved here) so \(F\) has an asymptotic \(F_{p-q, n-p}\) distribution. Hence, we compare nested generalised linear models by calculating \(F\) and rejecting \(H_0\) in favour of \(H_1\) if \(F\) is too large (for example, greater than the 95% point of the relevant F distribution).

The dependence of the maximum likelihood equations \(\boldsymbol{u}(\hat{\boldsymbol{\beta}})={\boldsymbol 0}\) on \(\sigma^2\) (where \(\boldsymbol{u}\) is given by Equation 5.7) can be eliminated by multiplying through by \(\sigma^2\). However, inference based on the maximum likelihood estimates, as described in Section 5.6, does require knowledge of \(\sigma^2\). This is because asymptotically \(\text{Var}(\hat{\boldsymbol{\beta}})\) is the inverse of the Fisher information matrix \({\cal I}(\boldsymbol{\beta})=\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}\), and this depends on \(w_i={1\over{\text{Var}(Y_i)g'(\mu_i)^2}},\) where \(\text{Var}(Y_i)=a(\phi_i)b''(\theta_i)=\sigma^2 b''(\theta_i)/m_i\) here.

Therefore, to calculate standard errors and confidence intervals, we need to supply an estimate \(\hat \sigma^2\) of \(\sigma^2\). Generally, we do not use the maximum likelihood estimate. Instead, we notice that, from Equation 5.17, \(L_{0s}=D_{0s}/\sigma^2\), and we know that asymptotically, if model \(H_0\) is an adequate fit, \(L_{0s}\) has a \(\chi^2_{n-q}\) distribution. Hence

\[ E(L_{0s})=E\left({1\over{\sigma^2}}D_{0s}\right)=n-q\quad\Rightarrow\quad E\left({1\over{n-q}}D_{0s}\right)=\sigma^2. \]

Therefore the deviance of a model divided by its degrees of freedom is an asymptotically unbiased estimator of the scale parameter \(\sigma^2\). Hence \(\hat\sigma^2=D_{0s}/(n-q)\).

An alternative estimator of \(\sigma^2\) is based on the Pearson \(X^2\) statistic. As \(\text{Var}(Y)=a(\phi)V(\mu)=\sigma^2 V(\mu)/m\) here, then from Equation 5.16

\[ X^2={1\over\sigma^2} \sum_{i=1}^n {{m_i(y_i-\hat{\mu}_i^{(0)})^2}\over{{V}(\hat{\mu}_i^{(0)})}}. \tag{5.19}\]

Again, if \(H_0\) is an adequate fit, \(X^2\) has an chi-squared distribution with \(n-q\) degrees of freedom, so

\[ \hat \sigma^2={1\over{n-q}} \sum_{i=1}^n {{m_i(y_i-\hat{\mu}_i^{(0)})^2}\over{{V}(\hat{\mu}_i^{(0)})}} \]

is an alternative unbiased estimator of \(\sigma^2\). This estimator tends to be more reliable in small samples.

Example 5.7 (Normal) Suppose \(Y_i\sim N(\mu_i,\sigma^2),\;i = 1, \ldots, n\). Recall from Section 5.2 that \(\theta=\mu\), \(b(\theta)=\theta^2/2\), \(\mu=b'(\theta)=\theta\) and \(\text{Var}(Y)=a(\phi)V(\mu)={\sigma^2}\cdot 1\), so \(m_i=1,\;i = 1, \ldots, n\). Therefore, by Equation 5.17,

\[ D_{0s}=2\sum_{i=1}^n y_i[\hat{\mu}^{(s)}_i-\hat{\mu}^{(0)}_i] -[\frac{1}{2}{{\hat{\mu}}^{(s)^2}_i}-\frac{1}{2}{{\hat{\mu}}^{(0)^2}_i}] =\sum_{i=1}^n [y_i-\hat{\mu}^{(0)}_i]^2, \tag{5.20}\]

which is just the residual sum of squares for model \(H_0\). Therefore, we estimate \(\sigma^2\) for a normal GLM by its residual sum of squares for the model divided by its degrees of freedom. From Equation 5.19, the estimate for \(\sigma^2\) based on \(X^2\) is identical.

5.10 Residuals

Recall that for linear models, we define the residuals to be the differences between the observed and fitted values \(y_i-\hat{\mu}^{(0)}_i,\;i = 1, \ldots, n\). From Equation 5.20 we notice that both the scaled deviance and Pearson \(X^2\) statistic for a normal GLM are the sum of the squared residuals divided by \(\sigma^2\). We can generalise this to define residuals for other generalised linear models in a natural way.

For any GLM we define the Pearson residuals to be \[ r^P_i={{y_i-\hat{\mu}_i^{(0)}}\over{\hat{\text{Var}}(Y_i)^{1\over 2}}}\qquad i = 1, \ldots, n. \] Then, from Equation 5.16, \(X^2\) is the sum of the squared Pearson residuals.

For any GLM we define the deviance residuals to be \[r^D_i=\text{sign}(y_i-\hat{\mu}_i^{(0)}) \left[ 2 {{y_i[\hat{\theta}^{(s)}_i-\hat{\theta}^{(0)}_i] -[b(\hat{\theta}^{(s)}_i)-b(\hat{\theta}^{(0)}_i)]} \over{a(\phi_i)}}\right]^{1\over 2}, \quad i = 1, \ldots, n,\] where \(\text{sign}(x)=1\) if \(x>0\) and \(-1\) if \(x<0\). Then, from Equation 5.14, the scaled deviance, \(L_{0s}\), is the sum of the squared deviance residuals.

When \(a(\phi)=\sigma^2/m\) and \(\sigma^2\) is unknown, as in Section 5.9, the residuals are based on Equation 5.17 and Equation 5.19, and the expressions above need to be multiplied through by \(\sigma^2\) to eliminate dependence on the unknown scale parameter. Therefore, for a normal GLM the Pearson and deviance residuals are both equal to the usual residuals, \(y_i-\hat{\mu}^{(0)}_i,\;i = 1, \ldots, n\).

Residual plots are most commonly of use in normal linear models, where they provide an essential check of the model assumptions. This kind of check is less important for a model without an unknown scale parameter as the scaled deviance provides a useful overall assessment of fit which takes into account most aspects of the model.

However, when data have been collected in serial order, a plot of the deviance or Pearson residuals against the order may again be used as a check for potential serial correlation.

Otherwise, residual plots are most useful when a model fails to fit (scaled deviance is too high). Then, examining the residuals may give an indication of the reason(s) for lack of fit. For example, there may be a small number of outlying observations.

A plot of deviance or Pearson residuals against the linear predictor should produce something that looks like a random scatter. If not, then this may be due to incorrect link function, wrong scale for an explanatory variable, or perhaps a missing polynomial term in an explanatory variable.