\( \newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\corr}{corr} \newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\nindep}{\perp\!\!\!\perp\!\!\!\!\!\!/\;\;} \)

2.4 Comparing generalised linear models

It should be noted that this section describes just one method for comparing models. General principles and other methods will be discussed in detail in the APTS module itself.

As with linear models, we can proceed by comparing nested models \(H_0\) and \(H_1\) pairwise using a generalised likelihood ratio test where ‘nested’ means that \(H_0\) and \(H_1\) are based on the same exponential family distribution, have the same link function, but \(\Theta^{(0)}\), the set of values of the canonical parameter \(\theta\) allowed by \(H_0\), is a subset of \(\Theta^{(1)}\), the set of values allowed by \(H_1\).

Without loss of generality, we can think of \(H_1\) as the model \[ \eta_i=\sum_{j=0}^p x_{ij} \beta_j \qquad i = 1, \ldots, n \] and \(H_0\) is the same model with \(\beta_{q+1}=\beta_{q+2}=\cdots=\beta_{p}=0.\)

Now, the log likelihood ratio statistic for a test of \(H_0\) against \(H_1\) is \[\begin{align} L_{01}&\equiv 2\log \left({{\max_{\theta\in \Theta^{(1)}}f_{Y}(y;\theta)}\over {\max_{\theta\in \Theta^{(0)}}f_{Y}(y;\theta)}}\right) \notag \\ &=2\log f_{Y}(y;\hat{\theta}^{(1)})-2\log f_{Y}(y;\hat{\theta}^{(0)}) \tag{2.9} \end{align}\] where \(\hat{\theta}^{(1)}\) and \(\hat{\theta}^{(0)}\) follow from \(b'(\hat{\theta}_i)=\hat{\mu}_i\), \(g(\hat{\mu}_i)=\hat{\eta_i}\), \(i = 1, \ldots, n\) where \(\hat{\eta}\) for each model is the linear predictor evaluated at the corresponding maximum likelihood estimate for \(\beta\). Here, we assume that \(\phi_i,\; i = 1, \ldots, n\) are known; unknown \(\phi\) is discussed in Section 2.5.

We reject \(H_0\) in favour of \(H_1\) when \(L_{01}>k\) where \(k\) is determined by \(\alpha\), the size of the test. Under \(H_0\), \(L_{01}\) has an asymptotic chi-squared distribution with \(p-q\) degrees of freedom.

The saturated model is defined to be the model where the canonical parameters \(\theta\) (or equivalently \(\mu\) or \(\eta\)) are unconstrained, and the parameter space is \(n\)-dimensional. For the saturated model, we can calculate the m.l.es \(\hat{\theta}\) directly from their likelihood (2.1) by differentiating with respect to \(\theta_1,\ldots ,\theta_n\) to give \[ {\partial\over{\partial\theta_k}}\log f_{Y}(y;\theta)={{y_k-b'(\theta_k)} \over{\phi_k}}\qquad k=1,\ldots ,n. \] Therefore \(b'(\hat{\theta}_k)=y_k,\;k=1,\ldots ,n\), and it follows immediately that \(\hat{\mu}_k=y_k,\;k=1,\ldots ,n\). Hence the saturated model fits the data perfectly, as the fitted values \(\hat{\mu}_k\) and observed values \(y_k\) are the same for every observation \(k=1,\ldots ,n\).

The saturated model is rarely of any scientific interest in its own right. It is highly parameterised, having as many parameters as there are observations. However, every other model is necessarily nested in the saturated model, and a test comparing a model \(H_0\) against the saturated model can be interpreted as a goodness of fit test. If the saturated model, which fits the observed data perfectly, does not provide a significantly better fit than model \(H_0\), we can conclude that \(H_0\) is an acceptable fit to the data.

The log likelihood ratio statistic for a test of \(H_0\) against the saturated alternative is, from (2.9) \[ L_{0}=2\log f_{Y}(y;\hat{\theta}^{(s)})-2\log f_{Y}(y;\hat{\theta}^{(0)}) \] where \(\hat{\theta}^{(s)}\) follows from \(b'(\hat{\theta})=\hat{\mu}=y\). However, calibrating \(L_0\) is not straightforward. In some circumstances (typically those where the response distribution might be adequately approximated by a normal) \(L_{0}\) has an asymptotic chi-squared distribution with \(n-q-1\) degrees of freedom, under \(H_0\). Therefore, if \(L_{0}\) is ‘too large’ then we reject \(H_0\) as a plausible model for the data, as it does not fit the data adequately. However, in other situations, for example, for Bernoulli data, this approximation breaks down.

The degrees of freedom of model \(H_0\) is defined to be the degrees of freedom for this test, \(n-q-1\), the number of observations minus the number of linear parameters of \(H_0\). We call \(L_{0}\) the scaled deviance of model \(H_0\).

From (2.3) and (2.9) we can write the scaled deviance of model \(H_0\) as \[\begin{equation} L_{0}=2\sum_{i=1}^n{{y_i[\hat{\theta}^{(s)}_i-\hat{\theta}^{(0)}_i] -[b(\hat{\theta}^{(s)}_i)-b(\hat{\theta}^{(0)}_i)]} \over{\phi_i}}. \tag{2.10} \end{equation}\] which can be calculated using the observed data, provided that \(\phi_i,\; i = 1, \ldots, n\) is known.

Notes

  1. The log likelihood ratio statistic (2.9) for testing \(H_0\) against a nonsaturated alternative \(H_1\) can be written as \[\begin{align} L_{01}&=2\log f_{Y}(y;\hat{\theta}^{(1)})-2\log f_{Y}(y;\hat{\theta}^{(0)}) \notag \\ &=[2\log f_{Y}(y;\hat{\theta}^{(s)})-2\log f_{Y}(y;\hat{\theta}^{(0)})] -[2\log f_{Y}(y;\hat{\theta}^{(s)})-2\log f_{Y}(y;\hat{\theta}^{(1)})] \notag \\ &=L_{0}-L_{1}. \tag{2.11} \end{align}\] Therefore the log likelihood ratio statistic for comparing two nested models is the difference between their scaled deviances. Furthermore, as \(p-q=(n-q-1)-(n-p-1)\), the degrees of freedom for the test is the difference in degrees of freedom of the two models.

  2. An alternative goodness of fit statistic for a model \(H_0\) is Pearson’s \(X^2\) given by \[\begin{equation} X^2=\sum_{i=1}^n {{(y_i-\hat{\mu}_i^{(0)})^2}\over{\widehat{\var}(Y_i)}}. \tag{2.12} \end{equation}\] \(X^2\) is small when the squared differences between observed and fitted values (scaled by variance) is small. Hence, large values of \(X^2\) correspond to poor fitting models. In fact, \(X^2\) and \(L_{0}\) are asymptotically equivalent. However, the asymptotic \(\chi^2_{n-q-1}\) approximation associated with \(X^2\) is often more reliable.