\( \newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\corr}{corr} \newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\nindep}{\perp\!\!\!\perp\!\!\!\!\!\!/\;\;} \)

2.5 Models with an unknown dispersion parameter

Thus far, we have assumed that the \(\phi_i\) are known. This is the case for both the Poisson and Bernoulli distributions (\(\phi=1\)). Neither the scaled deviance (2.10) nor Pearson \(X^2\) statistic (2.12) can be evaluated unless \(a(\phi)\) is known. Therefore, when \(\phi_i\) are not known, we cannot use the scaled deviance as a measure of goodness of fit, or to compare models using (2.11).

Progress is possible if we assume that \(\phi_i=\sigma^2/m_i,\; i = 1, \ldots, n\) where \(\sigma^2\) is a common unknown dispersion parameter and \(m_1,\ldots ,m_n\) are known weights. (A normal linear model takes this form, if we assume that \(\var(Y_i)=\sigma^2,\; i = 1, \ldots, n\), in which case \(m_i=1,\; i = 1, \ldots, n\)). Under this assumption \[\begin{align} L_{0}&={2\over\sigma^2}\sum_{i=1}^nm_iy_i[\hat{\theta}^{(s)}_i-\hat{\theta}^{(0)}_i] -m_i[b(\hat{\theta}^{(s)}_i)-b(\hat{\theta}^{(0)}_i)] \notag \\ &\equiv{1\over\sigma^2}D_{0} \tag{2.13} \end{align}\] where \(D_{0}\) can be calculated using the observed data. We call \(D_{0}\) the deviance of the model.

In order to compare nested models \(H_0\) and \(H_1\), one might calculate the test statistic \[\begin{equation} F={{L_{01}/(p-q)}\over{L_{1}/(n-p-1)}}={{(L_{0}-L_{1})/(p-q)}\over{L_{1}/(n-p-1)}} ={{(D_{0}-D_{1})/(p-q)}\over{D_{1}/(n-p-1)}}. \tag{2.14} \end{equation}\] This statistic does not depend on the unknown dispersion parameter \(\sigma^2\), so can be calculated using the observed data. Asymptotically, under \(H_0\), \(L_{01}\) has a \(\chi^2_{p-q}\) distribution and \(L_{01}\) and \(L_{1}\) are independent (not proved here). Assuming that \(L_1\) has an approximate \(\chi^2_{n-p-1}\) distribution, then \(F\) has an approximate F distribution with \(p-q\) degrees of freedom in the numerator and \(n-p-1\) degrees of freedom in the denominator. Hence, we compare nested generalised linear models by calculating \(F\) and rejecting \(H_0\) in favour of \(H_1\) if \(F\) is too large.

The dependence of the maximum likelihood equations \(u(\hat{\beta})=0\) on \(\sigma^2\) (where \(u\) is given by (2.4)) can be eliminated by multiplying through by \(\sigma^2\). However, inference based on the maximum likelihood estimates, as described in Section 2.3 does require knowledge of \(\sigma^2\). This is because asymptotically \(\var(\hat{\beta})\) is the inverse of the Fisher information matrix \({\cal I}(\beta)=X^TWXy\), and this depends on \(w_i={1\over{\var(Y_i)g'(\mu_i)^2}}\) where \(\var(Y_i)=\phi_ib''(\theta_i)=\sigma^2 b''(\theta_i)/m_i\) here.

Therefore, to calculate standard errors and confidence intervals, we need to supply an estimate \(\hat{\sigma}^2\) of \(\sigma^2\). Generally, rather than use the maximum likelihood estimate, it is more common to base an estimator of \(\sigma^2\) on the Pearson \(X^2\) statistic. As \(\var(Y_i)=\phi_iV(\mu_i)=\sigma^2 V(\mu_i)/m_i\) here (where the variance function \(V(\mu)\) is defined as \(b''(\theta)\), written in terms of \(\mu\)), then from (2.12) \[\begin{equation} X^2={1\over\sigma^2} \sum_{i=1}^n {{m_i(y_i-\hat{\mu}_i^{(0)})^2}\over{{V}(\hat{\mu}_i)}}. \tag{2.15} \end{equation}\] Exercise 8: Making the assumption that, if \(H_0\) is an adequate fit, \(X^2\) has an chi-squared distribution with \(n-q-1\) degrees of freedom, show that \[ \hat{\sigma}^2\equiv{1\over{n-q-1}} \sum_{i=1}^n {{m_i(y_i-\hat{\mu}_i^{(0)})^2}\over{{V}(\hat{\mu}_i)}} \] is an approximately unbiased estimator of \(\sigma^2\). Suggest an alternative estimator based on the deviance \(D_0\).