2.2 Maximum likelihood estimation

\( \newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\corr}{corr} \newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\nindep}{\perp\!\!\!\perp\!\!\!\!\!\!/\;\;} \)

As usual, we maximise the log likelihood function which, from (2.1), can be written \[\begin{equation} \log f_{Y}(y;\beta,\phi)= \sum_{i=1}^n{{y_i\theta_i-b(\theta_i)}\over{\phi_i}} +\sum_{i=1}^nc(y_i,\phi_i) \tag{2.3} \end{equation}\] and depends on \(\beta\) through \[\begin{align*} &\mu_i=b'(\theta_i)\qquad i = 1, \ldots, n\cr &g(\mu_i)=\eta_i\qquad i = 1, \ldots, n\cr &\eta_i=x_i^T\beta=\sum_{j=0}^p x_{ij}\beta_j\qquad i = 1, \ldots, n. \end{align*}\] To find \(h\), we solve the equations \(u(\hat\beta)=0\) where \(u\) is the score vector whose components are given by \[\begin{align} u_k(\beta)&\equiv{\partial\over{\partial\beta_k}}\log f_{Y}(y;\beta) \notag \\ &=\sum_{i=1}^n{\partial\over{\partial\beta_k}} \left[{{y_i\theta_i-b(\theta_i)}\over{\phi_i}}\right] \notag \\ &=\sum_{i=1}^n{\partial\over{\partial\theta_i}}\left[{{y_i\theta_i-b(\theta_i)} \over{\phi_i}}\right]{{\partial\theta_i}\over{\partial\mu_i}} {{\partial\mu_i}\over{\partial\eta_i}}{{\partial\eta_i}\over{\partial\beta_k}}\qquad k = 0, \ldots, p. \notag \\ &=\sum_{i=1}^n{{y_i-b'(\theta_i)}\over{\phi_i}} {{x_{ik}}\over{b''(\theta_i)g'(\mu_i)}} \notag \\ &=\sum_{i=1}^n{{y_i-\mu_i}\over{\var(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}}\qquad k = 0, \ldots, p \tag{2.4} \end{align}\] which depends on \(\beta\) through \(\mu_i\equiv E(Y_i)\) and \(\var(Y_i),\) \(i = 1, \ldots, n\).

In practice, these equations are usually non-linear and have no analytic solution. Therefore, we rely on numerical methods to solve them.

First, we note that the Hessian and Fisher information matrices can be derived directly from (2.4), as \[\begin{align*} [H(\beta)]_{jk}&={{\partial^2}\over{\partial\beta_j\partial\beta_k}}\log f_{Y}(y;\beta)\cr &={\partial\over{\partial\beta_j}}\sum_{i=1}^n{{y_i-\mu_i}\over{\var(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}}\cr &=\sum_{i=1}^n{{-{{\partial\mu_i}\over{\partial\beta_j}}}\over{\var(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}}+\sum_{i=1}^n(y_i-\mu_i){\partial\over{\partial\beta_j}} \left[{{x_{ik}}\over{\var(Y_i) g'(\mu_i)}}\right]\cr \end{align*}\] and \[ [{\cal I}(\beta)]_{jk}=E[-H(\beta)]_{jk} =\sum_{i=1}^n{{{{\partial\mu_i}\over{\partial\beta_j}}}\over{\var(Y_i)}} {{x_{ik}}\over{g'(\mu_i)}} =\sum_{i=1}^n{{x_{ij}x_{ik}}\over{\var(Y_i)g'(\mu_i)^2}}. \]

Exercise 6: Show that we we can write \[\begin{equation} {\cal I}(\beta)=X^TWX \tag{2.5} \end{equation}\] where \(X\) is the design matrix and \[ W={\rm diag}(w)= \begin{pmatrix} w_1&0&\cdots&0\cr 0&w_2&&\vdots\cr \vdots&&\ddots&0\cr 0&\cdots&0&w_n \end{pmatrix} \] where \[ w_i={1\over{\var(Y_i)g'(\mu_i)^2}}\qquad i = 1, \ldots, n. \] The Fisher information matrix \(I(\beta)\) depends on \(\beta\) through \(\mu\) and \(\var(Y_i),\; i = 1, \ldots, n\).

The scores in (2.4) may now be written as \[\begin{align*} u_k(\beta)&=\sum_{i=1}^n(y_i-\mu_i)x_{ik}w_ig'(\mu_i)\cr &=\sum_{i=1}^n x_{ik}w_iz_i\qquad k = 0, \ldots, p \end{align*}\] where \[ z_i=(y_i-\mu_i)g'(\mu_i)\qquad i = 1, \ldots, n. \] Therefore \[\begin{equation} u(\beta)=X^TWz. \tag{2.6} \end{equation}\]

One possible method to solve the \(p\) simultaneous equations \(u(h)=0\) that give \(h\) is the (multivariate) Newton-Raphson method. If \(\beta^{t}\) is the current estimate of \(h\) then the next estimate is \[\begin{equation} \beta^{t+1}=\beta^{t}-H(\beta^{t})^{-1}u(\beta^{t}). \tag{2.7} \end{equation}\] In practice, an alternative to Newton-Raphson replaces \(H(\beta)\) in (2.7) with \(E[H(\beta)]\equiv-I(\beta)\). Therefore, if \(\beta^{t}\) is the current estimate of \(h\) then the next estimate is \[\begin{equation} \beta^{t+1}=\beta^{t}+{\cal I}(\beta^{t})^{-1}u(\beta^{t}). \tag{2.8} \end{equation}\] The resulting iterative algorithm is called Fisher scoring. Notice that if we substitute (2.5) and (2.6) into (2.8) we get \[\begin{align*} \beta^{t+1}&=\beta^{t}+[X^TW^tX]^{-1}X^TW^tz^t\cr &=[X^TW^tX]^{-1}[X^TW^tX\beta^{t}+X^TW^tz^t]\cr &=[X^TW^tX]^{-1}X^TW^t[X\beta^{t}+z^t]\cr &=[X^TW^tX]^{-1}X^TW^i[\eta^{t}+z^t] \end{align*}\] where \(\eta^{t},\,W^t\) and \(z^t\) are all functions of \(\beta^t\).

This is a weighted least squares equation, as \(\beta^{t+1}\) minimises the weighted sum of squares \[ (\eta+z-X\beta)^TW(\eta+z-X\beta)= \sum_{i=1}^n w_i\left(\eta_i+z_i-x_i^T\beta\right)^2 \] as a function of \(\beta\) where \(w_1,\ldots ,w_n\) are the weights and \(\eta+z\) is called the adjusted dependent variable.

Therefore, the Fisher scoring algorithm proceeds as follows.

Choose an initial estimate \(\beta^t\) for \(h\) at t=0.
Evaluate \(\eta^t,\,W^t\) and \(z^t\) at \(\beta^t\).
Calculate \(\beta^{t+1} =[X^TW^tX]^{-1}X^TW^t[\eta^{t}+z^t]\).
If \(||\beta^{t+1}-\beta^t||>\) some prespecified (small) tolerance then set \(t\to t+1\) and go to 2.
Use \(\beta^{t+1}\) as the solution for \(h\).

As this algorithm involves iteratively minimising a weighted sum of squares, it is sometimes known as iteratively (re)weighted least squares.

Notes

Recall that the canonical link function is \(g(\mu)=b'^{-1}(\mu)\) and with this link \(\eta_i=g(\mu_i)=\theta_i\). Then \[ {1\over{g'(\mu_i)}} ={{\partial\mu_i}\over{\partial\theta_i}}=b''(\theta_i)\qquad i = 1, \ldots, n. \] Therefore \(\var(Y_i)g'(\mu_i)=\phi_i\) which does not depend on \(\beta\), and hence \({\partial\over{\partial\beta_j}}\left[{{x_{ik}}\over{\var(Y_i)g'(\mu_i)}}\right]=0\) for all \(j=0,\ldots ,p\). It follows that \(H(\beta)=-I(\beta)\) and, for the canonical link, Newton-Raphson and Fisher scoring are equivalent.
(Exercise 7) The linear model is a generalised linear model with identity link, \(\eta_i=g(\mu_i)=\mu_i\) and \(\var(Y_i)=\sigma^2\) for all \(i = 1, \ldots, n\). For this model, show that \(w_i=\sigma^{-2}\) and \(z_i=y_i-\eta_i,\, i = 1, \ldots, n\). Hence, show that the Fisher scoring algorithm converges in a single iteration, from any starting point, to the usual least squares estimate.
Estimation of an unknown dispersion parameter \(\sigma^2\) is discussed later. A common \(\sigma^2\) has no effect on \(h\).