2.5 Generalised linear mixed models
2.5.1 Model setup
Generalised linear mixed models (GLMMs) generalise LMMs to non-normal data, in the obvious way: \[\begin{equation} Y_i \sim F(\cdot\mid\mu_i,\sigma^2),\quad g(\mu)\equiv \left(\begin{array}{c}g(\mu_1)\\\vdots \\ g(\mu_n)\end{array}\right)=X\beta+Zb,\;\; b\sim N(0,\Sigma_b) \tag{2.10} \end{equation}\] where \(F(\cdot\mid\mu_i,\sigma^2)\) is an exponential family distribution with \(E(Y)=\mu\) and \(\var(Y)=\sigma^2V(\mu)/m\) for known \(m\). Commonly (e.g. Binomial, Poisson) \(\sigma^2=1\), and we shall assume this from here on.
It is not necessary that the distribution for the random effects \(b\) is normal, but this usually fits. It is possible (but beyond the scope of this module) to relax this.
Example 2.6
A plausible GLMM for binary data in \(k\) clusters with \(n_1,\ldots , n_k\) observations per cluster, and a single explanatory variable \(x\) (e.g. the toxoplasmosis data at individual level) is \[\begin{equation} Y_{ij}\sim \text{Bernoulli}(\mu_{ij}),\quad \log\frac{\mu_{ij}}{1-\mu_{ij}} =\beta_0+b_{0i} + \beta_1 x_{ij} ,\quad b_{0i}\sim N(0,\sigma^2_b) \tag{2.11} \end{equation}\] Note there is no random slope here. This fits into the general GLMM framework (2.10) with \[X=\begin{pmatrix} 1 & x_{11}\\ \vdots&\vdots\\ 1& x_{k n_k}\end{pmatrix},\quad Z=\begin{pmatrix} Z_1&0&0\\ 0&\ddots&0\\ 0&0&Z_k \end{pmatrix},\quad Z_i=\begin{pmatrix} 1\\ \vdots\\ 1\end{pmatrix},\] \[\beta=(\beta_0, \beta_1)^T ,\quad b=(b_{01},\ldots,b_{0k})^T ,\quad \Sigma_b=\sigma^2_b I_k.\] or equivalent binomial representation for city data, with clusters of size 1.2.5.2 GLMM likelihood
The marginal distribution for the observed \(Y\) in a GLMM does not usually have a convenient closed-form representation. \[\begin{align} f(y\mid \beta,\Sigma_b)&= \int f(y\mid \beta,b,\Sigma_b)f(b\mid \beta,\Sigma_b)db\notag\\ &= \int f(y\mid \beta,b)f(b\mid \Sigma_b)db\notag\\ &=\int \prod_{i=1}^n f\left(y_i\mid g^{-1}([X\beta+Zb]_i)\right)f(b\mid \Sigma_b)db. \tag{2.12} \end{align}\]
For nested random effects structures, some simplification is possible. For example, for (2.11) \[f(y\mid \beta,\sigma^2_b) = \prod_{i=1}^k \int \prod_j f \left(y_{ij}\mid g^{-1}(x_i^T \beta+b_i)\right)\phi(b_i\mid 0,\sigma^2_b)db_i,\] a product of one-dimensional integrals.
Fitting a GLMM by likelihood methods requires some method for approximating the integrals involved.
The most reliable when the integrals are of low dimension is to use Gaussian quadrature (see APTS: Statistical Computing). For example, for a one-dimensional cluster-level random intercept \(b_i\) we might use \[\int \prod_j f \left(y_{ij}\mid g^{-1}(x_i^T \beta+b_i)\right)\phi(b_i\mid 0,\sigma^2_b)db_i \approx \sum_{q=1}^Q w_q \prod_j f \left(y_{ij}\mid g^{-1}(x_i^T \beta+b_{iq})\right)\] for suitably chosen weights \((w_q, q=1,\ldots ,Q)\) and quadrature points \((b_{iq}, q=1,\ldots ,Q)\)
Effective quadrature approaches use information about the mode and dispersion of the integrand (can be done adaptively). For multi-dimensional \(b_i\), quadrature rules can be applied recursively, but performance (in fixed time) diminishes rapidly with dimension.
An alternative approach is to use a Laplace approximation to the likelihood. Writing \[h(b) = \prod_{i=1}^n f\left(y_i\mid g^{-1}([X\beta+Zb]_i)\right)f(b\mid \Sigma_b)\] for the integrand of the likelihood, a (first-order) Laplace approximation approximates \(h(.)\) as an unnormalised multivariate normal density function \[\tilde h(b) = c \, \phi_k(b; \hat b, V),\] where
- \(\hat b\) is found by maximizing \(\log h(.)\) over \(b\).
- the variance matrix \(V\) is chosen so that the curvature of \(\log h(.)\) and \(\log \tilde h(.)\) agree at \(\hat b\).
- \(c\) is chosen so that \(\tilde h(\hat b) = h(\hat b).\)
The first-order Laplace approximation is equivalent to adaptive Gaussian quadrature with a single quadrature point. Quadrature provides accurate approximations to the likelihood. For some model structures, particularly those with crossed rather than nested random effects, the likelihood integral may be high-dimensional, and it may not be possible to use quadrature. In such cases, a Laplace approximation is often sufficiently accurate for most purposes, but this is not guaranteed.
Another alternative is to use Penalized Quasi Likelihood (PQL) for inference, which is very fast but often inaccurate. PQL can fail badly in some cases, particularly with binary observations, and its use is not recommended. Likelihood inference for GLMMs remains an area of active research and vigorous debate.
Example 2.7
For the individual-level model \[\begin{equation} Y_{ij}\sim \text{Bernoulli}(\mu_i),\quad \log{\frac{\mu_i}{1-\mu_i}}=\beta_0+b_{0i} + \beta_1x_{i} ,\quad b_{0i}\sim N(0,\sigma^2_b), \tag{2.13} \end{equation}\] the estimates and standard errors obtained by ML (quadrature), Laplace and PQL are shown in Table 2.1. For the extended model \[\begin{equation} \log{\frac{\mu_i}{1-\mu_i}}=\beta_0+b_{0i} + \beta_1x_{ij} + \beta_1x_{ij}^2+ \beta_1x_{ij}^3, \tag{2.14} \end{equation}\] the estimates and standard errors are shown in Table 2.2. For this example, there is a good agreement between the different computational methods.Parameter | ML | Laplace | PQL |
---|---|---|---|
\(\beta_0\) | -0.1384 (1.452) | -0.1343 (1.440) | -0.115 (1.445) |
\(\beta_1 (\times 10^{6})\) | 7.215 (752) | 5.930 (745.7) | 0.57 (749.2) |
\(\sigma_b\) | 0.5209 | 0.5132 | 0.4946 |
AIC | 65.75 | 65.96 | — |
Parameter | ML | Laplace | PQL |
---|---|---|---|
\(\beta_0\) | -335.5 (137.3) | -335.1 (136.3) | -330.8 (143.4) |
\(\beta_1\) | 0.5238 (0.2128) | 0.5231 (0.2112) | 0.5166 (0.222) |
\(\beta_2 (\times 10^{4})\) | -2.710 (1.094) | -2.706 (1.086) | -3 (1.1) |
\(\beta_3 (\times 10^{8})\) | 4.643 (1.866) | 4.636 (1.852) | 0 (0) |
\(\sigma_b\) | 0.4232 | 0.4171 | 0.4315 |
AIC | 63.84 | 63.97 | — |
2.5.3 Bayesian inference for GLMMs
Bayesian inference in GLMMs, as in LMMs, is generally based on the Gibbs sampler. For the GLMM \[\begin{equation*} Y_i \sim F(\cdot\mid \mu),\quad g(\mu)=X\beta+Zb,\quad b\sim N(0,\Sigma_b) \end{equation*}\] with corresponding prior densities \(f(\beta)\) and \(f(\Sigma_b)\), we obtain the conditional posterior distributions y \[\begin{align*} f(\beta\mid y,\text{rest})&\propto f(\beta)\prod_i f(y_i\mid g^{-1}(X\beta+Zb))\\ f(b\mid y,\text{rest})&\propto \phi(b;0,\Sigma_b)\prod_i f(y_i\mid g^{-1}(X\beta+Zb))\\ f(\Sigma_b\mid y,\text{rest})&\propto \phi(b;0,\Sigma_b)f(\Sigma_b). \end{align*}\] For a conditionally conjugate choice of \(f(\Sigma_b)\), \(f(\Sigma_b\mid y,\text{rest})\) is straightforward to sample from. The conditionals for \(\beta\) and \(b\) are not generally available for direct sampling, but there are a number of ways of modifying the basic approach to account for this.