4 Linear Mixed Models
In this chapter, we introduce the linear mixed models (LMMs) with random effects. This is a method for analysing complex datasets contain features such as multilevel/hierarchy, longitudinality, or correlation/dependence, where the linear models in Chapter 3 can not be applied. We will study both the general concepts/interpretation and some simple theory of LMMs.
4.1 Introduction to Linear Mixed Models
4.1.1 Motivations of LMMs
In statistical models that are mentioned earlier, it has been generally assume that all observations are independent from each other. In particular, in linear regression model, we assume that the error term independently follows \(N(0, \sigma^2)\), which leads to independent response random variables \(\{Y_1, Y_2,\dots, Y_n\}\).
However, in many practical data problems, such as clustered data or longitudinal data, the above assumption is not valid. We need to consider more sophisticated linear models, in which observations are allowed to be correlated. Linear mixed models, sometimes also referred to as linear mixed effect models, is one of the most popular models that can incorporate correlations between responses. It can be viewed as an extension of linear regression for cross-sectional (panel) data, by introducing random effects to account for within-groups correlations.
We introduce some motivating examples to help you understand LMMs. More mathematical details will be presented in subsequent sections.
Example 4.1 In clustered data, we have each response is measured for each data point, and each data point belongs to a group (cluster).
For example, consider the maths test scores for all Year 1 students in a primary school, where students are grouped by classrooms. In this way, each class room forms a cluster and we believe it is more sensible to assume the scores within each cluster are not independent but somehow correlated, due to the fact that students in the same classroom take the exact same courses.
Example 4.2 In longitudinal data, we have each response is measure at several time points, and the number of time points is fixed.
For example, consider the the number of sales of different products at each month in Year 2021. Here we have 12 time-points, at each we assume the sales of different products are correlated.
In linear regression models, all model parameters in regression coefficients \(\boldsymbol{\beta}=(\beta_1, \dots, \beta_p)^T\) are fixed, i.e., are the same for all observations of \(\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n\). Therefore we call it as the fixed effect model.
In contrast, if the regression coefficients are random variables, we call it as a random effect model. A linear mixed model, by definition, is a linear model with both fixed effects and random effects. Usually it refers to a regression model in which data can be grouped according to several observed factors. Such groups can be clusters or data collected at the same time point, as shown in the above examples. Mixture of fixed and random effects allow us to make inference or prediction on a specific group, which is essential in some applications.
Random effects can be thought of as missing information on individual subjects that, were it available, would be included in the statistical model. To reflect our not knowing what values to use for the random effects, we model them as a random sample from a distribution. In this way it induces correlation amongst repeated measurements on the same subjects or amongst measurements in a cluster.
4.1.2 Basics of Linear Mixed Models
In this section, we present the linear mixed models in general forms. Let \[\boldsymbol{y}_i=(y_{i1}, y_{i2},\dots, y_{in_i}), \quad i=1,2\dots, m,\] be \(n_i\) observed responses within group \(i\), where \(m\) is the number of groups and we have the total number of observations is \(n= \sum_{i=1}^m n_i\). As usual, \(\boldsymbol{y}_i=(y_{i1}, y_{i2},\dots, y_{in_i})\) are assume to be observations of random variables \[\boldsymbol{Y}_i=(Y_{i1}, Y_{i2},\dots, Y_{in_i}).\]
A general LMM is then as follows:
\[\begin{aligned} Y_{ij}&=\boldsymbol{x}_{ij}^T\boldsymbol{\beta}+\boldsymbol{u}_{ij}^T\boldsymbol{\gamma}_i+\epsilon_{ij},\qquad j=1,\dots, n_i;\,\, i=1,\dots,m, \end{aligned} \tag{4.1}\]
where same to the linear model, \(\boldsymbol{x}_{ij}\) is the \(p\times 1\) vector of explanatory variables corresponding to the \(j\)-th component in \(i\)-th group, associated with the fixed effects, and we write \(\boldsymbol{\beta}=(\beta_1,\dots,\beta_p)^T\) as a \(p\times 1\) parameter vector of fixed effects. We assume the random errors \(\epsilon_{ij}\sim N(0, \sigma^2)\) are all independent.
Different from linear model, here we also have \(\boldsymbol{u}_{ij}\), which is the \(q\times 1\) vector of explanatory variables corresponding to the \(j\)-th component in \(i\)-th group. It is associated with the random effects. Usually \(\boldsymbol{u}_{ij}\) can be either some new covariates or a subset of \(\boldsymbol{x}_{ij}\). We write \(\boldsymbol{\gamma}_i = (\gamma_{i1},\dots,\gamma_{iq})^T\) as a \(q\times 1\) parameter vector of random effects in \(i\)-th group.
As introduced earlier, unlike \(\boldsymbol{\beta}\) being the parameters of interests, \(\boldsymbol{\gamma}_i\) is a vector of random coefficients. For the rest of this section, we assume that \[\boldsymbol{\gamma}_i\sim N(\boldsymbol{0}, \boldsymbol{\mathcal{D}}),\] where \(\boldsymbol{\mathcal{D}}\) is a \(q \times q\) covariance matrix of the random effects. The distributional assumption of coefficients \(\boldsymbol{\theta}\) is resemble to specification of a prior in Bayesian analysis. Some of the computation in LMMs will also correspond to Bayesian methods, as we will see later.
Usually we assume the variance components in LMM is indexed by (depends on) some parameters \(\boldsymbol{\theta}\), which will be our prime target of statistical inference about the random effects. As a result we can write \(\mathcal{D}=\mathcal{D}_{\boldsymbol{\theta}}\), and \(\sigma^2\) is an element of \(\boldsymbol{\theta}\).
Let \(\boldsymbol{\epsilon_i}=(\epsilon_{i1}, ,\dots, \epsilon_{in_i})^T\) represents random errors within \(i\)-th group, which is independent of \(\boldsymbol{\gamma_i}\). In LMMs, we also assume independence between each groups, that means \(\boldsymbol{\gamma}_1, \boldsymbol{\gamma}_2,\dots, \boldsymbol{\gamma}_m\), \(\boldsymbol{\epsilon}_1, \boldsymbol{\epsilon}_2,\dots, \boldsymbol{\epsilon}_m\) are all independent.
LMMs in Equation 4.1 specifically incorporates two sources of randomness: the within-group randomness and the between-group randomness. Thus, it can be interpreted as two-stage hierarchical,
- stage 1: specifies the within-group randomness, which is given by fixing \(i\) and letting \(j=1,\dots, n_i\);
- stage 2: specifies the between-group randomness, which is given by letting \(i=1,\dots,m\).
Example 4.3 We consider the following simple LMMs to illustrate the correlation introduced by the random effect in the model: \[Y_{ij}= \beta_0 + \gamma_i + \epsilon_{ij}, \quad i=1,\dots,m; \,\,j=1,\dots, n_i,\] where the random effect and the error satisfy: \[\gamma_i\sim N(0, \sigma_\gamma^2), \quad \epsilon_{ij}\sim N(0, \sigma_{\epsilon}^2).\]
In this case, we have the variance parameters \(\boldsymbol{\theta}=(\sigma_\gamma, \sigma_\epsilon)\). This model only include an intercept with no covariate in the fixed effect. It can be shown that the correlation between the responses within \(i\)-th group \(\{Y_{i1}, Y_{i2}, \dots, Y_{in_i}\}\) is: \[\begin{aligned} r^i_{jk}=\mbox{corr}(Y_{ij},Y_{ik})=&\dfrac{\mbox{Cov}(\gamma_i + \epsilon_{ij}, \gamma_i + \epsilon_{ik})}{\sqrt{\mbox{Var}(\gamma_i + \epsilon_{ij})\mbox{Var}(\gamma_i + \epsilon_{ik})}}\\ =&\dfrac{\sigma_\gamma^2}{\sigma_{\gamma}^2+\sigma_{\epsilon}^2}. \end{aligned}\] Thus, the random effect \(\gamma_i\) introduces correlation between with-in group responses. If there is no random effect (i.e., \(\sigma_\gamma=0\)), there is no such correlation. If the between-group randomness is much smaller than the within-group randomness, i.e., \(\sigma_{\epsilon}^2 \ll \sigma_{\gamma}^2\), the correlation \(r^i_{jk}\) becomes very high (close to 1).
To present the \(i\)-th group of LMMs in a matrix form, let \(\boldsymbol{Y}_i=({Y}_{i1},{Y}_{i2},\dots, {Y}_{in_i})^T\), \(\boldsymbol{X}_i=(\boldsymbol{x}_{i1},\boldsymbol{x}_{i2},\dots, \boldsymbol{x}_{in_i})^T\) be a \(n_i\times p\) design matrix, and \(\boldsymbol{U}_i=(\boldsymbol{u}_{i1},\boldsymbol{u}_{i2},\dots, \boldsymbol{u}_{in_i})^T\) be a \(n_i\times q\) design matrix. Therefore, we can write \[\begin{aligned} \boldsymbol{Y}_i&=\boldsymbol{X}_i\boldsymbol{\beta}+\boldsymbol{U}_i\boldsymbol{\gamma}_i+\boldsymbol{\epsilon}_i\\ &=\boldsymbol{X}_i\boldsymbol{\beta} + \boldsymbol{\epsilon}^\ast_i, \qquad \qquad i=1\dots, m, \end{aligned}\] where \(\boldsymbol{\epsilon}^\ast_i=\boldsymbol{U}_i\boldsymbol{\gamma}_i+\boldsymbol{\epsilon}_i\) can be regarded as random “errors” in \(i\)-th group. Note that \(\boldsymbol{\epsilon}_i\sim N(\boldsymbol{0}, \sigma^2\boldsymbol{I}_{n_i})\). Let \(\boldsymbol{V_i}=\boldsymbol{U}_i \boldsymbol{\mathcal{D}}\boldsymbol{U}_i^T+\sigma^2\boldsymbol{I}_{n_i}\), it is straightforward to see that \(\boldsymbol{\epsilon}^\ast_i\sim N(\boldsymbol{0}, \boldsymbol{V}_i)\), which leads to the marginal model \[\boldsymbol{Y}_i\sim N(\boldsymbol{X}_i\boldsymbol{\beta}, \,\boldsymbol{V}_i).\]
If we want to further present all the \(m\) groups in one big matrix formula, we can simply write \[\boldsymbol{Y}=\begin{pmatrix}\boldsymbol{Y_1}\\ \vdots\\ \boldsymbol{Y_m} \end{pmatrix} \in \mathbb{R}^n, \quad \boldsymbol{X}=\begin{pmatrix}\boldsymbol{X_1}\\ \vdots\\ \boldsymbol{X_m} \end{pmatrix} \in \mathbb{R}^{n\times p},\] and \[\boldsymbol{\gamma}=\begin{pmatrix}\boldsymbol{\gamma_1}\\ \vdots\\ \boldsymbol{\gamma_m} \end{pmatrix} \in \mathbb{R}^{mq}, \quad \boldsymbol{\epsilon}=\begin{pmatrix}\boldsymbol{\epsilon_1}\\ \vdots\\ \boldsymbol{\epsilon_m} \end{pmatrix} \in \mathbb{R}^{n\times 1}.\] Moreover, let \[\boldsymbol{U}=\begin{pmatrix}\boldsymbol{U_1},& \boldsymbol{0}_{n_1\times q}, &\cdots, &\boldsymbol{0}_{n_1\times q}\\ \boldsymbol{0}_{n_2\times q},& \boldsymbol{U_2},\\ \vdots && \ddots\\ \boldsymbol{0}_{n_m\times q}, &&&\boldsymbol{U_m} \end{pmatrix} \in \mathbb{R}^{n\times mq}.\]
Therefore, we can write the LMMs for all groups as \[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{U} \boldsymbol{\gamma} +\boldsymbol{\epsilon}, \tag{4.2}\] where \(\boldsymbol{\epsilon}\sim N(0, \sigma^2 \boldsymbol{I}_n)\) and \(\boldsymbol{\gamma}\sim N(\boldsymbol{0}, \mathcal{G})\), with \[\mathcal{G}=\begin{pmatrix}\boldsymbol{\mathcal{D}} & &\\ & \boldsymbol{\mathcal{D}} &\\ && \ddots\\ &&&\boldsymbol{\mathcal{D}} \end{pmatrix} \in \mathbb{R}^{mq\times mq}. \]
Similarly, if we express \(\boldsymbol{\epsilon}^\ast=\boldsymbol{U}\boldsymbol{\gamma}+\boldsymbol{\epsilon}\), we can write Equation 4.2 as \[\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta} +\boldsymbol{\epsilon}^\ast,\] where \(\boldsymbol{\epsilon}^\ast\sim N(0, \boldsymbol{V})\) with \(\boldsymbol{V}=\boldsymbol{U} \mathcal{G}\boldsymbol{U}^T+\sigma^2\boldsymbol{I}_{n}\). We use \(\boldsymbol{y}\) to denote observations of random vector \(\boldsymbol{Y}\).
As introduced earlier, the covariance matrix \(\mathcal{G}\) and \(\boldsymbol{V}\) depend on parameters \(\boldsymbol{\theta}\). We can further denote them as \(\mathcal{G}_{\boldsymbol{\theta}}\) and \(\boldsymbol{V}_{\boldsymbol{\theta}}\). For simplicity we omit the subscript here.
4.2 LMMs parameter estimation I
Statistical inference for a LMMs is typically based on the maximum likelihood, and depend on if the variance component parameters \(\boldsymbol{\theta}\) are known of not. In this section, let us first assume the knowledge of \(\boldsymbol{\theta},\) which means \(\sigma^2\), \(\boldsymbol{D}, \mathcal{G},\boldsymbol{V}\) are all known.
4.2.1 Estimation of \({\boldsymbol \beta}\)
We rewrite the linear mixed model in equation Equation 4.2 as \[ \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}^\ast, \quad \boldsymbol{\epsilon}^\ast\sim N(0, \boldsymbol{V}). \tag{4.3}\]
It looks very similar to the linear model in Chapter 3. However, here we cannot directly apply the MLE estimator of \(\boldsymbol{\beta}\) introduced in Section 3.3 as \(\boldsymbol{V}\neq \sigma^2 \boldsymbol{I}_n\).
To solve this problem, note that \[\boldsymbol{V}^{-1/2}\boldsymbol{\epsilon}^\ast\sim N(0, \boldsymbol{V}^{-1/2}\boldsymbol{V}\boldsymbol{V}^{-1/2})=N(0, \boldsymbol{I}_n).\]
Therefore, multiply by \(\boldsymbol{V}^{-1/2}\) on both sides of Equation 4.3, we have \[ \boldsymbol{V}^{-1/2}\boldsymbol{Y}=\boldsymbol{V}^{-1/2}\boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{V}^{-1/2}\boldsymbol{\epsilon}^\ast \tag{4.4}\]
Let \(\boldsymbol{Y}'=\boldsymbol{V}^{-1/2}\boldsymbol{Y}\), \(\boldsymbol{X}'=\boldsymbol{V}^{-1/2}\boldsymbol{X}\) and \(\boldsymbol{\epsilon}'=\boldsymbol{V}^{-1/2}\boldsymbol{\epsilon}^\ast\), we can rewrite Equation 4.4 as \[ \boldsymbol{Y}' = \boldsymbol{X}'\boldsymbol{\beta}+ \boldsymbol{\epsilon}', \quad \boldsymbol{\epsilon}'\sim N(0, \boldsymbol{I}_n), \tag{4.5}\]
which is an ordinary multiple linear regression with i.i.d errors. Similar to Section 3.3, we have the MLE for \(\boldsymbol{\beta}\) is \[ \begin{aligned} \hat{\boldsymbol{\beta}}=&(\boldsymbol{X}'^T\boldsymbol{X}')^{-1}\boldsymbol{X}'^T\boldsymbol{y}'\nonumber\\ =& (\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{y} \end{aligned} \tag{4.6}\]
This is the BLUE (best linear unbiased estimator) of \(\beta\) as obtained in linear model, given the knowledge of \(\boldsymbol{\theta}\).
4.2.2 “Estimation” of \(\gamma\)
The random coefficients \(\boldsymbol{\gamma}\) are random coefficients not parameters of interests. Nevertheless, sometimes we need an “estimator” \(\boldsymbol{\hat \gamma}\) as an intermediate quantity in our statistical inference. To this end, note that \[\mathbf{Cov}(\boldsymbol{Y}, \boldsymbol{\gamma})=\mathbf{Cov}(\boldsymbol{x\beta}+\boldsymbol{U\gamma}+\boldsymbol{\epsilon}, \boldsymbol{\gamma})=\mathbf{Cov}(\boldsymbol{U\gamma}, \boldsymbol{\gamma})=\boldsymbol{U}\mathcal{G}.\] As a result, \[(\boldsymbol{Y},\boldsymbol{\gamma}) \sim N\left(\begin{pmatrix} \boldsymbol{X\beta}\\0 \end{pmatrix}, \begin{pmatrix} \boldsymbol{V}, &\boldsymbol{U}\mathcal{G}\\ \mathcal{G}\boldsymbol{U}^T, &\mathcal{G} \end{pmatrix}\right).\]
Before proceeding, we first present the following lemma.
Lemma 4.1 If we partition the random vector \(\boldsymbol{X}\sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) into two random vectors \(\boldsymbol{x}_1\) and \(\boldsymbol{X}_2\), and partition its mean vector and covariance matrix in a corresponding manner, i.e., \[\boldsymbol{\mu}=\begin{pmatrix} \boldsymbol{\mu}_1\\ \boldsymbol{\mu}_2 \end{pmatrix}, \quad \boldsymbol{\Sigma}=\begin{pmatrix} \boldsymbol{\Sigma}_{11}, &\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{21}, &\boldsymbol{\Sigma}_{22} \end{pmatrix}\] such that \(\boldsymbol{X}_1\sim N(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11})\), \(\boldsymbol{X}_2\sim N(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_{22})\) and \(\mathbf{Cov}(\boldsymbol{X}_i, \boldsymbol{X}_j)=\boldsymbol{\Sigma}_{ij}\) for \(i,j\in \{1,2\}\).
The conditional distribution of \(\boldsymbol{X}_2\) given known values for \(\boldsymbol{X}_1=\boldsymbol{x}_1\) is multivariate normal \(N(\boldsymbol{\mu}_{\boldsymbol{X}_2|\boldsymbol{x}_1}, \boldsymbol{\Sigma}_{\boldsymbol{X}_2|\boldsymbol{x}_1})\), where \[\begin{aligned} \boldsymbol{\mu}_{\boldsymbol{X}_2|\boldsymbol{x}_1}&=\boldsymbol{\mu}_2+\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}(\boldsymbol{x}_1-\boldsymbol{\mu}_1),\\ \boldsymbol{\Sigma}_{\boldsymbol{X}_2|\boldsymbol{x}_1}&=\boldsymbol{\Sigma}_{22}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}. \end{aligned}\]
Proof. Consider \(\boldsymbol{Z}=\boldsymbol{X}_2-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{X}_1\), note that
\[\begin{aligned} \mbox{Cov}(\boldsymbol{X}, \boldsymbol{X}_1)=&\mbox{Cov}(\boldsymbol{X}_2, \boldsymbol{X}_1)-\mbox{Cov}\left(\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{X}_2, \boldsymbol{X}_1\right)\\ =&\boldsymbol{\Sigma}_{21}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{11} \end{aligned}\]
As for jointly normal random vectors, zero covariance leads to independent, we have that \(\boldsymbol{z}\) and \(\boldsymbol{x}_1\) are independent. Therefore \[\begin{aligned} E(\boldsymbol{X}_2\,|\,\boldsymbol{x}_1)=&E\left(\boldsymbol{Z}+\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{x}_1\,|\,\boldsymbol{x}_1\right)\\ =&E(\boldsymbol{Z}\,|\,\boldsymbol{x}_1)+E\left(\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{x}_1\,|\,\boldsymbol{x}_1\right)\\ =&E(\boldsymbol{Z})+\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{x}_1\\ =&\boldsymbol{\mu}_2+\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}(\boldsymbol{x}_1-\boldsymbol{\mu}_1). \end{aligned}\] Similarly, we have \[\begin{aligned} \mbox{Var}(\boldsymbol{X}_2\,|\,\boldsymbol{x}_1)=&\mbox{Var}\left(\boldsymbol{Z}+\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{x}_1\,|\,\boldsymbol{x}_1\right)\\ =&\mbox{Var}(\boldsymbol{Z}\,|\,\boldsymbol{x}_1)+\mbox{Var}\left(\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{x}_1\,|\,\boldsymbol{x}_1\right)+2\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\mbox{Cov}\left(\boldsymbol{Z}, \boldsymbol{x}_1\,|\,\boldsymbol{x}_1\right)\\ =&\mbox{Var}(\boldsymbol{Z}\,|\,\boldsymbol{x}_1)\\ =&\mbox{Var}(\boldsymbol{Z})\\ =&\boldsymbol{\Sigma}_{22}+\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}-2\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}\\ =&\boldsymbol{\Sigma}_{22}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}, \end{aligned}\] which completes the proof.
Applying Lemma 4.1 implies conditional on \(\boldsymbol{Y}=\boldsymbol{y}\), \[\boldsymbol{\gamma}|\boldsymbol{Y}=\boldsymbol{y}\sim N\left(\boldsymbol{\mu}_{\boldsymbol{\gamma}|\boldsymbol{y}}, \Sigma_{\boldsymbol{\gamma}|\boldsymbol{y}}\right).\] Hence, given the value of \(\boldsymbol{\beta}\), we have \[E(\boldsymbol{\gamma}|\boldsymbol{Y}=\boldsymbol{y})=\boldsymbol{\mu}_{\boldsymbol{\gamma}|\boldsymbol{Y}}=\mathcal{G}\boldsymbol{U}^T\boldsymbol{V}^{-1}(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}).\]
This suggests the following estimator: \[ \hat{\boldsymbol{\gamma}}=\mathcal{G}\boldsymbol{U}^T\boldsymbol{V}^{-1}(\boldsymbol{y}-\boldsymbol{X}{\boldsymbol{\beta}}). \tag{4.7}\] When \(\boldsymbol{\beta}\) is not given, we can replace \(\boldsymbol{\beta}\) with its estimator \(\boldsymbol{\hat \beta}\) derived in Equation 4.7. It is straightforward to see that \(E(\hat{\boldsymbol{\gamma}}\,|\,\boldsymbol{Y}=\boldsymbol{y})=E(\gamma\,|\,\boldsymbol{Y}=\boldsymbol{y})\). \(\hat \gamma\) is sometimes referred as the maximum a posteriori (MAP), or predicted random effects.
In the exercise class we will prove the above estimators \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\gamma}}\) are also the joint maximiser of the log-likelihood of \((\boldsymbol{Y}^T,\boldsymbol{\gamma}^T)\), with respect to \(\boldsymbol{\beta}\) and \(\boldsymbol{\gamma}\), meaning that they are MLEs (under the knowledge of \(\boldsymbol{\theta}\)).
Example 4.4 Consider the following linear mixed model: \[Y_{ij}=\beta_1+\beta_2x_{ij}+\gamma_{1i}+\gamma_{2i}x_{ij}+\epsilon_{ij}, \quad i=1,\dots, m; j=1,\dots, n_i,\] where \(\epsilon_{ij}\sim N(0,\sigma^2)\) are random errors and \((\gamma_{1i}, \gamma_{2i})^T\sim N(\boldsymbol{0}, \boldsymbol{I}_2)\) are two random effects.
In this example, we have \(\mathcal{G}=\boldsymbol{I}_{2m}\), and \[\boldsymbol{U}_i=\boldsymbol{X}_i=\begin{pmatrix} 1 &x_{i1}\\ \vdots &\vdots\\ 1 &x_{in_i} \end{pmatrix}, \quad \boldsymbol{U}=\begin{pmatrix}\boldsymbol{X}_1,& \boldsymbol{0}_{n_1\times 2}, &\cdots, &\boldsymbol{0}_{n_1\times q}\\ \boldsymbol{0}_{n_2\times 2},& \boldsymbol{X}_2,\\ \vdots && \ddots\\ \boldsymbol{0}_{n_m\times 2}, &&&\boldsymbol{X}_m \end{pmatrix}.\]
Therefore, we have \[\boldsymbol{V}=\boldsymbol{U} \mathcal{G}\boldsymbol{U}^T+\sigma^2\boldsymbol{I}_{n}=\begin{pmatrix}\boldsymbol{X_1}\boldsymbol{X_1}^T+\sigma^2\boldsymbol{I}_{n_1},& \boldsymbol{0}_{n_1\times n_2}, &\cdots, &\boldsymbol{0}_{n_1\times n_m}\\ \boldsymbol{0}_{n_2\times n_1},& \boldsymbol{X_2}\boldsymbol{X_2}^T+\sigma^2\boldsymbol{I}_{n_2},\\ \vdots && \ddots\\ \boldsymbol{0}_{n_m\times n_1}, &&&\boldsymbol{X_m}\boldsymbol{X_m}^T+\sigma^2\boldsymbol{I}_{n_m} \end{pmatrix}\]
Combined with Equation 4.6 and Equation 4.7 implies the estimator \[\hat{\boldsymbol{\beta}}=(\hat{\beta_1}, \hat{\beta_2})^T=\left[\sum_{i=1}^m \boldsymbol{X}_i^T(\boldsymbol{X}_i\boldsymbol{X}_i^T+\sigma^2 \boldsymbol{I}_{n_i})^{-1}\boldsymbol{X}_i\right]^{-1}\sum_{i=1}^m \boldsymbol{X_i}^T(\boldsymbol{X}_i\boldsymbol{X}_i^T+\sigma^2 \boldsymbol{I}_{n_i})^{-1}\boldsymbol{y}\] and \[\hat{\boldsymbol{\gamma_i}}=(\hat{\gamma}_{i1},\hat{\gamma}_{i2})^T=\boldsymbol{X}_i^T(\boldsymbol{X}_i\boldsymbol{X}_i^T+\sigma^2\boldsymbol{I}_{n_i})^{-1}(\boldsymbol{y}_i-\boldsymbol{X}_i\hat{\boldsymbol{\beta}}).\]
4.3 LMMs parameter estimation II
When \(\boldsymbol{\theta}\) are not known, we will have to estimate it together with \(\boldsymbol{\beta}\) using the maximum likelihood and relevant estimations. In this section we will re-introduce the subscript in covariance matrices.
4.3.1 The log-likelihood function
The likelihood for parameters \(\boldsymbol{\beta}\) and \(\boldsymbol{\theta}\) of the linear mixed model is in principle based on the joint probability density function of \(\boldsymbol{Y}=(\boldsymbol{Y}^T_1,\dots, \boldsymbol{Y_m}^T)^T\), i.e., \(f_{\boldsymbol{Y}}(\boldsymbol{y}; \boldsymbol{\beta}, \boldsymbol{\theta}).\) Since we know \(\boldsymbol{Y}\sim N(\boldsymbol{X\beta}, \boldsymbol{V_{\boldsymbol{\theta}}})\), we have its joint probability density function is \[({2\pi})^{-n/2} |\boldsymbol{V}_{\boldsymbol{\theta}}|^{-1/2} \exp\left[\dfrac{(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} )^T \boldsymbol{V_{\boldsymbol{\theta}}}^{-1}(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} )}{2}\right]\] This likelihood expression looks simple however it requires the inverse of \(n \times n\) matrix \(\boldsymbol{V}_{\boldsymbol{\theta}}\). The computational complexity (cost) is \(O(n^{3})\), which is very expensive. In practice we usually use an alternative approach.
From standard property of conditional density, we have \[f(\boldsymbol{y},\boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})=f(\boldsymbol{y}|\boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta}) f(\boldsymbol{\gamma; \boldsymbol{\beta}, \boldsymbol{\theta}}).\]
Note that \(\boldsymbol{Y}|\boldsymbol{\gamma}\sim N(\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{U} \boldsymbol{\gamma}, \sigma^2 \boldsymbol{I}_n)\), we have \[ f(\boldsymbol{y}|\boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})=\left({2\pi\sigma^2}\right)^{-n/2} \exp\left[-\dfrac{(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}- \boldsymbol{U} \boldsymbol{\gamma} )^T (\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}- \boldsymbol{U} \boldsymbol{\gamma} )}{2\sigma^2}\right]. \]
Moreover, as \(\boldsymbol{\gamma}\sim N(0, \mathcal{G}_{\boldsymbol{\theta}})\), we have \[ f(\boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})=(2\pi)^{-mq/2}|\mathcal{G}_{\boldsymbol{\theta}}|^{-1/2}\exp\left(-\dfrac{\boldsymbol{\gamma} \mathcal{G}_{\boldsymbol{\theta}}^{-1}\boldsymbol{\gamma}}{2}\right). \]
As a result consider evaluating the likelihood \(f(\boldsymbol{y}; \boldsymbol{\beta}, \boldsymbol{\theta})\) by integrating out \(\boldsymbol{\gamma}\) in \(f(\boldsymbol{y}, \boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})\): \[\begin{aligned} L( \boldsymbol{\beta}, \boldsymbol{\theta})=f_{\boldsymbol{Y}}(\boldsymbol{y}; \boldsymbol{\beta}, \boldsymbol{\theta})=\int f(\boldsymbol{y}, \boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta}) d\boldsymbol{\gamma} = \int \exp \left[\log f(\boldsymbol{y}, \boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})\right] d\boldsymbol{\gamma}. \end{aligned}\]
Here we are taking an additional log and exponential to apply the following trick — consider Taylor expansion of \(\log f(\boldsymbol{y}, \boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})\) at \(\boldsymbol{\hat \gamma}\), where \(\boldsymbol{\hat \gamma}\) is the estimator we obtained in Equation 4.7, which is also the maximiser of \(f(\boldsymbol{y},\boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})\) as we proved in the exercise class. Hence we have \[\begin{aligned} \log f(\boldsymbol{y}, \boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})= & \log f(\boldsymbol{y}, \boldsymbol{\hat \gamma}; \boldsymbol{\beta}, \boldsymbol{\theta}) + \dfrac{1}{2}(\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})^T \dfrac{\partial^2 \log f(\boldsymbol{y}, \boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta}) }{\partial \boldsymbol{\gamma} \partial \boldsymbol{\gamma}^T} \big|_{\boldsymbol{\hat\gamma}} (\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})\\ = & \log f(\boldsymbol{y}, \boldsymbol{\hat \gamma}; \boldsymbol{\beta}, \boldsymbol{\theta}) - \dfrac{1}{2}(\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})^T \left(\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\right) (\boldsymbol{\gamma}- \boldsymbol{\hat \gamma}). \end{aligned}\]
There are no further remainder terms because the higher order derivatives of \(\log f(\boldsymbol{y}, \boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})\) with respect to \(\boldsymbol{\gamma}\) are exactly zero since it is polynomial of order 2. Hence, \[ \begin{aligned} L( \boldsymbol{\beta}, \boldsymbol{\theta})=&\int \exp\left[\log f(\boldsymbol{y}, \boldsymbol{\hat \gamma}; \boldsymbol{\beta}, \boldsymbol{\theta}) - \dfrac{1}{2}(\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})^T \left(\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\right) (\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})\right]d\boldsymbol{\gamma}\\ =& f(\boldsymbol{y}, \boldsymbol{\hat \gamma}; \boldsymbol{\beta}, \boldsymbol{\theta}) \int \exp\left[- \dfrac{1}{2}(\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})^T \left(\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\right) (\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})\right]d\boldsymbol{\gamma}\\ =& f(\boldsymbol{y}|\hat{\boldsymbol{\gamma}}; \boldsymbol{\beta}, \boldsymbol{\theta}) f(\hat{\boldsymbol{\gamma}}; \boldsymbol{\beta}, \boldsymbol{\theta}) \int \exp\left[- \dfrac{1}{2}(\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})^T \left(\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\right) (\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})\right]d\boldsymbol{\gamma}. \end{aligned} \]
Consider \[(2\pi)^{-mq/2}\bigg|\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\bigg|^{1/2}\exp\left[- \dfrac{1}{2}(\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})^T \left(\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\right) (\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})\right],\]
which is the probability density function of \(N\big(\boldsymbol{\hat \gamma}, \left({U^TU}/{\sigma^2}+\mathcal{G}_{\boldsymbol{\theta}}\right)^{-1}\big)\), it must integrate to \(1\).
This implies \[\begin{aligned} \int \exp\left[- \dfrac{1}{2}(\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})^T \left(\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\right) (\boldsymbol{\gamma}- \boldsymbol{\hat \gamma})\right]d\boldsymbol{\gamma}\nonumber\\ =\dfrac{(2\pi)^{mq/2}}{\bigg|\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\bigg|^{1/2}}. \end{aligned} \tag{4.8}\]
Combining the formula of \(f(\boldsymbol{y}|\boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})\), \(f(\boldsymbol{\gamma}; \boldsymbol{\beta}, \boldsymbol{\theta})\) and Equation 4.8, we finally have \[\begin{aligned} L( \boldsymbol{\beta}, \boldsymbol{\theta})=& ({2\pi\sigma^2})^{-n/2} |\mathcal{G}_{\boldsymbol{\theta}}|^{-1/2} \bigg|\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\bigg|^{-1/2}\cdot \exp\left(-\dfrac{\boldsymbol{\hat\gamma} \mathcal{G}_{\boldsymbol{\theta}}^{-1}\boldsymbol{\hat\gamma}}{2}\right) \\ &\cdot \exp\left[-\dfrac{(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}- \boldsymbol{U} \hat{\boldsymbol{\gamma}} )^T (\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}- \boldsymbol{U} \hat{\boldsymbol{\gamma}} )}{2\sigma^2}\right]. \end{aligned}\]
This likelihood function only contains the inverse of \(\mathcal{G}_{\boldsymbol{\theta}}\), which is a \(mq \times mq\) matrix. The computational complexity therefore is much smaller.
Therefore taking the log we have the log likelihood is \[ \begin{aligned} \ell(\boldsymbol{\beta}, \boldsymbol{\theta})=&-\dfrac{(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}- \boldsymbol{U} \boldsymbol{\hat\gamma} )^T (\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}- \boldsymbol{U} \boldsymbol{\hat \gamma} )}{2\sigma^2}-\dfrac{\boldsymbol{\hat\gamma}^T\mathcal{G}_{\boldsymbol{\theta}}^{-1}\boldsymbol{\hat\gamma}}{2} \nonumber\\ &-\dfrac{\log|\mathcal{G}_{\boldsymbol{\theta}}|}{2}-\dfrac{1}{2}\log \bigg|\dfrac{U^TU}{\sigma^2}+ \mathcal{G}_{\boldsymbol{\theta}}^{-1}\bigg|-\dfrac{n}{2}\log({2\pi\sigma^2}). \end{aligned} \tag{4.9}\]
4.3.2 Profile likelihood method
Unlike MLE in linear model, we do not have a close form solution for \(\boldsymbol{\hat \beta}\) without \(\boldsymbol{\theta}\). To solve the optimisation problem of Equation 4.9 with respect to \(\boldsymbol{\beta}\) and \(\boldsymbol{\theta}\), we can update their values iteratively to seek the MLE by numerical methods.
For this specific problem, to get solutions of MLE what people usually do is to first maximise the log-likelihood \(\ell(\boldsymbol{\beta}, \boldsymbol{\theta})\) with respect to \(\boldsymbol{\beta}\) for a given value of \(\boldsymbol{\theta}\). The optimised solution \(\boldsymbol{\hat\beta}(\boldsymbol{\theta})\) is obtained as a function of \(\boldsymbol{\theta}\). We can then substitute this solution into the log likelihood function as \[\ell_p(\boldsymbol{\theta})=\ell(\boldsymbol{\hat\beta}(\boldsymbol{\theta}), \boldsymbol{\theta}),\] which is a function of \(\boldsymbol{\theta}\), and we then just need to optimise it with respect to \(\boldsymbol{\theta}\). Once the maximiser value \(\boldsymbol{\hat \theta}_p\) is found, we then estimate \(\boldsymbol{\beta}\) by \(\boldsymbol{\hat\beta}_p=\boldsymbol{\hat\beta}(\boldsymbol{\hat \theta})\).
This simple and naive method is called profiled likelihood, as we have profiled out \(\boldsymbol{\beta}\) in the log likelihood function.
4.3.3 Restricted maximum likelihood method
Recall the MLE of \(\sigma^2\) in linear model, which is \[\hat \sigma^2 = \frac{1}{n} \sum_{i=1}^n \left(y_i-\boldsymbol{x}_i^T\hat{\boldsymbol{\beta}}\right)^2.\] We know this is an biased estimator as \(E(\hat \sigma^2)=\dfrac{n-p}{n}\sigma^2\), and in practice we often use the unbiased alternative \[\tilde \sigma^2={1\over {n-p}}\sum_{i=1}^n \left(y_i-\boldsymbol{x}_i^T\hat{\boldsymbol{\beta}}\right)^2.\]
This is a typical problem of MLE for variance components. It gets worse in LMMs as the number of fixed effects increases. Therefore, Restricted Maximum Likelihood (REML) is proposed as to alleviate the problem. The idea is just to includes only the variance components in the REML, while the \(\boldsymbol{\beta}\) that parameterise the fixed effect terms in LMMs is estimated in a second step.
To this end, we can think as treating \(\boldsymbol{\beta}\) also as a random coefficients. This is similar to the Bayesian framework where we can give a prior distribution to the parameters. Here we can just apply the improper uniform prior distribution where \(f(\boldsymbol{\beta})=1\) at \((0,1)^{p}\). As a result, \[f(\boldsymbol{y};\boldsymbol{\theta})=\int f(\boldsymbol{y}, \boldsymbol{\beta}; \boldsymbol{\theta}) d \boldsymbol{\beta}= \int f(\boldsymbol{y}|\boldsymbol{\beta}; \boldsymbol{\theta})f(\boldsymbol{\beta}) d \boldsymbol{\beta}= \int f(\boldsymbol{y}; \boldsymbol{\beta}, \boldsymbol{\theta}) d \boldsymbol{\beta}.\] Using the exactly same techniques in Section 4.3.1, we can get the restricted log-likelihood function \(\ell_r(\boldsymbol{\theta})\). Details are omitted here. The resulted maximiser is denoted as \(\boldsymbol{\hat \theta}_r\). And we can calculate \(\boldsymbol{\hat \beta}_r\) based on the value of \(\boldsymbol{\hat \theta}_r\).
REML accounts for the degrees of freedom loss by estimating the fixed effects, and results in a less biased estimation of random effects variances. The estimates of \(\boldsymbol{\theta}\) are invariant to the value of \(\boldsymbol{\beta}\) and less sensitive to outliers in the data compared to MLE.
In practice, we can use numerical methods such as Newton-Raphson or an Expectation-Maximisation (EM) algorithm to obtain \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\gamma}}\) by maximising profile likelihood and REML. There are R
packages ready to use for LMMs, such as lme4
, which we will illustrate in the computer lab.
4.4 Statistical Inference of LMMs
4.4.1 Confidence intervals
First let us consider \(\boldsymbol{\theta}\) is known, hence \(\boldsymbol{D}\) and \(\sigma^2\), and the covariance matrix \(\boldsymbol{V}\) is fixed and known. Since \(\boldsymbol{Y}\sim N(\boldsymbol{X\beta}, \boldsymbol{V})\), we have \(\hat{\boldsymbol{\beta}}=(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{Y}\) is normally distributed with mean \(\boldsymbol{\beta}\) and covariance given by \[ \begin{aligned} \mbox{Cov}(\hat{\boldsymbol{\beta}})=&(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{V}^{-1} \mbox{Cov}(\boldsymbol{Y}) \boldsymbol{V}^{-1} \boldsymbol{X} (\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\\ =&(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}. \end{aligned} \]
As a result, the \(j\)-th diagonal element in above matrix is \(\sigma_j^2=(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}_{jj}\). This equals to \(\mbox{Var}(\hat{\beta}_j)\), which leads to \[ \hat{\beta}_j\pm z_{1-\frac{\alpha}{2}}\sqrt{(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}_{jj}} \tag{4.10}\] as the \(100(1 - \alpha)\%\) confidence interval for \(\beta_j\).
If \(\boldsymbol{\theta}\) is not known, we can use its estimates, e.g. \(\boldsymbol{\hat \theta}_p\) or \(\boldsymbol{\hat \theta}_r\) instead, which leads to an approximation to the covariance matrix, \(\boldsymbol{V}(\boldsymbol{\hat \theta})\), and to the fix effect coefficients, \(\boldsymbol{\hat \beta}(\boldsymbol{\hat \theta})=(\boldsymbol{X}^T\boldsymbol{V(\boldsymbol{\hat \theta})}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{V(\boldsymbol{\hat \theta})}^{-1}\boldsymbol{Y}\). Therefore, Equation 4.10 becomes \[ \hat{\beta}_j(\boldsymbol{\hat \theta})\pm z_{1-\frac{\alpha}{2}}\sqrt{(\boldsymbol{X}^T\boldsymbol{V(\boldsymbol{\hat \theta})}^{-1}\boldsymbol{X})^{-1}_{jj}}, \tag{4.11}\] which gives an approximate \(100(1-\alpha)\%\) confidence interval for \(\beta_j\).
We expect that \((\boldsymbol{X}^T\boldsymbol{V(\boldsymbol{\hat \theta})}^{-1}\boldsymbol{X})^{-1}_{jj}\) will underestimate \(\mbox{Var}(\hat{\beta}_j)\) since the variation in \(\boldsymbol{\hat \theta}\) is not taken into account. A more sophisticated way to do this is using Bayesian analysis such as MCMC for this confidence interval approximations.
Confidence intervals for \(\boldsymbol{\theta}\) rely on the large sample theory that \[\boldsymbol{\hat \theta}_p \sim N(\boldsymbol{\theta}, \hat I_p ^{-1})\quad \mbox{and} \quad \boldsymbol{\hat \theta}_r \sim N(\boldsymbol{\theta}, \hat I_r ^{-1}), \quad \mbox{as } n\rightarrow \infty\] where \(\hat I_p ^{-1}\) and \(\hat I_p ^{-1}\) are the information matrix of the log profile likelihood or log restricted likelihood, respectively, i.e., \[\hat I_p ^{-1}=\dfrac{\partial^2\ell_p(\boldsymbol{\theta})}{(\partial \boldsymbol{\theta})^2}\bigg|_{\boldsymbol{\hat \theta_p}} \quad\mbox{and} \quad \hat I_r ^{-1}=\dfrac{\partial^2\ell_r(\boldsymbol{\theta})}{(\partial \boldsymbol{\theta})^2}\bigg|_{\boldsymbol{\hat \theta_r}}.\] We can build approximated confidence intervals for \(\boldsymbol{\theta}\) based on this result.
4.4.2 Hypothesis testing
For hypothesis testing, suppose we want to perform a test on the fixed effects \(\boldsymbol{\beta}\), such as \[H_0: \boldsymbol{C}\boldsymbol{\beta}= \boldsymbol{c} \quad\mbox{against }\quad H_1: \boldsymbol{C}\boldsymbol{\beta}\neq \boldsymbol{c},\] where \(\boldsymbol{C}\) is a \(r\times p\) constant matrix with rank \(r\), and \(\boldsymbol{c}\) is a \(r\)-dimensional vector.
Due to \(\hat{\boldsymbol{\beta}}\sim N\left(\boldsymbol{\beta}, (\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\right),\) under the null hypothesis, we have that \[\boldsymbol{C}\hat{\boldsymbol{\beta}}-\boldsymbol{c}\sim N\left(\boldsymbol{0},\boldsymbol{C}(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\boldsymbol{C}^T\right).\] Hence \[\left(\boldsymbol{C}(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\boldsymbol{C}^T\right)^{-1/2}\left(\boldsymbol{C}\hat{\boldsymbol{\beta}}-\boldsymbol{c}\right)\sim N(\boldsymbol{0}, \boldsymbol{I}_r).\] This implies \[W=\left(\boldsymbol{C}\hat{\boldsymbol{\beta}}-\boldsymbol{c}\right)^T \left[\boldsymbol{C}(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\boldsymbol{C}^T\right]^{-1}\left(\boldsymbol{C}\hat{\boldsymbol{\beta}}-\boldsymbol{c}\right) \sim \chi^2_{r}.\] If \(H_1\) is true, \(\left(\boldsymbol{C}(\boldsymbol{X}^T\boldsymbol{V}^{-1}\boldsymbol{X})^{-1}\boldsymbol{C}^T\right)^{-1/2}\left(\boldsymbol{C}\hat{\boldsymbol{\beta}}-\boldsymbol{c}\right)\sim N(\boldsymbol{C}\boldsymbol{\beta}-\boldsymbol{c}, \boldsymbol{I}_r),\) the distribution of \(W\) will shift to the right by \((\boldsymbol{C}\boldsymbol{\beta}-\boldsymbol{c})^T(\boldsymbol{C}\boldsymbol{\beta}-\boldsymbol{c})\). 1
Therefore, we could employ \(W\) as the test statistic of \(H_0\) against \(H_1\). This is the so-called Wald-Test. We will reject \(H_0\) if \(W>\chi^2_{r,1-\alpha}\), where \(\chi^2_{r,1-\alpha}\) is the \(100(1-\alpha)\%\) quantile of the \(\chi^2_{r}\) distribution.
Again, if \(\boldsymbol{\theta}\) is unknown, we can use its estimate \(\boldsymbol{\hat \theta}_p\), and replace \(\boldsymbol{V}\) and \(\boldsymbol{\hat \beta}\) in above terms with \(\boldsymbol{V}(\boldsymbol{\hat \theta}_p)\) and \(\boldsymbol{\hat \beta}(\boldsymbol{\hat \theta}_p)\), respectively. Note that REML method can not be used to compare models with different fixed effect structures, because \(\ell_r(\boldsymbol{\theta})\) is not comparable between models with different fixed effect.
Example 4.5 If we want to compare two linear mixed models with differences in fixed effects \(\boldsymbol{\beta}\), i.e., one of the models, \(H_0\), is nested in the other, \(H_1\). Again, we assume model \(H_1\) contains \(p\) linear parameters \(\beta_1, \beta_2,\dots, \beta_p\) and model \(H_0\) contains a subset of \(q < p\) of these, i.e., \(\beta_1, \dots, \beta_q\).
Therefore our hypothesis is: \[H_0: \beta_{q+1}=\beta_{q+2}=\dots=\beta_p=0 \,\mbox{ against }\, H_1: \beta_{q+1},\beta_{q+2},\dots,\beta_p \mbox{ are not all } 0\]
We can simply apply the above Wald test by setting \[\boldsymbol{C}=\begin{pmatrix} 0, &\dots, &0, &1, &0, &\dots,&0\\ 0, &\dots, &0, &0, &1, &\dots,&0\\ \vdots &\vdots &\vdots &\vdots & &\ddots &\\ 0, &\dots, &0, &0, &0, &\dots,&1\\ \end{pmatrix},\] and \(\boldsymbol{c}=\boldsymbol{0}\), with \(r=p-q\).
In this scenario, \(W\) follows a non-central chi-square distribution with non-centrality parameter \((\boldsymbol{C}\boldsymbol{\beta}-\boldsymbol{c})^T(\boldsymbol{C}\boldsymbol{\beta}-\boldsymbol{c})\).↩︎