2 Parametric statistical inference

2.1 Introduction

Probability distributions like the binomial, Poisson and normal, enable us to calculate probabilities, and other quantities of interest (e.g. expectations) for a probability model of a random process. Therefore, given the model, we can make statements about possible outcomes of the process.

Statistical inference is concerned with the inverse problem. Given outcomes of a random process (observed data), what conclusions (inferences) can we draw about the process itself?

We assume that the $n$ observations of the response $y = (y_{1}, \dots, y_{n})^{T}$ are observations of random variables $Y = (Y_{1}, \dots, Y_{n})^{T}$ , which have joint p.d.f. $f_{Y}$ (joint p.f. for discrete variables). We use the observed data $y$ to make inferences about $f_{Y}$ .

We usually make certain assumptions about $f_{Y}$ . In particular, we often assume that $y_{1}, \dots, y_{n}$ are observations of independent random variables. Hence $f_{Y} (y) = f_{Y_{1}} (y_{1}) f_{Y_{2}} (y_{2}) \dots f_{Y_{n}} (y_{n}) = \prod_{i = 1}^{n} f_{Y_{i}} (y_{i}) .$

In parametric statistical inference, we specify a joint distribution $f_{Y}$ , for $Y$ , which is known, except for the values of parameters $θ_{1}, θ_{2}, \dots, θ_{p}$ (sometimes denoted by $θ$ ). Then we use the observed data $y$ to make inferences about $θ_{1}, θ_{2}, \dots, θ_{p}$ . In this case, we usually write $f_{Y}$ as $f_{Y} (y; θ)$ , to make explicit the dependence on the unknown $θ$ .

2.2 The likelihood function

We often think of the joint density $f_{Y} (y; θ)$ as a function of $y$ for fixed $θ$ , which describes the relative probabilities of different possible values of $y$ , given a particular set of parameters $θ$ . However, in statistical inference, we have observed $y_{1}, \dots, y_{n}$ (values of $Y_{1}, \dots, Y_{n}$ ). Knowledge of the probability of alternative possible realisations of $Y$ is largely irrelevant. What we want to know about is $θ$ .

Our only link between the observed data $y_{1}, \dots, y_{n}$ and $θ$ is through the function $f_{Y} (y; θ)$ . Therefore, it seems sensible that parametric statistical inference should be based on this function. We can think of $f_{Y} (y; θ)$ as a function of $θ$ for fixed $y$ , which describes the relative likelihoods of different possible (sets of) $θ,$ given observed data $y_{1}, \dots, y_{n}$ . We write $L (θ; y) = f_{Y} (y; θ)$ for this likelihood, which is a function of the unknown parameter $θ$ . For convenience, we often drop $y$ from the notation, and write $L (θ)$ .

The likelihood function is of central importance in parametric statistical inference. It provides a means for comparing different possible values of $θ$ , based on the probabilities (or probability densities) that they assign to the observed data $y_{1}, \dots, y_{n}$ .

Notes

Frequently it is more convenient to consider the log-likelihood function $ℓ (θ) = \log L (θ)$ .
Nothing in the definition of the likelihood requires $y_{1}, \dots, y_{n}$ to be observations of independent random variables, although we shall frequently make this assumption.
Any factors which depend on $y_{1}, \dots, y_{n}$ alone (and not on $θ$ ) can be ignored when writing down the likelihood. Such factors give no information about the relative likelihoods of different possible values of $θ$ .

Example 2.1 (Bernoulli) $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , independent identically distributed (i.i.d.) Bernoulli $(p)$ random variables. Here $θ = (p)$ and the likelihood is $L (p) = \prod_{i = 1}^{n} p^{y_{i}} (1 - p)^{1 - y_{i}} = p^{\sum_{i = 1}^{n} y_{i}} (1 - p)^{n - \sum_{i = 1}^{n} y_{i}} .$ The log-likelihood is $ℓ (p) = \log L (p) = n \bar{y} \log p + n (1 - \bar{y}) \log (1 - p) .$

Example 2.2 (Normal) $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. $N (μ, σ^{2})$ random variables. Here $θ = (μ, σ^{2})$ and the likelihood is $\begin{aligned} L (μ, σ^{2}) & = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{1}{2 σ^{2}} (y_{i} - μ)^{2}) \\ = (2 π σ^{2})^{- \frac{n}{2}} \exp (- \frac{1}{2 σ^{2}} \sum (y_{i} - μ)^{2}) \\ \propto (σ^{2})^{- \frac{n}{2}} \exp (- \frac{1}{2 σ^{2}} \sum (y_{i} - μ)^{2}) . \end{aligned}$ The log-likelihood is $ℓ (μ, σ^{2}) = \log L (μ, σ^{2}) = - \frac{n}{2} \log (2 π) - \frac{n}{2} \log (σ^{2}) - \frac{1}{2 σ^{2}} \sum (y_{i} - μ)^{2} .$

2.3 Maximum likelihood estimation

One of the primary tasks of parametric statistical inference is estimation of the unknown parameters $θ_{1}, \dots, θ_{p}$ . Consider the value of $θ$ which maximises the likelihood function. This is the ‘most likely’ value of $θ$ , the one which makes the observed data ‘most probable’. When we are searching for an estimate of $θ$ , this would seem to be a good candidate.

We call the value of $θ$ which maximises the likelihood $L (θ)$ the maximum likelihood estimate (MLE) of $θ$ , denoted by $\hat{θ}$ . $\hat{θ}$ depends on $y$ , as different observed data samples lead to different likelihood functions. The corresponding function of $Y$ is called the maximum likelihood estimator and is also denoted by $\hat{θ}$ .

Note that as $θ = (θ_{1}, \dots, θ_{p})$ , the MLE for any component of $θ$ is given by the corresponding component of $\hat{θ} = ({\hat{θ}}_{1}, \dots, {\hat{θ}}_{p})^{T}$ . Similarly, the MLE for any function of parameters $g (θ)$ is given by $g (\hat{θ})$ .

As $\log$ is a strictly increasing function, the value of $θ$ which maximises $L (θ)$ also maximises $ℓ (θ) = \log L (θ)$ . It is almost always easier to maximise $ℓ (θ)$ . This is achieved in the usual way; finding a stationary point by differentiating $ℓ (θ)$ with respect to $θ_{1}, \dots, θ_{p}$ , and solving the resulting $p$ simultaneous equations. It should also be checked that the stationary point is a maximum.

Example 2.3 (Bernoulli) $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. Bernoulli $(p)$ random variables. Here $θ = (p)$ and the log-likelihood is $ℓ (p) = n \bar{y} \log p + n (1 - \bar{y}) \log (1 - p) .$ Differentiating with respect to $p$ , $\frac{\partial}{\partial p} ℓ (p) = \frac{n \bar{y}}{p} - \frac{n (1 - \bar{y})}{1 - p}$ so the MLE $\hat{p}$ solves $\frac{n \bar{y}}{\hat{p}} - \frac{n (1 - \bar{y})}{1 - \hat{p}} = 0.$ Solving this for $\hat{p}$ gives $\hat{p} = \bar{y}$ . Note that $\frac{\partial^{2}}{\partial p^{2}} ℓ (p) = - n \bar{y} / p^{2} - n (1 - \bar{y}) / (1 - p)^{2} < 0$ everywhere, so the stationary point is clearly a maximum.

Example 2.4 (Normal) $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. $N (μ, σ^{2})$ random variables. Here $θ = (μ, σ^{2})$ and and the log-likelihood is $ℓ (μ, σ^{2}) = - \frac{n}{2} \log (2 π) - \frac{n}{2} \log (σ^{2}) - \frac{1}{2 σ^{2}} \sum (y_{i} - μ)^{2} .$ Differentiating with respect to $μ$ $\frac{\partial}{\partial μ} ℓ (μ, σ^{2}) = \frac{1}{σ^{2}} \sum (y_{i} - μ) = \frac{n (\bar{y} - μ)}{σ^{2}}$ so $(\hat{μ}, {\hat{σ}}^{2})$ solve

$\begin{matrix} (2.1) & \frac{n (\bar{y} - \hat{μ})}{{\hat{σ}}^{2}} = 0. \end{matrix}$

Differentiating with respect to $σ^{2}$ $\frac{\partial}{\partial σ^{2}} ℓ (μ, σ^{2}) = - \frac{n}{2 σ^{2}} + \frac{1}{2 (σ^{2})^{2}} \sum (y_{i} - μ)^{2},$ so

$\begin{matrix} (2.2) & - \frac{n}{2 {\hat{σ}}^{2}} + \frac{1}{2 ({\hat{σ}}^{2})^{2}} \sum (y_{i} - \hat{μ})^{2} = 0 \end{matrix}$

Solving (Equation 2.1) and (Equation 2.2), we obtain $\hat{μ} = \bar{y}$ and ${\hat{σ}}^{2} = \frac{1}{n} \sum (y_{i} - \hat{μ})^{2} = \frac{1}{n} \sum (y_{i} - \bar{y})^{2} .$

Strictly, to show that this stationary point is a maximum, we need to show that the Hessian matrix (the matrix of second derivatives with elements $[H (θ)]_{i j} = \frac{\partial^{2}}{\partial θ_{i} \partial θ_{j}} ℓ (θ)$ ) is negative definite at $θ = \hat{θ}$ , that is $a^{T} H (\hat{θ}) a < 0$ for every $a \neq 0$ . Here $H (\hat{μ}, {\hat{σ}}^{2}) = (\begin{matrix} - \frac{n}{{\hat{σ}}^{2}} & 0 \\ 0 & - \frac{n}{2 ({\hat{σ}}^{2})^{2}} \end{matrix})$ which is clearly negative definite.

2.4 Score

Let $u_{i} (θ) \equiv \frac{\partial}{\partial θ_{i}} ℓ (θ), i = 1, \dots, p$ and $u (θ) \equiv [u_{1} (θ), \dots, u_{p} (θ)]^{T}$ . Then we call $u (θ)$ the vector of scores or score vector. Where $p = 1$ and $θ = (θ)$ , the score is the scalar defined as $u (θ) \equiv \frac{\partial}{\partial θ} ℓ (θ) .$ The maximum likelihood estimate $\hat{θ}$ satisfies $u (\hat{θ}) = 0,$ that is, $u_{i} (\hat{θ}) = 0, i = 1, \dots, p .$ Note that $u (θ)$ is a function of $θ$ for fixed (observed) $y$ . However, if we replace $y_{1}, \dots, y_{n}$ in $u (θ)$ , by the corresponding random variables $Y_{1}, \dots, Y_{n}$ then we obtain a vector of random variables $U (θ) \equiv [U_{1} (θ), \dots, U_{p} (θ)]^{T}$ .

An important result in likelihood theory is that the expected score at the true (but unknown) value of $θ$ is zero:

Theorem 2.1 We have $E [U (θ)] = 0$ , i.e. $E [U_{i} (θ)] = 0,$ $i = 1, \dots, p,$ provided that

The expectation exists.
The sample space for $Y$ does not depend on $θ$ .

Proof. Our proof is for continuous $y$ – in the discrete case, replace $\int$ by $\sum$ . For each $i = 1, \dots, n$ $\begin{aligned} E [U_{i} (θ)] & = \int U_{i} (θ) f_{Y} (y, θ) d y \\ = \int \frac{\partial}{\partial θ_{i}} ℓ (θ) f_{Y} (y; θ) d y \\ = \int \frac{\partial}{\partial θ_{i}} \log f_{Y} (y; θ) f_{Y} (y; θ) d y \\ = \int \frac{\frac{\partial}{\partial θ_{i}} f_{Y} (y; θ)}{f_{Y} (y; θ)} f_{Y} (y; θ) d y \\ = \int \frac{\partial}{\partial θ_{i}} f_{Y} (y; θ) d y \\ = \frac{\partial}{\partial θ_{i}} \int f_{Y} (y; θ) d y \\ = \frac{\partial}{\partial θ_{i}} 1 = 0, \end{aligned}$ as required.

Example 2.5 (Bernoulli) $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. Bernoulli $(p)$ random variables. Here $θ = (p)$ and $u (p) = n \bar{y} / p - n (1 - \bar{y}) / (1 - p) .$ Since $E [U (p)] = 0$ , we must have $E [\bar{Y}] = p$ (which we already know is correct).

Example 2.6 (Normal) $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. $N (μ, σ^{2})$ random variables. Here $θ = (μ, σ^{2})$ and $\begin{aligned} u_{1} (μ, σ^{2}) & = n (\bar{y} - μ) / σ^{2} \\ u_{2} (μ, σ^{2}) & = - \frac{n}{2 σ^{2}} + \frac{1}{2 (σ^{2})^{2}} \sum_{i = 1}^{n} (y_{i} - μ)^{2} \end{aligned}$ Since $E [U (μ, σ^{2})] = 0$ , we must have $E [\bar{Y}] = μ$ and $E [\frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - μ)^{2}] = σ^{2} .$

2.5 Information

Suppose that $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , whose joint p.d.f. $L (θ)$ is completely specified except for the values of $p$ unknown parameters $θ = (θ_{1}, \dots, θ_{p})^{T}$ . Previously, we defined the Hessian matrix $H (θ)$ to be the matrix with components $[H (θ)]_{i j} \equiv \frac{\partial^{2}}{\partial θ_{i} \partial θ_{j}} ℓ (θ) i = 1, \dots, p; j = 1, \dots, p .$ We call the matrix $- H (θ)$ the observed information matrix. Where $p = 1$ and $θ = (θ)$ , the observed information is a scalar defined as $- H (θ) \equiv - \frac{\partial}{\partial θ^{2}} ℓ (θ) .$

As with the score, if we replace $y_{1}, \dots, y_{n}$ in $H (θ)$ , by the corresponding random variables $Y_{1}, \dots, Y_{n}$ , we obtain a matrix of random variables. Then, we define the expected information matrix or Fisher information matrix $[I (θ)]_{i j} = E (- [H (θ)]_{i j}) i = 1, \dots, p; j = 1, \dots, p .$

An important result in likelihood theory is that the variance-covariance matrix of the score vector is equal to the expected information matrix:

Theorem 2.2 We have $Var [U (θ)] = I (θ)$ , i.e. $Var [U (θ)]_{i j} = [I (θ)]_{i j}, i = 1, \dots, p, j = 1, \dots, p$ provided that

The variance exists.
The sample space for $Y$ does not depend on $θ$ .

Proof. Our proof is for continuous $y$ – in the discrete case, replace $\int$ by $\sum$ .

For each $i = 1, \dots, p$ and $j = 1, \dots, p$ , $\begin{aligned} Var [U (θ)]_{i j} & = E [U_{i} (θ) U_{j} (θ)] \\ = \int \frac{\partial}{\partial θ_{i}} ℓ (θ) \frac{\partial}{\partial θ_{j}} ℓ (θ) f_{Y} (y; θ) d y \\ = \int \frac{\partial}{\partial θ_{i}} \log f_{Y} (y; θ) \frac{\partial}{\partial θ_{j}} \log f_{Y} (y; θ) f_{Y} (y; θ) d y \\ = \int \frac{\frac{\partial}{\partial θ_{i}} f_{Y} (y; θ)}{f_{Y} (y; θ)} \frac{\frac{\partial}{\partial θ_{j}} f_{Y} (y; θ)}{f_{Y} (y; θ)} f_{Y} (y; θ) d y \\ = \int \frac{1}{f_{Y} (y; θ)} \frac{\partial}{\partial θ_{i}} f_{Y} (y; θ) \frac{\partial}{\partial θ_{j}} f_{Y} (y; θ) d y . \end{aligned}$

Now $\begin{aligned} [I (θ)]_{i j} & = E [- \frac{\partial^{2}}{\partial θ_{i} \partial θ_{j}} ℓ (θ)] \\ = \int - \frac{\partial^{2}}{\partial θ_{i} \partial θ_{j}} \log f_{Y} (y; θ) f_{Y} (y; θ) d y \\ = \int - \frac{\partial}{\partial θ_{i}} [\frac{\frac{\partial}{\partial θ_{j}} f_{Y} (y; θ)}{f_{Y} (y; θ)}] f_{Y} (y; θ) d y \\ = \int [- \frac{\frac{\partial^{2}}{\partial θ_{i} \partial θ_{j}} f_{Y} (y; θ)}{f_{Y} (y; θ)} + \frac{\frac{\partial}{\partial θ_{i}} f_{Y} (y; θ) \frac{\partial}{\partial θ_{j}} f_{Y} (y; θ)}{f_{Y} (y; θ)^{2}}] f_{Y} (y; θ) d y \\ = - \frac{\partial^{2}}{\partial θ_{i} \partial θ_{j}} \int f_{Y} (y; θ) d y + \int \frac{1}{f_{Y} (y; θ)} \frac{\partial}{\partial θ_{i}} f_{Y} (y; θ) \frac{\partial}{\partial θ_{j}} f_{Y} (y; θ) d y \\ = Var [U (θ)]_{i j}, \end{aligned}$ as required.

Example 2.7 (Bernoulli) $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. Bernoulli $(p)$ random variables. Here $θ = (p)$ and $\begin{aligned} u (p) & = \frac{n \bar{y}}{p} - \frac{n (1 - \bar{y})}{(1 - p)} \\ - H (p) & = \frac{n \bar{y}}{p^{2}} + \frac{n (1 - \bar{y})}{(1 - p)^{2}} \\ I (p) & = \frac{n}{p} + \frac{n}{(1 - p)} = \frac{n}{p (1 - p)} . \end{aligned}$

Example 2.8 (Normal) $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. $N (μ, σ^{2})$ random variables. Here $θ = (μ, σ^{2})$ and $\begin{aligned} u_{1} (μ, σ^{2}) & = \frac{n (\bar{y} - μ)}{σ^{2}} \\ u_{2} (μ, σ^{2}) & = - \frac{n}{2 σ^{2}} + \frac{1}{2 (σ^{2})^{2}} \sum (y_{i} - μ)^{2} . \end{aligned}$ Therefore $- H (μ, σ^{2}) = (\begin{matrix} \frac{n}{σ^{2}} & \frac{n (\bar{y} - μ)}{(σ^{2})^{2}} \\ \frac{n (\bar{y} - μ)}{(σ^{2})^{2}} & \frac{1}{(σ^{2})^{3}} \sum (y_{i} - μ)^{2} - \frac{n}{2 (σ^{2})^{2}} \end{matrix})$ $I (μ, σ^{2}) = (\begin{matrix} \frac{n}{σ^{2}} & 0 \\ 0 & \frac{n}{2 (σ^{2})^{2}} \end{matrix}) .$

2.6 Asymptotic distribution of the MLE

Maximum likelihood estimation is an attractive method of estimation for a number of reasons. It is intuitively sensible and usually reasonably straightforward to carry out. Even when the simultaneous equations we obtain by differentiating the log-likelihood function are impossible to solve directly, solution by numerical methods is usually feasible.

Perhaps the most compelling reason for considering maximum likelihood estimation is the asymptotic behaviour of maximum likelihood estimators.

Suppose that $y_{1}, \dots, y_{n}$ are observations of independent random variables $Y_{1}, \dots, Y_{n}$ , whose joint p.d.f. $f_{Y} (y; θ) = \prod_{i = 1}^{n} f_{Y_{i}} (y_{i}; θ)$ is completely specified except for the values of an unknown parameter vector $θ$ , and that $\hat{θ}$ is the maximum likelihood estimator of $θ$ .

Then, as $n \to \infty$ , the distribution of $\hat{θ}$ tends to a multivariate normal distribution with mean vector $θ$ and variance covariance matrix $I (θ)^{- 1}$ .

Where $p = 1$ and $θ = (θ)$ , the distribution of the MLE $\hat{θ}$ tends to $N [θ, 1 / I (θ)]$ .

For ‘large enough $n$ ’, we can treat the asymptotic distribution of the MLE as an approximation. The fact that $E (\hat{θ}) \approx θ$ means that the maximum likelihood estimator is approximately unbiased for large samples. The variance of $\hat{θ}$ is approximately $I (θ)^{- 1}$ . It is possible to show that this is the smallest possible variance of any unbiased estimator of $θ$ (this result is called the Cramér–Rao lower bound, which we do not prove here). Therefore the MLE is the ‘best possible’ estimator in large samples (and therefore we hope also reasonable in small samples, though we should investigate this case by case).

2.7 Quantifying uncertainty in parameter estimates

The usefulness of an estimate is always enhanced if some kind of measure of its precision can also be provided. Usually, this will be a standard error, an estimate of the standard deviation of the associated estimator. For the maximum likelihood estimator $\hat{θ}$ , a standard error is given by $s . e . (\hat{θ}) = \frac{1}{I (\hat{θ})^{\frac{1}{2}}},$ and for a vector parameter $θ$ $s . e . ({\hat{θ}}_{i}) = [I (\hat{θ})^{- 1}]_{i i}^{\frac{1}{2}}, i = 1, \dots, p .$

An alternative summary of the information provided by the observed data about the location of a parameter $θ$ and the associated precision is a confidence interval.

The asymptotic distribution of the maximum likelihood estimator can be used to provide approximate large sample confidence intervals. Asymptotically, ${\hat{θ}}_{i}$ has a $N (θ_{i}, [I (θ)^{- 1}]_{i i})$ distribution and we can find $z_{1 - \frac{α}{2}}$ such that $P (- z_{1 - \frac{α}{2}} \leq \frac{{\hat{θ}}_{i} - θ_{i}}{[I (θ)^{- 1}]_{i i}^{\frac{1}{2}}} \leq z_{1 - \frac{α}{2}}) = 1 - α .$ Therefore $P ({\hat{θ}}_{i} - z_{1 - \frac{α}{2}} [I (θ)^{- 1}]_{i i}^{\frac{1}{2}} \leq θ_{i} \leq {\hat{θ}}_{i} + z_{1 - \frac{α}{2}} [I (θ)^{- 1}]_{i i}^{\frac{1}{2}}) = 1 - α .$ The endpoints of this interval cannot be evaluated because they also depend on the unknown parameter vector $θ$ . However, if we replace $I (θ)$ by its MLE $I (\hat{θ})$ we obtain the approximate large sample $100 (1 - α) %$ confidence interval $[{\hat{θ}}_{i} - z_{1 - \frac{α}{2}} [I (\hat{θ})^{- 1}]_{i i}^{\frac{1}{2}}, {\hat{θ}}_{i} + z_{1 - \frac{α}{2}} [I (\hat{θ})^{- 1}]_{i i}^{\frac{1}{2}}] .$ For $α = 0.1, 0.05, 0.01$ , $z_{1 - \frac{α}{2}} = 1.64, 1.96, 2.58$ .

Example 2.9 (Bernoulli) If $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. Bernoulli $(p)$ random variables then asymptotically $\hat{p} = \bar{y}$ has a $N (p, p (1 - p) / n)$ distribution, and a large sample 95% confidence interval for $p$ is $\begin{aligned} [\hat{p} - 1.96 [I (\hat{p})^{- 1}]^{\frac{1}{2}}, \hat{p} + 1.96 [I (\hat{p})^{- 1}]^{\frac{1}{2}}] \\ = [\hat{p} - 1.96 [\hat{p} (1 - \hat{p}) / n]^{\frac{1}{2}}, \hat{p} + 1.96 [\hat{p} (1 - \hat{p}) / n]^{\frac{1}{2}}] \\ = [\bar{y} - 1.96 [\bar{y} (1 - \bar{y}) / n]^{\frac{1}{2}}, \bar{y} + 1.96 [\bar{y} (1 - \bar{y}) / n]^{\frac{1}{2}}] . \end{aligned}$

2.8 Comparing statistical models

If we have a set of competing probability models which might have generated the observed data, we may want to determine which of the models is most appropriate. In practice, we proceed by comparing models pairwise. Suppose that we have two competing alternatives, $f_{Y}^{(0)}$ (model $M_{0}$ ) and $f_{Y}^{(1)}$ (model $M_{1}$ ) for $f_{Y},$ the joint distribution of $Y_{1}, \dots, Y_{n}$ . Often $H_{0}$ and $H_{1}$ both take the same parametric form, $f_{Y} (y; θ)$ but with $θ \in Θ^{(0)}$ for $H_{0}$ and $θ \in Θ^{(1)}$ for $H_{1}$ , where $Θ^{(0)}$ and $Θ^{(1)}$ are alternative sets of possible values for $θ$ . In the regression setting, we are often interested in determining which of a set of explanatory variables have an impact on the distribution of the response.

2.8.1 Hypothesis testing

A hypothesis test provides one mechanism for comparing two competing statistical models. A hypothesis test does not treat the two hypotheses (models) symmetrically. One hypothesis, $H_{0} : the data were generated from model M_{0},$ is accorded special status, and referred to as the null hypothesis. The null hypothesis is the reference model, and will be assumed to be appropriate unless the observed data strongly indicate that $H_{0}$ is inappropriate, and that $H_{1} : the data were generated from model M_{1},$ (the alternative hypothesis) should be preferred. The fact that a hypothesis test does not reject $H_{0}$ should not be taken as evidence that $H_{0}$ is true and $H_{1}$ is not, or that $H_{0}$ is better supported by the data than $H_{1},$ merely that the data does not provide sufficient evidence to reject $H_{0}$ in favour of $H_{1}$ .

A hypothesis test is defined by its critical region or rejection region, which we shall denote by $C$ . $C$ is a subset of $R^{n}$ and is the set of possible $y$ which would lead to rejection of $H_{0}$ in favour of $H_{1}$ , i.e.

If $y \in C$ , $H_{0}$ is rejected in favour of $H_{1}$ ;
If $y \notin C$ , $H_{0}$ is not rejected.

As $Y$ is a random variable, there remains the possibility that a hypothesis test will produce an erroneous result. We define the size (or significance level) of the test $α = max_{θ \in Θ^{(0)}} P (Y \in C; θ)$ This is the maximum probability of erroneously rejecting $H_{0}$ , over all possible distributions for $Y$ implied by $H_{0}$ . We also define the power function $ω (θ) = P (Y \in C; θ)$ It represents the probability of rejecting $H_{0}$ for a particular value of $θ$ . Clearly we would like to find a test with where $ω (θ)$ is large for every $θ \in Θ^{(1)} ∖ Θ^{(0)}$ , while at the same time avoiding erroneous rejection of $H_{0}$ . In other words, a good test will have small size, but large power.

The general hypothesis testing procedure is to fix $α$ to be some small value (often 0.05), so that the probability of erroneous rejection of $H_{0}$ is limited. In doing this, we are giving $H_{0}$ precedence over $H_{1}$ . Given our specified $α$ , we try to choose a test, defined by its rejection region $C$ , to make $ω (θ)$ as large as possible for $θ \in Θ^{(1)} ∖ Θ^{(0)}$ .

2.8.2 Likelihood ratio tests for nested hypotheses

Suppose that $H_{0}$ and $H_{1}$ both take the same parametric form, $f_{Y} (y; θ)$ with $θ \in Θ^{(0)}$ for $H_{0}$ and $θ \in Θ^{(1)}$ for $H_{1}$ , where $Θ^{(0)}$ and $Θ^{(1)}$ are alternative sets of possible values for $θ$ . A likelihood ratio test of $H_{0}$ against $H_{1}$ has a critical region of the form

$\begin{matrix} (2.3) & C = {y : \frac{max_{θ \in Θ^{(1)}} L (θ)}{max_{θ \in Θ^{(0)}} L (θ)} > k} \end{matrix}$

where $k$ is determined by $α$ , the size of the test, so $max_{θ \in Θ^{(0)}} P (y \in C; θ) = α .$ Therefore, we will only reject $H_{0}$ if $H_{1}$ offers a distribution for $Y_{1}, \dots, Y_{n}$ which makes the observed data much more probable than any distribution under $H_{0}$ . This is intuitively appealing and tends to produce good tests (large power) across a wide range of examples.

In order to determine $k$ in Equation 2.3, we need to know the distribution of the likelihood ratio, or an equivalent statistic, under $H_{0}$ . In general, this will not be available to us. However, we can make use of an important asymptotic result.

First we notice that, as $\log$ is a strictly increasing function, the rejection region is equivalent to $C = {y : 2 \log (\frac{max_{θ \in Θ^{(1)}} L (θ)}{max_{θ \in Θ^{(0)}} L (θ)}) > k^{'}}$ where $max_{θ \in Θ^{(0)}} P (y \in C; θ) = α .$ Write $L_{01} \equiv 2 \log (\frac{max_{θ \in Θ^{(1)}} L (θ)}{max_{θ \in Θ^{(0)}} L (θ)})$ for the log-likelihood ratio test statistic. Provided that $H_{0}$ is nested within $H_{1}$ , the following result provides a useful large- $n$ approximation to the distribution of $L_{01}$ .

Theorem 2.3 Suppose that $H_{0}$ : $θ \in Θ^{(0)}$ and $H_{1}$ : $θ \in Θ^{(1)}$ , where $Θ^{(0)} \subset Θ^{(1)}$ . Let $d_{0} = \dim (Θ^{(0)})$ and $d_{1} = \dim (Θ^{(1)})$ . Under $H_{0}$ , the distribution of $L_{01}$ tends towards $χ_{d_{1} - d_{0}}^{2}$ as $n \to \infty$ .

Proof. First we note that in the case where $θ$ is one-dimensional and $θ = (θ)$ , a Taylor series expansion of $ℓ (θ)$ around the MLE $\hat{θ}$ gives $ℓ (θ) = ℓ (\hat{θ}) + (θ - \hat{θ}) U (\hat{θ}) + \frac{1}{2} (θ - \hat{θ})^{2} U^{'} (\hat{θ}) + \dots$ Now, $U (\hat{θ}) = 0$ , and if we approximate $U^{'} (\hat{θ}) \equiv H (\hat{θ})$ by $E [H (θ)] \equiv - I (θ)$ , and also ignore higher order terms, we obtain $2 [ℓ (\hat{θ}) - ℓ (θ)] = (θ - \hat{θ})^{2} I (θ)$ As $\hat{θ}$ is asymptotically $N [θ, I (θ)^{- 1}]$ , $(θ - \hat{θ})^{2} I (θ)$ is asymptotically $χ_{1}^{2}$ , and hence so is $2 [ℓ (\hat{θ}) - ℓ (θ)]$ .

Similarly it can be shown that when $θ \in Θ$ , a multidimensional space, $2 [ℓ (\hat{θ}) - ℓ (θ)]$ is asymptotically $χ_{p}^{2}$ , where $p$ is the dimension of $Θ$ .

Now, suppose that $H_{0}$ is true and $θ \in Θ^{(0)}$ and therefore $θ \in Θ^{(1)}$ . Furthermore, suppose that $ℓ (θ)$ is maximised in $Θ^{(0)}$ by ${\hat{θ}}^{(0)}$ and is maximised in $Θ^{(1)}$ by ${\hat{θ}}^{(1)}$ . Then

$\begin{aligned} L_{01} & \equiv 2 \log (\frac{max_{θ \in Θ^{(1)}} L (θ)}{max_{θ \in Θ^{(0)}} L (θ)}) \\ = 2 \log L ({\hat{θ}}^{(1)}) - 2 \log L ({\hat{θ}}^{(0)}) \\ = 2 [\log L ({\hat{θ}}^{(1)}) - \log L (θ)] - 2 [\log L ({\hat{θ}}^{(0)}) - \log L (θ)] \\ = L_{1} - L_{0} . \end{aligned}$

Therefore $L_{1} = L_{01} + L_{0}$ and we know that, under $H_{0}$ , $L_{1}$ has a $χ_{d_{1}}^{2}$ distribution and $L_{0}$ has a $χ_{d_{0}}^{2}$ distribution. Furthermore, it is possible to show (although we will not do so here) that under $H_{0}$ , $L_{01}$ and $L_{0}$ are independent. It can also be shown that under $H_{0}$ the difference $L_{1} - L_{0}$ can be expressed as a quadratic form of normal random variables. Therefore, it follows that under $H_{0}$ , the log likelihood ratio statistic $L_{01}$ has a $χ_{d_{1} - d_{0}}^{2}$ distribution. ◻

Example 2.10 $y_{1}, \dots, y_{n}$ are observations of $Y_{1}, \dots, Y_{n}$ , i.i.d. Bernoulli $(p)$ random variables. Suppose that we require a size $α$ test of the hypothesis $H_{0}$ : $p = p_{0}$ against the general alternative $H_{1}$ : ‘ $p$ is unrestricted’ where $α$ and $p_{0}$ are specified.

Here $θ = (p)$ , $Θ^{(0)} = {p_{0}}$ and $Θ^{(1)} = (0, 1)$ and the log likelihood ratio statistic is $L_{01} = 2 n \bar{y} \log (\frac{\bar{y}}{p_{0}}) + 2 n (1 - \bar{y}) \log (\frac{1 - \bar{y}}{1 - p_{0}}) .$ As $d_{1} = 1$ and $d_{0} = 0$ , under $H_{0}$ , the log likelihood ratio statistic has an asymptotic $χ_{1}^{2}$ distribution. For a log likelihood ratio test, we only reject $H_{0}$ in favour of $H_{1}$ when the test statistic is too large (observed data are much more probable under model $H_{1}$ than under model $H_{0}$ ), so in this case we reject $H_{0}$ when the observed value of the test statistic above is ‘too large’ to have come from a $χ_{1}^{2}$ distribution. What we mean by ‘too large’ depends on the significance level $α$ of the test. For example, if $α = 0.05$ , a common choice, then we should reject $H_{0}$ if the test statistic is greater than the 3.84, the 95% quantile of the $χ_{1}^{2}$ distribution.

2.8.3 Information criteria for model comparison

It is more difficult to use the likelihood ratio test of Section 2.8.2 to compare two models if those models are not nested. An alternative approach is to record some criterion measuring the quality of the model for each of a candidate set of models, then choose the model which is the best according to this criterion.

When we were estimating the unknown parameters $θ$ of a model, we chose the value which maximised the likelihood: that is, the value of $θ$ that maximises the probability of observing the data we actually saw. It is tempting to use a similar system for choosing between two models, and to choose the model which has the greater likelihood, under which the probability of seeing the data we actually observed is maximised. However, if we do this we will always end up choosing complicated models, which fit the observed data very closely, but do not meet our requirement of parsimony.

For a given model depending on parameters $θ \in R^{p}$ , let $\hat{ℓ} = ℓ (\hat{θ})$ be the log-likelihood function for that model evaluated at the MLE $\hat{θ}$ . It is not sensible to choose between models by maximising $\hat{ℓ}$ directly, and instead it is common to choose a model to maximise a criteria of the form $\hat{ℓ} - penalty,$ where the penalty term will be large for complex models, and small for simple models.

Equivalently, we may choose between models by minimising a criteria of the form $- 2 \hat{ℓ} + penalty .$ By convention, many commonly-used criteria for model comparison take this form. For instance, the Akaike information criterion (AIC) is $AIC = - 2 \hat{ℓ} + 2 p,$ where $p$ is the dimension of the unknown parameter in the candidate model, and the Bayesian information criterion (BIC) is $BIC = - 2 \hat{ℓ} + \log (n) p,$ where $n$ is the number of observations.