2  Parametric statistical inference

2.1 Introduction

Probability distributions like the binomial, Poisson and normal, enable us to calculate probabilities, and other quantities of interest (e.g. expectations) for a probability model of a random process. Therefore, given the model, we can make statements about possible outcomes of the process.

Statistical inference is concerned with the inverse problem. Given outcomes of a random process (observed data), what conclusions (inferences) can we draw about the process itself?

We assume that the n observations of the response y=(y1,,yn)T are observations of random variables Y=(Y1,,Yn)T, which have joint p.d.f. fY (joint p.f. for discrete variables). We use the observed data y to make inferences about fY.

We usually make certain assumptions about fY. In particular, we often assume that y1,,yn are observations of independent random variables. Hence fY(y)=fY1(y1)fY2(y2)fYn(yn)=i=1nfYi(yi).

In parametric statistical inference, we specify a joint distribution fY, for Y, which is known, except for the values of parameters θ1,θ2,,θp (sometimes denoted by θ). Then we use the observed data y to make inferences about θ1,θ2,,θp. In this case, we usually write fY as fY(y;θ), to make explicit the dependence on the unknown θ.

2.2 The likelihood function

We often think of the joint density fY(y;θ) as a function of y for fixed θ, which describes the relative probabilities of different possible values of y, given a particular set of parameters θ. However, in statistical inference, we have observed y1,,yn (values of Y1,,Yn). Knowledge of the probability of alternative possible realisations of Y is largely irrelevant. What we want to know about is θ.

Our only link between the observed data y1,,yn and θ is through the function fY(y;θ). Therefore, it seems sensible that parametric statistical inference should be based on this function. We can think of fY(y;θ) as a function of θ for fixed y, which describes the relative likelihoods of different possible (sets of) θ, given observed data y1,,yn. We write L(θ;y)=fY(y;θ) for this likelihood, which is a function of the unknown parameter θ. For convenience, we often drop y from the notation, and write L(θ).

The likelihood function is of central importance in parametric statistical inference. It provides a means for comparing different possible values of θ, based on the probabilities (or probability densities) that they assign to the observed data y1,,yn.

Notes

  1. Frequently it is more convenient to consider the log-likelihood function (θ)=logL(θ).
  2. Nothing in the definition of the likelihood requires y1,,yn to be observations of independent random variables, although we shall frequently make this assumption.
  3. Any factors which depend on y1,,yn alone (and not on θ) can be ignored when writing down the likelihood. Such factors give no information about the relative likelihoods of different possible values of θ.

Example 2.1 (Bernoulli) y1,,yn are observations of Y1,,Yn, independent identically distributed (i.i.d.) Bernoulli(p) random variables. Here θ=(p) and the likelihood is L(p)=i=1npyi(1p)1yi=pi=1nyi(1p)ni=1nyi. The log-likelihood is (p)=logL(p)=ny¯logp+n(1y¯)log(1p).

Example 2.2 (Normal) y1,,yn are observations of Y1,,Yn, i.i.d. N(μ,σ2) random variables. Here θ=(μ,σ2) and the likelihood is L(μ,σ2)=i=1n12πσ2exp(12σ2(yiμ)2)=(2πσ2)n2exp(12σ2(yiμ)2)(σ2)n2exp(12σ2(yiμ)2). The log-likelihood is (μ,σ2)=logL(μ,σ2)=n2log(2π)n2log(σ2)12σ2(yiμ)2.

2.3 Maximum likelihood estimation

One of the primary tasks of parametric statistical inference is estimation of the unknown parameters θ1,,θp. Consider the value of θ which maximises the likelihood function. This is the ‘most likely’ value of θ, the one which makes the observed data ‘most probable’. When we are searching for an estimate of θ, this would seem to be a good candidate.

We call the value of θ which maximises the likelihood L(θ) the maximum likelihood estimate (MLE) of θ, denoted by θ^. θ^ depends on y, as different observed data samples lead to different likelihood functions. The corresponding function of Y is called the maximum likelihood estimator and is also denoted by θ^.

Note that as θ=(θ1,,θp), the MLE for any component of θ is given by the corresponding component of θ^=(θ^1,,θ^p)T. Similarly, the MLE for any function of parameters g(θ) is given by g(θ^).

As log is a strictly increasing function, the value of θ which maximises L(θ) also maximises (θ)=logL(θ). It is almost always easier to maximise (θ). This is achieved in the usual way; finding a stationary point by differentiating (θ) with respect to θ1,,θp, and solving the resulting p simultaneous equations. It should also be checked that the stationary point is a maximum.

Example 2.3 (Bernoulli) y1,,yn are observations of Y1,,Yn, i.i.d. Bernoulli(p) random variables. Here θ=(p) and the log-likelihood is (p)=ny¯logp+n(1y¯)log(1p). Differentiating with respect to p, p(p)=ny¯pn(1y¯)1p so the MLE p^ solves ny¯p^n(1y¯)1p^=0. Solving this for p^ gives p^=y¯. Note that 2p2(p)=ny¯/p2n(1y¯)/(1p)2<0 everywhere, so the stationary point is clearly a maximum.

Example 2.4 (Normal) y1,,yn are observations of Y1,,Yn, i.i.d. N(μ,σ2) random variables. Here θ=(μ,σ2) and and the log-likelihood is (μ,σ2)=n2log(2π)n2log(σ2)12σ2(yiμ)2. Differentiating with respect to μ μ(μ,σ2)=1σ2(yiμ)=n(y¯μ)σ2 so (μ^,σ^2) solve

(2.1)n(y¯μ^)σ^2=0.

Differentiating with respect to σ2 σ2(μ,σ2)=n2σ2+12(σ2)2(yiμ)2, so

(2.2)n2σ^2+12(σ^2)2(yiμ^)2=0

Solving () and (), we obtain μ^=y¯ and σ^2=1n(yiμ^)2=1n(yiy¯)2.

Strictly, to show that this stationary point is a maximum, we need to show that the Hessian matrix (the matrix of second derivatives with elements [H(θ)]ij=2θiθj(θ)) is negative definite at θ=θ^, that is aTH(θ^)a<0 for every a0. Here H(μ^,σ^2)=(nσ^200n2(σ^2)2) which is clearly negative definite.

2.4 Score

Let ui(θ)θi(θ),i=1,,p and u(θ)[u1(θ),,up(θ)]T. Then we call u(θ) the vector of scores or score vector. Where p=1 and θ=(θ), the score is the scalar defined as u(θ)θ(θ). The maximum likelihood estimate θ^ satisfies u(θ^)=0, that is, ui(θ^)=0,i=1,,p. Note that u(θ) is a function of θ for fixed (observed) y. However, if we replace y1,,yn in u(θ), by the corresponding random variables Y1,,Yn then we obtain a vector of random variables U(θ)[U1(θ),,Up(θ)]T.

An important result in likelihood theory is that the expected score at the true (but unknown) value of θ is zero:

Theorem 2.1 We have E[U(θ)]=0, i.e. E[Ui(θ)]=0, i=1,,p, provided that

  1. The expectation exists.
  2. The sample space for Y does not depend on θ.

Proof. Our proof is for continuous y – in the discrete case, replace by . For each i=1,,n E[Ui(θ)]=Ui(θ)fY(y,θ)dy=θi(θ)fY(y;θ)dy=θilogfY(y;θ)fY(y;θ)dy=θifY(y;θ)fY(y;θ)fY(y;θ)dy=θifY(y;θ)dy=θifY(y;θ)dy=θi1=0, as required.

Example 2.5 (Bernoulli) y1,,yn are observations of Y1,,Yn, i.i.d. Bernoulli(p) random variables. Here θ=(p) and u(p)=ny¯/pn(1y¯)/(1p). Since E[U(p)]=0, we must have E[Y¯]=p (which we already know is correct).

Example 2.6 (Normal) y1,,yn are observations of Y1,,Yn, i.i.d. N(μ,σ2) random variables. Here θ=(μ,σ2) and u1(μ,σ2)=n(y¯μ)/σ2u2(μ,σ2)=n2σ2+12(σ2)2i=1n(yiμ)2 Since E[U(μ,σ2)]=0, we must have E[Y¯]=μ and E[1ni=1n(Yiμ)2]=σ2.

2.5 Information

Suppose that y1,,yn are observations of Y1,,Yn, whose joint p.d.f. L(θ) is completely specified except for the values of p unknown parameters θ=(θ1,,θp)T. Previously, we defined the Hessian matrix H(θ) to be the matrix with components [H(θ)]ij2θiθj(θ)i=1,,p;j=1,,p. We call the matrix H(θ) the observed information matrix. Where p=1 and θ=(θ), the observed information is a scalar defined as H(θ)θ2(θ).

As with the score, if we replace y1,,yn in H(θ), by the corresponding random variables Y1,,Yn, we obtain a matrix of random variables. Then, we define the expected information matrix or Fisher information matrix [I(θ)]ij=E([H(θ)]ij)i=1,,p;j=1,,p.

An important result in likelihood theory is that the variance-covariance matrix of the score vector is equal to the expected information matrix:

Theorem 2.2 We have Var[U(θ)]=I(θ), i.e. Var[U(θ)]ij=[I(θ)]ij,i=1,,p,j=1,,p provided that

  1. The variance exists.
  2. The sample space for Y does not depend on θ.

Proof. Our proof is for continuous y – in the discrete case, replace by .

For each i=1,,p and j=1,,p, Var[U(θ)]ij=E[Ui(θ)Uj(θ)]=θi(θ)θj(θ)fY(y;θ)dy=θilogfY(y;θ)θjlogfY(y;θ)fY(y;θ)dy=θifY(y;θ)fY(y;θ)θjfY(y;θ)fY(y;θ)fY(y;θ)dy=1fY(y;θ)θifY(y;θ)θjfY(y;θ)dy.

Now [I(θ)]ij=E[2θiθj(θ)]=2θiθjlogfY(y;θ)fY(y;θ)dy=θi[θjfY(y;θ)fY(y;θ)]fY(y;θ)dy=[2θiθjfY(y;θ)fY(y;θ)+θifY(y;θ)θjfY(y;θ)fY(y;θ)2]fY(y;θ)dy=2θiθjfY(y;θ)dy+1fY(y;θ)θifY(y;θ)θjfY(y;θ)dy=Var[U(θ)]ij, as required.

Example 2.7 (Bernoulli) y1,,yn are observations of Y1,,Yn, i.i.d. Bernoulli(p) random variables. Here θ=(p) and u(p)=ny¯pn(1y¯)(1p)H(p)=ny¯p2+n(1y¯)(1p)2I(p)=np+n(1p)=np(1p).

Example 2.8 (Normal) y1,,yn are observations of Y1,,Yn, i.i.d. N(μ,σ2) random variables. Here θ=(μ,σ2) and u1(μ,σ2)=n(y¯μ)σ2u2(μ,σ2)=n2σ2+12(σ2)2(yiμ)2. Therefore H(μ,σ2)=(nσ2n(y¯μ)(σ2)2n(y¯μ)(σ2)21(σ2)3(yiμ)2n2(σ2)2) I(μ,σ2)=(nσ200n2(σ2)2).

2.6 Asymptotic distribution of the MLE

Maximum likelihood estimation is an attractive method of estimation for a number of reasons. It is intuitively sensible and usually reasonably straightforward to carry out. Even when the simultaneous equations we obtain by differentiating the log-likelihood function are impossible to solve directly, solution by numerical methods is usually feasible.

Perhaps the most compelling reason for considering maximum likelihood estimation is the asymptotic behaviour of maximum likelihood estimators.

Suppose that y1,,yn are observations of independent random variables Y1,,Yn, whose joint p.d.f. fY(y;θ)=i=1nfYi(yi;θ) is completely specified except for the values of an unknown parameter vector θ, and that θ^ is the maximum likelihood estimator of θ.

Then, as n, the distribution of θ^ tends to a multivariate normal distribution with mean vector θ and variance covariance matrix I(θ)1.

Where p=1 and θ=(θ), the distribution of the MLE θ^ tends to N[θ,1/I(θ)].

For ‘large enough n’, we can treat the asymptotic distribution of the MLE as an approximation. The fact that E(θ^)θ means that the maximum likelihood estimator is approximately unbiased for large samples. The variance of θ^ is approximately I(θ)1. It is possible to show that this is the smallest possible variance of any unbiased estimator of θ (this result is called the Cramér–Rao lower bound, which we do not prove here). Therefore the MLE is the ‘best possible’ estimator in large samples (and therefore we hope also reasonable in small samples, though we should investigate this case by case).

2.7 Quantifying uncertainty in parameter estimates

The usefulness of an estimate is always enhanced if some kind of measure of its precision can also be provided. Usually, this will be a standard error, an estimate of the standard deviation of the associated estimator. For the maximum likelihood estimator θ^, a standard error is given by s.e.(θ^)=1I(θ^)12, and for a vector parameter θ s.e.(θ^i)=[I(θ^)1]ii12,i=1,,p.

An alternative summary of the information provided by the observed data about the location of a parameter θ and the associated precision is a confidence interval.

The asymptotic distribution of the maximum likelihood estimator can be used to provide approximate large sample confidence intervals. Asymptotically, θ^i has a N(θi,[I(θ)1]ii) distribution and we can find z1α2 such that P(z1α2θ^iθi[I(θ)1]ii12z1α2)=1α. Therefore P(θ^iz1α2[I(θ)1]ii12θiθ^i+z1α2[I(θ)1]ii12)=1α. The endpoints of this interval cannot be evaluated because they also depend on the unknown parameter vector θ. However, if we replace I(θ) by its MLE I(θ^) we obtain the approximate large sample 100(1α)% confidence interval [θ^iz1α2[I(θ^)1]ii12,θ^i+z1α2[I(θ^)1]ii12]. For α=0.1,0.05,0.01, z1α2=1.64,1.96,2.58.

Example 2.9 (Bernoulli) If y1,,yn are observations of Y1,,Yn, i.i.d. Bernoulli(p) random variables then asymptotically p^=y¯ has a N(p,p(1p)/n) distribution, and a large sample 95% confidence interval for p is [p^1.96[I(p^)1]12,p^+1.96[I(p^)1]12]=[p^1.96[p^(1p^)/n]12,p^+1.96[p^(1p^)/n]12]=[y¯1.96[y¯(1y¯)/n]12,y¯+1.96[y¯(1y¯)/n]12].

2.8 Comparing statistical models

If we have a set of competing probability models which might have generated the observed data, we may want to determine which of the models is most appropriate. In practice, we proceed by comparing models pairwise. Suppose that we have two competing alternatives, fY(0) (model M0) and fY(1) (model M1) for fY, the joint distribution of Y1,,Yn. Often H0 and H1 both take the same parametric form, fY(y;θ) but with θΘ(0) for H0 and θΘ(1) for H1, where Θ(0) and Θ(1) are alternative sets of possible values for θ. In the regression setting, we are often interested in determining which of a set of explanatory variables have an impact on the distribution of the response.

2.8.1 Hypothesis testing

A hypothesis test provides one mechanism for comparing two competing statistical models. A hypothesis test does not treat the two hypotheses (models) symmetrically. One hypothesis, H0: the data were generated from model M0, is accorded special status, and referred to as the null hypothesis. The null hypothesis is the reference model, and will be assumed to be appropriate unless the observed data strongly indicate that H0 is inappropriate, and that H1: the data were generated from model M1, (the alternative hypothesis) should be preferred. The fact that a hypothesis test does not reject H0 should not be taken as evidence that H0 is true and H1 is not, or that H0 is better supported by the data than H1, merely that the data does not provide sufficient evidence to reject H0 in favour of H1.

A hypothesis test is defined by its critical region or rejection region, which we shall denote by C. C is a subset of Rn and is the set of possible y which would lead to rejection of H0 in favour of H1, i.e.

  • If yC, H0 is rejected in favour of H1;
  • If yC, H0 is not rejected.

As Y is a random variable, there remains the possibility that a hypothesis test will produce an erroneous result. We define the size (or significance level) of the test α=maxθΘ(0)P(YC;θ) This is the maximum probability of erroneously rejecting H0, over all possible distributions for Y implied by H0. We also define the power function ω(θ)=P(YC;θ) It represents the probability of rejecting H0 for a particular value of θ. Clearly we would like to find a test with where ω(θ) is large for every θΘ(1)Θ(0), while at the same time avoiding erroneous rejection of H0. In other words, a good test will have small size, but large power.

The general hypothesis testing procedure is to fix α to be some small value (often 0.05), so that the probability of erroneous rejection of H0 is limited. In doing this, we are giving H0 precedence over H1. Given our specified α, we try to choose a test, defined by its rejection region C, to make ω(θ) as large as possible for θΘ(1)Θ(0).

2.8.2 Likelihood ratio tests for nested hypotheses

Suppose that H0 and H1 both take the same parametric form, fY(y;θ) with θΘ(0) for H0 and θΘ(1) for H1, where Θ(0) and Θ(1) are alternative sets of possible values for θ. A likelihood ratio test of H0 against H1 has a critical region of the form

(2.3)C={y:maxθΘ(1)L(θ)maxθΘ(0)L(θ)>k}

where k is determined by α, the size of the test, so maxθΘ(0)P(yC;θ)=α. Therefore, we will only reject H0 if H1 offers a distribution for Y1,,Yn which makes the observed data much more probable than any distribution under H0. This is intuitively appealing and tends to produce good tests (large power) across a wide range of examples.

In order to determine k in , we need to know the distribution of the likelihood ratio, or an equivalent statistic, under H0. In general, this will not be available to us. However, we can make use of an important asymptotic result.

First we notice that, as log is a strictly increasing function, the rejection region is equivalent to C={y:2log(maxθΘ(1)L(θ)maxθΘ(0)L(θ))>k} where maxθΘ(0)P(yC;θ)=α. Write L012log(maxθΘ(1)L(θ)maxθΘ(0)L(θ)) for the log-likelihood ratio test statistic. Provided that H0 is nested within H1, the following result provides a useful large-n approximation to the distribution of L01.

Theorem 2.3 Suppose that H0: θΘ(0) and H1: θΘ(1), where Θ(0)Θ(1). Let d0=dim(Θ(0)) and d1=dim(Θ(1)). Under H0, the distribution of L01 tends towards χd1d02 as n.

Proof. First we note that in the case where θ is one-dimensional and θ=(θ), a Taylor series expansion of (θ) around the MLE θ^ gives (θ)=(θ^)+(θθ^)U(θ^)+12(θθ^)2U(θ^)+ Now, U(θ^)=0, and if we approximate U(θ^)H(θ^) by E[H(θ)]I(θ), and also ignore higher order terms, we obtain 2[(θ^)(θ)]=(θθ^)2I(θ) As θ^ is asymptotically N[θ,I(θ)1], (θθ^)2I(θ) is asymptotically χ12, and hence so is 2[(θ^)(θ)].

Similarly it can be shown that when θΘ, a multidimensional space, 2[(θ^)(θ)] is asymptotically χp2, where p is the dimension of Θ.

Now, suppose that H0 is true and θΘ(0) and therefore θΘ(1). Furthermore, suppose that (θ) is maximised in Θ(0) by θ^(0) and is maximised in Θ(1) by θ^(1). Then

L012log(maxθΘ(1)L(θ)maxθΘ(0)L(θ))=2logL(θ^(1))2logL(θ^(0))=2[logL(θ^(1))logL(θ)]2[logL(θ^(0))logL(θ)]=L1L0.

Therefore L1=L01+L0 and we know that, under H0, L1 has a χd12 distribution and L0 has a χd02 distribution. Furthermore, it is possible to show (although we will not do so here) that under H0, L01 and L0 are independent. It can also be shown that under H0 the difference L1L0 can be expressed as a quadratic form of normal random variables. Therefore, it follows that under H0, the log likelihood ratio statistic L01 has a χd1d02 distribution. ◻

Example 2.10 y1,,yn are observations of Y1,,Yn, i.i.d. Bernoulli(p) random variables. Suppose that we require a size α test of the hypothesis H0: p=p0 against the general alternative H1: ‘p is unrestricted’ where α and p0 are specified.

Here θ=(p), Θ(0)={p0} and Θ(1)=(0,1) and the log likelihood ratio statistic is L01=2ny¯log(y¯p0)+2n(1y¯)log(1y¯1p0). As d1=1 and d0=0, under H0, the log likelihood ratio statistic has an asymptotic χ12 distribution. For a log likelihood ratio test, we only reject H0 in favour of H1 when the test statistic is too large (observed data are much more probable under model H1 than under model H0), so in this case we reject H0 when the observed value of the test statistic above is ‘too large’ to have come from a χ12 distribution. What we mean by ‘too large’ depends on the significance level α of the test. For example, if α=0.05, a common choice, then we should reject H0 if the test statistic is greater than the 3.84, the 95% quantile of the χ12 distribution.

2.8.3 Information criteria for model comparison

It is more difficult to use the likelihood ratio test of to compare two models if those models are not nested. An alternative approach is to record some criterion measuring the quality of the model for each of a candidate set of models, then choose the model which is the best according to this criterion.

When we were estimating the unknown parameters θ of a model, we chose the value which maximised the likelihood: that is, the value of θ that maximises the probability of observing the data we actually saw. It is tempting to use a similar system for choosing between two models, and to choose the model which has the greater likelihood, under which the probability of seeing the data we actually observed is maximised. However, if we do this we will always end up choosing complicated models, which fit the observed data very closely, but do not meet our requirement of parsimony.

For a given model depending on parameters θRp, let ^=(θ^) be the log-likelihood function for that model evaluated at the MLE θ^. It is not sensible to choose between models by maximising ^ directly, and instead it is common to choose a model to maximise a criteria of the form ^penalty, where the penalty term will be large for complex models, and small for simple models.

Equivalently, we may choose between models by minimising a criteria of the form 2^+penalty. By convention, many commonly-used criteria for model comparison take this form. For instance, the Akaike information criterion (AIC) is AIC=2^+2p, where p is the dimension of the unknown parameter in the candidate model, and the Bayesian information criterion (BIC) is BIC=2^+log(n)p, where n is the number of observations.