Introduction
Probability distributions like the binomial, Poisson and normal, enable us to calculate probabilities, and other quantities of interest (e.g. expectations) for a probability model of a random process. Therefore, given the model, we can make statements about possible outcomes of the process.
Statistical inference is concerned with the inverse problem. Given outcomes of a random process (observed data), what conclusions (inferences) can we draw about the process itself?
We assume that the observations of the response are observations of random variables , which have joint p.d.f. (joint p.f. for discrete variables). We use the observed data to make inferences about .
We usually make certain assumptions about . In particular, we often assume that are observations of independent random variables. Hence
In parametric statistical inference, we specify a joint distribution , for , which is known, except for the values of parameters (sometimes denoted by ). Then we use the observed data to make inferences about . In this case, we usually write as , to make explicit the dependence on the unknown .
The likelihood function
We often think of the joint density as a function of for fixed , which describes the relative probabilities of different possible values of , given a particular set of parameters . However, in statistical inference, we have observed (values of ). Knowledge of the probability of alternative possible realisations of is largely irrelevant. What we want to know about is .
Our only link between the observed data and is through the function . Therefore, it seems sensible that parametric statistical inference should be based on this function. We can think of as a function of for fixed , which describes the relative likelihoods of different possible (sets of) given observed data . We write for this likelihood, which is a function of the unknown parameter . For convenience, we often drop from the notation, and write .
The likelihood function is of central importance in parametric statistical inference. It provides a means for comparing different possible values of , based on the probabilities (or probability densities) that they assign to the observed data .
Notes
- Frequently it is more convenient to consider the log-likelihood function .
- Nothing in the definition of the likelihood requires to be observations of independent random variables, although we shall frequently make this assumption.
- Any factors which depend on alone (and not on ) can be ignored when writing down the likelihood. Such factors give no information about the relative likelihoods of different possible values of .
Example 2.1 (Bernoulli) are observations of , independent identically distributed (i.i.d.) Bernoulli random variables. Here and the likelihood is The log-likelihood is
Example 2.2 (Normal) are observations of , i.i.d. random variables. Here and the likelihood is The log-likelihood is
Maximum likelihood estimation
One of the primary tasks of parametric statistical inference is estimation of the unknown parameters . Consider the value of which maximises the likelihood function. This is the ‘most likely’ value of , the one which makes the observed data ‘most probable’. When we are searching for an estimate of , this would seem to be a good candidate.
We call the value of which maximises the likelihood the maximum likelihood estimate (MLE) of , denoted by . depends on , as different observed data samples lead to different likelihood functions. The corresponding function of is called the maximum likelihood estimator and is also denoted by .
Note that as , the MLE for any component of is given by the corresponding component of . Similarly, the MLE for any function of parameters is given by .
As is a strictly increasing function, the value of which maximises also maximises . It is almost always easier to maximise . This is achieved in the usual way; finding a stationary point by differentiating with respect to , and solving the resulting simultaneous equations. It should also be checked that the stationary point is a maximum.
Example 2.3 (Bernoulli) are observations of , i.i.d. Bernoulli random variables. Here and the log-likelihood is Differentiating with respect to , so the MLE solves Solving this for gives . Note that everywhere, so the stationary point is clearly a maximum.
Example 2.4 (Normal) are observations of , i.i.d. random variables. Here and and the log-likelihood is Differentiating with respect to so solve
Differentiating with respect to so
Solving (Equation 2.1) and (Equation 2.2), we obtain and
Strictly, to show that this stationary point is a maximum, we need to show that the Hessian matrix (the matrix of second derivatives with elements ) is negative definite at , that is for every . Here which is clearly negative definite.
Score
Let and . Then we call the vector of scores or score vector. Where and , the score is the scalar defined as The maximum likelihood estimate satisfies that is, Note that is a function of for fixed (observed) . However, if we replace in , by the corresponding random variables then we obtain a vector of random variables .
An important result in likelihood theory is that the expected score at the true (but unknown) value of is zero:
Theorem 2.1 We have , i.e. provided that
- The expectation exists.
- The sample space for does not depend on .
Proof. Our proof is for continuous – in the discrete case, replace by . For each as required.
Example 2.5 (Bernoulli) are observations of , i.i.d. Bernoulli random variables. Here and Since , we must have (which we already know is correct).
Example 2.6 (Normal) are observations of , i.i.d. random variables. Here and Since , we must have and
Information
Suppose that are observations of , whose joint p.d.f. is completely specified except for the values of unknown parameters . Previously, we defined the Hessian matrix to be the matrix with components We call the matrix the observed information matrix. Where and , the observed information is a scalar defined as
As with the score, if we replace in , by the corresponding random variables , we obtain a matrix of random variables. Then, we define the expected information matrix or Fisher information matrix
An important result in likelihood theory is that the variance-covariance matrix of the score vector is equal to the expected information matrix:
Theorem 2.2 We have , i.e. provided that
- The variance exists.
- The sample space for does not depend on .
Proof. Our proof is for continuous – in the discrete case, replace by .
For each and ,
Now as required.
Example 2.7 (Bernoulli) are observations of , i.i.d. Bernoulli random variables. Here and
Example 2.8 (Normal) are observations of , i.i.d. random variables. Here and Therefore
Asymptotic distribution of the MLE
Maximum likelihood estimation is an attractive method of estimation for a number of reasons. It is intuitively sensible and usually reasonably straightforward to carry out. Even when the simultaneous equations we obtain by differentiating the log-likelihood function are impossible to solve directly, solution by numerical methods is usually feasible.
Perhaps the most compelling reason for considering maximum likelihood estimation is the asymptotic behaviour of maximum likelihood estimators.
Suppose that are observations of independent random variables , whose joint p.d.f. is completely specified except for the values of an unknown parameter vector , and that is the maximum likelihood estimator of .
Then, as , the distribution of tends to a multivariate normal distribution with mean vector and variance covariance matrix .
Where and , the distribution of the MLE tends to .
For ‘large enough ’, we can treat the asymptotic distribution of the MLE as an approximation. The fact that means that the maximum likelihood estimator is approximately unbiased for large samples. The variance of is approximately . It is possible to show that this is the smallest possible variance of any unbiased estimator of (this result is called the Cramér–Rao lower bound, which we do not prove here). Therefore the MLE is the ‘best possible’ estimator in large samples (and therefore we hope also reasonable in small samples, though we should investigate this case by case).
Quantifying uncertainty in parameter estimates
The usefulness of an estimate is always enhanced if some kind of measure of its precision can also be provided. Usually, this will be a standard error, an estimate of the standard deviation of the associated estimator. For the maximum likelihood estimator , a standard error is given by and for a vector parameter
An alternative summary of the information provided by the observed data about the location of a parameter and the associated precision is a confidence interval.
The asymptotic distribution of the maximum likelihood estimator can be used to provide approximate large sample confidence intervals. Asymptotically, has a distribution and we can find such that Therefore The endpoints of this interval cannot be evaluated because they also depend on the unknown parameter vector . However, if we replace by its MLE we obtain the approximate large sample confidence interval For , .
Example 2.9 (Bernoulli) If are observations of , i.i.d. Bernoulli random variables then asymptotically has a distribution, and a large sample 95% confidence interval for is
Comparing statistical models
If we have a set of competing probability models which might have generated the observed data, we may want to determine which of the models is most appropriate. In practice, we proceed by comparing models pairwise. Suppose that we have two competing alternatives, (model ) and (model ) for the joint distribution of . Often and both take the same parametric form, but with for and for , where and are alternative sets of possible values for . In the regression setting, we are often interested in determining which of a set of explanatory variables have an impact on the distribution of the response.
Hypothesis testing
A hypothesis test provides one mechanism for comparing two competing statistical models. A hypothesis test does not treat the two hypotheses (models) symmetrically. One hypothesis, is accorded special status, and referred to as the null hypothesis. The null hypothesis is the reference model, and will be assumed to be appropriate unless the observed data strongly indicate that is inappropriate, and that (the alternative hypothesis) should be preferred. The fact that a hypothesis test does not reject should not be taken as evidence that is true and is not, or that is better supported by the data than merely that the data does not provide sufficient evidence to reject in favour of .
A hypothesis test is defined by its critical region or rejection region, which we shall denote by . is a subset of and is the set of possible which would lead to rejection of in favour of , i.e.
- If , is rejected in favour of ;
- If , is not rejected.
As is a random variable, there remains the possibility that a hypothesis test will produce an erroneous result. We define the size (or significance level) of the test This is the maximum probability of erroneously rejecting , over all possible distributions for implied by . We also define the power function It represents the probability of rejecting for a particular value of . Clearly we would like to find a test with where is large for every , while at the same time avoiding erroneous rejection of . In other words, a good test will have small size, but large power.
The general hypothesis testing procedure is to fix to be some small value (often 0.05), so that the probability of erroneous rejection of is limited. In doing this, we are giving precedence over . Given our specified , we try to choose a test, defined by its rejection region , to make as large as possible for .
Likelihood ratio tests for nested hypotheses
Suppose that and both take the same parametric form, with for and for , where and are alternative sets of possible values for . A likelihood ratio test of against has a critical region of the form
where is determined by , the size of the test, so Therefore, we will only reject if offers a distribution for which makes the observed data much more probable than any distribution under . This is intuitively appealing and tends to produce good tests (large power) across a wide range of examples.
In order to determine in Equation 2.3, we need to know the distribution of the likelihood ratio, or an equivalent statistic, under . In general, this will not be available to us. However, we can make use of an important asymptotic result.
First we notice that, as is a strictly increasing function, the rejection region is equivalent to where Write for the log-likelihood ratio test statistic. Provided that is nested within , the following result provides a useful large- approximation to the distribution of .
Theorem 2.3 Suppose that : and : , where . Let and . Under , the distribution of tends towards as .
Proof. First we note that in the case where is one-dimensional and , a Taylor series expansion of around the MLE gives Now, , and if we approximate by , and also ignore higher order terms, we obtain As is asymptotically , is asymptotically , and hence so is .
Similarly it can be shown that when , a multidimensional space, is asymptotically , where is the dimension of .
Now, suppose that is true and and therefore . Furthermore, suppose that is maximised in by and is maximised in by . Then
Therefore and we know that, under , has a distribution and has a distribution. Furthermore, it is possible to show (although we will not do so here) that under , and are independent. It can also be shown that under the difference can be expressed as a quadratic form of normal random variables. Therefore, it follows that under , the log likelihood ratio statistic has a distribution. ◻
Example 2.10 are observations of , i.i.d. Bernoulli random variables. Suppose that we require a size test of the hypothesis : against the general alternative : ‘ is unrestricted’ where and are specified.
Here , and and the log likelihood ratio statistic is As and , under , the log likelihood ratio statistic has an asymptotic distribution. For a log likelihood ratio test, we only reject in favour of when the test statistic is too large (observed data are much more probable under model than under model ), so in this case we reject when the observed value of the test statistic above is ‘too large’ to have come from a distribution. What we mean by ‘too large’ depends on the significance level of the test. For example, if , a common choice, then we should reject if the test statistic is greater than the 3.84, the 95% quantile of the distribution.