$ \newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}} $

Chapter 10 Parameter estimation

10.1 Estimators and estimates

Suppose that $Y_1, \ldots, Y_n$ are independent identically distributed random variables, each with a distribution depending on unknown parameter(s) $\theta$. In the continuous case, we know the density function $f(y_i; \theta)$ up to the unknown $\theta$, and in the discrete case we have probability function $p(y_i; \theta)$. We might be interested in estimating $\theta$ based on $Y_1, \ldots, Y_n$.

An estimator $T = T(Y_1, ...,Y_n)$ is just some suitable function (a “statistic”) of $Y_1, Y_2, \ldots, Y_n$, which is used to estimate a parameter. It must not contain any unknown parameters or any unobservable quantities. For example, the sample mean $\bar Y$ is an obvious estimator for the population mean.

Note that an estimator is a function of random variables and hence can itself be treated as a random variable. Once a set of data has been collected (i.e. the observed values $y_1, y_2, \ldots ,y_n$ of $Y_1, Y_2, \ldots, Y_n$ are available), then $T(y_1, ...,y_n)$ is the corresponding estimate. When deriving properties of $T$, one should always work with the estimator.

You have already seen several desirable properties of estimators in MATH1024, which we now review.

10.2 Bias

Definition 10.1 The bias of an estimator $T = T(Y_1, \ldots, Y_n)$ of a parameter $\theta$ is: \[B(T; \theta) = E(T) - \theta.\] $T$ is said to be an unbiased estimator of $\theta$ if \[E(T) - \theta = 0\] for all possible values of $\theta$.

Example 10.1 (Bias of estimator of Bernoulli success probability) Let $Y_1, Y_2, \ldots, Y_n$ be independent and identically distributed, where each $Y_i \sim \text{Bernoulli}(\theta)$.

We know that $E(Y_i) = \theta$, so a natural estimator of $\theta$ is $T = \bar Y = \frac{1}{n} \sum_{i=1}^n Y_i$, the sample mean.

Here $T$ is an unbiased estimator of $\theta$, as \[E(T) = E(\bar Y) = \frac{1}{n} \sum_{i=1}^n E(Y_i) = \frac{1}{n} n \theta = \theta.\]

Example 10.2 (Bias of estimator of exponential rate parameter) Let $Y_1, Y_2, \ldots, Y_n$ be independent and identically distributed, where each $Y_i \sim \text{exponential}(\theta)$.

We know that $E(Y_i) = 1/\theta,$ so one possible estimator for $\theta$ is $T = 1 / \bar Y$.

Since $Y_i \sim \text{gamma}(1, \theta)$, by Propositions 6.4 and 6.5, \[\bar Y \sim \text{gamma}(n, n \theta).\]

Recall from Proposition 6.3 that $E(X) = \alpha/ \beta$ if $X \sim \text{gamma}(\alpha, \beta)$. So $\bar Y$ is an unbiased estimator of $1/\theta$, as \[E(\bar Y) = \frac{n}{n \theta} = \frac{1}{\theta}.\]

Given this, an obvious choice for an estimator of $\theta$ is $T = 1 / \bar Y$. Since $\bar Y \sim \text{gamma}(n, n \theta)$, we have \[\begin{align*} E(1 / \bar Y) &= \int_0^\infty \frac{1}{y} \frac{(n \theta)^n}{\Gamma(n)} y^{n - 1} e^{- n \theta y} dy \\ &= \frac{(n \theta)^n}{\Gamma(n)} \int_0^\infty y^{n-2} e^{- n \theta y} dy \\ &= \frac{(n \theta)^n}{\Gamma(n)} \frac{\Gamma(n-1)}{(n \theta)^{n-1}} \int_0^\infty \frac{(n \theta)^{n-1}}{\Gamma(n-1)} y^{n-2} e^{- n \theta y} dy \\ &\quad \text{(integrating the $\text{gamma}(n - 1, n \theta)$ pdf)} \\ &= \frac{\Gamma(n-1)}{\Gamma(n)} n \theta \cdot 1 \\ &= \frac{1}{n-1} n \theta, \end{align*}\] since $\Gamma(n) = (n-1) \Gamma(n - 1)$, by Proposition 6.1. So the bias of $1/\bar Y$ as an estimator of $\theta$ is \[B(1/\bar Y; \theta) = \frac{n}{n-1} \theta - \theta = \frac{1}{n-1} \theta \not = 0.\] $1/\bar Y$ is not an unbiased estimator of $\theta$, but the bias shrinks with $n$.

10.3 Consistency

Definition 10.2 An estimator $T = T(Y_1, \ldots,Y_n)$ of a parameter $\theta$ is said to be a consistent estimator of $\theta$ if:

it is asymptotically unbiased: $B(T; \theta) \rightarrow 0$ as $n \rightarrow \infty$; and
$\text{Var}(T) \rightarrow 0$ as $n \rightarrow \infty$.

for all possible values of $\theta$.

Note that a consistent estimator can be biased, and an unbiased estimator can be inconsistent (i.e. not consistent). So consistency and unbiasedness are different types of property.

Example 10.3 (Consistency of estimator of Bernoulli success probability) Continue example 10.1, with $Y_i \sim \text{Bernoulli}(\theta)$ and $T = \bar Y$. $T$ is a consistent estimator of $\theta$, as:

$E(T) = E(\bar Y) = \theta$, so $B(T; \theta) = 0$. Therefore $B(T; \theta) \rightarrow 0$ as $n \rightarrow \infty$ (any unbiased estimator is asymptotically unbiased).
The variance is \[\begin{align*} \text{Var}(T) &= \text{Var}(\bar Y) = \text{Var}\left(\frac{1}{n} \sum_{i=1}^n Y_i \right) \\ &= \frac{1}{n^2} \sum_{i=1}^n \text{Var}(Y_i) \quad \text{by independence} \\ &= \frac{1}{n^2} n \theta (1 - \theta) \\ &\rightarrow 0 \quad \text{as $n \rightarrow \infty$, for all $\theta \in [0, 1]$.} \end{align*}\]

Example 10.4 (Consistency of estimator of exponential rate parameter) Continue example 10.2, with $Y_i \sim \text{exponential}(\theta)$ and $T = \frac{1}{\bar Y}$. $T$ is a consistent estimator of $\theta$, as:

The bias of $T$ is \[B(T; \theta) = \frac{1}{n-1} \theta \rightarrow 0\] as $n \rightarrow \infty$.
The variance of $T$ is $\text{Var}(T) = E(T^2) - [E(T)]^{2}$. Recall $\bar Y \sim \text{gamma}(n, n \theta)$, so \[\begin{align*} E(T^2) &= E({\bar Y}^{-2}) = \int_0^\infty \frac{1}{y^2} \frac{(n \theta)^n}{\Gamma(n)} y^{n - 1} e^{- n \theta y} dy \\ &= \frac{(n \theta)^n}{\Gamma(n)} \int_0^\infty y^{n-3} e^{- n \theta y} dy \\ &= \frac{(n \theta)^n}{\Gamma(n)} \frac{\Gamma(n-2)}{(n \theta)^{n-2}} \int_0^\infty \frac{(n \theta)^{n-2}}{\Gamma(n-2)} y^{n-3} e^{- n \theta y} dy \\ &\quad \text{(integrating the $\text{gamma}(n - 2, n \theta)$ pdf)} \\ &= \frac{\Gamma(n-2)}{\Gamma(n)} (n \theta)^2 \cdot 1 \\ &= \frac{1}{(n-1)(n-2)} (n \theta)^2, \end{align*}\] since $\Gamma(n) = (n-1) \Gamma(n - 1) = (n-1) (n-2) \Gamma(n-2)$, by Proposition 6.1. So \[\begin{align*} \text{Var}(T) &= E(T^2) - [E(T)]^{2} \\ &= \frac{1}{(n-1)(n-2)} (n \theta)^2 - \frac{1}{(n-1)^2} (n \theta)^2 \\ &= \frac{(n \theta)^2}{(n-1)^2} \left[\frac{n-1}{n-2} - 1 \right] \\ &\rightarrow 0 \quad \text{as $n \rightarrow \infty$, for all $\theta>0$.} \end{align*}\]

10.4 Mean squared error

If we have two unbiased estimators, $T_1$ and $T_2$, of $\theta$ then typically we prefer the one with the smaller variance. However, having a small variance on its own is not sufficient if the estimator is biased.

Definition 10.3 The mean squared error (or MSE) of an estimator $T$ of $\theta$ is \[\text{MSE}(T; \theta) = E\{(T - \theta)^2\},\] the mean of the squared distance of the $T$ from its target value $\theta.$

We can use the MSE to choose between competing estimators whether they are unbiased or not.

Proposition 10.1 For any estimator $T$ of $\theta$, \[\text{MSE}(T; \theta) = \text{Var}(T) + \left[B(T; \theta)\right]^2.\]

Proof. We have \[\begin{align*} \text{MSE}(T; \theta) &= E\{(T - \theta)^2\} \\ &= E\{(T - E(T) + E(T) - \theta)^2\} \\ &= E\left\{(T - E(T))^2 + 2(T - E(T))(E(T) - \theta) + (E(T) - \theta)^2 \right\} \\ &= E\left\{(T - E(T))^2 \right\} + 2(E(T) - \theta) E\left\{T - E(T) \right\} +(E(T) - \theta)^2 \\ &= \text{Var}(T) + 2(E(T) - \theta) \cdot 0 + \left[B(T; \theta)\right]^2 \\ &= \text{Var}(T) + \left[B(T; \theta)\right]^2, \end{align*}\] as required.

An immediate consequence of Proposition 10.1 is that if $T$ is an unbiased estimator of $\theta$, then $\text{MSE}(T; \theta) = \text{Var}(T)$.

Note that it is not possible to find a uniformly minimum MSE estimator, that is, an estimator which will have the lower MSE than any other estimator for all value of $\theta$. For instance, consider the estimator $T^* = 1$, which takes the simple (but usually bad!) strategy of estimating $\theta$ as $1$, irrespective of the observed data. This has zero mean squared error at $\theta = 1$, so will beat any other estimator at $\theta = 1$. However, it will be a very bad estimator for other values of $\theta$.

It is sometimes possible to find a uniformly minimum MSE estimator within a class of “sensible” estimators.

Example 10.5 (Comparing estimators of the normal variance parameter) Suppose that $Y_1, Y_2, \ldots, Y_n$ are independent, with each $Y_i \sim N(\mu, \sigma^2)$, and that we wish to estimate $\sigma^2$. Two possible choices are: $S^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar Y)^2$ and $\frac{1}{n} \sum_{i=1}^n (Y_i - \bar Y)^2.$ You can think of these as special cases of estimators of the form \[T_c = c \sum_{i=1}^n (Y_i - \bar Y)^2,\] where $c$ is some positive constant.

Can we find a value of $c$ such that $\text{MSE}(T_c; \sigma^2)$ is minimised for all $\sigma^2 > 0$? From Theorem 6.1, $\frac{(n-1) S^2}{\sigma^2} \sim \chi^2_{n-1}$, so \[\frac{T_c}{c \sigma^2} = \frac{\sum_{i=1}^n (Y_i - \bar Y)^2}{\sigma^2} \sim \chi^2_{n-1}.\] We may use this to find $\text{MSE}(T_c; \sigma^2)$ for any $c$, and then minimise it with respect to $c$.

By Proposition 10.1, \[\text{MSE}(T_c; \sigma^2) = \text{Var}(T_c) + \left[B(T_c; \sigma^2)\right]^2.\] Let $X = \frac{T_c}{c \sigma^2} \sim \chi^2_{n-1}$. Recall from Section 6.3 that $E(X) = n - 1$ and $\text{Var}(X) = 2(n-1)$. We have \[E(T_c) = E(c \sigma^2 X) = c \sigma^2 E(X) = c \sigma^2 (n-1),\] so \[B(T_c; \sigma^2) = \sigma^2 [c (n-1) - 1] .\] The variance of $T_c$ is \[\text{Var}(T_c) = \text{Var}(c \sigma^2 X) = c^2 \sigma^4 \text{Var}(X) = c^2 \sigma^4 2(n-1),\] so the mean squared error is \[\begin{align*} \text{MSE}(T_c; \sigma^2) &= c^2 \sigma^4 2(n-1) + \sigma^4[c (n-1) - 1]^2 \\ &= \sigma^4 \left[2c^2 (n-1) + c^2 (n-1)^2 - 2c(n-1) + 1\right]. \end{align*}\] To minimise this, for any $\sigma^2$, we need to find $c$ to minimise \[g(c) = 2c^2 (n-1) + c^2 (n-1)^2 - 2c(n-1) + 1.\] Differentiating, we have \[g^\prime(c) = 4c(n-1) + 2c(n-1)^2- 2(n-1),\] so we want to find $c^*$ such that $g^\prime(c^*) = 0$, or \[4c^*(n-1) + 2c^*(n-1)^2- 2(n-1) = 0.\] Dividing by $2(n - 1)$ and rearranging gives $c^*(2 + n-1) = 1,$ so $c^* = \frac{1}{n+1}$. So \[T_{c^*} = \frac{1}{n+1} \sum_{i=1}^n (Y_i - \bar Y)^2\] has minimum mean squared error of all estimators in this class, for any value of $\sigma^2$.

Once we have come up with an estimator, we can check whether it has good properties, such as consistency or low mean squared error. However, it is not yet clear yet how we should go about finding an estimator for any given model, and it would be useful to have some general methods that will produce estimators. We will consider two general recipes for finding estimators, the method of moments, and maximum likelihood estimation.

10.5 Method of moments estimation

This approach essentially formalises an argument we have met before – “if I have data from a population with mean $\mu$, it is natural to estimate $\mu$ by the corresponding sample mean $\bar Y$”.

Suppose we have a sample of independent and identically distributed random variables $Y_1, Y_2, \ldots,Y_n$, whose probability function or probability density function depends on $k$ unknown parameters $(\theta_1, \theta_2, \ldots, \theta_k)$. Then we may write the $r$th population moment about the origin as a function of the parameters: \[\mu_r^\prime = \mu_r^\prime(\theta_1, \theta_2, \ldots, \theta_k), \quad \text{for $r = 1, 2, 3, \ldots$.}\] We write \[m_r^\prime = \frac{1}{n} \sum_{i=1}^n Y_i^r\] for the $r$th sample moment, for $r = 1, 2, 3, \ldots$.

Then (assuming each of the first $k$ moments involves at least one parameter) we find values $\tilde \theta_1, \ldots, \tilde \theta_k$ such that the first $k$ sample moments match the first $k$ population moments \[m_r^\prime = \mu_r^\prime(\tilde \theta_1, \tilde \theta_2, \ldots, \tilde \theta_k) \text{ for $r = 1, 2, \ldots$, k}.\] We call $\tilde \theta_1, \ldots, \tilde \theta_k$ the method of moments estimators of $\theta_1, \ldots, \theta_k$.

Sometimes one of the first $k$ expressions does not involve any parameters, in which case we need to add an extra simultaneous equation \[m_{k+1}^\prime = \mu_{k+1}^\prime(\tilde \theta_1, \tilde \theta_2, \ldots, \tilde \theta_k)\] in order to be able to solve the system of simultaneous equations for the $k$ unknowns $\tilde \theta_1, \ldots, \tilde \theta_k$ We always need at least $k$ simultaneous equations to solve for the $k$ unknowns: if any of those simultaneous equations does not provide any new information, then we need to generate more equations by matching higher-order population and sample moments, until it is possible to find a solution to the system of simultaneous equations.

This will become clearer with some examples. We first begin with some simple one-parameter examples to illustrate the method.

Example 10.6 (Bernoulli) Suppose $Y_i \sim \text{Bernoulli}(\theta)$. We have $\mu_1^\prime = \mu_1^\prime(\theta) = \theta$. and $m_1 = \frac{1}{n}\sum_{i=1}^n Y_i = \bar Y$. So the method of moments estimator is $\tilde \theta = \bar Y$.

Example 10.7 (Normal mean) Suppose $Y_i \sim N(\theta, 1)$. We have $\mu_1^\prime = \mu_1^\prime(\theta) = \theta$, and $m_1 = \frac{1}{n}\sum_{i=1}^n Y_i = \bar Y$. So the method of moments estimator is $\tilde \theta = \bar Y$.

Example 10.8 (Exponential) Suppose $Y_i \sim \text{exponential}(\theta)$. We have $\mu_1^\prime = \mu_1^\prime(\theta) = 1/\theta,$ and $m_1 = \frac{1}{n}\sum_{i=1}^n Y_i = \bar Y$. So the method of moments estimator is $\tilde \theta = 1/ \bar Y$.

Example 10.9 (Normal variance) Suppose $Y_i \sim N(0, \theta)$. We know that $\mu_1^\prime = \mu_1^\prime(\theta) = 0$, which does not involve the parameter of interest. But $\mu_2^\prime = \mu_2^\prime(\theta) = \theta$, and $m_2^\prime = \frac{1}{n}\sum_{i=1}^n Y_i^2,$ so $\tilde \theta = \frac{1}{n} \sum_{i=1}^n Y_i^2.$

We can also use method of moment estimation for models with more than one unknown parameter.

Example 10.10 (Normal mean and variance) Suppose $Y_i \sim N(\theta_1, \theta_2)$ (i.e. $\theta_1 = \mu$, $\theta_2 = \sigma^2$ in the usual notation). The first two population moments are $\mu_1^\prime(\theta_1, \theta_2) = \theta_1$ and \[\mu_2^\prime(\theta_1, \theta_2) = \text{Var}(Y) + [E(Y)]^2 = \theta_2 + \theta_1^2.\] The first two sample moments are $m_1^\prime = \bar Y$ and $m_2^\prime = \frac{1}{n}\sum_{i=1}^n Y_i^2$. So we choose $\tilde \theta_1$ and $\tilde \theta_2$ to solve $\mu_1^\prime(\tilde \theta_1, \tilde \theta_2) = m_1^\prime$ and $\mu_2^\prime(\tilde \theta_1, \tilde \theta_2) = m_2^\prime$. So \[\tilde \theta_1 = \bar Y \quad \text{and} \quad \tilde \theta_2 + \tilde \theta_1^2 = \frac{1}{n}\sum_{i=1}^n Y_i^2,\] so \[\tilde \theta_2 = \frac{1}{n}\sum_{i=1}^n Y_i^2 - \bar Y^2.\]

Example 10.11 Suppose $Y_i \sim \text{gamma}(\theta_1, \theta_2)$ In this notation the shape parameter is $\alpha = \theta_1$ and the rate parameter is $\beta = \theta_2$.

By Proposition 6.3, we have \[\mu_1^\prime(\theta_1, \theta_2) = \frac{\theta_1}{\theta_2}\] and \[\begin{align*} \mu_2^\prime(\theta_1, \theta_2) &= \text{Var}(Y) + [E(Y)]^2 \\ &= \frac{\theta_1}{\theta_2^2} + \frac{\theta_1^2}{\theta_2^2} \\ &= \frac{\theta_1(1 + \theta_1)}{\theta_2^2}. \end{align*}\]

The first two sample moments are $m_1^\prime = \bar Y$ and $m_2^\prime = \frac{1}{n}\sum_{i=1}^n Y_i^2$.

So we choose $\tilde \theta_1$ and $\tilde \theta_2$ to solve $\mu_1^\prime(\tilde \theta_1, \tilde \theta_2) = m_1^\prime$ and $\mu_2^\prime(\tilde \theta_1, \tilde \theta_2) = m_2^\prime$. So \[\frac{\tilde \theta_1}{\tilde \theta_2} = \bar Y \quad \text{and} \quad \frac{\tilde \theta_1(1 + \tilde \theta_1)}{\tilde \theta_2^2} = \frac{1}{n}\sum_{i=1}^n Y_i^2.\] So $\tilde \theta_1 = \bar Y \tilde \theta_2$, and \[\frac{\bar Y \tilde \theta_2 (1 + \bar Y \tilde \theta_2)}{\tilde \theta_2^2} = \frac{1}{n}\sum_{i=1}^n Y_i^2,\] or \[\bar Y \tilde{\theta}_2^{-1} + {\bar Y}^2 = \frac{1}{n}\sum_{i=1}^n Y_i^2.\] Rearranging, we get \[\tilde \theta_2 = \frac{\bar Y}{\frac{1}{n}\sum_{i=1}^n Y_i^2 - {\bar Y}^2},\] so \[\tilde \theta_1 = \bar Y \tilde \theta_2 = \frac{\bar Y^2}{\frac{1}{n}\sum_{i=1}^n Y_i^2 - {\bar Y}^2}.\]

10.6 Maximum likelihood estimation

Maximum likelihood estimation is a versatile method – it is the standard method in modern (frequentist) statistics – and it can be shown (see MATH3044) that maximum likelihood estimators (MLEs) have some nice optimality properties.

Maximum likelihood estimation can be applied in complex situations, but in this module we will stick to the situation where $Y_1, Y_2, \ldots, Y_n$ are independent and identically distributed random variables,

The likelihood function is a function of the unknown parameters $\theta$, which gives the joint probability (or joint probability density) of seeing the observed data $y_1, \ldots y_n$, assuming the data were generated from the model with parameter value $\theta$:

Definition 10.4 Suppose $Y_1, Y_2, \ldots, Y_n$ are independent and identically distributed random variables, with distribution depending on a vector of unknown parameters $\theta$. The likelihood function is \[L(\theta; y_1, \ldots, y_n) = f(y_1 ; \theta) \times f(y_2 ; \theta) \times \ldots \times f(y_n ; \theta),\] if each $Y_i$ has continuous distribution, with probability density function $f(y; \theta)$, or \[L(\theta; y_1, \ldots, y_n) = p(y_1 ; \theta) \times p(y_2 ; \theta) \times \ldots \times p(y_n ; \theta).\] if each $Y_i$ has discrete distribution, with probability function $p(y; \theta)$.

The values of the observations $y_1, \ldots, y_n$ have been observed, so these are known, and $L(\cdot)$ may be regarded as a function of the unknown $\theta$.

If $L(\theta_1 ; y_1 , \ldots, y_n) > L(\theta_2 ; y_1 , \ldots, y_n )$, we should prefer $\theta_1$ to $\theta_2$ because there is a greater likelihood that the observed data would occur with this value of $\theta_1$ than with $\theta_2$. It follows that we should try to maximise the likelihood to find the value of $\theta$ which is most likely to have given rise to the data that were actually observed.

The maximum likelihood estimate $\hat \theta = \hat \theta(y_1, \ldots, y_n)$ of $\theta$. is the value of $\theta$ which maximises the likelihood: \[\hat \theta = \arg \max_{\theta}L(\theta; y_1, \ldots, y_n)\] The corresponding maximum likelihood estimator is $\hat \theta = \hat \theta(Y_1, \ldots, Y_n)$.

Since we assume that the $Y_i$ are independent and identically distributed, the likelihood is just a product of pdf or probability function terms. Maximising a sum is usually easier than maximising a product, so we often work with the log-likelihood \[\log L(\theta; y_1, \ldots, y_n) = \sum_{i=1}^n \log f(y_i; \theta).\]

The value of $\theta$ which maximises $L(\cdot)$ is the same as that which maximises $\log L(\cdot)$, as $L(\theta_1 ) > L(\theta_2)$ if and only if $\log L(\theta_1) > \log L(\theta_2)$. So we may find the MLE as by maximising $\log L(\cdot)$: \[\hat \theta = \arg \max_{\theta}\left\{\log L(\theta; y_1, \ldots, y_n)\right\}.\]

Example 10.12 (Bernoulli) Suppose $Y_i \sim \text{Bernoulli}(\theta)$. The likelihood is \[L(\theta; y_1, \ldots, y_n) = \prod_{i=1}^n p(y_i; \theta) = \prod_{i=1}^n \theta^{y_i} (1 - \theta)^{1- y_i}.\] Hence the log-likelihood is \[\begin{align*} \log L(\theta; y_1, \ldots, y_n) &= \log\left(\prod_{i=1}^n \theta^{y_i} (1 - \theta)^{1- y_i}\right)\\ &= \sum_{i=1}^n \log\left(\theta^{y_i} (1 - \theta)^{1- y_i}\right) \\ &= \sum_{i=1}^n \log \theta^{y_i} + \log(1 - \theta)^{1-y_i} \\ &= \sum_{i=1}^n y_i \log \theta + (1 - y_i) \log (1 - \theta) \\ &= \left(\sum_{i=1}^n y_i\right) \log \theta + \left(n - \sum_{i=1}^n y_i\right) \log (1 - \theta). \end{align*}\]

We now maximise $\log L(\cdot)$ as a function of $\theta$. We attempt to find a stationary point of $\log L(\cdot)$ by differentiating and setting to zero. We have \[\frac{d}{d \theta}\log L(\theta; y_1, \ldots, y_n) = \frac{\sum_{i=1}^n y_i}{\theta} - \frac{n - \sum_{i=1}^n y_i}{1 - \theta},\] and we want to find $\hat \theta$ such that \[\frac{d}{d \theta}\log L(\theta; y_1, \ldots, y_n)|_{\theta = \hat \theta} = 0.\] We have \[\frac{\sum_{i=1}^n y_i}{\hat \theta} - \frac{n - \sum_{i=1}^n y_i}{1 - \hat \theta} = 0,\] and rearranging gives that \[\hat \theta = \frac{\sum_{i=1}^n y_i}{n} = \bar y\] is a stationary point of the log-likelihood, provided that $\sum_{i=1}^n y_i > 0$ and $\sum_{i=1}^n y_i < n$. If $\sum_{i=1}^n y_i = 0$, then $\log L(\theta) = n \log (1 - \theta)$, which is decreasing with $\theta$, so $\hat \theta = 0$ is the MLE. Similarly, if $\sum_{i=1}^n y_i = n$, then $\log L(\theta) = n \log \theta$, which is increasing with $\theta$, so $\hat \theta = 1$ is the MLE.

If $0 < \sum_{i=1}^n y_i < n$, we have shown that $\hat \theta = \bar y$ is a stationary point of the the log-likelihood, but we still need to show that it is a maximum. To do this, we need to show that \[\frac{d^2}{d\theta^2} \log L(\theta;y_1, \ldots, y_n)|_{\theta = \hat \theta} < 0.\] We have \[\frac{d^2}{d\theta^2} \log L(\theta;y_1, \ldots, y_n) = -\frac{\sum_{i=1}^n y_i}{\theta^2} - \frac{n - \sum_{i=1}^n y_i}{(1 - \theta)^2},\] so \[\frac{d^2}{d\theta^2} \log L(\theta;y_1, \ldots, y_n)|_{\theta = \hat \theta} = -\frac{\sum_{i=1}^n y_i}{\hat \theta^2} - \frac{n - \sum_{i=1}^n y_i}{(1 - \hat \theta)^2}.\] We know $\hat \theta = \bar y$, so $0 < \hat \theta < 1$, as we are considering the case $0 < \sum_{i=1}^n y_i < n$. So $\hat \theta>0$ and $1 - \hat \theta > 0$, so \[\frac{d^2}{d\theta^2} \log L(\theta;y_1, \ldots, y_n)|_{\theta = \hat \theta} < 0,\] and $\hat \theta = \bar Y$ is the maximum likelihood estimate.

In this case, the MLE $\hat \theta$ is the same as the method of moments estimate $\tilde \theta$ we found in Example 10.6. We have already studied the properties of this estimator, so know that $\hat \theta$ is an unbiased (see Example 10.1) and consistent (see Example 10.3) estimator of $\theta$.

Example 10.13 (Exponential) Suppose the $Y_i \sim \text{exponential}(\theta)$, so that each $Y_i$ has pdf $f(y; \theta) = \theta e^{- \theta y}$ for $y > 0$.

The likelihood is \[L(\theta; y_1, \ldots, y_n) = \prod_{i=1}^n f(y_i; \theta) = \prod_{i=1}^n \theta e^{-\theta y_i},\] since we know $y_i > 0$ for all $i$.

The log-likelihood is \[\begin{align*} \log L(\theta; y_1, \ldots, y_n) &= \sum_{i=1}^n \log\left(\theta e^{-\theta y_i}\right) \\ &= \sum_{i=1}^n \left\{\log \theta + \log\left(e^{-\theta y_i}\right) \right\} \\ &= n \log \theta - \theta \sum_{i=1}^n y_i. \end{align*}\]

Differentiating, we have \[\frac{d}{d \theta} \log L(\theta; y_1, \ldots, y_n) = \frac{n}{\theta} - \sum_{i=1}^n y_i,\] so a stationary point of the log-likelihood $\hat \theta$ satisfies \[\frac{n}{\hat \theta} - \sum_{i=1}^n y_i = 0.\] Rearranging this, we get \[\hat \theta = \frac{n}{\sum_{i=1}^n y_i} = \frac{1}{\bar y}.\]

Differentiating again, we have \[\frac{d^2}{d \theta^2} \log L(\theta; y_1, \ldots, y_n) = - \frac{n}{\hat \theta} < 0 \quad \text{for all $\theta$},\] so $\hat \theta$ is a maximum of $\log L(\cdot)$, and hence $\hat \theta$ is the MLE.

In this case, the MLE $\hat \theta$ is the same as the method of moments estimate $\tilde \theta$ we found in Example 10.8.

Example 10.14 (Gamma) Suppose $Y_i \sim \text{gamma}(\theta, 1)$.

To find the method of moments estimate of $\theta$, we have $\mu_1^\prime = E(Y_i) = \theta$ and $m_1^\prime = \bar Y$, so $\tilde \theta = \bar Y$.

To find the maximum likelihood estimate, we have \[f(y; \theta) = \frac{1}{\Gamma(\theta)} y^{\theta - 1} e^{-y}, \quad y > 0,\] so the likelihood is \[L(\theta; y_1, \ldots, y_n) = \prod_{i=1}^n \frac{1}{\Gamma(\theta)} y_i^{\theta - 1} e^{-y_i},\] since all $y_i > 0$.

The log-likelihood is \[\begin{align*} \log L(\theta; y_1, \ldots, y_n) &= \sum_{i=1}^n \log \left(\frac{1}{\Gamma(\theta)} y_i^{\theta - 1} e^{-y_i} \right) \\ &= \sum_{i=1}^n \log\left(\frac{1}{\Gamma(\theta)}\right) + (\theta - 1) \sum_{i=1}^n \log y_i - \sum_{i=1}^n y_i. \end{align*}\]

Differentiating, we have \[\frac{d}{d\theta} \log L(\theta; y_1, \ldots, y_n) = -n \frac{d}{d \theta} \log \Gamma(\theta) + \sum_{i=1}^n \log y_i,\] where $\psi(\theta) = \frac{d}{d\theta} \log \Gamma(\theta)$ is called the digamma function.

So $\hat \theta$ satisfies \[-n \psi(\hat \theta) + \sum_{i=1}^n \log y_i = 0,\] or \[\begin{equation} \psi(\hat \theta) = \frac{\sum_{i=1}^n \log y_i}{n} \tag{10.1} \end{equation}\] which has no closed-form solution for $\hat \theta$. We could use numerical methods to find a root of (10.1), or to directly maximise the log-likelihood function: see MATH3044 (Statistical Inference) for details. It is very common in practice that it is not possible to write down a closed-form expression for the MLE, so we must use numerical methods.

Example 10.15 (Normal mean and variance) Suppose $Y_i \sim N(\mu, \sigma^2)$, where $\mu$ and $\sigma^2$ are both unknown parameters. We have \[f(y; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left\{-\frac{1}{2\sigma^2} (y - \mu)^2\right\},\] so the likelihood is \[L(\mu, \sigma^2; y_1, \ldots, y_n) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left\{-\frac{1}{2\sigma^2} (y_i - \mu)^2\right\}\] and the log-likelihood is \[\begin{align*} \log L(\mu, \sigma^2; y_1, \ldots, y_n) &= \sum_{i=1}^n \log\left[ \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left\{-\frac{1}{2\sigma^2} (y_i - \mu)^2\right\}\right] \\ &= -\frac{n}{2} \log(2 \pi \sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2. \end{align*}\] Differentiating, we obtain \[\frac{\partial}{\partial \mu} \log L(\mu, \sigma^2; y_1, \ldots, y_n) = -\frac{1}{2 \sigma^2} \sum_{i=1}^n - 2(y_i - \mu) = \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \mu)\] and \[\frac{\partial}{\partial \sigma^2} \log L(\mu, \sigma^2; y_1, \ldots, y_n) = -\frac{n}{2 \sigma^2} + \frac{1}{2(\sigma^2)^2)} \sum_{i=1}^n (y_i - \mu)^2.\]

So stationary points $\hat \mu$ and $\hat \sigma^2$ solve \[\frac{1}{\hat \sigma^2} \sum_{i=1}^n (y_i - \hat \mu) = 0\] and \[ -\frac{n}{2 \hat \sigma^2} + \frac{1}{2(\hat \sigma^2)^2)} \sum_{i=1}^n (y_i - \hat \mu)^2 = 0.\] From the first equation, we obtain $\hat \mu = \bar y$, so \[ -\frac{n}{2 \hat \sigma^2} + \frac{1}{2(\hat \sigma^2)^2)} \sum_{i=1}^n (y_i - \bar y)^2 = 0,\] and multiplying through by $2 \hat \sigma^2$ gives \[ -n \hat \sigma^2 + \sum_{i=1}^n (y_i - \bar y)^2 = 0,\] so $\hat \sigma^2 = \frac{1}{n} \sum_{i=1}^n (y_i - \bar y)^2$. In the exercises, you can check that $(\hat \mu, \hat \sigma^2)$ is a maximum of the log-likelihood, so is the MLE.