$ \newcommand{\bm}[1]{\boldsymbol{\mathbf{#1}}} $

Chapter 12 Bayesian inference

12.1 Frequentist and Bayesian inference

In frequentist inference, uncertainty about a parameter value is usually expressed through a confidence interval for that parameter: an interval $[L(\bm y), U(\bm y)]$ such that \[P(L(\bm Y) \leq \theta \leq U(\bm Y)) = 1- \alpha.\] We treat $\theta$ as fixed (but unknown), and the probabilities are in terms of the random variables $\bm Y = (Y_1, \ldots Y_n)^T$. For instance, if we find $(1.1, 2.3)$ is a $95 \%$ confidence interval for $\theta$, this does not mean that $P(1.1 \leq \theta \leq 2.3) = 0.95$, as we treat $\theta$ as a fixed value.

By contrast, in Bayesian inference, we treat $\theta$ as a random variable, and construct a probability distribution which summarises our belief about the likely value of a parameter $\theta$. Our belief about which values of $\theta$ are likely (the “posterior” distribution for $\theta$) is influenced by two factors: how likely the observed data $y_1, \ldots, y_n$ were to be generated using that value of $\theta$ (the likelihood $L(\theta; y_1, \ldots, y_n)$); and how likely we thought each value $\theta$ was before conducting the experiment (the “prior” distribution for $\theta$).

We will give a very brief overview of Bayesian inference: see MATH3044 for more details.

12.2 Prior and posterior distributions

The first step of Bayesian inference is to express our beliefs about $\theta$ before conducting the experiment. We specify these beliefs through a probability distribution, which is called the prior distribution. Typically $\theta$ is a continuous random variable, so we specify the distribution through a density $\pi(\theta)$. This idea will become clearer later on, when we consider an example.

The posterior distribution is the probability distribution for a parameter $\theta$, conditional on the event $\{Y_1 = y_1, Y_2 = y_2, \ldots, Y_n = y_n\}.$

To find this distribution, we will use Bayes’ Theorem, which you have already seen in MATH1024:

Theorem 12.1 (Bayes’ Theorem) For two events $A$ and $B$, such that $P(B) > 0$, \[P(A | B) = \frac{P(B | A) P(A)}{P(B)}\]

Proof. By definition, \[P(B|A) = \frac{P(A \cap B)}{P(A)}, \quad \text{if $P(A) > 0$},\] so $P(A \cap B) = P(B | A) P(A)$, which holds even if $P(A) = 0$.

So \[\begin{align*} P(A|B) &= \frac{P(A \cap B)}{P(B)}, \quad \text{since $P(B) > 0$} \\ &= \frac{P(B | A) P(A)}{P(B)} \end{align*}\] as required.

We use a continuous version of Bayes’ Theorem to construct the probability distribution for the parameters $\theta$, given $Y_1 = y_1, \ldots, Y_n = y_n$.

If $Y_1, \ldots, Y_n$ have discrete distribution, the probability density for $\theta$, given the event $B = \{Y_1 = y_1, \ldots, Y_n = y_n\}$ is \[\begin{align*} \pi(\theta| y_1, \ldots, y_n) &= \frac{P(Y_1 = y_1, \ldots, Y_n = y_n| \theta) \pi(\theta)}{P(Y_1 = y_1, \ldots, Y_n = y_n)} \\ &= \frac{L(\theta; y_1, \ldots, y_n) \pi(\theta)}{P(Y_1 = y_1, \ldots, Y_n = y_n)} \end{align*}\]

where \[P(Y_1 = y_1, \ldots, Y_n = y_n) = \int L(\theta; y_1, \ldots, y_n) \pi(\theta) d \theta,\] because \[\int \pi(\theta|y_1, \ldots, y_n) d\theta = \frac{\int L(\theta; y_1, \ldots, y_n) \pi(\theta) d\theta}{P(Y_1 = y_1, \ldots, Y_n = y_n)} = 1,\] as $\pi(\theta|y_1, \ldots, y_n)$ is a probability density function.

The denominator $P(Y_1 = y_1, \ldots Y_n = y_n)$ does not depend on $\theta$, so usually we write \[\pi(\theta| y_1, \ldots, y_n) \propto L(\theta; y_1, \ldots, y_n) \pi(\theta),\] and if necessary find the constant of proportionality to make sure that $\pi(\theta|y_1, \ldots y_n)$ integrates to $1$. Sometimes we recognise the pdf of a known distribution, and do not need to compute the constant.

If $Y_1, \ldots Y_n$ have continuous distribution, with p.d.f. $f(y; \theta)$, the posterior density has the same form \[\pi(\theta | y_1, \ldots, y_n) \propto L(\theta; y_1, \ldots, y_n) \pi(\theta). \]

Example 12.1 (Bernoulli) Suppose that $Y_1, \ldots Y_n$ are independent and identically distributed, with each $Y_i \sim \text{Bernoulli}(\theta)$ where $\theta$ is an unknown parameter.

To be able to conduct Bayesian inference, we need to write down a prior distribution for $\theta$. In this case, it is convenient to choose a beta distribution $\theta \sim \text{beta}(m_0, n_0)$ for the prior, so that \[\pi(\theta) \propto \theta^{m_0-1}(1 - \theta)^{n_0 - 1}.\] We will see that if we make this choice, then the posterior distribution will also be a beta distribution.

The likelihood function for $\theta$ is \[L(\theta; y_1, \ldots, y_n) \prod_{i=1}^n \theta^{y_i} (1 - \theta)^{1- y_i} = \theta^{\sum_{i=1}^n y_i} (1 - \theta)^{n - \sum_{i=1}^n y_i}.\] The posterior distribution is \[\begin{align*} \pi(\theta|y_1, \ldots, y_n) &\propto L(\theta; y_1, \ldots, y_n) \pi(\theta) \\ &\propto \theta^{\sum_{i=1}^n y_i} (1 - \theta)^{n - \sum_{i=1}^n y_i} \theta^{m_0-1}(1 - \theta)^{n_0 - 1} \\ &= \theta^{\sum_{i=1}^n y_i + n_0 - 1} (1 - \theta)^{n - \sum_{i=1}^n y_i + n_0 - 1}, \end{align*}\]

which is proportional to the pdf of a $\text{beta}\left(\sum_{i=1}^n y_i + n_0, n - \sum_{i=1}^n y_i + n_0\right)$ distribution, so this is the posterior distribution. Writing $s = \sum_{i=1}^n y_i$ for the number of observed “successes”, and $f = n - \sum_{i=1}^n y_i$ for the number of observed “failures”, we have \[\theta|y_1, \ldots, y_n \sim \text{beta}(s + n_0, f + m_0).\] This means that we may interpret the prior distribution as equivalent to the information we would gain by seeing $m_0$ successes and $n_0$ failures.

In reality, in order to choose a sensible prior we need to know more information about the type of process we are modelling. For instance, suppose that our data are the outcomes of $n$ tennis matches between two friends, Alex and Bob, where $Y_i = 1$ denotes a victory for Alex, and $Y_i = 0$ a victory for Bob. Suppose that Alex is 25 and healthy, whereas Bob is 52 and slightly overweight. Before the matches are played, you have some prior belief about $\theta$, the probability that Alex will win a game of tennis against Bob. The prior distribution reflects your personal beliefs: your prior may well look quite different to another person’s prior. In this example, we might suppose $\theta \sim \text{beta}(3, 2)$. If we then observe two matches, both won by Bob ($s = 0$, $f = 2$), the posterior distribution will be $\theta|\bm Y \sim \text{beta}(3, 4)$. In this case, the smaller values of $\theta$ are given a higher probability in the posterior than in the prior, because of the what we have learnt from the data:

curve(dbeta(x, 3, 4), 0, 1, ylab = "pi(theta)", xlab = "theta")
curve(dbeta(x, 3, 2), 0, 1, lty = 2, add = TRUE)
legend("topleft", lty = c(2, 1), c("Prior", "Posterior"))

12.3 The posterior predictive distribution

Suppose we wanted to predict the outcome of a new random variables $Y_{n+1}$, assumed to have the same distribution as $Y_1, \ldots, Y_n$.

If the $Y_i$ are discrete random variables, the posterior predictive distribution is \[P(Y_{n+1} = y | y_1, \ldots, y_n) = \int_\theta p(y; \theta) \pi(\theta | y_1, \ldots, y_n) d \theta. \]

Example 12.2 (Bernoulli) Continuing Example 12.1, with $Y_i \sim \text{Bernoulli}(\theta)$ and prior $\theta \sim \text{beta}(m_0, n_0)$, recall that $\theta|y_1, \ldots, y_n \sim \text{beta}(s + m_0, f + n_0),$ where $s = \sum_{i=1}^n y_i$ and $f = n - \sum_{i=1}^n y_i$. We have \[\pi(\theta|y_1, \ldots, y_n) = \frac{1}{B(s + m_0, f + n_0)} \theta^{s + m_0 - 1} (1 - \theta)^{f + n_0 - 1}, \quad 0 < \theta < 1,\] and \[p(y; \theta) = \theta^y (1 - \theta)^{1 - y}, \quad y \in \{0, 1\}.\] We have \[\begin{align*} P(Y_{n+1} = 1 | y_1 \ldots y_n) &= \int_0^1 \theta \frac{1}{B(s + m_0, f + n_0)} \theta^{s + m_0 - 1} (1 - \theta)^{f + n_0 - 1} d\theta \\ &= \frac{1}{B(s + m_0, f + n_0)} \int_0^1 \theta^{s + m_0} (1 - \theta)^{f + n_0 - 1} d \theta \\ &= \frac{B(s + m_0 + 1, f + n_0)}{B(s + m_0, f + n_0)} \int_0^1 h(\theta) d \theta \\ & \quad \quad \text{where $h(\theta)$ the $\text{beta}(s + m_0 + 1, f + n_0)$ pdf} \\ &= \frac{\Gamma(s + m_0 + 1) \Gamma(f + n_0)}{\Gamma(s + m_0 + 1 + f + n_0)} \cdot \frac{\Gamma(s + m_0 + f + n_0)}{\Gamma(s + m_0)\Gamma(f + n_0)} \cdot 1 \\ &= \frac{\Gamma(n + m_0 + n_0)}{\Gamma(n + m_0 + n_0 + 1)} \frac{\Gamma(s + m_0 + 1)}{\Gamma(s + m_0)} \\ &= \frac{s+m_0}{n + m_0 + n_0}. \end{align*}\] In our tennis match example, with $m_0 = 3$, $n_0 = 2$, after observing two matches both won by Bob ($s = 0$, $f = 2$), our predicted probability that Alex will win the next match is \[P(Y_3 = 1 | y_1, y_2) = \frac{0 + 3}{2 + 3 + 2}= \frac{3}{7}.\]