1.4 A Bayesian perspective on model selection
In a parametric model, data \(y\) is assumed to be realisation of \(Y\sim f(y;\theta)\), where \(\theta\in\Omega_\theta\).
Separate from data, we have prior information about parameter \(\theta\) summarised in a prior density \(\pi(\theta)\). The model for the data is \(f(y\mid\theta)\equiv f(y;\theta)\) The posterior density for \(\theta\) is given by Bayes’ theorem: \[\pi(\theta\mid y ) = \frac{\pi(\theta) f(y\mid\theta)}{\int \pi(\theta) f(y\mid\theta)\, d\theta}.\] Here \(\pi(\theta\mid y)\) contains all information about \(\theta\), conditional on observed data \(y\). If \(\theta=(\psi,\lambda)\), then inference for \(\psi\) is based on marginal posterior density \[\pi(\psi\mid y) = \int \pi(\theta\mid y )\, d\lambda. \]
Suppose we have \(M\) alternative models for the data, with respective parameters \(\theta_1\in\Omega_{\theta_1},\ldots, \theta_m\in\Omega_{\theta_m}.\) Typically the dimensions of \(\Omega_{\theta_m}\) are different.
We enlarge the parameter space to give an encompassing model with parameter \[\theta=(m,\theta_m)\in \Omega = \bigcup_{m=1}^M \{m\}\times \Omega_{\theta_m}.\] Thus we need priors \(\pi_m(\theta_m\mid m)\) for the parameters of each model, plus a prior \(\pi(m)\) giving pre-data probabilities for each of the models. Overall, we have \[\pi(m,\theta_m) = \pi(\theta_m\mid m)\pi(m) = \pi_m(\theta_m)\pi_m,\] say.
Inference about model choice is based on marginal posterior density \[\pi(m\mid y) = \frac{\int f(y\mid\theta_m)\pi_m(\theta_m)\pi_m\, d\theta_m} {\sum_{m'=1}^M \int f(y\mid\theta_{m'})\pi_{m'}(\theta_{m'})\pi_{m'}\, d\theta_{m'}} = \frac{\pi_m f(y\mid m)} {\sum_{m'=1}^M \pi_{m'} f(y\mid{m'})}.\]
We can write \[ \pi(m,\theta_m\mid y) =\pi(\theta_m\mid y, m)\pi(m\mid y),\] so Bayesian updating corresponds to \[\pi(\theta_m\mid m)\pi(m) \mapsto \pi(\theta_m\mid y, m)\pi(m\mid y) \] and for each model \(m=1,\ldots, M\) we need
- the posterior probability \(\pi(m\mid y)\), which involves the marginal likelihood \(f(y\mid m) = \int f(y\mid\theta_m,m)\pi(\theta_m\mid m)\, d\theta_m\); and
- the posterior density \(f(\theta_m\mid y, m)\).
If there are just two models, can write \[\frac{\pi(1\mid y)}{\pi(2\mid y)} = \frac{\pi_1}{\pi_2} \frac{f(y\mid 1)}{f(y\mid 2)},\] so the posterior odds on model 1 equal the prior odds on model 1 multiplied by the Bayes factor \(B_{12}={f(y\mid 1)/ f(y\mid 2)}\).
Suppose the prior for each \(\theta_m\) is \(N(0, \sigma^2 I_{d_m})\), where \(d_m=\dim(\theta_m)\). Then, dropping the \(m\) subscript for clarity, \[\begin{align*} f(y\mid m)&= \sigma^{-d} (2\pi)^{-d/2} \int f(y \mid m, \theta) \prod_r \exp \left\{-{\theta_{r}^2/(2\sigma^2)}\right\} d\theta_r\\ & \approx \sigma^{-d} (2\pi)^{-d/2} \int f(y \mid m, \theta) \prod_r d\theta_r, \end{align*}\] for a highly diffuse prior distribution (large \(\sigma^2\)).
The Bayes factor for comparing the models is approximately \[\frac{f(y\mid 1)}{f(y\mid 2)}\approx \sigma^{d_2-d_1} g(y), \] where \(g(y)\) depends on the two likelihoods but is independent of \(\sigma^2\). Hence, whatever the data tell us about the relative merits of the two models, the Bayes factor in favour of the simpler model can be made arbitrarily large by increasing \(\sigma\).
This illustrates Lindley’s paradox, and implies that we must be careful when specifying prior dispersion parameters to compare models.
If a quantity \(Z\) has the same interpretation for all models, it may be necessary to allow for model uncertainty. In prediction, each model may be just a vehicle that provides a future value, not of interest per se.
The predictive distribution for \(Z\) may be written \[f(z\mid y) = \sum_{m=1}^M f(z\mid y, m)P(m\mid y)\] where \[P(m\mid y) = \frac{f(y\mid m)P(m)}{\sum_{m'=1}^M f(y\mid m')P(m')}.\]