Calibrated Bayes for X-ray spectra

Johannes Buchner (MPE)

johannes.buchner.acad [ät] gmx.com

http://arxiv.org/abs/1402.0004

About me

models with 3-30 parameters, numerical
Author of
- APEMoST (PTMCMC)
- PyMultiNest
- BXA (+50 other projects on github)
Bayesian inference since 7 years
MCMC/nested sampling techniques
before: optical/radio, now: X-ray AGN (spectra, LF)

Overview

Introduction to model selection (graphically)
Methods for computing model selection
Calibrated Bayes, application to X-ray spectra
Current project: field inference

Bayes theorem

$A\smallfrown B = B\smallfrown A \\ p(A\smallfrown B) = p(B\smallfrown A) \\ p(A|B)p(B) = p(B|A)p(A) \\ p(A|B) = \frac{p(B|A)p(A)}{p(B)} \\ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}$

Is this interpretation allowed?

The likelihood function $p(D|\theta) = \cal{L}$

propensity/tendency for the process to produce a spectific outcome D?
In the long run: frequency of D occurring
Example: Gaussian process $$ \cal{L} \sim \exp\left\{ -\frac{1}{2}\left(\frac{d-\mu}{\sigma}\right)^{2}\right\} $$

The problem with likelihoods

Data: squeaky noise

Cat
- Meows
- walks around on stairs
- walks around on floor
Poltergeist
- makes squeaky noise

Can construct a likelihood that produces always and only the observed data $\cal{L} = 1$
But process not very probable - combine
Interpretation of the data in context of other information (Discussion & Results)

BI as update rule

$ p(\theta|D) \sim p(\theta) \prod_{i}\exp\left\{ -\frac{1}{2}\left(\frac{d_{i}-\mu}{\sigma}\right)^{2}\right\} $

More data: add terms to the right
First term: absence of data, starting point for update rule

Discrete introduction to BI:

evaluate likelihood at every point
- how prone is the process to produce the observed data
Compute relative importance: $$\cal{L} / \overline{{\cal L}}$$
Grab those that
make up 90% of $\sum{\cal L}$
$Z = \overline{{\cal L}}$ "evidence" is average likelihood

How to place grid points

Result is dependent on placement
Equal spacing in $\theta_1$ or in $\log \theta_1$.
Choice of spacing is called "prior"
coin = investment in computing there, put coins where it is worthwhile

Two models

Compare two parameter spaces by $$\left.\sum{{\cal L}}\right|_{M1} / \left.\sum{{\cal L}}\right|_{M2} $$
How many coins to put in M1, M2?
model prior

Parameter Estimation vs. Model Comparison

prior is measure, rule of averaging, deformation of space to "natural variables", investment in/weighting of sub-regions
most common priors: uniform, log-uniform.
model priors are relative size of spaces

Remove coins contributing less than 10%.
Under Bayesian inference, same problem:
- comparing bags of hypotheses

Prior transforms

$\int_a^b{f(x)g(x) dx} = \int_0^1{f(G^{-1}(u))~ du}$

where $G$ is the cumulative distribution of $g$ :

$G(x) = \int_{-\infty}^{x}{g(x') dx'}, G(a) = 0, G(b) = 1$

transformation through inverse of CDF
compresses unimportant regions
"native units": $P(\theta|D)\propto \int_0^1{P(D|u) du}$

Exploration

for dim≤3, grids are best
otherwise,
- sparse grids or
- randomized methods (MCMC, nested sampling)

MCMC Exploration

Missing ingredient: transition kernel
tune to the problems
Fraction of visits ~ converges to ~ probability of hypothesis
Where does chain spend 90% of its visits

Comparing models

With MCMC: RJMCMC
Does not work: too difficult to make a good transition kernel
How else? Just compute that integral for every space.

MCMC problems

Need transition kernel that can overcome separated maxima in likelihood
Solutions: PTMCMC
high-dim: acceptance can be very small, good region is tiny volume
Solutions: HMCMC

nested sampling idea

MCMC: only consider likelihood ratios. Integration by vertical slices
nested sampling: compute geometric size at various likelihood thresholds
orthogonal, unique re-ordering of volume by likelihood

$$\sum\underbrace{\text{Shrinkage}\times\text{Likelihood}}_{\text{Importance of shell}}=Z$$

nested sampling algorithm

Start with volume 1, draw randomly uniformly 200 points
remove one, volume shrinks by 1/200.
draw a new one excluding the removed volume
Unique ordering of space required: via likelihood
draw a new uniformly random point, with higher likelihood
(the crux of nested sampling)
Scanning up vertically, done at some point
- converges (flat at highest likelihood)

Missing ingredients

Insert tuned transition kernel into MCMC
Insert constrained drawing algorithm into nested sampling
General solutions: MultiNest, MCMC, HMCMC, Galilean, RadFriends
- Buchner (2014, submitted) - Statistical tests for nested sampling and the RadFriends algorithm

RadFriends / MultiNest

Use existing points to guess contour
Expand contour a little bit
Draw uniformly from contour
Reject points below likelihood threshold
RadFriends: Compute distance at which every point has a neighbor. Bootstrap (Leave out) for safety.
MultiNest clusters and uses ellipses

RadFriends / MultiNest

RadFriends: Compute distance at which every point has a neighbor. Bootstrap (Leave out) for safety.

MultiNest clusters and uses ellipses

What to do with Z

Z1, Z2 $$ \frac{p(M1|D)}{p(M2|D)}=\frac{Z1\cdot p(M1)}{Z2\cdot p(M2)} $$ $$ \frac{p(M_{1}|D)}{\sum p(M_{i}|D)}=\frac{Z_{1}\cdot p(M_{i})}{\sum_{i}Z_{i}\cdot p(M_{i})} $$
model priors: leave to reader or motivated by theory
Discard highly improbable model or marginalise
Does $\frac{p(M1|D)}{p(M2|D)} = 3/1$ mean M2 is correct in a quarter of the cases?

Method summary

Problems with

Multiple maxima
low information state
peculiar shapes in likelihood
numerical likelihood

Parameter estimation:
- low-dim:
  Nested sampling - MultiNest
- high-dim: HMCMC with multiple chains - Stan
Model selection:
- low-dim:
  Nested sampling - MultiNest
- high-dim: fund me! NS + some MCMC variants -- experimental work needed!

Alternative methods for model selection

As $n\rightarrow\infty$, information well constrained in a

single
prior-independent
Gaussian-shaped maximum
inside the parameter space

$\Delta \left(- 2\log\cal{L}\right) \sim \chi^2$ distributed
LaPlace Approximation, Wilks' theorem, F-test, LR-test, BIC, AIC
Unification: Watanabe (2014), WBIC/WAIC.
"Wilks's theorem should not apply — YET it works!!"
Narrow field of validity. General solution is computing Z.

II: Calibrated Bayes

BI is theoretically motivated. Is it any good?
Any application has approximations and assumptions in
- model, distributions, data extraction, prior
Motivation for Bayes factor/level to cut at (free parameter?)