Tutorial: fitting a line

In this tutorial you will learn:

  • How to fit a line to data with error bars

  • How to obtain the intrinsic scatter and its uncertainties

  • How to quantify the numerical error of the sampling

  • How to compare empirical models

Lets say we want to fit a line to some data points.

Here is our data: measurements of three observables (The bulge mass of galaxies, the velocity dispersion and the mass of the black hole.

[1]:
import numpy as np
# Black hole data from Kormendy & Ho (2014) https://arxiv.org/abs/1304.7762 https://arxiv.org/abs/1308.6483
# Bulge mass and error (log Msun)
mB = np.array([9.05, 11.84, 11.27, 10.65, 11.5, 11.74, 11.33, 10.26, 11.06, 11.61, 10.5, 10.91, 11.26, 11.01, 11.77, 11.65, 10.85, 11.62, 11.51, 10.88, 11.84, 10.85, 11.72, 9.64, 11.64, 10.97, 11.16, 12.09, 11.28, 11.05, 11.65, 11.6, 11.0, 10.57, 11.69, 11.25, 11.61, 11.65, 11.75, 11.6, 11.81, 11.78])
mBerr = np.array([0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.09, 0.09, 0.09, 0.1, 0.09])
# Black hole mass, errors, and magnitude
mBH = np.log10([2.45e+06, 1.69e+08, 1.47e+09, 5.90e+08, 8.81e+08, 4.65e+09, 3.87e+09,
 1.45e+07, 1.08e+07, 3.72e+09, 1.78e+08, 4.16e+08, 1.37e+08, 4.65e+08,
 9.09e+09, 5.29e+08, 9.78e+08, 9.25e+08, 1.30e+07, 6.96e+07, 2.54e+09,
 9.00e+07, 6.15e+09, 6.00e+08, 4.72e+09, 2.02e+08, 1.71e+09, 2.08e+10,
 8.55e+08, 5.69e+07, 4.75e+09, 3.69e+09, 2.73e+08, 4.87e+08, 3.74e+09,
 2.10e+09, 3.96e+08, 2.30e+09, 1.34e+09, 2.48e+09, 3.74e+09, 1.30e+09])
mBHlo = np.log10([1.43e+06, 1.39e+08, 1.27e+09, 5.39e+08, 4.35e+08, 4.24e+09, 3.16e+09,
 0.10e+07, 1.03e+07, 3.21e+09, 8.50e+07, 3.12e+08, 9.00e+07, 3.66e+08,
 6.28e+09, 4.21e+08, 6.70e+08, 8.38e+08, 0.10e+07, 5.62e+07, 2.44e+09,
 4.50e+07, 5.78e+09, 4.00e+08, 3.67e+09, 1.52e+08, 1.52e+09, 4.90e+09,
 4.07e+08, 4.65e+07, 2.81e+09, 2.65e+09, 1.94e+08, 3.34e+08, 2.59e+09,
 2.00e+09, 2.40e+08, 2.19e+09, 9.30e+08, 2.29e+09, 3.22e+09, 1.11e+09])
mBHhi = np.log10([3.460e+06, 1.970e+08, 1.680e+09, 6.510e+08, 1.781e+09, 5.380e+09,
 4.480e+09, 2.910e+07, 1.120e+07, 3.830e+09, 2.720e+08, 5.200e+08,
 1.820e+08, 5.640e+08, 1.143e+10, 6.360e+08, 1.286e+09, 1.023e+09,
 2.240e+08, 8.290e+07, 3.120e+09, 1.350e+08, 6.530e+09, 9.000e+08,
 5.760e+09, 2.530e+08, 1.810e+09, 3.660e+10, 1.293e+09, 6.730e+07,
 5.630e+09, 3.790e+09, 3.410e+08, 6.400e+08, 5.500e+09, 2.730e+09,
 6.720e+08, 3.450e+09, 1.850e+09, 2.960e+09, 4.160e+09, 1.500e+09])

# Velocity dispersion and error (km/s)
sigma = np.array([77.0, 226.0, 328.0, 167.0, 315.0, 276.0, 270.0, 175.0, 166.0, 297.0, 145.0, 206.0, 229.0, 182.0, 270.0, 315.0, 242.0, 296.0, 182.0, 167.0, 300.0, 190.0, 324.0, 185.0, 380.0, 177.0, 355.0, 347.0, 222.0, 150.0, 333.0, 328.0, 183.0, 239.0, 318.0, 389.0, 266.0, 292.0, 257.0, 331.0, 288.0, 322.0])
sigmaerr = np.array([3.0, 9.0, 9.0, 3.0, 3.0, 2.0, 10.0, 8.0, 16.0, 12.0, 7.0, 10.0, 11.0, 9.0, 27.0, 15.0, 12.0, 14.0, 5.0, 8.0, 7.0, 9.0, 28.0, 9.0, 19.0, 8.0, 14.0, 5.0, 11.0, 7.0, 2.0, 11.0, 9.0, 11.0, 2.0, 3.0, 13.0, 5.0, 26.0, 5.0, 14.0, 16.0])

n_data = len(mBerr)

Visualise the data

Lets plot the data first to see what is going on:

[2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
xlabel = 'Bulge mass [log, $M_\odot$]'
ylabel = 'Velocity dispersion [km/s]'
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.errorbar(x=mB, xerr=mBerr, y=sigma, yerr=sigmaerr,
             marker='o', ls=' ', color='orange')
plt.yscale('log')
_images/example-line_3_0.svg

Data properties

This scatter plot shows:

  • error bars in both x and y

  • asymmetric error bars

  • intrinsic scatter

Resampling the data

We could also represent each data point by a cloud of samples. Each point represents a possible true solution of that galaxy.

[3]:
samples = []

plt.figure()
for i in range(n_data):
    # draw normal random points

    # scale according to error bars and values
    samples_mBi = np.random.normal(mB[i], mBerr[i], size=400)

    # same for sigma
    samples_sigmai = np.random.normal(sigma[i], sigmaerr[i], size=400)

    # we will work in log-sigma:
    samples_logsigmai = np.log10(samples_sigmai)

    samples.append([samples_mBi, samples_logsigmai])

    # for each galaxy, plot alittle cloud with its own colors
    plt.scatter(samples_mBi, samples_logsigmai, s=2, marker='x')

samples = np.array(samples)
xlabel = 'Bulge mass [log, $M_\odot$]'
ylabel = 'Velocity dispersion [log, km/s]'
plt.xlabel(xlabel)
plt.ylabel(ylabel)

[3]:
Text(0, 0.5, 'Velocity dispersion [log, km/s]')
_images/example-line_5_1.svg
[4]:
samples.shape
[4]:
(42, 2, 400)

Model

Lets fit a line model with intrinsic, gaussian scatter.

\[y \sim \mathrm{Normal}(x \times alpha + beta, sigma)\]

The model has three unknown parameters:

  • the slope \(\alpha\)

  • the offset \(\beta\)

  • the scatter \(\sigma\)

Lets write down prior ranges for these parameters:

[5]:
parameters = ['slope', 'offset', 'scatter']

def prior_transform(cube):
    # the argument, cube, consists of values from 0 to 1
    # we have to convert them to physical scales

    params = cube.copy()
    # let slope go from -3 to +3
    lo = -3
    hi = +3
    params[0] = cube[0] * (hi - lo) + lo
    # let offset go from 10 to 1000 km/s -- use log
    lo = np.log10(10)
    hi = np.log10(1000)
    params[1] = cube[1] * (hi - lo) + lo
    # let scatter go from 0.001 to 10
    lo = np.log10(0.001)
    hi = np.log10(10)
    params[2] = 10**(cube[2] * (hi - lo) + lo)
    return params

Define the likelihood, which measures how far the data are from the model predictions. More precisely, how often the parameters would arise under the given parameters.

[6]:
import scipy.stats

def log_likelihood(params):
    # unpack the current parameters:
    slope, offset, scatter = params

    # compute for each x point, where it should lie in y
    y_expected = (samples[:,0] - 10) * slope + offset
    # compute the probability of each sample
    probs_samples = scipy.stats.norm(y_expected, scatter).pdf(samples[:,1])
    # average over each galaxy, because we assume one of the points is the correct one (logical OR)
    probs_objects = probs_samples.mean(axis=1)
    assert len(probs_objects) == n_data
    # multiply over the galaxies, because we assume our model holds true for all objects (logical AND)
    # for numerical stability, we work in log and avoid zeros
    loglike = np.log(probs_objects + 1e-100).sum()
    return loglike

Implicitly, this model assumes that the velocity dispersion is predicted by the bulge mass. Alternatively, one could flip the axes, or define the scatter orthogonally. But lets stick with our approach for now.

Solving the problem

[7]:
import ultranest

sampler = ultranest.ReactiveNestedSampler(parameters, log_likelihood, prior_transform)

Lets first try with relatively poor sampling:

[8]:
result = sampler.run(min_num_live_points=50, min_ess=100) # you can increase these numbers later
[ultranest] Sampling 50 live points from prior ...
[ultranest] Explored until L=4e+01
[ultranest] Likelihood function evaluations: 2697
[ultranest]   logZ = 29.81 +- 0.2828
[ultranest] Effective samples strategy satisfied (ESS = 242.7, need >100)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.44+-0.24 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.81, need <0.5)
[ultranest]   logZ error budget: single: 0.44 bs:0.28 tail:0.00 total:0.28 required:<0.50
[ultranest] done iterating.
[9]:
from ultranest.plot import cornerplot
cornerplot(sampler.results)
_images/example-line_16_0.svg
[10]:
plt.figure()
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.errorbar(x=mB, xerr=mBerr, y=sigma, yerr=sigmaerr,
             marker='o', ls=' ', color='orange')

from ultranest.plot import PredictionBand
x = np.linspace(9, 12.5, 400)
band = PredictionBand(x)
band_lo = PredictionBand(x)
band_hi = PredictionBand(x)

for params in sampler.results['samples'][:40]:
    slope, offset, scatter = params
    y = (x - 10) * slope + offset
    band.add(10**y)

    # indicate intrinsic scatter
    band_hi.add(10**(y + scatter))
    band_lo.add(10**(y - scatter))

band.shade(color='k', alpha=0.1)
band.line(color='k')
band_lo.line(color='r', ls='--')
band_hi.line(color='r', ls='--')

plt.yscale('log');

_images/example-line_17_0.svg

Understanding the uncertainties

Lets focus on the scatter parameter.

We want to understand how well we can constrain it.

We can make a histogram of the posterior samples:

[11]:
scatter_samples = result['weighted_samples']['points'][:,2]
weights = result['weighted_samples']['weights']

plt.figure(figsize=(8, 2.5))

bins=np.linspace(0.01, 0.2, 100)
plt.hist(scatter_samples, weights=weights, bins=bins, density=True, histtype='step')

plt.xlabel('sigma')
plt.ylabel("Posterior probability")

[11]:
Text(0, 0.5, 'Posterior probability')
_images/example-line_19_1.svg

Quantifying the distribution tails

But how well do we really know this probability distribution? We put uncertainties on it, because UltraNest provides bootstrapped weights, which emulates multiple (30) runs with a different number of live points.

[12]:
result['weighted_samples']['bootstrapped_weights'].shape
[12]:
(899, 30)
[13]:
from fastkde import fastKDE
[14]:
plt.figure()
bins=np.linspace(0.01, 0.2, 64+1)
scatter_samples = result['samples'][:,2]

pdf, _ = fastKDE.pdf(scatter_samples, axes=(bins,))
plt.plot(bins, pdf, color='k')

from ultranest.plot import PredictionBand
from ultranest.integrator import resample_equal
band = PredictionBand(bins)

for weights in result['weighted_samples']['bootstrapped_weights'].transpose():
    scatter_samples = resample_equal(result['weighted_samples']['points'][:,2], weights)
    pdf, _ = fastKDE.pdf(scatter_samples, axes=(bins,))
    band.add(pdf)

band.line(ls='--', color='r', alpha=0.5)
band.shade(0.49, color='r', alpha=0.1)


plt.xlabel('$\sigma$')
plt.ylabel("Posterior probability")
#plt.yscale('log')
plt.ylim(1e-3, 50);

/home/user/.local/lib/python3.6/site-packages/fastkde/fastKDE.py:464: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  *(1+sqrt(1-ecfThresh/ecfSq[iCalcPhi]))
/home/user/.local/lib/python3.6/site-packages/fastkde/fastKDE.py:472: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  self.phiSC[iCalcPhi] = self.ECF[iCalcPhi]*kappaSC[iCalcPhi]
/home/user/.local/lib/python3.6/site-packages/fastkde/fastKDE.py:464: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  *(1+sqrt(1-ecfThresh/ecfSq[iCalcPhi]))
/home/user/.local/lib/python3.6/site-packages/fastkde/fastKDE.py:472: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  self.phiSC[iCalcPhi] = self.ECF[iCalcPhi]*kappaSC[iCalcPhi]
_images/example-line_23_1.svg

title

Task for you

Edit this notebook. Try adjusting the number of live points (min_num_live_points) and effective sample size (min_ess) parameters above to decrease the uncertainties.

Comparing empirical models

We are using an ad-hoc/empirical model function, and it does not have well-defined priors. Because of that, doing Bayesian model comparison with Bayes factors does not make sense.

Instead, we can compare models based on their information content, and their prediction power.

Lets see how much better the line model is to a constant.

[15]:
parameters0 = ['offset', 'scatter']

def prior_transform0(cube):
    params = cube.copy()
    # let offset go from 10 to 1000 km/s -- use log
    lo = np.log10(10)
    hi = np.log10(1000)
    params[0] = cube[0] * (hi - lo) + lo
    # let scatter go from 0.001 to 10
    lo = np.log10(0.001)
    hi = np.log10(10)
    params[1] = 10**(cube[1] * (hi - lo) + lo)
    return params


def log_likelihood0(params):
    # unpack the current parameters:
    offset, scatter = params

    # compute for each x point, where it should lie in y
    y_expected = offset
    # compute the probability of each sample
    probs_samples = scipy.stats.norm(y_expected, scatter).pdf(samples[:,1])
    # average over each galaxy, because we assume one of the points is the correct one (logical OR)
    probs_objects = probs_samples.mean(axis=1)
    assert len(probs_objects) == n_data
    # multiply over the galaxies, because we assume our model holds true for all objects (logical AND)
    # for numerical stability, we work in log and avoid zeros
    loglike = np.log(probs_objects + 1e-100).sum()
    return loglike

[16]:
sampler0 = ultranest.ReactiveNestedSampler(parameters0, log_likelihood0, prior_transform0)
result0 = sampler0.run()
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=2e+01
[ultranest] Likelihood function evaluations: 6661
[ultranest]   logZ = 14.84 +- 0.08229
[ultranest] Effective samples strategy satisfied (ESS = 1671.5, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.45+-0.08 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.20, need <0.5)
[ultranest]   logZ error budget: single: 0.12 bs:0.08 tail:0.01 total:0.08 required:<0.50
[ultranest] done iterating.

Model comparison with AIC

Here we compute the Akaike information criterion (AIC). https://en.wikipedia.org/wiki/Akaike_information_criterion

The model with the lowest AIC should be preferred.

[17]:
Lmax0 = result0['weighted_samples']['logl'].max()
AIC0 = -2 * Lmax0 + len(parameters0)

Lmax1 = result['weighted_samples']['logl'].max()
AIC1 = -2 * Lmax1 + len(parameters)

print("AIC of constant model: %d" % AIC0)
print("AIC of line model    : %d" % AIC1)

AIC of constant model: -41
AIC of line model    : -79

The line model is doing better according to the AIC.

Model comparison by prediction power

We can also leave out some data points and see how well the model, trained on the others, predicts the unseen data points.

There are many ways to leave points out (K-Fold, LOO, bootstrapping). Here we use a 5-fold cross-validation.

[18]:
samples_orig = samples.copy()
[19]:
from scipy.special import logsumexp
[20]:
Kpredicts = []

for lo, hi in [(9, 9.5), (9.5, 10), (10, 10.5), (10.5, 11), (11, 11.5), (11.5, 12.2)]:
#for lo, hi in [(9, 10), (10, 11), (11, 12.2)]:
    # leave out samples within that interval
    excluded = np.logical_and(mB >= lo, mB < hi)
    # all the others are allowed
    included = ~excluded
    # set samples (used inside likelihood functions)
    samples = samples_orig[included]
    n_data = len(samples)

    # analyse with line model
    sampler1 = ultranest.ReactiveNestedSampler(parameters, log_likelihood, prior_transform)
    result1 = sampler1.run()

    # analyse with constant model
    sampler0 = ultranest.ReactiveNestedSampler(parameters0, log_likelihood0, prior_transform0)
    result0 = sampler0.run()

    # now set the samples to the withheld data
    samples = samples_orig[excluded]
    n_data = len(samples)
    # get the log of the mean likelihood from each model
    Zpredict0 = logsumexp([log_likelihood0(sample) for sample in result0['samples']])
    Zpredict1 = logsumexp([log_likelihood(sample) for sample in result1['samples']])

    Kpredicts.append(Zpredict1 - Zpredict0)


[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=4e+01
[ultranest] Likelihood function evaluations: 12060
[ultranest]   logZ = 29.07 +- 0.09791
[ultranest] Effective samples strategy satisfied (ESS = 1917.1, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.07 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.26, need <0.5)
[ultranest]   logZ error budget: single: 0.16 bs:0.10 tail:0.01 total:0.10 required:<0.50
[ultranest] done iterating.
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=3e+01
[ultranest] Likelihood function evaluations: 6644
[ultranest]   logZ = 21.07 +- 0.1109
[ultranest] Effective samples strategy satisfied (ESS = 1591.6, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.07 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.26, need <0.5)
[ultranest]   logZ error budget: single: 0.12 bs:0.11 tail:0.01 total:0.11 required:<0.50
[ultranest] done iterating.
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=4e+01
[ultranest] Likelihood function evaluations: 12899
[ultranest]   logZ = 30.28 +- 0.1028
[ultranest] Effective samples strategy satisfied (ESS = 1951.0, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.45+-0.05 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.21, need <0.5)
[ultranest]   logZ error budget: single: 0.16 bs:0.10 tail:0.01 total:0.10 required:<0.50
[ultranest] done iterating.
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=2e+01
[ultranest] Likelihood function evaluations: 6618
[ultranest]   logZ = 14.29 +- 0.1092
[ultranest] Effective samples strategy satisfied (ESS = 1643.7, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.10 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.24, need <0.5)
[ultranest]   logZ error budget: single: 0.12 bs:0.11 tail:0.01 total:0.11 required:<0.50
[ultranest] done iterating.
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=4e+01
[ultranest] Likelihood function evaluations: 12624
[ultranest]   logZ = 28.05 +- 0.139
[ultranest] Effective samples strategy satisfied (ESS = 1931.4, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.07 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.30, need <0.5)
[ultranest]   logZ error budget: single: 0.16 bs:0.14 tail:0.01 total:0.14 required:<0.50
[ultranest] done iterating.
[ultranest] Sampling 400 live points from prior ...
Z=-36.2(0.00%) | Like=-30.69..21.40 [-9440.5989..19.6820] | it/evals=1000/1694 eff=77.2798% N=400

So lets look at the prediction quality of line model compared to constant model: Positive values indicate preference for the line model, each entry is a K-fold instance.

[21]:
Kpredicts
[21]:
[6.924367600568345,
 -1.359872939704573,
 1.105160894784154,
 4.803849914056784,
 -0.21356533540923373,
 12.229240813519034]

averaging the results, we get:

[22]:
np.mean(Kpredicts)
[22]:
3.9148634913024183

Again positive, so in terms of prediction, the line model is better.