Tutorial: distribution with outliers

In this tutorial you will learn:

How to deal with outliers and non-gaussian distributions

In the “intrinsic distribution” tutorial, we fitted a gaussian distribution. But maybe we do not want to make such a strong assumption – we may not know the exact distribution. Perhaps it is skewed, has several sub-populations. This can be tested by trying out different models.

Here, we check compare three scenarios:

baseline model: a gaussian distribution
baseline model + outlier model
heavy-tailed distribution

Here is our data:

[1]:

import numpy as np
# velocity dispersions of dwarf galaxies by van Dokkum et al., Nature, 555, 629 https://arxiv.org/abs/1803.10237v1

values = np.array([15, 4, 2, 11, 1, -2, -1, -14, -39, -3])
values_lo = np.array([7, 16, 6, 3, 6, 5, 10, 6, 11, 13])
values_hi = np.array([7, 15, 8, 3, 6, 6, 10, 7, 14, 14])

n_data = len(values)

Visualise the data

Lets plot the data first to see what is going on:

[2]:

%matplotlib inline
import matplotlib.pyplot as plt
np.random.seed(42)

samples = []
for i in range(n_data):
    # draw normal random points
    u = np.random.normal(size=400)
    v = values[i] + np.where(u < 0, u * values_lo[i], u * values_hi[i])
    samples.append(v)

samples = np.array(samples)

plt.figure()
# for each galaxy, plot alittle cloud with its own colors
plt.violinplot(samples.transpose(), vert=False, showextrema=False)

xlabel = 'Velocity [km/s]'
plt.xlabel(xlabel)
plt.xlim(-50, 50);

Models

Gaussian

[3]:

parameters = ['mean', 'scatter']

def prior_transform(cube):
    params = cube.copy()
    params[0] = cube[0] * 200 - 100 # mean from -100 to +100
    params[1] = 10**(cube[1] * 2)   # scatter from 1 to 100
    return params

import scipy.stats

def log_likelihood(params):
    probs_samples = scipy.stats.norm(*params).pdf(samples)
    probs_objects = probs_samples.mean(axis=1)
    loglike = np.log(probs_objects + 1e-100).sum()
    return loglike

import ultranest

sampler = ultranest.ReactiveNestedSampler(parameters, log_likelihood, prior_transform)
result = sampler.run()
sampler.print_results()

[ultranest] Sampling 400 live points from prior ...
[ultranest] Widening roots to 404 live points (have 400 already) ...
[ultranest] Sampling 4 live points from prior ...

[ultranest] Explored until L=-4e+01
[ultranest] Likelihood function evaluations: 5432
[ultranest]   logZ = -43.88 +- 0.0808
[ultranest] Effective samples strategy satisfied (ESS = 1676.6, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.07 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.08, need <0.5)
[ultranest]   logZ error budget: single: 0.09 bs:0.08 tail:0.01 total:0.08 required:<0.50
[ultranest] done iterating.

logZ = -43.882 +- 0.143
  single instance: logZ = -43.882 +- 0.093
  bootstrapped   : logZ = -43.879 +- 0.142
  tail           : logZ = +- 0.010
insert order U test : converged: True correlation: inf iterations

    mean                : -28.7 │ ▁      ▁▁▁▁▁▁▁▁▁▁▁▂▃▄▅▆▇▇▆▄▃▂▁▁▁▁▁▁▁▁ │17.4      0.1 +- 4.0
    scatter             : 1.0   │▁▂▄▆▇▇▆▅▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁  ▁          ▁ │57.2      9.5 +- 4.4

Gaussian + Outliers

[4]:

parameters2 = ['mean', 'scatter', 'w_outlier']

def prior_transform2(cube):
    params = cube.copy()
    params[0] = cube[0] * 200 - 100 # mean from -100 to +100
    params[1] = 10**(cube[1] * 2)   # scatter from 1 to 100
    params[2] = (cube[2] * 0.2)     # fraction of outliers: allow 0-20%
    return params

def log_likelihood2(params):
    mean, scatter, weight = params

    # compute probability if following distribution
    probs_samples = scipy.stats.norm(mean, scatter).pdf(samples)
    # compute probability density, if outlier:
    prob_outlier = scipy.stats.uniform(-100, 200).pdf(samples)

    # combine the two populations with weight:
    probs_samples = probs_samples * (1 - weight) + weight * prob_outlier

    # average within each object (logical OR)
    probs_objects = probs_samples.mean(axis=1)
    # multiply across object (logical AND)
    loglike = np.log(probs_objects + 1e-100).sum()
    return loglike


sampler2 = ultranest.ReactiveNestedSampler(parameters2, log_likelihood2, prior_transform2)
result2 = sampler2.run()
sampler2.print_results()

[ultranest] Sampling 400 live points from prior ...

[ultranest] Explored until L=-4e+01
[ultranest] Likelihood function evaluations: 7615
[ultranest]   logZ = -44.3 +- 0.06916
[ultranest] Effective samples strategy satisfied (ESS = 1844.5, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.06 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.07, need <0.5)
[ultranest]   logZ error budget: single: 0.09 bs:0.07 tail:0.01 total:0.07 required:<0.50
[ultranest] done iterating.

logZ = -44.299 +- 0.129
  single instance: logZ = -44.299 +- 0.090
  bootstrapped   : logZ = -44.305 +- 0.129
  tail           : logZ = +- 0.010
insert order U test : converged: True correlation: inf iterations

    mean                : -44.6 │ ▁        ▁  ▁▁▁▁▁▁▂▂▃▅▇▇▇▄▂▁▁▁▁▁    ▁ │29.4      0.8 +- 4.5
    scatter             : 1.0   │▃▃▄▅▆▆▇▇▆▅▄▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁  ▁ ▁ ▁ │39.0      8.7 +- 4.5
    w_outlier           : 0.000 │▇▆▇▇▆▆▆▇▅▅▆▇▅▆▄▅▅▄▃▄▅▄▃▄▅▆▄▄▄▅▄▄▄▄▃▃▃▃▃│0.200     0.085 +- 0.057

Heavy-tailed distribution

[5]:

parameters3 = ['mean', 'scatter', 'dof']

def prior_transform3(cube):
    params = cube.copy()
    params[0] = cube[0] * 200 - 100 # mean from -100 to +100
    params[1] = 10**(cube[1] * 2)   # scatter from 1 to 100
    params[2] = 10**(cube[2])       # degrees of freedom in student-t: 1 (heavy tails)..10 (gaussian)
    return params

def log_likelihood3(params):
    mean, scatter, dof = params
    # compute student-t probability
    probs_samples = scipy.stats.t(loc=mean, scale=scatter, df=dof).pdf(samples)
    # average within each object (logical OR)
    probs_objects = probs_samples.mean(axis=1)
    # multiply across object (logical AND)
    loglike = np.log(probs_objects + 1e-100).sum()
    return loglike

sampler3 = ultranest.ReactiveNestedSampler(parameters3, log_likelihood3, prior_transform3)
sampler3.run()
sampler3.print_results()

[ultranest] Sampling 400 live points from prior ...

[ultranest] Explored until L=-4e+01
[ultranest] Likelihood function evaluations: 7362
[ultranest]   logZ = -44.16 +- 0.06484
[ultranest] Effective samples strategy satisfied (ESS = 1728.7, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.06 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.07, need <0.5)
[ultranest]   logZ error budget: single: 0.09 bs:0.06 tail:0.01 total:0.07 required:<0.50
[ultranest] done iterating.

logZ = -44.168 +- 0.122
  single instance: logZ = -44.168 +- 0.092
  bootstrapped   : logZ = -44.164 +- 0.121
  tail           : logZ = +- 0.010
insert order U test : converged: True correlation: inf iterations

    mean                : -26.8 │ ▁   ▁ ▁▁▁▁▁▁▁▁▁▂▂▃▃▄▆▇▇▇▇▆▄▃▂▂▁▁▁▁▁▁▁ │18.8      0.7 +- 4.3
    scatter             : 1.0   │▃▄▆▇▇▇▆▅▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁     ▁ │41.2      7.5 +- 4.2
    dof                 : 1.0   │▇▇▅▆▆▅▅▄▄▄▄▃▂▃▃▃▃▂▂▃▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂│10.0      4.1 +- 2.5

Comparing the models

[6]:

plt.figure(figsize=(5, 12))
plt.xlabel(xlabel)

from ultranest.plot import PredictionBand
x = np.linspace(-50, 50, 400)


plt.subplot(3, 1, 1)
band = PredictionBand(x)

for params in sampler.results['samples'][:40]:
    mean, scatter = params
    band.add(scipy.stats.norm(mean, scatter).pdf(x))

band.shade(color='k', alpha=0.1, label="Gaussian")
band.line(color='k')
plt.legend(loc='upper right')
plt.title("Gaussian: lnZ=%.1f" % sampler.results['logz'])

plt.subplot(3, 1, 2)
band = PredictionBand(x)
bando = PredictionBand(x)

for params in sampler2.results['samples'][:40]:
    mean, scatter, outlier_weight = params
    band.add(scipy.stats.norm(mean, scatter).pdf(x) * (1 - outlier_weight))
    bando.add(scipy.stats.norm(-100, 200).pdf(x) * outlier_weight)

band.shade(color='k', alpha=0.1, label="Gaussian")
band.line(color='k');
bando.shade(color='orange', alpha=0.1, label="Outliers")
bando.line(color='orange');
plt.legend(loc='upper right')
plt.title("Gaussian + outliers: lnZ=%.1f" % sampler2.results['logz'])

plt.subplot(3, 1, 3)

for params in sampler3.results['samples'][:40]:
    mean, scatter, df = params
    band.add(scipy.stats.t(loc=mean, scale=scatter, df=df).pdf(x))

band.shade(color='k', alpha=0.1, label='Student-t')
band.line(color='k');
plt.title("Student-t: lnZ=%.1f" % sampler3.results['logz'])
plt.legend(loc='upper right');

The models look very similar in their predictions, and their lnZ values are also quite close. So I would not have a strong preference in any direction.

Lets compare the posterior distributions with getdist:

[7]:

from getdist import MCSamples, plots

samples_g = MCSamples(samples=sampler.results['samples'],
                       names=sampler.results['paramnames'],
                       label='Gaussian',
                       settings=dict(smooth_scale_2D=3), sampler='nested')
samples_o = MCSamples(samples=sampler2.results['samples'][:,:2],
                       names=sampler2.results['paramnames'][:2],
                       label='Gaussian + Outlier',
                       settings=dict(smooth_scale_2D=3), sampler='nested')
samples_t = MCSamples(samples=sampler3.results['samples'][:,:2],
                       names=sampler3.results['paramnames'][:2],
                       label='Student-t',
                       settings=dict(smooth_scale_2D=3), sampler='nested')

mcsamples = [samples_g, samples_o, samples_t]

g = plots.get_subplot_plotter(width_inch=8)
g.settings.num_plot_contours = 1
g.triangle_plot(mcsamples, filled=False, contour_colors=plt.cm.Set1.colors)

Removed no burn in
Removed no burn in
Removed no burn in

Next

Try adding an outlier.
Try adjusting the number of live points (min_num_live_points) and effective sample size (min_ess) parameters above to decrease the uncertainties.
Try varying the prior ranges.
Continue with the other tutorials and explore other UltraNest features