Feeds:
Posts

## decision making and model comparison with Python

A typical problem in natural science and engineering is to select one model among many; the one that best fits the observed data. The following figures show fitting a linear, quadratic, and cubic polynomial to the same test data (the test data was parabolic with added noise on both axes).

#### The question is: which model fits best the observed data?

A classical approach is to use the reduced chi-square which takes into account the degrees of freedom in each fit. In Bayesian statistics, we can use the Bayes factor to perform decision making. The aim of the Bayes factor is to quantify the support for a model over another. When we want to choose between two models based on observed data, we calculate the evidence of each model, also called the marginalized likelihood. The Bayes factor is the ratio of the two evidences.

K = P(D|M1) / P(D|M2)

A value of K larger than one means support for model M1. There are, however, detailed tables about how to interpret the Bayes factor. Kass and Raftery (1995) presented the following table:

K                           Strength of evidence
1 to 3                    not worth more than a bare mention
3 to 20                  positive
20 to 150             strong
>150                     very strong

in python, it can be done in different ways. One option is to use the PT Sampler in EMCEE. Another one is to use the PyMultinest, an advanced MCMC package which performs importance nested sampling.

In our example, the PT Sampler finds that there are strong support for the parabolic model compared to the cubic model, and at the same time very strong support for the parabolic model against the linear model. Please note that the Bayes factor does not tell which model is correct, it just quantitatively estimates which model is preferred given the observed data.

## Python: comparison of median, Gaussian, and RBF filtering

There are several different methods to smooth a noisy signal. In this post I compare three common smoothing methods, namely a median filter, a Gaussian filter, and a Radian Basis Function (RBF) smoothing. RBF is a powerful tool not only for the multivariate data smoothing, but also for the interpolation, regression, etc. The following figure shows the magnificent performance of RBF compared to the median and Gaussian filters. The synthetic data was modified with Gaussian noise. I have used the ‘quintic’ kernel in this example.

Comparison of the RBF smoothing with the median and Gaussian filtering in a one-dimensional example.

the core program is fairly easy as it is a built-in function in python:

from scipy.interpolate import RBF
rbf = RBF(x, y, function=’quintic’, smooth=0.1)
s = rbf(x)

## Python: fit with error on both axis

In an earlier post, I have discussed a least-square fit with error on y-axis (for statistical fits, check the PyMC and EMCEE posts).  Almost in any fit, having an estimate of the fit uncertainty is a must. The better we know the noise characteristics of the experiment, the better we should estimate the uncertainty of the fit parameters. In this post, I show a more serious example in which we have error on both axis.

Unlike the previous example, we do not use the curve_fit module of Scipy, Instead, there is  another  dedicated module to estimate the orthogonal distance regression (odr). The program with some comments is shown below:

import numpy as np
from pylab import *
from scipy.optimize import curve_fit
from scipy import odr

def func(p, x):

a, b, c = p
return a * x *x + b*x + c

# Model object

# test data and error
x0 = np.linspace(-10, 10, 100)
y0 = – 0.07 * x0 * x0 + 0.5 * x0 + 2.
noise_x = np.random.normal(0.0, 1.0, len(x0))
noise_y = np.random.normal(0.0, 1.0, len(x0))
y = y0 + noise_y
x = x0 + noise_x

# Create a RealData object
data = odr.RealData(x, y, sx=noise_x, sy=noise_y)

# Set up ODR with the model and data.
odr = odr.ODR(data, quad_model, beta0=[0., 1., 1.])

# Run the regression.
out = odr.run()

#print fit parameters and 1-sigma estimates
popt = out.beta
perr = out.sd_beta
print(‘fit parameter 1-sigma error’)
print(‘———————————–‘)
for i in range(len(popt)):
print(str(popt[i])+’ +- ‘+str(perr[i]))

# prepare confidence level curves
nstd = 5. # to draw 5-sigma intervals
popt_up = popt + nstd * perr
popt_dw = popt – nstd * perr

x_fit = np.linspace(min(x), max(x), 100)
fit = func(popt, x_fit)
fit_up = func(popt_up, x_fit)
fit_dw= func(popt_dw, x_fit)

#plot
fig, ax = plt.subplots(1)
rcParams[‘font.size’]= 20
errorbar(x, y, yerr=noise_y, xerr=noise_x, hold=True, ecolor=’k’, fmt=’none’, label=’data’)
xlabel(‘x’, fontsize=18)
ylabel(‘y’, fontsize=18)
title(‘fit with error on both axis’, fontsize=18)
plot(x_fit, fit, ‘r’, lw=2, label=’best fit curve’)
plot(x0, y0, ‘k–‘, lw=2, label=’True curve’)
ax.fill_between(x_fit, fit_up, fit_dw, alpha=.25, label=’5-sigma interval’)
legend(loc=’lower right’,fontsize=18)
show()

Please note that as you know, python is case sensitive so do not try to use change the upper/lower case in the above commands. A general comment is that you can easily change the second order  function of this example to any desired function. The method we used to estimate the uncertainties of fit parameters are the standard method using diagonal elements of the co-variance matrix.

## Python: fit with error on Y-axis

In some earlier post, I have discussed statistical fits with PyMC and EMCEE. Advantage of statistical methods is that they are not sensitive to the form of chi-square function. This is important in some cases where the merit function doe snot have a well-define minimum. The advantage of chi-squaree methods is that they are generally much faster. In this post, I show a typical example of a least-square fit with measurement error. As usual, we are interested to estimate a fit parameter as well as their uncertainties.

As you see in the above example, we fit a simple function with measured y-error, estimate the fit parameters and their uncertainties, and plot a confidence level of a given range. The program is shown below:

import numpy as np
from pylab import *
from scipy.optimize import curve_fit

def func(x, a, b, c):

return a * x *x + b*x + c

# test data and error
x = np.linspace(-10, 10, 100)
y0 = – 0.07 * x * x + 0.5 * x + 2.
noise = np.random.normal(0.0, 1.0, len(x))
y = y0 + noise

# curve fit [with only y-error]
popt, pcov = curve_fit(func, x, y, sigma=1./(noise*noise))
perr = np.sqrt(np.diag(pcov))

#print fit parameters and 1-sigma estimates
print(‘fit parameter 1-sigma error’)
print(‘———————————–‘)
for i in range(len(popt)):
print(str(popt[i])+’ +- ‘+str(perr[i]))

# prepare confidence level curves
nstd = 5. # to draw 5-sigma intervals
popt_up = popt + nstd * perr
popt_dw = popt – nstd * perr

fit = func(x, *popt)
fit_up = func(x, *popt_up)
fit_dw = func(x, *popt_dw)

#plot
fig, ax = plt.subplots(1)
rcParams[‘xtick.labelsize’] = 18
rcParams[‘ytick.labelsize’] = 18
rcParams[‘font.size’]= 20
errorbar(x, y0, yerr=noise, xerr=0, hold=True, ecolor=’k’, fmt=’none’, label=’data’)

xlabel(‘x’, fontsize=18)
ylabel(‘y’, fontsize=18)
title(‘fit with only Y-error’, fontsize=18)
plot(x, fit, ‘r’, lw=2, label=’best fit curve’)
plot(x, y0, ‘k–‘, lw=2, label=’True curve’)
ax.fill_between(x, fit_up, fit_dw, alpha=.25, label=’5-sigma interval’)
legend(loc=’lower right’,fontsize=18)
show()

Please note that using the measurement error is optional. If you do not have y-error, simply skip its command in the fit procedure:

# curve fit [with only y-error]
popt, pcov = curve_fit(func, x, y)

You still get an estimate for the uncertainty of the fit parameters, although it is less reliable.  In the next post, I show an example of a least-square fit with error on both axis.

## Python 2 and Python 3 side by side: conda

Python 3 is gradually replacing Python 2 and is some of the newest Linux distributions like Fedora 23, it is installed as default.

So the question is if you have a library of python 2.x programs and you want to start learning python 3 and updating your codes, how can you install all the necessary packages like matplotlib, scipy, nompy, etc for both versions of python without messing up the system (since Linux desktops e.g. gnome, KDE, etc use python themselves).

Using Anaconda or Miniconda is one of the solutions. I tried miniconda since I do not want the whole Anaconda packages. According to its webpage, “Anaconda is a completely free Python distribution … It includes over 195 of the most popular Python packages for science, math, engineering, data analysis.” Both Anaconda and Miniconda are available for Linux, Mac OSX, and Windows. This way, you install a fresh version of python on your home directory, so doing anything to that does not affect your operating system.

#### Package installation using conda

Assume you have installed the package via your package manager. Now we have the command  “conda” available on command line. We can create two different environments for python 2 and python 3 packages:

\$conda create -n mypy2 python=2 numpy

\$conda create -n mypy3 python=3 numpy

and similarly you can install any other package, e.g. ipython, scipy, etc.

When the environment exist, you can use the normal “install’ command, like pip:

\$ conda install numpy python=3

#### How to use it?

export PATH=”/home/user_name/miniconda/bin:\$PATH”

Then in the command line, you can activate the environment you want to work with:

\$source activate mypy2

and you get an output in terminal like this:

prepending /home/user_name/miniconda/envs/mypy2/bin to PATH

When you are done, you can deactivate it:

\$source deactivate

conda has easy options to install, update or remove a package or the whole environment.For instance, you can remove a whole environment at once:

\$conda remove -n mypy2 –all

Typing “conda” in terminal, you get the help file:

\$ conda
usage: conda [-h] [-V] [–debug] command …

conda is a tool for managing and deploying applications, environments and packages.

Options:

positional arguments:
command
info          Display information about current conda install.
help         Displays a list of available conda commands and their help strings.
list            List linked packages in a conda environment.
search     Search for packages and display their information. The input is a regular expression.To perform a search with a   search string that starts with a -, separate the search from the options with –, like ‘conda search — -h’. A * in the results means that package is installed in the current environment. A . means that package is not installed but is cached in the pkgs directory.
create       Create a new conda environment from a list of specified packages.
install         Install a list of packages into a specified conda environment.
update       Update conda packages to the current version.
remove      Remove a list of packages from a specified conda environment.
uninstall     Alias for conda remove. See conda remove –help.
run             Launches an application installed with conda. To include command line options in a command, separate the   command from the other options with –, like conda run — ipython –matplotlib
config        Modify configuration values in .condarc. This is modeled after the git config command. …
init             Initialize conda into a regular environment (when conda was installed as a Python package, e.g. using pip).
clean         Remove unused packages and caches….