Feeds:
Posts

## Python: fit with error on both axis

In an earlier post, I have discussed a least-square fit with error on y-axis (for statistical fits, check the PyMC and EMCEE posts).  Almost in any fit, having an estimate of the fit uncertainty is a must. The better we know the noise characteristics of the experiment, the better we should estimate the uncertainty of the fit parameters. In this post, I show a more serious example in which we have error on both axis.

Unlike the previous example, we do not use the curve_fit module of Scipy, Instead, there is  another  dedicated module to estimate the orthogonal distance regression (odr). The program with some comments is shown below:

import numpy as np
from pylab import *
from scipy.optimize import curve_fit
from scipy import odr

def func(p, x):

a, b, c = p
return a * x *x + b*x + c

# Model object

# test data and error
x0 = np.linspace(-10, 10, 100)
y0 = – 0.07 * x0 * x0 + 0.5 * x0 + 2.
noise_x = np.random.normal(0.0, 1.0, len(x0))
noise_y = np.random.normal(0.0, 1.0, len(x0))
y = y0 + noise_y
x = x0 + noise_x

# Create a RealData object
data = odr.RealData(x, y, sx=noise_x, sy=noise_y)

# Set up ODR with the model and data.
odr = odr.ODR(data, quad_model, beta0=[0., 1., 1.])

# Run the regression.
out = odr.run()

#print fit parameters and 1-sigma estimates
popt = out.beta
perr = out.sd_beta
print(‘fit parameter 1-sigma error’)
print(‘———————————–‘)
for i in range(len(popt)):
print(str(popt[i])+’ +- ‘+str(perr[i]))

# prepare confidence level curves
nstd = 5. # to draw 5-sigma intervals
popt_up = popt + nstd * perr
popt_dw = popt – nstd * perr

x_fit = np.linspace(min(x), max(x), 100)
fit = func(popt, x_fit)
fit_up = func(popt_up, x_fit)
fit_dw= func(popt_dw, x_fit)

#plot
fig, ax = plt.subplots(1)
rcParams[‘font.size’]= 20
errorbar(x, y, yerr=noise_y, xerr=noise_x, hold=True, ecolor=’k’, fmt=’none’, label=’data’)
xlabel(‘x’, fontsize=18)
ylabel(‘y’, fontsize=18)
title(‘fit with error on both axis’, fontsize=18)
plot(x_fit, fit, ‘r’, lw=2, label=’best fit curve’)
plot(x0, y0, ‘k–‘, lw=2, label=’True curve’)
ax.fill_between(x_fit, fit_up, fit_dw, alpha=.25, label=’5-sigma interval’)
legend(loc=’lower right’,fontsize=18)
show()

Please note that as you know, python is case sensitive so do not try to use change the upper/lower case in the above commands. A general comment is that you can easily change the second order  function of this example to any desired function. The method we used to estimate the uncertainties of fit parameters are the standard method using diagonal elements of the co-variance matrix.

## Python: fit with error on Y-axis

In some earlier post, I have discussed statistical fits with PyMC and EMCEE. Advantage of statistical methods is that they are not sensitive to the form of chi-square function. This is important in some cases where the merit function doe snot have a well-define minimum. The advantage of chi-squaree methods is that they are generally much faster. In this post, I show a typical example of a least-square fit with measurement error. As usual, we are interested to estimate a fit parameter as well as their uncertainties.

As you see in the above example, we fit a simple function with measured y-error, estimate the fit parameters and their uncertainties, and plot a confidence level of a given range. The program is shown below:

import numpy as np
from pylab import *
from scipy.optimize import curve_fit

def func(x, a, b, c):

return a * x *x + b*x + c

# test data and error
x = np.linspace(-10, 10, 100)
y0 = – 0.07 * x * x + 0.5 * x + 2.
noise = np.random.normal(0.0, 1.0, len(x))
y = y0 + noise

# curve fit [with only y-error]
popt, pcov = curve_fit(func, x, y, sigma=1./(noise*noise))
perr = np.sqrt(np.diag(pcov))

#print fit parameters and 1-sigma estimates
print(‘fit parameter 1-sigma error’)
print(‘———————————–‘)
for i in range(len(popt)):
print(str(popt[i])+’ +- ‘+str(perr[i]))

# prepare confidence level curves
nstd = 5. # to draw 5-sigma intervals
popt_up = popt + nstd * perr
popt_dw = popt – nstd * perr

fit = func(x, *popt)
fit_up = func(x, *popt_up)
fit_dw = func(x, *popt_dw)

#plot
fig, ax = plt.subplots(1)
rcParams[‘xtick.labelsize’] = 18
rcParams[‘ytick.labelsize’] = 18
rcParams[‘font.size’]= 20
errorbar(x, y0, yerr=noise, xerr=0, hold=True, ecolor=’k’, fmt=’none’, label=’data’)

xlabel(‘x’, fontsize=18)
ylabel(‘y’, fontsize=18)
title(‘fit with only Y-error’, fontsize=18)
plot(x, fit, ‘r’, lw=2, label=’best fit curve’)
plot(x, y0, ‘k–‘, lw=2, label=’True curve’)
ax.fill_between(x, fit_up, fit_dw, alpha=.25, label=’5-sigma interval’)
legend(loc=’lower right’,fontsize=18)
show()

Please note that using the measurement error is optional. If you do not have y-error, simply skip its command in the fit procedure:

# curve fit [with only y-error]
popt, pcov = curve_fit(func, x, y)

You still get an estimate for the uncertainty of the fit parameters, although it is less reliable.  In the next post, I show an example of a least-square fit with error on both axis.

## Python 2 and Python 3 side by side: conda

Python 3 is gradually replacing Python 2 and is some of the newest Linux distributions like Fedora 23, it is installed as default.

So the question is if you have a library of python 2.x programs and you want to start learning python 3 and updating your codes, how can you install all the necessary packages like matplotlib, scipy, nompy, etc for both versions of python without messing up the system (since Linux desktops e.g. gnome, KDE, etc use python themselves).

Using Anaconda or Miniconda is one of the solutions. I tried miniconda since I do not want the whole Anaconda packages. According to its webpage, “Anaconda is a completely free Python distribution … It includes over 195 of the most popular Python packages for science, math, engineering, data analysis.” Both Anaconda and Miniconda are available for Linux, Mac OSX, and Windows. This way, you install a fresh version of python on your home directory, so doing anything to that does not affect your operating system.

#### Package installation using conda

Assume you have installed the package via your package manager. Now we have the command  “conda” available on command line. We can create two different environments for python 2 and python 3 packages:

\$conda create -n mypy2 python=2 numpy

\$conda create -n mypy3 python=3 numpy

and similarly you can install any other package, e.g. ipython, scipy, etc.

When the environment exist, you can use the normal “install’ command, like pip:

\$ conda install numpy python=3

#### How to use it?

export PATH=”/home/user_name/miniconda/bin:\$PATH”

Then in the command line, you can activate the environment you want to work with:

\$source activate mypy2

and you get an output in terminal like this:

prepending /home/user_name/miniconda/envs/mypy2/bin to PATH

When you are done, you can deactivate it:

\$source deactivate

conda has easy options to install, update or remove a package or the whole environment.For instance, you can remove a whole environment at once:

\$conda remove -n mypy2 –all

Typing “conda” in terminal, you get the help file:

\$ conda
usage: conda [-h] [-V] [–debug] command …

conda is a tool for managing and deploying applications, environments and packages.

Options:

positional arguments:
command
info          Display information about current conda install.
help         Displays a list of available conda commands and their help strings.
list            List linked packages in a conda environment.
search     Search for packages and display their information. The input is a regular expression.To perform a search with a   search string that starts with a -, separate the search from the options with –, like ‘conda search — -h’. A * in the results means that package is installed in the current environment. A . means that package is not installed but is cached in the pkgs directory.
create       Create a new conda environment from a list of specified packages.
install         Install a list of packages into a specified conda environment.
update       Update conda packages to the current version.
remove      Remove a list of packages from a specified conda environment.
uninstall     Alias for conda remove. See conda remove –help.
run             Launches an application installed with conda. To include command line options in a command, separate the   command from the other options with –, like conda run — ipython –matplotlib
config        Modify configuration values in .condarc. This is modeled after the git config command. …
init             Initialize conda into a regular environment (when conda was installed as a Python package, e.g. using pip).
clean         Remove unused packages and caches….

## Python pip

pip is a very helpful command to install python package either with or without root password. You can install a fresh python binding in your home directory. To install pip, you can do one of the following depending if you use redhat/centos/fedora or Debian/ubuntu:

Debian:  #apt-get install python-pip

Fedora: # yum install python-pip     (on fedora 23, yum was replaced by dnf).

then, you can easily use yum to manage your python packages:

\$pip install numpy –user      # –user indicates that you install it in your home directory

\$pip uninstall numpy

\$pip show numpy    # This will show the installed version and directory of a package

Name: numpy
Version: 1.8.2
Location: /usr/lib/python2.7/dist-packages
Requires:

\$pip list       # it will print a list of installed packages with their versions.
\$pip list –outdated      # check each package with the latest available version and remark the outdated ones.
the following reviews everything and reports about problems:

# pip-review

#### What can be installed with pip?

There are several thousand packages that can be installed via pip. I do NOT recommend to click on this link (the page is very big). The easiest way to check if a particular package exists in pip is the following command:

\$ pip search numpy

## EMCEE: yet another MCMC fitting in Python

so far, I have introduced PYMC, which performs Bayesian fitting (and a lot more) in Python. Now, I would like to note the EMCEE package developed at MIT. It is very easy to install and can be readily used for simple regression fitting, which is my everyday practice.

juts check the link and I hope it is clear enouph.

## Accuracy of curve fitting using Bayesian method

In an earlier post, I have explained how a Bayesian based method can replace standard chi-square technique. Since a Bayesian method is solely based on statistics, having a very few data points then has a significant consequence on the fit quality. To evaluate how reproduceble are the fit parameters (for a parabolic function),

I performed the following test: I created 10 data points and assumed given values for {a, b, c}. The resulting y-values then formed a parabola. I added normal noise to both x and y axis. I let the PyMC run the Bayesian fit for 100 times. That means we have 100 solutions for each parameter. Then I repeated the identical experiment but this time, I assumed 100 data points rather than 10. The following plot show the scatter of data points for parameter b.

Blue and red circles correspond to solutions for samples with 10 and 100 data points, respectively. 4.5 is the true b value (thick horizontal line).

It is clear that with more input information, we can better constrain the parameters. In this particular example, the solutions with ten data points (blue circles) returned on average b=4.39±0.78 while the ones with hundred data points (red circles) return a more accurate solution of b=4.41±0.25.

It is a common practice to use loops in programming. Now if you have a lot of big loops, you find python rather slow. A very efficient way is to use the so-called array broadcasting. For an introduction, see this page for example:

The basic concept is to use column and row vectors to compute all intermediate values at the same time resulting in an 2d-array with all results.This way, you can get rid of almost all ordinary Python loops over arrays. The actual loops will be executed by highly optimized C routines instead, with only a small overhead for calling it from inside the Python interpreter. So this should be almost as fast as plain C or Fortran code.

Example: I would like to evaluate the Planck function for N test data sets, each of them cost four times evaluation of the function (for four wavelengths). In an elementary program, one can do it one by one for N times. This costs about 1.831 seconds runtime. Now if I use array broadcasting technique, I can do it within 0.013 seconds, or some 141 times faster !!!  The main trick is shown in the following example:

ordinary loop: This loop should be called N times.

for i in arange(0,num):
param = fplk(frqs[i], temperature)

vectorized loop using array broadcasting: one single run, no loop at all.

frqs_row = frqs[numpy.newaxis]
temp = temperature[:, numpy.newaxis]
output = fplk(frqs_row, temp)

note that the fplk function is identical !