Feeds:
Posts

## EMCEE: yet another MCMC fitting in Python

so far, I have introduced PYMC, which performs Bayesian fitting (and a lot more) in Python. Now, I would like to note the EMCEE package developed at MIT. It is very easy to install and can be readily used for simple regression fitting, which is my everyday practice.

juts check the link and I hope it is clear enouph.

## Accuracy of curve fitting using Bayesian method

In an earlier post, I have explained how a Bayesian based method can replace standard chi-square technique. Since a Bayesian method is solely based on statistics, having a very few data points then has a significant consequence on the fit quality. To evaluate how reproduceble are the fit parameters (for a parabolic function), I performed the following test: I created 10 data points and assumed given values for {a, b, c}. The resulting y-values then formed a parabola. I added normal noise to both x and y axis. I let the PyMC run the Bayesian fit for 100 times. That means we have 100 solutions for each parameter. Then I repeated the identical experiment but this time, I assumed 100 data points rather than 10. The following plot show the scatter of data points for parameter b. Blue and red circles correspond to solutions for samples with 10 and 100 data points, respectively. 4.5 is the true b value (thick horizontal line).

It is clear that with more input information, we can better constrain the parameters. In this particular example, the solutions with ten data points (blue circles) returned on average b=4.39±0.78 while the ones with hundred data points (red circles) return a more accurate solution of b=4.41±0.25.

## Pymc: Bayesian fit for Python

Using a Bayesian fit is totally different from a  least-squared fit. Doing it is also more complicated.  In a chi-squared fit, we minimize a merit function. In a Bayesian fit, we have a set of priors, and a set of observations. To fit a model to those observations, we calculate a likelihood function. Unlike a least-squared problem, here basically each variable has a distribution. and we find the peak of distribution as the solution for unknown parameters. It is way better to read a review or something like this nice paper:

Well, now that we know what is it, how can we do simplest things with that: i.e., fitting a straight line to some data points? Pymc made it easy. It is a python package which contains three different solvers for Bayesian statistics including a Markov chain Monte Carlo (MCMC) estimator.

You can not only use it to do simple fitting stuff like this, but also do more complicated things.  However, I try to show some simple examples of its usage and comparison to a traditional fit in a separate post. Basically I compare fitting a parabola using chi-square and Bayesian method.

save the following in a file (I call it test.py):

#———————————————————–
import numpy, pymc

# create some test data
x = numpy.arange(100) * 0.3
f = 0.1 * x**2 – 2.6 * x – 1.5
numpy.random.seed(76523654)
noise = numpy.random.normal(size=100) * .1     # create some Gaussian noise
f = f + noise                                # add noise to the data

z = numpy.polyfit(x, f, 2)   # the traditional chi-square fit
print ‘The chi-square result: ‘,  z

#priors
sig = pymc.Uniform(‘sig’, 0.0, 100.0, value=1.)

a = pymc.Uniform(‘a’, -10.0, 10.0, value= 0.0)
b = pymc.Uniform(‘b’, -10.0, 10.0, value= 0.0)
c = pymc.Uniform(‘c’, -10.0, 10.0, value= 0.0)

#model
@pymc.deterministic(plot=False)
return a*x**2 + b*x + c

#likelihood
y = pymc.Normal(‘y’, mu=mod_quadratic, tau=1.0/sig**2, value=f, observed=True)
#———————————————————–

Now, go to command line and run the following (or alternatively put them in a file):

import pymc, test              # load the model file
R = pymc.MCMC(test)    #  build the model
R.sample(10000)              # populate and run it
print ‘a   ‘, R.a.value        # print outputs
print ‘b    ‘, R.b.value
print ‘c    ‘, R.c.value
The output looks like this:

The chi-square result: [ 0.0999244  -2.59879643 -1.49278601]
Sampling: 100%  Iterations: 10000
a    0.0981867109293
b     -2.54514077442
c     -1.77016283024

The correct results are a=0.1, b=2.6, and c=1.5. Both methods gave satisfactory results. Indeed, if you increase number of iterations from 10000 to one million, the answer will be like this:

Sampling: 100%  Iterations: 1000000
a    0.0997945467529
b     -2.59731835923
c     -1.47308938314

As you see, the results are more accurate compared to the former case.

If you like to have more information, you can use the following command:

print ‘a   ‘, R.a.stats()

a    {‘95% HPD interval’: array([ 0.09956635,  0.10027448]), ‘n’: 1000000, ‘quantiles’: {2.5: 0.099545071298350371, 25: 0.099805050609863097, 50: 0.099919386231275664, 75: 0.10003496316945379, 97.5: 0.10025986774435611}, ‘standard deviation’: 0.0050467522199834905, ‘mc error’: 0.0003721035009025152, ‘mean’: 0.099548164235412406}

Recall that in  a Bayesian fit, each variable has a distribution. The following plot shows the data points as well as the two Bayesian fits. To compare it with a least-square fit, I repeated the experiment with a sample data which has more noise. The resulting fits are compared in the following panel. There are much more you can learn from the examples of Pymc. I hope in a future post, I can explain other types of fit, like a weighted-least-square fit or a bisector fit.