Statistics Hands-on session for Lecture 1 - exercises solved

Students are required to hand in the exercises and questions below in the form of a SWAN notebook or pdf file (with code snippets) by Wednesday, April 25. Please send the material to Anna Sfyrla (anna.sfyrla@cern.ch) and T.J. Khoo. (Teng.Jian.Khoo@cern.ch).

Likelihood fit to data

You have a dataset from where you have constructed the di-photon invariant mass. You know that these data contain (after analysis selections) di-photon QCD events (what we will call 'background') and di-photon pairs from a known particle with a mass of 50 GeV (what we will call 'signal'). You can find the dataset constructed with some luminosity L, and the dataset constructed with 10x the luminosity (refered to as "high-stat sample" in what follows), in the directory files.

In the root files, you can find four TTrees. In this first exercise we will use one of them: Sample1a, which corresponds to background plus signal. The TTree sample2a also corresponds to background plus signal; you will be asked to evaluate the differences as an exercise.

Data analysis with ROOT

For a gentle introduction, let's make sure we understand what we are doing when we fit a function just using basic ROOT. First, we retrieve the data from the file and plot it.

In [ ]:
# Do some preliminary setup that we don't need to repeat
import ROOT
ROOT.enableJSVis() # This makes the plot canvases interactive
c = ROOT.TCanvas('myrootplot','My ROOT plot',800,600)
In [ ]:
inputfile = ROOT.TFile('files/ttreedata.root')
inputtree = inputfile.Get('Sample0')

# Make a canvas for saving plots
inputtree.Draw('x>>h_x(100,0,200)','','ep') # Draw the data to a histogram called "h_x"
hist = ROOT.gDirectory.Get('h_x') # Retrieve the histogram from the current directory
c.Draw()

This looks like an exponentially falling distribution, so we can define a corresponding function to fit it using ROOT, and draw this on the plot. Try adjusting the parameters until you get something that reasonably approximates the data.

In [ ]:
f_expo = ROOT.TF1('expo','[0]*exp([1]*x)',0,200)
f_expo.SetParameter(0,70)
f_expo.SetParameter(1,-0.01)
hist.Draw()
f_expo.Draw('same')
c.Draw()

We really want a more scientific i.e. quantitative method to determine the goodness of the fit. One of the best known tests is the $\chi^2$ (or least squares) test, defined as: $$ \chi^2 = \sum_i \frac{(O_i - E_i)^2}{\sigma(E_i)^2}, $$ where $O_i$ signifies one observation, $E_i$ signifies one expectation value, and $\sigma(E_i)$ is the uncertainty on $E_i$. In our case, we have a histogram, so each expectation value is the value of the function at the centre of the corresponding bin, and the appropriate uncertainty is $\sqrt{E_i}$.

For the purposes of comparison, ROOT uses a modified $\chi^2$, for which the observed uncertainties are used instead, so let's define: $$ \chi^2 = \sum_i \frac{(O_i - E_i)^2}{\sigma(O_i)^2}, $$

Define a function that takes a TF1 and a TH1 as input, and returns the $\chi^2$ value.

Question: How does the $\chi^2$ change as you pick "better" and "worse" parameters?

In [ ]:
def getChi2(function,hist):
    chi2 = 0
    fullbins = 0
    for ibin in range(1,hist.GetNbinsX()+1):
        observation = hist.GetBinContent(ibin)
        binCentre = hist.GetXaxis().GetBinCenter(ibin)
        expectation = function.Eval(binCentre)
        if observation<1e-9 or expectation<1e-9: continue
        fullbins += 1
        delta = observation - expectation
        chi2 += delta*delta / observation
    if fullbins == 0: chi2 = 1e9 # penalise if prediction is 0
    return chi2
In [ ]:
f_expo.SetParameter(0,60)
f_expo.SetParameter(1,-0.02)
c.Draw()

print getChi2(f_expo,hist)

Of course, you already knew that smaller is better for the $\chi^2$, as it's a sum of the squared residuals, weighted by the (expected) variance of each observation. So we can determine the best fit parameters by minimising the $\chi^2$ -- let's do this by brute force, and just scan over some reasonable parameter values.

Exercise 1:
Plot the value of $\chi^2$ as you vary the normalisation factor (parameter 0) and decay constant (parameter 1) in the exponential function. What values minimise the $\chi^2$? How long does it take to carry out this minimisation for a reasonable level of precision?

</div>

===> Below is the solution to exercise 1

In [ ]:
####### Exercise 1 

# Let's fill a 2D histogram with the chi2 values for a range of hypotheses
Nbins = 100
binrangex = (1,101)
binrangey = (-0.05,0.)
chi2_vs_pars = ROOT.TH2D('chi2_vs_pars','Chi2 for choices of fn pars',
                         Nbins,binrangex[0],binrangex[1],
                         Nbins,binrangey[0],binrangey[1])
chi2_vs_pars.SetXTitle('Normalisation factor')
chi2_vs_pars.SetXTitle('Decay constant')
# Use numpy to get a float range
chi2min = 1e20
pars_chi2min = (0.,0.)
import time
starttime = time.time()
print 'Begin chi2 minimisation scan'
from numpy import arange
for f_norm in arange(binrangex[0],binrangex[1],(binrangex[1]-binrangex[0])/Nbins):
    for f_decay in arange(binrangey[0],binrangey[1],(binrangey[1]-binrangey[0])/Nbins):
        f_expo.SetParameter(0,f_norm)
        f_expo.SetParameter(1,f_decay)
        #print f_norm, f_decay, getChi2(f_expo,hist)
        chi2 = getChi2(f_expo,hist)
        chi2_vs_pars.Fill(f_norm,f_decay,chi2)
        if chi2<chi2min:
            pars_chi2min = (f_norm,f_decay)
            chi2min = chi2
endtime = time.time()
print 'Finished chi2 minimisation scan after {0} seconds.'.format(endtime-starttime)
chi2_vs_pars.Draw('colz')
# Do some formatting
chi2_vs_pars.SetContour(100)
chi2_vs_pars.GetZaxis().SetRangeUser(chi2min,100*chi2min)
#ROOT.gStyle.SetPalette(ROOT.kTemperatureMap)
# Mark the best fit point
bestfit = ROOT.TGraph(1)
bestfit.SetPoint(0,pars_chi2min[0],pars_chi2min[1])
bestfit.SetMarkerStyle(ROOT.kMultiply)
bestfit.SetMarkerColor(ROOT.kRed)
bestfit.Draw('psame')
#
c.SetLogz()
c.Draw()

print 'Minimum chi2 found: {0:.2f}'.format(chi2min)
print 'Chi2 min --> norm = {0:.2f}, decay const = {1:.3f}'.format(pars_chi2min[0],pars_chi2min[1])

Fitting functions by minimising the goodness of fit is something that is baked into much of our analysis software, so let's try using the built-in ROOT functions to attempt the same thing. We can just use TH1::Fit().

In [ ]:
# Reset to sensible function parameters before fitting
f_expo.SetParameter(0,70)
f_expo.SetParameter(1,-0.02)
hist.Fit(f_expo) # providing a second argument 'P' will make ROOT use the expected uncertainty in the chi2 instead
print 'Chi2 after ROOT fit:', getChi2(f_expo,hist)
print 'ROOT Fit --> norm = {0:.2f}, decay const = {1:.3f}'.format(f_expo.GetParameter(0),f_expo.GetParameter(1))

See now what is happening if the minimization is done not with a $\chi^2$ minimization (default option) but a likelihood fit:

In [ ]:
hist.Fit(f_expo, 'L') 
print 'Chi2 after ROOT fit:', getChi2(f_expo,hist)
print 'ROOT Fit --> norm = {0:.2f}, decay const = {1:.3f}'.format(f_expo.GetParameter(0),f_expo.GetParameter(1))

Question 1:
How did you do? How long did ROOT take to minimise the $\chi^2$ in comparison to the brute force scan and how do the $\chi^2$ values compare? Estimate the probability that corresponds to these $\chi^2$ values; how does it depend on the number of bins considered in this exercise?

Data in RooFit and a simple fit

Now, let's pick up RooFit, which is the specialised fitting/probability/statistics module in ROOT.

We will first see how to define ROOT workspaces. After importing the ROOT file, we will create a ROOT workspace for Sample1a.

In [ ]:
# Always start by importing ROOT & RooFit
import ROOT.RooFit
In [ ]:
w = ROOT.RooWorkspace()
w.factory('x[0.,200.]') # Create a RooRealVar in range [0,200]
x = w.var('x')          # Retrieve the RooRealVar as a python object
# Have to pass the variable as a RooArgSet (set of vector data) because
# in principle we might be reading a multi-dimensional dataset
inputdata = ROOT.RooDataSet('data','data',inputtree,ROOT.RooArgSet(x))

We want to plot the distributions we deal with, but the syntax is a bit different in RooFit compared to vanilla ROOT. We generate a "frame" that will display plots with axes in the variable "x", and then plot the data on the frame. Note: if you want to modify details of the plot style, this is typically done by passing arguments to plotOn.

In [ ]:
# Plot a la RooFit
xframe = x.frame() # Generates a new frame for plotting
inputdata.plotOn(xframe)
xframe.Draw()
c.Draw()

Extending this, let's now add a fit function and plot this on top of the data.

We can continue to use the RooFit workspace factory to define the function and any necessary variables, reproducing the one that we used in plain ROOT above. However, RooFit's exponential class does not allow us to set the normalisation. If we generate a pdf from this distribution, it will automatically be normalised to conserve unitarity in the range that we have set.

So we need two steps. First, we generate the exponential distribution with free parameter $x$ and add a decay constant $\tau$. Then, we create an "extended PDF" which has a normalisation parameter, by creating a sum over just one pdf, but which is multiplied by another variable. Then, we retrieve this as a python object via the RooWorkspace::pdf() function.

In [ ]:
w.factory('Exponential::expo(x,tau[-0.1,-0.2,0])')
w.factory('SUM::bkg(bkgnorm[1.0,0,1000]*expo)')
bkgpdf = w.pdf('bkg')

Now, we want to fit the function and plot it. Fitting is "backwards" from ROOT, in that we call fitTo on the function, giving it the data, rather than calling Fit on a histogram.

Plotting, however, is no more difficult!

In [ ]:
xframe = x.frame() # Generates a new frame for plotting
inputdata.plotOn(xframe)

# Now fit the background pdf to the sample data, saving the fit result
fitresult = bkgpdf.fitTo(inputdata,ROOT.RooFit.Save(),ROOT.RooFit.PrintLevel(-1))
fitresult.Print() # Print fit result
# Add the fitted function to the plot
bkgpdf.plotOn(xframe)
xframe.Draw()
c.Draw()

Now we can compare the parameters extracted by RooFit to those by ROOT. What does this tell you about how RooFit is handling the fit? Does it look like a $\chi^2$ minimisation?

We can also print out the $\chi^2$ value extracted by RooFit, but this is a property of the frame and not of the function or of the fit result...

In [ ]:
print 'Fit chi2: {0:.3g}'.format(xframe.chiSquare())
print 'tau : {0:.3g} +- {1:.3g}'.format(w.var('tau').getVal(),w.var('tau').getError())

Recall that we defined a likelihood as a function that describes how well a prediction agrees with the data as a function of the model parameters. Using RooFit, it is easy to retrieve the negative log-likelihood (NLL), which is exactly what it sounds like, starting from the pdf and some data.

We will be using the NLL in later material, so let's already see how it looks:

In [ ]:
bkgnll = bkgpdf.createNLL(inputdata)
tau = w.var('tau')
tauframe = tau.frame() # Generates a new frame for plotting
bkgnll.plotOn(tauframe)
tauframe.Draw()
c.Draw()

Exercise 2:
Using a similar methodology as above, fit sample1a with the sum of an exponential and a gaussian. Find the fitted parameters and the chi2 of the fit. How does the result change if sample1b is used instead?

Hint: Don't forget to account for the signal & background normalisations in the fit.

Bonus Exercise:
Can you now fit sample2? Repeat that for the high-stat sample.

===> Below is the solution to exercise 2

In [ ]:
####### Exercise 2 ; solved for low stat sample 1a

inputtree1a = inputfile.Get('Sample1a')
inputtree1a.Draw('x')
inputdata = ROOT.RooDataSet('data','data',inputtree1a,ROOT.RooArgSet(x))
bsframe = x.frame() # Generates a new frame for plotting
inputdata.plotOn(bsframe)


w.factory('Gaussian::sig(x,mu[50,20,150],sigma[5,0.,20.])')
w.factory('SUM::splusb(Nsig[50,0.,1000]*sig,Nbkg[1000,0.,100000]*bkg)')
sbpdf = w.pdf('splusb')
fitresult = sbpdf.fitTo(inputdata,ROOT.RooFit.Save(),ROOT.RooFit.PrintLevel(-1))
fitresult.Print() # Print fit result
sbpdf.plotOn(bsframe)
# We can separately plot the components of the model
sbpdf.plotOn(bsframe,ROOT.RooFit.Components('bkg'),ROOT.RooFit.LineStyle(ROOT.kDashed))
sbpdf.plotOn(bsframe,ROOT.RooFit.Components('sig'),ROOT.RooFit.LineColor(ROOT.kRed))
bsframe.Draw()
c.Draw()

Toy-event generation

We often need to generate toy datasets that follow distributions of real data; these pseudodata can be used to test the quality of a fitting procedure, for example.

We can generate pseudodata with a model we extract (e.g. the signal plus background model in Exercise 2) and see whether we can reproduce the same result with adequate statistical precision.

Let's do that for the example of the background fit above.

In [ ]:
c2 = ROOT.TCanvas("myCanvasName","The Canvas Title")

hists = {}
hists['tau'] = ROOT.TH1D('h_tau','tau expo',100,-0.05,0.)
hists['tau_error'] = ROOT.TH1D('h_tau_error','tau expo error',100,0,0.002)
hists['tau_pull'] = ROOT.TH1D('h_tau_pull','tau expo pull',60,-3.,3.)

We will want to generate $N_\mathrm{toys}$ toy datasets, each having the same size as our input sample. From each pseudodata sample, we will re-fit the parameters, which tells us how much the data could vary statistically under the assumption of our model.

In [ ]:
Ntoys = 1000
Nevents = inputtree.GetEntries()

# Redo the fit, as the NLL computation moves tau around.
fitresult = bkgpdf.fitTo(inputdata,ROOT.RooFit.Save(),ROOT.RooFit.PrintLevel(-1))
tau_initial = w.var('tau').getVal()
print 'Initial Value for tau:',tau_initial
In [ ]:
for counter,toy in enumerate(xrange(Ntoys)):
    w.var('tau').setVal(tau_initial)
    toydata = bkgpdf.generate(ROOT.RooArgSet(x),Nevents)
    bkgpdf.fitTo(toydata,ROOT.RooFit.PrintLevel(-1))
    hists['tau'].Fill(w.var('tau').getVal())
    hists['tau_error'].Fill(w.var('tau').getError())
    # the following gives the pull of the fit -- has to have center at 0 and width 1 by definition.
    pull = (w.var('tau').getVal() - tau_initial)/w.var('tau').getError()
    hists['tau_pull'].Fill(pull)
    if counter%100 ==0 :
            print 'Going through toy number...',counter

Now, we plot the distributions of the parameters that we fitted from each run of the pseudodata. Ideally, the "pull", which is the significance of the variation of the fitted parameter $\tau$ should be normally distributed.

In [ ]:
c2.Divide(3,1)
c2.cd(1)
hists['tau'].Draw()
c2.cd(2)
hists['tau_error'].Draw()
c2.cd(3)
hists['tau_pull'].Draw()
c2.Draw()

Question 2:
What happens if the initial value for tau is not reset at every toy generation?

Question 3:
What happens if we generate more toys than Ntoys? What happens if we generate more events than Nevents?

Question 2: the tau drifts; test it yourselves!

Question 3: the more Ntoys the better. Nevents should be poisson-distributed around the number of events you got on data, to have a realistic representation of the statistics of your dataset.

Exercise 3:
Examine with toys the signal plus background fit tested in Exercise 2.

Exercise 4:
How does the situation change in all the above if you look at the high-stat sample?

===> This is exactly as the above except with the sbpdf rather than the bkgpdf.

Central Limit Theorem

Recall that in the lectures we covered the CLT, stating that the sample mean for multiple measurements of a random variable tends to a normal distribution as the sample size grows. This is true provided that the variable's probability distribution satisfies various criteria, including having a well defined and finite mean and variance.

We will now investigate how fast this convergence occurs -- in other words, how many measurements are required before the result is suitably gaussian. To do so, we need to generate (pseudo)random numbers using the ROOT random number generator, specifically ROOT::TRandom3.

Question: Why do you think we specify TRandom3? What might be distinguishing properties and drawbacks of different RNGs?

In [ ]:
# Generate a uniformly-distributed dataset using TRandom3
randgen = ROOT.TRandom3()

# Define histograms that will hold averages over a number of values
nbins = 300
xmin = -3.
xmax = 3.
hist1 = ROOT.TH1F('hist1','Hist of 1 value',nbins,xmin,xmax)
hist2 = ROOT.TH1F('hist2','Hist averaging 2 values',nbins,xmin,xmax)
hist5 = ROOT.TH1F('hist5','Hist averaging 5 values',nbins,xmin,xmax)
hist10 = ROOT.TH1F('hist10','Hist averaging 10 values',nbins,xmin,xmax)
hist20 = ROOT.TH1F('hist20','Hist averaging 20 values',nbins,xmin,xmax)
hist100 = ROOT.TH1F('hist100','Hist averaging 100 values',nbins,xmin,xmax)


# Just for illustration
colours = {
    1: ROOT.kBlack,
    2: ROOT.kBlue,
    5: ROOT.kGreen+2,
    10:ROOT.kViolet,
    20:ROOT.kRed,
    100:ROOT.kGray+2
}

# Collect the histograms into a dict for looping and easy retrieval
hists = {
    1:hist1,
    2:hist2,
    5:hist5,
    10:hist10,
    20:hist20,
    100:hist100
}

In a loop, generate random numbers with the TRandom, and keep a running list of the last few values, over which we take the average and fill the histograms that we defined above. In fact, for easier comparison later, we also rescale the average by sqrt(n) -- make sure you understand why from the CLT definition.

To try and do this in a pythonic way, and avoid needless use of computing resources, we avoid saving all of the numbers we generate, and only store the shortest list we need to compute all of the averages that we will use. Note the use of xrange, numpy, pop, and list slicing, all of which you may find useful in other python exercises.

Be warned: this may take a while to run if you pick a large Nvalues.

In [ ]:
Nvalues = 1000000
last100 = []
print 'Filling histograms looping over {0} values'.format(Nvalues)
from numpy import mean
from math import sqrt,pi
for i in xrange(Nvalues):
    f = randgen.Uniform(-1,1)
    last100.append(f)
    hist1.Fill(f)
    if len(last100)>100:
        last100.pop(0)
    for ntosum in sorted(hists.keys()):
        if len(last100)>=ntosum:
            hists[ntosum].Fill(sqrt(ntosum)*mean(last100[-1*ntosum:]))

print 'Done histogramming'
print '------------------'

Now we have our histograms ready to draw, but before we do this, we first define the gaussian distribution that we expect to reproduce with our measurement.

According to the CLT, an average over $n$ measurements should have standard deviation $\sigma/\sqrt{n}$, where $\sigma$ is the standard deviation of the variable distribution. So we need to compute the standard deviation of the uniform distribution between -1 and 1. Convince yourself that this is $1/\sqrt{3}$, from the formula $$ \sigma^2\left[f(x)\right] = \int x^2 \, f(x)\, dx - \mu^2 $$

The gaussian distribution is normalised to unity if defined as follows: $$ G(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( \frac{-(x-\mu)^2}{2\sigma^2} \right) $$ so we can fill in the normalisation factor, scaling by the bin width in order to match the normalisation of the histograms when we normalise them to unity.

In [ ]:
var_uniform = 1./3 # 1/12 for a uniform distribution of width 1
sigma_uniform = sqrt(var_uniform)
gaus = ROOT.TF1('gaus','[0]*exp(-x^2/(2.*[1]))',xmin,xmax)
# scale by bin width
scalef = (xmax-xmin)/nbins
gausnorm = scalef / sqrt(2*pi*var_uniform)
gaus.SetParameter(0,gausnorm)
gaus.SetParameter(1,var_uniform)

Now we draw the gaussian and overlay the histograms

In [ ]:
c = ROOT.TCanvas()

gaus.SetLineColor(ROOT.kOrange)
gaus.Draw()

# Now draw the histograms on top normalised to unity

for ntosum in sorted(hists.keys()):
    hists[ntosum].SetLineColor(colours[ntosum])
    hists[ntosum].DrawNormalized('histsame')

gaus.Draw('same')
    
c.Draw()

How good is our approximation? How many measurements do we need to average over before the average looks reasonably gaussian-distributed?

One thing to consider: What is the maximum absolute value that could enter the histogram for sample size $n$? What does this tell you about how many measurements you need to reproduce a 5$\sigma$ effect at a reliable rate?

We can get some numerical comparisons by just printing off the fraction of probability (events) that fall outside the $N\sigma$ range.

In [ ]:
for nsigma in range(1,6):
    print "Gauss fraction outside {0} sigma ({1:.2f}): {2:.5g}".format(nsigma,nsigma*sigma_uniform,ROOT.TMath.Erfc(nsigma/sqrt(2)))
    for ntosum in sorted(hists.keys()):
        thehist = hists[ntosum]
        firstbin = thehist.FindBin(-1*nsigma*sigma_uniform)
        lastbin = thehist.FindBin(nsigma*sigma_uniform)

        frac = 1. - thehist.Integral(firstbin,lastbin)/thehist.Integral()
        print "Histogram fraction for average of {0} values with |x| > {1} sigma: {2:.5g}".format(ntosum,nsigma,frac)
    print ""

Exercise 5:
What happens if instead of a uniform distribution you draw the random number from (1) exponential distributions and (2) Landau distributions? Discuss the differences in the context of the limitations of the CLT.

Hint: Remember to subtract the mean of the distributions when you define your averages, as this will make it easier to plot the distributions for comparison.

===> One can see from this exercise that the random measurements taken from landau distributions do not generate gaussians due to the tails in the landau distributions. This is not the case for the exponentials; it takes more iterations to reach a gaussian but this is what you eventually get.