Statistics Hands-on session for Lecture 2¶

Students are required to hand in the exercises and questions below in the form of a SWAN notebook or pdf file (with code snippets) by Wednesday, May 2. Please send the material to Anna Sfyrla (anna.sfyrla@cern.ch) and T.J. Khoo. (Teng.Jian.Khoo@cern.ch).

Hypothesis testing without systematic uncertainties¶

Last time, we familiarised ourselves with the procedures for fitting and estimating parameter values. We will now apply these techniques to the goal of hypothesis testing, that is comparing two hypotheses to determine which is preferred by the data. For now, we neglect systematic uncertainties, as these will be covered in the last lecture.

The statistical test will determine whether the data show no difference to the "null hypothesis", with a statistically significant difference usually being the interesting outcome. This is established if the data show much better consistency with the "alternative hypothesis".

We will carry out two hypothesis tests:

a simple "cut-and-count" analysis, where we make a selection in the data and simply model the number of events expected in this selection; and
an explicit fit of the gaussian signal on top of the exponential background, extending the fits you carried out in the previous exercises.

Note that this notebook won't run all together this time; there are parts of the code for you to fill in (see "Q:").

As usual, we start by importing the ROOT and RooFit modules and extracting the sample data, but this time we want to use Sample1a.

import ROOT
from ROOT import RooFit
ROOT.enableJSVis()
inputfile = ROOT.TFile('files/ttreedata.root')
inputtree = inputfile.Get('Sample1a')
Nevents = inputtree.GetEntries()
c = ROOT.TCanvas('plot','plot',800,600)

Then, as before, we define the variable x and generate a dataset from the tree.

w = ROOT.RooWorkspace()
w.factory('x[0.,200.]')
x = w.var('x')
inputdata = ROOT.RooDataSet('data_1a','data_1a',inputtree,ROOT.RooArgSet(x))

Hypothesis testing with a cut & count analysis¶

Before manipulating any of the data this time, let's focus on the model of signal + background that we want to use. For a cut & count analysis, we have the following basic parameters:

Number of background events passing selection
Number of signal events passing selection

The data will tell us Nobs, which if we theorise correctly satisfies Nobs = Nexp = (Nsig+Nbkg). This number is Poisson-distributed, so we can construct the relevant pdf, building the sum Nexp as the Poisson mean.

w.factory('sum::Nexp(Nsig[0,-100.,100],Nbkg[100,0,500])')
w.factory('Poisson::pois(Nobs[100,0,500],Nexp)')

Now we define the selection (i.e. the range of x in which we will count events).

Q: Define reasonable SR cut values based on the fits you performed in the homework exercise.¶

##### Set the values of SRmin and SRmax here
SRmin = 1
SRmax = 2

To extract the background estimate without biasing ourselves based on the possibility of observing a signal, we can fit the background distribution to sidebands. Here we generate the RooDataSet explicitly excluding the SR region whose boundaries you chose.

sidebanddata = ROOT.RooDataSet('sideband_1a','sideband_1a',inputtree,ROOT.RooArgSet(x),'x<{0} || x>{1}'.format(SRmin,SRmax))

xframe = x.frame()
sidebanddata.plotOn(xframe)
xframe.Draw()
c.Draw()

To carry out the background-only fit, we need build the extended PDF, as we did for the first hands-on session.

Q: Go ahead and define extended pdfs not just for the background-only model, but also for the background + gaussian signal model that you produced in the homework. We will use the signal plus background pdf later. Some conventions to follow:¶

Call the background pdf bkg, the extended background pdf bkg_ext, with decay constant tau
Call the signal+background pdf splusb
Use the same signal normalisation variable Nsig as we did for the cut-and-count model (this is reasonable if our signal region selects most of the events in the signal peak region)
Use bkgnorm for the total number of background events in the entire range of x (remember that Nbkg is specifically the number of background events in the SR).

##### Use the workspace factory to generate the relevant pdfs

Now we retrieve these from the workspace as python variables

bkgpdf = w.pdf('bkg_ext')
sbpdf = w.pdf('splusb')
bkgnorm = w.var('bkgnorm')

As before, we can simply fit the plot, but now need to specify the allowed fit ranges. We can attach "ranges" to the variable x, so that we can exclude portions of the spectrum, e.g. to avoid unblinding the SR.

Beware: fitting more than once can cause the normalisation to be screwed up and increase every time the curve is plotted.

# Limit the fit range, so that the 0 values don't influence the fit
x.setRange('sideband_low',0.,SRmin)
x.setRange('sideband_high',SRmax,200.)

# Now we explicitly run an extended fit
fitresult = bkgpdf.fitTo(sidebanddata,RooFit.PrintLevel(-1),
                         RooFit.Range('sideband_low,sideband_high'),
                         RooFit.Extended())

print "Extracted decay constant tau={0:.3g}, total background normalisation bkgnorm={1:.2f}".format(w.var('tau').getVal(),bkgnorm.getVal())

# Add the data and fitted function to the plot
xframe = x.frame()
sidebanddata.plotOn(xframe)
# Plot function with dashed line in full fit range, and solid in fitted range (default)
bkgpdf.plotOn(xframe,RooFit.Range('Full'),RooFit.LineStyle(ROOT.kDashed))
bkgpdf.plotOn(xframe)
xframe.Draw()
c.Draw()

Having completed the sideband fit we can now estimate the normalisation in the signal region. To do so, we need to create an "integral" object over the region of interest, i.e. the SR -- so we need a range for that as well.

The RooFit syntax for generating integrals also requires us to specify over what range we want to normalise the integral (i.e. it will be 1 over this range). So we extract the fraction of the distribution that falls in the SR, and then are able to scale this to the total background normalisation that we got from the fit.

x.setRange('signalregion',SRmin,SRmax)
bkgint_SR = bkgpdf.createIntegral(ROOT.RooArgSet(x),ROOT.RooArgSet(x),'signalregion')
Nbkg = w.var('Nbkg')
Nbkg_cnc = bkgint_SR.getVal() * bkgnorm.getVal()
Nbkg.setVal(Nbkg_cnc)

print 'Integral normalised over x range: {0:.3g}, normalisation: {1:.2f}'.format(bkgint_SR.getVal(),bkgnorm.getVal())
print '*** Number of fitted background events in SR: {0:.2f}'.format(Nbkg.getVal())

For the cut-and-count analysis, we of course need to set the number of signal events. To test whether we have discovered a new particle, we use the following hypotheses:

Null hypothesis: "The SR contains only background events."
Alternative hypothesis: "Contributions other than background are present in the SR"

Now, set Nsig according to the null hypothesis.

Q: What value should Nsig take?¶

Nsig = w.var('Nsig')

Knowing how much background we expect in the signal region, we can now compare that to what we see in data (Nobs), by extracting the number of events in our input tree satisfying the SR cuts.

In the signal region, we can now compare the number of total observed events and the number of total background events we had estimated with the fit. Following this comparison, we can estimate the p-value.

inputtree.Draw('>>elist','x>{0} && x<{1}'.format(SRmin,SRmax))
elist = ROOT.gDirectory.Get('elist')
NSR = elist.GetN() # Note: this is just a plain float, which we'll use to integrate for the p-value
print '*** Number of total (observed) events in SR:', NSR
Nobs = w.var('Nobs')
Nobs.setVal(NSR)

# We can do the following if Nobs > Nbkg
Nsig_cnc = Nobs.getVal() - Nbkg.getVal()
print 'A first estimate of the signal number of events: {0:.2f}'.format(Nsig_cnc)

Hopefully you got an encouraging result (i.e. a reasonably large Nsig value). But of course we wish to test this quantitatively with our Poisson background model, so let's conduct our hypothesis test.

In its simplest form, we can state the test in terms of the following question: what is the probability that the the observed value Nobs originates purely from a statistical fluctuation of the data (the $p$-value)? This defines Nobs-Nbkg as the "test statistic".

To get a binary answer from our test, we have to establish the significance level $\alpha$ for which we will claim a statistically significant result. Of course, in HEP, the standard thresholds in terms of Gaussian standard deviations are equivalent to 3$\sigma$ for evidence and 5$\sigma$ for observation.

To compute the (one-sided) $p$-value, we integrate the poisson distribution corresponding to our null hypothesis (Nsig=0) from Nobs to infinity.

# Get the value of the Poisson distribution for Nbkg, just for plotting purposes
pois = w.pdf('pois')
Nobs.setVal(Nbkg.getVal())
pois_Nbkg = pois.getVal(ROOT.RooArgSet(Nobs))

Nobs.setVal(NSR)
# Set a range to integrate over
# For discovery, we test the hypothesis that the SR is only populated by background.
# The discovery p-value p0 is the probability that the background has fluctuated up to the observed value.
Nobs.setRange('greaterThanObs',NSR,ROOT.RooNumber.infinity())
p0 = pois.createIntegral(ROOT.RooArgSet(Nobs),'greaterThanObs')
z = ROOT.RooStats.PValueToSignificance(p0.getVal())
# Analytically, this would be erfInverse(1-2p0)*sqrt(2) to get the normalisation correct
print "Discovery p-value: {0:.3g}, significance = {1:.3f} sigma".format(p0.getVal(),z)

# Explicitly set the number of bins in the frame
# Otherwise the plot values can go wrong, because the histogram gets rebinned
obsframe = Nobs.frame(500)
Nobs.setRange('toshade',NSR,500)
pois.plotOn(obsframe)
# We draw the pdf in the integrated range, shaded to show the integral.
# We need to renormalise to the integral, otherwise the pdf normalisation
# will be adjusted to unity
pois.plotOn(obsframe,RooFit.FillColor(ROOT.kRed),
            RooFit.DrawOption('BC'), RooFit.Range('toshade'),
            RooFit.Normalization(p0.getVal()))
# Redraw the full pdf
pois.plotOn(obsframe)
obsframe.Draw()
NSRarrow = ROOT.TArrow(Nbkg.getVal(),pois_Nbkg*0.2,Nbkg.getVal(),0)
NSRarrow.Draw()

c.Draw()

Question 1:
Calculate the two-sided p-value for the observation, and compare the value to the one-sided $p$-value. Explain qualitatively how the two differ, and what we learn from each one.
¶

In the above example, we didn't actually bother comparing the null hypothesis with an explicit alternative hypothesis. Given that we are just trying to reject the null hypothesis, what alternative hypothesis could we compare to?

One reasonable choice is -- whatever the data prefers. In other words, we can compare the likelihood under the assumption of no background to that under the assumption that Nsig is equal to Nobs-Nbkg. Specifically, we can look at the distribution of the ratio of the two likelihoods -- this is the "likelihood ratio test". So we use the likelihood ratio $\Lambda(N_\mathrm{obs})/\Lambda(N_\mathrm{bkg})$ as the test statistic, rather than simply $N_\mathrm{obs} - N_\mathrm{bkg}$.

It turns out to be easier in RooFit to extract the negative log likelihood (NLL) than the likelihood function itself. Of course, we get back to the likelihood ratio by computing the difference in the logs ($\Delta$NLL). We can plot the NLL for our data as follows:

Nobs_bkg = ROOT.RooDataSet('Nobs_bkg', 'Nobs_bkg', ROOT.RooArgSet(Nobs))
Nobs_bkg.add(ROOT.RooArgSet(Nobs))
nll_cnc = pois.createNLL(Nobs_bkg)
Nsigframe = Nsig.frame()

# We set the plotted min to 0, as we're only interested in differences wrt the min NLL
nll_cnc.plotOn(Nsigframe,RooFit.ShiftToZero())
Nsigframe.SetMaximum(10)
Nsigframe.SetMinimum(-1)
Nsigframe.Draw()
c.Draw()

We have the NLL, now let's extract the $\Delta$NLL, and determine the $p$-value for this. As we have determined that the signal is only positive, we want the test to be one-sided, such we do not disfavour the null hypothesis if we observe fewer events in the signal region than our background prediction.

For how to do this, see https://arxiv.org/abs/1007.1727 Referring to section 2.3, we should define our test statistic as $q_0 = -2 \ln \Lambda(0)$ for a signal strength > 0, and 0 otherwise, where $\Lambda(0)$ is the likelihood value assuming a zero signal. The pdf of $-2 \ln \Lambda(0)$ is approximately chi-square distributed with 1 dof (in the limit of large statistics).

Nsig.setVal(0)
nll_0sig = nll_cnc.getVal()
Nsig.setVal(Nsig_cnc)
nll_obs = nll_cnc.getVal()

# We divide by two because we assigned 50% of the probability to the case
# where the observation is no greater than the background prediction,
# and the chi2 probability is symmetric.
p0_nosyst = ROOT.TMath.Prob( 2*(nll_0sig-nll_obs), 1)/2
print "P-value based on DeltaNLL = {0:.3g}".format(p0_nosyst)

Question 2:
Rather than using the chi-square approximation, what other procedure could we use to determine the exact p-value with the NLL calculation? How does this result compare to the simple poisson integration?
¶

Hypothesis testing with a fitted background & signal model¶

Earlier, you defined a more specific signal model, which we can also use for hypothesis testing. How well do we expect this to behave compared to the cut & count analysis?

Let's now use the detailed signal+background model and fit this to the data.

sbpdf = w.pdf('splusb')
fitresult = sbpdf.fitTo(inputdata,ROOT.RooFit.Save(),ROOT.RooFit.PrintLevel(-1))
fitresult.Print()
Nsig_fit = w.var('Nsig').getVal()

We will now construct the NLL plot and compare it to the one we retrieved from the cut & count procedure.

nll_fit = sbpdf.createNLL(inputdata)
Nsigframe = Nsig.frame()
nll_cnc.plotOn(Nsigframe,RooFit.ShiftToZero(),RooFit.LineColor(ROOT.kRed))
nll_fit.plotOn(Nsigframe,RooFit.ShiftToZero())
Nsigframe.SetMaximum(10)
Nsigframe.SetMinimum(-1)
Nsigframe.Draw()

zeroline = ROOT.TLine(0,0,100,0)
zeroline.Draw()
c.Draw()

Exercise 1:
Using the NLL, calculate the p-value as extracted from the likelihood fit and compare it to the p-value from the count & count experiment.

Bonus exercise:
What is the highest significance (smallest p-value) you are able to achieve with a cut-and-count analysis? Is it possible to get a higher significance than for the fit?
¶

Exercise 2:
Consider now the sample1b in the input file. Repeat the procedure above and find the p-values for (1) cut & count and (2) a likelihood fit.

Confidence limits and parameter estimation¶

What if, instead of trying to establish how significant a signal is in the context of the background-only hypothesis, we have an expected signal, but wish to understand if this is larger than predicted due to the effects of some new physics?

For this we will determine confidence limits, i.e. the range in which the data are expected to fall in a fraction $\alpha$ of trials, where $\alpha$ is the chosen "confidence level". By convention, we typically "exclude" models for which the parameters are outside the 95% confidence level range.

In this case, where we assume new physics must have a positive contribution, our null hypothesis now becomes a signal of a given size, while the alternative hypothesis is a smaller signal strength. Often you will find that the limits are set on a signal strength parameter $\mu$, scaled relative to a nominal cross-section value (which may be the SM or some other specific model).

We can work mostly with the models that we have already built. First, we illustrate two ways to approach the problem using just the naïve Poisson integral:

Determine the value of Nsig that is exceeded by only 5% of trials, assuming the observed Nsig to be the true value. (Especially in the case where we have systematics, often we call this a "model-independent upper limit".)
Scan the range of Nsig and find the value for which only 5% of trials would give the observed Nsig. (This is a model-dependent limit.)

Q: Do we expect these to give the same answer? Does this depend on the size of the expected signal?¶

# Finding Nsig such that p(Nobs>Nbkg+Nsig|Nexp=Nobs)<5%
Nobs.setVal(NSR)
Nsig.setVal(Nsig_cnc)
Nbkg.setVal(Nbkg_cnc)

# Use numpy to generate a float range
# Obviously the upper limit on Nsig has to be larger than the observed Nsig...
Nsig_95 = Nsig_cnc
from numpy import arange
for Nsig_hypo in arange(int(Nsig_cnc),100.,0.01):
    Nobs.setRange('greaterThanHypo',Nbkg_cnc+Nsig_hypo,ROOT.RooNumber.infinity())
    p1 = pois.createIntegral(ROOT.RooArgSet(Nobs),'greaterThanHypo')
    # Store the values of Nsig until we hit the magic 5% range
    if p1.getVal()<0.05:
        Nsig_95 = Nsig_hypo
        break

obsframe = Nobs.frame(500)
Nobs.setRange('toshade',Nbkg_cnc+Nsig_95,500)
pois.plotOn(obsframe)
# We draw the pdf in the integrated range, shaded to show the integral.
# We need to renormalise to the integral, otherwise the pdf normalisation
# will be adjusted to unity
pois.plotOn(obsframe,RooFit.FillColor(ROOT.kRed),
            RooFit.DrawOption('BC'), RooFit.Range('toshade'),
            RooFit.Normalization(0.05))
# Redraw the full pdf
pois.plotOn(obsframe)
obsframe.Draw()

c.Draw()

print "95% of trials assuming Nsig equal to the observed value {0:.2f} have Nsig<{1:.2f}".format(Nsig_cnc,Nsig_95)

# Finding Nsig such that p(Nobs>Nbkg+Nsig|Nexp=Nbkg+Nsig)<5%

Nobs.setVal(NSR)
Nbkg.setVal(Nbkg_cnc)

# Use numpy to generate a float range
# Obviously the upper limit on Nsig has to be larger than the observed Nsig...
Nsig_95 = Nsig_cnc
from numpy import arange
for Nsig_hypo in arange(int(Nsig_cnc),100.,0.01):
    Nsig.setVal(Nsig_hypo)
    p1 = pois.createIntegral(ROOT.RooArgSet(Nobs),'greaterThanObs')
    # Store the values of Nsig until we hit the magic 5% range
    if p1.getVal()>0.95:
        Nsig_95 = Nsig_hypo
        break

obsframe = Nobs.frame(500)
Nobs.setRange('toshade',0,Nobs.getVal())
pois.plotOn(obsframe)
# We draw the pdf in the integrated range, shaded to show the integral.
# We need to renormalise to the integral, otherwise the pdf normalisation
# will be adjusted to unity
pois.plotOn(obsframe,RooFit.FillColor(ROOT.kRed),
            RooFit.DrawOption('BC'), RooFit.Range('toshade'),
            RooFit.Normalization(0.05))
# Redraw the full pdf
pois.plotOn(obsframe)
obsframe.Draw()

c.Draw()

print "95% of trials assuming Nsig equal to the observed value {0:.2f} have Nsig<{1:.2f}".format(Nsig_cnc,Nsig_95)

Exercise 3:
We also have a powerful tool in the NLL (both for the cut & count and for the fit) -- determine the upper limit on the number of signal events using the NLL with the asymptotic approximation.

</div>

Exercise 4:
Instead of placing limits, perhaps we want to estimate the preferred range of a parameter (its estimated value is the MLE, but we want e.g. 1 sigma and 2 sigma uncertainties on it). Also using the NLL, determine the 68% and 95% confidence level range for the value of Nsig. Are these ranges uniquely defined, and if not, how do you define the "preferred range"?
¶

Exercise 5:
Now, estimate the 68% and 95% CL ranges on $\mu$ using the NLL for the gaussian fit.
¶

Multiple signal regions¶

Exercise 6:
Perform a hypothesis test combining the data from Sample1a and Sample1b, following the steps outlined below.

Step 1:

Extract the Sample1b tree and generate a corresponding dataset, plotting this data (also defined in terms of the real variable x).

Step 2:

Define a PDF that describes the event counts in both SRs simultaneously. For this you will of course need parameters that are independent of the model for the first SR (1a). If you wish, you can reuse the 1a model from earlier.
- Plot your combined PDF as a function of the number of events observed in SR1a (this will generate a projection of the PDF onto the variable). Does this look as you expect?

Step 3:

Perform the background sideband fit on the Sample1b dataset to extract the background prediction for this SR, then with this plot the combined PDF vs the number of events observed in SR1b. Determine the number of signal events in SR1b, based on this background estimate.

Step 4:

Plot the combined PDF in two dimensions, you can use the createHistogram function of the pdf (https://root.cern.ch/doc/master/classRooAbsReal.html#a552a08367c964e689515f2b5c92c8bbe). This will return a 2D histogram that you can draw as usual in ROOT (to display this, use Draw("colz") to make a contour plot with colours).

Step 5:

Now we perform the hypothesis test.
- We need to redefine our null hypothesis in this case -- stating that Nsig=Nsig1b=0. For exclusion limits, actually it is a little trickier, as they need to have a consistent signal definition across the two regions.
- But for discovery, it's still quite easy for us to do the hypothesis test so long as we use the NLL. Simply integrating over the two Poisson distributions is more difficult, because one has to choose the region to integrate over, and it does not make sense to pick e.g. rectangular cuts.
- Generate the NLL for the combined PDF, then draw it as a function of:
  1. the number of signal events in SR1a
  2. the number of signal events in SR1b
  3. in 2D as a function of both of the above

Step 6:

Determine the p-value and gaussian significance of the observed signal region counts using the NLL, based on the asymptotic estimate we used above.

Step 7:

Determine the p-value and gaussian significance a second time, now using toys instead of the asymptotic estimate. Think carefully about the appropriate initial values to use. How many toys do you need for a reliable p-value calculation? Does this result agree well with the asymptotic approach?
Note: for performance reasons, it may be good to delete the toy dataset and the associated NLL in each iteration of the loop.

Statistics Hands-on session for Lecture 2¶

Hypothesis testing without systematic uncertainties¶

Hypothesis testing with a cut & count analysis¶

Q: Define reasonable SR cut values based on the fits you performed in the homework exercise.¶

Q: Go ahead and define extended pdfs not just for the background-only model, but also for the background + gaussian signal model that you produced in the homework. We will use the signal plus background pdf later. Some conventions to follow:¶

Q: What value should Nsig take?¶

Question 1: Calculate the two-sided p-value for the observation, and compare the value to the one-sided $p$-value. Explain qualitatively how the two differ, and what we learn from each one.¶

Question 2: Rather than using the chi-square approximation, what other procedure could we use to determine the exact p-value with the NLL calculation? How does this result compare to the simple poisson integration? ¶

Hypothesis testing with a fitted background & signal model¶

Exercise 1: Using the NLL, calculate the p-value as extracted from the likelihood fit and compare it to the p-value from the count & count experiment.

Bonus exercise: What is the highest significance (smallest p-value) you are able to achieve with a cut-and-count analysis? Is it possible to get a higher significance than for the fit? ¶

Exercise 2: Consider now the sample1b in the input file. Repeat the procedure above and find the p-values for (1) cut & count and (2) a likelihood fit.