Solution to exercise 6

Exercise 6:
Perform a hypothesis test combining the data from Sample1a and Sample1b, following the steps outlined below.

Combine the samples 1a and 1b now, using separately (1) the cut & count and (2) a likelihood fit. Perform the combination (a) by adding up the data sets as two independent data sets and (b) considering each sample separately and combining the final statistical results. Compare the four results and discuss.</div>

SOLUTIONS

Step 1: Extract the Sample1b tree and generate a corresponding dataset, plotting this data (also defined in terms of the real variable x).

In [ ]:
# First, get the tree and generate the dataset for Sample1b
inputtree_1b = inputfile.Get('Sample1b')
inputdata_1b = ROOT.RooDataSet('data_1b','data_1b',inputtree,ROOT.RooArgSet(x))

# Plot this, for satisfaction.
xframe = x.frame()
inputdata_1b.plotOn(xframe)
xframe.Draw()
c.Draw()

Step 2: Define a PDF that describes the event counts in both SRs simultaneously. For this you will of course need parameters that are independent of the model for the first SR (1a). If you wish, you can reuse the 1a model from earlier.

Plot your combined PDF as a function of the number of events observed in SR1a (this will generate a projection of the PDF onto the variable). Does this look as you expect?

In [ ]:
# Generate the poisson model for a second SR, then build the product
# PDF describing the event counts in both SRs simultaneously
w.factory('sum::Nexp1b(Nsig1b[0.,0.,100],Nbkg1b[100,0,500])')
w.factory('Poisson::pois1b(Nobs1b[100,0,500],Nexp1b)')
w.factory('PROD::cnc_1ab(pois,pois1b)')
In [ ]:
Nobsframe = Nobs.frame(500)
pois_comb = w.pdf('cnc_1ab')
pois_comb.plotOn(Nobsframe)
Nobsframe.Draw()
c.Draw()
# Plot the projection of the combined pdf on Nobs

Step 3: Perform the background sideband fit on the Sample1b dataset to extract the background prediction for this SR, then with this plot the combined PDF vs the number of events observed in SR1b. Determine the number of signal events in SR1b, based on this background estimate.

In [ ]:
sidebanddata1b = ROOT.RooDataSet('sideband_1b','sideband_1b',inputtree_1b,ROOT.RooArgSet(x),'x<{0} || x>{1}'.format(SRmin,SRmax))
fitresult1b = bkgpdf.fitTo(sidebanddata1b,RooFit.PrintLevel(-1),
                         RooFit.Range('sideband_low,sideband_high'),
                         RooFit.Extended())

bkgint_SR1b = bkgpdf.createIntegral(ROOT.RooArgSet(x),ROOT.RooArgSet(x),'signalregion')
Nbkg1b = w.var('Nbkg1b')
Nbkg1b.setVal(bkgint_SR1b.getVal() * bkgnorm.getVal())
print 'SR1a: bkg expectation = {0:.2f}'.format(Nbkg.getVal())
print 'SR1b: bkg expectation = {0:.2f}'.format(Nbkg1b.getVal())
In [ ]:
inputtree_1b.Draw('>>elist','x>{0} && x<{1}'.format(SRmin,SRmax))
elist = ROOT.gDirectory.Get('elist')
NSR1b = elist.GetN() # Note: this is just a plain float, which we'll use to integrate for the p-value
print '*** Number of total (observed) events in SR1a: {0}, SR1b: {1}'.format(NSR,NSR1b)
Nobs1b = w.var('Nobs1b')
Nobs1b.setVal(NSR1b)
In [ ]:
Nsig1b = w.var('Nsig1b')
Nsig1b.setVal(0)
Nobs1b = w.var('Nobs1b')
Nobs1bframe = Nobs1b.frame(500)
pois_comb.plotOn(Nobs1bframe)
Nobs1bframe.Draw()
c.Draw()
# Now plot the projection of the combined pdf on Nobs1b
In [ ]:
# We can do the following if Nobs > Nbkg
Nbkg1b = w.var('Nbkg1b')
Nsig_cnc1b = Nobs1b.getVal() - Nbkg1b.getVal()
print 'Signal events: 1a = {0:.2f}, 1b = {1:.2f}'.format(Nsig_cnc,Nsig_cnc1b)

Step 4: Plot the combined PDF in two dimensions, you can use the createHistogram function of the pdf (https://root.cern.ch/doc/master/classRooAbsReal.html#a552a08367c964e689515f2b5c92c8bbe). This will return a 2D histogram that you can draw as usual in ROOT (to display this, use Draw("colz") to make a contour plot with colours).

In [ ]:
# For completeness, plot the 2D histogram of the PDF.
hist_2dpdf = pois_comb.createHistogram("Nobs,Nobs1b")
hist_2dpdf.Draw("colz")

c.Draw()

Step 5: Now we perform the hypothesis test.

We need to redefine our null hypothesis in this case -- stating that Nsig=Nsig1b=0. For exclusion limits, actually it is a little trickier, as they need to have a consistent signal definition across the two regions.

But for discovery, it's still quite easy for us to do the hypothesis test so long as we use the NLL. Simply integrating over the two Poisson distributions is more difficult, because one has to choose the region to integrate over, and it does not make sense to pick e.g. rectangular cuts.

Generate the NLL for the combined PDF, then draw it as a function of:

  1. the number of signal events in SR1a
  2. the number of signal events in SR1b
  3. in 2D as a function of both of the above
In [ ]:
Nobs_bkg1ab = ROOT.RooDataSet('Nobs_bkg1ab', 'Nobs_bkg1ab', ROOT.RooArgSet(Nobs,Nobs1b))
Nobs_bkg1ab.add(ROOT.RooArgSet(Nobs,Nobs1b))
nll_cnccomb = pois_comb.createNLL(Nobs_bkg1ab)

Nsigframe = Nsig.frame()
nll_cnccomb.plotOn(Nsigframe)
Nsigframe.Draw()
c.Draw()

Nsig1bframe = Nsig1b.frame()
nll_cnccomb.plotOn(Nsig1bframe)
Nsig1bframe.Draw()
c.Draw()
In [ ]:
hist_2dnll = nll_cnccomb.createHistogram("Nsig,Nsig1b",100,100)
hist_2dnll.Draw("colz")
c.Draw()

Step 6: Determine the p-value and gaussian significance of the observed signal region counts using the NLL, based on the asymptotic estimate we used above.

In [ ]:
Nsig.setVal(Nsig_cnc)
Nsig1b.setVal(Nsig_cnc1b)
nll_cnccomb_obs = nll_cnccomb.getVal(ROOT.RooArgSet(Nsig,Nsig1b))
In [ ]:
Nsig.setVal(0)
Nsig1b.setVal(0)
nll_cnccomb_0sig = nll_cnccomb.getVal(ROOT.RooArgSet(Nsig,Nsig1b))
In [ ]:
p0_comb_nosyst = ROOT.TMath.Prob( 2*(nll_cnccomb_0sig-nll_cnccomb_obs), 1)/2
print "P-value based on DeltaNLL asymptotics = {0:.3g}".format(p0_comb_nosyst)
z = ROOT.RooStats.PValueToSignificance(p0_comb_nosyst)
print "Significance based on DeltaNLL asymptotics = {0:.3g}".format(z)

Step 7: Determine the p-value and gaussian significance a second time, now using toys instead of the asymptotic estimate. Think carefully about the appropriate initial values to use. How many toys do you need for a reliable p-value calculation? Does this result agree well with the asymptotic approach?

  • Note: for performance reasons, it may be good to delete the toy dataset and the associated NLL in each iteration of the loop. *
In [ ]:
initialvals = {
    "Nsig": 0,
    "Nsig1b": 0,
    "Nbkg": Nbkg.getVal(),
    "Nbkg1b": Nbkg1b.getVal(),
}
print "Initial values: "
for key, val in sorted(initialvals.iteritems()):
    print "  {0:6s}: {1:.1f}".format( key, val )

dnll_toydist = ROOT.TH1D('dnll_toydist','Distribution of observed NLL values',100,0,10)

print Nobs.getVal(), Nobs1b.getVal()
In [ ]:
Ntoys = 100000
Ntoys_0sig = 0

dnll_nom = nll_cnccomb_0sig - nll_cnccomb_obs

# Loop over the number of toys we want to do
import sys
print "Begin throwing toys"
dnll_toydist.Reset()
for counter,toy in enumerate(xrange(Ntoys)):
    for key, val in initialvals.iteritems():
        w.var(key).setVal(val)
    toydata = pois_comb.generate(ROOT.RooArgSet(Nobs,Nobs1b),1)
    Nobs.setVal(toydata.get(0).getRealValue('Nobs'))
    Nobs1b.setVal(toydata.get(0).getRealValue('Nobs1b'))
    toy_Nsig_1a = Nobs.getVal() - initialvals['Nbkg']
    toy_Nsig_1b = Nobs1b.getVal() - initialvals['Nbkg1b']

    nll_toy = pois_comb.createNLL(toydata)
    Nsig.setVal(toy_Nsig_1a)
    Nsig1b.setVal(toy_Nsig_1b)
    nll_toy_obs = nll_toy.getVal(ROOT.RooArgSet(Nsig,Nsig1b))
    Nsig.setVal(0)
    Nsig1b.setVal(0)
    nll_toy_0sig = nll_toy.getVal(ROOT.RooArgSet(Nsig,Nsig1b))
    
    dnll_toy = nll_toy_0sig - nll_toy_obs
    dnll_toydist.Fill(dnll_toy)
    if dnll_toy>dnll_nom:
        Ntoys_0sig += 1
    if counter%(Ntoys/20) == 0 :
            print 'Now on toy number... {0}'.format(counter)
            sys.stdout.flush()

    toydata.Delete()
    nll_toy.Delete()
dnll_toydist.Draw()
theline = ROOT.TLine(dnll_nom,0,dnll_nom,dnll_toydist.GetMaximum())
theline.Draw()
c.Draw()
In [ ]:
print 'DeltaNLL(0 signal, obs) = {0:.3g}'.format(dnll_nom)
print 'Fraction of toys with observed DeltaNLL = {0:.3g}'.format(float(Ntoys_0sig)/Ntoys)
In [ ]:
print Ntoys_0sig, Ntoys