Significance Estimations using Multivariate Analysis of Single Top Monte Carlo

Here, we return to early data (below 1fb-1), looking at the impact of a multivariate analysis on the capability to observe the single-top signal. EarlyDataStudy looked at the ability of a cut-based analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the single-top process in early data.

The updated event yields, weighted to 100pb-1, can be found in table below for samples containing 1 b-tagged jet or2 b-tagged jets, and muons or electrons. In this analysis, there are two selections studied, one isolating 1 b-tagged jets and the other isolating 2 b-tagged jets. As in the earlier, cut-based analysis, the early b-tagger IP2D is used. Some of the Pt cuts are also adjusted to have slightly lower thresholds, such as Second leading jet Pt, which is required to be greater than 25 GeV. These adjustments are the same as in the cut based analysis.
processes muon (1 b-jet) electron (1 b-jet) muon (2 b-jet) electron (2 b-jet) both (1 b-jet) both (2 b-jet)
t-channel 408.7 312.0 140.4 123.2 720.8 263.6
s-channel 17.2 12.2 10.5 8.24 29.4 18.8
wt- channel 160.5 124.3 50.2 43.3 284.8 93.6
ttbar to lep+jets 925.2 747.9 565.1 451.2 1673.1 1016.3
ttbar to lep+lep 415.1 319.6 275.8 206.1 734.7 481.9
Wjets 1562.2 1023.2 125.8 87.3 2585.4 213.1
Wbbjets 49.3 32.2 27.0 18.0 81.5 45.0
WW 26.6 3.5 19.2 2.2 45.7 5.7
WZ 7.7 5.7 3.1 2.4 13.3 5.5
S/B 0.20 0.20 0.21 0.23 0.20 0.21

The S/B ratio is about 0.2. This is high compared to other single-top analyses because in this analysis, the signal is combined. There is probably also some impact from slightly different cut selection and the different b-tagger. The proportions of the background are different from the CSC note (here, there are more W+Jets, versus ttbar). This is understood and a result of the lowering of the Pt thresholds and changes in the b-tagger. The overall effect of the cut changes is to increase the number of events overall and slightly increase the S/B ratio. The increased proportion of W+Jets may also improve the results of the analysis, as it is likely to be easier to remove than ttbar, but this has not been extensively studied.

Additional Variable Selection

This analysis required the use of several variables beyond those used for preselection cuts. Explanations are given here:
  • Ht refers to the sum of the Pt of the particles indicated (except for MET, for which we used Et)
  • DeltaR refers to sqrt{(Delta Eta)2 + (Delta Phi)2 } for the particles in question
  • H is the sum of the P (momentum) of the particles
  • lepton refers to the leading lepton
  • centrality is (sum Pt/sum P)
  • Transverse W mass is sqrt{(leptonPt + MET)2 - (leptonP_x + MEX)2 - (leptonP_y + MEY)2 }
  • Jet1 refers to the leading jet, jet2 refers to the second leading jet, etc.

The following are the 33 variables used in the multivariate analysis:

  • Ht(jet1, jet2), Ht(all jets), Ht(lepton, MET)
  • Delta Pt(jet1, jet2), Delta Pt(btaggedjet1, untaggedjet1)
  • DeltaR(untaggedjet1, lepton), DeltaR(jet1, lepton), DeltaR(jet1, jet2)
  • P(last jet), H(jet1, jet2), MET (Missing Et)
  • Pt(jet1), Pt(jet2), Pt(btaggedjet1), Pt(untaggedjet1), Pt(lepton)
  • Pt(jet1) + Pt(jet2)
  • Eta(btaggedjet1), Eta(untaggedjet1)
  • Maximum jet eta in event
  • Minimum jet eta in event
  • Eta(jet2), Eta(lepton), Phi(lepton)
  • Number of jets, Number of untagged jets
  • Transverse mass of W
  • Mass of the b-tagged jet top quark
  • Mass of the leading jet top quark
  • Invariant mass of jet1 + jet2
  • Invariant mass of all jets
  • Centrality(last jet, lepton),Centrality(jet1, jet2)

When considering the variables used in this analysis, the significance was estimated based on the validation sample several times while varying the number of variables used, to see if the dimensionality could be reduced without a loss of significance. However, there appears to be a linear correlation between significance and dimensionality, as can be seen in the figure below. This indicates that more variables may be useful in future studies and that, for this study, all 33 variables should be used to optimize the significance.

Significance vs Dimensionality, 1 b-tag Sample

Multivariate Analysis Programming Packages

Two analysis packages were initially considered, TMVA and SPR. In both cases, the general procedure was the same. The Monte Carlo data were split into three sets, to be used for the trial, validation, and yield phases of analysis. Because the SPR program is unable to handle negatively weighted events in the training cycle, these events were removed from the training set (although not from the validation or yield sets). Additionally, the events in the training sets were randomized to eliminate any order-dependency from merging the trees from files of different event types. The randomized sets were used to train the classifier, and the resulting significances were noted. For the purposes of this paper, the randomized data set that produced the best trained classifiers was used for validation purposes.

An initial analysis was done to determine which of the classifiers best separated background (ttbar, w+jets, w+bbjets, ww, wz) and signal (t-channel, s-channel, wt). In this phase, no systematics were included and efficiency curves were examined. The goal was to minimize the background efficiency and maximize the signal efficiency. It was found that the best classifier for TMVA was a boosted decision tree and the best classifier for SPR was a random forest. Classifier output plots can be seen below, in log scale, for both.
Random Forest SPR, 1 b-tag Sample* Random Forest, SPR, 2 b-tag Sample
hBagger1_b-Tagged_Jet_CO_validation.gif" hBagger2_b-Tagged_Jets_CO_validation.gif"

Boosted Decision Tree, TMVA, 1 b-tag Sample Boosted Decision Tree, TMVA, 2 b-tag Sample
hBDT_CO_validated_nice_nolum_log.gif" hBDT_CO_validated_nice_nolum_log_tag2.gif"

The figures below show the efficiency curves from the best TMVA and SPR classifiers, zoomed in on the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well. At this point, the SPR random forest classifier was focused on.

Efficiency, 1 b-tag Sample Efficiency, 2 b-tag Sample
EFF_tmva_spr_tag1_yield.gif" EFF_tmva_spr.gif"

Multivariate Analysis with Systematics

To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cut-based analysis, and the b-tagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background sum for each systematic. After a cut was made, the remaining background events were multiplied by these percentages to determine the error from b-tagging and jet energy scale systematics. The luminosity error was taken at 20 percent of the total background events for cross-section and Vtb calculations (it is not included in the significance). Background cross-section error was estimated at 12 or 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. The error percentages on the background sum after a classifier output cut for SPR's arcx4 classifier are given in the table below.
Errors Variation 1 b-tagged jet percentage 2 b-tagged jet percentage
Data Statistics   26 36
JES 10 17 5
B-tagging 5 9 13
MC Statistics   17 30
Background x-section 10,20 15 13
ISR/FSR 10 10 10
Other 5 5 5
Total Uncertainty| | 32 | 37 | The significance was calculated in two ways, a sideband method (cross-check) and a Gaussian-pdf method (primary), both of which involved a frequentist solution to estimate the significance (\x93Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process\x94, by R.D. Cousins, J.T. Linnemann, and J. Tucker, 23 Aug. 2008). The systematics discussed above are used in the Gaussian pdf method calculations, which assume a gaussian pdf for the background mean. The sideband method requires a cut to determine the amount of background in a no signal region, and will be discussed later. In both cases, the calculation was done with code in Appendix E of that paper. Additionally, the sideband method requires a leftward cut to isolate a background region. It was required that S/B remain at least less than 5 percent and also that there be a significant amount of background events for comparison.

Statistics also had to be considered. The significance estimation requires a Gaussian shape to the background mean and without enough statistics the results can be misleading. To allow comfortably for a Gaussian shape, a requirement of at least 18 unweighted background events was implemented. The chosen classifier in SPR has a significance curve that peaks well before this cut becomes an issue, although this cut played some role in the overall choice of classifier parameters.

Running the trained random forest classifier over the final sample, the yield sample, revealed significance values that were 3.7 sigma for the 1 b-tagged selection and about 3.2 sigma for the 2 b-tagged jet selection, with significances from the sideband estimation being only slightly lower, as can be seen in the table below. The table also shows the expected signal and background obtained while estimating the Gaussian pdf-based significance. If the weights are adjusted to give 1fb-1 luminosity, then we expect 4.1 sigma significance for the 2 b-tagged sample and 4.4 sigma for the 1 b-tagged sample, even with the limitations of this study. Luminosity plots are shown below.

Properties 1 b-Tagged Jet, SPR RF 2 b-Tagged Jets, SPR RF
Gausian pdf Significance 3.7 3.2
Sideband Significance 4.3 3.2
Expected Signal 35.0 19.7
Expected Background 15.0 +/- 4.8 7.5 +/- 2.8

Luminosity, 1 b-tag Sample Luminosity, 2 b-tag Sample
hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" hBagger2_b-Tagged_Jets_LUM_sys_validation.gif"

Although no actual data are considered in this analysis, it is still interesting to consider the cross-section and the uncertainties on it based on the simulated events. This calculation included the systematic uncertainties listed earlier. Here, we also include a statistical uncertainty and a separate luminosity uncertainty at 20%. Systematic uncertainties on the cross-section are listed below:
Errors Variation 1 b-tagged jet percentage 2 b-tagged jet percentage
JES 10 14 5
B-tagging 5 5 18
MC Statistics   23 34
Background x-section 10, 20 7 5
ISR/FSR 10 14 14
Other 5 7 7
Total Systematic Uncertainty   33 42
For the 1 b-tagged jet selection, the cross-section is 323 + 105(sys) - 105(sys) +/- 65(stat) +/- 65(lum) pb. For the 2 b-tagged jet selection, the cross-section is 323 + 135(sys) - 138(sys) +/- 85(stat) +/- 65(lum) pb. It is clear in both cases that the uncertainties are quite large, which one might expect in an early data situation. Notice that the systematic uncertainties are larger than the statistical uncertainties. This indicates that the dominant problem, at this point, for 100pb-1 of integrated luminosity is not the number of data events we expect, but the systematic errors, which we can try to reduce by improving the analysis.

Once the cross-section is obtained, the |Vtb| value from the quark mixing matrix can also be calculated. In this case the signal theoretical cross-section is also included in the systematic uncertainties, with a value of 5%, although it has a negligible impact on the overall numbers. Expressing |Vtb, exp| as

Vtb, exp = sqrt{xsec_exp / xsec_sm} Vtb, sm

and taking |Vtb, sm| to be 1, since the ratios of the cross-sections in this case is also 1, the uncertainties may be simply propagated to give the |Vtb, exp| value. For the 1 b-tagged jet selection, |Vtb, exp| is 1.0 + 0.16(sys) - 0.16(sys) +/- 0.10(stat) +/- 0.10(lum), and for the 2 b-tagged jet selection, |Vtb, exp| is 1.0 + 0.21(sys) - 0.21(sys) +/- 0.13(stat) +/- 0.10(lum).

Both selections show large uncertainties, which is to be expected in early data. It should also be noted that this analysis also has some room for improvement by future analyses. All of the variables listed in this paper were used, and additional variables may improve the result. Additionally, there could be some improvement by combining these two selections, which is beyond the scope of this study.

Additional Studies

Effect of Statistics on Significance

Even with a combined signal analysis, the raw counts may be somewhat low for a 33 variable space. Thus, statistics were considered to see what sort of improvement, if any, increased statistics would give to the results. Several training subsets were made, each with a percentage of the total training set. Each of these subsets contained at least eight samples and of these, the sample with the highest significance was chosen. The figure below on the left shows S/sqrtB versus the percentage of the total training set used for training the classifier. The 2 b-tagged set seems to level off somewhat with more training events but the 1 b-tagged sample is still increasing. Thus, more events in the training sample would improve the significance, particularly in the 1 b-tagged set. Additionally, if the proportions of events are kept the same, the validation set would also increase in terms of numbers of events and the MC statistical uncertainty should decrease as well.

Significance vs Percentage of Training Sample Effect of Unweighted Background Cut
Sig_percent_trial_r5_1100_nice.gif" Sig_cut_percent_trial_all_r5_1100_nice.gif"

Variation in Significances of Yield and Validation Samples

In doing this study, it was found that the significances from the validation and yield samples were somewhat different, and it was desired to repeat these calculations many times and see if this sort of result was chance or a trend, for the benefit of future studies. For the 1 and 2 b-tagged jet samples, eight randomized training sets were produced, where the ones that gave the highest validation significances were used in the main study of this paper. All of these sets were then used to make trained classifiers for both the TMVA boosted decision tree and the SPR random forest. Each of these classifiers was applied to the validation and yield samples, and the significances, using the Gaussian pdf method, were noted. These significances are shown in figures below on the left for the 1 b-tagged jet selection and on the right for the 2 b-tagged jet selection. In general the classifiers seem to have a range of significances for both the validation and yield samples. This variation in significance values may help to explain slight fluctuations in the 2 b-tagged sample curve in the previous section that could not be explained by the impact of the unweighted background count cut. Additionally, there is a line plotted in the figures below with slope 1. For both classifiers and b-tagging selections, most points are below this line, indicating that the validation significances are generally higher than the yield significances. The selection of the classifier parameters is determined to optimize the validation significance, so these plots may indicate that these parameters are too finely tuned to the validation sample. In general, it seems the selection of a classifier that gives a high validation significance will not necessarily give a high yield significance. It may be more useful for future studies to combine these classifiers, rather than choosing one classifier.

Validation vs Yield Significances, 1 b-tagged jet Validation vs Yield Significances, 2 b-tagged jets
Sig_val_yield.gif" Sig_val_yield_new.gif"

Additional studies have also been done looking at the MC statistical uncertainty issue. Specifically, it was desired to know, if more events are generate, which events would improve the statistical uncertainty the most? By removing the MC statistical uncertainty for a particular event type from the overall uncertainty and comparing, it was found that not having to weight the t-channel MC would reduce the MC statistical uncertainty by 66%. This was the biggest improvement, and not a surprise, as only 50,000 t-channel events were generated, despite its large cross-section, and of those, many were removed because they were duplicated. Not having to weight the backgrounds improved the MC statistical uncertainty by about 30%. In the 1 b-tagged jet case, it was slightly preferable to improve the number of W+jets events rather than ttbar. For the 2 b-tagged jets case, improvement in the ttbar was much preferred over improvements in W+Jets. Overall, however, this analysis would see the most benefit from the generation of more t-channel events.

-- JennyHolzbauer - 06 Jan 2009 -- JennyHolzbauer - 01 Apr 2009 -- JennyHolzbauer - 09 Jun 2009
Topic attachments
I Attachment Action Size Date Who Comment
EFF_tmva_spr.gifgif EFF_tmva_spr.gif manage 9 K 09 Jun 2009 - 19:05 JennyHolzbauer EFF_tmva_spr_btag2
EFF_tmva_spr_tag1_yield.gifgif EFF_tmva_spr_tag1_yield.gif manage 10 K 09 Jun 2009 - 19:06 JennyHolzbauer EFF_tmva_spr_btag1
RF_tag1_SBR_SB_sys_validation.gifgif RF_tag1_SBR_SB_sys_validation.gif manage 8 K 17 Apr 2009 - 20:26 JennyHolzbauer RFbtag1SBR
RF_tag1_SBR_SB_sys_validation_zoom.gifgif RF_tag1_SBR_SB_sys_validation_zoom.gif manage 8 K 17 Apr 2009 - 20:27 JennyHolzbauer RFbtag1SBRZoom
Sig_cut_percent_trial_all_r5_1100_nice.gifgif Sig_cut_percent_trial_all_r5_1100_nice.gif manage 12 K 09 Jun 2009 - 19:12 JennyHolzbauer Sig_cut_percent_trial
Sig_number_trial_33.gifgif Sig_number_trial_33.gif manage 6 K 09 Jun 2009 - 19:07 JennyHolzbauer Sig_number_trial_33
Sig_percent_trial_r5_1100_nice.gifgif Sig_percent_trial_r5_1100_nice.gif manage 12 K 09 Jun 2009 - 19:12 JennyHolzbauer Sig_percent_trial
Sig_val_yield.gifgif Sig_val_yield.gif manage 13 K 09 Jun 2009 - 19:09 JennyHolzbauer Sig_val_yield_btag1
Sig_val_yield_new.gifgif Sig_val_yield_new.gif manage 14 K 09 Jun 2009 - 19:09 JennyHolzbauer Sig_val_yield_btag2
hBDT_CO_validated_nice_nolum_log.gifgif hBDT_CO_validated_nice_nolum_log.gif manage 10 K 09 Jun 2009 - 19:10 JennyHolzbauer BDTbtag1CO
hBDT_CO_validated_nice_nolum_log_tag2.gifgif hBDT_CO_validated_nice_nolum_log_tag2.gif manage 10 K 09 Jun 2009 - 19:11 JennyHolzbauer BDTbtag2CO
hBagger1_b-Tagged_Jet_CO_validation.gifgif hBagger1_b-Tagged_Jet_CO_validation.gif manage 8 K 09 Jun 2009 - 19:13 JennyHolzbauer RFbtag1CO
hBagger1_b-Tagged_Jet_LUM_sys_validation.gifgif hBagger1_b-Tagged_Jet_LUM_sys_validation.gif manage 4 K 09 Jun 2009 - 19:14 JennyHolzbauer RFbtag1LUM
hBagger2_b-Tagged_Jets_CO_validation.gifgif hBagger2_b-Tagged_Jets_CO_validation.gif manage 8 K 09 Jun 2009 - 19:14 JennyHolzbauer RFbtag2CO
hBagger2_b-Tagged_Jets_LUM_sys_validation.gifgif hBagger2_b-Tagged_Jets_LUM_sys_validation.gif manage 4 K 09 Jun 2009 - 19:14 JennyHolzbauer RFbtag2LUM
Topic revision: r7 - 16 Oct 2009, TomRockwell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback