Difference: EarlyDataStudyWithMVA (1 vs. 7)

Revision 7
16 Oct 2009 - Main.TomRockwell
Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="Trash.Tier3WebHome"
>
>
META TOPICPARENT name="Trash.Trash/Tier3WebHome"
 

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo

Here, we return to early data (below 1fb-1), looking at the impact of a multivariate analysis on the capability to observe the single-top signal. EarlyDataStudy looked at the ability of a cut-based analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the single-top process in early data.
Revision 6
13 Oct 2009 - Main.ChipBrock
Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="WebHome"
>
>
META TOPICPARENT name="Trash.Tier3WebHome"
 

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo

Here, we return to early data (below 1fb-1), looking at the impact of a multivariate analysis on the capability to observe the single-top signal. EarlyDataStudy looked at the ability of a cut-based analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the single-top process in early data.
Revision 5
09 Jun 2009 - Main.JennyHolzbauer
Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo

Line: 7 to 7
  The updated event yields, weighted to 100pb-1, can be found in table below for samples containing 1 b-tagged jet or2 b-tagged jets, and muons or electrons. In this analysis, there are two selections studied, one isolating 1 b-tagged jets and the other isolating 2 b-tagged jets. As in the earlier, cut-based analysis, the early b-tagger IP2D is used. Some of the Pt cuts are also adjusted to have slightly lower thresholds, such as Second leading jet Pt, which is required to be greater than 25 GeV. These adjustments are the same as in the cut based analysis.
processes muon (1 b-jet) electron (1 b-jet) muon (2 b-jet) electron (2 b-jet) both (1 b-jet) both (2 b-jet)
Changed:
<
<
t-channel 290.1 389.5 116.5 132.7 679.6 249.4
s-channel 11.6 16.3 7.8 10.0 27.9 17.8
wt- channel 92.1 118.3 31.4 36.6 210.4 68.1
>
>
t-channel 408.7 312.0 140.4 123.2 720.8 263.6
s-channel 17.2 12.2 10.5 8.24 29.4 18.8
wt- channel 160.5 124.3 50.2 43.3 284.8 93.6
 
Changed:
<
<
ttbar to lep+jets 729.5 899.5 205.7 270.3 1629.0 476.0
ttbar to lep+lep 316.9 411.0 444.7 547.4 727.9 992.1
Wjets 927.5 1422.8 79.1 114.0 2350.3 193.1
Wbbjets 29.2 46.0 16.8 25.7 75.2 42.5
WW 17.9 23.1 2.0 3.3 41.0 5.3
WZ 5.4 7.3 2.2 2.9 12.7 5.1
>
>
ttbar to lep+jets 925.2 747.9 565.1 451.2 1673.1 1016.3
ttbar to lep+lep 415.1 319.6 275.8 206.1 734.7 481.9
Wjets 1562.2 1023.2 125.8 87.3 2585.4 213.1
Wbbjets 49.3 32.2 27.0 18.0 81.5 45.0
WW 26.6 3.5 19.2 2.2 45.7 5.7
WZ 7.7 5.7 3.1 2.4 13.3 5.5
 
Changed:
<
<
S/B 0.19 0.19 0.21 0.19 0.19 0.20
>
>
S/B 0.20 0.20 0.21 0.23 0.20 0.21
 
Changed:
<
<
The S/B ratio is about 0.2. This is high compared to other single-top analyses because in this analysis, the signal is combined. There is probably also some impact from slightly different cut selection and the different b-tagger.
>
>
The S/B ratio is about 0.2. This is high compared to other single-top analyses because in this analysis, the signal is combined. There is probably also some impact from slightly different cut selection and the different b-tagger. The proportions of the background are different from the CSC note (here, there are more W+Jets, versus ttbar). This is understood and a result of the lowering of the Pt thresholds and changes in the b-tagger. The overall effect of the cut changes is to increase the number of events overall and slightly increase the S/B ratio. The increased proportion of W+Jets may also improve the results of the analysis, as it is likely to be easier to remove than ttbar, but this has not been extensively studied.
 

Additional Variable Selection

This analysis required the use of several variables beyond those used for preselection cuts. Explanations are given here:
  • Ht refers to the sum of the Pt of the particles indicated (except for MET, for which we used Et)
Changed:
<
<
  • DeltaR refers to sqrt{(Delta Eta)2 + (Delta Pt)2 } for the particles in question
>
>
  • DeltaR refers to sqrt{(Delta Eta)2 + (Delta Phi)2 } for the particles in question
 
  • H is the sum of the P (momentum) of the particles
  • lepton refers to the leading lepton
  • centrality is (sum Pt/sum P)
Line: 37 to 37
 
  • Ht(jet1, jet2), Ht(all jets), Ht(lepton, MET)
  • Delta Pt(jet1, jet2), Delta Pt(btaggedjet1, untaggedjet1)
  • DeltaR(untaggedjet1, lepton), DeltaR(jet1, lepton), DeltaR(jet1, jet2)
Changed:
<
<
  • H(all jets), H(jet1, jet2), MET (Missing Et)
>
>
  • P(last jet), H(jet1, jet2), MET (Missing Et)
 
  • Pt(jet1), Pt(jet2), Pt(btaggedjet1), Pt(untaggedjet1), Pt(lepton)
  • Pt(jet1) + Pt(jet2)
  • Eta(btaggedjet1), Eta(untaggedjet1)
Line: 50 to 50
 
  • Mass of the leading jet top quark
  • Invariant mass of jet1 + jet2
  • Invariant mass of all jets
Changed:
<
<
  • Centrality(all jets, lepton),Centrality(jet1, jet2)
>
>
  • Centrality(last jet, lepton),Centrality(jet1, jet2)
 

When considering the variables used in this analysis, the significance was estimated based on the validation sample several times while varying the number of variables used, to see if the dimensionality could be reduced without a loss of significance. However, there appears to be a linear correlation between significance and dimensionality, as can be seen in the figure below. This indicates that more variables may be useful in future studies and that, for this study, all 33 variables should be used to optimize the significance.
Line: 67 to 67
 
Boosted Decision Tree, TMVA, 1 b-tag Sample Boosted Decision Tree, TMVA, 2 b-tag Sample
hBDT_CO_validated_nice_nolum_log.gif" hBDT_CO_validated_nice_nolum_log_tag2.gif"
Changed:
<
<
The figures below show the efficiency curves from the best TMVA and SPR classifiers, zoomed in on the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies, as shown in the table below, show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well. At this point, the SPR random forest classifier was focused on.

Program Signal Efficiency Background Efficiency
TMVA bdt, 1 b-tagged jet 0.0243 0.0031
SPR arcx4, 1 b-tagged jet 0.0272 0.0018
TMVA bdt, 2 b-tagged jets 0.0726 0.0218
SPR arcx4, 2 b-tagged jets 0.0754 0.0107
>
>
The figures below show the efficiency curves from the best TMVA and SPR classifiers, zoomed in on the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well. At this point, the SPR random forest classifier was focused on.
 

Efficiency, 1 b-tag Sample Efficiency, 2 b-tag Sample
EFF_tmva_spr_tag1_yield.gif" EFF_tmva_spr.gif"
Line: 82 to 75
 

Multivariate Analysis with Systematics

To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cut-based analysis, and the b-tagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background sum for each systematic. After a cut was made, the remaining background events were multiplied by these percentages to determine the error from b-tagging and jet energy scale systematics. The luminosity error was taken at 20 percent of the total background events for cross-section and Vtb calculations (it is not included in the significance). Background cross-section error was estimated at 12 or 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. The error percentages on the background sum after a classifier output cut for SPR's arcx4 classifier are given in the table below.
Changed:
<
<
Errors 1 b-tagged jet percentage 2 b-tagged jet percentage
JES 18 7
B-tagging 9 13
MC Statistics 22 19
Background x-section 13 12
ISR/FSR 10 10
Other 5 5
>
>
Errors Variation 1 b-tagged jet percentage 2 b-tagged jet percentage
Data Statistics   26 36
JES 10 17 5
B-tagging 5 9 13
MC Statistics   17 30
Background x-section 10,20 15 13
ISR/FSR 10 10 10
Other 5 5 5
Total Uncertainty| | 32 | 37 |
  The significance was calculated in two ways, a sideband method (cross-check) and a Gaussian-pdf method (primary), both of which involved a frequentist solution to estimate the significance (“Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process”, by R.D. Cousins, J.T. Linnemann, and J. Tucker, 23 Aug. 2008). The systematics discussed above are used in the Gaussian pdf method calculations, which assume a gaussian pdf for the background mean. The sideband method requires a cut to determine the amount of background in a no signal region, and will be discussed later. In both cases, the calculation was done with code in Appendix E of that paper. Additionally, the sideband method requires a leftward cut to isolate a background region. It was required that S/B remain at least less than 5 percent and also that there be a significant amount of background events for comparison.

Statistics also had to be considered. The significance estimation requires a Gaussian shape to the background mean and without enough statistics the results can be misleading. To allow comfortably for a Gaussian shape, a requirement of at least 18 unweighted background events was implemented. The chosen classifier in SPR has a significance curve that peaks well before this cut becomes an issue, although this cut played some role in the overall choice of classifier parameters.

Changed:
<
<
Running the trained random forest classifier over the final sample, the yield sample, revealed significance values that were 3.8 sigma for the 1 b-tagged selection and about 2.6 sigma for the 2 b-tagged jet selection, with significances from the sideband estimation being only slightly lower, as can be seen in the table below. The table also shows the expected signal and background obtained while estimating the Gaussian pdf-based significance. If the weights are adjusted to give 1fb-1 luminosity, then we expect 3 sigma significance for the 2 b-tagged sample and 4.7 sigma for the 1 b-tagged sample, even with the limitations of this study. Luminosity plots are shown below.
>
>
Running the trained random forest classifier over the final sample, the yield sample, revealed significance values that were 3.7 sigma for the 1 b-tagged selection and about 3.2 sigma for the 2 b-tagged jet selection, with significances from the sideband estimation being only slightly lower, as can be seen in the table below. The table also shows the expected signal and background obtained while estimating the Gaussian pdf-based significance. If the weights are adjusted to give 1fb-1 luminosity, then we expect 4.1 sigma significance for the 2 b-tagged sample and 4.4 sigma for the 1 b-tagged sample, even with the limitations of this study. Luminosity plots are shown below.
 

Properties 1 b-Tagged Jet, SPR RF 2 b-Tagged Jets, SPR RF
Changed:
<
<
Gausian pdf Significance 3.8 2.7
Sideband Significance 3.6 2.3
Expected Signal 25.1 25.3
Expected Background 8.5 +/- 3.0 18.4 +/- 5.4
>
>
Gausian pdf Significance 3.7 3.2
Sideband Significance 4.3 3.2
Expected Signal 35.0 19.7
Expected Background 15.0 +/- 4.8 7.5 +/- 2.8
 

Luminosity, 1 b-tag Sample Luminosity, 2 b-tag Sample
hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" hBagger2_b-Tagged_Jets_LUM_sys_validation.gif"

Although no actual data are considered in this analysis, it is still interesting to consider the cross-section and the uncertainties on it based on the simulated events. This calculation included the systematic uncertainties listed earlier. Here, we also include a statistical uncertainty and a separate luminosity uncertainty at 20%. Systematic uncertainties on the cross-section are listed below:
Changed:
<
<
Errors 1 b-tagged jet percentage 2 b-tagged jet percentage
>
>
Errors Variation 1 b-tagged jet percentage 2 b-tagged jet percentage
 
Changed:
<
<
JES 15 14
B-tagging 5 22
MC Statistics 27 33
Background x-section 4 9
ISR/FSR 14 17
Other 7 9
Total 35 48
For the 1 b-tagged jet selection, the cross-section is 323 + 111(sys) - 116(sys) +/- 75(stat) +/- 65(lum) pb. For the 2 b-tagged jet selection, the cross-section is 323 + 146(sys) - 161(sys) +/- 84(stat) +/- 65(lum) pb. It is clear in both cases that the uncertainties are quite large, which one might expect in an early data situation. Notice that the systematic uncertainties are larger than the statistical uncertainties. This indicates that the dominant problem, at this point, for 100pb-1 of integrated luminosity is not the number of data events we expect, but the systematic errors, which we can try to reduce by improving the analysis.
>
>
JES 10 14 5
B-tagging 5 5 18
MC Statistics   23 34
Background x-section 10, 20 7 5
ISR/FSR 10 14 14
Other 5 7 7
Total Systematic Uncertainty   33 42
For the 1 b-tagged jet selection, the cross-section is 323 + 105(sys) - 105(sys) +/- 65(stat) +/- 65(lum) pb. For the 2 b-tagged jet selection, the cross-section is 323 + 135(sys) - 138(sys) +/- 85(stat) +/- 65(lum) pb. It is clear in both cases that the uncertainties are quite large, which one might expect in an early data situation. Notice that the systematic uncertainties are larger than the statistical uncertainties. This indicates that the dominant problem, at this point, for 100pb-1 of integrated luminosity is not the number of data events we expect, but the systematic errors, which we can try to reduce by improving the analysis.
 

Once the cross-section is obtained, the |Vtb| value from the quark mixing matrix can also be calculated. In this case the signal theoretical cross-section is also included in the systematic uncertainties, with a value of 5%, although it has a negligible impact on the overall numbers. Expressing |Vtb, exp| as

Vtb, exp = sqrt{xsec_exp / xsec_sm} Vtb, sm
Changed:
<
<
and taking |Vtb, sm| to be 1, since the ratios of the cross-sections in this case is also 1, the uncertainties may be simply propagated to give the |Vtb, exp| value. For the 1 b-tagged jet selection, |Vtb, exp| is 1.0 + 0.17(sys) - 0.18(sys) +/- 0.12(stat) +/- 0.10(lum), and for the 2 b-tagged jet selection, |Vtb, exp| is 1.0 + 0.23(sys) - 0.25(sys) +/- 0.13(stat) +/- 0.10(lum).
>
>
and taking |Vtb, sm| to be 1, since the ratios of the cross-sections in this case is also 1, the uncertainties may be simply propagated to give the |Vtb, exp| value. For the 1 b-tagged jet selection, |Vtb, exp| is 1.0 + 0.16(sys) - 0.16(sys) +/- 0.10(stat) +/- 0.10(lum), and for the 2 b-tagged jet selection, |Vtb, exp| is 1.0 + 0.21(sys) - 0.21(sys) +/- 0.13(stat) +/- 0.10(lum).
 

Both selections show large uncertainties, which is to be expected in early data. It should also be noted that this analysis also has some room for improvement by future analyses. All of the variables listed in this paper were used, and additional variables may improve the result. Additionally, there could be some improvement by combining these two selections, which is beyond the scope of this study.

Additional Studies

Effect of Statistics on Significance

Changed:
<
<
Even with a combined signal analysis, the raw counts may be somewhat low for a 33 variable space. Thus, statistics were considered to see what sort of improvement, if any, increased statistics would give to the results. Several training subsets were made, each with a percentage of the total training set. Each of these subsets contained at least eight samples and of these, the sample with the highest significance was chosen. The figure below on the left shows the significance versus the percentage of the total training set used for training the classifier. The set with a 2 b-tagged jet selection has a significance that is generally flat, so more statistics may not dramatically improve the current result. The 1 b-tagged selection, however, has an oddly shaped line, which is most likely not just related to the size of the training sample.
>
>
Even with a combined signal analysis, the raw counts may be somewhat low for a 33 variable space. Thus, statistics were considered to see what sort of improvement, if any, increased statistics would give to the results. Several training subsets were made, each with a percentage of the total training set. Each of these subsets contained at least eight samples and of these, the sample with the highest significance was chosen. The figure below on the left shows S/sqrtB versus the percentage of the total training set used for training the classifier. The 2 b-tagged set seems to level off somewhat with more training events but the 1 b-tagged sample is still increasing. Thus, more events in the training sample would improve the significance, particularly in the 1 b-tagged set. Additionally, if the proportions of events are kept the same, the validation set would also increase in terms of numbers of events and the MC statistical uncertainty should decrease as well.
 

Significance vs Percentage of Training Sample Effect of Unweighted Background Cut
Sig_percent_trial_r5_1100_nice.gif" Sig_cut_percent_trial_all_r5_1100_nice.gif"
Deleted:
<
<
There is another parameter that is not considered in the figure, the cut on the number of unweighted background events, a cut that is directly related to the statistics of the sample. How this cut is made can be seen in the figure below for the 1 b-tagged jet sample using the SPR random forest classifier. Here, both the significance and the sum of the background events, integrated from the right, is plotted against classifier output. A cut at 0.51 has the maximum significance and 21 unweighted background events remaining. However, if the significance were to peak after the cut requiring 18 unweighted background events, the significance quoted may not be the highest possible. If there were more statistics, the unweighted background cut would move to the right and perhaps the significance curve would peak before the cut, at a higher value.

The figure above on the right shows the percentage of cases for each trial size where the significance peak is to the left of the cut on the unweighted background events and the maximum potential significance may not be accessible to this study due to low statistics. The odd shape in the first plot is likely related to the increasing impact of the cut on the number of unweighted background events as the percentage of the training sample used increases. This, incidentally, has little effect on the 2 b-tagged selection, which has none of its training sample for any subset except the smallest affected by this cut, as can be seen the figure. Increased statistics would help to reduce this effect in the 1 b-tagged jet sample, as the cut is directly related to statistics, and could possibly improve the resulting significances in this sample.

Significance and Unweighted Background Cut Zoomed
RF_tag1_SBR_SB_sys_validation.gif" RF_tag1_SBR_SB_sys_validation_zoom.gif"
 

Variation in Significances of Yield and Validation Samples

In doing this study, it was found that the significances from the validation and yield samples were somewhat different, and it was desired to repeat these calculations many times and see if this sort of result was chance or a trend, for the benefit of future studies. For the 1 and 2 b-tagged jet samples, eight randomized training sets were produced, where the ones that gave the highest validation significances were used in the main study of this paper. All of these sets were then used to make trained classifiers for both the TMVA boosted decision tree and the SPR random forest. Each of these classifiers was applied to the validation and yield samples, and the significances, using the Gaussian pdf method, were noted. These significances are shown in figures below on the left for the 1 b-tagged jet selection and on the right for the 2 b-tagged jet selection. In general the classifiers seem to have a range of significances for both the validation and yield samples. This variation in significance values may help to explain slight fluctuations in the 2 b-tagged sample curve in the previous section that could not be explained by the impact of the unweighted background count cut. Additionally, there is a line plotted in the figures below with slope 1. For both classifiers and b-tagging selections, most points are below this line, indicating that the validation significances are generally higher than the yield significances. The selection of the classifier parameters is determined to optimize the validation significance, so these plots may indicate that these parameters are too finely tuned to the validation sample. In general, it seems the selection of a classifier that gives a high validation significance will not necessarily give a high yield significance. It may be more useful for future studies to combine these classifiers, rather than choosing one classifier.

Validation vs Yield Significances, 1 b-tagged jet Validation vs Yield Significances, 2 b-tagged jets
Sig_val_yield.gif" Sig_val_yield_new.gif"
Added:
>
>
Additional studies have also been done looking at the MC statistical uncertainty issue. Specifically, it was desired to know, if more events are generate, which events would improve the statistical uncertainty the most? By removing the MC statistical uncertainty for a particular event type from the overall uncertainty and comparing, it was found that not having to weight the t-channel MC would reduce the MC statistical uncertainty by 66%. This was the biggest improvement, and not a surprise, as only 50,000 t-channel events were generated, despite its large cross-section, and of those, many were removed because they were duplicated. Not having to weight the backgrounds improved the MC statistical uncertainty by about 30%. In the 1 b-tagged jet case, it was slightly preferable to improve the number of W+jets events rather than ttbar. For the 2 b-tagged jets case, improvement in the ttbar was much preferred over improvements in W+Jets. Overall, however, this analysis would see the most benefit from the generation of more t-channel events.
  -- JennyHolzbauer - 06 Jan 2009 -- JennyHolzbauer - 01 Apr 2009
Added:
>
>
-- JennyHolzbauer - 09 Jun 2009
 
Changed:
<
<
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_CO_validation.gif" attr="" comment="RFbtag1CO" date="1240000105" name="hBagger1_b-Tagged_Jet_CO_validation.gif" path="hBagger1 b-Tagged Jet_CO_validation.gif" size="12017" stream="hBagger1 b-Tagged Jet_CO_validation.gif" tmpFilename="/usr/tmp/CGItemp16017" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" attr="" comment="RFbtag1LUM" date="1240000132" name="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" path="hBagger1 b-Tagged Jet_LUM_sys_validation.gif" size="6332" stream="hBagger1 b-Tagged Jet_LUM_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp16490" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_CO_validation.gif" attr="" comment="RFbtag2CO" date="1240000257" name="hBagger2_b-Tagged_Jets_CO_validation.gif" path="hBagger2 b-Tagged Jets_CO_validation.gif" size="12296" stream="hBagger2 b-Tagged Jets_CO_validation.gif" tmpFilename="/usr/tmp/CGItemp16125" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" attr="" comment="RFbtag2LUM" date="1240000277" name="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" path="hBagger2 b-Tagged Jets_LUM_sys_validation.gif" size="6560" stream="hBagger2 b-Tagged Jets_LUM_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp16612" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="EFF_tmva_spr.gif" attr="" comment="EFF_tmva_spr_btag2" date="1238599406" name="EFF_tmva_spr.gif" path="EFF_tmva_spr.gif" size="11194" stream="EFF_tmva_spr.gif" tmpFilename="/usr/tmp/CGItemp65451" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_val_yield_new.gif" attr="" comment="Sig_val_yield_btag2" date="1240000231" name="Sig_val_yield_new.gif" path="sig_val_yield_new.gif" size="14147" stream="sig_val_yield_new.gif" tmpFilename="/usr/tmp/CGItemp14386" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="EFF_tmva_spr_tag1_yield.gif" attr="" comment="EFF_tmva_spr_btag1" date="1240000880" name="EFF_tmva_spr_tag1_yield.gif" path="Eff_tmva_spr_tag1.gif" size="10987" stream="Eff_tmva_spr_tag1.gif" tmpFilename="/usr/tmp/CGItemp14946" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="Sig_val_yield.gif" attr="" comment="Sig_val_yield_btag1" date="1240000074" name="Sig_val_yield.gif" path="Sig_val_yield_new.gif" size="17679" stream="Sig_val_yield_new.gif" tmpFilename="/usr/tmp/CGItemp14063" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="Sig_percent_trial_r5_1100_nice.gif" attr="" comment="Sig_percent_trial" date="1239999892" name="Sig_percent_trial_r5_1100_nice.gif" path="sig_percent_trial.gif" size="13749" stream="sig_percent_trial.gif" tmpFilename="/usr/tmp/CGItemp15499" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="Sig_cut_percent_trial_all_r5_1100_nice.gif" attr="" comment="Sig_cut_percent_trial" date="1239999855" name="Sig_cut_percent_trial_all_r5_1100_nice.gif" path="sig_percent_trial_stat.gif" size="14674" stream="sig_percent_trial_stat.gif" tmpFilename="/usr/tmp/CGItemp16379" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log.gif" attr="" comment="BDTbtag1CO" date="1240000339" name="hBDT_CO_validated_nice_nolum_log.gif" path="hBDT_CO_validated_nice_nolum_log_new.gif" size="11065" stream="hBDT_CO_validated_nice_nolum_log_new.gif" tmpFilename="/usr/tmp/CGItemp15838" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log_tag2.gif" attr="" comment="BDTbtag2CO" date="1240000369" name="hBDT_CO_validated_nice_nolum_log_tag2.gif" path="hBDT_CO_validated_nice_nolum_lin_new.gif" size="10209" stream="hBDT_CO_validated_nice_nolum_lin_new.gif" tmpFilename="/usr/tmp/CGItemp16304" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="Sig_number_trial_33.gif" attr="" comment="Sig_number_trial_33" date="1239999973" name="Sig_number_trial_33.gif" path="sig_train_new.gif" size="12457" stream="sig_train_new.gif" tmpFilename="/usr/tmp/CGItemp14513" user="JennyHolzbauer" version="2"
>
>
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_CO_validation.gif" attr="" comment="RFbtag1CO" date="1244574808" name="hBagger1_b-Tagged_Jet_CO_validation.gif" path="hBaggertag1_par_7_yield_CO_validation_bw_log_newbr.gif" size="8956" stream="hBaggertag1_par_7_yield_CO_validation_bw_log_newbr.gif" tmpFilename="/usr/tmp/CGItemp28593" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" attr="" comment="RFbtag1LUM" date="1244574844" name="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" path="hBaggertag1_par_7_yield_LUM_sys_validation_newbr.gif" size="4943" stream="hBaggertag1_par_7_yield_LUM_sys_validation_newbr.gif" tmpFilename="/usr/tmp/CGItemp29210" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_CO_validation.gif" attr="" comment="RFbtag2CO" date="1244574870" name="hBagger2_b-Tagged_Jets_CO_validation.gif" path="hBaggertag2_par2_8_CO_validation_bw_log_newbr.gif" size="9034" stream="hBaggertag2_par2_8_CO_validation_bw_log_newbr.gif" tmpFilename="/usr/tmp/CGItemp28683" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" attr="" comment="RFbtag2LUM" date="1244574898" name="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" path="hBaggertag2_par2_8_LUM_sys_validation_newbr.gif" size="4866" stream="hBaggertag2_par2_8_LUM_sys_validation_newbr.gif" tmpFilename="/usr/tmp/CGItemp29385" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="EFF_tmva_spr.gif" attr="" comment="EFF_tmva_spr_btag2" date="1244574339" name="EFF_tmva_spr.gif" path="EFF_tmva_spr_bw_tag2_newbr.gif" size="10183" stream="EFF_tmva_spr_bw_tag2_newbr.gif" tmpFilename="/usr/tmp/CGItemp26519" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="Sig_val_yield_new.gif" attr="" comment="Sig_val_yield_btag2" date="1244574572" name="Sig_val_yield_new.gif" path="val_yield_tag2_bw_newbr.gif" size="14386" stream="val_yield_tag2_bw_newbr.gif" tmpFilename="/usr/tmp/CGItemp27093" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="EFF_tmva_spr_tag1_yield.gif" attr="" comment="EFF_tmva_spr_btag1" date="1244574411" name="EFF_tmva_spr_tag1_yield.gif" path="EFF_tmva_spr_tag1_bw_newbr.gif" size="10249" stream="EFF_tmva_spr_tag1_bw_newbr.gif" tmpFilename="/usr/tmp/CGItemp27642" user="JennyHolzbauer" version="4"
META FILEATTACHMENT attachment="Sig_val_yield.gif" attr="" comment="Sig_val_yield_btag1" date="1244574541" name="Sig_val_yield.gif" path="val_yield_tag1_bw_newbr.gif" size="14247" stream="val_yield_tag1_bw_newbr.gif" tmpFilename="/usr/tmp/CGItemp26645" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="Sig_percent_trial_r5_1100_nice.gif" attr="" comment="Sig_percent_trial" date="1244574746" name="Sig_percent_trial_r5_1100_nice.gif" path="train_size_ssqrtb_tag1_bw_newbr.gif" size="12710" stream="train_size_ssqrtb_tag1_bw_newbr.gif" tmpFilename="/usr/tmp/CGItemp28258" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="Sig_cut_percent_trial_all_r5_1100_nice.gif" attr="" comment="Sig_cut_percent_trial" date="1244574770" name="Sig_cut_percent_trial_all_r5_1100_nice.gif" path="trainsize_ssqrtb_tag2_bw_newbr.gif" size="12324" stream="trainsize_ssqrtb_tag2_bw_newbr.gif" tmpFilename="/usr/tmp/CGItemp29048" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log.gif" attr="" comment="BDTbtag1CO" date="1244574658" name="hBDT_CO_validated_nice_nolum_log.gif" path="hBDT_CO_validated_tag1_bw_log_newbr.gif" size="10498" stream="hBDT_CO_validated_tag1_bw_log_newbr.gif" tmpFilename="/usr/tmp/CGItemp28528" user="JennyHolzbauer" version="4"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log_tag2.gif" attr="" comment="BDTbtag2CO" date="1244574683" name="hBDT_CO_validated_nice_nolum_log_tag2.gif" path="hBDT_CO_validated_tag2_bw_log_newbr.gif" size="10504" stream="hBDT_CO_validated_tag2_bw_log_newbr.gif" tmpFilename="/usr/tmp/CGItemp29044" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="Sig_number_trial_33.gif" attr="" comment="Sig_number_trial_33" date="1244574477" name="Sig_number_trial_33.gif" path="var_sig_bw_newbr.gif" size="6932" stream="var_sig_bw_newbr.gif" tmpFilename="/usr/tmp/CGItemp27222" user="JennyHolzbauer" version="3"
 
META FILEATTACHMENT attachment="RF_tag1_SBR_SB_sys_validation.gif" attr="" comment="RFbtag1SBR" date="1240000018" name="RF_tag1_SBR_SB_sys_validation.gif" path="tag1_SBR.gif" size="9191" stream="tag1_SBR.gif" tmpFilename="/usr/tmp/CGItemp15291" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="RF_tag1_SBR_SB_sys_validation_zoom.gif" attr="" comment="RFbtag1SBRZoom" date="1240000040" name="RF_tag1_SBR_SB_sys_validation_zoom.gif" path="tag1_SBR_zoom.gif" size="9016" stream="tag1_SBR_zoom.gif" tmpFilename="/usr/tmp/CGItemp15987" user="JennyHolzbauer" version="2"
Revision 4
17 Apr 2009 - Main.JennyHolzbauer
Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo

Line: 52 to 52
 
  • Invariant mass of all jets
  • Centrality(all jets, lepton),Centrality(jet1, jet2)
Added:
>
>
When considering the variables used in this analysis, the significance was estimated based on the validation sample several times while varying the number of variables used, to see if the dimensionality could be reduced without a loss of significance. However, there appears to be a linear correlation between significance and dimensionality, as can be seen in the figure below. This indicates that more variables may be useful in future studies and that, for this study, all 33 variables should be used to optimize the significance.

Significance vs Dimensionality, 1 b-tag Sample
Sig_number_trial_33.gif"
 

Multivariate Analysis Programming Packages

Two analysis packages were initially considered, TMVA and SPR. In both cases, the general procedure was the same. The Monte Carlo data were split into three sets, to be used for the trial, validation, and yield phases of analysis. Because the SPR program is unable to handle negatively weighted events in the training cycle, these events were removed from the training set (although not from the validation or yield sets). Additionally, the events in the training sets were randomized to eliminate any order-dependency from merging the trees from files of different event types. The randomized sets were used to train the classifier, and the resulting significances were noted. For the purposes of this paper, the randomized data set that produced the best trained classifiers was used for validation purposes.
Line: 76 to 81
 

Multivariate Analysis with Systematics

Changed:
<
<
To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cut-based analysis, and the b-tagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background for each systematic. After a cut was made, the remaining background were multiplied by these percentages to determine the error from b-tagging and jet energy scale systematics. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. Additionally, the luminosity error was taken at 20 percent of the total background events. Background cross-section error was estimated at 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. The error percentages after a classifier output cut for SPR's arcx4 classifier are given in the table below.
>
>
To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cut-based analysis, and the b-tagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background sum for each systematic. After a cut was made, the remaining background events were multiplied by these percentages to determine the error from b-tagging and jet energy scale systematics. The luminosity error was taken at 20 percent of the total background events for cross-section and Vtb calculations (it is not included in the significance). Background cross-section error was estimated at 12 or 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. The error percentages on the background sum after a classifier output cut for SPR's arcx4 classifier are given in the table below.
 
Errors 1 b-tagged jet percentage 2 b-tagged jet percentage
JES 18 7
B-tagging 9 13
MC Statistics 22 19
Changed:
<
<
Luminosity 20 20
Background x-section 14 14
>
>
Background x-section 13 12
 
ISR/FSR 10 10
Other 5 5
The significance was calculated in two ways, a sideband method (cross-check) and a Gaussian-pdf method (primary), both of which involved a frequentist solution to estimate the significance (“Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process”, by R.D. Cousins, J.T. Linnemann, and J. Tucker, 23 Aug. 2008). The systematics discussed above are used in the Gaussian pdf method calculations, which assume a gaussian pdf for the background mean. The sideband method requires a cut to determine the amount of background in a no signal region, and will be discussed later. In both cases, the calculation was done with code in Appendix E of that paper. Additionally, the sideband method requires a leftward cut to isolate a background region. It was required that S/B remain at least less than 5 percent and also that there be a significant amount of background events for comparison.

Statistics also had to be considered. The significance estimation requires a Gaussian shape to the background mean and without enough statistics the results can be misleading. To allow comfortably for a Gaussian shape, a requirement of at least 18 unweighted background events was implemented. The chosen classifier in SPR has a significance curve that peaks well before this cut becomes an issue, although this cut played some role in the overall choice of classifier parameters.
Deleted:
<
<
When considering the variables used in this analysis, the significance was estimated based on the validation sample several times while varying the number of variables used, to see if the dimensionality could be reduced without a loss of significance. However, there appears to be a linear correlation between significance and dimensionality, as can be seen in the figure below. This indicates that more variables may be useful in future studies and that, for this study, all 33 variables should be used to optimize the significance.
 
Deleted:
<
<
Significance vs Dimensionality, 1 b-tag Sample
Sig_number_trial_33.gif"
 

Running the trained random forest classifier over the final sample, the yield sample, revealed significance values that were 3.8 sigma for the 1 b-tagged selection and about 2.6 sigma for the 2 b-tagged jet selection, with significances from the sideband estimation being only slightly lower, as can be seen in the table below. The table also shows the expected signal and background obtained while estimating the Gaussian pdf-based significance. If the weights are adjusted to give 1fb-1 luminosity, then we expect 3 sigma significance for the 2 b-tagged sample and 4.7 sigma for the 1 b-tagged sample, even with the limitations of this study. Luminosity plots are shown below.

Properties 1 b-Tagged Jet, SPR RF 2 b-Tagged Jets, SPR RF
Changed:
<
<
Gausian pdf Significance 3.8 2.6
>
>
Gausian pdf Significance 3.8 2.7
 
Sideband Significance 3.6 2.3
Expected Signal 25.1 25.3
Changed:
<
<
Expected Background 8.5 +/- 3.0 18.4 +/- 5.5
>
>
Expected Background 8.5 +/- 3.0 18.4 +/- 5.4
 

Luminosity, 1 b-tag Sample Luminosity, 2 b-tag Sample
hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" hBagger2_b-Tagged_Jets_LUM_sys_validation.gif"
Changed:
<
<
Although no actual data are considered in this analysis, it is still interesting to consider the cross-section and the uncertainties on it based on the simulated events. This calculation included the systematic uncertainties listed earlier. Here, we also include a statistical uncertainty and a separate luminosity uncertainty at 20%. For the 1 b-tagged jet selection, the cross-section is 323 + 112(sys) - 116(sys) +/- 75(stat) +/- 65(lum) pb. For the 2 b-tagged jet selection, the cross-section is 323 + 147(sys) - 162(sys) +/- 84(stat) +/- 65(lum) pb. It is clear in both cases that the uncertainties are quite large, which one might expect in an early data situation.
>
>
Although no actual data are considered in this analysis, it is still interesting to consider the cross-section and the uncertainties on it based on the simulated events. This calculation included the systematic uncertainties listed earlier. Here, we also include a statistical uncertainty and a separate luminosity uncertainty at 20%. Systematic uncertainties on the cross-section are listed below:
Errors 1 b-tagged jet percentage 2 b-tagged jet percentage
JES 15 14
B-tagging 5 22
MC Statistics 27 33
Background x-section 4 9
ISR/FSR 14 17
Other 7 9
Total 35 48
For the 1 b-tagged jet selection, the cross-section is 323 + 111(sys) - 116(sys) +/- 75(stat) +/- 65(lum) pb. For the 2 b-tagged jet selection, the cross-section is 323 + 146(sys) - 161(sys) +/- 84(stat) +/- 65(lum) pb. It is clear in both cases that the uncertainties are quite large, which one might expect in an early data situation. Notice that the systematic uncertainties are larger than the statistical uncertainties. This indicates that the dominant problem, at this point, for 100pb-1 of integrated luminosity is not the number of data events we expect, but the systematic errors, which we can try to reduce by improving the analysis.
 

Once the cross-section is obtained, the |Vtb| value from the quark mixing matrix can also be calculated. In this case the signal theoretical cross-section is also included in the systematic uncertainties, with a value of 5%, although it has a negligible impact on the overall numbers. Expressing |Vtb, exp| as
Line: 141 to 151
  -- JennyHolzbauer - 06 Jan 2009 -- JennyHolzbauer - 01 Apr 2009
Changed:
<
<
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_CO_validation.gif" attr="" comment="RFbtag1CO" date="1238599227" name="hBagger1_b-Tagged_Jet_CO_validation.gif" path="hBagger1 b-Tagged Jet_CO_validation.gif" size="12025" stream="hBagger1 b-Tagged Jet_CO_validation.gif" tmpFilename="/usr/tmp/CGItemp65454" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" attr="" comment="RFbtag1LUM" date="1238599270" name="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" path="hBagger1 b-Tagged Jet_LUM_sys_validation.gif" size="6738" stream="hBagger1 b-Tagged Jet_LUM_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp65296" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_CO_validation.gif" attr="" comment="RFbtag2CO" date="1238599312" name="hBagger2_b-Tagged_Jets_CO_validation.gif" path="hBagger2 b-Tagged Jets_CO_validation.gif" size="12276" stream="hBagger2 b-Tagged Jets_CO_validation.gif" tmpFilename="/usr/tmp/CGItemp65453" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" attr="" comment="RFbtag2LUM" date="1238599342" name="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" path="hBagger2 b-Tagged Jets_LUM_sys_validation.gif" size="6549" stream="hBagger2 b-Tagged Jets_LUM_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp65384" user="JennyHolzbauer" version="1"
>
>
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_CO_validation.gif" attr="" comment="RFbtag1CO" date="1240000105" name="hBagger1_b-Tagged_Jet_CO_validation.gif" path="hBagger1 b-Tagged Jet_CO_validation.gif" size="12017" stream="hBagger1 b-Tagged Jet_CO_validation.gif" tmpFilename="/usr/tmp/CGItemp16017" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" attr="" comment="RFbtag1LUM" date="1240000132" name="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" path="hBagger1 b-Tagged Jet_LUM_sys_validation.gif" size="6332" stream="hBagger1 b-Tagged Jet_LUM_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp16490" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_CO_validation.gif" attr="" comment="RFbtag2CO" date="1240000257" name="hBagger2_b-Tagged_Jets_CO_validation.gif" path="hBagger2 b-Tagged Jets_CO_validation.gif" size="12296" stream="hBagger2 b-Tagged Jets_CO_validation.gif" tmpFilename="/usr/tmp/CGItemp16125" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" attr="" comment="RFbtag2LUM" date="1240000277" name="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" path="hBagger2 b-Tagged Jets_LUM_sys_validation.gif" size="6560" stream="hBagger2 b-Tagged Jets_LUM_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp16612" user="JennyHolzbauer" version="2"
 
META FILEATTACHMENT attachment="EFF_tmva_spr.gif" attr="" comment="EFF_tmva_spr_btag2" date="1238599406" name="EFF_tmva_spr.gif" path="EFF_tmva_spr.gif" size="11194" stream="EFF_tmva_spr.gif" tmpFilename="/usr/tmp/CGItemp65451" user="JennyHolzbauer" version="1"
Changed:
<
<
META FILEATTACHMENT attachment="Sig_val_yield_new.gif" attr="" comment="Sig_val_yield_btag2" date="1238599450" name="Sig_val_yield_new.gif" path="Sig_val_yield_new.gif" size="13810" stream="Sig_val_yield_new.gif" tmpFilename="/usr/tmp/CGItemp65419" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="EFF_tmva_spr_tag1_yield.gif" attr="" comment="EFF_tmva_spr_btag1" date="1238599486" name="EFF_tmva_spr_tag1_yield.gif" path="EFF_tmva_spr_tag1_yield.gif" size="10408" stream="EFF_tmva_spr_tag1_yield.gif" tmpFilename="/usr/tmp/CGItemp65420" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_val_yield.gif" attr="" comment="Sig_val_yield_btag1" date="1238599532" name="Sig_val_yield.gif" path="Sig_val_yield.gif" size="17309" stream="Sig_val_yield.gif" tmpFilename="/usr/tmp/CGItemp65391" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_percent_trial_r5_1100_nice.gif" attr="" comment="Sig_percent_trial" date="1238599574" name="Sig_percent_trial_r5_1100_nice.gif" path="Sig_percent_trial_r5_1100_nice.gif" size="8459" stream="Sig_percent_trial_r5_1100_nice.gif" tmpFilename="/usr/tmp/CGItemp65392" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_cut_percent_trial_all_r5_1100_nice.gif" attr="" comment="Sig_cut_percent_trial" date="1238599606" name="Sig_cut_percent_trial_all_r5_1100_nice.gif" path="Sig_cut_percent_trial_all_r5_1100_nice.gif" size="9244" stream="Sig_cut_percent_trial_all_r5_1100_nice.gif" tmpFilename="/usr/tmp/CGItemp65424" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log.gif" attr="" comment="BDTbtag1CO" date="1238599668" name="hBDT_CO_validated_nice_nolum_log.gif" path="hBDT_CO_validated_nice_nolum_log.gif" size="10357" stream="hBDT_CO_validated_nice_nolum_log.gif" tmpFilename="/usr/tmp/CGItemp65483" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log_tag2.gif" attr="" comment="BDTbtag2CO" date="1238600037" name="hBDT_CO_validated_nice_nolum_log_tag2.gif" path="hBDT_CO_validated_nice_nolum_log_tag2.gif" size="12055" stream="hBDT_CO_validated_nice_nolum_log_tag2.gif" tmpFilename="/usr/tmp/CGItemp65448" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_number_trial_33.gif" attr="" comment="Sig_number_trial_33" date="1238600492" name="Sig_number_trial_33.gif" path="Sig_number_trial_33.gif" size="7436" stream="Sig_number_trial_33.gif" tmpFilename="/usr/tmp/CGItemp65264" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="RF_tag1_SBR_SB_sys_validation.gif" attr="" comment="RFbtag1SBR" date="1238601959" name="RF_tag1_SBR_SB_sys_validation.gif" path="RF_tag1_SBR_SB_sys_validation.gif" size="9136" stream="RF_tag1_SBR_SB_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp65306" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="RF_tag1_SBR_SB_sys_validation_zoom.gif" attr="" comment="RFbtag1SBRZoom" date="1238601988" name="RF_tag1_SBR_SB_sys_validation_zoom.gif" path="RF_tag1_SBR_SB_sys_validation_zoom.gif" size="9201" stream="RF_tag1_SBR_SB_sys_validation_zoom.gif" tmpFilename="/usr/tmp/CGItemp65315" user="JennyHolzbauer" version="1"
>
>
META FILEATTACHMENT attachment="Sig_val_yield_new.gif" attr="" comment="Sig_val_yield_btag2" date="1240000231" name="Sig_val_yield_new.gif" path="sig_val_yield_new.gif" size="14147" stream="sig_val_yield_new.gif" tmpFilename="/usr/tmp/CGItemp14386" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="EFF_tmva_spr_tag1_yield.gif" attr="" comment="EFF_tmva_spr_btag1" date="1240000880" name="EFF_tmva_spr_tag1_yield.gif" path="Eff_tmva_spr_tag1.gif" size="10987" stream="Eff_tmva_spr_tag1.gif" tmpFilename="/usr/tmp/CGItemp14946" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="Sig_val_yield.gif" attr="" comment="Sig_val_yield_btag1" date="1240000074" name="Sig_val_yield.gif" path="Sig_val_yield_new.gif" size="17679" stream="Sig_val_yield_new.gif" tmpFilename="/usr/tmp/CGItemp14063" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="Sig_percent_trial_r5_1100_nice.gif" attr="" comment="Sig_percent_trial" date="1239999892" name="Sig_percent_trial_r5_1100_nice.gif" path="sig_percent_trial.gif" size="13749" stream="sig_percent_trial.gif" tmpFilename="/usr/tmp/CGItemp15499" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="Sig_cut_percent_trial_all_r5_1100_nice.gif" attr="" comment="Sig_cut_percent_trial" date="1239999855" name="Sig_cut_percent_trial_all_r5_1100_nice.gif" path="sig_percent_trial_stat.gif" size="14674" stream="sig_percent_trial_stat.gif" tmpFilename="/usr/tmp/CGItemp16379" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log.gif" attr="" comment="BDTbtag1CO" date="1240000339" name="hBDT_CO_validated_nice_nolum_log.gif" path="hBDT_CO_validated_nice_nolum_log_new.gif" size="11065" stream="hBDT_CO_validated_nice_nolum_log_new.gif" tmpFilename="/usr/tmp/CGItemp15838" user="JennyHolzbauer" version="3"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log_tag2.gif" attr="" comment="BDTbtag2CO" date="1240000369" name="hBDT_CO_validated_nice_nolum_log_tag2.gif" path="hBDT_CO_validated_nice_nolum_lin_new.gif" size="10209" stream="hBDT_CO_validated_nice_nolum_lin_new.gif" tmpFilename="/usr/tmp/CGItemp16304" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="Sig_number_trial_33.gif" attr="" comment="Sig_number_trial_33" date="1239999973" name="Sig_number_trial_33.gif" path="sig_train_new.gif" size="12457" stream="sig_train_new.gif" tmpFilename="/usr/tmp/CGItemp14513" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="RF_tag1_SBR_SB_sys_validation.gif" attr="" comment="RFbtag1SBR" date="1240000018" name="RF_tag1_SBR_SB_sys_validation.gif" path="tag1_SBR.gif" size="9191" stream="tag1_SBR.gif" tmpFilename="/usr/tmp/CGItemp15291" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="RF_tag1_SBR_SB_sys_validation_zoom.gif" attr="" comment="RFbtag1SBRZoom" date="1240000040" name="RF_tag1_SBR_SB_sys_validation_zoom.gif" path="tag1_SBR_zoom.gif" size="9016" stream="tag1_SBR_zoom.gif" tmpFilename="/usr/tmp/CGItemp15987" user="JennyHolzbauer" version="2"
Revision 3
01 Apr 2009 - Main.JennyHolzbauer
Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo

Here, we return to early data (below 1fb-1), looking at the impact of a multivariate analysis on the capability to observe the single-top signal. EarlyDataStudy looked at the ability of a cut-based analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the single-top process in early data.
Changed:
<
<
The preselection cuts are the same as those used by other analyses in the single top group, lthough because this is a low luminosity study, the second jet \pt\ cut was adjusted to 25 GeV and the early data b-tagger, TRFIP2D, was used. The TRF was compared to a random number to determine if an individual jet would be tagged or not, rather than using it as a weight. The pt cuts for the leptons were also lower, set at 20 GeV for muons and 25 GeV for electrons. Additionally, because the W+Jets sample (not including Wbbjets) were FastSim files, the trigger cuts (EM25i, MU20i, or EM60) were applied by weighting events based on the trigger turn-on curves. After applying these cuts, the events were separated into groups according to the number of b-tagged jets, 1 or 2, effectively introducing a maximum tagged jet cut. The event yields, weighted to 100pb-1, can be found in table below for samples containing 1 b-tagged jet or2 b-tagged jets, and muons or electrons.
>
>
The updated event yields, weighted to 100pb-1, can be found in table below for samples containing 1 b-tagged jet or2 b-tagged jets, and muons or electrons. In this analysis, there are two selections studied, one isolating 1 b-tagged jets and the other isolating 2 b-tagged jets. As in the earlier, cut-based analysis, the early b-tagger IP2D is used. Some of the Pt cuts are also adjusted to have slightly lower thresholds, such as Second leading jet Pt, which is required to be greater than 25 GeV. These adjustments are the same as in the cut based analysis.
 
processes muon (1 b-jet) electron (1 b-jet) muon (2 b-jet) electron (2 b-jet) both (1 b-jet) both (2 b-jet)
t-channel 290.1 389.5 116.5 132.7 679.6 249.4
Line: 20 to 20
 
S/B 0.19 0.19 0.21 0.19 0.19 0.20
Changed:
<
<
The S/B ratio is about 0.2. This is high compared to other single-top analyses because in this analysis, the signal is combined.
>
>
The S/B ratio is about 0.2. This is high compared to other single-top analyses because in this analysis, the signal is combined. There is probably also some impact from slightly different cut selection and the different b-tagger.
 
Changed:
<
<

Additional Variable Selection (NEEDS MODIFICATION, NOT ACCURATE)

>
>

Additional Variable Selection

  This analysis required the use of several variables beyond those used for preselection cuts. Explanations are given here:
  • Ht refers to the sum of the Pt of the particles indicated (except for MET, for which we used Et)
  • DeltaR refers to sqrt{(Delta Eta)2 + (Delta Pt)2 } for the particles in question
Line: 30 to 30
 
  • lepton refers to the leading lepton
  • centrality is (sum Pt/sum P)
  • Transverse W mass is sqrt{(leptonPt + MET)2 - (leptonP_x + MEX)2 - (leptonP_y + MEY)2 }
Added:
>
>
  • Jet1 refers to the leading jet, jet2 refers to the second leading jet, etc.
 
Changed:
<
<
The following are the variables used in the multivariate analysis:
>
>
The following are the 33 variables used in the multivariate analysis:
 

  • Ht(jet1, jet2), Ht(all jets), Ht(lepton, MET)
Deleted:
<
<
  • Ht(jet1, jet2, lepton, MET), Ht(all jets, lepton, MET)
 
  • Delta Pt(jet1, jet2), Delta Pt(btaggedjet1, untaggedjet1)
Changed:
<
<
  • DeltaR(btaggedjet1, lepton), DeltaR(untaggedjet1, lepton), DeltaR(jet1, lepton)
  • DeltaR(btaggedjet1, btaggedjet2), DeltaR(jet1, jet2)
  • H(all jets), H(jet1, jet2)
  • Pt(jet1), Pt(btaggedjet1), Pt(untaggedjet1), Pt(untaggedjet2), Pt(lepton)
>
>
  • DeltaR(untaggedjet1, lepton), DeltaR(jet1, lepton), DeltaR(jet1, jet2)
  • H(all jets), H(jet1, jet2), MET (Missing Et)
  • Pt(jet1), Pt(jet2), Pt(btaggedjet1), Pt(untaggedjet1), Pt(lepton)
 
  • Pt(jet1) + Pt(jet2)
  • Eta(btaggedjet1), Eta(untaggedjet1)
  • Maximum jet eta in event
  • Minimum jet eta in event
Changed:
<
<
  • Mass of jet1 + mass of jet2
  • Missing Et, also written MET
>
>
  • Eta(jet2), Eta(lepton), Phi(lepton)
 
  • Number of jets, Number of untagged jets
  • Transverse mass of W
Changed:
<
<
  • Mass of the b-tagged top
>
>
  • Mass of the b-tagged jet top quark
  • Mass of the leading jet top quark
  • Invariant mass of jet1 + jet2
  • Invariant mass of all jets
 
  • Centrality(all jets, lepton),Centrality(jet1, jet2)

Multivariate Analysis Programming Packages

Two analysis packages were initially considered, TMVA and SPR. In both cases, the general procedure was the same. The Monte Carlo data were split into three sets, to be used for the trial, validation, and yield phases of analysis. Because the SPR program is unable to handle negatively weighted events in the training cycle, these events were removed from the training set (although not from the validation or yield sets). Additionally, the events in the training sets were randomized to eliminate any order-dependency from merging the trees from files of different event types. The randomized sets were used to train the classifier, and the resulting significances were noted. For the purposes of this paper, the randomized data set that produced the best trained classifiers was used for validation purposes.
Changed:
<
<
An initial analysis was done to determine which of the classifiers best separated background (ttbar, w+jets, w+bbjets, ww, wz) and signal (t-channel, s-channel, wt). In this phase, no systematics were included and efficiency curves were examined. The goal was to minimize the background efficiency and maximize the signal efficiency. It was found that the best classifier for TMVA was a boosted decision tree and the best classifier for SPR was a random forest using arcing(REF!!!), which will be referred to as an arcx4 classifier. Figure~\ref{earlydata_TMVASPR} shows the best classifiers in SPR and TMVA together, and it indicates that the SPR classifier performs better.
>
>
An initial analysis was done to determine which of the classifiers best separated background (ttbar, w+jets, w+bbjets, ww, wz) and signal (t-channel, s-channel, wt). In this phase, no systematics were included and efficiency curves were examined. The goal was to minimize the background efficiency and maximize the signal efficiency. It was found that the best classifier for TMVA was a boosted decision tree and the best classifier for SPR was a random forest. Classifier output plots can be seen below, in log scale, for both.
Random Forest SPR, 1 b-tag Sample* Random Forest, SPR, 2 b-tag Sample
hBagger1_b-Tagged_Jet_CO_validation.gif" hBagger2_b-Tagged_Jets_CO_validation.gif"
 
Changed:
<
<
FigureXXX shows the efficiency curves from the best TMVA and SPR classifiers. Although they appear close, the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies, as shown in tableXXX, show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well.
>
>
Boosted Decision Tree, TMVA, 1 b-tag Sample Boosted Decision Tree, TMVA, 2 b-tag Sample
hBDT_CO_validated_nice_nolum_log.gif" hBDT_CO_validated_nice_nolum_log_tag2.gif"
 
Changed:
<
<
Program Classifier Cut Signal Efficiency Background Efficiency
SPR arcx4 0.53 0.136 0.003
TMVA bdt 0.07 0.076 0.003
>
>
The figures below show the efficiency curves from the best TMVA and SPR classifiers, zoomed in on the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies, as shown in the table below, show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well. At this point, the SPR random forest classifier was focused on.
 
Changed:
<
<
At this point, the SPR classifier was focused on. The arcx4 classifier output curve has bumps after signal is separated from background, and these events were examined separately to see if they seemed reasonable. No spikes at unphysical values, for instance, were found. It was noted that the signal bumps contained two different signals, t-channel and wt. However, there is no apparent reason to reject this classifier on the basis of this plot shape. Figure~\ref{earlydata_Bump} shows the events from the signal bump area (corresponding to a classifier output cut of 0.71) for the arcx4 classifier for a few common variables, based on the validation set.
>
>
Program Signal Efficiency Background Efficiency
TMVA bdt, 1 b-tagged jet 0.0243 0.0031
SPR arcx4, 1 b-tagged jet 0.0272 0.0018
TMVA bdt, 2 b-tagged jets 0.0726 0.0218
SPR arcx4, 2 b-tagged jets 0.0754 0.0107
 
Added:
>
>
Efficiency, 1 b-tag Sample Efficiency, 2 b-tag Sample
EFF_tmva_spr_tag1_yield.gif" EFF_tmva_spr.gif"
 

Multivariate Analysis with Systematics

Changed:
<
<
To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cut-based analysis, and the b-tagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background for each systematic. After a cut was made, the remaining background were multiplied by these percentages to determine the error from b-tagging and jet energy scale systematics. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. Additionally, the luminosity error was taken at 20 percent of the total background events. Background cross-section error was estimated at 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. The error percentages after a classifier output cut for SPR's arcx4 classifier are given in tableXXXX. After this cut at 0.53, the overall background yield is 3.9 +/- 1.6, and includes 42 unweighted background events.
>
>
To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cut-based analysis, and the b-tagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background for each systematic. After a cut was made, the remaining background were multiplied by these percentages to determine the error from b-tagging and jet energy scale systematics. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. Additionally, the luminosity error was taken at 20 percent of the total background events. Background cross-section error was estimated at 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. The error percentages after a classifier output cut for SPR's arcx4 classifier are given in the table below.

Errors 1 b-tagged jet percentage 2 b-tagged jet percentage
JES 18 7
B-tagging 9 13
MC Statistics 22 19
Luminosity 20 20
Background x-section 14 14
ISR/FSR 10 10
Other 5 5
The significance was calculated in two ways, a sideband method (cross-check) and a Gaussian-pdf method (primary), both of which involved a frequentist solution to estimate the significance (“Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process”, by R.D. Cousins, J.T. Linnemann, and J. Tucker, 23 Aug. 2008). The systematics discussed above are used in the Gaussian pdf method calculations, which assume a gaussian pdf for the background mean. The sideband method requires a cut to determine the amount of background in a no signal region, and will be discussed later. In both cases, the calculation was done with code in Appendix E of that paper. Additionally, the sideband method requires a leftward cut to isolate a background region. It was required that S/B remain at least less than 5 percent and also that there be a significant amount of background events for comparison.

Statistics also had to be considered. The significance estimation requires a Gaussian shape to the background mean and without enough statistics the results can be misleading. To allow comfortably for a Gaussian shape, a requirement of at least 18 unweighted background events was implemented. The chosen classifier in SPR has a significance curve that peaks well before this cut becomes an issue, although this cut played some role in the overall choice of classifier parameters.

When considering the variables used in this analysis, the significance was estimated based on the validation sample several times while varying the number of variables used, to see if the dimensionality could be reduced without a loss of significance. However, there appears to be a linear correlation between significance and dimensionality, as can be seen in the figure below. This indicates that more variables may be useful in future studies and that, for this study, all 33 variables should be used to optimize the significance.

Significance vs Dimensionality, 1 b-tag Sample
Sig_number_trial_33.gif"

Running the trained random forest classifier over the final sample, the yield sample, revealed significance values that were 3.8 sigma for the 1 b-tagged selection and about 2.6 sigma for the 2 b-tagged jet selection, with significances from the sideband estimation being only slightly lower, as can be seen in the table below. The table also shows the expected signal and background obtained while estimating the Gaussian pdf-based significance. If the weights are adjusted to give 1fb-1 luminosity, then we expect 3 sigma significance for the 2 b-tagged sample and 4.7 sigma for the 1 b-tagged sample, even with the limitations of this study. Luminosity plots are shown below.

Properties 1 b-Tagged Jet, SPR RF 2 b-Tagged Jets, SPR RF
Gausian pdf Significance 3.8 2.6
Sideband Significance 3.6 2.3
Expected Signal 25.1 25.3
Expected Background 8.5 +/- 3.0 18.4 +/- 5.5

Luminosity, 1 b-tag Sample Luminosity, 2 b-tag Sample
hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" hBagger2_b-Tagged_Jets_LUM_sys_validation.gif"

Although no actual data are considered in this analysis, it is still interesting to consider the cross-section and the uncertainties on it based on the simulated events. This calculation included the systematic uncertainties listed earlier. Here, we also include a statistical uncertainty and a separate luminosity uncertainty at 20%. For the 1 b-tagged jet selection, the cross-section is 323 + 112(sys) - 116(sys) +/- 75(stat) +/- 65(lum) pb. For the 2 b-tagged jet selection, the cross-section is 323 + 147(sys) - 162(sys) +/- 84(stat) +/- 65(lum) pb. It is clear in both cases that the uncertainties are quite large, which one might expect in an early data situation.

Once the cross-section is obtained, the |Vtb| value from the quark mixing matrix can also be calculated. In this case the signal theoretical cross-section is also included in the systematic uncertainties, with a value of 5%, although it has a negligible impact on the overall numbers. Expressing |Vtb, exp| as
 
Changed:
<
<
Errors Percentage after 0.53 cut at 30pb-1
JES 18
B-tagging 9
MC Statistics 24
Luminosity 20
Background x-section 14
ISR/FSR 10
Other 5
The significance was calculated in two ways, a sideband method and a Gaussian-form method, both of which involved a frequentist solution to estimate the significance~\cite{GaussSideband}. The systematics discussed above are used in the Gaussian method calculations. The sideband method requires a cut to determine the amount of background in a no signal region, and will be discussed later. In both cases, the calculation was done with code in Appendix E of the paper~\cite{GaussSideband}. Additionally, the sideband method requires a leftward cut to isolate a background region. It was required that S/B remain at least less than 5 percent and also that there be a significant amount of background events for comparison. A cut was chosen at 0.14, which gave about 2 percent for the S/B ratio, and also quite a bit of background (392 unweighted background in the validation sample). Lower cuts could improve S/B slightly, but there is also increasing uncertainty on the background, and increasing deviation from a gaussian form. Also, a comparison of sideband results using this cut at 2 percent S/B and 5 percent S/B showed only a few tenths of significance difference, so it seems reasonable to keep the cut at 0.14.

The following table gives the significance levels based on the Gaussian method for three classifiers under consideration, at 10pb-1, with a classifier cut to allow reasonable numbers of events in each case. It is evident that the arcx4 classifier has the best significance. Calculations were also done on the validation sample for ???

INSERT TABLE

Running the trained arcx4 classifier over the final sample, the yield sample, revealed significance values that were still high, although somewhat lower than with the validation sample.
Classifier Gaussian method Significance Sideband method Significance Expected Background Expected Signal
ArcX4 (SPR) 5.1 6.5 4.5 25.5
These are significance values taken at 30~pb-1 for the ArcX4 classifier using the yield sample, with an 0.53 classifier output cut to isolate signal and a 0.14???? cut to isolate background
>
>
Vtb, exp = sqrt{xsec_exp / xsec_sm} Vtb, sm

and taking |Vtb, sm| to be 1, since the ratios of the cross-sections in this case is also 1, the uncertainties may be simply propagated to give the |Vtb, exp| value. For the 1 b-tagged jet selection, |Vtb, exp| is 1.0 + 0.17(sys) - 0.18(sys) +/- 0.12(stat) +/- 0.10(lum), and for the 2 b-tagged jet selection, |Vtb, exp| is 1.0 + 0.23(sys) - 0.25(sys) +/- 0.13(stat) +/- 0.10(lum).

Both selections show large uncertainties, which is to be expected in early data. It should also be noted that this analysis also has some room for improvement by future analyses. All of the variables listed in this paper were used, and additional variables may improve the result. Additionally, there could be some improvement by combining these two selections, which is beyond the scope of this study.

Additional Studies

Effect of Statistics on Significance

Even with a combined signal analysis, the raw counts may be somewhat low for a 33 variable space. Thus, statistics were considered to see what sort of improvement, if any, increased statistics would give to the results. Several training subsets were made, each with a percentage of the total training set. Each of these subsets contained at least eight samples and of these, the sample with the highest significance was chosen. The figure below on the left shows the significance versus the percentage of the total training set used for training the classifier. The set with a 2 b-tagged jet selection has a significance that is generally flat, so more statistics may not dramatically improve the current result. The 1 b-tagged selection, however, has an oddly shaped line, which is most likely not just related to the size of the training sample.

Significance vs Percentage of Training Sample Effect of Unweighted Background Cut
Sig_percent_trial_r5_1100_nice.gif" Sig_cut_percent_trial_all_r5_1100_nice.gif"

There is another parameter that is not considered in the figure, the cut on the number of unweighted background events, a cut that is directly related to the statistics of the sample. How this cut is made can be seen in the figure below for the 1 b-tagged jet sample using the SPR random forest classifier. Here, both the significance and the sum of the background events, integrated from the right, is plotted against classifier output. A cut at 0.51 has the maximum significance and 21 unweighted background events remaining. However, if the significance were to peak after the cut requiring 18 unweighted background events, the significance quoted may not be the highest possible. If there were more statistics, the unweighted background cut would move to the right and perhaps the significance curve would peak before the cut, at a higher value.

The figure above on the right shows the percentage of cases for each trial size where the significance peak is to the left of the cut on the unweighted background events and the maximum potential significance may not be accessible to this study due to low statistics. The odd shape in the first plot is likely related to the increasing impact of the cut on the number of unweighted background events as the percentage of the training sample used increases. This, incidentally, has little effect on the 2 b-tagged selection, which has none of its training sample for any subset except the smallest affected by this cut, as can be seen the figure. Increased statistics would help to reduce this effect in the 1 b-tagged jet sample, as the cut is directly related to statistics, and could possibly improve the resulting significances in this sample.

Significance and Unweighted Background Cut Zoomed
RF_tag1_SBR_SB_sys_validation.gif" RF_tag1_SBR_SB_sys_validation_zoom.gif"

Variation in Significances of Yield and Validation Samples

In doing this study, it was found that the significances from the validation and yield samples were somewhat different, and it was desired to repeat these calculations many times and see if this sort of result was chance or a trend, for the benefit of future studies. For the 1 and 2 b-tagged jet samples, eight randomized training sets were produced, where the ones that gave the highest validation significances were used in the main study of this paper. All of these sets were then used to make trained classifiers for both the TMVA boosted decision tree and the SPR random forest. Each of these classifiers was applied to the validation and yield samples, and the significances, using the Gaussian pdf method, were noted. These significances are shown in figures below on the left for the 1 b-tagged jet selection and on the right for the 2 b-tagged jet selection. In general the classifiers seem to have a range of significances for both the validation and yield samples. This variation in significance values may help to explain slight fluctuations in the 2 b-tagged sample curve in the previous section that could not be explained by the impact of the unweighted background count cut. Additionally, there is a line plotted in the figures below with slope 1. For both classifiers and b-tagging selections, most points are below this line, indicating that the validation significances are generally higher than the yield significances. The selection of the classifier parameters is determined to optimize the validation significance, so these plots may indicate that these parameters are too finely tuned to the validation sample. In general, it seems the selection of a classifier that gives a high validation significance will not necessarily give a high yield significance. It may be more useful for future studies to combine these classifiers, rather than choosing one classifier.

Validation vs Yield Significances, 1 b-tagged jet Validation vs Yield Significances, 2 b-tagged jets
Sig_val_yield.gif" Sig_val_yield_new.gif"
 

-- JennyHolzbauer - 06 Jan 2009
Added:
>
>
-- JennyHolzbauer - 01 Apr 2009

META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_CO_validation.gif" attr="" comment="RFbtag1CO" date="1238599227" name="hBagger1_b-Tagged_Jet_CO_validation.gif" path="hBagger1 b-Tagged Jet_CO_validation.gif" size="12025" stream="hBagger1 b-Tagged Jet_CO_validation.gif" tmpFilename="/usr/tmp/CGItemp65454" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" attr="" comment="RFbtag1LUM" date="1238599270" name="hBagger1_b-Tagged_Jet_LUM_sys_validation.gif" path="hBagger1 b-Tagged Jet_LUM_sys_validation.gif" size="6738" stream="hBagger1 b-Tagged Jet_LUM_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp65296" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_CO_validation.gif" attr="" comment="RFbtag2CO" date="1238599312" name="hBagger2_b-Tagged_Jets_CO_validation.gif" path="hBagger2 b-Tagged Jets_CO_validation.gif" size="12276" stream="hBagger2 b-Tagged Jets_CO_validation.gif" tmpFilename="/usr/tmp/CGItemp65453" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" attr="" comment="RFbtag2LUM" date="1238599342" name="hBagger2_b-Tagged_Jets_LUM_sys_validation.gif" path="hBagger2 b-Tagged Jets_LUM_sys_validation.gif" size="6549" stream="hBagger2 b-Tagged Jets_LUM_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp65384" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="EFF_tmva_spr.gif" attr="" comment="EFF_tmva_spr_btag2" date="1238599406" name="EFF_tmva_spr.gif" path="EFF_tmva_spr.gif" size="11194" stream="EFF_tmva_spr.gif" tmpFilename="/usr/tmp/CGItemp65451" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_val_yield_new.gif" attr="" comment="Sig_val_yield_btag2" date="1238599450" name="Sig_val_yield_new.gif" path="Sig_val_yield_new.gif" size="13810" stream="Sig_val_yield_new.gif" tmpFilename="/usr/tmp/CGItemp65419" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="EFF_tmva_spr_tag1_yield.gif" attr="" comment="EFF_tmva_spr_btag1" date="1238599486" name="EFF_tmva_spr_tag1_yield.gif" path="EFF_tmva_spr_tag1_yield.gif" size="10408" stream="EFF_tmva_spr_tag1_yield.gif" tmpFilename="/usr/tmp/CGItemp65420" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_val_yield.gif" attr="" comment="Sig_val_yield_btag1" date="1238599532" name="Sig_val_yield.gif" path="Sig_val_yield.gif" size="17309" stream="Sig_val_yield.gif" tmpFilename="/usr/tmp/CGItemp65391" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_percent_trial_r5_1100_nice.gif" attr="" comment="Sig_percent_trial" date="1238599574" name="Sig_percent_trial_r5_1100_nice.gif" path="Sig_percent_trial_r5_1100_nice.gif" size="8459" stream="Sig_percent_trial_r5_1100_nice.gif" tmpFilename="/usr/tmp/CGItemp65392" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_cut_percent_trial_all_r5_1100_nice.gif" attr="" comment="Sig_cut_percent_trial" date="1238599606" name="Sig_cut_percent_trial_all_r5_1100_nice.gif" path="Sig_cut_percent_trial_all_r5_1100_nice.gif" size="9244" stream="Sig_cut_percent_trial_all_r5_1100_nice.gif" tmpFilename="/usr/tmp/CGItemp65424" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log.gif" attr="" comment="BDTbtag1CO" date="1238599668" name="hBDT_CO_validated_nice_nolum_log.gif" path="hBDT_CO_validated_nice_nolum_log.gif" size="10357" stream="hBDT_CO_validated_nice_nolum_log.gif" tmpFilename="/usr/tmp/CGItemp65483" user="JennyHolzbauer" version="2"
META FILEATTACHMENT attachment="hBDT_CO_validated_nice_nolum_log_tag2.gif" attr="" comment="BDTbtag2CO" date="1238600037" name="hBDT_CO_validated_nice_nolum_log_tag2.gif" path="hBDT_CO_validated_nice_nolum_log_tag2.gif" size="12055" stream="hBDT_CO_validated_nice_nolum_log_tag2.gif" tmpFilename="/usr/tmp/CGItemp65448" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="Sig_number_trial_33.gif" attr="" comment="Sig_number_trial_33" date="1238600492" name="Sig_number_trial_33.gif" path="Sig_number_trial_33.gif" size="7436" stream="Sig_number_trial_33.gif" tmpFilename="/usr/tmp/CGItemp65264" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="RF_tag1_SBR_SB_sys_validation.gif" attr="" comment="RFbtag1SBR" date="1238601959" name="RF_tag1_SBR_SB_sys_validation.gif" path="RF_tag1_SBR_SB_sys_validation.gif" size="9136" stream="RF_tag1_SBR_SB_sys_validation.gif" tmpFilename="/usr/tmp/CGItemp65306" user="JennyHolzbauer" version="1"
META FILEATTACHMENT attachment="RF_tag1_SBR_SB_sys_validation_zoom.gif" attr="" comment="RFbtag1SBRZoom" date="1238601988" name="RF_tag1_SBR_SB_sys_validation_zoom.gif" path="RF_tag1_SBR_SB_sys_validation_zoom.gif" size="9201" stream="RF_tag1_SBR_SB_sys_validation_zoom.gif" tmpFilename="/usr/tmp/CGItemp65315" user="JennyHolzbauer" version="1"
Revision 2
10 Mar 2009 - Main.JennyHolzbauer
Line: 1 to 1
 
META TOPICPARENT name="WebHome"

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo

Changed:
<
<
Here, we return to early data (below 1fb-1), looking at the impact of a multivariate analysis on the capability to observe the single-top signal. EarlyDataStudy looked at the ability of a cut-based analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used. However, this analysis only considers the 1 b-tagged jet sample because of its promising nature and larger statistics. Although the cut based analysis only predicted a few sigma at most for a significance value at 100pb-1, the multivariate approach has improved these results.
>
>
Here, we return to early data (below 1fb-1), looking at the impact of a multivariate analysis on the capability to observe the single-top signal. EarlyDataStudy looked at the ability of a cut-based analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the single-top process in early data.
 
Changed:
<
<
The event yields are given in the table below. Note that this table only includes 1 b-tagged jet events. 2 b-tagged jets and no b-tagging cut samples still need to be examined, and are not included in this analysis. Also note that, as in the last analysis, the early data b-tagger, IP2D, was used, which has a higher fake rate than the b-tagger recommended for general data. Note that W+Jets includes W+0, W+1, W+2 and W+3 jets, and similarly for wbbjets.
processes muon (1 b-jet) electron (1 b-jet)
s-channel 13.9 9.9
t-channel 336.2 247.8
W+t channel 99.8 76.6
ttbar to l+jets 899.5 729.5
ttbar to l+l 411.0 316.9
Wjets 1422.8 927.5
Wbbjets 46.0 29.2
WW 23.1 17.9
WZ 7.3 5.4
From table, S/B can also be calculated and is found to be 0.16 for early data for both data sets.
>
>
The preselection cuts are the same as those used by other analyses in the single top group, lthough because this is a low luminosity study, the second jet \pt\ cut was adjusted to 25 GeV and the early data b-tagger, TRFIP2D, was used. The TRF was compared to a random number to determine if an individual jet would be tagged or not, rather than using it as a weight. The pt cuts for the leptons were also lower, set at 20 GeV for muons and 25 GeV for electrons. Additionally, because the W+Jets sample (not including Wbbjets) were FastSim files, the trigger cuts (EM25i, MU20i, or EM60) were applied by weighting events based on the trigger turn-on curves. After applying these cuts, the events were separated into groups according to the number of b-tagged jets, 1 or 2, effectively introducing a maximum tagged jet cut. The event yields, weighted to 100pb-1, can be found in table below for samples containing 1 b-tagged jet or2 b-tagged jets, and muons or electrons.
processes muon (1 b-jet) electron (1 b-jet) muon (2 b-jet) electron (2 b-jet) both (1 b-jet) both (2 b-jet)
t-channel 290.1 389.5 116.5 132.7 679.6 249.4
s-channel 11.6 16.3 7.8 10.0 27.9 17.8
wt- channel 92.1 118.3 31.4 36.6 210.4 68.1
ttbar to lep+jets 729.5 899.5 205.7 270.3 1629.0 476.0
ttbar to lep+lep 316.9 411.0 444.7 547.4 727.9 992.1
Wjets 927.5 1422.8 79.1 114.0 2350.3 193.1
Wbbjets 29.2 46.0 16.8 25.7 75.2 42.5
WW 17.9 23.1 2.0 3.3 41.0 5.3
WZ 5.4 7.3 2.2 2.9 12.7 5.1
S/B 0.19 0.19 0.21 0.19 0.19 0.20

The S/B ratio is about 0.2. This is high compared to other single-top analyses because in this analysis, the signal is combined.
 

Additional Variable Selection (NEEDS MODIFICATION, NOT ACCURATE)

This analysis required the use of several variables beyond those used for preselection cuts. Explanations are given here:
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback