Line: 1 to 1  

Changed:  
< < 
 
> > 
 
Significance Estimations using Multivariate Analysis of Single Top Monte CarloHere, we return to early data (below 1fb1), looking at the impact of a multivariate analysis on the capability to observe the singletop signal. EarlyDataStudy looked at the ability of a cutbased analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the singletop process in early data. 
Line: 1 to 1  

Changed:  
< < 
 
> > 
 
Significance Estimations using Multivariate Analysis of Single Top Monte CarloHere, we return to early data (below 1fb1), looking at the impact of a multivariate analysis on the capability to observe the singletop signal. EarlyDataStudy looked at the ability of a cutbased analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the singletop process in early data. 
Line: 1 to 1  

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo  
Line: 7 to 7  
The updated event yields, weighted to 100pb1, can be found in table below for samples containing 1 btagged jet or2 btagged jets, and muons or electrons. In this analysis, there are two selections studied, one isolating 1 btagged jets and the other isolating 2 btagged jets. As in the earlier, cutbased analysis, the early btagger IP2D is used. Some of the Pt cuts are also adjusted to have slightly lower thresholds, such as Second leading jet Pt, which is required to be greater than 25 GeV. These adjustments are the same as in the cut based analysis.
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 
 
Changed:  
< <  The S/B ratio is about 0.2. This is high compared to other singletop analyses because in this analysis, the signal is combined. There is probably also some impact from slightly different cut selection and the different btagger.  
> >  The S/B ratio is about 0.2. This is high compared to other singletop analyses because in this analysis, the signal is combined. There is probably also some impact from slightly different cut selection and the different btagger. The proportions of the background are different from the CSC note (here, there are more W+Jets, versus ttbar). This is understood and a result of the lowering of the Pt thresholds and changes in the btagger. The overall effect of the cut changes is to increase the number of events overall and slightly increase the S/B ratio. The increased proportion of W+Jets may also improve the results of the analysis, as it is likely to be easier to remove than ttbar, but this has not been extensively studied.  
Additional Variable SelectionThis analysis required the use of several variables beyond those used for preselection cuts. Explanations are given here:
 
Changed:  
< < 
 
> > 
 
 
Line: 37 to 37  
 
Changed:  
< < 
 
> > 
 
 
Line: 50 to 50  
 
Changed:  
< < 
 
> > 
 
When considering the variables used in this analysis, the significance was estimated based on the validation sample several times while varying the number of variables used, to see if the dimensionality could be reduced without a loss of significance. However, there appears to be a linear correlation between significance and dimensionality, as can be seen in the figure below. This indicates that more variables may be useful in future studies and that, for this study, all 33 variables should be used to optimize the significance.  
Line: 67 to 67  
 
Changed:  
< < 
The figures below show the efficiency curves from the best TMVA and SPR classifiers, zoomed in on the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies, as shown in the table below, show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well. At this point, the SPR random forest classifier was focused on.
 
> >  The figures below show the efficiency curves from the best TMVA and SPR classifiers, zoomed in on the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well. At this point, the SPR random forest classifier was focused on.  
 
Line: 82 to 75  
Multivariate Analysis with SystematicsTo estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cutbased analysis, and the btagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background sum for each systematic. After a cut was made, the remaining background events were multiplied by these percentages to determine the error from btagging and jet energy scale systematics. The luminosity error was taken at 20 percent of the total background events for crosssection and Vtb calculations (it is not included in the significance). Background crosssection error was estimated at 12 or 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. The error percentages on the background sum after a classifier output cut for SPR's arcx4 classifier are given in the table below.  
Changed:  
< < 
 
> > 
 
The significance was calculated in two ways, a sideband method (crosscheck) and a Gaussianpdf method (primary), both of which involved a frequentist solution to estimate the significance (“Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the backgroundonly hypothesis for a Poisson process”, by R.D. Cousins, J.T. Linnemann, and J. Tucker, 23 Aug. 2008). The systematics discussed above are used in the Gaussian pdf method calculations, which assume a gaussian pdf for the background mean. The sideband method requires a cut to determine the amount of background in a no signal region, and will be discussed later. In both cases, the calculation was done with code in Appendix E of that paper. Additionally, the sideband method requires a leftward cut to isolate a background region. It was required that S/B remain at least less than 5 percent and also that there be a significant amount of background events for comparison. Statistics also had to be considered. The significance estimation requires a Gaussian shape to the background mean and without enough statistics the results can be misleading. To allow comfortably for a Gaussian shape, a requirement of at least 18 unweighted background events was implemented. The chosen classifier in SPR has a significance curve that peaks well before this cut becomes an issue, although this cut played some role in the overall choice of classifier parameters.  
Changed:  
< <  Running the trained random forest classifier over the final sample, the yield sample, revealed significance values that were 3.8 sigma for the 1 btagged selection and about 2.6 sigma for the 2 btagged jet selection, with significances from the sideband estimation being only slightly lower, as can be seen in the table below. The table also shows the expected signal and background obtained while estimating the Gaussian pdfbased significance. If the weights are adjusted to give 1fb1 luminosity, then we expect 3 sigma significance for the 2 btagged sample and 4.7 sigma for the 1 btagged sample, even with the limitations of this study. Luminosity plots are shown below.  
> >  Running the trained random forest classifier over the final sample, the yield sample, revealed significance values that were 3.7 sigma for the 1 btagged selection and about 3.2 sigma for the 2 btagged jet selection, with significances from the sideband estimation being only slightly lower, as can be seen in the table below. The table also shows the expected signal and background obtained while estimating the Gaussian pdfbased significance. If the weights are adjusted to give 1fb1 luminosity, then we expect 4.1 sigma significance for the 2 btagged sample and 4.4 sigma for the 1 btagged sample, even with the limitations of this study. Luminosity plots are shown below.  
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 
 
Once the crosssection is obtained, the Vtb value from the quark mixing matrix can also be calculated. In this case the signal theoretical crosssection is also included in the systematic uncertainties, with a value of 5%, although it has a negligible impact on the overall numbers. Expressing Vtb, exp as
 
Changed:  
< <  and taking Vtb, sm to be 1, since the ratios of the crosssections in this case is also 1, the uncertainties may be simply propagated to give the Vtb, exp value. For the 1 btagged jet selection, Vtb, exp is 1.0 + 0.17(sys)  0.18(sys) +/ 0.12(stat) +/ 0.10(lum), and for the 2 btagged jet selection, Vtb, exp is 1.0 + 0.23(sys)  0.25(sys) +/ 0.13(stat) +/ 0.10(lum).  
> >  and taking Vtb, sm to be 1, since the ratios of the crosssections in this case is also 1, the uncertainties may be simply propagated to give the Vtb, exp value. For the 1 btagged jet selection, Vtb, exp is 1.0 + 0.16(sys)  0.16(sys) +/ 0.10(stat) +/ 0.10(lum), and for the 2 btagged jet selection, Vtb, exp is 1.0 + 0.21(sys)  0.21(sys) +/ 0.13(stat) +/ 0.10(lum).  
Both selections show large uncertainties, which is to be expected in early data. It should also be noted that this analysis also has some room for improvement by future analyses. All of the variables listed in this paper were used, and additional variables may improve the result. Additionally, there could be some improvement by combining these two selections, which is beyond the scope of this study.
Additional StudiesEffect of Statistics on Significance  
Changed:  
< <  Even with a combined signal analysis, the raw counts may be somewhat low for a 33 variable space. Thus, statistics were considered to see what sort of improvement, if any, increased statistics would give to the results. Several training subsets were made, each with a percentage of the total training set. Each of these subsets contained at least eight samples and of these, the sample with the highest significance was chosen. The figure below on the left shows the significance versus the percentage of the total training set used for training the classifier. The set with a 2 btagged jet selection has a significance that is generally flat, so more statistics may not dramatically improve the current result. The 1 btagged selection, however, has an oddly shaped line, which is most likely not just related to the size of the training sample.  
> >  Even with a combined signal analysis, the raw counts may be somewhat low for a 33 variable space. Thus, statistics were considered to see what sort of improvement, if any, increased statistics would give to the results. Several training subsets were made, each with a percentage of the total training set. Each of these subsets contained at least eight samples and of these, the sample with the highest significance was chosen. The figure below on the left shows S/sqrtB versus the percentage of the total training set used for training the classifier. The 2 btagged set seems to level off somewhat with more training events but the 1 btagged sample is still increasing. Thus, more events in the training sample would improve the significance, particularly in the 1 btagged set. Additionally, if the proportions of events are kept the same, the validation set would also increase in terms of numbers of events and the MC statistical uncertainty should decrease as well.  
 
Deleted:  
< < 
There is another parameter that is not considered in the figure, the cut on the number of unweighted background events, a cut that is directly related to the statistics of the sample. How this cut is made can be seen in the figure below for the 1 btagged jet sample using the SPR random forest classifier. Here, both the significance and the sum of the background events, integrated from the right, is plotted against classifier output. A cut at 0.51 has the maximum significance and 21 unweighted background events remaining. However, if the significance were to peak after the cut requiring 18 unweighted background events, the significance quoted may not be the highest possible. If there were more statistics, the unweighted background cut would move to the right and perhaps the significance curve would peak before the cut, at a higher value.
The figure above on the right shows the percentage of cases for each trial size where the significance peak is to the left of the cut on the unweighted background events and the maximum potential significance may not be accessible to this study due to low statistics. The odd shape in the first plot is likely related to the increasing impact of the cut on the number of unweighted background events as the percentage of the training sample used increases. This, incidentally, has little effect on the 2 btagged selection, which has none of its training sample for any subset except the smallest affected by this cut, as can be seen the figure. Increased statistics would help to reduce this effect in the 1 btagged jet sample, as the cut is directly related to statistics, and could possibly improve the resulting significances in this sample.
 
Variation in Significances of Yield and Validation SamplesIn doing this study, it was found that the significances from the validation and yield samples were somewhat different, and it was desired to repeat these calculations many times and see if this sort of result was chance or a trend, for the benefit of future studies. For the 1 and 2 btagged jet samples, eight randomized training sets were produced, where the ones that gave the highest validation significances were used in the main study of this paper. All of these sets were then used to make trained classifiers for both the TMVA boosted decision tree and the SPR random forest. Each of these classifiers was applied to the validation and yield samples, and the significances, using the Gaussian pdf method, were noted. These significances are shown in figures below on the left for the 1 btagged jet selection and on the right for the 2 btagged jet selection. In general the classifiers seem to have a range of significances for both the validation and yield samples. This variation in significance values may help to explain slight fluctuations in the 2 btagged sample curve in the previous section that could not be explained by the impact of the unweighted background count cut. Additionally, there is a line plotted in the figures below with slope 1. For both classifiers and btagging selections, most points are below this line, indicating that the validation significances are generally higher than the yield significances. The selection of the classifier parameters is determined to optimize the validation significance, so these plots may indicate that these parameters are too finely tuned to the validation sample. In general, it seems the selection of a classifier that gives a high validation significance will not necessarily give a high yield significance. It may be more useful for future studies to combine these classifiers, rather than choosing one classifier.
 
Added:  
> >  Additional studies have also been done looking at the MC statistical uncertainty issue. Specifically, it was desired to know, if more events are generate, which events would improve the statistical uncertainty the most? By removing the MC statistical uncertainty for a particular event type from the overall uncertainty and comparing, it was found that not having to weight the tchannel MC would reduce the MC statistical uncertainty by 66%. This was the biggest improvement, and not a surprise, as only 50,000 tchannel events were generated, despite its large crosssection, and of those, many were removed because they were duplicated. Not having to weight the backgrounds improved the MC statistical uncertainty by about 30%. In the 1 btagged jet case, it was slightly preferable to improve the number of W+jets events rather than ttbar. For the 2 btagged jets case, improvement in the ttbar was much preferred over improvements in W+Jets. Overall, however, this analysis would see the most benefit from the generation of more tchannel events.  
 JennyHolzbauer  06 Jan 2009  JennyHolzbauer  01 Apr 2009  
Added:  
> >   JennyHolzbauer  09 Jun 2009  
Changed:  
< < 
 
> > 
 

Line: 1 to 1  

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo  
Line: 52 to 52  
 
Added:  
> > 
When considering the variables used in this analysis, the significance was estimated based on the validation sample several times while varying the number of variables used, to see if the dimensionality could be reduced without a loss of significance. However, there appears to be a linear correlation between significance and dimensionality, as can be seen in the figure below. This indicates that more variables may be useful in future studies and that, for this study, all 33 variables should be used to optimize the significance.
 
Multivariate Analysis Programming PackagesTwo analysis packages were initially considered, TMVA and SPR. In both cases, the general procedure was the same. The Monte Carlo data were split into three sets, to be used for the trial, validation, and yield phases of analysis. Because the SPR program is unable to handle negatively weighted events in the training cycle, these events were removed from the training set (although not from the validation or yield sets). Additionally, the events in the training sets were randomized to eliminate any orderdependency from merging the trees from files of different event types. The randomized sets were used to train the classifier, and the resulting significances were noted. For the purposes of this paper, the randomized data set that produced the best trained classifiers was used for validation purposes.  
Line: 76 to 81  
Multivariate Analysis with Systematics  
Changed:  
< <  To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cutbased analysis, and the btagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background for each systematic. After a cut was made, the remaining background were multiplied by these percentages to determine the error from btagging and jet energy scale systematics. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. Additionally, the luminosity error was taken at 20 percent of the total background events. Background crosssection error was estimated at 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. The error percentages after a classifier output cut for SPR's arcx4 classifier are given in the table below.  
> >  To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cutbased analysis, and the btagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background sum for each systematic. After a cut was made, the remaining background events were multiplied by these percentages to determine the error from btagging and jet energy scale systematics. The luminosity error was taken at 20 percent of the total background events for crosssection and Vtb calculations (it is not included in the significance). Background crosssection error was estimated at 12 or 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. The error percentages on the background sum after a classifier output cut for SPR's arcx4 classifier are given in the table below.  
 
Changed:  
< < 
 
> > 
 
 
Deleted:  
< <  When considering the variables used in this analysis, the significance was estimated based on the validation sample several times while varying the number of variables used, to see if the dimensionality could be reduced without a loss of significance. However, there appears to be a linear correlation between significance and dimensionality, as can be seen in the figure below. This indicates that more variables may be useful in future studies and that, for this study, all 33 variables should be used to optimize the significance.  
Deleted:  
< < 
 
Running the trained random forest classifier over the final sample, the yield sample, revealed significance values that were 3.8 sigma for the 1 btagged selection and about 2.6 sigma for the 2 btagged jet selection, with significances from the sideband estimation being only slightly lower, as can be seen in the table below. The table also shows the expected signal and background obtained while estimating the Gaussian pdfbased significance. If the weights are adjusted to give 1fb1 luminosity, then we expect 3 sigma significance for the 2 btagged sample and 4.7 sigma for the 1 btagged sample, even with the limitations of this study. Luminosity plots are shown below.
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< <  Although no actual data are considered in this analysis, it is still interesting to consider the crosssection and the uncertainties on it based on the simulated events. This calculation included the systematic uncertainties listed earlier. Here, we also include a statistical uncertainty and a separate luminosity uncertainty at 20%. For the 1 btagged jet selection, the crosssection is 323 + 112(sys)  116(sys) +/ 75(stat) +/ 65(lum) pb. For the 2 btagged jet selection, the crosssection is 323 + 147(sys)  162(sys) +/ 84(stat) +/ 65(lum) pb. It is clear in both cases that the uncertainties are quite large, which one might expect in an early data situation.  
> > 
Although no actual data are considered in this analysis, it is still interesting to consider the crosssection and the uncertainties on it based on the simulated events. This calculation included the systematic uncertainties listed earlier. Here, we also include a statistical uncertainty and a separate luminosity uncertainty at 20%. Systematic uncertainties on the crosssection are listed below:
 
Once the crosssection is obtained, the Vtb value from the quark mixing matrix can also be calculated. In this case the signal theoretical crosssection is also included in the systematic uncertainties, with a value of 5%, although it has a negligible impact on the overall numbers. Expressing Vtb, exp as  
Line: 141 to 151  
 JennyHolzbauer  06 Jan 2009  JennyHolzbauer  01 Apr 2009  
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 

Line: 1 to 1  

Significance Estimations using Multivariate Analysis of Single Top Monte CarloHere, we return to early data (below 1fb1), looking at the impact of a multivariate analysis on the capability to observe the singletop signal. EarlyDataStudy looked at the ability of a cutbased analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the singletop process in early data.  
Changed:  
< <  The preselection cuts are the same as those used by other analyses in the single top group, lthough because this is a low luminosity study, the second jet \pt\ cut was adjusted to 25 GeV and the early data btagger, TRFIP2D, was used. The TRF was compared to a random number to determine if an individual jet would be tagged or not, rather than using it as a weight. The pt cuts for the leptons were also lower, set at 20 GeV for muons and 25 GeV for electrons. Additionally, because the W+Jets sample (not including Wbbjets) were FastSim files, the trigger cuts (EM25i, MU20i, or EM60) were applied by weighting events based on the trigger turnon curves. After applying these cuts, the events were separated into groups according to the number of btagged jets, 1 or 2, effectively introducing a maximum tagged jet cut. The event yields, weighted to 100pb1, can be found in table below for samples containing 1 btagged jet or2 btagged jets, and muons or electrons.  
> >  The updated event yields, weighted to 100pb1, can be found in table below for samples containing 1 btagged jet or2 btagged jets, and muons or electrons. In this analysis, there are two selections studied, one isolating 1 btagged jets and the other isolating 2 btagged jets. As in the earlier, cutbased analysis, the early btagger IP2D is used. Some of the Pt cuts are also adjusted to have slightly lower thresholds, such as Second leading jet Pt, which is required to be greater than 25 GeV. These adjustments are the same as in the cut based analysis.  
 
Line: 20 to 20  
 
Changed:  
< <  The S/B ratio is about 0.2. This is high compared to other singletop analyses because in this analysis, the signal is combined.  
> >  The S/B ratio is about 0.2. This is high compared to other singletop analyses because in this analysis, the signal is combined. There is probably also some impact from slightly different cut selection and the different btagger.  
Changed:  
< < 
Additional Variable Selection (NEEDS MODIFICATION, NOT ACCURATE)  
> > 
Additional Variable Selection  
This analysis required the use of several variables beyond those used for preselection cuts. Explanations are given here:
 
Line: 30 to 30  
 
Added:  
> > 
 
Changed:  
< <  The following are the variables used in the multivariate analysis:  
> >  The following are the 33 variables used in the multivariate analysis:  
 
Deleted:  
< < 
 
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< < 
 
> > 
 
Multivariate Analysis Programming PackagesTwo analysis packages were initially considered, TMVA and SPR. In both cases, the general procedure was the same. The Monte Carlo data were split into three sets, to be used for the trial, validation, and yield phases of analysis. Because the SPR program is unable to handle negatively weighted events in the training cycle, these events were removed from the training set (although not from the validation or yield sets). Additionally, the events in the training sets were randomized to eliminate any orderdependency from merging the trees from files of different event types. The randomized sets were used to train the classifier, and the resulting significances were noted. For the purposes of this paper, the randomized data set that produced the best trained classifiers was used for validation purposes.  
Changed:  
< <  An initial analysis was done to determine which of the classifiers best separated background (ttbar, w+jets, w+bbjets, ww, wz) and signal (tchannel, schannel, wt). In this phase, no systematics were included and efficiency curves were examined. The goal was to minimize the background efficiency and maximize the signal efficiency. It was found that the best classifier for TMVA was a boosted decision tree and the best classifier for SPR was a random forest using arcing(REF!!!), which will be referred to as an arcx4 classifier. Figure~\ref{earlydata_TMVASPR} shows the best classifiers in SPR and TMVA together, and it indicates that the SPR classifier performs better.  
> > 
An initial analysis was done to determine which of the classifiers best separated background (ttbar, w+jets, w+bbjets, ww, wz) and signal (tchannel, schannel, wt). In this phase, no systematics were included and efficiency curves were examined. The goal was to minimize the background efficiency and maximize the signal efficiency. It was found that the best classifier for TMVA was a boosted decision tree and the best classifier for SPR was a random forest. Classifier output plots can be seen below, in log scale, for both.
 
Changed:  
< <  FigureXXX shows the efficiency curves from the best TMVA and SPR classifiers. Although they appear close, the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies, as shown in tableXXX, show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well.  
> > 
 
Changed:  
< < 
 
> >  The figures below show the efficiency curves from the best TMVA and SPR classifiers, zoomed in on the interesting region in the lower left. When a cut on the classifier output is taken to maximize the significance in both cases, the resulting efficiencies, as shown in the table below, show TMVA to have a lower signal efficiency than SPR, which results in a classifier that does not perform as well. At this point, the SPR random forest classifier was focused on.  
Changed:  
< <  At this point, the SPR classifier was focused on. The arcx4 classifier output curve has bumps after signal is separated from background, and these events were examined separately to see if they seemed reasonable. No spikes at unphysical values, for instance, were found. It was noted that the signal bumps contained two different signals, tchannel and wt. However, there is no apparent reason to reject this classifier on the basis of this plot shape. Figure~\ref{earlydata_Bump} shows the events from the signal bump area (corresponding to a classifier output cut of 0.71) for the arcx4 classifier for a few common variables, based on the validation set.  
> > 
 
Added:  
> > 
 
Multivariate Analysis with Systematics  
Changed:  
< <  To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cutbased analysis, and the btagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background for each systematic. After a cut was made, the remaining background were multiplied by these percentages to determine the error from btagging and jet energy scale systematics. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. Additionally, the luminosity error was taken at 20 percent of the total background events. Background crosssection error was estimated at 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. The error percentages after a classifier output cut for SPR's arcx4 classifier are given in tableXXXX. After this cut at 0.53, the overall background yield is 3.9 +/ 1.6, and includes 42 unweighted background events.  
> > 
To estimate the systematics, the jet energies in the events were scaled up and down by 10 percent, as they were in the cutbased analysis, and the btagging efficiency was also altered up and down to determine the average effect of each of these systematic. The average difference from the unaltered background events was found, and this was then turned into a percentage of the background for each systematic. After a cut was made, the remaining background were multiplied by these percentages to determine the error from btagging and jet energy scale systematics. Additionally, the square root of the sum of the squares of the weights was calculated to determine the Monte Carlo statistical uncertainty. Additionally, the luminosity error was taken at 20 percent of the total background events. Background crosssection error was estimated at 14 percent after preselection cuts, given a 10 percent error for ttbar and 20 percent for W+Jets. ISR/FSR was taken at 10 percent and an additional 5 percent was allocated for other errors. The error percentages after a classifier output cut for SPR's arcx4 classifier are given in the table below.
 
Changed:  
< < 
 
> > 
Additional StudiesEffect of Statistics on SignificanceEven with a combined signal analysis, the raw counts may be somewhat low for a 33 variable space. Thus, statistics were considered to see what sort of improvement, if any, increased statistics would give to the results. Several training subsets were made, each with a percentage of the total training set. Each of these subsets contained at least eight samples and of these, the sample with the highest significance was chosen. The figure below on the left shows the significance versus the percentage of the total training set used for training the classifier. The set with a 2 btagged jet selection has a significance that is generally flat, so more statistics may not dramatically improve the current result. The 1 btagged selection, however, has an oddly shaped line, which is most likely not just related to the size of the training sample.
Variation in Significances of Yield and Validation SamplesIn doing this study, it was found that the significances from the validation and yield samples were somewhat different, and it was desired to repeat these calculations many times and see if this sort of result was chance or a trend, for the benefit of future studies. For the 1 and 2 btagged jet samples, eight randomized training sets were produced, where the ones that gave the highest validation significances were used in the main study of this paper. All of these sets were then used to make trained classifiers for both the TMVA boosted decision tree and the SPR random forest. Each of these classifiers was applied to the validation and yield samples, and the significances, using the Gaussian pdf method, were noted. These significances are shown in figures below on the left for the 1 btagged jet selection and on the right for the 2 btagged jet selection. In general the classifiers seem to have a range of significances for both the validation and yield samples. This variation in significance values may help to explain slight fluctuations in the 2 btagged sample curve in the previous section that could not be explained by the impact of the unweighted background count cut. Additionally, there is a line plotted in the figures below with slope 1. For both classifiers and btagging selections, most points are below this line, indicating that the validation significances are generally higher than the yield significances. The selection of the classifier parameters is determined to optimize the validation significance, so these plots may indicate that these parameters are too finely tuned to the validation sample. In general, it seems the selection of a classifier that gives a high validation significance will not necessarily give a high yield significance. It may be more useful for future studies to combine these classifiers, rather than choosing one classifier.
 
 JennyHolzbauer  06 Jan 2009  
Added:  
> > 
 JennyHolzbauer  01 Apr 2009

Line: 1 to 1  

Significance Estimations using Multivariate Analysis of Single Top Monte Carlo  
Changed:  
< <  Here, we return to early data (below 1fb1), looking at the impact of a multivariate analysis on the capability to observe the singletop signal. EarlyDataStudy looked at the ability of a cutbased analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used. However, this analysis only considers the 1 btagged jet sample because of its promising nature and larger statistics. Although the cut based analysis only predicted a few sigma at most for a significance value at 100pb1, the multivariate approach has improved these results.  
> >  Here, we return to early data (below 1fb1), looking at the impact of a multivariate analysis on the capability to observe the singletop signal. EarlyDataStudy looked at the ability of a cutbased analysis to do so, and this is an extension of that analysis. The same data set, with the same preselection cuts is used, with a small modification. Duplicated events were found in the signal files of the original set (which has a severe impact on a multivariate analysis), and these were removed by hand such that no two events in the same signal channel will share an event number. The analysis on this set showed improvement from the cut based analysis and the possibility of at least evidence for the singletop process in early data.  
Changed:  
< < 
The event yields are given in the table below. Note that this table only includes 1 btagged jet events. 2 btagged jets and no btagging cut samples still need to be examined, and are not included in this analysis. Also note that, as in the last analysis, the early data btagger, IP2D, was used, which has a higher fake rate than the btagger recommended for general data. Note that W+Jets includes W+0, W+1, W+2 and W+3 jets, and similarly for wbbjets.
 
> > 
The preselection cuts are the same as those used by other analyses in the single top group, lthough because this is a low luminosity study, the second jet \pt\ cut was adjusted to 25 GeV and the early data btagger, TRFIP2D, was used. The TRF was compared to a random number to determine if an individual jet would be tagged or not, rather than using it as a weight. The pt cuts for the leptons were also lower, set at 20 GeV for muons and 25 GeV for electrons. Additionally, because the W+Jets sample (not including Wbbjets) were FastSim files, the trigger cuts (EM25i, MU20i, or EM60) were applied by weighting events based on the trigger turnon curves. After applying these cuts, the events were separated into groups according to the number of btagged jets, 1 or 2, effectively introducing a maximum tagged jet cut. The event yields, weighted to 100pb1, can be found in table below for samples containing 1 btagged jet or2 btagged jets, and muons or electrons.
 
Additional Variable Selection (NEEDS MODIFICATION, NOT ACCURATE)This analysis required the use of several variables beyond those used for preselection cuts. Explanations are given here: 