Check of TMVA and SPR operations using the 1405006 MC
There are theoretically a few ways to do the TMVA analysis. However, because of issues writing validation or yield classifier information to a tree from TMVApplication, and because of a lack of information about how, precisely TMVAnalysis splits the input file into a training and testing (validation) file, only the one below is recommended for use:
- Run TMVAnalyis to generate the classifier and then use TMVApplication to run additional data sets over this classifier. TMVApplication is also where you may write in code to calculate the significance.
Comments on TMVAnalysis: TMVAnalysis can break up your input file into a training and validation set. However, it is not always clear exactly how this is done, and trying to reproduce this split for comparisons with other programs like SPR can be difficult. I recommend specifying one or two events for the testing (validation) sample within TMVAnalysis so that the program runs, but to actually do validation set calculations outside of this program, in TMVApplication or spr_apply4.C, if you want your training, validation, and yield samples to be consistent between SPR and TMVA
SPR can theoretically be run from the command line or within ROOT. However, recent downloads of SPR have not included the directory with ROOT information, so I will discuss the command line option only. SPR has several programs, each of which generates a classifier with specified parameters. The input files and weights are specified in separate .pat files located in a data directory. After generating the classifier (a .spr file), you can run this over validation or yield samples using the outputwriter program. This step also produces .root files, which can be processed using programs like spr_apply4.C to calculate significances.
You will need a .pat file for validation and training (here, there is another one for a validation file without negative weights). The file ../data/train_pattest.pat looks like this:
Leaves: _HT _Jet1Pt _DeltaRJet1Jet2 _WTransverseMass
File: /work/hx5/jenny/programs/groups/SingleTopRootAnalysis/Pat_1405006_weighted_NNfiles/Topology.SingleTop.1405006.FDR2.Electron.Background.Training.NoNeg.root 0- 0
File: /work/hx5/jenny/programs/groups/SingleTopRootAnalysis/Pat_1405006_weighted_NNfiles/Topology.SingleTop.1405006.FDR2.Electron.Signal.Training.NoNeg.root 0- 1
Here, Tree is the name of the tree in the folder, Leaves are the variables names, and Weight is the weight applied to all the events (you could also say WeightVariable
: _EventWeight, to specify a weighting variable in the tree). File indicates that an input file is being specified. 0- indicates to use all the events (you could specify a range) and 0 on the first line and 1 on the second line are identification numbers. By default, the program uses 0 for background and 1 for signal. You can also use more than one file for background or signal with additional numbers, as long as you specify which numbers are signal or background in the command line. For instance, if I have another signal file, 2, in the command line I would say '0:1,2'.
You can also run a python script to generate these files (I have attached one at the bottom of this page). In this case, you just run, on the command line, before the other commands in the command line section:
python python_runs_pat.py /work/hx5/jenny/programs/groups/StatPatternRecognition_newroot/data "_HT _Jet1Pt _DeltaRJet1Jet2 _WTransverseMass"
Enter the following to train a random forest classifier and run a validation file over this classifier:
./SprBaggerDecisionTreeApp -n 100 -l 4 -s 4 -g 1 -y '0:1' -f bagtree_arcx4_pattest1.spr -t ../data/validate_nonegwgt_pattest.pat -d 5 ../data/train_pattest.pat
./SprOutputWriterApp bagtree_arcx4_pattest1.spr -y '0:1' ../data/validate_pattest.pat validate_out_pattest1.root
The first line tells SPR to use the BaggerDecisionTreeApp
(random forest), with 100 trees (-n), leaves of size 4(-l), with 4 randomly chosen variables(-s). -f specifies the output file with the trained classifier (.spr file), and -t specifies a file to use for validation. The last file on the line is the training sample.
The second line is the output writter, which takes the trained classifier and applies it to the validation set (.pat file) specified, and then writes the resulting information to a root file.
Comparing SPR and TMVA
Unfortunately, SPR and TMVA generate their classifiers differently, with different parameters, so in order to check that the programs are working, we instead look at the significance (S/sqrtB) before the classifier cut is taken, but with all of the other steps outlined above intact. I have done this twice, once setting all the weights to 3.0, and again using the EventWeight
variable (times 3.0). The events are multiplied by 3.0 to adjust for the division by 3 when dividing into training, validation, and yield samples.
NOTE: all of the programs used to do this are linked to this page, in text files, at the bottom of the page. Of the programs mentioned above, the only additional one is a python script used to generate the .pat files for SPR. The .pat files are not attached, but are of the form mentioned above.
I have run using 1405006 MC, electron selection, using nonegwgt validation files for "testing" and using the regular validation files with all of the events for validation (and thus the significance calculations). I used the boosted decision tree classifier in TMVA and the random forest classifier in SPR, although this should not matter, and chose two events for testing in TMVA.
- Run with weight = 3.0, I got 5.78826 sigma for TMVA and SPR, calculating S/sqrtB using the integral of the signal and background histograms.
- Run with weight = 3.0*EventWeight, I got 3.99055 sigma for TMVA and SPR, calculating S/sqrtB using the integral of the signal and background histograms.
Using the random forest above, and a boosted decision tree with
factory->BookMethod( TMVA::Types::kBDT, "BDT", "!H:!V:NTrees=400:BoostType=AdaBoost:SeparationType=GiniIndex:nCuts=20:PruneMethod=CostComplexity:PruneStrength=1.5" );
in TMVA, I got, for the run with weight = 3.0*EventWeight,
- TMVA BDT: 3.99055sigma (left integrated), 5.54816sigma (right integrated)
- SPR RF : 4.0034 sigma (left integrated), 6.34963sigma (right integrated)
I am not attaching any plots for now, but can upon request.
- 22 Apr 2009