SPR and TMVA Programming Packages
This page gives an overview of how to use TMVA and SPR (also known as StatPatternRecognition
), which are two multivariate analysis packages. For a general overview of multivariate techniques, I recommend the following pages:
For SPR in particular, I recommend
For TMVA in particular, try
TMVA is pretty user friendly. Make sure to read the README for installation instructions. The data files are referred to within the files you run in root, located in the macros directory. At the top of the macro are booleans for each of the analysis methods. Setting them to 0 or 1 tells the program to not run, or to run that method this time through. This program has the option to read from Ascii or ROOT format, or from a webpage. It makes a list of the selected analysis methods, specifies the name of the output file, and then loads the data files (signal and background) you specify within the program. You then add the signal and background trees, with a weight (I set this to 1), again specifying the name of the trees within the program. You then tell the "factory" to what variables you are using by adding them. You also specify the variable type here. It should be noted that the testing phase of analysis is done in the TMVApplication.C with the reader object. The program allows you to split a training sample into train and validation samples. However, it seems that you need to load the trees for each of the samples and then choose the "block" split mode in the factory->PrepareTrainingAndTestTree method (right before the classifiers are processed)
NOTE: TMVA does not like doubles! The training part may run ok, but the later validation parts only accept floats. If you are unable to change the data type, you may need to rewrite some of the code.
You can add a weighting variable from your tree, telling the factory to SetSignalWeightExpression
() or SetBackgroundWeightExpression
() also seems to work.
NOTE: Although these expressions are set, the classifier outputs in the application phase do not come out with these weights. Thus, you will have to weight these outputs, which are per event, by hand to get the correct values and significance levels, if you are calculating these.
In the training phase, everything is normalized in TMVA. At this point you can also specify cuts. TMVA includes a cut analysis option, which some might find beneficial. At this point, the program contains several if statements, to ensure that only methods on the list will be processed. If it is, the factory does a BookMethod
for each analysis type in question. This is where you specify the parameters for each analysis type. When this is done, the factory is told to train, test, and evaluate all methods before saving the file and outputting a GUI. This is not done in batch mode. The GUI is pretty nice, and allows you to look at whichever plots you want (and there are a fair number available).
To do the next phase of analysis, you will need to use a different program. The validation phase is based on TMVApplication.C. This program starts off like the last one, setting the analysis methods by 0's or 1's and making a list of the desired methods. However, then it declares a reader object, after which you must declare the variables that you used in the last step and add them to the reader object.
NOTE: if you declare variables as doubles here and add them to the reader, when you run in ROOT, there will be an error. The code only accepts ints and floats.
Then you book your methods. Here, the argument is a weight file, which was generated in the last step for each method. The directory weights within macros contains these files (unless otherwise coded). At this point you must personally declare and initialize the output histograms. Now we can load the new data files, in the same manner as the last program. Now we deal with variables again as you get a tree and set the branch address for each variable. If you want to look at histograms for both trees in one program, you will need to repeat this code later in the program for the second ntree. Then, we finally get to use the reader to EvaluateSPRandTMVA
, which returns the classifier output, which we then fill our histograms with. At this point we write the histograms to a root file. The application phase does not have a GUI function, so to see the histograms you will need to draw them or look in the root file.
In general, the TMVA analysis methods (most methods have several variations). Refer to the user's guide for the different parameters for each function. The following is summarized from the user's guide:
- Cuts: rectangular cuts
- Likelihood: use probability density functions to reproduce the signal, background- Probablility Desity Esitmator, PDE, approach
- PDERS: PDE Range Search: generalization of likelihood to number of input variables dimensions
- KNN: k-Nearest Neighbor - like PDERS but more adaptive to local data density (good for irregular signal, background boundaries)
- HMatrix: - depreciated, Fisher usually performs better
- Fisher: more useful when the signal and background means are not equal, since fisher works to distinguish the means
- FDA: Function Discriminant Analysis - intermediary to linear problems (fisher) and nonlinear (neural networks). Basically it tries to fit a provided function to the data.
- MLP (and other Neutral Networks): sensitive to highly correlated variables- may crash if too correlated. See user's guide for visual description.
- Boosted Decision Tree (and decorrelated version): See user's guide for visual description.
- RuleFit: creates ensemble of rules (like cuts) from a forest of decision trees
- SVM: Support Vector Machine - see user's guide for description
- Plugin: a user-defined method
, is slightly less user-friendly than TMVA. It is, however, a bit faster. It is also set up to be run from the command line, although it is also quite possible to run it within ROOT. However, most of the online presentations and examples about how to use the program are set up to be run from the command line. There are a few examples of how to run the program in ROOT included with the package, but it will take some experimentation. If you use these examples, be sure to change all of the directories for the files to match the way your's is set up. For instance, the data file names are written as if they are in the same directory as the ROOT program, when instead they are in a data directory, at least when the programming package is first downloaded.
To get SPR, google StatPatternRecognition
and download the appropriate files. It is also on CVS. The installation of this package is more difficult than TMVA, but for the MSU ROOT set up, you can just follow the instructions and you should be ok. Our path to ROOT installation is $ROOTSYS. If you ever change part of the actual programming (such as files in the src directory) you will need to recompile. Do a ./makedict.sh before doing a make. In general, however, you should be able to just use macros you write in the root directory, and these don't require any sort of recompile.
There are two directories to worry about in this package if you are using ROOT. The root directory, which contains the macros, and the data directory, which contains files with variable and data location information. These data files have the extension .pat. A .pat file should contain the tree name, leaves (variable names), and files for signal and background. The file lines have two other parts. Directly after the file name one may write 0-, which indicates to use all of the file and after this, one writes either a 0 or a 1. 0 corresponds to background and 1 corresponds to signal.
NOTE: Trees may need a ;1 after it if this shows up in the TBrowser as part of the tree name and variables may need a _ in front of them if this shows up in the TBrowser as part of the leaf name
Once the .pat file is ready, we can use a macro in the root directory with our data. The spr_tutorial.C program is useful to overview the various things a program can do. You should also look at spr_transform.C, which shows how to transform variables (this includes normalizing them). Unlike TMVA, the variables are not normalized for training unless you set it up this way. The first line of the macro will load a library- which I also had to do seperately when working at CERN in ROOT itself, while at MSU simply loading it within the program was fine. You will have to try it and see if ROOT complains or not. You will also need to load spr_plot.C within ROOT for the program to work. This macro is responsible for making the plots we are interested in. Things like colors for the plots and pad layout are set within this macro. After this, we load our .pat file. You can either load two seperate files for training and testing, or you can split the one file's data. Here "testing" refers to the validation sample. There is then a line about choosing classes (in theory you can add more classes corresponding to numbers 2, 3, etc- but they must be ints). Then, a plot of variable correlations is made. Please refer to spr_tutorial.C for the coding form of this. Then, the various analysis methods you wish to use are added and the parameters specified. The data is trained and tested, and more plots are made, including classifier output plots. These each analysis method needs it's own chunk of code to make it's own classifier output plot, unless you can find a better way to program it. The tutorial also shows how to save the classifiers in a .spr file, so it can be used later (perhaps tested over a different data set). There is no seperate validation-step example program, at least not that I have found. What I have done is to save the classifiers and, in a separate step, load the classifiers and use the test method over a new set of validation data.
It is also possible to run SPR from the command line. Actually, it is really set up for doing this, rather than the root method. Each of the classifiers has its own "App", located in the bin/ directory. Each application has various options, which can be displayed (with explanation) by typing ./***App -h (where ***App is the application you are interested in). The following example trains the data with a boosted decision tree, and then runs the validation background and signal separately over the classifier, so that the plot of the classifier output, located in the generated root file, will show only the signal or background, respectively, not a combination of both. Be careful to check the plots to be sure the weight has been applied (if you used one in the .pat file). I usually got it to show up using the -w option. From within the bin directory, with the .pat files located in that directory:
./SprAdaBoostDecisionTreeApp train_root.pat -n 100 -l 1500 -c 3 -M 0 -f TrainedBDT
./SprOutputWriterApp -w TrainedBDT signal_root.pat signal_out.root
./SprOutputWriterApp -w TrainedBDT bkgd_root.pat bkgd_out.root
The -n option refers to the number of trees, and the -l option to the number of events per node. The -c option chooses what parameter to optimize against (Gini index, S/sqrt(S+B), etc. Use the -h option for a complete listing), and the -M option chooses the type of boosted boosting algorithm (discrete, real, or epsilon). TrainedBDT
is what the classifier will be saved as, so it can be reloaded in the next two lines.
There is no users guide explaining what the different parameters are for the different analysis methods, although the README is somewhat helpful. However, all of these methods are located within src/SprRootAdapter.cc, and the names of the parameters and the types will usually tell you something. The some plotting methods are also located here (histogram, for instance) which refers to methods located in src/SprPlotter.cc. The fillSandB() method in here gets the numbers that are used to calculate efficiencies and classifier outputs.
Some of the analysis methods are included in TMVA and SPR, but the coding is different and the input parameters will be different. SPR also has the option to optimize on, say, S/sqrt(S+B) rather than the Gini index. This being said, included methods in SPR are:
- Fisher: log ratio of likelihoods, includes linear (1) and quadratic (2) options- also known as Linear or Quadratic Discriminant Analysis
- LogitR: logistic regression- fits data to logistic curve
- Decision Tree: see http://xxx.lanl.gov/pdf/physics/0507143v1
- Topdown Tree: Like decision tree, but has continuous option. Is less interpretable, but is faster when reapplied to new events.
- Bump Hunter: like decision tree, but tries to find region with FOM (figure of merit) overall optimized
- Boosted Bianary Splits: see http://www.hep.caltech.edu/%7Enarsky/SPR/SPR_Workshop_Dec2005.pdf
- Boosted Decision Tree: weights of misclassified events are enhanced for additional runs
- Random Forest: combines bagging and a random selection of input variables for each split
- arc-x4: This is a different kind of boosting than adaBoost, which increases the weight the more often an event is misclassified. It is also typically more successful at reducing variance than bagging, and is available as an option in the random forest classifier in SPR
- Bagging: take N points from sample of size N- this is a bootstrap replica. Then make lots of decision tress from these replicas and use majority vote to classify more data
- StdBackprop (a neural network): see section in http://xxx.lanl.gov/pdf/physics/0507143v1
NOTE: Unlike TMVA's boosted decision tree, SPR has a leaf size parameter that generally works better if it is larger, ~5-50% of number of events in smallest data set. TMVA has a minimum number of events, which is a stopping parameter and will probably be on the lower side. Also, decision trees without adaboost, like the DecisionTree
, don't appear to be continuous and so your classifier output plot may look "funny". Additionally, the neural network performs better if you normalize first, which requires a transformation (see spr_transform.C).
- 25 Jun 2008
- 13 Jun 2008
- 11 Jun 2008