Working with datasets from the Grid
There is an MSU Atlas webpage that includes an overview here
, that includes a link about how to get permissions to get files off of the grid. Once you have done this, you can view the files on the grid using the commands described on the website. Note that in this article dataset refers to a collection of .root files. It is not necessarily collected ATLAS data.
Downloading datasets from the grid:
Once you have found a dataset or set of datasets that you are interested in analyzing locally, the next step is to download them. The DQ toolset is a set of tools for interacting with datasets on the grid. To begin, run the setup script. It will prompt you for your grid certificate password
Now you are ready to use the DQ toolset. There are two tools of interest that will be covered here. dq2-ls is used to view datasets and files in datasets. dq2-get is used to retrieve datasets from the grid. New users should start by checking the built in help to see the options available.
Often we are given a set of datasets that we are interested in, but they contain more variables and/or more events than we are interested in. For example, the Top group creates a set of D3PDs
that contain all of the events and variables that any conceivable top analysis may need. A specific single top analysis would have much more specific needs, and thus does not need the full set of information provided by the Top D3PDs
. Even worse, if the full Top D3PDs
were downloaded onto the tier 3 work disks, they would eat up all of the free disk space. We can significantly improve this situation by using techniques known as slimming and skimming. Skimming refers to selecting a subset of the events in a dataset and copying them into a new dataset. Slimming refers to selecting a subset of the variables in a dataset and copying them into a new dataset. In this section we will go through a script that will allow us to simultaneously slim and skim datasets that are on the grid. Start by copying the skimming package.
cp -r /msu/opt/cern/scripts/SlimAndSkim .
Understanding the slimming/skimming script
Inside there are a few important files we should examine.
Let's start with datasetlist.txt. This is an example of an input file. It contains a list of the input D3PDs
, so in this case our MC TopD3PDs
. It should be pretty straightforward to add or remove datasets of interest.
In variables.txt we have a list of the variables we want to slim to. You just need to specify the variable name, one on each line.
selectevent.py has the criteria used to skim events. It should be pretty plainly readable. You can use any variable that is in the input D3PD
, even if it isn't in variables.txt. Currently we require an electron that passes mediumPP and Pt > 10 GeV
or a muon that passes tight and Pt > 10 GeV
Now that we are familiar with the components of the package, we can look inside command generating script, makeskimjob.py. There currently should be a line that specifies the output dataset name. Right now it looks like this
outds = 'user.kollj.'+ds.replace('merge.NTUP','SKIM').replace('/\n','_v1/')
You should replace kollj with your CERN username and feel free to add any other special naming scheme you would like. Also of note is some of the parameters that it uses with prun:
This is a list of GRID sites that have annoyed me at one point or another by being slow or failing jobs, so I tell jobs not to use these sites.
--nFilesPerJob 1 This is a workaround for the very annoying problem with Top D3PDs
and trigger information. The Top D3PDs
only have branches of trigger information for events if the events need it for the trigger selection. This means that if an input ROOT file does not have any events which require a trigger e22, then this branch does NOT exist in that ROOT file. This causes a problem when you are trying to merge ROOT files, because no method that I'm aware of handles the merging of differently structured ROOT files correctly. To work around this, we maintain a 1:1 ratio between input files and output files. By default, you could take, for example, 10 input files and create only 1 output file. But the structure of the output ROOT file is determined by the first input file. So if the first input file does not have e22, but files 2-10 do, then this information will be lost in the output file. If the top group fixes this strange behavior of missing branches in their D3PDs
, we can remove this part or change it to a larger number.
Now the script is ready. Save it and we will run it.
Creating the job script
Using the setup from the previous section, we can run the script with the following command:
python makeskimjob.py datasetlist.txt outputfiles.txt makeMC.sh
This uses the information in datasetlist to create two files. outputfiles.txt is a list of the skimmed dataset names that will be made when the skimming jobs finish. We will use this file later when we are downloading the skimmed datasets with dq2-get. More interesting right now is makeMC.sh, which contains a list of commands, one for each input dataset, that will submit all of these jobs to the GRID.
Submitting the jobs to the grid
To run this list of skimming jobs, we simply have to setup the panda environment and then source the script:
You will be required to put in your GRID password, then it will begin submitting all of your jobs. You can monitor their status using the panda monitor webpage. For example, my page is here
You can also use the pbook tool on the command line to monitor status and to retry failed jobs. Start by running pbook on the command line
This should bring up a new command prompt. Here you can 'Show()' jobs or 'Retry(eventnumber)' a failed job. Press ctrl-D to exit.
Downloading and splitting skimmed datasets
Once the jobs are complete, we can download them to the tier 3. Because of the way the MSU tier 3 network is setup, downloads from the grid are much quicker on the msu4 machine than green. To download, simply ssh into msu4, create a temporary directory, cd into it, and then do dq2-get:
dq2-get -T 4,5 -F /path/to/listoffiles.txt
Once you begin the download, you should be able to let it sit and download everything, and when its done it will let you know if any files failed to download. Once they are all downloaded successfully, you are ready to use the supersplitter.py script to divide it among many of the disks. Copy it to your directory and open it up.
cp /msu/opt/cern/scripts/supersplitter.py .
Note the line:
This specifies which disks the files will be copied to. For MC, I recommend including any t3fast disks that are available, as they are run on much more frequently than data (for systematics). If you look at the code, you can see that the output directory is hard coded, so if the filesystem structure changes in the future, this script will need to be modified. For now it is okay, so we will run it:
python supersplitter.py /path/to/datasets/ MC12a
The first argument is the path to the datasets you just downloaded. The second argument is the name you want this grouping of datasets to go by. Moving can take quite a bit of time for larger datasets, so be patient. When it is finished, the files will be evenly split between /msu/data/[disk]/single_top/MC12a/
- 16 Jul 2008 -- JamesKoll
- 20 Aug 2009 -- JamesKoll
- 14 Aug 2012