Using HTCondor

Disclaimer

In my opinion, the official documentation for HTCondor is very good. Here is a link for submitting a job. Despite that, there aren't any complete examples that will help the average person get started immediately. So here are some examples that I think are complete enough for adapting to ones own coding needs.

Minimal Requirements

To run jobs on the tier3's worker nodes, a user needs an executable or wrapper script, as well as a submit script. The executable or wrapper script is what the user wants the worker nodes to run, while the submit script is what HTCondor uses to create the job.

Executable/Wrapper Script

This can be any form of an executable a user wants to run on a worker node, ranging from a simple unix command to a compiled executable to a complex shell script (also known as a wrapper script). This executable or wrapper script should include everything the user wants to run on the worker node, including copying/moving files from one place to another. Note: It is best practice to copy any code that will be run, as well as any data that will be runover, to the worker node. This avoids the issue of having the worker nodes constantly using home directory or work disk i/o, which could cause the login nodes or work disks to become slow.

Wrapper Script Example:

#!/bin/bash
 
# Copy code and input files to worker node disk, then cd to that space.
cp -r /pathToCode/code $TMPDIR/
cp /pathToInputs/inputFile.root $TMPDIR/
cd $TMPDIR/
 
# Compile code
cd code/
make
cd ..
 
# Run code on input file and get output file
./code/framework inputFile.root outputFile.root
 
# Copy output file to workdisk
mv outputFile.root /pathToWorkDisk/
 
# Cleanup
rm -fr ./*

Submit Script

The submit script is what HTCondor uses to create a job. An example is given below:

# Specify condor environment job should use
universe=vanilla
# Specify executable or wrapper script job will run
executable=wrapper.sh
 
# Specify log, output, and error files location and names
# Any of the below could also be given a path to where the file should be created
log = job.log
output = job.out
error = job.error
 
# Request resources
request_cpus = 1
request_disk = 20MB
request_memory = 20MB
 
# Queue some number of jobs
queue 1

The example submit script above will be used to create one job that runs "wrapper.sh" on a worker node. The different variables are defined below:
  • universe: the condor environment the job will be run with, this should usually be vanilla so no further description will be given.
  • executable: the executable or wrapper script to be run on the worker node.
  • log: the location and name of the condor log file, this records information on how long the job has been running as well as information on resource uses such as memory and disk space.
  • output: the location and name of the file that records the standard output of the job.
  • error: the location and name of the file that records the standard error of the job.
  • request_cpus: used to request job slots with a certain number of cpus, less useful in the current tier3 setup (each job slot is one core).
  • request_disk: used to request job slots with a certain disk space. Again, less useful in current tier3 setup. This should be set to how much disk space (so approximatly code size + input file size + output file size) a job will use.
  • request_memory: used to request job slots with a certain amount of memory. Again, less useful in current tier3 setup. This should be set to how much maximum memory a job will use. Finding this value usually requires submitting test jobs and looking at the log file.
  • queue: how many of this particular job should be created. Can be given more complex options for submitting multiple jobs as described below.
To submit a job to condor, use the command "condor_submit <submitScriptName>". It can be useful to give wrapper scripts the ".sub" file extension to differentiate them from other files.

Passing Arguments to the Executable

Some executables can have, or require, arguments be passed to them. Condor can pass said arguments with the "arguments" variable.

For example, given the wrapper script below:

#!/bin/bash
 
# Copy code, specific input file, and specific configuration file for code to worker node disk. Then cd to worker node disk space.
cp -r /pathToCode/code $TMPDIR/
cp /pathToInputs/inputFile-$1.root $TMPDIR/
cp /pathToConfigs/config-$2.cfg $TMPDIR/
cd $TMPDIR/
 
# Compile code
cd code
make
 
# Run code on input file, using configuration file, to make output file with specific name
./code/framework config-$2.cfg inputFile-$1.root outputFile-$3.root
 
# Copy output to work disk
cp outputFile-$3.root /pathToWorkDisk/
 
# Cleanup
rm -fr ./$

You could pass the necessary arguments via the submit script like so:

# Specify condor environment job should use
universe=vanilla
# Specify executable or wrapper script job will run
executable=wrapper.sh
# Pass Arguments to executable
arguments = ttbar-semileptonic lvbb 1Lep-ttbar
 
# Specify log, output, and error files location and names
# Any of the below could also be given a path to where the file should be created
log = job.log
output = job.out
error = job.error
 
# Request resources
request_cpus = 1
request_disk = 20MB
request_memory = 20MB
 
# Queue some number of jobs
queue 1

So the following will copy inputFile-ttbar.root and config-lvbb.cfg from the specified paths to the worker node disk, then copy the resulting output file named outputFile-Lep1-ttbar.root to the specified path.

Submitting Multiple Jobs

If one wants to submit multiple jobs, there are several ways to modify the submit script to do so.

Queueing Identical Jobs

If one wants to submit multiple identical jobs, they can simply give an integer N to the queue statement at the end of the submit script. Ex: The following will submit 100 identical jobs if used in a submit script.

queue 100

The "in" Keyword

To create a job for each item in a list, the "in" keyword can be used.

# The variable "var" is just an example, the name could be different.
# To reference the variable in the rest of the submit script, use $(var).
# The elements of this list are also just examples.
queue var in (
item1
item2
item3
item4
)

The "matching" Keyword

Jobs for each item matching an expression can be created using the "matching" keyword. Ex: A submit script that used queue in this way would create a job for each .root file in the directory from which "condor_submit" was called.

queue var matching *.root

You could also give a path to another directory.

The "from" Keyword

If more than one variable is required, the "from" keyword can be used. This takes in a text file that is a comma separated list of values. Ex: Given the text file "List.txt" below:

500, 0.25, 0.25
500, 0.25, 0.5
500, 0.25, 0.75
500, 0.5, 0.25
500, 0.5, 0.5
500, 0.5, 0.75
500, 0.75, 0.25
500, 0.75, 0.5
500, 0.75, 0.75
1000, 0.25, 0.25
...
1500, 0.25,0.25
...

Where the first column represents the mass of a particle, and the second and third arguments represent some coupling. One could create a job for each line by modifying the "queue" portion of the submit script like so:

queue mass,g1,g2 from List.txt

HTCondor Script Variables

A variable inside a submit script can be set and referenced like so.

X = 3
Y = $(X)

Short, Medium, and Long Queues

MSU's tier3 is split up into three queues based on how long a job will take.
  • Short Queue: For jobs that will take about 3 hour or less to run. All job slots can run short jobs.
  • Medium Queue: For jobs that will take 3-48 hours to run. Almost all the job slots can run medium jobs.
  • Long Queue: For jobs that will take 2-7 days to run. A small fraction of the job slots can run long jobs.

By default a job is submitted to the short queue, but if a user knows there job will take more than three hours they can submit a job to the medium or long queue by doing the following. To submit a job to the medium queue, add the following line to your condor submit script before the queue command.

+IsMediumJob = true

To submit a job to the long queue, add the following line to your condor submit script before the queue command.

+IsLongJob = true

Changing a Job's Queue

If a user submits jobs to one of these queues and some or all of the jobs exceed the allotted time, they will be put on hold. Users can change which queue their jobs are in with TWO steps, first using the condor_qedit command and then condor_release.

To move a job from the short queue to the medium queue, use the following command.

condor_qedit <JobIdentifier> IsMediumJob true

Where <JobIdentifier> could be the process ID, the cluster ID, or the user's username. To move a job from the short queue to long queue, simply replace "IsMediumJob" with "IsLongJob" in the example above.

To move a job from the medium queue to the long queue, use the following command.

condor_qedit <JobIdentifier> IsMediumJob false IsLongJob true

Once you have changed the queue the jobs are in, release them with:

condor_release <JobIdentifier>

If a job exceeds it's time limit and is put on hold it will be restarted when released. If a user sees their job is approaching the time limit and would like to avoid restarting the job, they can switch queues using the condor_qedit command before the job is held.

-- ForrestPhillips - 19 Sep 2017
Topic attachments
I Attachment Action Size Date Who Comment
DagMan_FHP_MSUATLAS_09-28-2017.pdfpdf DagMan_FHP_MSUATLAS_09-28-2017.pdf manage 767.6 K 06 Nov 2017 - 18:33 ForrestPhillips Talk on how to use DAGMan to manage complex job submissions.
FHP_MSU_Weekly_HTCondor.pdfpdf FHP_MSU_Weekly_HTCondor.pdf manage 783.6 K 19 Sep 2017 - 20:02 ForrestPhillips Talk on using HTCondor to submit jobs.
Topic revision: r8 - 04 May 2018, ForrestPhillips
 

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback