Difference: UsingHTCondor (1 vs. 17)

Revision 17
21 Oct 2019 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 197 to 206
  Some of the resources on the tier3 can only have so many jobs using them before they begin to get bogged down, namely the work disks (t3work1-9). Luckily condor has a built-in way of limiting the number of jobs that run on a particular resource, and it only requires the user to declare which resource they are using and how much of that resource a single job consumes. This is done by setting concurrency limits in your submit script, by adding the following line to it before the "queue" command.
Added:
>
>
 
Changed:
<
<
Concurrency_Limits = :
>
>
Concurrency_Limits = :
 

Things to note about concurrency limits:
  • If multiple users declare they are using the same resource, the total number of jobs that run in parallel are determined aggregately, not per user.
Line: 222 to 232
 
  • DISK_CYNISCA: 10000
  • DISK_HOME: 10000
Changed:
<
<
A fairly simple calculation can be done to figure out how many units of i/o a job takes up if you already know how many jobs you want running at one time (10000/<# of jobs>).
>
>
A fairly simple calculation can be done to figure out how many units of i/o a job takes up if you already know how many jobs you want running at one time (10000/).
  For instance, if I want to use t3work3 and only have 200 jobs using it at once, then I would insert the following in my submit script:
Added:
>
>
 
# 10000/200 = 50, so I would put...
Changed:
<
<
Concurrency_Limits = DISK_T3WORK3:50
>
>
Concurrency_Limits = DISK_T3WORK3:50

User limits

 
Deleted:
<
<

User limits (W.I.P.)

  Most users do not use the work disk concurrency limits as intended (to limit the number of jobs running that use a particular disk). Instead, they simply need a way to limit the total number of their own jobs that are running (without the concurrency limits of different users interfering with each other).
Changed:
<
<
(To be added soon)
>
>
If this is the only concurrency limit you'd like to set, add the following line to your submit script:

Concurrency_Limits = <username>:<1000/# of jobs>
# Example, this will limit the number of jobs I (forrestp) can run to 100.
Concurrency_Limit = forrestp:10

Note that while the disk limits use a value of 10,000 to determine the number of jobs, the user limits use a value of 1,000.

User Defined Resources (W.I.P.)

Multiple Concurrency Limits

If you would like to submit jobs with limits on both the username and disk, or some user defined resource, give the "Concurrency_Limit" variable in your submit script a comma separated list.

# Example, I (forrestp) want to limit the number of jobs I can run to 100, but also want to limit the number of jobs running on t3work4 to 50.
Concurrency_Limit = forrestp:10, DISK_T3WORK4:200

An example of where this would be useful is if you've split the datasets you run over across multiple disks. So say you've split this dataset across t3work3, t3work4, and t3work5. If you'd like to limit the total number of jobs you can run at once to 100, but limit the number of jobs that use a particular disk to 50, there are two approaches you can take: have three different submit scripts (one for each disk) or one submit script that changes the relevant variables (a bit more complicated).

Example 1 (three submit scripts)

In the submit script for jobs that use the dataset on t3work3, insert:
Concurrency_Limits = forrestp:100, DISK_T3WORK3:200

In the submit script for jobs that use the dataset on t3work4, insert:
Concurrency_Limits = forrestp:100, DISK_T3WORK4:200

In the submit script for jobs that use the dataset on t3work5, insert:
Concurrency_Limits = forrestp:100, DISK_T3WORK5:200

Example 2 (one submit script that changes the appropriate variables)

Warning: this is a simple example and more/different changes may be needed for your submit script depending on its setup, please use this only as a reference for how you might change it.
universe = vanilla
executable = wrapper.sh # This wrapper excepts the location of the dataset as an argument

log = job.log
out = job-$(dataset).out
err = job-$(dataset).out

request_cpus = 1
request_disk = 1GB
request_memory = 1GB

arguments = /msu/data/t3work3/restOfPathToDataset
Concurrency_Limits = forrestp:10, DISK_T3WORK3:200
queue

arguments = /msu/data/t3work4/restOfPathToDataset
Concurrency_Limits = forrestp:10, DISK_T3WORK4:200
queue

arguments = /msu/data/t3work5/restOfPathToDataset
Concurrency_Limits = forrestp:10, DISK_T3WORK5:200
queue
 

Short, Medium, and Long Queues

Added:
>
>
  MSU's tier3 is split up into three queues based on how long a job will take.
  • Short Queue: For jobs that will take about 3 hour or less to run. All job slots can run short jobs.
  • Medium Queue: For jobs that will take 3-48 hours to run. Almost all the job slots can run medium jobs.
Line: 299 to 384
  In addition, you may want to check how many job slots there are on the tier3, check how many of those are available, or view a slots ClassAds.

Jobs Currently in the Queue VIDEO

Changed:
<
<
To view the job queue on the login/submit node you are currently on, use the command:
condor_q
If you'd like to view only your jobs, use the command:
condor_q <username>
To view the job queues of every login/submit node at once, use the command:
condor_q -global
To view the ClassAds of a particular job, use the command:
condor_q -l <JobID>
To view a particular ClassAd, use the command:
condor_q -af <ClassAd>
>
>

To view the job queue on the login/submit node you are currently on, use the command:
condor_q

If you'd like to view only your jobs, use the command:
condor_q <username>

To view the job queues of every login/submit node at once, use the command:
condor_q -global

To view the ClassAds of a particular job, use the command:
condor_q -l <JobID>

To view a particular ClassAd, use the command:
condor_q -af <ClassAd>
 

Jobs No Longer in the Queue (Video in job editing section)

Changed:
<
<
If you'd like to view information about jobs that used to be on the queue, use the command:
condor_history
>
>

If you'd like to view information about jobs that used to be on the queue, use the command:
condor_history
  All of the options available in the condor_q examples work for this command as well.

Job Slots VIDEO

Changed:
<
<
To view the status of the job slots, use the command:
condor_status
To view the ClassAds of a particular job slot, use the command:
condor_status -l <JobSlot>
>
>

To view the status of the job slots, use the command:
condor_status

To view the ClassAds of a particular job slot, use the command:
condor_status -l <JobSlot>
 


Line: 316 to 418
 

Editing Jobs in the Queue VIDEO

Added:
>
>
  If you'd like to edit the ClassAds of a job that has already been submitted you can do so by holding it, editing it, and then releasing it.
Changed:
<
<
To hold a job, use the following command:
condor_hold <JobID, ClusterID, or username>
To edit a job, use the following command:
condor_qedit <JobID, ClusterID, or username> <ClassAd> <new value>
Once a job is edited, release it using the command:
condor_release <JobID, ClusterID, or username>
>
>
To hold a job, use the following command:
condor_hold <JobID, ClusterID, or username>

To edit a job, use the following command:
condor_qedit <JobID, ClusterID, or username> <ClassAd> <new value>

Once a job is edited, release it using the command:
condor_release <JobID, ClusterID, or username>
 


Removing Jobs from the Queue VIDEO

Changed:
<
<
To remove a job (or set of jobs) from the queue, use the command:
condor_rm <JobID, ClusterID, or username>
>
>

To remove a job (or set of jobs) from the queue, use the command:
condor_rm <JobID, ClusterID, or username>
 


Line: 329 to 439
 

DagMan

Added:
>
>
  If you have a job workflow that involves any form of submitting some number of jobs only to run more jobs that use the previous jobs results as inputs, then you may want to consider using DagMan.
Changed:
<
<
DagMan is a tool built into HTCondor that was made to handle complex job workflows. The documentation for it can be found here, but there is also a presentation on how to use it in the attachments of this page.
>
>
DagMan is a tool built into HTCondor that was made to handle complex job workflows. The documentation for it can be found here, but there is also a presentation on how to use it in the attachments of this page.
 

META FILEATTACHMENT attachment="FHP_MSU_Weekly_HTCondor.pdf" attr="" comment="Talk on using HTCondor to submit jobs." date="1505851326" name="FHP_MSU_Weekly_HTCondor.pdf" path="FHP_MSU_Weekly_HTCondor.pdf" size="802452" user="ForrestPhillips" version="1"
META FILEATTACHMENT attachment="DagMan_FHP_MSUATLAS_09-28-2017.pdf" attr="" comment="Talk on how to use DAGMan to manage complex job submissions." date="1509993181" name="DagMan_FHP_MSUATLAS_09-28-2017.pdf" path="DagMan_FHP_MSUATLAS_09-28-2017.pdf" size="786036" user="ForrestPhillips" version="1"
Revision 16
17 Oct 2019 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 218 to 218
 
  • DISK_T3WORK6: 10000
  • DISK_T3WORK7: 10000
  • DISK_T3WORK8: 10000
Added:
>
>
  • DISK_T3WORK9: 10000
  • DISK_CYNISCA: 10000
 
  • DISK_HOME: 10000

A fairly simple calculation can be done to figure out how many units of i/o a job takes up if you already know how many jobs you want running at one time (10000/<# of jobs>).
Revision 15
15 Oct 2019 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 230 to 230
 

User limits (W.I.P.)

Most users do not use the work disk concurrency limits as intended (to limit the number of jobs running that use a particular disk). Instead, they simply need a way to limit the total number of their own jobs that are running (without the concurrency limits of different users interfering with eachother).
Changed:
<
<
The admins have been talking about adding a new set of limits to do this, but as of yet nothing has been implemented. The names of these limits would likely be:
  • USER1
  • USER2
  • ...
  • USER10
Users would then need to claim one of these to use for their own jobs, which would probably be done using the google doc.
>
>
(To be added soon)
 

Short, Medium, and Long Queues

MSU's tier3 is split up into three queues based on how long a job will take.
Revision 14
09 Oct 2019 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 290 to 290
 

If you need to move already submitted jobs to the bypass queue, see the section "Changing a Job's Queue".
Added:
>
>

Misc.

Staggering jobs for the sake of the work disks

 

Checking the Status of Jobs & Job Slots

Revision 13
03 Oct 2019 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 9 to 9
  In my opinion, the official documentation for HTCondor is very good. Here is a link for submitting a job. Despite that, there aren't any complete examples that will help the average person get started immediately. So here are some examples that I think are complete enough for adapting to ones own coding needs.
Changed:
<
<

Submitting Jobs to the Queue

>
>

Submitting Jobs to the Queue VIDEO

 

To run jobs on the tier3's worker nodes, a user needs an executable or wrapper script, as well as a submit script. The executable or wrapper script is what the user wants the worker nodes to run, while the submit script is what HTCondor uses to create the job.
Line: 298 to 298
 

In addition, you may want to check how many job slots there are on the tier3, check how many of those are available, or view a slots ClassAds.
Changed:
<
<

Jobs Currently in the Queue

>
>

Jobs Currently in the Queue VIDEO

  To view the job queue on the login/submit node you are currently on, use the command:
condor_q
If you'd like to view only your jobs, use the command:
condor_q <username>
To view the job queues of every login/submit node at once, use the command:
condor_q -global
To view the ClassAds of a particular job, use the command:
condor_q -l <JobID>
To view a particular ClassAd, use the command:
condor_q -af <ClassAd>
Changed:
<
<

Jobs No Longer in the Queue

>
>

Jobs No Longer in the Queue (Video in job editing section)

  If you'd like to view information about jobs that used to be on the queue, use the command:
condor_history
All of the options available in the condor_q examples work for this command as well.
Changed:
<
<

Job Slots

>
>

Job Slots VIDEO

  To view the status of the job slots, use the command:
condor_status
To view the ClassAds of a particular job slot, use the command:
condor_status -l <JobSlot>


Changed:
<
<

Editing Jobs in the Queue

>
>

Editing Jobs in the Queue VIDEO

  If you'd like to edit the ClassAds of a job that has already been submitted you can do so by holding it, editing it, and then releasing it. To hold a job, use the following command:
condor_hold <JobID, ClusterID, or username>
To edit a job, use the following command:
condor_qedit <JobID, ClusterID, or username> <ClassAd> <new value>
Line: 323 to 323
 


Changed:
<
<

Removing Jobs from the Queue

>
>

Removing Jobs from the Queue VIDEO

  To remove a job (or set of jobs) from the queue, use the command:
condor_rm <JobID, ClusterID, or username>


Revision 12
24 Sep 2019 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 326 to 326
 

Removing Jobs from the Queue

To remove a job (or set of jobs) from the queue, use the command:
condor_rm <JobID, ClusterID, or username>
Added:
>
>

 

DagMan

If you have a job workflow that involves any form of submitting some number of jobs only to run more jobs that use the previous jobs results as inputs, then you may want to consider using DagMan. DagMan is a tool built into HTCondor that was made to handle complex job workflows. The documentation for it can be found here, but there is also a presentation on how to use it in the attachments of this page.
Revision 11
20 Sep 2019 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 316 to 316
 

Editing Jobs in the Queue

Changed:
<
<
If you'd like to edit the ClassAds of a job that has already been submitted you can do so by holding it, editing it, and then releasing it.
>
>
If you'd like to edit the ClassAds of a job that has already been submitted you can do so by holding it, editing it, and then releasing it.
  To hold a job, use the following command:
condor_hold <JobID, ClusterID, or username>
To edit a job, use the following command:
condor_qedit <JobID, ClusterID, or username> <ClassAd> <new value>
Once a job is edited, release it using the command:
condor_release <JobID, ClusterID, or username>
Line: 326 to 326
 

Removing Jobs from the Queue

To remove a job (or set of jobs) from the queue, use the command:
condor_rm <JobID, ClusterID, or username>
Added:
>
>

DagMan

If you have a job workflow that involves any form of submitting some number of jobs only to run more jobs that use the previous jobs results as inputs, then you may want to consider using DagMan. DagMan is a tool built into HTCondor that was made to handle complex job workflows. The documentation for it can be found here, but there is also a presentation on how to use it in the attachments of this page.
 
META FILEATTACHMENT attachment="FHP_MSU_Weekly_HTCondor.pdf" attr="" comment="Talk on using HTCondor to submit jobs." date="1505851326" name="FHP_MSU_Weekly_HTCondor.pdf" path="FHP_MSU_Weekly_HTCondor.pdf" size="802452" user="ForrestPhillips" version="1"
META FILEATTACHMENT attachment="DagMan_FHP_MSUATLAS_09-28-2017.pdf" attr="" comment="Talk on how to use DAGMan to manage complex job submissions." date="1509993181" name="DagMan_FHP_MSUATLAS_09-28-2017.pdf" path="DagMan_FHP_MSUATLAS_09-28-2017.pdf" size="786036" user="ForrestPhillips" version="1"
Revision 10
18 Sep 2019 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 9 to 9
  In my opinion, the official documentation for HTCondor is very good. Here is a link for submitting a job. Despite that, there aren't any complete examples that will help the average person get started immediately. So here are some examples that I think are complete enough for adapting to ones own coding needs.
Changed:
<
<

Minimal Requirements

>
>

Submitting Jobs to the Queue

 

To run jobs on the tier3's worker nodes, a user needs an executable or wrapper script, as well as a submit script. The executable or wrapper script is what the user wants the worker nodes to run, while the submit script is what HTCondor uses to create the job.
Line: 76 to 74
 
  • queue: how many of this particular job should be created. Can be given more complex options for submitting multiple jobs as described below.
To submit a job to condor, use the command "condor_submit <submitScriptName>". It can be useful to give wrapper scripts the ".sub" file extension to differentiate them from other files.
Changed:
<
<

Passing Arguments to the Executable

>
>

HTCondor Script Variables

A variable inside a submit script can be set and referenced like so.

X = 3
Y = $(X)

Passing Arguments to the Executable

 

Some executables can have, or require, arguments be passed to them. Condor can pass said arguments with the "arguments" variable.
Line: 81 to 87
  Some executables can have, or require, arguments be passed to them. Condor can pass said arguments with the "arguments" variable.

For example, given the wrapper script below:
Deleted:
<
<
 
#!/bin/bash
Added:
>
>
# For those unfamiliar with bash scripting: to reference arguments passed via the command line, use the built in variables $1, $2, etc. to reference the 1st argument, 2nd argument, etc.
 

# Copy code, specific input file, and specific configuration file for code to worker node disk. Then cd to worker node disk space. cp -r /pathToCode/code $TMPDIR/
Line: 130 to 135
 

So the following will copy inputFile-ttbar.root and config-lvbb.cfg from the specified paths to the worker node disk, then copy the resulting output file named outputFile-Lep1-ttbar.root to the specified path.
Changed:
<
<

Submitting Multiple Jobs

>
>

Submitting Multiple Jobs

 

If one wants to submit multiple jobs, there are several ways to modify the submit script to do so.
Changed:
<
<

Queueing Identical Jobs

>
>

Queueing Identical Jobs

 

If one wants to submit multiple identical jobs, they can simply give an integer N to the queue statement at the end of the submit script. Ex: The following will submit 100 identical jobs if used in a submit script.
Line: 142 to 146
 
queue 100
Changed:
<
<

The "in" Keyword

>
>

The "in" Keyword

 

To create a job for each item in a list, the "in" keyword can be used.
Deleted:
<
<
 
# The variable "var" is just an example, the name could be different.
# To reference the variable in the rest of the submit script, use $(var).
Line: 157 to 160
  item4 )
Changed:
<
<

The "matching" Keyword

>
>

The "matching" Keyword

 

Jobs for each item matching an expression can be created using the "matching" keyword. Ex: A submit script that used queue in this way would create a job for each .root file in the directory from which "condor_submit" was called.
Line: 167 to 169
 

You could also give a path to another directory.
Changed:
<
<

The "from" Keyword

>
>

The "from" Keyword

 

If more than one variable is required, the "from" keyword can be used. This takes in a text file that is a comma separated list of values. Ex: Given the text file "List.txt" below:
Line: 193 to 193
 
queue mass,g1,g2 from List.txt
Changed:
<
<

HTCondor Script Variables

>
>

Concurrency Limits

Some of the resources on the tier3 can only have so many jobs using them before they begin to get bogged down, namely the work disks (t3work1-9). Luckily condor has a built-in way of limiting the number of jobs that run on a particular resource, and it only requires the user to declare which resource they are using and how much of that resource a single job consumes. This is done by setting concurrency limits in your submit script, by adding the following line to it before the "queue" command.
Concurrency_Limits = <resource1 name>:<units needed by job>
 
Changed:
<
<
A variable inside a submit script can be set and referenced like so.
>
>
Things to note about concurrency limits:
  • If multiple users declare they are using the same resource, the total number of jobs that run in parallel are determined aggregately, not per user.
  • To declare the use of multiple resources, use a comma separated list.
  • Using the wrong resource name does not cause an error.
  • Using a resource name does not actually claim anything physically, you could for instance limit the number of jobs you are running that use t3work9 by using concurrency limits meant for t3work5.
  • If you do not use concurrency limits, then your jobs will begin running on any available jobs slots and might overwhelm the resources they use, thereby slowing down that resource for everyone.

Work Disks

Nearly all of the work disks are setup to use concurrency limits. Each disk has an associated amount of i/o that is used to limit the number of jobs running on them. This value is somewhat arbitrary, it does not actually represent real units of a disk's i/o. The names of these resources and the unit's of i/o they are given are listed here (may change soon):
  • DISK_T3WORK1: 10000
  • DISK_T3WORK2: 10000
  • DISK_T3WORK3: 10000
  • DISK_T3WORK4: 10000
  • DISK_T3WORK5: 10000
  • DISK_T3WORK6: 10000
  • DISK_T3WORK7: 10000
  • DISK_T3WORK8: 10000
  • DISK_HOME: 10000
 
Added:
>
>
A fairly simple calculation can be done to figure out how many units of i/o a job takes up if you already know how many jobs you want running at one time (10000/<# of jobs>). For instance, if I want to use t3work3 and only have 200 jobs using it at once, then I would insert the following in my submit script:
 
Changed:
<
<
X = 3 Y = $(X)
>
>
# 10000/200 = 50, so I would put... Concurrency_Limits = DISK_T3WORK3:50
 
Changed:
<
<

Short, Medium, and Long Queues

>
>

User limits (W.I.P.)

Most users do not use the work disk concurrency limits as intended (to limit the number of jobs running that use a particular disk). Instead, they simply need a way to limit the total number of their own jobs that are running (without the concurrency limits of different users interfering with eachother). The admins have been talking about adding a new set of limits to do this, but as of yet nothing has been implemented. The names of these limits would likely be:
  • USER1
  • USER2
  • ...
  • USER10
Users would then need to claim one of these to use for their own jobs, which would probably be done using the google doc.
 
Added:
>
>

Short, Medium, and Long Queues

  MSU's tier3 is split up into three queues based on how long a job will take.
  • Short Queue: For jobs that will take about 3 hour or less to run. All job slots can run short jobs.
  • Medium Queue: For jobs that will take 3-48 hours to run. Almost all the job slots can run medium jobs.
Line: 220 to 253
 
+IsLongJob = true
Changed:
<
<

Changing a Job's Queue

>
>

Changing a Job's Queue

 

If a user submits jobs to one of these queues and some or all of the jobs exceed the allotted time, they will be put on hold. Users can change which queue their jobs are in with TWO steps, first using the condor_qedit command and then condor_release.
Line: 245 to 275
  If a job exceeds it's time limit and is put on hold it will be restarted when released. If a user sees their job is approaching the time limit and would like to avoid restarting the job, they can switch queues using the condor_qedit command before the job is held.
Changed:
<
<

The Bypass Queue

>
>

The Bypass Queue

 

The bypass queue is a queue that bypasses the time limit and node restrictions of the short, medium, and long queues. It is discouraged to use this unless absolutely necessary.
Line: 261 to 290
 

If you need to move already submitted jobs to the bypass queue, see the section "Changing a Job's Queue".
Changed:
<
<
-- CarlosBuxovazquez - 11 Jun 2019
>
>

Checking the Status of Jobs & Job Slots

After you've submitted a job (or after it's finished running), you will probably want to check up on it from time to time or view it's ClassAds. There are two commands for doing this, one for jobs currently in the queue (condor_q) and one for jobs that are no longer in the queue (condor_history).

In addition, you may want to check how many job slots there are on the tier3, check how many of those are available, or view a slots ClassAds.

Jobs Currently in the Queue

To view the job queue on the login/submit node you are currently on, use the command:
condor_q
If you'd like to view only your jobs, use the command:
condor_q <username>
To view the job queues of every login/submit node at once, use the command:
condor_q -global
To view the ClassAds of a particular job, use the command:
condor_q -l <JobID>
To view a particular ClassAd, use the command:
condor_q -af <ClassAd>

Jobs No Longer in the Queue

If you'd like to view information about jobs that used to be on the queue, use the command:
condor_history
All of the options available in the condor_q examples work for this command as well.

Job Slots

To view the status of the job slots, use the command:
condor_status
To view the ClassAds of a particular job slot, use the command:
condor_status -l <JobSlot>


Editing Jobs in the Queue

If you'd like to edit the ClassAds of a job that has already been submitted you can do so by holding it, editing it, and then releasing it. To hold a job, use the following command:
condor_hold <JobID, ClusterID, or username>
To edit a job, use the following command:
condor_qedit <JobID, ClusterID, or username> <ClassAd> <new value>
Once a job is edited, release it using the command:
condor_release <JobID, ClusterID, or username>


Removing Jobs from the Queue

To remove a job (or set of jobs) from the queue, use the command:
condor_rm <JobID, ClusterID, or username>
 

META FILEATTACHMENT attachment="FHP_MSU_Weekly_HTCondor.pdf" attr="" comment="Talk on using HTCondor to submit jobs." date="1505851326" name="FHP_MSU_Weekly_HTCondor.pdf" path="FHP_MSU_Weekly_HTCondor.pdf" size="802452" user="ForrestPhillips" version="1"
META FILEATTACHMENT attachment="DagMan_FHP_MSUATLAS_09-28-2017.pdf" attr="" comment="Talk on how to use DAGMan to manage complex job submissions." date="1509993181" name="DagMan_FHP_MSUATLAS_09-28-2017.pdf" path="DagMan_FHP_MSUATLAS_09-28-2017.pdf" size="786036" user="ForrestPhillips" version="1"
Revision 9
11 Jun 2019 - Main.CarlosBuxovazquez
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 245 to 245
  If a job exceeds it's time limit and is put on hold it will be restarted when released. If a user sees their job is approaching the time limit and would like to avoid restarting the job, they can switch queues using the condor_qedit command before the job is held.
Changed:
<
<
-- ForrestPhillips - 19 Sep 2017
>
>

The Bypass Queue

The bypass queue is a queue that bypasses the time limit and node restrictions of the short, medium, and long queues. It is discouraged to use this unless absolutely necessary.

In order to use the bypass queue, you must meet the following conditions:
  • You're crunched on time.
  • You need unlimited resources (no time limits and as many nodes as possible).
  • All T3 users agreed you can use it.

To use the bypass queue, insert the following line in your condor submit script.

+IsBypassJob = True

If you need to move already submitted jobs to the bypass queue, see the section "Changing a Job's Queue".

-- CarlosBuxovazquez - 11 Jun 2019
 

META FILEATTACHMENT attachment="FHP_MSU_Weekly_HTCondor.pdf" attr="" comment="Talk on using HTCondor to submit jobs." date="1505851326" name="FHP_MSU_Weekly_HTCondor.pdf" path="FHP_MSU_Weekly_HTCondor.pdf" size="802452" user="ForrestPhillips" version="1"
META FILEATTACHMENT attachment="DagMan_FHP_MSUATLAS_09-28-2017.pdf" attr="" comment="Talk on how to use DAGMan to manage complex job submissions." date="1509993181" name="DagMan_FHP_MSUATLAS_09-28-2017.pdf" path="DagMan_FHP_MSUATLAS_09-28-2017.pdf" size="786036" user="ForrestPhillips" version="1"
Revision 8
04 May 2018 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Added:
>
>

Disclaimer

In my opinion, the official documentation for HTCondor is very good. Here is a link for submitting a job. Despite that, there aren't any complete examples that will help the average person get started immediately. So here are some examples that I think are complete enough for adapting to ones own coding needs.
 

Minimal Requirements

To run jobs on the tier3's worker nodes, a user needs an executable or wrapper script, as well as a submit script. The executable or wrapper script is what the user wants the worker nodes to run, while the submit script is what HTCondor uses to create the job.
Line: 190 to 195
 

HTCondor Script Variables

Added:
>
>
A variable inside a submit script can be set and referenced like so.

X = 3
Y = $(X)
 

Short, Medium, and Long Queues

MSU's tier3 is split up into three queues based on how long a job will take.
Revision 7
19 Apr 2018 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 181 to 190
 

HTCondor Script Variables

Changed:
<
<

Short, Medium, and Long Queues (Currently being made)

>
>

Short, Medium, and Long Queues

 

MSU's tier3 is split up into three queues based on how long a job will take.
  • Short Queue: For jobs that will take about 3 hour or less to run. All job slots can run short jobs.
Line: 216 to 230
 
condor_release <JobIdentifier>
Added:
>
>
If a job exceeds it's time limit and is put on hold it will be restarted when released. If a user sees their job is approaching the time limit and would like to avoid restarting the job, they can switch queues using the condor_qedit command before the job is held.
  -- ForrestPhillips - 19 Sep 2017

META FILEATTACHMENT attachment="FHP_MSU_Weekly_HTCondor.pdf" attr="" comment="Talk on using HTCondor to submit jobs." date="1505851326" name="FHP_MSU_Weekly_HTCondor.pdf" path="FHP_MSU_Weekly_HTCondor.pdf" size="802452" user="ForrestPhillips" version="1"
Revision 6
19 Apr 2018 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Added:
>
>
 

Minimal Requirements

Line: 123 to 130
  If one wants to submit multiple identical jobs, they can simply give an integer N to the queue statement at the end of the submit script. Ex: The following will submit 100 identical jobs if used in a submit script.
Changed:
<
<
queue 100
>
>
queue 100
 

The "in" Keyword

Added:
>
>
  To create a job for each item in a list, the "in" keyword can be used.
# The variable "var" is just an example, the name could be different.
Line: 137 to 144
  item2 item3 item4
Changed:
<
<
)
>
>
)
 

The "matching" Keyword

Added:
>
>
  Jobs for each item matching an expression can be created using the "matching" keyword. Ex: A submit script that used queue in this way would create a job for each .root file in the directory from which "condor_submit" was called.
Line: 144 to 151
  Jobs for each item matching an expression can be created using the "matching" keyword. Ex: A submit script that used queue in this way would create a job for each .root file in the directory from which "condor_submit" was called.
Changed:
<
<
queue var matching *.root
>
>
queue var matching *.root
  You could also give a path to another directory.

The "from" Keyword

Line: 164 to 172
  1000, 0.25, 0.25 ... 1500, 0.25,0.25
Changed:
<
<
...
>
>
...
  Where the first column represents the mass of a particle, and the second and third arguments represent some coupling. One could create a job for each line by modifying the "queue" portion of the submit script like so:
Changed:
<
<
queue mass,g1,g2 from List.txt
>
>
queue mass,g1,g2 from List.txt
 

HTCondor Script Variables

Line: 184 to 191
  By default a job is submitted to the short queue, but if a user knows there job will take more than three hours they can submit a job to the medium or long queue by doing the following. To submit a job to the medium queue, add the following line to your condor submit script before the queue command.
Changed:
<
<
+IsMediumJob = true
>
>
+IsMediumJob = true
  To submit a job to the long queue, add the following line to your condor submit script before the queue command.
Changed:
<
<
+IsLongJob = true
>
>
+IsLongJob = true
 

Changing a Job's Queue

Deleted:
<
<
If a user submits jobs to one of these queues and some or all of the jobs exceed the allotted time, they will be put on hold. Users can change which queue their jobs are in using the condor_qedit command.
 
Changed:
<
<
To move a job from the short queue to the medium queue, use the following command.
>
>
If a user submits jobs to one of these queues and some or all of the jobs exceed the allotted time, they will be put on hold. Users can change which queue their jobs are in with TWO steps, first using the condor_qedit command and then condor_release.

To move a job from the short queue to the medium queue, use the following command.
 
Changed:
<
<
condor_qedit IsMediumJob true Where \<JobIdentifier\> could be the process ID, the cluster ID, or the user's username. To move a job from the short queue to long queue, simply replace "IsMediumJob" with "IsLongJob" in the example above.
>
>
condor_qedit IsMediumJob true

Where <JobIdentifier> could be the process ID, the cluster ID, or the user's username. To move a job from the short queue to long queue, simply replace "IsMediumJob" with "IsLongJob" in the example above.
 
Changed:
<
<
To move a job from the medium queue to the long queue, use the following command.
>
>
To move a job from the medium queue to the long queue, use the following command.
 
Changed:
<
<
condor_qedit IsMediumJob false IsLongJob true
>
>
condor_qedit IsMediumJob false IsLongJob true
 

Once you have changed the queue the jobs are in, release them with:
Changed:
<
<
condor_release
>
>
condor_release
 

-- ForrestPhillips - 19 Sep 2017
Revision 5
11 Dec 2017 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 182 to 182
 
  • Long Queue: For jobs that will take 2-7 days to run. A small fraction of the job slots can run long jobs.

By default a job is submitted to the short queue, but if a user knows there job will take more than three hours they can submit a job to the medium or long queue by doing the following.
Changed:
<
<
If the user wants to submit a job to the medium queue, they need to add the following line to their condor submit script before the queue command.
>
>
To submit a job to the medium queue, add the following line to your condor submit script before the queue command.
 
+IsMediumJob = true
Changed:
<
<
If the user wants to submit a job to the long queue, they need to add the following line to their condor submit script before the queue command.
>
>
To submit a job to the long queue, add the following line to your condor submit script before the queue command.
 
+IsLongJob = true
Line: 198 to 198
 
condor_qedit <JobIdentifier> IsMediumJob true
Changed:
<
<
Where could be the process ID, the cluster ID, or the user's username.
>
>
Where \<JobIdentifier\> could be the process ID, the cluster ID, or the user's username.
  To move a job from the short queue to long queue, simply replace "IsMediumJob" with "IsLongJob" in the example above.

To move a job from the medium queue to the long queue, use the following command.
Revision 4
08 Dec 2017 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 172 to 172
  queue mass,g1,g2 from List.txt
Changed:
<
<

HTCondor Script Variables

>
>

HTCondor Script Variables

Short, Medium, and Long Queues (Currently being made)

MSU's tier3 is split up into three queues based on how long a job will take.
  • Short Queue: For jobs that will take about 3 hour or less to run. All job slots can run short jobs.
  • Medium Queue: For jobs that will take 3-48 hours to run. Almost all the job slots can run medium jobs.
  • Long Queue: For jobs that will take 2-7 days to run. A small fraction of the job slots can run long jobs.

By default a job is submitted to the short queue, but if a user knows there job will take more than three hours they can submit a job to the medium or long queue by doing the following. If the user wants to submit a job to the medium queue, they need to add the following line to their condor submit script before the queue command.
+IsMediumJob = true
If the user wants to submit a job to the long queue, they need to add the following line to their condor submit script before the queue command.
+IsLongJob = true

Changing a Job's Queue

If a user submits jobs to one of these queues and some or all of the jobs exceed the allotted time, they will be put on hold. Users can change which queue their jobs are in using the condor_qedit command.

To move a job from the short queue to the medium queue, use the following command.
condor_qedit <JobIdentifier> IsMediumJob true
Where could be the process ID, the cluster ID, or the user's username. To move a job from the short queue to long queue, simply replace "IsMediumJob" with "IsLongJob" in the example above.

To move a job from the medium queue to the long queue, use the following command.
condor_qedit <JobIdentifier> IsMediumJob false IsLongJob true

Once you have changed the queue the jobs are in, release them with:
condor_release <JobIdentifier>
 

-- ForrestPhillips - 19 Sep 2017
Revision 3
06 Nov 2017 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Line: 117 to 117
  So the following will copy inputFile-ttbar.root and config-lvbb.cfg from the specified paths to the worker node disk, then copy the resulting output file named outputFile-Lep1-ttbar.root to the specified path.

Submitting Multiple Jobs

Added:
>
>
If one wants to submit multiple jobs, there are several ways to modify the submit script to do so.

Queueing Identical Jobs

If one wants to submit multiple identical jobs, they can simply give an integer N to the queue statement at the end of the submit script. Ex: The following will submit 100 identical jobs if used in a submit script.
queue 100

The "in" Keyword

To create a job for each item in a list, the "in" keyword can be used.
# The variable "var" is just an example, the name could be different.
# To reference the variable in the rest of the submit script, use $(var).
# The elements of this list are also just examples.
queue var in (
item1
item2
item3
item4
)

The "matching" Keyword

Jobs for each item matching an expression can be created using the "matching" keyword. Ex: A submit script that used queue in this way would create a job for each .root file in the directory from which "condor_submit" was called.
queue var matching *.root
You could also give a path to another directory.

The "from" Keyword

If more than one variable is required, the "from" keyword can be used. This takes in a text file that is a comma separated list of values. Ex: Given the text file "List.txt" below:
500, 0.25, 0.25
500, 0.25, 0.5
500, 0.25, 0.75
500, 0.5, 0.25
500, 0.5, 0.5
500, 0.5, 0.75
500, 0.75, 0.25
500, 0.75, 0.5
500, 0.75, 0.75
1000, 0.25, 0.25
...
1500, 0.25,0.25
...
Where the first column represents the mass of a particle, and the second and third arguments represent some coupling. One could create a job for each line by modifying the "queue" portion of the submit script like so:
queue mass,g1,g2 from List.txt

HTCondor Script Variables

 

-- ForrestPhillips - 19 Sep 2017

META FILEATTACHMENT attachment="FHP_MSU_Weekly_HTCondor.pdf" attr="" comment="Talk on using HTCondor to submit jobs." date="1505851326" name="FHP_MSU_Weekly_HTCondor.pdf" path="FHP_MSU_Weekly_HTCondor.pdf" size="802452" user="ForrestPhillips" version="1"
Added:
>
>
META FILEATTACHMENT attachment="DagMan_FHP_MSUATLAS_09-28-2017.pdf" attr="" comment="Talk on how to use DAGMan to manage complex job submissions." date="1509993181" name="DagMan_FHP_MSUATLAS_09-28-2017.pdf" path="DagMan_FHP_MSUATLAS_09-28-2017.pdf" size="786036" user="ForrestPhillips" version="1"
Revision 2
19 Oct 2017 - Main.ForrestPhillips
Line: 1 to 1
 
META TOPICPARENT name="CondorHowTo"

Using HTCondor

Changed:
<
<

Minimal Requirements

>
>
 
Added:
>
>

Minimal Requirements

  To run jobs on the tier3's worker nodes, a user needs an executable or wrapper script, as well as a submit script. The executable or wrapper script is what the user wants the worker nodes to run, while the submit script is what HTCondor uses to create the job.
Deleted:
<
<
Executable/Wrapper Script
 
Added:
>
>

Executable/Wrapper Script

  This can be any form of an executable a user wants to run on a worker node, ranging from a simple unix command to a compiled executable to a complex shell script (also known as a wrapper script). This executable or wrapper script should include everything the user wants to run on the worker node, including copying/moving files from one place to another. Note: It is best practice to copy any code that will be run, as well as any data that will be runover, to the worker node. This avoids the issue of having the worker nodes constantly using home directory or work disk i/o, which could cause the login nodes or work disks to become slow.

Wrapper Script Example:
Line: 31 to 31
  # Cleanup rm -fr ./*
Changed:
<
<
Submit Script
>
>

Submit Script

  The submit script is what HTCondor uses to create a job. An example is given below:
Deleted:
<
<
 
# Specify condor environment job should use
universe=vanilla
Line: 66 to 64
 
  • request_memory: used to request job slots with a certain amount of memory. Again, less useful in current tier3 setup. This should be set to how much maximum memory a job will use. Finding this value usually requires submitting test jobs and looking at the log file.
  • queue: how many of this particular job should be created. Can be given more complex options for submitting multiple jobs as described below.
To submit a job to condor, use the command "condor_submit <submitScriptName>". It can be useful to give wrapper scripts the ".sub" file extension to differentiate them from other files.
Deleted:
<
<

Passing Arguments to the Executable

 
Added:
>
>

Passing Arguments to the Executable

  Some executables can have, or require, arguments be passed to them. Condor can pass said arguments with the "arguments" variable.

For example, given the wrapper script below:
Line: 119 to 115
  queue 1

So the following will copy inputFile-ttbar.root and config-lvbb.cfg from the specified paths to the worker node disk, then copy the resulting output file named outputFile-Lep1-ttbar.root to the specified path.
Changed:
<
<

Submitting Multiple Jobs

>
>

Submitting Multiple Jobs

 

-- ForrestPhillips - 19 Sep 2017
Added:
>
>
 
META FILEATTACHMENT attachment="FHP_MSU_Weekly_HTCondor.pdf" attr="" comment="Talk on using HTCondor to submit jobs." date="1505851326" name="FHP_MSU_Weekly_HTCondor.pdf" path="FHP_MSU_Weekly_HTCondor.pdf" size="802452" user="ForrestPhillips" version="1"
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback