HTCondor
What is HTCondor?
HTCondor is a high throughput computing system that can run multiple related tasks simultaneously. Most commonly, this means splitting up a job that runs over N events into M jobs that each run N/M events. However, that is just a simple example of what HTCondor can do.
In the case of MSU's tier3, condor is made up of several interacting systems.
- The login nodes (green, maron, and white at time of writing): Which are used to submit jobs, run small scripts, and manage files. Note: Should not be used for resource intensive computing.
- Home directories: Technically part of the login nodes? This is where a users home folder is located. These are backed up and should be used for things like code, plots, theses, etc.
- The work disks (t3work1-9 for example): These are file servers with large amounts of disk space (many TBs). These are used for storing large data. They use RAID 6(?) to protect against failed disks.
- The worker nodes/job slots: At the time of writing, there are about 50 machines that are used for running jobs. Each of these has 8, 12, or 24 cores. Most of these cores run 1 job, while others are hyper-threaded and can run more than 1 job. In total, there are about 450 job slots at the time of writing.
--
ForrestPhillips - 14 Sep 2017