Why hasn’t my (Slurm) job started?
A job can be blocked from being scheduled for the following reasons:
There are insufficient resources available to start the job, either due to active reservations, other running jobs, component status, or system/partition size.
Other higher-priority jobs are waiting to run, and the job’s time limit prevents it from being backfilled.
The job’s time limit exceeds an upcoming reservation (e.g., scheduled preventative maintenance)
The job is associated with an account that has reached or exceeded its
GrpCPUMins
.
Display a list of queued jobs sorted in the order considered by the
scheduler using squeue
.
squeue --sort=-p,i --priority --format '%7T %7A %10a %5D %.12L %10P %10S %20r'
Reason codes
A list of reason codes [1] is available as part of the squeue
manpage. [2]
Common reason codes:
ReqNodeNotAvail
AssocGrpJobsLimit
AssocGrpCPUMinsLimit
resources
QOSResourceLimit
Priority
AssociationJobLimit
JobHeldAdmin
How are jobs prioritized?
PriorityType=priority/multifactor
Slurm prioritizes jobs using the multifactor plugin [3] based on a weighted summation of age, size, QOS, and fair-share factors.
Use the sprio
command to inspect each weighted priority value
separately.
sprio [-j jobid]
Age Factor
PriorityWeightAge=1000 PriorityMaxAge=14-0
The age factor represents the length of time a job has been sitting in the queue and eligible to run. In general, the longer a job waits in the queue, the larger its age factor grows. However, the age factor for a dependent job will not change while it waits for the job it depends on to complete. Also, the age factor will not change when scheduling is withheld for a job whose node or time limits exceed the cluster’s current limits.
The weighted age priority is calculated as PriorityWeightAge[1000]*[0..1] as the job age approaches PriorityMaxAge[14-0], or 14 days. As such, an hour of wait-time is equivalent to ~2.976 priority.
Job Size Factor
PriorityWeightJobSize=2000
The job size factor correlates to the number of nodes or CPUs the job has requested. The weighted job size priority is calculated as PriorityWeightJobSize[2000]*[0..1] as the job size approaches the entire size of the system. A job that requests all the nodes on the machine will get a job size factor of 1.0, with an effective weighted job size priority of 28 wait-days (except that job age priority is capped at 14 days).
Quality of Service (QOS) Factor
PriorityWeightQOS=1500
Each QOS can be assigned a priority: the larger the number, the greater the job priority will be for jobs that request this QOS. This priority value is then normalized to the highest priority of all the QOS’s to become the QOS factor. As such, the weighted QOS priority is calculated as PriorityWeightQOS[1500]*QosPriority[0..1000]/MAX(QOSPriority[1000]).
QOS Priority Weighted priority Wait-days equivalent ----------- -------- ----------------- -------------------- admin 1000 1500 21.0 janus 0 0 0.0 janus-debug 400 600 8.4 janus-long 200 300 4.2