Slurm OverviewΒΆ
The CRC clusters use Slurm for batch job queuing.
The sinfo
command provides an overview of the state of the nodes within the cluster.
[user@login0 ~ ]$ sinfo -M smp,gpu,mpi
CLUSTER: gpu
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gtx1080* up infinite 5 mix gpu-stage[08-12]
gtx1080* up infinite 13 idle gpu-n[16-25],gpu-stage[13-15]
titanx up infinite 1 mix gpu-stage01
titanx up infinite 6 idle gpu-stage[02-07]
k40 up infinite 1 idle smpgpu-n0
titan up infinite 5 idle legacy-n[126,128-131]
CLUSTER: mpi
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
opa* up infinite 3 mix opa-n[80,89-90]
opa* up infinite 47 alloc opa-n[0-9,12,15-23,32-37,39-45,60-64,72-75,81-84,86]
opa* up infinite 46 idle opa-n[10-11,13-14,24-31,38,46-59,65-71,76-79,85,87-88,91-95]
legacy up infinite 20 idle legacy-n[0-19]
CLUSTER: smp
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
smp* up infinite 6 mix smp-n[42,56-58,63,65]
smp* up infinite 3 alloc smp-n[44,46,62]
smp* up infinite 91 idle smp-n[24-41,43,45,47-55,59-61,64,66-123]
high-mem up infinite 29 idle smp-256-n[1-2],smp-512-n[1-2],smp-n[0-23],smp-nvme-n1
Note
The -M
flag for sinfo
, scontrol
,sbatch
and scancel
specify one or more clusters you want to see the
output for. By default (without this flag), all commands refer to the primary cluster configured on the login node
you are on (SMP cluster on h2p.crc.pitt.edu, and HTC on htc.crc.pitt.edu).
Nodes in the alloc
state mean that a job is running.
The asterisk next to the partition means that it is the default partition for all jobs.
squeue -M <smp,mpi,gpu,htc>
shows the list of running and queued jobs.
The most common states for jobs in squeue
are described below. See the output of man squeue
or
this page for more details.
Abbreviation | State | Description |
---|---|---|
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes. |
CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. |
PD | PENDING | Job is awaiting resource allocation. |
R | RUNNING | Job currently has an allocation. |
TO | TIMEOUT | Job terminated upon reaching its time limit. |
To see when all jobs are expected to start run squeue --start
. See man squeue
for a complete description the
possible REASONS for pending jobs.
Note
Not all jobs have a definite start time.
The scontrol
output shows detailed job output for pending or allocated jobs.
[user@login0 ~ ]$ scontrol -M <cluster> show job <jobid>