Job submission
Computing nodes are shared between all the users of the computing center. A job scheduler manages the access to the global resources and allows users to book the resources they need for their computation. Job submission, resource allocation and job launch are handled by the batch scheduler called SLURM. An abstraction layer called Bridge is provided to give uniform command-line tools and more integrated ways to access batch scheduling systems.
To submit a batch job, you first have to write a shell script containing:
- a set of directives to tell which resources your job needs.
- instructions to execute your code.
You can launch the job by submitting its script to the batch scheduler. It will enter a batch queue. When resources become available, the job is launched over its allocated nodes. Jobs can be monitored.
Scheduling policy
Jobs are submitted to the supercomputer resource manager which will decide how to schedule them. There are hundreds of jobs submitted daily and the order in which they are launched depends on several parameters.
Jobs are sorted by priority. The priority is an integer value initially assigned to the job when it is submitted. The job with the highest priority will start first as soon as enough resources are available. Sometimes, a job with a lower priority may start first but only if it does not delay any job with a higher priority. This mechanism is called backfill scheduling and jobs can easily be backfilled if they require few computing resources and have a small time limit.
Here are the 3 components that are added to obtain the priority. They are listed below in order of importance:
- The Quality of Service (QoS): The QoS component is a constant value associated to the QoS chosen during the submission. Its value depends on the priority factor given by the ccc_mqinfo command. This is the component that has the most influence on the overall value of the priority. Also, an usage limit can be set to prevent jobs from running as long as the project is not under a certain consumption. See Qos for more information.
- The project’s fair-share: The fair-share component reflects the ratio between the total share of allocated hours and the amount of hours consumed for the chosen project. The fair-share value will be high if the project is under-consuming its allocated hours whereas it will be low if the project is over-consuming its allocated hours. A half-life decay is applied to the computation of the fair-share with a half life period of 14 days. That way, an over-consumption or under-consumption of hours will have a decreasing impact on new jobs submissions. After 2 months, the negative impact of an over-consumption is almost insignificant.
- The job’s age: The age component depends on the time spent in a pending state while waiting for resources. Its value is incremented regularly for 7 days. After this delay, it will not increase anymore.
In order to reduce the waiting time as much as possible, try to:
- Use your computing hours evenly throughout your project duration. It will increase your fair-share value, thus increase the priority of your jobs.
- For small jobs, specify a time limit as close as possible to the real duration of the job instead of leaving default time limit of 2 hours. That way you are more likely to benefit from the backfill scheduling mechanism.
- Since the default time limit is 2 hours for all jobs, try to specify a time limit as close as possible to the real duration of your job. That way you are more likely to benefit from the backfill scheduling mechanism.
Choosing the file systems
Your job submissions can use the -m
option to specify the file systems compute nodes will need for execution. This avoids job suspensions, if an unused file system becomes unavailable.
example: ccc_msub -m scratch,store will run even if the WORK file system is unavailable. You can choose the following file systems: scratch, work and store.
Warning
On Irene, your job submissions MUST use the -m
option. Failing to provide the -m
option will actually prevent your compute nodes from using any of scratch, work or store.
Submission scripts
ccc_msub
ccc_msub <script> is the command used to submit your batch job to the resource manager. The options passed to ccc_msub will determine the number of cores allocated to the job.
There are two ways of passing arguments to ccc_msub.
- Use the
#MSUB
keyword in the submission script. All the lines beginning with#MSUB
will be parsed and the corresponding parameters will be taken into account by the batch manager. - Use command line arguments. If the same argument is specified in the command line and the submission script, the command line argument will take precedence.
Note that you cannot pass arguments to the submission script when launching with ccc_msub.
For a short documentation and list of available options, see the help message:
$ ccc_msub -h
Basic options
-o output_file
: standard output file (special character%I
will be replaced by the job ID)-e error_file
: standard error file (special character%I
will be replaced by the job ID)
All the output and error messages generated by the job are redirected to the log files specified with -o
and -e
. If you want all output to be redirected in one single file, just set the same name for -o
and -e
.
-r reqname
: job name-A projid
: project/account name-m filesystem
: file system required by the job. If another file system is unavailable, the job will not be impacted. Without the option, the job is considered to use every file systemexample: ccc_msub -m scratch,store will run even if the WORK file system is unavailable. You can choose the following file systems: scratch, work and store.
-@ mailopts
: mail options following the patternmailaddr[:begin|end|requeue]
example: ccc_msub -@ jdoe@foo.com:begin,end will send a mail to jdoe at the beginning and the end of the job default behavior depends of the underlying batch system
Partitions
The compute nodes are gathered in partitions according to their hardware characteristics (CPU architecture, amount of RAM, presence of GPU, etc). A partition is a set of identical nodes that can be targeted to host one or several jobs. Choosing the right partition for a job depends on code prerequisites in term of hardware resources. For example, executing a code designed to be GPU accelerated requires a partition with GPU nodes.
The partition is specified with the -q
option. This option is mandatory while submitting.
The ccc_mpinfo command lists the available partitions for the current supercomputer. For each partition, it will display the number of available cores and nodes but also give some hardware specifications of the nodes composing the partition. For more information, see the Supercomputer architecture section.
Note
- This choice is exclusive: your job can only be submitted on one of those architectures at a time.
- Job priority and limits are not related to the partition but to the QoS (Quality of Service) parameter (option
-Q
). QoS is the main component of job priority. - Depending on the allocation granted to your project, you may not have access to all the partitions. You can check on which partition(s) your project has allocated hours thanks to the command ccc_myproject.
Resources
There are several options that will influence the number of cores that will be allocated for the job.
-n
: number of tasks that will be used in parallel mode (default=1)-c
: number of cores per parallel task to allocate (default=1)If the
-c
parameter is not specified, the number of cores allocated is equal to the number of parallel tasks requested with-n
. So specifying-n
is enough for most of the basic jobs. The-c
option is useful when each MPI process launched needs more resources than just one core. It can be either to require more cpu power or more memory:- For hybrid MPI/OpenMP jobs, each MPI process will need several cores in order to spawn its OpenMP threads. In that case, usually, the
-c
option is equal to the number passed toOMP_NUM_THREADS
- For jobs requiring a lot of memory, each MPI process may need more than the 4 GB it is granted by default. In that case, the amount of memory may be multiplied by allocating more cores for each process. For example, by specifying
-c 2
, each process will be able to use the memory of 2 cores: 8GB.
- For hybrid MPI/OpenMP jobs, each MPI process will need several cores in order to spawn its OpenMP threads. In that case, usually, the
-N
: number of nodes to allocate for parallel usage (default is chosen by the underlying system)The
-N
option is not necessary in most of the cases. The number of nodes to use is inferred by the number of cores requested and the number of cores per node of the partition used.-x
: request for exclusive usage of allocated nodesThe
-x
options forces the allocation of a whole node, even if only a couple of cores were requested. This is the default configuration for jobs requiring more than 128 cores.-T time_limit
: maximum walltime of the batch job in seconds (optional directive, if not provided, set to default = 7200)It may be useful to set an accurate walltime in order to benefit from backfill scheduling.
Skylake partition
- To submit a job on a Skylake node, it’s advised to recompile your code to use the Skylake supported vectorisation instructions as shown in the Compiling for Skylake section. Apart from the partition used, your submission script shouldn’t change much :
#!/bin/bash
#MSUB -q skylake
#MSUB -n 40
ccc_mprun ./my-application
QoS
One can specify a Quality of Service (QoS) for each job submitted to the scheduler. The quality of service associated to a job will affect it in two ways: scheduling priority and limits. Depending on the required quantity of resources, on duration or on the job purpose (debugging, normal production, etc), you have to select the appropriate job QoS. It enables to trade a high job submission limit for a lower job priority, or a high job priority for lesser resources and duration.
ccc_mqinfo displays the available job QoS and the associated limitations:
$ ccc_mqinfo
Name Partition Priority MaxCPUs SumCPUs MaxNodes MaxRun MaxSub MaxTime
------- --------- -------- ------- ------- -------- ------ ------ ----------
long * 20 2048 2048 32 3-00:00:00
normal * 20 128 1-00:00:00
test * 40 2 2 00:30:00
test partition 40 280 280 10 2 00:30:00
For instance, to develop or debug your code, you may submit a job using the test QoS which will allow it to be scheduled faster. This QoS is limited to 2 jobs of maximum 30 minutes and 10 nodes each. CPU time accounting is not dependent on the chosen QoS.
To specify a QoS, you can use the -Q
option of command-line or add #MSUB -Q <qosname>
directive to your submission script. If no QoS is mentioned, default QoS normal will be used.
An usage limit can be set to manage hours consumption within a project. The usage limit can be set to high, medium or low with the option #MSUB -U <limit>
.
- A job submitted with
-U high
priority will be scheduled normally as described in Scheduling policy. This is the default behavior if no usage limit is specified. - A job submitted with
-U medium
will only be scheduled if the current project consumption is less than 20% over the suggested use at this time. - A job submitted with
-U low
will only be scheduled if the current project consumption is more than 20% under the suggested use at this time.
Medium and low priority jobs may stay pending even if all the resources are free. It allows to prevent non important jobs from over consuming project hours and thus lower the default priority of future important jobs.
Dependencies between jobs
Note
The following methods are equivalent concerning the priorities of dependent jobs. See the Scheduling policy section for more details.
The command offers two options to prevent some jobs from running at the same time. Those options are -w
and -a
:
- The option
-w
makes jobs with the same submission name (specified with-r
) run in the order they were submitted. For example, all jobs with the following options will not run at the same time.
#MSUB -r SAME_NAME
#MSUB -w
- The option
-a
indicates that a job depends on another (already submitted) job. Use it with the id of the existing job.
ccc_msub -a <existing_job> script.sh
- One can write multi step jobs that run sequentially by submitting the next job at the end of the current script. Here are some example scripts:
JOB_A.sh:
#!/bin/bash
#MSUB -r JOB_A
#MSUB -n 32
#MSUB -q <partition>
#MSUB -A <project>
ccc_mprun ./a.out
ccc_msub JOB_B.sh
JOB_B.sh:
#!/bin/bash
#MSUB -r JOB_B
#MSUB -n 16
#MSUB -q <partition>
#MSUB -A <project>
ccc_mprun ./b.out
ccc_msub JOB_C.sh
JOB_C.sh:
#!/bin/bash
#MSUB -r JOB_C
#MSUB -n 8
#MSUB -q <partition>
#MSUB -A <project>
ccc_mprun ./c.out
Then, only JOB_A.sh has to be submitted. When it finishes, the script submits JOB_B.sh, etc…
Note
If a job is killed or if it reaches its time limit, all the jobs are removed and the last ccc_msub may not be launched.
To avoid this, you can use the ccc_tremain from libccc_user or use the #MSUB -w
directive as described above.
Environment variables
When a job is submitted, some environment variables are set. Those variables may be used only within the job script.
BRIDGE_MSUB_JOBID
: batch id of the running job.BRIDGE_MSUB_MAXTIME
: timelimit in seconds of the job.BRIDGE_MSUB_PWD
: working directory. Usually the directory in which the job was submitted.BRIDGE_MSUB_NPROC
: number of requested processesBRIDGE_MSUB_NCORE
: number of requested cores per processBRIDGE_MSUB_REQNAME
: job name
Note
You cannot use those variables as input arguments to ccc_msub. You cannot use them with the #MSUB
headers.
ccc_mprun
ccc_mprun command allows to launch parallel jobs over nodes allocated by resources manager. So inside a submission script, a parallel code will be launched with the following command:
ccc_mprun ./a.out
By default, ccc_mprun takes information (number of nodes, number of processors, etc) from the resources manager to launch the job. You can customize its behavior with the command line options. Type ccc_mprun -h for an up-to-date and complete documentation.
Here are some basic options ccc_mprun:
-n nproc
: number of tasks to run-c ncore
: number of cores per task-N nnode
: number of nodes to use-T time
: maximum walltime of the allocations in seconds ( optional directive, if not provided, set to default=7200)-E extra
: extra parameters to pass directly to the underlying resource manager-m filesystem
: file system required by the job. If another file system is unavailable, the job will not be impacted. Without the option, the job is considered to use every file system. For example, ccc_mprun -m scratch,store will run even if the WORK file system is unavailable. You can choose the following file systems: scratch, work and store. This option may work depending the system and its configuration
Note
If the resources requested by ccc_mprun are not compatible with the resources previously allocated with ccc_msub, the job will crash with the following error message:
srun: error: Unable to create job step: More processors requested than permitted
Job submission scripts examples
- Sequential job
#!/bin/bash
#MSUB -r MyJob # Job name
#MSUB -n 1 # Number of tasks to use
#MSUB -T 600 # Elapsed time limit in seconds of the job (default: 7200)
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name (see ccc_mpinfo)
#MSUB -A <project> # Project ID
set -x
cd ${BRIDGE_MSUB_PWD} # The BRIDGE_MSUB_PWD environment variable contains the directory from which the script was submitted.
./a.out
- Parallel MPI job
#!/bin/bash
#MSUB -r MyJob_Para # Job name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name (see ccc_mpinfo)
#MSUB -A <project> # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun ./a.out
- Parallel OpenMP/multi-threaded job
#!/bin/bash
#MSUB -r MyJob_Para # Job name
#MSUB -n 1 # Number of tasks to use
#MSUB -c 16 # Number of threads per task to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name (see ccc_mpinfo)
#MSUB -A <project> # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
export OMP_NUM_THREADS=16
./a.out
Note
An OpenMP/multi-threaded program can only run inside a node. If you ask more threads than available cores in a node, your submission will be rejected.
- Parallel hybrid OpenMP/MPI or multi-threaded/MPI
#!/bin/bash
#MSUB -r MyJob_ParaHyb # Job name
#MSUB -n 8 # Total number of tasks to use
#MSUB -c 4 # Number of threads per task to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
export OMP_NUM_THREADS=4
ccc_mprun ./a.out # This script will launch 8 MPI tasks. Each task will have 4 OpenMP threads.
- GPU jobs
For examples of GPU job submission scripts, see Running GPU applications.
Job monitoring and control
Once a job is submitted with ccc_msub, it is possible to follow its evolution with several bridge commands.
We recommend you limit the rate at which your jobs query the batch system to an aggregate of 1 - 2 times / minutes. This includes all Bridge and Slurm queries such as ccc_mpp, ccc_mstat, squeue, sacct, or other Bridge and Slurm commands. Keep in mind this is an aggregate rate across all your jobs, so if you have a single job that queries once a minute but 500 of these jobs start at once the Slurm controller will see a rate of 500 queries / minute, so please scale your rate accordingly.
Warning
Use of watch for Slurm and Bridge commands is strongly prohibited.
ccc_mstat
The ccc_mstat command provides information about jobs on the supercomputer. By default, it displays all the jobs that are either running or pending on the different partitions of the supercomputer. Use the option -u
to display only your jobs.
$ ccc_mstat -u
BATCHID NAME USER PROJECT QUEUE QOS PRIO SUBHOST EXEHOST STA TUSED TLIM MLIM CLIM
------- ---- ---- ------- ----- ------ ------ ------- ------- --- -------- -------- ------- ------
229862 MyJob1 mylogin projXXX@partition1 partition1 normal 210000 node174 node174 PEN 0 86400 1865 2240
233463 MyJob2 mylogin projXXX@partition2 partition2 normal 210000 node175 node1331 R00 58631 85680 1865 43200
233464 MyJob3 mylogin projXXX@partition3 partition3 normal 200000 node172 node1067 R01 12171 85680 1865 43200
233582 MyJob4 mylogin projXXX@partition1 partition1 normal 200000 node172 node1104 R01 3620 85680 1865 43200
Here is the information that can be gathered from :
- Basic job information (USER,PROJECT,BATCHID,CLIM,QUEUE,TLIM,NAME): Describes the parameter with which the job was submitted. It allows to check that the parameters passed to ccc_msub were taken into account correctly.
- PRIO: The priority of the job depends on many parameters. For instance, it depends on your project and the amount of hours your project consumed. It also depends on how long the job has been waiting in queue. The priority is what will determine the order in which the jobs from different users and different projects will run when the resource is available.
- SUBHOST: The host where the job was submitted.
- EXEHOST: The first host where the job is running.
- STA: The state of the job. Most of the time, it is either pending (PEN) if it is waiting for resources or running (R01) if it has started. Sometimes, the job is in a completing state (R00). It means the jobs has finished and computing resources are in a cleanup phase.
- TUSED/TLIM: If the job is running, the TUSED field shows for how long it has been running in seconds. The TLIM field shows the maximum execution time requested at submission.
- MLIM: The maximum memory allowed per core in MB.
- CLIM: The number of core requested at submission.
Here are command line options for ccc_mstat:
-f
: show jobs full name.-q queue
: show jobs of requested batch queue.-u [user]
: show jobs of a requested user. If no user is given, it shows the job of the current user.-b batchid
: show all the processes related to a job.-r batchid
: show detailed information of a running job.-H batchid
: show detailed information of a finished job.-O
: show jobs exceeding the time limit.
ccc_mpp
ccc_mpp provides information about jobs on the supercomputer. By default, it displays all the jobs that are either running or pending on the different partitions of the supercomputer. Use the option -u $USER
to display only your jobs.
$ ccc_mpp -u $USER
USER ACCOUNT BATCHID NCPU QUEUE PRIORITY STATE RLIM RUN/START SUSP OLD NAME NODES/REASON
mylogin projXX 1680920 64 partition1 290281 RUN 1.0h 55.0s - 53.0s MyJob1 node8002
mylogin projXX 1680923 84 partition2 284040 PEN 10.0h - - 50.0s MyJob3 Dependency
mylogin projXX 1661942 84 partition2 284040 RUN 24.0h 53.0s - 51.0s MyJob3 node[1283-1285]
mylogin projXX 1680917 1024 partition2 274036 PEN 24.0h - - 7.5m MyJob4 Priority
mylogin projXX 1680921 28 partition3 215270 PEN 24.0h ~05h36 - 52.0s MyJob2 Resources
Here is the information that can be gathered from :
- Basic job information (USER,ACCOUNT,BATCHID,NCPU,QUEUE,RLIM,NAME): Describes the parameter with which the job was submitted. It allows to check that the parameters passed to ccc_msub were taken into account correctly.
- PRIORITY: The priority of the job depends on many parameters. For instance, it depends on your project and the amount of hours your project consumed. It also depends on how long the job has been waiting in queue. The priority is what will determine the order in which the jobs from different users and different projects will run when the resource is available.
- STATE: The state of the job. Most of the time, it is either pending (PEN) if it is waiting for resources or running (RUN) if it has started. Sometimes, the job is in a completing state (COMP). It means the jobs has finished and computing resources are in a cleanup phase.
- RUN/START: If the job is running, this field shows for how long it has been running. If the job is pending, it sometimes gives an evaluation of the estimated start time. This start time may vary depending on the jobs submitted by other users.
- SUSP: The time spent in a suspended state. Jobs may be suspended by staff when an issue occur on the supercomputer. In that case, running jobs are not flushed but suspended in order to let them continue once the issue is solved.
- OLD: The total amount of time since the job was submitted. It includes the time spent waiting and running.
- NODES/REASON: If the job is running, this gives you the list of nodes used by the job. If it is pending, it gives you the reason. For example, “Dependency” means you submitted the job with a dependency to another unfinished job. “Priority” means there are other jobs that have a better priority and yours will start running after those. “Resources” means there are not enough cores available at the moment to let the job start. It will have to wait for some other jobs to end and free their allocated resources. “JobHeldAdmin” means that the current job has been held by a user who is not the owner (generaly an admin with the required rights). Please note that pending jobs with lower priority may display a pending reason not reflecting their current pending state since the batch scheduler only updates the pending reason of high priority jobs.
Here are command line options for ccc_mpp:
-r
: prints ‘running’ batch jobs-s
: prints ‘suspended’ batch jobs-p
: prints ‘pending’ batch jobs-q queue
: requested batch queue-u user
: requested user-g group
: requested group-n
: prints results without colors
ccc_mpeek
ccc_mpeek gives information about a job while it runs.
$ ccc_mpeek <jobid>
It is particularly useful to check the output of a job while it is running. The default behavior is to display the standard output. It is what you would basically find in the .o
log file.
Here are command line options for ccc_mpeek:
-o
: prints the standard output-e
: prints the standard error output-s
: prints the job submission script-t
: same as-o
in tail -f mode-S
: print the launched script of the running job-d
: print the temporary directory of the running job-I
: print the INFO file of the running job
ccc_mpstat
ccc_mpstat gives information about a parallel job during its execution.
$ ccc_mpstat <jobid>
It gives details about the MPI processes and their repartition across nodes with their rank and affinity.
Here are command line options for ccc_mpstat:
-r jobid
: display resource allocation characteristics-a jobid
: display active steps belonging to a jobid-t stepid
: print execution trace (tree format) for the specified stepid-m stepid
: print mpi layout for the specified stepid-p partition
: only print jobs of a particular partition-u [user]
: only print jobs of a particular user or the current one if not specified
ccc_macct
Once the job is done, it will not appear in ccc_mpp anymore. To display information afterwards, the command ccc_macct is available. It works for pending and running jobs but information is more complete once the job has finished. It needs a valid jobID as input.
$ ccc_macct <jobid>
Here is an example of the output of the ccc_macct command:
$ ccc_macct 1679879
Jobid : 1679879
Jobname : MyJob
User : login01
Account : project01@partition
Limits : time = 01:00:00 , memory/task = Unknown
Date : submit = 14/05/2014 09:23:50 , start = 14/05/2014 09:23:50 , end = 14/05/2014 09:30:00
Execution : partition = partition , QoS = normal , Comment = avgpowe+
Resources : ncpus = 32 , nnodes = 2
Nodes=node[1802,1805]
Memory / step
--------------
Resident Size (Mo) Virtual Size (Go)
JobID Max (Node:Task) AveTask Max (Node:Task) AveTask
----------- ------------------------ ------- -------------------------- -------
1679879.bat+ 151 (node1802 : 0) 0 0.00 (node1802 : 0) 0.00
1679879.0 147 (node1805 : 2) 0 0.00 (node1805 : 2) 0.00
1679879.1 148 (node1805 : 8) 0 0.00 (node1805 : 8) 0.00
1679879.2 0 (node1805 : 16) 0 0.00 (node1805 : 16) 0.00
Accounting / step
------------------
JobID JobName Ntasks Ncpus Nnodes Layout Elapsed Ratio CPusage Eff State
------------ ------------ ------ ----- ------ ------- ------- ----- ------- --- -----
1679879 MyJob - 32 2 - 00:06:10 100 - - -
1679879.bat+ batch 1 1 1 Unknown 00:06:10 100.0 - - COMPLETED
1679879.0 exe0 4 4 2 BBlock 00:03:18 53.5 00:02:49 85.3 COMPLETED
1679879.1 exe1 16 16 2 BBlock 00:02:42 43.7 00:00:37 22.8 COMPLETED
1679879.2 exe2 32 32 2 BBlock 00:00:03 .8 - - FAILED
Energy usage / job (experimental)
---------------------------------
Nnodes MaxPower MaxPowerNode AvgPower Duration Energy
------ -------- ------------ -------- -------- ------
2 169W node1802 164W 00:06:10 0kWh
Sometimes, there are several steps for the same job described in ccc_macct. Here, they are called 1679879.0, 1679879.1 and 1679879.2. This happens when there are several calls to ccc_mprun in one submission script. In this case, 3 executables were run with ccc_mprun: exe0, exe1 and exe2. Every other call to functions such as cp, mkdir, etc will be counted in the step called 1679879.batch
There are 4 specific sections in the ccc_macct output.
- First section Job summary: It gives all the basic information about submission parameters you can also find in ccc_mpp. There is also submission date and time, when the job has started to run and when it has finished.
- Second section Memory / step: For each step, this section gives an idea of the amount of memory used by process. It gives the memory consumption of the top process.
- Third section Accounting / step: It gives detailed information for each step.
- Ntasks, Ncpus, Nnodes and Layout describes how the job is distributed among the allocated resources. Ntasks is the number of processes defined by the parameter
-n
. By default, Ncpus is the same as Ntasks except if the number of cores per process was set with the-c
parameter. The layout can be BBlock, CBlock, BCyclic or CCyclic depending on the process distribution. (more information given in the Advanced Usage documentation) - Elapsed is the user time spent in each step. Ratio is the percentage of time spent in this specific step compared to the total time of the job.
- CPusage is the average cpu time for all the processes in one step and Eff is defined by CPusage/Elapsed*100. Which means that if Eff is close to 1, all the processes are equally busy.
- Ntasks, Ncpus, Nnodes and Layout describes how the job is distributed among the allocated resources. Ntasks is the number of processes defined by the parameter
- Fourth section Energy usage: It gives an idea of the amount of electric energy used while running this job. This section is only relevant if the job used entire nodes.
ccc_mdel
ccc_mdel enables to kill your jobs. It works whether they are running or pending.
First, identify the batchid of the job you need to kill with ccc_mpp for example. And then, kill the job with:
$ ccc_mdel <jobid>
ccc_malter
ccc_malter enables to decrease your jobs time limit. It works only when they are running or pending.
$ ccc_malter -T <new time limit> <jobid>
Here are available options:
-T time_limit
: Decrease the time limit of a job-L licenses_string
: Change licenses of your job
ccc_affinity
The command ccc_affinity show you the processes and threads affinity for a given job id. The usual format is ccc_affinity [options] JOBID.
Here are available options:
-l
: Run on local node.-t
: Display processes and threads. Default is to display processes only.-u
: Specify a username.
This is an example of output:
$ ccc_affinity 900481
Host Rank PID %CPU State MEM_kB CPU AFFINITY NAME
node1434:
| 2 8117 310 Sl 34656 28 0,28 life_par_step7
| 3 8118 323 Rl 34576 42 14,42 life_par_step7
node1038:
| 0 6518 323 Rl 34636 0 0,28 life_par_step7
| 1 6519 350 Rl 34732 42 14,42 life_par_step7
And this is with thread option -t
:
$ ccc_affinity -t 900481
Host Rank PID %CPU State ThreadID MEM_kB CPU AFFINITY NAME
node1434:
| 2 8117 20.3 Sl 8117 34660 28 0,28 life_par_step7
| `-- -- 0.0 Sl 8125 34660 29 0-13,28-41 life_par_step7
| `-- -- 0.0 Sl 8126 34660 0 0-13,28-41 life_par_step7
| `-- -- 0.0 Sl 8142 34660 29 0-13,28-41 life_par_step7
| `-- -- 0.0 Sl 8149 34660 36 0-13,28-41 life_par_step7
| `-- -- 99.6 Rl 8150 34660 4 4,32 life_par_step7
| `-- -- 99.6 Rl 8151 34660 7 7,35 life_par_step7
| `-- -- 99.6 Rl 8152 34660 11 11,39 life_par_step7
| 3 8118 33.6 Rl 8118 34580 42 14,42 life_par_step7
| `-- -- 0.0 Sl 8124 34580 43 14-27,42-55 life_par_step7
| `-- -- 0.0 Sl 8127 34580 18 14-27,42-55 life_par_step7
| `-- -- 0.0 Sl 8143 34580 50 14-27,42-55 life_par_step7
| `-- -- 0.0 Sl 8145 34580 23 14-27,42-55 life_par_step7
| `-- -- 99.6 Rl 8146 34580 18 18,46 life_par_step7
| `-- -- 99.6 Rl 8147 34580 21 21,49 life_par_step7
| `-- -- 99.6 Rl 8148 34580 25 25,53 life_par_step7
node1038:
| 0 6518 44.1 Rl 6518 34636 28 0,28 life_par_step7
| `-- -- 0.0 Sl 6526 34636 29 0-13,28-41 life_par_step7
| `-- -- 0.0 Sl 6531 34636 11 0-13,28-41 life_par_step7
| `-- -- 0.0 Sl 6549 34636 36 0-13,28-41 life_par_step7
| `-- -- 0.0 Sl 6553 34636 10 0-13,28-41 life_par_step7
| `-- -- 99.8 Rl 6554 34636 4 4,32 life_par_step7
| `-- -- 99.8 Rl 6555 34636 7 7,35 life_par_step7
| `-- -- 99.8 Rl 6556 34636 11 11,39 life_par_step7
| 1 6519 71.1 Sl 6519 34736 42 14,42 life_par_step7
| `-- -- 0.0 Sl 6525 34736 43 14-27,42-55 life_par_step7
| `-- -- 0.0 Sl 6527 34736 17 14-27,42-55 life_par_step7
| `-- -- 0.0 Sl 6548 34736 43 14-27,42-55 life_par_step7
| `-- -- 0.0 Sl 6557 34736 50 14-27,42-55 life_par_step7
| `-- -- 99.6 Rl 6558 34736 18 18,46 life_par_step7
| `-- -- 99.6 Rl 6559 34736 21 21,49 life_par_step7
| `-- -- 99.6 Rl 6560 34736 25 25,53 life_par_step7
We can see which process runs on which node and which thread runs on which core/CPU. The “State” column show if the thread is Running (R) or sleeping (S). The AFFINITY column shows CPUs on which a thread is allowed to move on and the CPU column shows the CPU on which a thread is currently running.
On a single node we can see CPU numbered to 55 though a node is 28 cores. It’s because of Intel Hyper-Threading that allows to run two threads on one core.
Special jobs
Note
A step (or SLURM step) is a call of ccc_mprun command inside an allocation or a job.
Note
In the following cases, some scripts use the submission directive #MSUB -F. This directive call a Bridge plugin which uses Flux. It is a new framework for managing resources and jobs.
Warning
If you want to submit a chained job at the end of a job using the Flux plugin, please write the following command before the submit command:
BATCH_SYSTEM=slurm ccc_msub [...]
Multiple sequential short steps
In this case, a job contains many sequential and very short steps.
#!/bin/bash
#MSUB -r MyJob_SeqShort # Job name
#MSUB -n 16 # Number of tasks to use
#MSUB -T 3600 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
#MSUB -F # Use Flux plugin
for i in $(seq 1 10000)
do
ccc_mprun -n 16 ./bin1
done
Embarrassingly parallel jobs
An embarrassingly parallel job is a job which launches independent processes in parallel. These processes need few or no communications. We call such an independent process a task.
Multiple concurrent sequential short steps
In this case, a job contains many concurrent, sequential and very short steps.
#!/bin/bash
#MSUB -r MyJob_MulSeqShort # Job name
#MSUB -n 16 # Number of tasks to use
#MSUB -T 3600 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
#MSUB -F # Use Flux plugin
for i in $(seq 1 10000)
do
ccc_mprun -n 8 ./bin1 &
ccc_mprun -n 8 ./bin2 &
wait
done
Multiple concurrent steps
This case is the typical embarrassingly job: a job allocates resources and launches multiple steps which are independant from other steps. Flux will launch them through the available resources. When a step ends, a new step will be launched if there are enough resources.
The different tasks to launch must be listed in a simple text file. Each line of this taskfile contains:
- the number of task
- the number of core per task
- the command to launch
For example:
$ cat taskfile.txt
8-2 bin1.exe
1-1 bin.exe file1.dat
2-8 bin2.exe file2.dat
<...>
4-1 bin2.exe file2.dat
And the job script:
#!/bin/bash
#MSUB -r MyJob_Concurrent # Job name
#MSUB -n 256 # Number of tasks to use
#MSUB -T 3600 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
#MSUB -F # Use Flux plugin
ccc_mprun -B taskfile.txt
You can combine the previous actions into one:
#!/bin/bash
#MSUB -r MyJob_Concurrent # Job name
#MSUB -n 256 # Number of tasks to use
#MSUB -T 3600 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
#MSUB -F # Use Flux plugin
ccc_mprun -B <(cat << EOF
8-2 bin1.exe
1-1 bin.exe file1.dat
2-8 bin2.exe file2.dat
<...>
4-1 bin2.exe file2.dat
EOF
)
Multiple concurrent jobs
In this cas, you can submit many jobs inside an allocation. Bridge provides an environment variable BRIDGE_MSUB_ARRAY_TASK_ID
to differentiate the different subjobs.
For example, here a job script where you need you submit 10000 times:
$ cat subjob.sh
#!/bin/bash
<some preprocessing commands>
ccc_mprun bin.exe file_${BRIDGE_MSUB_ARRAY_TASK_ID}.dat
<some postprocessing commands>
And the job script:
#!/bin/bash
#MSUB -r MyJob_ConcurrentJobs # Job name
#MSUB -n 256 # Number of tasks to use
#MSUB -T 3600 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
#MSUB -F # Use Flux plugin
flux resource drain 0
ccc_msub -n 16 -y 1-10000 subjob.sh
flux resource undrain 0
In this example, the job will allocate 256 tasks. Inside, it will submit 10000 subjobs.
GLoST: Greedy Launcher of Small Tasks
GLoST is a lightweight, highly scalable tool for launching independent non-MPI processes in parallel. It has been developed by the TGCC for handling huge to-do list like operations. The source code is available on the cea-hpc github. GLoST manages the launch and scheduling of a list of tasks with the command glost_launch. It also allows error detection, continuation of undone operations and relaunch of failed operations thanks to the post processing script glost_filter.sh
.
The different tasks to launch must be listed in a simple text file. Commented and blank lines and supported. Comments added at the end of a line will be printed to the job output. This can be used to tag different tasks. Here is an example of a task file:
$ cat taskfile.list
./bin1 # Tag for task 1
./bin2 # Tag for task 2
./bin3 # Tag for task 3
./bin4 # Tag for task 4
./bin5 # Tag for task 5
./bin6 # Tag for task 6
./bin7 # Tag for task 7
./bin8 # Tag for task 8
./bin9 # Tag for task 9
./bin10 # Tag for task 10
Or with MPI binaries:
$ cat task_mpi.list
ccc_mprun -E"--jobid=${SLURM_JOBID}" -E"--exclusive" -n 3 ./mpi_init
ccc_mprun -E"--jobid=${SLURM_JOBID}" -E"--exclusive" -n 5 ./mpi_init
Here is an example of submission script:
#!/bin/bash
#MSUB -r MyJob_Para # Job name
#MSUB -n 4 # Number of tasks to use
#MSUB -T 3600 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
module load glost
ccc_mprun glost_launch taskfile.list
GLoST will automatically manage the different tasks and schedule them on the different available nodes. Note that one MPI process is reserved for task management so GLoST cannot run on 1 process. For more information, please check out the man page for glost_launch.
Once the job is submitted, information on the task scheduling is provided in the error output of the job. For each task, it shows which process launched it, its exit status and its duration. Here is a typical output for a 4 process job. Process 0 is used for scheduling and the 10 tasks to launch are treated by processes 1,2 or 3.
#executed by process 3 in 7.01134s with status 0 : ./bin3 # Tag for task 3
#executed by process 1 in 8.01076s with status 0 : ./bin2 # Tag for task 2
#executed by process 2 in 9.01171s with status 0 : ./bin1 # Tag for task 1
#executed by process 2 in 0.00851917s with status 12 : ./bin6 # Tag for task 6
#executed by process 2 in 3.00956s with status 0 : ./bin7 # Tag for task 7
#executed by process 1 in 5.01s with status 0 : ./bin5 # Tag for task 5
#executed by process 3 in 6.01114s with status 0 : ./bin4 # Tag for task 4
Some tasks may exit on errors or not be executed if the job has reached its timelimit before launching all the tasks in the taskfile. To help analysing the executed, failed or not executed tasks, we provide the script glost_filter.sh
. By default, lists all the executed tasks and adding the option -H
will highlight the failed tasks. Let’s say the error output file is called example_1050311.e
. Here are some useful options:
- List all failed tasks (with exit status other than 0)
$ glost_filter.sh -n taskfile example_1050311.e
./bin6 # Tag for task 6
- List all tasks not executed (may be used to generate the next taskfile)
$ glost_filter.sh -R taskfile example_1050311.e
./bin8 # Tag for task 8
./bin9 # Tag for task 9
./bin10 # Tag for task 10
For more information and options, please check out glost_filter.sh -h.
GLoST also provides the tool glost_bcast.sh which broadcasts a file from a shared filesystem to the local temporary directory (/tmp
by default). Typical usage on the center is:
#!/bin/bash
#MSUB -r MyJob_Para # Job name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 3600 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
module load glost
# Broadcast <file> on each node under /tmp/<file>
ccc_mprun -n ${BRIDGE_MSUB_NNODE} -N ${BRIDGE_MSUB_NNODE} glost_bcast <file>
# Alternatively, to broadcast <file> on each node under /dev/shm/<file>
TMPDIR=/dev/shm ccc_mprun -n ${BRIDGE_MSUB_NNODE} -N ${BRIDGE_MSUB_NNODE} glost_bcast <file>
MPMD jobs
An MPMD job (for Multi Program Multi Data) is a parallel job that launches different executables over the processes. The different codes are still sharing the same MPI environment. This can be done with the -f
option of ccc_mprun and by creating an appfile. The appfile should specify the different executables to launch and the number of processes for each.
Homogeneous
An homogeneous MPMD job is a parallel job where each process have the same cores number. Here is an example of an appfile:
$ cat app.conf
1 ./bin1
5 ./bin2
26 ./bin3
# This script will launch the 3 executables
# respectively on 1, 5 and 26 processes
And the submission script:
#!/bin/bash
#MSUB -r MyJob_Para # Job name
#MSUB -n 32 # Total number of tasks/processes to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun -f app.conf
Note
The total number of processes specified on the appfile cannot be larger than the number of processes requested for the job in the submission script.
In order to have several calls per line in the appfile, it is necessary to execute the whole line in the bash command.
- For example, if each binary has to be executed in its own directory
1 bash -c "cd ./dir-1 && ./bin1"
4 bash -c "cd ./dir-2 && ./bin2"
- Or if you need to export an environment variable that is different for each binary.
1 bash -c "export OMP_NUM_THREADS=3; ./bin1;"
4 bash -c "export OMP_NUM_THREADS=1; ./bin2;"
Heterogeneous
An heterogeneous MPMD job is a parallel job where each process could have a different threads number. Heterogeneous MPMD is enabled by loading feature/bridge/heterogenous_mpmd module. Here is an example of an appfile:
$ cat app.conf
1-2 bash -c "export OMP_NUM_THREADS=2; ./bin1"
5-4 bash -c "export OMP_NUM_THREADS=4; ./bin2"
2-5 bash -c "export OMP_NUM_THREADS=5; ./bin3"
# This script will launch the 3 executables
# respectively on 1, 5 and 2 processes with 2, 4 and 5 cores
# 1*2 + 5*4 + 2*5 = 32 (#MSUB -n 32)
The first number describes how many processes to run the followed command while the second one indicates the number of cores allocated to each process.
And the submission script:
#!/bin/bash
#MSUB -r MyJob_Para # Job name
#MSUB -n 32 # Total number of tasks and cores to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -q <partition> # Partition name
#MSUB -A <project> # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
module load feature/bridge/heterogenous_mpmd
ccc_mprun -f app.conf