Job submission

Computing nodes are shared between all the users of the computing center. A job scheduler manages the access to the global resources and allows users to book the resources they need for their computation. Job submission, resource allocation and job launch are handled by the batch scheduler called SLURM. An abstraction layer called Bridge is provided to give uniform command-line tools and more integrated ways to access batch scheduling systems.

To submit a batch job, you first have to write a shell script containing:

  • a set of directives to tell which resources your job needs.
  • instructions to execute your code.

You can launch the job by submitting its script to the batch scheduler. It will enter a batch queue. When resources become available, the job is launched over its allocated nodes. Jobs can be monitored.

Scheduling policy

Jobs are submitted to the supercomputer resource manager which will decide how to schedule them. There are hundreds of jobs submitted daily and the order in which they are launched depends on several parameters.

Jobs are sorted by priority. The priority is an integer value initially assigned to the job when it is submitted. The job with the highest priority will start first as soon as enough resources are available. Sometimes, a job with a lower priority may start first but only if it does not delay any job with a higher priority. This mechanism is called backfill scheduling and jobs can easily be backfilled if they require few computing resources and have a small time limit.

Here are the 3 components that are added to obtain the priority. They are listed below in order of importance:

  • The Quality of Service (QoS): The QoS component is a constant value associated to the QoS chosen during the submission. Its value depends on the priority factor given by the ccc_mqinfo command. This is the component that has the most influence on the overall value of the priority. Also, an usage limit can be set to prevent jobs from running as long as the project is not under a certain consumption. See Qos for more information.
  • The project’s fair-share: The fair-share component reflects the ratio between the total share of allocated hours and the amount of hours consumed for the chosen project. The fair-share value will be high if the project is under-consuming its allocated hours whereas it will be low if the project is over-consuming its allocated hours. A half-life decay is applied to the computation of the fair-share with a half life period of 14 days. That way, an over-consumption or under-consumption of hours will have a decreasing impact on new jobs submissions. After 2 months, the negative impact of an over-consumption is almost insignificant.
  • The job’s age: The age component depends on the time spent in a pending state while waiting for resources. Its value is incremented regularly for 7 days. After this delay, it will not increase anymore.

In order to reduce the waiting time as much as possible, try to:

  • Use your computing hours evenly throughout your project duration. It will increase your fair-share value, thus increase the priority of your jobs.
  • For small jobs, specify a time limit as close as possible to the real duration of the job instead of leaving default time limit of 2 hours. That way you are more likely to benefit from the backfill scheduling mechanism.
  • Since the default time limit is 2 hours for all jobs, try to specify a time limit as close as possible to the real duration of your job. That way you are more likely to benefit from the backfill scheduling mechanism.

Choosing the file systems

Your job submissions can use the -m option to specify the file systems compute nodes will need for execution. This avoids job suspensions, if an unused file system becomes unavailable.

example: ccc_msub -m scratch,store will run even if the WORK file system is unavailable. You can choose the following file systems: scratch, work and store.

Warning

On Irene, your job submissions MUST use the -m option. Failing to provide the -m option will actually prevent your compute nodes from using any of scratch, work or store.

Submission scripts

ccc_msub

ccc_msub <script> is the command used to submit your batch job to the resource manager. The options passed to ccc_msub will determine the number of cores allocated to the job.

There are two ways of passing arguments to ccc_msub.

  • Use the #MSUB keyword in the submission script. All the lines beginning with #MSUB will be parsed and the corresponding parameters will be taken into account by the batch manager.
  • Use command line arguments. If the same argument is specified in the command line and the submission script, the command line argument will take precedence.

Note that you cannot pass arguments to the submission script when launching with ccc_msub.

For a short documentation and list of available options, see the help message:

$ ccc_msub -h

Basic options

  • -o output_file: standard output file (special character %I will be replaced by the job ID)
  • -e error_file: standard error file (special character %I will be replaced by the job ID)

All the output and error messages generated by the job are redirected to the log files specified with -o and -e. If you want all output to be redirected in one single file, just set the same name for -o and -e.

  • -r reqname: job name

  • -A projid: project/account name

  • -m filesystem: file system required by the job. If another file system is unavailable, the job will not be impacted. Without the option, the job is considered to use every file system

    example: ccc_msub -m scratch,store will run even if the WORK file system is unavailable. You can choose the following file systems: scratch, work and store.

  • -@ mailopts: mail options following the pattern mailaddr[:begin|end|requeue]

    example: ccc_msub -@ jdoe@foo.com:begin,end will send a mail to jdoe at the beginning and the end of the job default behavior depends of the underlying batch system

Partitions

The compute nodes are gathered in partitions according to their hardware characteristics (CPU architecture, amount of RAM, presence of GPU, etc). A partition is a set of identical nodes that can be targeted to host one or several jobs. Choosing the right partition for a job depends on code prerequisites in term of hardware resources. For example, executing a code designed to be GPU accelerated requires a partition with GPU nodes.

The partition is specified with the -q option. This option is mandatory while submitting.

The ccc_mpinfo command lists the available partitions for the current supercomputer. For each partition, it will display the number of available cores and nodes but also give some hardware specifications of the nodes composing the partition. For more information, see the Supercomputer architecture section.

Note

  • This choice is exclusive: your job can only be submitted on one of those architectures at a time.
  • Job priority and limits are not related to the partition but to the QoS (Quality of Service) parameter (option -Q). QoS is the main component of job priority.
  • Depending on the allocation granted to your project, you may not have access to all the partitions. You can check on which partition(s) your project has allocated hours thanks to the command ccc_myproject.

Resources

There are several options that will influence the number of cores that will be allocated for the job.

  • -n: number of tasks that will be used in parallel mode (default=1)

  • -c: number of cores per parallel task to allocate (default=1)

    If the -c parameter is not specified, the number of cores allocated is equal to the number of parallel tasks requested with -n. So specifying -n is enough for most of the basic jobs. The -c option is useful when each MPI process launched needs more resources than just one core. It can be either to require more cpu power or more memory:

    • For hybrid MPI/OpenMP jobs, each MPI process will need several cores in order to spawn its OpenMP threads. In that case, usually, the -c option is equal to the number passed to OMP_NUM_THREADS
    • For jobs requiring a lot of memory, each MPI process may need more than the 4 GB it is granted by default. In that case, the amount of memory may be multiplied by allocating more cores for each process. For example, by specifying -c 2, each process will be able to use the memory of 2 cores: 8GB.
  • -N: number of nodes to allocate for parallel usage (default is chosen by the underlying system)

    The -N option is not necessary in most of the cases. The number of nodes to use is inferred by the number of cores requested and the number of cores per node of the partition used.

  • -x: request for exclusive usage of allocated nodes

    The -x options forces the allocation of a whole node, even if only a couple of cores were requested. This is the default configuration for jobs requiring more than 128 cores.

  • -T time_limit: maximum walltime of the batch job in seconds (optional directive, if not provided, set to default = 7200)

    It may be useful to set an accurate walltime in order to benefit from backfill scheduling.

Skylake partition

  • To submit a job on a Skylake node, it’s advised to recompile your code to use the Skylake supported vectorisation instructions as shown in the Compiling for Skylake section. Apart from the partition used, your submission script shouldn’t change much :
#!/bin/bash
#MSUB -q skylake
#MSUB -n 40
ccc_mprun ./my-application

QoS

One can specify a Quality of Service (QoS) for each job submitted to the scheduler. The quality of service associated to a job will affect it in two ways: scheduling priority and limits. Depending on the required quantity of resources, on duration or on the job purpose (debugging, normal production, etc), you have to select the appropriate job QoS. It enables to trade a high job submission limit for a lower job priority, or a high job priority for lesser resources and duration.

ccc_mqinfo displays the available job QoS and the associated limitations:

$ ccc_mqinfo
Name     Partition  Priority  MaxCPUs  SumCPUs  MaxNodes  MaxRun  MaxSub     MaxTime
-------  ---------  --------  -------  -------  --------  ------  ------  ----------
long             *        20     2048     2048                        32  3-00:00:00
normal           *        20                                         128  1-00:00:00
test             *        40                           2               2    00:30:00
test     partition        40      280      280        10               2    00:30:00

For instance, to develop or debug your code, you may submit a job using the test QoS which will allow it to be scheduled faster. This QoS is limited to 2 jobs of maximum 30 minutes and 10 nodes each. CPU time accounting is not dependent on the chosen QoS.

To specify a QoS, you can use the -Q option of command-line or add #MSUB -Q <qosname> directive to your submission script. If no QoS is mentioned, default QoS normal will be used.

An usage limit can be set to manage hours consumption within a project. The usage limit can be set to high, medium or low with the option #MSUB -U <limit>.

  • A job submitted with -U high priority will be scheduled normally as described in Scheduling policy. This is the default behavior if no usage limit is specified.
  • A job submitted with -U medium will only be scheduled if the current project consumption is less than 20% over the suggested use at this time.
  • A job submitted with -U low will only be scheduled if the current project consumption is more than 20% under the suggested use at this time.

Medium and low priority jobs may stay pending even if all the resources are free. It allows to prevent non important jobs from over consuming project hours and thus lower the default priority of future important jobs.

Dependencies between jobs

Note

The following methods are equivalent concerning the priorities of dependent jobs. See the Scheduling policy section for more details.

The command offers two options to prevent some jobs from running at the same time. Those options are -w and -a:

  • The option -w makes jobs with the same submission name (specified with -r) run in the order they were submitted. For example, all jobs with the following options will not run at the same time.
#MSUB -r SAME_NAME
#MSUB -w
  • The option -a indicates that a job depends on another (already submitted) job. Use it with the id of the existing job.
ccc_msub -a <existing_job> script.sh
  • One can write multi step jobs that run sequentially by submitting the next job at the end of the current script. Here are some example scripts:

JOB_A.sh:

#!/bin/bash
#MSUB -r JOB_A
#MSUB -n 32
#MSUB -q <partition>
#MSUB -A <project>
ccc_mprun ./a.out
ccc_msub JOB_B.sh

JOB_B.sh:

#!/bin/bash
#MSUB -r JOB_B
#MSUB -n 16
#MSUB -q <partition>
#MSUB -A <project>
ccc_mprun ./b.out
ccc_msub JOB_C.sh

JOB_C.sh:

#!/bin/bash
#MSUB -r JOB_C
#MSUB -n 8
#MSUB -q <partition>
#MSUB -A <project>
ccc_mprun ./c.out

Then, only JOB_A.sh has to be submitted. When it finishes, the script submits JOB_B.sh, etc…

Note

If a job is killed or if it reaches its time limit, all the jobs are removed and the last ccc_msub may not be launched. To avoid this, you can use the ccc_tremain from libccc_user or use the #MSUB -w directive as described above.

Environment variables

When a job is submitted, some environment variables are set. Those variables may be used only within the job script.

  • BRIDGE_MSUB_JOBID: batch id of the running job.
  • BRIDGE_MSUB_MAXTIME: timelimit in seconds of the job.
  • BRIDGE_MSUB_PWD: working directory. Usually the directory in which the job was submitted.
  • BRIDGE_MSUB_NPROC: number of requested processes
  • BRIDGE_MSUB_NCORE: number of requested cores per process
  • BRIDGE_MSUB_REQNAME: job name

Note

You cannot use those variables as input arguments to ccc_msub. You cannot use them with the #MSUB headers.

ccc_mprun

ccc_mprun command allows to launch parallel jobs over nodes allocated by resources manager. So inside a submission script, a parallel code will be launched with the following command:

ccc_mprun ./a.out

By default, ccc_mprun takes information (number of nodes, number of processors, etc) from the resources manager to launch the job. You can customize its behavior with the command line options. Type ccc_mprun -h for an up-to-date and complete documentation.

Here are some basic options ccc_mprun:

  • -n nproc: number of tasks to run
  • -c ncore: number of cores per task
  • -N nnode: number of nodes to use
  • -T time: maximum walltime of the allocations in seconds ( optional directive, if not provided, set to default=7200)
  • -E extra: extra parameters to pass directly to the underlying resource manager
  • -m filesystem: file system required by the job. If another file system is unavailable, the job will not be impacted. Without the option, the job is considered to use every file system. For example, ccc_mprun -m scratch,store will run even if the WORK file system is unavailable. You can choose the following file systems: scratch, work and store. This option may work depending the system and its configuration

Note

If the resources requested by ccc_mprun are not compatible with the resources previously allocated with ccc_msub, the job will crash with the following error message:

srun: error: Unable to create job step: More processors requested than permitted

Job submission scripts examples

  • Sequential job
#!/bin/bash
#MSUB -r MyJob               # Job name
#MSUB -n 1                   # Number of tasks to use
#MSUB -T 600                 # Elapsed time limit in seconds of the job (default: 7200)
#MSUB -o example_%I.o        # Standard output. %I is the job id
#MSUB -e example_%I.e        # Error output. %I is the job id
#MSUB -q <partition>         # Partition name (see ccc_mpinfo)
#MSUB -A <project>           # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}        # The BRIDGE_MSUB_PWD environment variable contains the directory from which the script was submitted.
./a.out
  • Parallel MPI job
#!/bin/bash
#MSUB -r MyJob_Para          # Job name
#MSUB -n 32                  # Number of tasks to use
#MSUB -T 1800                # Elapsed time limit in seconds
#MSUB -o example_%I.o        # Standard output. %I is the job id
#MSUB -e example_%I.e        # Error output. %I is the job id
#MSUB -q <partition>         # Partition name (see ccc_mpinfo)
#MSUB -A <project>           # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun ./a.out
  • Parallel OpenMP/multi-threaded job
#!/bin/bash
#MSUB -r MyJob_Para          # Job name
#MSUB -n 1                   # Number of tasks to use
#MSUB -c 16                  # Number of threads per task to use
#MSUB -T 1800                # Elapsed time limit in seconds
#MSUB -o example_%I.o        # Standard output. %I is the job id
#MSUB -e example_%I.e        # Error output. %I is the job id
#MSUB -q <partition>         # Partition name (see ccc_mpinfo)
#MSUB -A <project>           # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
export OMP_NUM_THREADS=16
 ./a.out

Note

An OpenMP/multi-threaded program can only run inside a node. If you ask more threads than available cores in a node, your submission will be rejected.

  • Parallel hybrid OpenMP/MPI or multi-threaded/MPI
#!/bin/bash
#MSUB -r MyJob_ParaHyb             # Job name
#MSUB -n 8                         # Total number of tasks to use
#MSUB -c 4                         # Number of threads per task to use
#MSUB -T 1800                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
export OMP_NUM_THREADS=4
ccc_mprun ./a.out # This script will launch 8 MPI tasks. Each task will have 4 OpenMP threads.
  • GPU jobs

For examples of GPU job submission scripts, see Running GPU applications.

Job monitoring and control

Once a job is submitted with ccc_msub, it is possible to follow its evolution with several bridge commands.

We recommend you limit the rate at which your jobs query the batch system to an aggregate of 1 - 2 times / minutes. This includes all Bridge and Slurm queries such as ccc_mpp, ccc_mstat, squeue, sacct, or other Bridge and Slurm commands. Keep in mind this is an aggregate rate across all your jobs, so if you have a single job that queries once a minute but 500 of these jobs start at once the Slurm controller will see a rate of 500 queries / minute, so please scale your rate accordingly.

Warning

Use of watch for Slurm and Bridge commands is strongly prohibited.

ccc_mstat

The ccc_mstat command provides information about jobs on the supercomputer. By default, it displays all the jobs that are either running or pending on the different partitions of the supercomputer. Use the option -u to display only your jobs.

$ ccc_mstat -u
BATCHID  NAME     USER     PROJECT             QUEUE      QOS     PRIO   SUBHOST  EXEHOST   STA    TUSED     TLIM    MLIM   CLIM
-------  ----     ----     -------             -----      ------  ------ -------  -------   --- -------- -------- ------- ------
229862   MyJob1   mylogin  projXXX@partition1  partition1 normal  210000 node174  node174   PEN        0    86400    1865   2240
233463   MyJob2   mylogin  projXXX@partition2  partition2 normal  210000 node175  node1331  R00    58631    85680    1865  43200
233464   MyJob3   mylogin  projXXX@partition3  partition3 normal  200000 node172  node1067  R01    12171    85680    1865  43200
233582   MyJob4   mylogin  projXXX@partition1  partition1 normal  200000 node172  node1104  R01     3620    85680    1865  43200

Here is the information that can be gathered from :

  • Basic job information (USER,PROJECT,BATCHID,CLIM,QUEUE,TLIM,NAME): Describes the parameter with which the job was submitted. It allows to check that the parameters passed to ccc_msub were taken into account correctly.
  • PRIO: The priority of the job depends on many parameters. For instance, it depends on your project and the amount of hours your project consumed. It also depends on how long the job has been waiting in queue. The priority is what will determine the order in which the jobs from different users and different projects will run when the resource is available.
  • SUBHOST: The host where the job was submitted.
  • EXEHOST: The first host where the job is running.
  • STA: The state of the job. Most of the time, it is either pending (PEN) if it is waiting for resources or running (R01) if it has started. Sometimes, the job is in a completing state (R00). It means the jobs has finished and computing resources are in a cleanup phase.
  • TUSED/TLIM: If the job is running, the TUSED field shows for how long it has been running in seconds. The TLIM field shows the maximum execution time requested at submission.
  • MLIM: The maximum memory allowed per core in MB.
  • CLIM: The number of core requested at submission.

Here are command line options for ccc_mstat:

  • -f: show jobs full name.
  • -q queue: show jobs of requested batch queue.
  • -u [user]: show jobs of a requested user. If no user is given, it shows the job of the current user.
  • -b batchid: show all the processes related to a job.
  • -r batchid: show detailed information of a running job.
  • -H batchid: show detailed information of a finished job.
  • -O: show jobs exceeding the time limit.

ccc_mpp

ccc_mpp provides information about jobs on the supercomputer. By default, it displays all the jobs that are either running or pending on the different partitions of the supercomputer. Use the option -u $USER to display only your jobs.

$ ccc_mpp -u $USER
USER    ACCOUNT BATCHID    NCPU    QUEUE        PRIORITY  STATE   RLIM    RUN/START   SUSP   OLD     NAME    NODES/REASON
mylogin projXX  1680920    64      partition1   290281    RUN     1.0h    55.0s       -      53.0s   MyJob1  node8002
mylogin projXX  1680923    84      partition2   284040    PEN     10.0h   -           -      50.0s   MyJob3  Dependency
mylogin projXX  1661942    84      partition2   284040    RUN     24.0h   53.0s       -      51.0s   MyJob3  node[1283-1285]
mylogin projXX  1680917    1024    partition2   274036    PEN     24.0h   -           -      7.5m    MyJob4  Priority
mylogin projXX  1680921    28      partition3   215270    PEN     24.0h   ~05h36      -      52.0s   MyJob2  Resources

Here is the information that can be gathered from :

  • Basic job information (USER,ACCOUNT,BATCHID,NCPU,QUEUE,RLIM,NAME): Describes the parameter with which the job was submitted. It allows to check that the parameters passed to ccc_msub were taken into account correctly.
  • PRIORITY: The priority of the job depends on many parameters. For instance, it depends on your project and the amount of hours your project consumed. It also depends on how long the job has been waiting in queue. The priority is what will determine the order in which the jobs from different users and different projects will run when the resource is available.
  • STATE: The state of the job. Most of the time, it is either pending (PEN) if it is waiting for resources or running (RUN) if it has started. Sometimes, the job is in a completing state (COMP). It means the jobs has finished and computing resources are in a cleanup phase.
  • RUN/START: If the job is running, this field shows for how long it has been running. If the job is pending, it sometimes gives an evaluation of the estimated start time. This start time may vary depending on the jobs submitted by other users.
  • SUSP: The time spent in a suspended state. Jobs may be suspended by staff when an issue occur on the supercomputer. In that case, running jobs are not flushed but suspended in order to let them continue once the issue is solved.
  • OLD: The total amount of time since the job was submitted. It includes the time spent waiting and running.
  • NODES/REASON: If the job is running, this gives you the list of nodes used by the job. If it is pending, it gives you the reason. For example, “Dependency” means you submitted the job with a dependency to another unfinished job. “Priority” means there are other jobs that have a better priority and yours will start running after those. “Resources” means there are not enough cores available at the moment to let the job start. It will have to wait for some other jobs to end and free their allocated resources. “JobHeldAdmin” means that the current job has been held by a user who is not the owner (generaly an admin with the required rights). Please note that pending jobs with lower priority may display a pending reason not reflecting their current pending state since the batch scheduler only updates the pending reason of high priority jobs.

Here are command line options for ccc_mpp:

  • -r: prints ‘running’ batch jobs
  • -s: prints ‘suspended’ batch jobs
  • -p: prints ‘pending’ batch jobs
  • -q queue: requested batch queue
  • -u user: requested user
  • -g group: requested group
  • -n: prints results without colors

ccc_mpeek

ccc_mpeek gives information about a job while it runs.

$ ccc_mpeek <jobid>

It is particularly useful to check the output of a job while it is running. The default behavior is to display the standard output. It is what you would basically find in the .o log file.

Here are command line options for ccc_mpeek:

  • -o: prints the standard output
  • -e: prints the standard error output
  • -s: prints the job submission script
  • -t: same as -o in tail -f mode
  • -S: print the launched script of the running job
  • -d: print the temporary directory of the running job
  • -I: print the INFO file of the running job

ccc_mpstat

ccc_mpstat gives information about a parallel job during its execution.

$ ccc_mpstat <jobid>

It gives details about the MPI processes and their repartition across nodes with their rank and affinity.

Here are command line options for ccc_mpstat:

  • -r jobid : display resource allocation characteristics
  • -a jobid : display active steps belonging to a jobid
  • -t stepid : print execution trace (tree format) for the specified stepid
  • -m stepid : print mpi layout for the specified stepid
  • -p partition: only print jobs of a particular partition
  • -u [user] : only print jobs of a particular user or the current one if not specified

ccc_macct

Once the job is done, it will not appear in ccc_mpp anymore. To display information afterwards, the command ccc_macct is available. It works for pending and running jobs but information is more complete once the job has finished. It needs a valid jobID as input.

$ ccc_macct <jobid>

Here is an example of the output of the ccc_macct command:

$ ccc_macct 1679879

Jobid     : 1679879
Jobname   : MyJob
User      : login01
Account   : project01@partition
Limits    : time = 01:00:00 , memory/task = Unknown
Date      : submit = 14/05/2014 09:23:50 , start = 14/05/2014 09:23:50 , end = 14/05/2014 09:30:00
Execution : partition = partition , QoS = normal , Comment = avgpowe+
Resources : ncpus = 32 , nnodes = 2
   Nodes=node[1802,1805]

Memory / step
--------------
                   Resident Size (Mo)                     Virtual Size (Go)
JobID          Max     (Node:Task)       AveTask    Max  (Node:Task)            AveTask
-----------    ------------------------  -------    --------------------------  -------
1679879.bat+     151 (node1802   :   0)       0      0.00 (node1802   :   0)    0.00
1679879.0        147 (node1805   :   2)       0      0.00 (node1805   :   2)    0.00
1679879.1        148 (node1805   :   8)       0      0.00 (node1805   :   8)    0.00
1679879.2          0 (node1805   :  16)       0      0.00 (node1805   :  16)    0.00

Accounting / step
------------------

JobID          JobName             Ntasks  Ncpus Nnodes     Layout       Elapsed   Ratio      CPusage    Eff  State
------------   ------------        ------  ----- ------     -------      -------   -----      -------    ---  -----
1679879       MyJob                     -     32      2           -     00:06:10     100            -      -  -
1679879.bat+  batch                     1      1      1     Unknown     00:06:10   100.0            -      -  COMPLETED
1679879.0     exe0                      4      4      2      BBlock     00:03:18    53.5     00:02:49   85.3  COMPLETED
1679879.1     exe1                     16     16      2      BBlock     00:02:42    43.7     00:00:37   22.8  COMPLETED
1679879.2     exe2                     32     32      2      BBlock     00:00:03      .8            -      -  FAILED

Energy usage / job (experimental)
---------------------------------

Nnodes  MaxPower  MaxPowerNode  AvgPower     Duration      Energy
------  --------  ------------  --------     --------      ------
    2       169W  node1802         164W     00:06:10        0kWh

Sometimes, there are several steps for the same job described in ccc_macct. Here, they are called 1679879.0, 1679879.1 and 1679879.2. This happens when there are several calls to ccc_mprun in one submission script. In this case, 3 executables were run with ccc_mprun: exe0, exe1 and exe2. Every other call to functions such as cp, mkdir, etc will be counted in the step called 1679879.batch

There are 4 specific sections in the ccc_macct output.

  • First section Job summary: It gives all the basic information about submission parameters you can also find in ccc_mpp. There is also submission date and time, when the job has started to run and when it has finished.
  • Second section Memory / step: For each step, this section gives an idea of the amount of memory used by process. It gives the memory consumption of the top process.
  • Third section Accounting / step: It gives detailed information for each step.
    • Ntasks, Ncpus, Nnodes and Layout describes how the job is distributed among the allocated resources. Ntasks is the number of processes defined by the parameter -n. By default, Ncpus is the same as Ntasks except if the number of cores per process was set with the -c parameter. The layout can be BBlock, CBlock, BCyclic or CCyclic depending on the process distribution. (more information given in the Advanced Usage documentation)
    • Elapsed is the user time spent in each step. Ratio is the percentage of time spent in this specific step compared to the total time of the job.
    • CPusage is the average cpu time for all the processes in one step and Eff is defined by CPusage/Elapsed*100. Which means that if Eff is close to 1, all the processes are equally busy.
  • Fourth section Energy usage: It gives an idea of the amount of electric energy used while running this job. This section is only relevant if the job used entire nodes.

ccc_mdel

ccc_mdel enables to kill your jobs. It works whether they are running or pending.

First, identify the batchid of the job you need to kill with ccc_mpp for example. And then, kill the job with:

$ ccc_mdel <jobid>

ccc_malter

ccc_malter enables to decrease your jobs time limit. It works only when they are running or pending.

$ ccc_malter -T <new time limit> <jobid>

Here are available options:

  • -T time_limit: Decrease the time limit of a job
  • -L licenses_string: Change licenses of your job

ccc_affinity

The command ccc_affinity show you the processes and threads affinity for a given job id. The usual format is ccc_affinity [options] JOBID.

Here are available options:

  • -l: Run on local node.
  • -t: Display processes and threads. Default is to display processes only.
  • -u: Specify a username.

This is an example of output:

$ ccc_affinity 900481
Host             Rank  PID        %CPU  State MEM_kB     CPU   AFFINITY             NAME
node1434:
 |               2     8117       310   Sl    34656      28    0,28                 life_par_step7
 |               3     8118       323   Rl    34576      42    14,42                life_par_step7
node1038:
 |               0     6518       323   Rl    34636      0     0,28                 life_par_step7
 |               1     6519       350   Rl    34732      42    14,42                life_par_step7

And this is with thread option -t:

$ ccc_affinity -t 900481
Host             Rank  PID        %CPU   State ThreadID   MEM_kB     CPU   AFFINITY             NAME
node1434:
 |               2     8117       20.3   Sl    8117       34660      28    0,28                 life_par_step7
 |               `--   --         0.0    Sl    8125       34660      29    0-13,28-41           life_par_step7
 |               `--   --         0.0    Sl    8126       34660      0     0-13,28-41           life_par_step7
 |               `--   --         0.0    Sl    8142       34660      29    0-13,28-41           life_par_step7
 |               `--   --         0.0    Sl    8149       34660      36    0-13,28-41           life_par_step7
 |               `--   --         99.6   Rl    8150       34660      4     4,32                 life_par_step7
 |               `--   --         99.6   Rl    8151       34660      7     7,35                 life_par_step7
 |               `--   --         99.6   Rl    8152       34660      11    11,39                life_par_step7
 |               3     8118       33.6   Rl    8118       34580      42    14,42                life_par_step7
 |               `--   --         0.0    Sl    8124       34580      43    14-27,42-55          life_par_step7
 |               `--   --         0.0    Sl    8127       34580      18    14-27,42-55          life_par_step7
 |               `--   --         0.0    Sl    8143       34580      50    14-27,42-55          life_par_step7
 |               `--   --         0.0    Sl    8145       34580      23    14-27,42-55          life_par_step7
 |               `--   --         99.6   Rl    8146       34580      18    18,46                life_par_step7
 |               `--   --         99.6   Rl    8147       34580      21    21,49                life_par_step7
 |               `--   --         99.6   Rl    8148       34580      25    25,53                life_par_step7
node1038:
 |               0     6518       44.1   Rl    6518       34636      28    0,28                 life_par_step7
 |               `--   --         0.0    Sl    6526       34636      29    0-13,28-41           life_par_step7
 |               `--   --         0.0    Sl    6531       34636      11    0-13,28-41           life_par_step7
 |               `--   --         0.0    Sl    6549       34636      36    0-13,28-41           life_par_step7
 |               `--   --         0.0    Sl    6553       34636      10    0-13,28-41           life_par_step7
 |               `--   --         99.8   Rl    6554       34636      4     4,32                 life_par_step7
 |               `--   --         99.8   Rl    6555       34636      7     7,35                 life_par_step7
 |               `--   --         99.8   Rl    6556       34636      11    11,39                life_par_step7
 |               1     6519       71.1   Sl    6519       34736      42    14,42                life_par_step7
 |               `--   --         0.0    Sl    6525       34736      43    14-27,42-55          life_par_step7
 |               `--   --         0.0    Sl    6527       34736      17    14-27,42-55          life_par_step7
 |               `--   --         0.0    Sl    6548       34736      43    14-27,42-55          life_par_step7
 |               `--   --         0.0    Sl    6557       34736      50    14-27,42-55          life_par_step7
 |               `--   --         99.6   Rl    6558       34736      18    18,46                life_par_step7
 |               `--   --         99.6   Rl    6559       34736      21    21,49                life_par_step7
 |               `--   --         99.6   Rl    6560       34736      25    25,53                life_par_step7

We can see which process runs on which node and which thread runs on which core/CPU. The “State” column show if the thread is Running (R) or sleeping (S). The AFFINITY column shows CPUs on which a thread is allowed to move on and the CPU column shows the CPU on which a thread is currently running.

On a single node we can see CPU numbered to 55 though a node is 28 cores. It’s because of Intel Hyper-Threading that allows to run two threads on one core.

Special jobs

Note

A step (or SLURM step) is a call of ccc_mprun command inside an allocation or a job.

Note

In the following cases, some scripts use the submission directive #MSUB -F. This directive call a Bridge plugin which uses Flux. It is a new framework for managing resources and jobs.

Warning

If you want to submit a chained job at the end of a job using the Flux plugin, please write the following command before the submit command:

BATCH_SYSTEM=slurm ccc_msub [...]

Multiple sequential short steps

In this case, a job contains many sequential and very short steps.

#!/bin/bash
#MSUB -r MyJob_SeqShort            # Job name
#MSUB -n 16                        # Number of tasks to use
#MSUB -T 3600                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID
#MSUB -F                           # Use Flux plugin

for i in $(seq 1 10000)
do
   ccc_mprun -n 16 ./bin1
done

Embarrassingly parallel jobs

An embarrassingly parallel job is a job which launches independent processes in parallel. These processes need few or no communications. We call such an independent process a task.

Multiple concurrent sequential short steps

In this case, a job contains many concurrent, sequential and very short steps.

#!/bin/bash
#MSUB -r MyJob_MulSeqShort         # Job name
#MSUB -n 16                        # Number of tasks to use
#MSUB -T 3600                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID
#MSUB -F                           # Use Flux plugin

for i in $(seq 1 10000)
do
   ccc_mprun -n 8 ./bin1 &
   ccc_mprun -n 8 ./bin2 &
   wait
done

Multiple concurrent steps

This case is the typical embarrassingly job: a job allocates resources and launches multiple steps which are independant from other steps. Flux will launch them through the available resources. When a step ends, a new step will be launched if there are enough resources.

The different tasks to launch must be listed in a simple text file. Each line of this taskfile contains:

  • the number of task
  • the number of core per task
  • the command to launch

For example:

$ cat taskfile.txt
8-2 bin1.exe
1-1 bin.exe file1.dat
2-8 bin2.exe file2.dat
<...>
4-1 bin2.exe file2.dat

And the job script:

#!/bin/bash
#MSUB -r MyJob_Concurrent          # Job name
#MSUB -n 256                       # Number of tasks to use
#MSUB -T 3600                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID
#MSUB -F                           # Use Flux plugin

ccc_mprun -B taskfile.txt

You can combine the previous actions into one:

#!/bin/bash
#MSUB -r MyJob_Concurrent          # Job name
#MSUB -n 256                       # Number of tasks to use
#MSUB -T 3600                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID
#MSUB -F                           # Use Flux plugin

ccc_mprun -B <(cat << EOF
8-2 bin1.exe
1-1 bin.exe file1.dat
2-8 bin2.exe file2.dat
<...>
4-1 bin2.exe file2.dat
EOF
)

Multiple concurrent jobs

In this cas, you can submit many jobs inside an allocation. Bridge provides an environment variable BRIDGE_MSUB_ARRAY_TASK_ID to differentiate the different subjobs.

For example, here a job script where you need you submit 10000 times:

$ cat subjob.sh
#!/bin/bash
<some preprocessing commands>
ccc_mprun bin.exe file_${BRIDGE_MSUB_ARRAY_TASK_ID}.dat
<some postprocessing commands>

And the job script:

#!/bin/bash
#MSUB -r MyJob_ConcurrentJobs      # Job name
#MSUB -n 256                       # Number of tasks to use
#MSUB -T 3600                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID
#MSUB -F                           # Use Flux plugin

flux resource drain 0
ccc_msub -n 16 -y 1-10000 subjob.sh
flux resource undrain 0

In this example, the job will allocate 256 tasks. Inside, it will submit 10000 subjobs.

GLoST: Greedy Launcher of Small Tasks

GLoST is a lightweight, highly scalable tool for launching independent non-MPI processes in parallel. It has been developed by the TGCC for handling huge to-do list like operations. The source code is available on the cea-hpc github. GLoST manages the launch and scheduling of a list of tasks with the command glost_launch. It also allows error detection, continuation of undone operations and relaunch of failed operations thanks to the post processing script glost_filter.sh.

The different tasks to launch must be listed in a simple text file. Commented and blank lines and supported. Comments added at the end of a line will be printed to the job output. This can be used to tag different tasks. Here is an example of a task file:

$ cat taskfile.list
./bin1 # Tag for task 1
./bin2 # Tag for task 2
./bin3 # Tag for task 3
./bin4 # Tag for task 4
./bin5 # Tag for task 5
./bin6 # Tag for task 6
./bin7 # Tag for task 7
./bin8 # Tag for task 8
./bin9 # Tag for task 9
./bin10 # Tag for task 10

Or with MPI binaries:

$ cat task_mpi.list
ccc_mprun -E"--jobid=${SLURM_JOBID}" -E"--exclusive" -n 3 ./mpi_init
ccc_mprun -E"--jobid=${SLURM_JOBID}" -E"--exclusive" -n 5 ./mpi_init

Here is an example of submission script:

#!/bin/bash
#MSUB -r MyJob_Para                # Job name
#MSUB -n 4                         # Number of tasks to use
#MSUB -T 3600                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID

module load glost
ccc_mprun glost_launch taskfile.list

GLoST will automatically manage the different tasks and schedule them on the different available nodes. Note that one MPI process is reserved for task management so GLoST cannot run on 1 process. For more information, please check out the man page for glost_launch.

Once the job is submitted, information on the task scheduling is provided in the error output of the job. For each task, it shows which process launched it, its exit status and its duration. Here is a typical output for a 4 process job. Process 0 is used for scheduling and the 10 tasks to launch are treated by processes 1,2 or 3.

#executed by process    3 in 7.01134s     with status     0 : ./bin3 # Tag for task 3
#executed by process    1 in 8.01076s     with status     0 : ./bin2 # Tag for task 2
#executed by process    2 in 9.01171s     with status     0 : ./bin1 # Tag for task 1
#executed by process    2 in 0.00851917s  with status     12 : ./bin6 # Tag for task 6
#executed by process    2 in 3.00956s     with status     0 : ./bin7 # Tag for task 7
#executed by process    1 in 5.01s        with status     0 : ./bin5 # Tag for task 5
#executed by process    3 in 6.01114s     with status     0 : ./bin4 # Tag for task 4

Some tasks may exit on errors or not be executed if the job has reached its timelimit before launching all the tasks in the taskfile. To help analysing the executed, failed or not executed tasks, we provide the script glost_filter.sh. By default, lists all the executed tasks and adding the option -H will highlight the failed tasks. Let’s say the error output file is called example_1050311.e. Here are some useful options:

  • List all failed tasks (with exit status other than 0)
$ glost_filter.sh -n taskfile example_1050311.e
./bin6 # Tag for task 6
  • List all tasks not executed (may be used to generate the next taskfile)
$ glost_filter.sh -R taskfile example_1050311.e
./bin8 # Tag for task 8
./bin9 # Tag for task 9
./bin10 # Tag for task 10

For more information and options, please check out glost_filter.sh -h.

GLoST also provides the tool glost_bcast.sh which broadcasts a file from a shared filesystem to the local temporary directory (/tmp by default). Typical usage on the center is:

#!/bin/bash
#MSUB -r MyJob_Para                # Job name
#MSUB -n 32                        # Number of tasks to use
#MSUB -T 3600                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID

module load glost

# Broadcast <file> on each node under /tmp/<file>
ccc_mprun -n ${BRIDGE_MSUB_NNODE} -N ${BRIDGE_MSUB_NNODE} glost_bcast <file>

# Alternatively, to broadcast <file> on each node under /dev/shm/<file>
TMPDIR=/dev/shm ccc_mprun -n ${BRIDGE_MSUB_NNODE} -N ${BRIDGE_MSUB_NNODE} glost_bcast <file>

MPMD jobs

An MPMD job (for Multi Program Multi Data) is a parallel job that launches different executables over the processes. The different codes are still sharing the same MPI environment. This can be done with the -f option of ccc_mprun and by creating an appfile. The appfile should specify the different executables to launch and the number of processes for each.

Homogeneous

An homogeneous MPMD job is a parallel job where each process have the same cores number. Here is an example of an appfile:

$ cat app.conf
1       ./bin1
5       ./bin2
26      ./bin3

# This script will launch the 3 executables
# respectively on 1, 5 and 26 processes

And the submission script:

#!/bin/bash
#MSUB -r MyJob_Para                # Job name
#MSUB -n 32                        # Total number of tasks/processes to use
#MSUB -T 1800                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}

ccc_mprun -f app.conf

Note

The total number of processes specified on the appfile cannot be larger than the number of processes requested for the job in the submission script.

In order to have several calls per line in the appfile, it is necessary to execute the whole line in the bash command.

  • For example, if each binary has to be executed in its own directory
1 bash -c "cd ./dir-1 && ./bin1"
4 bash -c "cd ./dir-2 && ./bin2"
  • Or if you need to export an environment variable that is different for each binary.
1 bash -c "export OMP_NUM_THREADS=3; ./bin1;"
4 bash -c "export OMP_NUM_THREADS=1; ./bin2;"

Heterogeneous

An heterogeneous MPMD job is a parallel job where each process could have a different threads number. Heterogeneous MPMD is enabled by loading feature/bridge/heterogenous_mpmd module. Here is an example of an appfile:

$ cat app.conf
1-2 bash -c "export OMP_NUM_THREADS=2; ./bin1"
5-4 bash -c "export OMP_NUM_THREADS=4; ./bin2"
2-5 bash -c "export OMP_NUM_THREADS=5; ./bin3"

# This script will launch the 3 executables
# respectively on 1, 5 and 2 processes with 2, 4 and 5 cores
# 1*2 + 5*4 + 2*5 = 32 (#MSUB -n 32)

The first number describes how many processes to run the followed command while the second one indicates the number of cores allocated to each process.

And the submission script:

#!/bin/bash
#MSUB -r MyJob_Para                # Job name
#MSUB -n 32                        # Total number of tasks and cores to use
#MSUB -T 1800                      # Elapsed time limit in seconds
#MSUB -o example_%I.o              # Standard output. %I is the job id
#MSUB -e example_%I.e              # Error output. %I is the job id
#MSUB -q <partition>               # Partition name
#MSUB -A <project>                 # Project ID
set -x
cd ${BRIDGE_MSUB_PWD}
module load feature/bridge/heterogenous_mpmd

ccc_mprun -f app.conf