Parallel IO

MPI-IO

MPI-IO provides a low level implementation of parallel I/O. It was introduced as a standard in MPI-2.

More information on how to use MPI-IO can be found in the I/O section of the official MPI documentation.

Recommended data usage on parallel file system

Technical considerations

On the computing center, parallel file systems are implemented with Lustre technology as follows:

One MDS stores namespace metadata, such as file names, directories, access permissions, and file layout.
Multiple OSSes store file data.
Each supercomputer node is a client that accesses and uses the data.

When a file is opened or created:

Supercomputer node requests it to the MDS.
The MDS performs a hand-shake between the supercomputer node and the OSSes.
The node can now read and/or write directly by transferring data to the OSSes.

The important point is that all open, close and file name related calls from all the supercomputer nodes go to a unique server: the MDS.

These calls are often referred to as meta-data calls and too many of these calls can badly affect the overall performances of the file system.

The following sections provide data usage guidelines to ensure the good operation of the parallel file systems.

Big files vs small files

Lustre file system is optimized for bandwidth rather than latency.

While manipulating small files, I/O performances are mainly latency-based (the hand-shakes with the MDS).
While manipulating big files, latency impact is considerably reduced.

Thus, prefer having several big files instead of having many small files. To perform basic tasks on large files or file trees, like copy, archiving and deletion, it is recommended to use MpiFileUtils as described in the MpiFileUtils section. These utilitaries are only effective on Lustre file systems (STORE, SCRATCH et WORK) and should not be used from or toward NFS file systems (HOME).

Number of files per directory

Many meta-data operations involve the MDS but also the OSSes.

Let us see with an example, a classic ls -l, what happens behind the scene:

Client asks the MDS to list all files in the current directory.
The MDS lists the names of the files (sub-directories, regular files, etc.).
The MDS knows everything about the sub-directories.
But MDS does not know the sizes of non-directory files (nor the date of last modification)!
So for each file the MDS synchronizes with each OSSes involved (sum of sizes, max of the date of modification, etc.).

Hence the more non-directory files you have within the directory, the slower the operation will be.

A similar scheme occurs to any file operation using stat system call for each file, like done with find or du commands.

This is the main reason why many file operations become slower as the number of regular files within a directory increases.

As a consequence, it is strongly advised to have less than 10.000 files per directory.
If you have more files, organize them into sub-directories (directories do not involve OSSes).

Files on STORE

STORE is a Lustre file systems linked to a Hierarchical Storage Management (HSM) relying on a magnetic tape storage system.

Thus they share the constraints of Lustre file systems and those of the magnetic tape storage.

Because tape storage has a long latency and the number of tape readers are limited, having big files becomes nearly mandatory.

That’s why in order to reduce the time of retrieval :

each file should be bigger than 1GB (if needed, use tar to aggregate your small files)
each file should be less than 1TB

When you can manipulate the file size (e.g. using tarball), making it around 100GB is recommended. Practical experience shows that a 100GB file is quick to retrieve with ccc_hsm get and subsequent manipulation or extraction is also fast.

You may use ccc_quota to check file size statistics on STORE (they are updated daily).

Example of a good usage of filesystems

You have compiled your own program and installed it in your home directory under ~/products. In your home, your program will be safe. Also, you may use some user defined modules that you would create in ~/products/modules/modulefiles as described in Extend your environment with modulefiles. It may simplify the use of your code. To avoid reaching the quota limit on the home, don’t use it by default for any work. The best place to write submission files and to launch your jobs would be the workdir. One of the advantages of the workdir is that it is shared between calculators. By launching your jobs from the workdir, you will be able to keep the submission scripts and the log files. Let us say the code generates lots of files, for example restart files, so that a new job can start where the last one stopped. In that case, the best is to launch the code on the scratchdir. It is the fastest and largest filesystem. However, data located on the scratchdir may be purged if they are not used for a while. Therefore, once your job ran successfully and you obtained the result files, you need to move them to an appropriate location if you want to keep them. That is the main purpose of the storedir. Once the job is finished, you can gather its result files in a big tar file that you shall copy on the storedir.

So, a typical job would be launched from the workdir, where the submission scripts are located. The job would go through the following steps :

Create a temporary directory on the scratchdir and go to this directory
Untar large input files from the storedir to our directory in the scratchdir
Launch the code
Once it has finished running, check the results
If the results are not needed at the moment for other runs, make a tar of the useful files and copy the tar on the storedir
Remove the temporary directory

Note that those steps are not compulsory and do not apply to every job. It is just one possible example of correct use of the different filesystems.

Parallel compression and decompression with pigz

Instead of using classical compression tools like gzip (included by default with tar), we recommend to use his parallel counterpart pigz to get faster processing.

With this tool, we advise you to limit the number of threads used for compression between 4 and 8 to have a good performance / resources ratio. Increasing the number of threads should not dramatically improve performance and could even slow your compression. To speed up the process you may also adjust the compression level at the cost of reducing the compression quality. Decompression will be done by only one thread in any case and three more threads will be used for various purposes (read, write, check).

Please do not use this tool on login nodes, and prefer an interactive submission with ccc_mprun -s or with a batch script.

Compression and decompression example using 6 threads:

#!/bin/bash
#MSUB -n 1
#MSUB -c 6
#MSUB -q <partition>
#MSUB -A <Project>
module load pigz

# compression:
# we are forced to create a wrapper around pigz if we want to use
# specific options to change the default behaviour of pigz.
# Note that '$@' is important because tar can pass arguments to pigz
cat <<EOF > pigz.sh
#!/bin/bash
pigz -p 6 \$@
EOF
chmod +x ./pigz.sh

tar -I ./pigz.sh -cf folder.tar.gz folder

# decompression:
tar -I pigz -xf folder.tar.gz

For additional information, please refer to the man-pages of the software:

$ man pigz

MpiFileUtils

MpiFileUtils is a suite of utilitaries allowing for handling file trees and large files. It is optimised for HPC and use MPI paralellisation. It offers tools for basic tasks like copy, remove, and compare for such datasets, delivering better performance than their single-process counterparts.

dcp - Copy files Using 64 processes and 16 cores, dcp provides a data rater more than 6 times greater than a regular cp using a full node to copy 80GB/1800 files on the scratch file system.

dtar - Create and extract tape archive files Using 64 processes and 16 cores, dtar provides a data rater more than 6 times greater than a regular tar using a full node to archive 80GB/1800 files on the scratch file system.

dbcast - Broadcast a file to each compute node.

dbz2 - Compress and decompress a file with bz2.

dchmod - Change owner, group, and permissions on files.

dcmp - Compare contents between directories or files.

ddup - Find duplicate files.

dfind - Filter files.

dreln - Update symlinks to point to a new path.

drm - Remove files.

dstripe - Restripe files (Lustre).

dsync - Synchronize source and destination directories or files.

dtar - Create and extract tape archive files.

dwalk - List, sort, and profile files.

Sample script for dcp, dtar and drm

#!/bin/bash
#MSUB -r dcp_dtar
#MSUB -q |default_CPU_partition|
#MSUB -T 3600
#MSUB -n 64
#MSUB -c 16
#MSUB -x
#MSUB -A <Project>
#MSUB -m work,scratch
ml pu
ml mpi
ml mpifileutils
ccc_mprun dtar -cf to_copy.tar to_tar/ # Create an archive
ccc_mprun dcp to_copy.tar copy_dcp.tar # Copy the archive
ccc_mprun drm copy_dcp.tar # Remove the copy of the archive

Sample script for dstripe

#!/bin/bash
#MSUB -r dstripe
#MSUB -q rome
#MSUB -T 1500
#MSUB -Q test
#MSUB -n 16
#MSUB -c 8
#MSUB -x
#MSUB -A <Project>
#MSUB -m work,scratch
ml gnu/8
ml mpifileutils
ccc_mprun dstripe -c <number of desired stripe> to_copy.tar

For more information, see: https://mpifileutils.readthedocs.io/en/v0.11.1/