Aries is a 176-GPU cluster, containing AMD GPUs (courtesy of AMD).

Jobs are allocated by nodes, not GPUs. Since there are 8 GPUs per node, each submitted job must be able to use all 8 GPUs. Remember to set properly in your script the --gres=gpu variable, as it is not automatic. Below, we provide examples for how to effectively use this resource.

Since each job submission on Aries runs exclusively on an entire GPU node, the user must launch 8 parallel GPU processes (or launch a single process that can use all 8 GPUs simultaneously). It is ideal that all 8 parallel runs to have similar run time to maximize efficiency. Make sure the 8 parallel runs have different names for the output file generated.

All Aries users should have access to the data storage directory /work/cms16/. Create a sub-directory within /work/cms16/ with your "username".

create dir

mkdir /work/cms16/$USER

Write all data to your directory in /work/cms16/username and not in your /home. Your /home directory will be able to store around 20 GB for you, while /work partition has a total of 100 TB. Regularly remove your data from /work, to ensure stability of the resource. /work is not backed up!

Important note: set up ssh communication keys between nodes, as explained in the pdf at the end of this page (ARIES_Quick_Start_wl52_20220406.pdf).

Partition commons

This partition contains 22 GPU nodes (gn01-gn22). Each node contains 8 AMD Vega20 GPUs, 96 CPUs and 512 GB memory, except for the last three nodes that each have 128 CPUs and 8 AMD MI100 GPUs.

19 x MI50 Nodes (gn01-gn19): 1x AMD EPYC 7642 processor (96 CPUs), 512GB RAM, 2TB storage, HDR Infiniband, 8x AMD Radeon Instinct MI50 32GB GPUs.
3x MI100 Nodes (gn20-gn22): 2x AMD EPYC 7V13 processors (128 CPUs), 512GB RAM, 2TB storage, HDR Infiniband, 8x AMD Radeon Instinct MI100 32GB GPUs

Partition highmem

This partition contains 2 CPU nodes with 64 CPUs and about 4 TB of RAM. These nodes don't have GPUs.

2x Large Memory Nodes (hm01-02): 2x AMD EPYC 7302 processors (64 CPUs), 4TB RAM, 4TB storage, HDR Infiniband.

Singularity Container

For submitting OpenMM jobs, a singularity container with OpenMM pre-installed is available.

Singularity container for OpenMM

container=/home/pcw2/bin/openmm-ctbp.sif

This container does not have OpenSMOG, or other CTBP-specific tools, and they will need to be installed with pip3. For example, to install OpenSMOG:

$pip3 install OpenSMOG

Usage example with bash submission script, openmm run python script, and input files can be downloaded below.

aries_example.tar.gz

Job submission script

#!/bin/bash -l
#SBATCH --job-name=ctbpexample
#SBATCH --nodes=1
#SBATCH --cpus-per-task=96        #set to 96 if not using MPI (OpenMM does not use MPI)
#SBATCH --tasks-per-node=1
#SBATCH --export=ALL
#SBATCH --mem=0                   #each GPU assigned 32 GB by default
#SBATCH --gres=gpu:8 
#SBATCH --time=1-00:00:00         #max run time is 1 day

Launcher_GPU

For submitting jobs, another possibility is to use the Launcher_GPU module and bind each run/task to one gpu. For this, remember to set the '--cpus-per-task' as the total number of cpus divided by the number of '--tasks-per-node'. The example bellow, as required when using Aries, the job will simultaneously use all 8 GPUs on each node. To do this, we will run 8 simulations, each on a single GPU. For this, OpenMM was loaded as module.

Job submission script Launcher_GPU

#!/bin/bash -l

#SBATCH --job-name=my_job
#SBATCH --account=commons
#SBATCH --partition=commons
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --threads-per-core=2
#SBATCH --mem-per-cpu=3G
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
#SBATCH --export=ALL

echo "Submitting simulations..."

module purge

# Using Launcher_GPU on ARIES 
module load GCC/10.2.0 OpenMPI/4.0.5 OpenMM/7.5.0 foss/2020b Launcher_GPU

# This is for controling Launcher
export LAUNCHER_WORKDIR=`pwd`
export LAUNCHER_JOB_FILE=$PWD/launcher_jobs_sim
export LAUNCHER_BIND=1

echo "Job started on " `date`
echo "Running on hostname" `hostname`
echo "Job $SLURM_JOB_ID is running on: $SLURM_NODELIST"
echo "Job SLURM_SUBMIT_DIR is $SLURM_SUBMIT_DIR"
echo "Running on $SLURM_NNODES nodes"
echo "Running on $SLURM_NPROCS processors"
echo "CPUS per task is $SLURM_CPUS_PER_TASK"
echo "LAUNCHER_WORKDIR: $LAUNCHER_WORKDIR"
echo "Number of replicas is $max_replicas"
df -h

# This will adjust the total number of runs to nodes*8
max_replicas=$((SLURM_NNODES*8))

rm $LAUNCHER_WORKDIR/launcher_jobs_sim &> /dev/null

# Create Launcher_job_file needed by $LAUNCHER_DIR/paramrun
for i in `seq 1 $max_replicas`
do
	echo "python run_code.py input_$i output_${i}.log" >> $LAUNCHER_WORKDIR/launcher_jobs_sim
done

# This line launches the jobs in the launcher_jobs_sim file
$LAUNCHER_DIR/paramrun

echo "My job finished at:" `date`

Additional Information

Acknowledgments

Please acknowledge the use of this cluster in your publications if you used the cluster:

This work was made possible by the donation of critical hardware and resources from AMD COVID-19 HPC Fund.

The following image can be used for poster and presentations.

AMD COVID-19 HPC Fund

Child pages

Aries Usage Guidelines