Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Aries is a 176-GPU cluster, containing AMD GPUs (courtesy of AMD). cluster, provided by AMD, is composed of 176 AMD GPUs distributed over 22 nodes, along with 2 additional nodes without GPUs, each equipped with 4TB of RAM.

Job allocation is determined by nodes rather than GPUs, with each node housing 8 GPUs. Each Jobs are allocated by nodes, not GPUs. Since there are 8 GPUs per node, each submitted job must be able to use all 8 GPUs.

Remember to set properly in your script the --gres=gpu variable, as it is not automaticallyautomatic. Below, we provide examples for how to effectively use this resource.

Since each job submission on Aries runs exclusively on an entire GPU node, the user must launch 8 parallel GPU processes (or launch a single process that can use all 8 GPUs simultaneously).

It is ideal that all 8 parallel runs to have similar run time to maximize efficiency. Make sure the 8 parallel runs have different names for the output file generated.

...

Code Block
languagebash
firstline1
titlecreate dir
linenumberstrue
mkdir /work/cms16/$USER

Recommended to store Write all generated data to your directory in /work/cms16/username and not in your /home. Your /home directory will be able to store around 20 GB for you, while /work partition has a total of 100 TB.  Regularly remove your data from /work, to ensure stability of the resource. /work is not backed up! 

Partition commons

This partition contains 22 GPU nodes (gn01-gn22). Each node contains 8 AMD Vega20 GPUs, 96 CPUs and 512 GB memory. Expect for the last three nodes (gn20-gn22) with each having 128 CPUs and 8 AMD MI100 GPUs. 

Partition highmem

In consideration of your colleagues also using the Aries cluster, if you are going to have more than 10 jobs in the queue at a time (80 GPUs), please limit the wall time to 12 hours. This will help other users get jobs through, as well.

One way to regulate the number of jobs running simultaneously is to use job arrays and adding a percentage at the end to set the limit of jobs running at the same time, for example using #SBATCH --array=0-39%10 to run 40 jobs, but only 10 at the same time.

Another way to regulate the number of jobs running simultaneously is to add dependencies to the jobs with the #SBATCH --dependency=previous_job_id command. A sample script to submit multiple jobs with dependencies can be found here

Important note: set up ssh communication keys between nodes, as explained in the pdf at the end of this page (ARIES_Quick_Start_wl52_20220406.pdf).

Partitions

Partition commons

The partition commons comprises 22 GPU nodes (gn01-gn22), with 19 MI50 nodes (gn01-gn19) and 3 MI100 nodes (gn20-gn22).

  • 19x MI50 Nodes (gn01-gn19):
    • Processor: 96CPUs → 1x AMD EPYC 7642 (48 cores, 2 CPUs/core)
    • RAM: 512 GB
    • Storage: 2 TB 
    • Network: HDR Infiniband
    • GPUs: 8x AMD Radeon Instinct MI50 32GB
  • 3x MI100 Nodes (gn20-gn22):
    • Processor: 128CPUs -> 2x AMD EPYC 7V13 processors (64 cores, 1CPUs/core)
    • RAM: 512 GB
    • Storage: 2 TB 
    • Network: HDR Infiniband
    • GPUs: 8x AMD Radeon Instinct MI100 32GB

Partition highmem

The partition highmem contains 2 nodes with This partition contains 2 CPU nodes (hm01-02) with each node having 64 CPUs and about 4 TB of RAM.   These nodes don't have GPUs.

  • 2x Large Memory Nodes (hm01-02):
    • Processor: 64CPUs -> 2x AMD EPYC 7302 (16 cores, 2CPUs/core)
    • RAM: 4 TB
    • Storage: 4 TB 
    • Network: HDR Infiniband
    • GPUs: None

Singularity Container

For submitting OpenMM jobs, a singularity container containers with OpenMM pre-installed is are available .in:

Code Block
languagebash
firstline1
titleSingularity container for OpenMM
linenumberstrue
container=/home/pcw2/bin/openmm-ctbp.sif

This container does not have OpenSMOG and needs to be installed with pip3. 

Code Block
$pip3 install OpenSMOG

 If you can not find a container with suitable options available, you may contact Prof. Whitford, or you may be able to use pip3.

Usage example with bash submission script, openmm run python script, and input files can be downloaded below.

...

For submitting jobs, another possibility is to use the Launcher_GPU module and bind each run/task to one gpu. For this, remember to set the '--cpus-per-task' as the total number of cpus divided by the number of '--tasks-per-node'. The example bellow, runs 8 jobs in parallel, as required when using Aries, the job will simultaneously use all 8 GPUs on each node. To do this, we will run 8 simulations, each on a single GPU. For this, OpenMM was installed locally via conda environmentloaded as module


Code Block
languagebash
firstline1
titleJob submission script Launcher_GPU
linenumberstrue
#!/bin/bash -l

#SBATCH --job-name=my_job
#SBATCH --account=commons
#SBATCH --partition=commons
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --threads-per-core=2
#SBATCH --mem-per-cpu=3G
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
#SBATCH --export=ALL

echo "Submitting simulations..."

module purge

# Using Launcher_GPU on ARIES 
module load foss/2020b Launcher_GPU OpenMPI 

# OpenMM loaded via conda
echo "Initiating conda environment:"
source $HOME/anaconda3/bin/activate
conda activate openmmGCC/10.2.0 OpenMPI/4.0.5 OpenMM/7.5.0 foss/2020b Launcher_GPU

# This is for controling Launcher
export LAUNCHER_WORKDIR=`pwd`
export LAUNCHER_JOB_FILE=$PWD/launcher_jobs_sim
export LAUNCHER_BIND=1

echo "Job started on " `date`
echo "Running on hostname" `hostname`
echo "Job $SLURM_JOB_ID is running on: $SLURM_NODELIST"
echo "Job SLURM_SUBMIT_DIR is $SLURM_SUBMIT_DIR"
echo "Running on $SLURM_NNODES nodes"
echo "Running on $SLURM_NPROCS processors"
echo "CPUS per task is $SLURM_CPUS_PER_TASK"
echo "LAUNCHER_WORKDIR: $LAUNCHER_WORKDIR"
echo "Number of replicas is $max_replicas"
df -h

# This will adjust the total number of runs to nodes*8
max_replicas=$((SLURM_NNODES*8))

rm $LAUNCHER_WORKDIR/launcher_jobs_sim &> /dev/null

# Create Launcher_job_file needed by $LAUNCHER_DIR/paramrun
for i in `seq 1 $max_replicas`
do
	echo "python run_code.py input_$i output_${i}.log" >> $LAUNCHER_WORKDIR/launcher_jobs_sim
done

# This line launches the jobs in the launcher_jobs_sim file
$LAUNCHER_DIR/paramrun

echo "My job finished at:" `date`

...