Aries is a 152176-GPU cluster, containing AMD MI50 GPUs (courtesy of AMD). It contains 19 GPU nodes (gn01-gn19). Each node contains 8 AMD MI50 GPUs with 32 GB memory each.
Jobs are allocated by nodes, not GPUs. Since there are 8 GPUs per node, each submitted job must be able to use 8 GPUs. Below . Remember to set properly in your script --gres=gpu variable, it is not automatically. Below, we provide examples for how to effectively use this resource.
Since each job submission on Aries runs exclusively on an entire GPU node, the user must launch 8 parallel GPU processes. Ideal to have It is ideal that all 8 parallel runs to have similar run time to maximize efficiency. Make sure the 8 parallel runs have different names for the output file generated.
All Aries users should have access to the data storage directory /work/cms16/. Create a sub-directory within /work/cms16/ with your "username".$mkdir
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
mkdir /work/cms16/ |
...
$USER |
Recommended to store all generated data to your directory in /work/cms16/username and not in your /home. Your /home directory will be able to store around 20 GB for you, while /work partition has a total of 100 TB.
Partition commons
This partition contains 22 GPU nodes (gn01-gn22). Each node contains 8 AMD Vega20 GPUs, 96 CPUs and 512 GB memory. Expect for the last three nodes (gn20-gn22) with each having 128 CPUs and 8 AMD MI100 GPUs.
Partition highmem
This partition contains 2 CPU nodes (hm01-02) with each node having 64 CPUs and about 4 TB of RAM. These nodes don't have GPUs.
Singularity Container
For submitting OpenMM jobs, a singularity container with OpenMM pre-installed is available.
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash -l #SBATCH --job-name=ctbpexample #SBATCH --nodes=1 #SBATCH --cpus-per-task=96 #set to 96 if not using MPI (OpenMM does not use MPI) #SBATCH --tasks-per-node=1 #SBATCH --export=ALL #SBATCH --mem=0 #each GPU assigned 32 GB by default #SBATCH --gres=gpu:8 #SBATCH --time=1-00:00:00 #max run time is 1 day |
Launcher_GPU
For submitting jobs, another possibility is to use the Launcher_GPU module and bind each run/task to one gpu. For this, remember to set the '--cpus-per-task' as the total number of cpus divided by the number of '--tasks-per-node'. The example bellow, runs 8 jobs in parallel. For this, OpenMM was installed locally via conda environment.
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash -l
#SBATCH --job-name=my_job
#SBATCH --account=commons
#SBATCH --partition=commons
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --threads-per-core=2
#SBATCH --mem-per-cpu=3G
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
#SBATCH --export=ALL
echo "Submitting simulations..."
module purge
# Using Launcher_GPU on ARIES
module load foss/2020b Launcher_GPU OpenMPI
# OpenMM loaded via conda
echo "Initiating conda environment:"
source $HOME/anaconda3/bin/activate
conda activate openmm
# This is for controling Launcher
export LAUNCHER_WORKDIR=`pwd`
export LAUNCHER_JOB_FILE=$PWD/launcher_jobs_sim
export LAUNCHER_BIND=1
echo "Job started on " `date`
echo "Running on hostname" `hostname`
echo "Job $SLURM_JOB_ID is running on: $SLURM_NODELIST"
echo "Job SLURM_SUBMIT_DIR is $SLURM_SUBMIT_DIR"
echo "Running on $SLURM_NNODES nodes"
echo "Running on $SLURM_NPROCS processors"
echo "CPUS per task is $SLURM_CPUS_PER_TASK"
echo "LAUNCHER_WORKDIR: $LAUNCHER_WORKDIR"
echo "Number of replicas is $max_replicas"
df -h
# This will adjust the total number of runs to nodes*8
max_replicas=$((SLURM_NNODES*8))
rm $LAUNCHER_WORKDIR/launcher_jobs_sim &> /dev/null
# Create Launcher_job_file needed by $LAUNCHER_DIR/paramrun
for i in `seq 1 $max_replicas`
do
echo "python run_code.py input_$i output_${i}.log" >> $LAUNCHER_WORKDIR/launcher_jobs_sim
done
# This line launches the jobs in the launcher_jobs_sim file
$LAUNCHER_DIR/paramrun
echo "My job finished at:" `date` |
Additional Information
View file | ||||
---|---|---|---|---|
|
Acknowledgments
Please acknowledge the use of this cluster in your publications if you used the cluster:
...