Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Write all data to your directory in /work/cms16/username and not in your /home. Your /home directory will be able to store around 20 GB for you, while /work partition has a total of 100 TB.  Regularly remove your data from /work, to ensure stability of the resource. /work is not backed up! 

In consideration of your colleagues also using the Aries cluster, if you are going to have more than 10 jobs in the queue at a time (80 GPUs), please limit the wall time to 12 hours. This will help other users get jobs through, as well.

One way to regulate the number of jobs running simultaneously is to add dependencies to the jobs with the #SBATCH --dependency=previous_job_id command.

Another way to regulate the number of jobs running simultaneously is to use job arrays and adding a percentage at the end to set the limit of jobs running at the same time, for example using #SBATCH --array=0-39%10 to run 40 jobs, but only 10 at the same time.

Important note: set up ssh communication keys between nodes, as explained in the pdf at the end of this page (ARIES_Quick_Start_wl52_20220406.pdf).

...

  • 19x MI50 Nodes (gn01-gn19):
    • Processor: 96CPUs → 1x AMD EPYC 7642 (48 cores, 96 2 CPUs/core)
    • RAM: 512 GB
    • Storage: 2 TB 
    • Network: HDR Infiniband
    • GPUs: 8x AMD Radeon Instinct MI50 32GB
  • 3x MI100 Nodes (gn20-gn22):
    • Processor: 128CPUs -> 2x AMD EPYC 7V13 processors (64 cores, 128 CPUs1CPUs/core)
    • RAM: 512 GB
    • Storage: 2 TB 
    • Network: HDR Infiniband
    • GPUs: 8x AMD Radeon Instinct MI100 32GB

...

The partition highmem contains 2 CPU nodes with 64 CPUs and about 4 TB of RAM. These nodes don't have GPUs.

  • 2x Large Memory Nodes (hm01-02):
    • Processor: 64CPUs -> 2x AMD EPYC 7302 (16 cores, 32 CPUs2CPUs/core)
    • RAM: 4 TB
    • Storage: 4 TB 
    • Network: HDR Infiniband
    • GPUs: None

...