Configurations for running CMAQ on a Single VM or ParallelCluster#

Note

The tutorials presented here, require an AWS account, which requires a credit card. If you are diligent in terminating the resources created in this tutorial after you run the benchmark, the cost should be less than $15. Performance and Cost Optimization tables are provided in Chapter 5.

Sign up for an Amazon Web Services (AWS) Account#

From the Amazon Web Services website, http://aws.amazon.com, click on “Create an AWS account” on the upper right corner. After you have an account it will say “Sign into the Console”.

We highly recommend that users set up a spending alarm to help manage costs. You can configure alarm to receive an email alert if you exceed a specific dollar amount, e.g., $100 per month.

Software Requirements for CMAQ on AWS Single VM or ParallelCluster#

Tier 1: Native OS and associated system libraries, compilers

Operating System: Ubuntu2004
Tcsh shell
Git
Compilers (C, C++, and Fortran) - GNU compilers version ≥ 8.3
MPI (Message Passing Interface) - OpenMPI ≥ 4.0
Slurm Scheduler

Tier 2: additional libraries required for installing CMAQ

NetCDF (with C, C++, and Fortran support)
I/O API

Tier 3: Software developed by EPA and distributed through the CMAS Center

CMAQv5.4+
CMAQv5.4+ Post Processors
R code for QAing model ouput

Software on Local Computer

AWS ParallelCluster CLI v3.0 installed in a virtual environment
Text editors for editing scripts and configuration files (e.g., vi, nedit)
Git
Mac - XQuartz for X11 Display
Windows - MobaXterm - to connect to ParallelCluster IP address

Note

When working on the AWS Cloud you will need to select a Region for your workloads. (See AWS blog on What to consider when selecting a region). The scripts used in this tutorial use the us-east-1 region, but they can be modified to use any of the supported regions listed here: CLI v3 Supported Regions

Single VM Configuration for 12US1 Benchmark Domain#

c6a.48xlarge - (96 cpus/node with Multithreading disabled) with 384 GiB memory, 50 Gigabit Network Bandwidth, 40 EBS Bandwidth (Gbps), Elastic Fabric Adapter (EFA) and Nitro Hypervisor. (available in any region)

Name	vCPUs	cores	Memory (GiB)	Network Bandwidth (Gbps)	EBS Throughput (Gbps)
c6a.2xlarge	8	4	16	Up to 12.5	Up to 6.6
c6a.8xlarge	32	16	64	12.5	6.6
c6a.48xlarge	192	96	384	50	40

hpc6a.48xlarge (96 cpus/node) with 384 GiB memory, using two 48-core 3rd generation AMD EPYC 7003 series processors built on 7nm process nodes for increased efficiency with a total of 96 cores (4 GiB of memory per core), Elastic Fabric Adapter (EFA) and Nitro Hypervisor (lower cost than c6a.48xlarge) only available in us-east-2 region

Name	cores	Memory (GiB)	EFA Network Bandwidth (Gbps)	Network Bandwidth(Gbps)
Hpc6a.48xlarge	96	384	100	25

ParallelCluster Configuration for 12US1 Benchmark Domain#

Note

It is recommended to use a head node that is in the same family a the compute node so that the compiler options and executable is optimized for that processor type.

There are two recommended configuration of the ParallelCluster HPC head node and compute nodes to run the CMAQ CONUS benchmark for two days:

First configuration:

Head node:

c6a.xlarge

(note that head node should match the processor family of the compute nodes)

Compute Node:

hpc6a.48xlarge (96 cpus/node) with 384 GiB memory, using two 48-core 3rd generation AMD EPYC 7003 series processors built on 7nm process nodes for increased efficiency with a total of 96 cores (4 GiB of memory per core), Elastic Fabric Adapter (EFA) and Nitro Hypervisor (lower cost than c6a.48xlarge) only available in us-east-2 region

or (more costly option, but available in all regions)

c6a.48xlarge (96 cpus/node with Multithreading disabled) with 384 GiB memory, 50 Gigabit Network Bandwidth, 40 EBS Bandwidth (Gbps), Elastic Fabric Adapter (EFA) and Nitro Hypervisor

HPC6a EC2 Instance

Note

CMAQ is developed using OpenMPI and can take advantage of increasing the number of CPUs and memory. ParallelCluster provides a ready-made auto scaling solution.

Figure 1. AWS Recommended ParallelCluster Configuration (Number of compute nodes depends on setting for NPCOLxNPROW and #SBATCH –nodes=XX #SBATCH –ntasks-per-node=YY )

AWS ParallelCluster Configuration

Second configuration (Least expensive - see chapter on Cost and Performance Optimization):

Head node:

c7g.large

(note that head node should match the processor family of the compute nodes)

Compute Node:

hpc7g.16xlarge (64 cores/node) with 128 GiB memory (2 GiB of memory per core), Arm-based custom Graviton3E processors, which provide Double Data Rate 5 (DDR5) memory that offers 50% more bandwidth compared to DDR4, Elastic Fabric Adapter (EFA) and Nitro Hypervisor only available in us-east-1 region

HPC7g EC2 Instance

AWS ParallelCluster Configuration