CMAQv5.3.3 on AWS Tutorials (Single VM and ParallelCluster)#

Scripts and code to run CMAQ on Single Virtural Machine or Parallel Cluster (multiple VMs).

To obtain this code use the following command.#

git clone -b  CMAQv5.3.3 https://github.com/CMASCenter/pcluster-cmaq pcluster-cmaq-533

Warning

This documentation is under continuous development. This documentation is under continuous development latest version is available here: CMAQ on AWS Tutorials Latest Version

Overview#

This document provides tutorials and information on how users can create High Performance Computers (Single Virtual Machine (VM) or ParallelCluster) on Amazon Web Service (AWS) using the AWS Command Line Interface. The tutorials are aimed at users with cloud computing experience that are already familiar with Amazon Web Service (AWS). For those with no cloud computing experience we recommend reviewing the Additional Resources listed in chapter 16 of this document.

Format of this documentation#

This document provides several hands-on tutorials that are designed to be read in order.


The Introductory Tutorial will walk you through creating a demo ParallelCluster. You will learn how to set up your AWS Identity and Access Management Roles, configure and create a demo cluster, and exit and delete the cluster.

Single VM Tutorials#

The Single VM Intermediate Tutorial will show you how to create a single virtual machine using an AMI that has the software and data pre-loaded and give instructions for creating the virtual machine using ec2 instances that have different number of cores, and are matched to the benchmark domain. The Single VM Advanced tutorial will show you how to install the CMAQv5.3.3 software and libraries, and how to create custom environment modules.

Parallel Cluster Tutorials#

The CMAQv5.3.3 Parallel Cluster Intermediate chapter will show you how to run CMAQv5.3.3 using the 12US2 benchmark. The CMAQv5.3.3 Advanced Tutorial explains how to scale the ParallelCluster for larger compute jobs and install CMAQv5.3.3 and required libraries from scratch on the cloud. The Chapter “Benchmark on HPC6a-48xlarge with EBS and Lustre” uses CMAQv5.3.3 on advanced HPC6a compute nodes that are only available in the us-east-2 region.


The remaining sections provide instructions on post-processing CMAQ output, comparing output and runtimes from multiple simulations, and copying output from ParallelCluster to an AWS Simple Storage Service (S3) bucket.

Why might I need to use ParallelCluster?#

The AWS ParallelCluster may be configured to be the equivalent of a High Performance Computing (HPC) environment, including using job schedulers such as Slurm, running on multiple nodes using code compiled with Message Passing Interface (MPI), and reading and writing output to a high performance, low latency shared disk. The advantage of using the AWS ParallelCluster command line interface is that the compute nodes can be easily scaled up or down to match the compute requirements of a given simulation. In addition, the user can reduce costs by using Spot instances rather than On-Demand for the compute nodes. ParallelCluster also supports submitting multiple jobs to the job submission queue.

Our goal is make this user guide to running CMAQ on a ParallelCluster as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.

Additional information on AWS ParallelCluster:

AWS ParallelCluster documentation

AWS ParallelCluster training video

System Requirements#

Description of the compute node and head nodes used for the Parallel Cluster

Configurations for running CMAQv5.3.3 on AWS ParallelCluster#

Recommend that users set up a spending alarm using AWS#

Configure alarm to receive an email alert if you exceed $100 per month (or what ever monthly spending limit you need).

See also

See the AWS Tutorial on setting up an alarm for AWS Free Tier. AWS Free Tier Budgets

Software Requirements for CMAQ on AWS ParallelCluster#

Tier 1: Native OS and associated system libraries, compilers

  • Operating System: Ubuntu2004

  • Tcsh shell

  • Git

  • Compilers (C, C++, and Fortran) - GNU compilers version ≥ 8.3

  • MPI (Message Passing Interface) - OpenMPI ≥ 4.0

  • Slurm Scheduler

Tier 2: additional libraries required for installing CMAQ

  • NetCDF (with C, C++, and Fortran support)

  • I/O API

  • R Software and packages

Tier 3: Software distributed thru the CMAS Center

  • CMAQv533

  • CMAQv533 Post Processors

Tier 4: R packages and Scripts

  • R QA Scripts

Software on Local Computer

  • AWS CLI v3.0 installed in a virtual environment

  • pcluster is the primary AWS ParallelCluster CLI command. You use pcluster to launch and manage HPC clusters in the AWS Cloud and to create and manage custom AMI images

  • run-instances is another AWS Command Line method to create a single virtual machine to run CMAQ described in chapter 6.

  • Edit YAML Configuration Files using vi, nedit or other editor (yaml does not accept tabs as spacing)

  • Git

  • Mac - XQuartz for X11 Display

  • Windows - MobaXterm - to connect to ParallelCluster IP address

AWS CLI v3.0 AWS Region Availability#

Note

The scripts in this tutorial use the us-east-1 region, but the scripts can be modified to use any of the supported regions listed in the url below. CLI v3 Supported Regions

CONUS 12US2 Domain Description#

GRIDDESC
'12US2'
'12CONUS'     -2412000.0 -1620000.0 12000.0 12000.0 396 246 1

CMAQ 12US2 Domain

Single VM Configuration for CMAQv5.3.2_Benchmark_2Day_Input.tar.gz Benchmark#

  • c6a.2xlarge

ParallelCluster Configuration for 12US2 Benchmark Domain#

Note

It is recommended to use a head node that is in the same family a the compute node so that the compiler options and executable is optimized for that processor type.

Recommended configuration of the ParallelCluster HPC head node and compute nodes to run the CMAQ CONUS benchmark for two days:

Head node:

  • c5n.large

or

  • c6a.xlarge

(note that head node should match the processor family of the compute nodes)

Compute Node:

  • c5n.9xlarge (16 cpus/node with Multithreading disabled) with 96 GiB memory, 50 Gbps Network Bandwidth, 9,500 EBS Bandwidth (Mbps) and Elastic Fabric Adapter (EFA)

or

  • c5n.18xlarge (36 cpus/node with Multithreading disabled) with 192 GiB memory, 100 Gbps Network Bandwidth, 19,000 EBS Bandwidth (Mbps) and Elastic Fabric Adapter (EFA)

or

  • c6a.48xlarge (96 cpus/node with Multithreading disabled) with 384 GiB memory, 50 Gigabit Network Bandwidth, 40 EBS Bandwidth (Gbps), Elastic Fabric Adapter (EFA) and Nitro Hypervisor

or

  • hpc6a.48xlarge (96 cpus/node) only available in us-east-2 region with 384 GiB memory, using two 48-core 3rd generation AMD EPYC 7003 series processors built on 7nm process nodes for increased efficiency with a total of 96 cores (4 GiB of memory per core), Elatic Fabric Adapter (EFA) and Nitro Hypervisor (lower cost than c6a.48xlarge)

HPC6a EC2 Instance

Note

CMAQ is developed using OpenMPI and can take advantage of increasing the number of CPUs and memory. ParallelCluster provides a ready-made auto scaling solution.

Note

Additional best practice of allowing the ParallelCluster to create a placement group. Network Performance Placement Groups

This is specified in the yaml file in the slurm queue’s network settings.

Networking:
  PlacementGroup:
    Enabled: true

Note

To provide the lowest latency and the highest packet-per-second network performance for your placement group, choose an instance type that supports enhanced networking. For more information, see Enhanced Networking. Enhanced Networking (ENA)

To measure the network performance, you can use iPerf to measure network bandwidth.

Iperf

Note

Elastic Fabric Adapter(EFA) “EFA provides lower and more consistent latency and higher throughput than the TCP transport traditionally used in cloud-based HPC systems. It enhances the performance of inter-instance communication that is critical for scaling HPC and machine learning applications. It is optimized to work on the existing AWS network infrastructure and it can scale depending on application requirements.” “An EFA is an Elastic Network Adapter (ENA) with added capabilities. It provides all of the functionality of an ENA, with an additional OS-bypass functionality. OS-bypass is an access model that allows HPC and machine learning applications to communicate directly with the network interface hardware to provide low-latency, reliable transport functionality.” Elastic Fabric Adapter(EFA)

Note

Nitro Hypervisor “AWS Nitro System is composed of three main components: Nitro cards, the Nitro security chip, and the Nitro hypervisor. Nitro cards provide controllers for the VPC data plane (network access), Amazon Elastic Block Store (Amazon EBS) access, instance storage (local NVMe), as well as overall coordination for the host. By offloading these capabilities to the Nitro cards, this removes the need to use host processor resources to implement these functions, as well as offering security benefits. “ Bare metal performance with the Nitro Hypervisor

EC2 Nitro Instances Available

Importing data from S3 Bucket to Lustre

Justification for using the capability of importing data from an S3 bucket to the lustre file system over using elastic block storage file system and copying the data from the S3 bucket for the input and output data storage volume on the cluster.

  1. Saves storage cost

  2. Removes need to copy data from S3 bucket to Lustre file system. FSx for Lustre integrates natively with Amazon S3, making it easy for you to process HPC data sets stored in Amazon S3

  3. Simplifies running HPC workloads on AWS

  4. Amazon FSx for Lustre uses parallel data transfer techniques to transfer data to and from S3 at up to hundreds of GB/s.

Note

To find the default settings for Lustre see: Lustre Settings for ParallelCluster

Figure 1. AWS Recommended ParallelCluster Configuration (Number of compute nodes depends on setting for NPCOLxNPROW and #SBATCH –nodes=XX #SBATCH –ntasks-per-node=YY )

AWS ParallelCluster Configuration

Create Single VM and run CMAQv5.3.3 (software pre-installed)#

Creating an EC2 instance either from the AWS Web Interface or Command Line is easy to do. In this tutorial we will give examples on how to create and run using ec2 instances that vary in size depending on the size of the CMAQ benchmarks.

Use AWS Management Console to Create Single VM and run CMAQv5.3.3 (software pre-installed)#

Creating an EC2 instance from the AWS Management Console is easy to do. In this tutorial we will give examples on how to create and run using ec2 instances that vary in size depending on the size of the CMAQ benchmarks.

Launch an EC2 Instance using the AWS Manaement Console SPOT Pricing

Benchmark Name

Grid Domain

EC2 Instance

vCPU

Cores

Memory

Network Performance

Storage (EBS Only)

On Demand Hourly Cost

Spot Hourly Cost

2016_12SE1

(100x80x35)

c6a.2xlarge

8

4

16 GiB

Up to 12500 Megabit

gp3

0.306

0.2879

Data in table above is from the following: Sizing and Price Calculator from AWS

Run CMAQv5.3.3 on a single Virtual Machine (VM) using an ami with software pre-loaded to run on either a c6a.2xlarge instance with gp3 filesystem.

Learn how to Use the AWS Management Console to launch EC2 instance using Public AMI#

Public AMI contains the software and data to run 2016_12SE3 using CMAQv5.3.3#

Software was pre-installed and saved to a public ami.

The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.

This chapter describes the process used in the AWS Web interface to configure and create a c6a.2xlarge ec2 instance using a public ami. With additional instructions to use ssh to login and run CMAQ for the 2016_12SE3 domain.

Login to the AWs Consol and select EC2#

Login to AWS and then select EC2

Click on the orange “Launch Instance” button#

Click on Launch Instance

Enter the ami name: ami-019eb54acc4924d3f in the Search box and return or enter.#

Search for AMI

Click on the Community AMI tab and then and click on the orange “Select” button

Choose Public AMI with CMAQ pre-installed

Note this AMI was built for the following architecture, and can be used by the c6a - hpc6a family of instances#

Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2023-05-16

Search for c6a.2xlarge Instance Type and select#

Select c6a.2xlarge instance type

Select key pair name or create a new key pair#

Select key pair name or create new key pair

Use the default Network Settings#

Use default network settings

Configure Storage#

The AMI is preconfigured to use 500 GiB of gp3 as the root volume (Not encrypted)

Configure Storage

Select the Pull-down options for Advanced details#

Scroll down until you see option to Specify CPU cores

Click the checkbox for “Specify CPU cores”

Then select 4 Cores, and 1 thread per core

Select Advanced Details

Advanced Details turn off hyperthreading

In the Summary Menu, select Launch Instance#

Launch instance

Wait until the Status check has been completed and the Instance State is running#

Instance State running

Use the ssh command to login to the c6a.2xlarge instance#
ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@xx.xxx.xxx.xxx

Run CMAQv5.3.3 on c6a.2xlarge#

Login to the ec2 instance#

Note, the following command must be modified to specify your key, and ip address (obtained from the previous command):

ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address

Login to the ec2 instance again, so that you have two windows logged into the machine.#

ssh -Y -i ~/downloads/your-pem.pem ubuntu@your-ip-address

Load the environment modules#

module avail

module load ioapi-3.2/gcc-11.3.0-netcdf  mpi/openmpi-4.1.2  netcdf-4.8.1/gcc-11.3

Update the pcluster-cmaq repo using git#

cd /shared/pcluster-cmaq

git pull

Verify that the input data is available#

Input Data for the 2016_12SE1 Benchmark

ls -lrt /shared/data/CMAQv5.3.2_Benchmark_2Day_Input/2016_12SE1/*

Run CMAQv5.3.3 for 2016_12SE1 1 Day benchmark Case on 4 pe#

' '
'LamCon_40N_97W'
  2        33.000        45.000       -97.000       -97.000        40.000
' '
'SE52BENCH'
'LamCon_40N_97W'    792000.000  -1080000.000     12000.000     12000.000 100  80   1
'2016_12SE1'
'LamCon_40N_97W'    792000.000  -1080000.000     12000.000     12000.000 100  80   1

Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts
./run_cctm_Bench_2016_12SE1.csh |& tee run_cctm_Bench_2016_12SE1.log
Use HTOP to view performance.#

htop

output

If the ec2 instance was created without specifying 1 thread per core in the Advanced Settings, then it will have 8 vcpus.

Screenshot of HTOP with hyperthreading on

Successful output using the gp3 volume with hyperthreading on (8vcpus)#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2016-07-01
End Day:   2016-07-01
Number of Simulation Days: 1
Domain Name:               2016_12SE1
Number of Grid Cells:      280000  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       4
   All times are in seconds.

Num  Day        Wall Time
01   2016-07-01   2083.32
     Total Time = 2083.32
      Avg. Time = 2083.32

Use lscpu to confirm that there are 4 cores on the c6a.2xlarge ec2 instance that was created with hyperthreading turned off (1 thread per core).#

lscpu

Output:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7R13 Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            5300.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm con
                         stant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt a
                         es xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmm
                         call fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nr
                         ip_save vaes vpclmulqdq rdpid
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    16 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Save output data and run script logs#

Copy the log files and the output data to an s3 bucket.

cd /shared/pcluster-cmaq/s3_scripts
cat s3_upload_cmaqv533.c6a.2xlarge.csh 

Output

#!/bin/csh -f
# Script to upload output data to S3 bucket
# need to set up your AWS credentials prior to running this script
# aws configure
# NOTE: need permission to create a bucket and write to an s3 bucket. 
# 

mkdir /shared/data/output/logs
mkdir /shared/data/output/scripts

cp /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/*.log /shared/data/output_CCTM_v533_gcc_Bench_2016_12SE1/logs/
cp  /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_Bench_2016_12SE1.csh /shared/data/output_CCTM_v533_gcc_Bench_2016_12SE1/scripts/

setenv BUCKET c6a.2xlarge.cmaqv533
aws s3 mb s3://$BUCKET
aws s3 cp --recursive /shared/data/output_CCTM_v533_gcc_Bench_2016_12SE1 s3://$BUCKET

Set your aws credentials by running the command

aws configure

Edit the script to create a unique bucket name

Run the script

./s3_upload_cmaqv533.c6a.2xlarge.csh

or

Save the full input data, run scripts, output data and logs to an AMI that is owned by your account.#

Go to the EC2 Dashboard#

EC2 Resources on AWS Web Console

Click on Instances Running#

Select the checkbox next to the c6a.2xlarge instance name

Select Instance on EC2 Dashboard

Select Actions Pulldown menu and select Images and templates and Create Image.#

Note, this will log you out of the ec2 instance, and should be done after all runs have been completed and your are ready to save the image.

Create Image on EC2 Dashboard

Fill out the name of the image#

Name the instance to help identify the ec2 instance type, CMAQ version installed, and perhaps the input/output data available

Confirm Save Image on EC2 Dashboard

Click Save Image#

Wait until the image status available before terminating the ec2 instance

Click on AMI under the left menu, and then search for the image name and confirm that the status has a green checkmark and available#

Stop Instance#

Go to the EC2 Dashboard#

EC2 Resources on AWS Web Console

Click on Instances Running#

Select the checkbox next to the c6a.2xlarge instance name

Select Instance on EC2 Dashboard

Select Instance State Pulldown menu and select stop instance#

This will stop charges from being incurred by the ec2 instance, but you will still be charged for the gp3 volume until the ec2 instance is terminated. Typically, you would choose to stop, and then restart the instance if you plan to do additional work on it within a few hours. Otherwise, to avoid incurring costs, it is better to terminate the instance, and then recreate later from either the public AMI or your newly saved AMI.

Select Instance State Pulldown menu and select terminate instance.#

Terminate Instance on EC2 Dashboard

When the pop-up menu asks if you are sure you want to terminate the instance, click on the orange Terminate button.#

Confirm Terminate Instance on EC2 Dashboard

CMAQv5.3.3 on Single Virtual Machine Intermediate (software pre-installed)#

Creating an EC2 instance from the Command Line is easy to do. In this tutorial we will give examples on how to create and run using ec2 instances that vary in size depending on the size of the CMAQ benchmarks.

Using Amazon EC2 with the AWS CLI SPOT Pricing

Benchmark Name

Grid Domain

EC2 Instance

vCPU

Cores

Memory

Network Performance

Storage

On Demand Hourly Cost

Spot Hourly Cost

2016_12SE1

(100x80x35)

c6a.2xlarge

8

4

16 GiB

Up to 12500 Megabit

EBS Only

0.306

0.2879

Data in table above is from the following: Sizing and Price Calculator from AWS

Run CMAQv5.33 on a single Virtual Machine (VM) using an ami with software pre-loaded to run on either a c6a.2xlarge, c6a.8xlarge or c6a.48xlarge instance with gp3filesystem.

Learn how to Use AWS CLI to launch c6a.2xlarge EC2 instance using Public AMI#

Public AMI contains the software and data to run 2016_12SE1 benchmark using CMAQv5.33#

Software was pre-installed and saved to a public ami.

The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.

This chapter describes the process that was used to test and configure the c6a.2xlarge ec2 instance to run CMAQv5.3.3 for the 2016_12SE1 domain.

Todo: Need to create command line options to copy a public ami to a different region.

Verify that you can see the public AMI on the us-east-1 region.#

aws ec2 describe-images --region us-east-1 --image-id ami-065049c5c78e6c6a5

Output:

{
    "Images": [
        {
            "Architecture": "x86_64",
            "CreationDate": "2023-06-24T00:17:02.000Z",
            "ImageId": "ami-065049c5c78e6c6a5",
            "ImageLocation": "440858712842/cmaqv5.4_c6a.48xlarge.io2.iops.100000",
            "ImageType": "machine",
            "Public": true,
            "OwnerId": "440858712842",
            "PlatformDetails": "Linux/UNIX",
            "UsageOperation": "RunInstances",
            "State": "available",
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1",
                    "Ebs": {
                        "DeleteOnTermination": true,
                        "Iops": 100000,
                        "SnapshotId": "snap-08b8608dca836ef2e",
                        "VolumeSize": 500,
                        "VolumeType": "io2",
                        "Encrypted": false
                    }
                },
                {
                    "DeviceName": "/dev/sdb",
                    "VirtualName": "ephemeral0"
                },
                {
                    "DeviceName": "/dev/sdc",
                    "VirtualName": "ephemeral1"
                }
            ],
            "EnaSupport": true,
            "Hypervisor": "xen",
            "Name": "cmaqv5.4_c6a.48xlarge.io2.iops.100000",
            "RootDeviceName": "/dev/sda1",
            "RootDeviceType": "ebs",
            "SriovNetSupport": "simple",
            "VirtualizationType": "hvm",
            "DeprecationTime": "2025-06-24T00:17:02.000Z"
        }
    ]
}

Use q to exit out of the command line

Note, the AMI uses the maximum value available on io2 for Iops of 100000.

AWS Resources for the aws cli method to launch ec2 instances.#

aws cli examples

aws cli run instances command

Tutorial Launch Spot Instances

(note, it discourages the use of run-instances for launching spot instances, but they do provide an example method)

Launching EC2 Spot Instances using Run Instances API

Additional resources for spot instance provisioning.

Spot Instance Requests

To launch a Spot Instance with RunInstances API you create the configuration file as described below:

cat <<EoF > ./runinstances-config.json
{
    "DryRun": false,
    "MaxCount": 1,
    "MinCount": 1,
    "InstanceType": "c6a.2xlarge",
    "ImageId": "ami-065049c5c78e6c6a5",
    "InstanceMarketOptions": {
        "MarketType": "spot"
    },
    "TagSpecifications": [
        {
            "ResourceType": "instance",
            "Tags": [
                {
                    "Key": "Name",
                    "Value": "EC2SpotCMAQv54"
                }
            ]
        }
    ]
}
EoF
Use the publically available AMI to launch an ondemand c6a.2xlarge ec2 instance using a io2 volume with 100000 IOPS with hyperthreading disabled#

Note, we will be using a json file that has been preconfigured to specify the ImageId

Obtain the code using git#

git clone -b main https://github.com/CMASCenter/pcluster-cmaq

cd pcluster-cmaq/json

Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids launch-wizard-with-tcp-access

Example command: note launch-wizard-with-tcp-access needs to be replaced by your security group ID, and your-pem key needs to be replaced by the name of your-pem.pem key.

aws ec2 run-instances --debug --key-name your-pem --security-group-ids launch-wizard-with-tcp-access --dry-run --region us-east-1 --cli-input-json file://runinstances-config.json

Command that works for UNC’s security group and pem key:

yaws ec2 run-instances –debug –key-name cmaqv5.4 –security-group-ids launch-wizard-179 –region us-east-1 –dry-run –ebs-optimized –cpu-options CoreCount=4,ThreadsPerCore=1 –cli-input-json file://runinstances-config.io2.c6a.2xlarge.json`

Once you have verified that the command above works with the –dry-run option, rerun it without as follows.

aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --cpu-options CoreCount=4,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.c6a.2xlarge.json

Example of security group inbound and outbound rules required to connect to EC2 instance via ssh.

Inbound Rule

Outbound Rule

Additional resources

CLI commands to create Security Group

Use the following command to obtain the public IP address of the machine.#

This command is commented out, as the instance hasn’t been created yet. keeping the instructions for documentation purposes.

aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-065049c5c78e6c6a5" | grep PublicIpAddress

Login to the ec2 instance#

Note, the following command must be modified to specify your key, and ip address (obtained from the previous command):

ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address

Login to the ec2 instance again, so that you have two windows logged into the machine.#

ssh -Y -i ~/downloads/your-pem.pem ubuntu@your-ip-address

Load the environment modules#

module avail

module load ioapi-3.2/gcc-11.3.0-netcdf  mpi/openmpi-4.1.2  netcdf-4.8.1/gcc-11.3

Update the pcluster-cmaq repo using git#

cd /shared/pcluster-cmaq

git pull

Run CMAQv5.4 for 12US1 Listos Training 3 Day benchmark Case on 4 pe#

Input data is available for a subdomain of the 12km 12US1 case.

GRIDDESC

'2018_12Listos'
'LamCon_40N_97W'   1812000.000    240000.000     12000.000     12000.000   25   25    1
Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts
./run_cctm_2018_12US1_listos.csh | & tee ./run_cctm_2018_12US1_listos.c6a.2xlarge.log
Use HTOP to view performance.#

htop

output

Screenshot of HTOP

Successful output#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day:   2018-08-07
Number of Simulation Days: 3
Domain Name:               2018_12Listos
Number of Grid Cells:      21875  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       4
   All times are in seconds.

Num  Day        Wall Time
01   2018-08-05   166.7
02   2018-08-06   167.0
03   2018-08-07   171.3
     Total Time = 505.00
      Avg. Time = 168.33

Note, this took longer than the run done using c6a.48xlarge, where 32 cores were used. The c6a.2xlarge also has smaller cache sizes than the c6a.48xlarge, which you can see when you compare output of the lscpu command.

Change to the scripts directory#

cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts/

Use lscpu to confirm that there are 4 cores on the c6a.2xlarge ec2 instance that was created with hyperthreading turned off.#

lscpu

Output:

lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7R13 Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            5299.98
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdt
                         scp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x
                         2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_s
                         ingle ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clze
                         ro xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    16 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Edit the 12US3 Benchmark run script to use the gcc compiler and to output all species to CONC output file.#

cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts/

vi run_cctm_Bench_2018_12NE3.c6a48xlarge.csh

change

   setenv compiler intel

to

   setenv compiler gcc

Comment out the CONC_SPCS setting that limits them to only 12 species

   # setenv CONC_SPCS "O3 NO ANO3I ANO3J NO2 FORM ISOP NH3 ANH4I ANH4J ASO4I ASO4J" 

Change the NPCOL, NPROW to run on 4 cores

   @ NPCOL  =  2; @ NPROW =  2
Run the 12US3 Benchmark case#
./run_cctm_Bench_2018_12NE3.c6a.2xlarge.csh |& tee ./run_cctm_Bench_2018_12NE3.c6a.2xlarge.4pe.log
Use HTOP to view performance.#

htop

output

Screenshot of HTOP

Note, this 12NE3 Domain uses more memory, and takes longer than the 12LISTOS-Training Domain. It also takes longer to run using 4 cores on c6a.2xlarge instance than on 32 cores on c6a.48xlarge instance.

Successful output for 12 species output in the 3-D CONC file took 56 minutes to run 1 day#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-07-01
End Day:   2018-07-01
Number of Simulation Days: 1
Domain Name:               2018_12NE3
Number of Grid Cells:      367500  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       4
   All times are in seconds.

Num  Day        Wall Time
01   2018-07-01   3410.99
     Total Time = 3410.99
      Avg. Time = 3410.99

Compared to the timing for running on 32 processors, which took 444.34 seconds, this is a factor of 7.67 or close to perfect scalability of adding 8x as many cores.

Find the InstanceID using the following command on your local machine.#

aws ec2 describe-instances --region=us-east-1 | grep InstanceId

Output

i-xxxx

Stop the instance#

aws ec2 stop-instances --region=us-east-1 --instance-ids i-xxxx

Get the following error message.

aws ec2 stop-instances –region=us-east-1 –instance-ids i-041a702cc9f7f7b5d

An error occurred (UnsupportedOperation) when calling the StopInstances operation: You can’t stop the Spot Instance ‘i-041a702cc9f7f7b5d’ because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.

Note sure how to do a persistent spot instance request .

Terminate Instance#

aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx

Verify that the instance is being shut down.#

aws ec2 describe-instances --region=us-east-1

Learn how to Use AWS CLI to launch c6a.8xlarge EC2 instance using Public AMI#

Public AMI contains the software and data to run 2016_12SE1 using CMAQv5.3.3#

Software was pre-installed and saved to a public ami.

The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.

This chapter describes the process that was used to test and configure the c6a.8xlarge ec2 instance to run CMAQv5.3.3 for the 12SE1 domain.

Todo: Need to create command line options to copy a public ami to a different region.

Verify that you can see the public AMI on the us-east-1 region.#

aws ec2 describe-images --region us-east-1 --image-id ami-088f82f334dde0c9f

Output:

{
    "Images": [
        {
            "Architecture": "x86_64",
            "CreationDate": "2023-06-26T18:17:08.000Z",
            "ImageId": "ami-088f82f334dde0c9f",
            "ImageLocation": "440858712842/EC2CMAQv54io2_12LISTOS-training_12NE3_12US1",
            "ImageType": "machine",
            "Public": true,
            "OwnerId": "440858712842",
            "PlatformDetails": "Linux/UNIX",
            "UsageOperation": "RunInstances",
            "State": "available",
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1",
                    "Ebs": {
                        "DeleteOnTermination": true,
                        "Iops": 100000,
                        "SnapshotId": "snap-042b05034228ec830",
                        "VolumeSize": 500,
                        "VolumeType": "io2",
                        "Encrypted": false
                    }
                },
                {
                    "DeviceName": "/dev/sdb",
                    "VirtualName": "ephemeral0"
                },
                {
                    "DeviceName": "/dev/sdc",
                    "VirtualName": "ephemeral1"
                }
            ],
            "EnaSupport": true,
            "Hypervisor": "xen",
            "Name": "EC2CMAQv54io2_12LISTOS-training_12NE3_12US1",
            "RootDeviceName": "/dev/sda1",
            "RootDeviceType": "ebs",
            "SriovNetSupport": "simple",
            "VirtualizationType": "hvm",
            "DeprecationTime": "2025-06-26T18:17:08.000Z"
        }
    ]
}


Use q to exit out of the command line

Note, the AMI uses the maximum value available on io2 for Iops of 100000.

AWS Resources for the aws cli method to launch ec2 instances.#

aws cli examples

aws cli run instances command

Tutorial Launch Spot Instances

(note, it discourages the use of run-instances for launching spot instances, but they do provide an example method)

Launching EC2 Spot Instances using Run Instances API

Additional resources for spot instance provisioning.

Spot Instance Requests

To launch a Spot Instance with RunInstances API you create the configuration file as described below:

cat <<EoF > ./runinstances-config.json
{
    "DryRun": false,
    "MaxCount": 1,
    "MinCount": 1,
    "InstanceType": "c6a.8xlarge",
    "ImageId": "ami-088f82f334dde0c9f",
    "InstanceMarketOptions": {
        "MarketType": "spot"
    },
    "TagSpecifications": [
        {
            "ResourceType": "instance",
            "Tags": [
                {
                    "Key": "Name",
                    "Value": "EC2SpotCMAQv54"
                }
            ]
        }
    ]
}
EoF
Use the publically available AMI to launch an ondemand c6a.8xlarge ec2 instance using a io2 volume with 100000 IOPS with hyperthreading disabled#

Note, we will be using a json file that has been preconfigured to specify the ImageId

Obtain the code using git#

git clone -b main https://github.com/CMASCenter/pcluster-cmaq

cd pcluster-cmaq/json

Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids launch-wizard-with-tcp-access

Example command: note launch-wizard-with-tcp-access needs to be replaced by your security group ID, and your-pem key needs to be replaced by the name of your-pem.pem key.

aws ec2 run-instances --debug --key-name your-pem --security-group-ids launch-wizard-with-tcp-access --dry-run --region us-east-1 --cli-input-json file://runinstances-config.json

Command that works for UNC’s security group and pem key:

aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --dry-run --ebs-optimized --cpu-options CoreCount=16,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.c6a.8xlarge.json

Once you have verified that the command above works with the –dry-run option, rerun it without as follows.

aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --cpu-options CoreCount=16,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.c6a.8xlarge.json

Use q to quit to return to the command prompt.

Example of security group inbound and outbound rules required to connect to EC2 instance via ssh.

Inbound Rule

Outbound Rule

Additional resources

CLI commands to create Security Group

Use the following command to obtain the public IP address of the machine.#

This command is commented out, as the instance hasn’t been created yet. keeping the instructions for documentation purposes.

aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-088f82f334dde0c9f" | grep PublicIpAddress

Login to the ec2 instance (may need to wait 5 minutes for the ec2 instance to initialize and be ready for login)#

Note, the following command must be modified to specify your key, and ip address (obtained from the previous command):

ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address

Login to the ec2 instance again, so that you have two windows logged into the machine.#

ssh -Y -i ~/your-pem.pem ubuntu@your-ip-address

Load the environment modules#

module avail

module load ioapi-3.2/gcc-11.3.0-netcdf  mpi/openmpi-4.1.2  netcdf-4.8.1/gcc-11.3

Update the pcluster-cmaq repo using git#

cd /shared/pcluster-cmaq

git pull

Run CMAQv5.3.3 for 2016_12SE1 1 Day benchmark Case#
GRIDDESC

' '
'LamCon_40N_97W'
  2        33.000        45.000       -97.000       -97.000        40.000
' '
'SE53BENCH'
'LamCon_40N_97W'    792000.000  -1080000.000     12000.000     12000.000 100  80   1
'2016_12SE1'
'LamCon_40N_97W'    792000.000  -1080000.000     12000.000     12000.000 100  80   1

Edit the run script to run on 16 cores#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
cp run_cctm_Bench_2016_12SE1.csh run_cctm_Bench_2016_12SE1.16pe.csh

change NPCOLxNPROW to 4x4

Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
./run_cctm_Bench_2016_12SE1.16pe.csh |& tee ./run_cctm_Bench_2016_12SE1.16pe.log
Use HTOP to view performance.#

htop

output

Screenshot of HTOP

Successful output#

Find the InstanceID using the following command on your local machine.#

aws ec2 describe-instances --region=us-east-1 | grep InstanceId

Output

i-xxxx

Stop the instance#

aws ec2 stop-instances --region=us-east-1 --instance-ids i-xxxx

Get the following error message.

aws ec2 stop-instances –region=us-east-1 –instance-ids i-041a702cc9f7f7b5d

An error occurred (UnsupportedOperation) when calling the StopInstances operation: You can’t stop the Spot Instance ‘i-041a702cc9f7f7b5d’ because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.

Note sure how to do a persistent spot instance request .

Terminate Instance#

aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx

Verify that the instance is being shut down.#

aws ec2 describe-instances --region=us-east-1

Learn how to Use AWS CLI to launch c6a.48xlarge EC2 instance using Public AMI#

Public AMI contains the software and data to run 2016_12SE1 using CMAQv5.3.3#

Software was pre-installed and saved to a public ami.

The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.

This chapter describes the process that was used to test and configure the c6a.48xlarge ec2 instance to run CMAQv5.4 for the 12US2 domain.

Todo: Need to create command line options to copy a public ami to a different region.

Verify that you can see the public AMI on the us-east-1 region.#

aws ec2 describe-images --region us-east-1 --image-id ami-051ba52c157e4070c

Output:

{
    "Images": [
        {
            "Architecture": "x86_64",
            "CreationDate": "2023-06-26T18:17:08.000Z",
            "ImageId": "ami-088f82f334dde0c9f",
            "ImageLocation": "440858712842/EC2CMAQv54io2_12LISTOS-training_12NE3_12US1",
            "ImageType": "machine",
            "Public": true,
            "OwnerId": "440858712842",
            "PlatformDetails": "Linux/UNIX",
            "UsageOperation": "RunInstances",
            "State": "available",
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1",
                    "Ebs": {
                        "DeleteOnTermination": true,
                        "Iops": 100000,
                        "SnapshotId": "snap-042b05034228ec830",
                        "VolumeSize": 500,
                        "VolumeType": "io2",
                        "Encrypted": false
                    }
                },
                {
                    "DeviceName": "/dev/sdb",
                    "VirtualName": "ephemeral0"
                },
                {
                    "DeviceName": "/dev/sdc",
                    "VirtualName": "ephemeral1"
                }
            ],
            "EnaSupport": true,
            "Hypervisor": "xen",
            "Name": "EC2CMAQv54io2_12LISTOS-training_12NE3_12US1",
            "RootDeviceName": "/dev/sda1",
            "RootDeviceType": "ebs",
            "SriovNetSupport": "simple",
            "VirtualizationType": "hvm",
            "DeprecationTime": "2025-06-26T18:17:08.000Z"
        }
    ]
}

Use q to exit out of the command line

Note, the AMI uses the maximum value available on io2 for Iops of 100000.

AWS Resources for the aws cli method to launch ec2 instances.#

aws cli examples

aws cli run instances command

Tutorial Launch Spot Instances

(note, it discourages the use of run-instances for launching spot instances, but they do provide an example method)

Launching EC2 Spot Instances using Run Instances API

Additional resources for spot instance provisioning.

Spot Instance Requests

To launch a Spot Instance with RunInstances API you create the configuration file as described below:

cat <<EoF > ./runinstances-config.json
{
    "DryRun": false,
    "MaxCount": 1,
    "MinCount": 1,
    "InstanceType": "c6a.48xlarge",
    "ImageId": "ami-088f82f334dde0c9f",
    "InstanceMarketOptions": {
        "MarketType": "spot"
    },
    "TagSpecifications": [
        {
            "ResourceType": "instance",
            "Tags": [
                {
                    "Key": "Name",
                    "Value": "EC2SpotCMAQv54"
                }
            ]
        }
    ]
}
EoF
Use the publically available AMI to launch an ondemand c6a.48xlarge ec2 instance using a gp3 volume with 16000 IOPS with hyperthreading disabled#

Note, we will be using a json file that has been preconfigured to specify the ImageId

Obtain the code using git#

git clone -b main https://github.com/CMASCenter/pcluster-cmaq

cd pcluster-cmaq/json

Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids launch-wizard-with-tcp-access

Example command: note launch-wizard-with-tcp-access needs to be replaced by your security group ID, and your-pem key needs to be replaced by the name of your-pem.pem key.

aws ec2 run-instances --debug --key-name your-pem --security-group-ids launch-wizard-with-tcp-access --dry-run --region us-east-1 --cli-input-json file://runinstances-config.json

Command that works for UNC’s security group and pem key:

aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --dry-run --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.json

Once you have verified that the command above works with the –dry-run option, rerun it without as follows.

aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.json

Example of security group inbound and outbound rules required to connect to EC2 instance via ssh.

Inbound Rule

Outbound Rule

Additional resources

CLI commands to create Security Group

Use the following command to obtain the public IP address of the machine.#

This command is commented out, as the instance hasn’t been created yet. keeping the instructions for documentation purposes.

aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-088f82f334dde0c9f" | grep PublicIpAddress

Login to the ec2 instance#

Note, the following command must be modified to specify your key, and ip address (obtained from the previous command): Note, you will get a connection refused if you try to login prior to the ec2 instance being ready to run (takes ~5 minutes for initialization).

ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address

Login to the ec2 instance again, so that you have two windows logged into the machine.#

ssh -Y -i ~/downloads/your-pem.pem ubuntu@your-ip-address

Load the environment modules#

module avail

module load ioapi-3.2/gcc-11.3.0-netcdf  mpi/openmpi-4.1.2  netcdf-4.8.1/gcc-11.3

Update the pcluster-cmaq repo using git#

cd /shared/pcluster-cmaq

git pull

Run CMAQv5.3.3 for 2016_12SE1 1 Day benchmark Case on 4 pe#
' '
'LamCon_40N_97W'
  2        33.000        45.000       -97.000       -97.000        40.000
' '
'SE53BENCH'
'LamCon_40N_97W'    792000.000  -1080000.000     12000.000     12000.000 100  80   1
'2016_12SE1'
'LamCon_40N_97W'    792000.000  -1080000.000     12000.000     12000.000 100  80   1

Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
./run_cctm_Bench_2016_12SE1.csh |& tee run_cctm_Bench_2016_12SE1.log
Use HTOP to view performance.#

htop

output

Screenshot of HTOP

Successful output#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2016-07-01
End Day:   2016-07-01
Number of Simulation Days: 1
Domain Name:               2016_12SE1
Number of Grid Cells:      280000  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       4
   All times are in seconds.

Num  Day        Wall Time
01   2016-07-01   2083.32
     Total Time = 2083.32
      Avg. Time = 2083.32

Use lscpu to confirm that there are 8 processors on the c6a.2xlarge ec2 instance that was created with hyperthreading turned on.#

lscpu

Output:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7R13 Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            5300.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm con
                         stant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt a
                         es xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmm
                         call fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nr
                         ip_save vaes vpclmulqdq rdpid
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    16 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Run 12US2 benchmark again using gp3 volume#

Stop the instance#

aws ec2 stop-instances --region=us-east-1 --instance-ids i-xxxx

Get the following error message.

aws ec2 stop-instances –region=us-east-1 –instance-ids i-041a702cc9f7f7b5d

An error occurred (UnsupportedOperation) when calling the StopInstances operation: You can’t stop the Spot Instance ‘i-041a702cc9f7f7b5d’ because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.

Note sure how to do a persistent spot instance request .

Terminate Instance#

aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx

Verify that the instance is being shut down.#

aws ec2 describe-instances --region=us-east-1

Documentation of Troubleshooting effort for CMAQv5.4+ on 12US1#

Public AMI contains the software and data to run 12US1 using CMAQv5.4+#

Software was pre-installed and saved to a public ami.

The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.

This chapter describes the process that was used to test and configure the c6a.48xlarge ec2 instance to run CMAQv5.4 for the 12US1 domain.

Todo: Need to create command line options to copy a public ami to a different region.

Verify that you can see the public AMI on the us-east-1 region.#

aws ec2 describe-images --region us-east-1 --image-id ami-0aaa0cfeb5ed5763c

Output:

{
    "Images": [
        {
            "Architecture": "x86_64",
            "CreationDate": "2023-06-07T02:52:26.000Z",
            "ImageId": "ami-0aaa0cfeb5ed5763c",
            "ImageLocation": "440858712842/cmaqv5.4_c6a.48xlarge",
            "ImageType": "machine",
            "Public": true,
            "OwnerId": "440858712842",
            "PlatformDetails": "Linux/UNIX",
            "UsageOperation": "RunInstances",
            "State": "available",
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1",
                    "Ebs": {
                        "DeleteOnTermination": true,
                        "Iops": 4000,
                        "SnapshotId": "snap-0c2f11a82e76aac9b",
                        "VolumeSize": 500,
                        "VolumeType": "gp3",
                        "Throughput": 1000,
                        "Encrypted": false
                    }
                },
                {
                    "DeviceName": "/dev/sdb",
                    "VirtualName": "ephemeral0"
                },
                {
                    "DeviceName": "/dev/sdc",
                    "VirtualName": "ephemeral1"
                }
            ],
            "EnaSupport": true,
            "Hypervisor": "xen",
            "Name": "cmaqv5.4_c6a.48xlarge",
            "RootDeviceName": "/dev/sda1",
            "RootDeviceType": "ebs",
            "SriovNetSupport": "simple",
            "VirtualizationType": "hvm",
            "DeprecationTime": "2025-06-07T02:52:26.000Z"
        }
    ]
}

Note that the above AMI has a the maximum throughput limit of 1000, but this AMI had an IOPS limit of 4000 which caused I/O issues documented below.

The solution is to use update the volume to a use the maximum value for IOPS of 16000, and then save the EC2 instance as a new AMI that will have the highest IOPS and throughput for the gp3 VolumeType. The following is a screenshot of the option to do this within the AWS Web Interface. I will work on documenting a method to do this from the command line, but this will be saved for the advanced tutorial.

EC2 Modify Volume

AWS Resources for the aws cli method to launch ec2 instances.#

aws cli exampmles

aws cli run instances command

Tutorial Launch Spot Instances

(note, it discourages the use of run-instances for launching spot instances, but they do provide an example method)

Launching EC2 Spot Instances using Run Instances API

Additional resources for spot instance provisioning.

Spot Instance Requests

To launch a Spot Instance with RunInstances API you create the configuration file as described below:

cat <<EoF > ./runinstances-config.json
{
    "DryRun": false,
    "MaxCount": 1,
    "MinCount": 1,
    "InstanceType": "c6a.48xlarge",
    "ImageId": "ami-0aaa0cfeb5ed5763c",
    "InstanceMarketOptions": {
        "MarketType": "spot"
    },
    "TagSpecifications": [
        {
            "ResourceType": "instance",
            "Tags": [
                {
                    "Key": "Name",
                    "Value": "EC2SpotCMAQv54"
                }
            ]
        }
    ]
}
EoF

{ “DryRun”: false, “MaxCount”: 1, “MinCount”: 1, “InstanceType”: “c6a.48xlarge”, “ImageId”: “ami-0aaa0cfeb5ed5763c”, “InstanceMarketOptions”: { “MarketType”: “spot” }, “TagSpecifications”: [ { “ResourceType”: “instance”, “Tags”: [ { “Key”: “Name”, “Value”: “EC2SpotCMAQv54” } ] } ] }

Use a publically available AMI to launch a c6a.48xlarge ec2 instance using a gp3 volume with 16000 IOPS#

Launch a new instance using the AMI with the software loaded and request a spot instance for the c6a.8xlarge EC2 instance

Note, we will be using a json file that has been preconfigured to specify the ImageId

cd /shared/pcluster-cmaq

Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids launch-wizard-with-tcp-access

Example command: note launch-wizard-with-tcp-access needs to be replaced by your security group ID, and your-pem key needs to be replaced by the name of your-pem.pem key.

aws ec2 run-instances --debug --key-name your-pem --security-group-ids launch-wizard-with-tcp-access --dryrun --region us-east-1 --cli-input-json file://runinstances-config.json

Command that works for UNC’s security group and pem key:

aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --dryrun --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.hyperthread-off.16000IOPS.json

Once you have verified that the command above works with the –dryrun option, rerun it without as follows.

aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.hyperthread-off.16000IOPS.json

Example of security group inbound and outbound rules required to connect to EC2 instance via ssh.

Inbound Rule

Outbound Rule

(I am not sure if you can create a security group rule from the aws command line.)

Additional resources

CLI commands to create Security Group

Use the following command to obtain the public IP address of the machine.#

This command is commented out, as the instance hasn’t been created yet. keeping the instructions for documentation purposes.

aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-0aaa0cfeb5ed5763c" | grep PublicIpAddress

Login to the ec2 instance#

Note, the following command must be modified to specify your key, and ip address (obtained from the previous command):

ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address

Load the environment modules#

module avail module load ioapi-3.2/gcc-11.3.0-netcdf  mpi/openmpi-4.1.2  netcdf-4.8.1/gcc-11.3

Run CMAQv5.4 for the 12km Listos Training Case#

Input data is available for a subdomain of the 12km 12US1 case.

GRIDDESC

'2018_12Listos'
'LamCon_40N_97W'   1812000.000    240000.000     12000.000     12000.000   25   25    1
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts
./run_cctm_2018_12US1_listos_32pe.csh |& tee ./run_cctm_2018_12US1_listos_32pe.log

Successful output:

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day:   2018-08-07
Number of Simulation Days: 3
Domain Name:               2018_12Listos
Number of Grid Cells:      21875  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       32
   All times are in seconds.

Num  Day        Wall Time
01   2018-08-05   69.9
02   2018-08-06   64.7
03   2018-08-07   66.5
     Total Time = 201.10
      Avg. Time = 67.03

Run CMAQv5.4 for the full 12US1 Domain on c6a.48xlarge with 192 vcpus#
GRIDDESC
' '  !  end coords.  grids:  name; xorig yorig xcell ycell ncols nrows nthik
'12US1'
'LAM_40N97W'  -2556000.   -1728000.   12000.  12000.  459  299    1

Input Data for the 12US1 domain is available for a 2 day benchmark 12US1 Domain for both netCDF4 compressed (.nc4) and classic netCDF-3 compression (.nc). The 96 pe run on the c6a.48xlarge instance will take approximately 120 minutes for 1 day, or 240 minutes for the full 2 day benchmark.

Options that were used to disable multi-trheading:

--cpu-options (structure)

    The CPU options for the instance. For more information, see Optimize CPU options in the Amazon EC2 User Guide .

    CoreCount -> (integer)

        The number of CPU cores for the instance.

    ThreadsPerCore -> (integer)

        The number of threads per CPU core. To disable multithreading for the instance, specify a value of 1 . Otherwise, specify the default value of 2 .

--cpu-options CoreCount=integer,ThreadsPerCore=integer,AmdSevSnp=string

JSON Syntax:

{
  "CoreCount": integer,
  "ThreadsPerCore": integer,
  "AmdSevSnp": "enabled"|"disabled"
}


Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts

./run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.8x12.ncclassic.csh |& tee ./run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.8x12.ncclassic.2nd.log

Spot Pricing cost for Linux in US East Region

c6a.48xlarge $5.88 per Hour

Rerunning the 12US1 case on 8x12 processors - for total of 96 processors.

It took about 39 minutes of initial I/O prior to the model starting using this gp3 ami. Fahim was not able to reproduce this performance issue. I am not sure how to diagnose the issue. When I upgraded the AMI to use an io2 disk, this poor I/O issue was resolved.

Once the model starts running (see Processing cmpleted …) in the log file, then use htop to view the CPU usage.#

Login to the virtual machine and then run the following command.

./htop

Screenshot of HTOP for CMAQv5.4 on c6a.48xlarge

Using Cloudwatch to see the CPU utilization.#

Note that we are using 96 pes of the 192 virtual cpus, so the maximum cpu utilization reported would be 50%.

Screenshot of Cloudwatch for CMAQv5.4 on c6a.48xlarge using spot pricing

Successful run output, but it is taking too long (twice as long as on the Parallel Cluster).

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day:   2017-12-23
Number of Simulation Days: 2
Domain Name:               12US1
Number of Grid Cells:      4803435  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       96
   All times are in seconds.

Num  Day        Wall Time
01   2017-12-22   6320.8
02   2017-12-23   5409.6
     Total Time = 11730.40
      Avg. Time = 5865.20

Perhaps the instance is being i/o throttled?

ebs-volume-io-queue-latency-issues

Trying this CloudWatch Report

EBS Volume Throughput Limits

This report is saying that the maximum throughput for this gp3 volume is 1,000 MiB/s, and the baseline throughtput Limit is 125 MiB/s. Need to run this same report for the io2 volume, and see what the values are.

EBS Volume Throughput

Volume ID: vol-050662148aef41b8f
Instance ID: i-0c2615494c0a89ea9

You can use the AWS Web Interface to get an estimate of the savings of using a SPOT versus OnDEMAND Instance.

Save volume as a snapshot#

saving the volume as a snapshot so that I can have a copy of the log files to show the poor performance of the spot instance. After the snapshot is created then I will delete the instance. The snapshot name is c6a.48xlarge.cmaqv54.spot, snap-0cc3df82ba5bf5da8

Clean up Virtual Machine#
Find the InstanceID using the following command on your local machine.#

## aws ec2 describe-instances --region=us-east-1 | grep InstanceId

Output

i-xxxx

Terminate the instance#

## aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx

Commands for terminating EC2 instance from CLI

Create c6a.48xlarge with hyperthreading disabled#

## aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --dry-run --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.json

(note, take out –dry-run option after you try and verify it works)

Obtain the public IP address for the virtual machine

## aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-0aaa0cfeb5ed5763c" | grep PublicIpAddress

Login to the machine `## ssh -v -Y -i ~/your-pem.pem ubuntu@your-ip-address

Retry the Listos run script.#
## cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts
## ./run_cctm_2018_12US1_listos_32pe.csh |& tee ./run_cctm_2018_12US1_listos_32pe.log

Use HTOP to view performance.#

htop

output

Screenshot of HTOP

Successful output#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day:   2018-08-07
Number of Simulation Days: 3
Domain Name:               2018_12Listos
Number of Grid Cells:      21875  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       32
   All times are in seconds.

Num  Day        Wall Time
01   2018-08-05   87.6
02   2018-08-06   77.9
03   2018-08-07   77.2
     Total Time = 242.70
      Avg. Time = 80.90

Retried the 12US1 benchmark case but the i/o was still too slow.

Used the AWS Web Interface to upgrade to an io1 system#

Choosing EBS Storage Type

After upgrading to the io1 volume, the performance was much improved.

Now, we need to examine the cost, and whether it would cost less for an io2 volume.

Screenshot of AWS Web Interface after Storage Upgrade to io1

HTOP after upgrade storage

Additional information about how to calculate storage pricing.

EBS Pricing

Good comparison of EBS vs EFS, and discussion of using Cloud Volumes ONTAP for data tiering between S3 Buckets and EBS volumes.

Comparison between EBS and EFS

The aws cli can also be used to modify the volume as per these instructions.

aws cli modify volume

Output

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day:   2017-12-23
Number of Simulation Days: 2
Domain Name:               12US1
Number of Grid Cells:      4803435  (ROW x COL x LAY)
`Number of Layers:          35
Number of Processes:       96
   All times are in seconds.

Num  Day        Wall Time
01   2017-12-22   3045.2
02   2017-12-23   3351.8
     Total Time = 6397.00
      Avg. Time = 3198.50

Saved the EC2 instance as an AMI and made that ami public.

Use new ami instance with faster storage (io1) to create c6a.48xlarge ec2 instance#

Note: these command should work, using a runinstance-config.jason file that is in the /shared/pcluster-cmaq directory. (it has already been edited to specify the ami listed below.)

The your-key.pem and the runinstance-config.jason file should be copied to the same directory before using the aws cli instructions below.

New AMI instance name to use for CMAQv5.4 on c6a.48xlarge using 500 GB io1 Storage.

ami-031a6e4499abffdb6

Edit runinstances-config.json to use the new ami.

Add the following line:

    "ImageId": "ami-031a6e4499abffdb6",
Create new instance#

Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids your-security-group-with-ssh-access-to-Instance option.

Note, you will need to create or have a keypair that will be used to login to the ec2 instance that you create.

Replacing Key Pair

Create c6a.48xlarge instance:

aws ec2 run-instances --debug --key-name your-pem --security-group-ids your-security-group-with-ssh-access-to-Instance --region us-east-1 --ebs-optimized --dry-run --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.json

(take out –dryrun option after you see the following message:

botocore.exceptions.ClientError: An error occurred (DryRunOperation) when calling the RunInstances operation: Request would have succeeded, but DryRun flag is set.

Re-try creating the c5a.48xlarge instance without the dry-run option::

aws ec2 run-instances --debug --key-name your-pem --security-group-ids your-security-group-with-ssh-access-to-Instance --region us-east-1 --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.json

Check that the ec2 instance is running using the following command.#

aws ec2 describe-instances --region=us-east-1

Use the following command to obtain the IP address#

aws ec2 describe-instances --region=us-east-1  | grep PublicIpAddress

Login#

ssh -v -Y -i ~/your-pem.pem ubuntu@your-publicIpAddress

Load environment modules#

module avail

module load ioapi-3.2/gcc-11.3.0-netcdf  mpi/openmpi-4.1.2  netcdf-4.8.1/gcc-11.3

Change to the scripts directory#

cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts/

Use lscpu to confirm that there are 96 processors on the c6a.48xlarge ec2 instance that was created with hyperthreading turned off.#

lscpu

Output:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  96
  On-line CPU(s) list:   0-95
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7R13 Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  1
    Core(s) per socket:  48
    Socket(s):           2
    Stepping:            1
    BogoMIPS:            5299.98
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxs
                         r_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq m
                         onitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_l
                         egacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 a
                         vx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat
                          npt nrip_save vaes vpclmulqdq rdpid
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   3 MiB (96 instances)
  L1i:                   3 MiB (96 instances)
  L2:                    48 MiB (96 instances)
  L3:                    384 MiB (12 instances)
NUMA:                    
  NUMA node(s):          4
  NUMA node0 CPU(s):     0-23
  NUMA node1 CPU(s):     24-47
  NUMA node2 CPU(s):     48-71
  NUMA node3 CPU(s):     72-95
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
Login to the ec2 instance again, so that you have two windows logged into the machine.#

ssh -Y -i ~/your-pem.pem ubuntu@your-ip-address

Run 12US1 Listos Training 3 Day benchmark Case on 32 pe (this will take less than 2 minutes)#

./run_cctm_2018_12US1_listos_32pe.csh | & tee ./run_cctm_2018_12US1_listos_32pe.2nd.log

Successful output#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day:   2018-08-07
Number of Simulation Days: 3
Domain Name:               2018_12Listos
Number of Grid Cells:      21875  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       32
   All times are in seconds.

Num  Day        Wall Time
01   2018-08-05   35.7
02   2018-08-06   35.2
03   2018-08-07   36.1
     Total Time = 107.00
      Avg. Time = 35.66
Download input data for 12NE3 1 day Benchmark case#

Instructions to copy data from the s3 bucket to the ec2 instance and run this benchmark.

cd /shared/pcluster-cmaq/

Examine the command line options that are used to download the data. Note, that we can use the –nosign option, as the data is available from the CMAS Open Data Warehouse on AWS.

cat s3_copy_12NE3_Bench.csh

Output

#!/bin/csh -f
#Script to download enough data to run START_DATE 201522 and END_DATE 201523 for 12km Northeast Domain
#Requires installing aws command line interface
#https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html#cliv2-linux-install
#Total storage required is 56 G

setenv AWS_REGION "us-east-1"

aws s3 cp --no-sign-request --recursive s3://cmas-cmaq/CMAQv5.4_2018_12NE3_Benchmark_2Day_Input /shared/data/
Use the aws s3 copy command to copy data from the CMAS Data Warehouse Open Data S3 bucket.#

./s3_copy_12NE3_Bench.csh

Edit the 12US3 Benchmark run script to use the gcc compiler and to output all species to CONC output file.#

vi run_cctm_Bench_2018_12NE3.c6a48xlarge.csh

change

   setenv compiler intel

to

   setenv compiler gcc

Comment out the CONC_SPCS setting that limits them to only 12 species

   # setenv CONC_SPCS "O3 NO ANO3I ANO3J NO2 FORM ISOP NH3 ANH4I ANH4J ASO4I ASO4J" 
Run the 12US3 Benchmark case#
./run_cctm_Bench_2018_12NE3.c6a48xlarge.csh |& tee ./run_cctm_Bench_2018_12NE3.c6a48xlarge.32pe.log
Successful output for 12 species output in the 3-D CONC file took 7.4 minutes to run 1 day#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-07-01
End Day:   2018-07-01
Number of Simulation Days: 1
Domain Name:               2018_12NE3
Number of Grid Cells:      367500  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       32
   All times are in seconds.

Num  Day        Wall Time
01   2018-07-01   445.19
     Total Time = 445.19
      Avg. Time = 445.19


Successful output for all species output in the 3-D CONC File (222 variables)#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-07-01
End Day:   2018-07-01
Number of Simulation Days: 1
Domain Name:               2018_12NE3
Number of Grid Cells:      367500  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       32
   All times are in seconds.

Num  Day        Wall Time
01   2018-07-01   444.34
     Total Time = 444.34
      Avg. Time = 444.34

Todo: look into process pinning. (will it make a difference on a single VM for number of cores less than 96?)

Compare to timings available in Table 3-1 Example of job scenarios at EPA for a single day simulation.

Domain 	                Domain size 	Species Tracked 	Input files size 	Output files size 	Run time (# cores)
2018 North East US 	100 X 105 X 35 	225 	                26GB 	                2GB 	                15 min/day (32)
Run 12US1 2 day benchmark case on 96 processors#
./run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.12x8.ncclassic.csh |& tee ./run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.12x8.ncclassic.log
Verify that it is using 99% of each of the 96 cores using htop#

htop

Successful run timing#
==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day:   2017-12-23
Number of Simulation Days: 2
Domain Name:               12US1
Number of Grid Cells:      4803435  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       96
   All times are in seconds.

Num  Day        Wall Time
01   2017-12-22   3070.4
02   2017-12-23   3386.7
     Total Time = 6457.10
      Avg. Time = 3228.55

Compare timing to output available CMAQ User Guide: Running CMAQ

Find the InstanceID using the following command on your local machine.#

aws ec2 describe-instances --region=us-east-1 | grep InstanceId

Output

i-xxxx

Stop the instance#

aws ec2 stop-instances --region=us-east-1 --instance-ids i-xxxx

Get the following error message.

aws ec2 stop-instances –region=us-east-1 –instance-ids i-041a702cc9f7f7b5d

An error occurred (UnsupportedOperation) when calling the StopInstances operation: You can’t stop the Spot Instance ‘i-041a702cc9f7f7b5d’ because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.

Note sure how to do a persistent spot instance request .

Terminate Instance#

aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx

Try creating the gp3 version of the ami using the Nitro Hypervisor, and see if that improves the performance without the cost of the io1 volume.#

no - the nitro is being used.

“Hypervisor”: “xen”, - this applies to the nitro hypervisor according to the documentation.

Try creating the gp3 ami from the web interface, and see if you can reproduce the performance issues or not. If it performs well, then use the –describe-instances command to see what is different between the ami created from web interface and that created from the command line.

Create a Parallel Cluster and run CMAQv5.3.3#

Why might I need to use ParallelCluster?#

The AWS ParallelCluster may be configured to be the equivalent of a High Performance Computing (HPC) environment, including using job schedulers such as Slurm, running on multiple nodes using code compiled with Message Passing Interface (MPI), and reading and writing output to a high performance, low latency shared disk. The advantage of using the AWS ParallelCluster command line interface is that the compute nodes can be easily scaled up or down to match the compute requirements of a given simulation. In addition, the user can reduce costs by using Spot instances rather than On-Demand for the compute nodes. ParallelCluster also supports submitting multiple jobs to the job submission queue.

Our goal is make this user guide to running CMAQ on a ParallelCluster as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.

Additional information on AWS ParallelCluster:

AWS ParallelCluster documentation

AWS ParallelCluster training video

Introductory Tutorial#

Create a Demo cluster to configure your aws credentials and set up your identity and access management roles.

Introductory Tutorial

Step by Step Instructions to Build a Demo ParallelCluster.#

The goal is for users to get started and make sure they can spin up a node, launch the pcluster and terminate it.

Establish Identity and Permissions#
AWS Identity and Access Management Roles#

Requires the user to have AWS Identity and Access Management roles in AWS ParallelCluster

AWS ParallelCluster uses multiple AWS services to deploy and operate a cluster. See the complete list in the AWS Services used in AWS ParallelCluster section. It appears you can create the demo cluster, and even the intermediate or advanced cluster, but you can’t submit a slurm job and have it provision compute nodes until you have the IAM Policies set for your account. This likely requires the system administrator who has permissions to access the AWS Web Interface with root access to add these policies and then to attach them to each user account.

Use the AWS Web Interface to add a policy called AWSEC2SpotServiceRolePolicy to the account prior to running a job that uses spot pricing on the ParallelCluster.

AWS CLI 3.0#

Use AWS Command Line Interface (CLI) v3.0 to configure and launch a demo cluster

Requires the user to have a key.pair that was created on an ec2.instance

Install AWS ParallelCluster Command Line Interface on your local machine#

Create a virtual environment on a linux machine to install aws-parallel cluster

python3 -m virtualenv ~/apc-ve
source ~/apc-ve/bin/activate
python --version
python3 -m pip install --upgrade aws-parallelcluster
pcluster version
Install node.js#
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh 
chmod ug+x ~/.nvm/nvm.sh
source ~/.nvm/nvm.sh
nvm install node
node --version
Verify that AWS ParallelCluster is installed on local machine#

Run pcluster version.

pcluster version

Output:

{
"version": "3.1.2"
}

Note

If you start a new terminal window, you need to re-activate the virtual environment using the following commands:

source ~/apc-ve/bin/activate
source ~/.nvm/nvm.sh

Verify that the parallel cluster is working using:

pcluster version

Configure AWS Command line credentials on your local machine#

aws configure

Configure a Demo Cluster#
To create a parallel cluster, a yaml file needs to be created that is unique to your account.#

An example of the yaml file contents is described in the following Diagram:

Figure 1. Diagram of YAML file used to configure a ParallelCluster with a t2.micro head node and t2.micro compute nodes

t2.micro yaml configuration

Create a yaml configuration file for the cluster following these instructions#

pcluster configure --config new-hello-world.yaml

Input the following answers at each prompt:

  1. Allowed values for AWS Region ID: us-east-1

  2. Allowed values for EC2 Key Pair Name: choose your key pair

  3. Allowed values for Scheduler: slurm

  4. Allowed values for Operating System: ubuntu2004

  5. Head node instance type: t2.micro

  6. Number of queues: 1

  7. Name of queue 1: queue1

  8. Number of compute resources for queue1 [1]: 1

  9. Compute instance type for compute resource 1 in queue1: t2.micro

  10. Maximum instance count [10]: 10

  11. Automate VPC creation?: y

  12. Allowed values for Availability Zone: 1

  13. Allowed values for Network Configuration: 2. Head node and compute fleet in the same public subnet

Beginning VPC creation. Please do not leave the terminal until the creation is finalized

Note

The choice of operating system (specified during the yaml creation, or in an existing yaml file) determines what modules and gcc compiler versions are available.

  1. Centos7 has an older gcc version 4

  2. Ubuntu2004 has gcc version 9+

  3. Alinux or Amazon Linux/Red Hat Linux (haven’t tried)

Examine the yaml file#

cat new-hello-world.yaml

Region: us-east-1
Image:
  Os: ubuntu2004
HeadNode:
  InstanceType: t2.micro
  Networking:
    SubnetId: subnet-xx-xx-xx                  <<< unique to your account
  Ssh:
    KeyName: your-key                          <<< unique to your account
Scheduling:
  Scheduler: slurm
  SlurmQueues:
  - Name: queue1
    ComputeResources:
    - Name: t2micro
      InstanceType: t2.micro
      MinCount: 0
      MaxCount: 10
    Networking:
      SubnetIds:
      - subnet-xx-xx-xx                        <<< unique to your account

Note

The above yaml file is the very simplest form available. If you upgrade the compute node to using a faster compute instance, then you will need to add additional configuration options (networking, elastic fabric adapter) to the yaml file. These modifications will be highlighted in the yaml figures provided in the tutorial.

The key pair and Subnetid in the yaml file are unique to your account. To create the AWS Intermediate ParallelCluster, the key pair and subnet ID from the new-hello-world.yaml file that you created using your account will need to be transferred to the Yaml files that will be used to create the Intermediate ParallelCluster in the next section of the tutorial. You will need to edit these yaml files to use the key pair and your Subnetid that are valid for your AWS Account.

Create a Demo Cluster#

pcluster create-cluster --cluster-configuration new-hello-world.yaml --cluster-name hello-pcluster --region us-east-1

Check on the status of the cluster#

pcluster describe-cluster --region=us-east-1 --cluster-name hello-pcluster

List available clusters#

pcluster list-clusters --region=us-east-1

Check on status of cluster again#

pcluster describe-cluster --region=us-east-1 --cluster-name hello-pcluster

After 5-10 minutes, you see the following status: “clusterStatus”: “CREATE_COMPLETE”

While the cluster has been created, only the t2.micro head node is running. Before any jobs can be submitted to the slurm queue, the compute nodes need to be started.

Note

The compute nodes are not “provisioned” or “created” at this time (so they do not begin to incur costs). The compute nodes are only provisioned when a slurm job is scheduled. After a slurm job is completed, then the compute nodes will be terminated after 5 minutes of idletime.

Login and Examine Cluster#
SSH into the cluster#

Note

replace the your-key.pem key pair with your key pair you will need to change the permissions on your key pair so to be read only by owner.

cd ~
chmod 400 your-key.pem

Example: pcluster ssh -v -Y -i ~/your-key.pem –cluster-name hello-pcluster

pcluster ssh -v -Y -i ~/[your-key-pair] --cluster-name hello-pcluster

login prompt should look something like (this will depend on what OS was chosen in the yaml file).

[ip-xx-x-xx-xxx pcluster-cmaq]

Check what modules are available on the ParallelCluster#

module avail

Check what version of the compiler is available#

gcc --version

Need a minimum of gcc 8+ for CMAQ

Check what version of openmpi is available#

mpirun --version

Need a minimum openmpi version 4.0.1 for CMAQ

Verify that Slurm is available (if slurm is not available, then you may need to try a different OS)#

which sbatch

Do not install sofware on this demo cluster#

the t2.micro head node is too small

Save the key pair and SubnetId from this new-hello-world.yaml to use in the yaml for the Intermediate Tutorial

Exit the cluster#

exit

Delete the Demo Cluster#

pcluster delete-cluster --cluster-name hello-pcluster --region us-east-1

See also

pcluster --help

CMAQv5.3.3 Intermediate Tutorial#

Run CMAQ on a ParallelCluster using pre-loaded software and input data.

Intermediate Tutorial

Use ParallelCluster pre-installed with software and data.#

Step by step instructions for running the CMAQ 12US2 Benchmark for 2 days on a ParallelCluster.

Obtain YAML file pre-loaded with input data and software#
Choose a directory on your local machine to obtain a copy of the github repo.#

cd /your/local/machine/install/path/

Use a configuration file from the github by cloning the repo to your local machine#

git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq

cd pcluster-cmaq/yaml

Note

To find the default settings for Lustre see: Lustre Settings for ParallelCluster

Examine Diagram of the YAML file to build pre-installed software and input data.#

Includes Snapshot ID of volume pre-installed with CMAQ software stack and name of S3 Bucket to import data to the Lustre Filesystem

Figure 1. Diagram of YAML file used to configure a ParallelCluster with a c5n.large head node and c5n.18xlarge compute nodes with Software and Data Pre-installed (linked on lustre filesystem)

c5n-18xlarge Software+Data Pre-installed yaml configuration

Edit Yaml file#

This Yaml file specifies the /shared directory that contains the CMAQv5.3.3 and libraries, and the input data that will be imported from an S3 bucket to the /fsx lustre file system Note, the following yaml file is using a c5n-9xlarge compute node, and is using ONDEMAND pricing.

Note

Edit the c5n-9xlarge.ebs_unencrypted_installed_public_ubuntu2004.fsx_import_opendata.yaml file to specify your subnet-id and your keypair prior to creating the cluster

vi c5n-9xlarge.ebs_unencrypted_installed_public_ubuntu2004.fsx_import_opendata.yaml

Output:

Region: us-east-1
Image:
  Os: ubuntu2004
HeadNode:
  InstanceType: c5n.large
  Networking:
    SubnetId: subnet-xx-xx-xx                           <<< replace subnetID
  DisableSimultaneousMultithreading: true
  Ssh:
    KeyName: your-key                                   <<< replace keyname
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 5
  SlurmQueues:
    - Name: queue1
      CapacityType: SPOT
      Networking:
        SubnetIds:
          - subnet-xx-xx-xxx                            <<< replace subnetID
        PlacementGroup:
          Enabled: true
      ComputeResources:
        - Name: compute-resource-1
          InstanceType: c5n.9xlarge
          MinCount: 0
          MaxCount: 10
          DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true
            GdrSupport: false
SharedStorage:
  - MountDir: /shared
    Name: ebs-shared
    StorageType: Ebs
    EbsSettings:
      SnapshotId: snap-017568d24a4cedc83
  - MountDir: /fsx
    Name: name2
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      ImportPath: s3://cmas-cmaq-conus2-benchmark/data/CMAQ_Modeling_Platform_2016/CONUS
Create CMAQ ParallelCluster with software/data pre-installed#

pcluster create-cluster --cluster-configuration c5n-9xlarge.ebs_unencrypted_installed_public_ubuntu2004.fsx_import_opendata.yaml --cluster-name cmaq --region us-east-1

Output:

{
  "cluster": {
    "clusterName": "cmaq",
    "cloudformationStackStatus": "CREATE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:440858712842:stack/cmaq/6cfb1a50-6e99-11ec-8af1-0ea2256597e5",
    "region": "us-east-1",
    "version": "3.0.2",
    "clusterStatus": "CREATE_IN_PROGRESS"
  }
}

Check status again

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Output:

{
  "creationTime": "2022-01-06T02:36:18.119Z",
  "version": "3.0.2",
  "clusterConfiguration": {
    "url": "
  },
  "tags": [
    {
      "value": "3.0.2",
      "key": "parallelcluster:version"
    }
  ],
  "cloudFormationStackStatus": "CREATE_IN_PROGRESS",
  "clusterName": "cmaq",
  "computeFleetStatus": "UNKNOWN",
  "cloudformationStackArn": 
  "lastUpdatedTime": "2022-01-06T02:36:18.119Z",
  "region": "us-east-1",
  "clusterStatus": "CREATE_IN_PROGRESS"
}

After 5-10 minutes, check the status again and recheck until you see the following status: “clusterStatus”: “CREATE_COMPLETE”

Check status again

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Output:

  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "cmaq",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:440858712842:stack/cmaq/3cd2ba10-c18f-11ec-9f57-0e9b4dd12971",
  "lastUpdatedTime": "2022-04-21T16:22:28.879Z",
  "region": "us-east-1",
  "clusterStatus": "CREATE_COMPLETE"

Start the compute nodes, if the computeFleetStatus is not set to RUNNING

pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED

Log into the new cluster#

Note

replace your-key.pem with your Key Name

pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq

Change shell to use tcsh#
sudo usermod -s /bin/tcsh ubuntu

Log out and then log back in to have the shell take effect.

Verify Software#

The software is pre-loaded on the /shared volume of the ParallelCluster. The software was previously loaded and saved to the snapshot.

ls /shared/build

Create a .cshrc file by copying it from the git repo that is on /shared/pcluster-cmaq

cp /shared/pcluster-cmaq/install/dot.cshrc.pcluster ~/.cshrc

Source shell

csh

Load the modules

module avail

Output:

------------------------------------------------------------ /usr/share/modules/modulefiles -------------------------------------------------------------
dot  libfabric-aws/1.13.2amzn1.0  module-git  module-info  modules  null  openmpi/4.1.1  use.own

Load the modules openmpi and libfabric

module load openmpi/4.1.1

module load libfabric-aws/1.13.2amzn1.0

Verify Input Data#

The input data was imported from the S3 bucket to the lustre file system (/fsx).

cd /fsx/data/CMAQ_Modeling_Platform_2016/CONUS/12US2/

Notice that the data doesn’t take up much space, it must be linked, rather than copied.

du -h

Output:

27K     ./land
33K     ./MCIP
28K     ./emissions/ptegu
55K     ./emissions/ptagfire
27K     ./emissions/ptnonipm
55K     ./emissions/ptfire_othna
27K     ./emissions/pt_oilgas
26K     ./emissions/inln_point/stack_groups
51K     ./emissions/inln_point
28K     ./emissions/cmv_c1c2_12
28K     ./emissions/cmv_c3_12
28K     ./emissions/othpt
55K     ./emissions/ptfire
407K    ./emissions
27K     ./icbc
518K    .

Change the group and ownership permissions on the /fsx/data directory

sudo chown ubuntu /fsx/data

sudo chgrp ubuntu /fsx/data

Create the output directory

mkdir -p /fsx/data/output

Examine CMAQ Run Scripts#

The run scripts are available in two locations, one in the CMAQ scripts directory.

Another copy is available in the pcluster-cmaq repo. Do a git pull to obtain the latest scripts in the pcluster-cmaq repo.

cd /shared/pcluster-cmaq

git pull

Verify that the run scripts are updated and pre-configured for the parallel cluster by comparing with what is available in the github repo

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts

Example:

diff /shared/pcluster-cmaq/run_scripts/cmaq533/c5n.9xlarge/run_cctm_2016_12US2.108pe.6x18.pcluster.csh .

If a run script is missing or outdated, copy the run scripts from the repo. Note, there are different run scripts depending on what compute node is used. This tutorial assumes c5n.9xlarge is the compute node.

cp /shared/pcluster-cmaq/run_scripts/cmaq533/c5n.9xlarge/run*pcluster.csh /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/

Note

The time that it takes the 2 day CONUS benchmark to run will vary based on the number of CPUs used, and the compute node that is being used. See Figure 3 Benchmark Scaling Plot for c5n.18xlarge and c5n.9xlarge in chapter 11 for reference.

Examine how the run script is configured

head -n 30 /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_2016_12US2.108pe.6x18.pcluster.csh

#!/bin/csh -f
## For c5n.9xlarge (36 vcpu - 18 cpu)
## works with cluster-ubuntu.yaml
## data on /fsx directory
#SBATCH --nodes=6
#SBATCH --ntasks-per-node=18
#SBATCH --exclusive
#SBATCH -J CMAQ
#SBATCH -o /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.sharedvol.log
#SBATCH -e /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.sharedvol.log

Note

In this run script, slurm or SBATCH requests 6 nodes, each node with 18 pes, or 6x18 = 108 pes

Verify that the NPCOL and NPROW settings in the script are configured to match what is being requested in the SBATCH commands that tell slurm how many compute nodes to provision. In this case, to run CMAQ using on 108 cpus (SBATCH –nodes=6 and –ntasks-per-node=18), use NPCOL=9 and NPROW=12.

grep NPCOL /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_2016_12US2.108pe.6x18.pcluster.csh

Output:

   setenv NPCOL_NPROW "1 1"; set NPROCS   = 1 # single processor setting
   @ NPCOL  =  9; @ NPROW = 12
   @ NPROCS = $NPCOL * $NPROW
   setenv NPCOL_NPROW "$NPCOL $NPROW"; 

Submit Job to Slurm Queue#

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/

sbatch run_cctm_2016_12US2.108pe.6x18.pcluster.csh

Check status of run#

squeue

Output:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1    queue1     CMAQ   ubuntu PD       0:00      6 (BeginTime)
Successfully started run#

squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 5    queue1     CMAQ   ubuntu  R      22:39      6 queue1-dy-compute-resource-1-[1-6]
Once the job is successfully running#

Check on the log file status

grep -i 'Processing completed.' CTM_LOG_001*_gcc_2016*

Output:

            Processing completed...    6.5 seconds
            Processing completed...    6.5 seconds
            Processing completed...    6.5 seconds
            Processing completed...    6.5 seconds
            Processing completed...    6.4 seconds

Once the job has completed running the two day benchmark check the log file for the timings.

tail -n 30 run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.fsx_copied.log

Output:

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2015-12-22
End Day:   2015-12-23
Number of Simulation Days: 2
Domain Name:               12US2
Number of Grid Cells:      3409560  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       108
   All times are in seconds.

Num  Day        Wall Time
01   2015-12-22   2421.19
02   2015-12-23   2144.16
     Total Time = 4565.35
      Avg. Time = 2282.67

Note

if you see the following message, you may want to submit a job that requires fewer PEs.

ip-10-0-5-165:/shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts% squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1    queue1     CMAQ   ubuntu PD       0:00      6 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
If you repeatedly see that the job is not successfully provisioned, cancel the job.#

To cancel the job use the following command

scancel 1

Try submitting a smaller job to the queue.#

sbatch run_cctm_2016_12US2.90pe.5x18.pcluster.csh

Check status of run#

squeue

Check to view any errors in the log on the parallel cluster#

vi /var/log/parallelcluster/slurm_resume.log

An error occurred (MaxSpotInstanceCountExceeded) when calling the RunInstances operation: Max spot instance count exceeded

Note

If you encounter this error, you will need to submit a request to increase this spot instance limit using the AWS Website.

if the job will not run using SPOT pricing, then update the compute nodes to use ONDEMAND pricing#

To do this, exit the cluster, stop the compute nodes, then edit the yaml file to modify SPOT to ONDEMAND.

exit

On your local computer use the following command to stop the compute nodes

pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status STOP_REQUESTED

Edit the yaml file to modify SPOT to ONDEMAND, then update the cluster using the following command:

pcluster update-cluster --region us-east-1 --cluster-name cmaq --cluster-configuration  c5n-18xlarge.ebs_unencrypted_installed_public_ubuntu2004.fsx_import_opendata.yaml

Output:

{
  "cluster": {
    "clusterName": "cmaq",
    "cloudformationStackStatus": "UPDATE_IN_PROGRESS",
    "cloudformationStackArn": "xx-xxx-xx",
    "region": "us-east-1",
    "version": "3.1.1",
    "clusterStatus": "UPDATE_IN_PROGRESS"
  },
  "changeSet": [
    {
      "parameter": "Scheduling.SlurmQueues[queue1].CapacityType",
      "requestedValue": "ONDEMAND",
      "currentValue": "SPOT"                                      <<<  Modify to use ONDEMAND
    }
  ]
}

Check status of updated cluster

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Output:

"clusterStatus": "UPDATE_IN_PROGRESS"

once you see

  "clusterStatus": "UPDATE_COMPLETE"

Restart the compute nodes

pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED

Verify that compute nodes have started

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Output:

 "computeFleetStatus": "RUNNING",

Re-login to the cluster

pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq

Submit a new job using the updated ondemand compute nodes#

sbatch run_cctm_2016_12US2.180pe.5x36.pcluster.csh

Note

If you still have difficulty running a job in the slurm queue, there may be other issues that need to be resolved.

Verify that your IAM Policy has been created for your account.

Someone with administrative permissions should eable the spot instances IAM Policy: AWSEC2SpotServiceRolePolicy

An alternative way to enable this policy is to login to the EC2 Website and launch a spot instance. The service policy will be automatically created, that can then be used by ParallelCluster.

Submit a 72 pe job 2 nodes x 36 cpus#

sbatch run_cctm_2016_12US2.72pe.2x36.pcluster.csh

grep -i 'Processing completed.' CTM_LOG_036.v533_gcc_2016_CONUS_6x12pe_20151223

Output:

 Processing completed...    9.0 seconds
            Processing completed...   12.0 seconds
            Processing completed...   11.2 seconds
            Processing completed...    9.0 seconds
            Processing completed...    9.1 seconds

tail -n 20 run_cctmv5.3.3_Bench_2016_12US2.72.6x12pe.2day.pcluster.log

Output:

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2015-12-22
End Day:   2015-12-23
Number of Simulation Days: 2
Domain Name:               12US2
Number of Grid Cells:      3409560  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       72
   All times are in seconds.

Num  Day        Wall Time
01   2015-12-22   3562.50
02   2015-12-23   3151.21
     Total Time = 6713.71
      Avg. Time = 3356.85
Submit a minimum of 2 benchmark runs#

Ideally, two CMAQ runs should be submitted to the slurm queue, using two different NPCOLxNPROW configurations, to create output needed for the QA and Post Processing Sections in Chapter 10.

CMAQv5.3.3 Parallel Cluster Benchmark on HPC6a-48xlarge with EBS and Lustre (optional)#

Run CMAQv5.3.3 on a ParallelCluster using pre-loaded software and input data on EBS and Lustre using HPC6a-48xlarge Parallel Cluster.

CMAQv5.3.3 CONUS 2 Benchmark Tutorial using 12US2 Domain

Use ParallelCluster pre-installed with CMAQv5.3.3 software and 12US2 Benchmark#

Step by step instructions for running the CMAQ 12US2 Benchmark for 2 days on a ParallelCluster.

Obtain YAML file pre-loaded with input data and software#
Choose a directory on your local machine to obtain a copy of the github repo.#

cd /your/local/machine/install/path/

Use a configuration file from the github by cloning the repo to your local machine#

git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq

cd pcluster-cmaq/yaml

Note

To find the default settings for Lustre see: Lustre Settings for ParallelCluster

Examine Diagram of the YAML file to build pre-installed software and input data.#

Includes Snapshot ID of volume pre-installed with CMAQ software stack and name of S3 Bucket to import data to the Lustre Filesystem

Figure 1. Diagram of YAML file used to configure a ParallelCluster with a c5n.large head node and c5n.18xlarge compute nodes with Software and Data Pre-installed (linked on lustre filesystem)

hpc6a-48xlarge Software+Data Pre-installed yaml configuration

Edit Yaml file#

This Yaml file specifies the /shared directory that contains the CMAQv5.3.3 and libraries, and the input data that will be imported from an S3 bucket to the /fsx lustre file system Note, the following yaml file is using a hpc6a-48xlarge compute node, and is using ONDEMAND pricing.

Note

Edit the hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml file to specify your subnet-id and your keypair prior to creating the cluster In order to obtain these subnet id you will need to run pcluster configure

pcluster configure -r us-east-2 --config hpc6a.48xlarge.ebs.fsx.us-east-2.yaml

Example of the answers that were used to create the yaml for this benchmark:

Allowed values for EC2 Key Pair Name:
1. xxx-xxx
2. xxx-xxx-xxx
EC2 Key Pair Name [xxx-xxx]: 1
Allowed values for Scheduler:
1. slurm
2. awsbatch
Scheduler [slurm]: 1
Allowed values for Operating System:
1. alinux2
2. centos7
3. ubuntu1804
4. ubuntu2004
Operating System [alinux2]: 4
Head node instance type [t2.micro]: c6a.xlarge
Number of queues [1]: 
Name of queue 1 [queue1]: 
Number of compute resources for queue1 [1]: 1
Compute instance type for compute resource 1 in queue1 [t2.micro]: hpc6a.48xlarge
The EC2 instance selected supports enhanced networking capabilities using Elastic Fabric Adapter (EFA). EFA enables you to run applications requiring high levels of inter-node communications at scale on AWS at no additional charge (https://docs.aws.amazon.com/parallelcluster/latest/ug/efa-v3.html).
Enable EFA on hpc6a.48xlarge (y/n) [y]: y
Maximum instance count [10]: 
Enabling EFA requires compute instances to be placed within a Placement Group. Please specify an existing Placement Group name or leave it blank for ParallelCluster to create one.
Placement Group name []: 
Automate VPC creation? (y/n) [n]: y
Allowed values for Availability Zone:
1. us-east-2b
Availability Zone [us-east-2b]: 
Allowed values for Network Configuration:
1. Head node in a public subnet and compute fleet in a private subnet
2. Head node and compute fleet in the same public subnet
Network Configuration [Head node in a public subnet and compute fleet in a private subnet]: 2
Beginning VPC creation. Please do not leave the terminal until the creation is finalized
Creating CloudFormation stack...
Do not leave the terminal until the process has finished.
Status: parallelclusternetworking-pub-20230123170628 - CREATE_COMPLETE          
The stack has been created.
Configuration file written to hpc6a.48xlarge.ebs.fsx.us-east-2.yaml
You can edit your configuration file or simply run 'pcluster create-cluster --cluster-configuration hpc6a.48xlarge.ebs.fsx.us-east-2.yaml --cluster-name cluster-name --region us-east-2' to create your cluster.

vi hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml

Output:

Region: us-east-2
Image:
  Os: ubuntu2004
HeadNode:
  InstanceType: c6a.xlarge
  Networking:
    SubnetId: subnet-xx-xx-xx                           <<< replace subnetID
  DisableSimultaneousMultithreading: true
  Ssh:
    KeyName: your-key                                   <<< replace keyname
  LocalStorage:
    RootVolume:
      Encrypted: false
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 5
  SlurmQueues:
    - Name: queue1
      CapacityType: ONDEMAND 
      Networking:
        SubnetIds:
          - subnet-xx-xx-xxx                            <<< replace subnetID
        PlacementGroup:
          Enabled: true
      ComputeResources:
        - Name: compute-resource-1
          InstanceType: hpc6a.48xlarge
          MinCount: 0
          MaxCount: 10
          DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true
            GdrSupport: false
SharedStorage:
  - MountDir: /shared
    Name: ebs-shared
    StorageType: Ebs
    EbsSettings:
      VolumeType: gp3
      Size: 500
      Encrypted: false
      SnapshotId: snap-0f9592e0ea1749b5b
  - MountDir: /fsx
    Name: name2
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      ImportPath: s3://cmas-cmaq-conus2-benchmark
Create CMAQ ParallelCluster with software/data pre-installed#

pcluster create-cluster --cluster-configuration hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml --cluster-name cmaq --region us-east-2

Output:

{
  "cluster": {
    "clusterName": "cmaq",
    "cloudformationStackStatus": "CREATE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:us-east-2:440858712842:stack/cmaq/6cfb1a50-6e99-11ec-8af1-0ea2256597e5",
    "region": "us-east-2",
    "version": "3.0.2",
    "clusterStatus": "CREATE_IN_PROGRESS"
  }
}

Check status again

pcluster describe-cluster --region=us-east-2 --cluster-name cmaq

Output:

{
  "creationTime": "2022-01-06T02:36:18.119Z",
  "version": "3.0.2",
  "clusterConfiguration": {
    "url": "
  },
  "tags": [
    {
      "value": "3.0.2",
      "key": "parallelcluster:version"
    }
  ],
  "cloudFormationStackStatus": "CREATE_IN_PROGRESS",
  "clusterName": "cmaq",
  "computeFleetStatus": "UNKNOWN",
  "cloudformationStackArn": 
  "lastUpdatedTime": "2022-01-06T02:36:18.119Z",
  "region": "us-east-2",
  "clusterStatus": "CREATE_IN_PROGRESS"
}

Note, the snapshot image used is smaller than the EBS volume requested in the yaml file. Therefore you will get a warning from Parallel Cluster:

pcluster create-cluster --cluster-configuration hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml --cluster-name cmaq --region us-east-2
{
  "cluster": {
    "clusterName": "cmaq",
    "cloudformationStackStatus": "CREATE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:us-east-2:440858712842:stack/cmaq/276abf10-94fc-11ed-885c-02032a236214",
    "region": "us-east-2",
    "version": "3.1.2",
    "clusterStatus": "CREATE_IN_PROGRESS"
  },
  "validationMessages": [
    {
      "level": "WARNING",
      "type": "EbsVolumeSizeSnapshotValidator",
      "message": "The specified volume size is larger than snapshot size. In order to use the full capacity of the volume, you'll need to manually resize the partition according to this doc: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html"
    }
  ]
}

After 5-10 minutes, check the status again and recheck until you see the following status: “clusterStatus”: “CREATE_COMPLETE”

Check status again

pcluster describe-cluster --region=us-east-2 --cluster-name cmaq

Output:

  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "cmaq",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:440858712842:stack/cmaq/3cd2ba10-c18f-11ec-9f57-0e9b4dd12971",
  "lastUpdatedTime": "2022-04-21T16:22:28.879Z",
  "region": "us-east-2",
  "clusterStatus": "CREATE_COMPLETE"

Start the compute nodes, if the computeFleetStatus is not set to RUNNING

pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED

Log into the new cluster#

Note

replace your-key.pem with your Key Name

pcluster ssh -v -Y -i ~/your-key.pem --region=us-east-2 --cluster-name cmaq

Resize the EBS Volume#

To resize the EBS volume, you will need to login to the cluster and then run the following command:

sudo resize2fs /dev/nvme1n1

output:

resize2fs 1.45.5 (07-Jan-2020)
Filesystem at /dev/nvme1n1 is mounted on /shared; on-line resizing required
old_desc_blocks = 5, new_desc_blocks = 63
The filesystem on /dev/nvme1n1 is now 131072000 (4k) blocks long.
Change shell to use tcsh#
sudo usermod -s /bin/tcsh ubuntu

Log out and then log back in to have the shell take effect.

Verify Software#

The software is pre-loaded on the /shared volume of the ParallelCluster. The software was previously loaded and saved to the snapshot.

ls /shared/build

Create a .cshrc file by copying it from the git repo that is on /shared/pcluster-cmaq

cp /shared/pcluster-cmaq/install/dot.cshrc.pcluster ~/.cshrc

Source shell

csh

Load the modules

module avail

Output:

------------------------------------------------------------ /usr/share/modules/modulefiles ------------------------------------------------------------
dot  libfabric-aws/1.16.1amzn1.0  module-git  module-info  modules  null  openmpi/4.1.4  use.own  

--------------------------------------------------------- /opt/intel/mpi/2021.6.0/modulefiles ----------------------------------------------------------
intelmpi  

Load the modules openmpi and libfabric

module load openmpi/4.1.4

module load libfabric-aws/1.16.1amzn1.0

Verify Input Data#

The input data was imported from the S3 bucket to the lustre file system (/fsx).

cd /fsx/data/CMAQ_Modeling_Platform_2016/CONUS/12US2/

Notice that the data doesn’t take up much space, only the objects are loaded, the datasets will not be loaded to the /fsx volume until they are used either by the run scripts or using the touch command.

Note

More information about enhanced s3 integration for Lustre see: Enhanced S3 integration with lustre

du -h

Output:

27K     ./land
33K     ./MCIP
28K     ./emissions/ptegu
55K     ./emissions/ptagfire
27K     ./emissions/ptnonipm
55K     ./emissions/ptfire_othna
27K     ./emissions/pt_oilgas
26K     ./emissions/inln_point/stack_groups
51K     ./emissions/inln_point
28K     ./emissions/cmv_c1c2_12
28K     ./emissions/cmv_c3_12
28K     ./emissions/othpt
55K     ./emissions/ptfire
407K    ./emissions
27K     ./icbc
518K    .

Change the group and ownership permissions on the /fsx/data directory

sudo chown ubuntu /fsx/data

sudo chgrp ubuntu /fsx/data

Create the output directory

mkdir -p /fsx/data/output

Examine CMAQ Run Scripts#

The run scripts are available in two locations, one in the CMAQ scripts directory.

Another copy is available in the pcluster-cmaq repo. Do a git pull to obtain the latest scripts in the pcluster-cmaq repo.

cd /shared/pcluster-cmaq

git pull

Copy the run scripts from the repo. Note, there are different run scripts depending on what compute node is used. This tutorial assumes hpc6a-48xlarge is the compute node.

cp /shared/pcluster-cmaq/run_scripts/hpc6a_shared/*.pin.codemod.csh /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/

Note

The time that it takes the 2 day CONUS benchmark to run will vary based on the number of CPUs used, and the compute node that is being used, and what disks are used for the I/O (EBS or lustre). The Benchmark Scaling Plot for hpc6a-48xlarge on fsx and shared (include here).

Examine how the run script is configured

head -n 30 /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.fsx.pin.codemod.csh

#!/bin/csh -f
## For hpc6a.48xlarge (96 cpu)
## works with cluster-ubuntu.yaml
## data on /fsx directory
#SBATCH --nodes=6
#SBATCH --ntasks-per-node=96
#SBATCH --exclusive
#SBATCH -J CMAQ
#SBATCH -o /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctmv5.3.3_Bench_2016_12US2.hpc6a.48xlarge.576.6x96.24x24pe.2day.pcluster.fsx.pin.codemod.log
#SBATCH -e /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctmv5.3.3_Bench_2016_12US2.hpc6a.48xlarge.576.6x96.24x24pe.2day.pcluster.fsx.pin.codemod.log

Note

In this run script, slurm or SBATCH requests 6 nodes, each node with 96 pes, or 6x96 = 576 pes

Verify that the NPCOL and NPROW settings in the script are configured to match what is being requested in the SBATCH commands that tell slurm how many compute nodes to provision. In this case, to run CMAQ using on 108 cpus (SBATCH –nodes=6 and –ntasks-per-node=69), use NPCOL=24 and NPROW=24.

grep NPCOL /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.fsx.pin.codemod.csh

Output:

   setenv NPCOL_NPROW "1 1"; set NPROCS   = 1 # single processor setting
   @ NPCOL  =  24; @ NPROW = 24
   @ NPROCS = $NPCOL * $NPROW
   setenv NPCOL_NPROW "$NPCOL $NPROW"; 

To run on the EBS Volume a code modification is required.#

Note, we will use this modification when running on both lustre and EBS.

Copy the BLD directory with a code modification to wr_conc.F and wr_aconc.F to your directory.

cp -rp /shared/pcluster-cmaq/run_scripts/BLD_CCTM_v533_gcc_codemod /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/

Build the code by running the makefile#

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/BLD_CCTM_v533_gcc_codemod

Check to see you have the modules loaded

module list

openmpi/4.1.1   2) libfabric-aws/1.13.2amzn1.0

Run the Make command

make

Verify that the executable has been created

ls -lrt CCTM_v533.exe

Submit Job to Slurm Queue to run CMAQ on Lustre#

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/

sbatch run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.fsx.pin.codemod.csh

Check status of run#

squeue

Output:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1    queue1     CMAQ   ubuntu PD       0:00      6 (BeginTime)
Successfully started run#

squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 5    queue1     CMAQ   ubuntu  R      22:39      6 queue1-dy-compute-resource-1-[1-6]
Once the job is successfully running#

Check on the log file status

grep -i 'Processing completed.' CTM_LOG_001*_gcc_2016*

Output:

            Processing completed...    6.5 seconds
            Processing completed...    6.5 seconds
            Processing completed...    6.5 seconds
            Processing completed...    6.5 seconds
            Processing completed...    6.4 seconds

Once the job has completed running the two day benchmark check the log file for the timings.

tail -n 5 run_cctmv5.3.3_Bench_2016_12US2.hpc6a.48xlarge.576.6x96.24x24pe.2day.pcluster.fsx.pin.codemod.2.log

Output:

Num  Day        Wall Time
01   2015-12-22   1028.33
02   2015-12-23   916.31
     Total Time = 1944.64
      Avg. Time = 972.32
Submit a run script to run on the EBS volume#

To run on the EBS volume, you need to copy the input data from the s3 bucket to the /shared volume. You don’t want to copy directly from the /fsx volume, as this will copy more files than you need. The s3 copy script below copies only two days worth of data from the s3 bucket. If you copy from /fsx directory, you would be copying all of the files on the s3 bucket.

cd /shared/pcluster-cmaq/s3_scripts
./s3_copy_nosign_conus_cmas_opendata_to_shared.csh
Modify YAML and then Update Parallel Cluster.#

Note, not all settings in the yaml file can be updated, for some settings, such as using a different snapshot, you will need to terminate this cluster and create a new one.

If you want to edit the yaml file to update a setting such as the maximum number of compute nodes available, use the following command to stop the compute nodes

pcluster update-compute-fleet --region us-east-2 --cluster-name cmaq --status STOP_REQUESTED

Edit the yaml file to modify MaxCount under ComputeResoureces, then update the cluster using the following command:

pcluster update-cluster --region us-east-2 --cluster-name cmaq --cluster-configuration  hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml

Output:

{
  "cluster": {
    "clusterName": "cmaq",
    "cloudformationStackStatus": "UPDATE_IN_PROGRESS",
    "cloudformationStackArn": "xx-xxx-xx",
    "region": "us-east-2",
    "version": "3.1.1",
    "clusterStatus": "UPDATE_IN_PROGRESS"
  },
    "changeSet": [
    {
      "parameter": "Scheduling.SlurmQueues[queue1].ComputeResources[compute-resource-1].MaxCount",
      "requestedValue": 15,
      "currentValue": 10
    }
  ]
}

Check status of updated cluster

pcluster describe-cluster --region=us-east-2 --cluster-name cmaq

Output:

"clusterStatus": "UPDATE_IN_PROGRESS"

once you see

  "clusterStatus": "UPDATE_COMPLETE"
  "clusterName": "cmaq",
  "computeFleetStatus": "STOPPED",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-2:440858712842:stack/cmaq2/d68e5180-9698-11ed-b06c-06cfae76125a",
  "lastUpdatedTime": "2023-01-23T14:39:44.670Z",
  "region": "us-east-2",
  "clusterStatus": "UPDATE_COMPLETE"
}

Restart the compute nodes

pcluster update-compute-fleet --region us-east-2 --cluster-name cmaq --status START_REQUESTED

Verify that compute nodes have started

pcluster describe-cluster --region=us-east-2 --cluster-name cmaq

Output:

 "computeFleetStatus": "RUNNING",

Re-login to the cluster

pcluster ssh -v -Y -i ~/your-key.pem --region=us-east-2 --cluster-name cmaq

Submit a new job using the updated compute nodes#

sbatch run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.fsx.pin.codemod.csh

Note

If you still have difficulty running a job in the slurm queue, there may be other issues that need to be resolved.

Submit a 576 pe job 6 nodes x 96 cpus on the EBS volume /shared#

sbatch run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.shared.pin.csh

grep -i 'Processing completed.' CTM_LOG_036.v533_gcc_2016_CONUS_6x12pe_20151223

Output:

            Processing completed...    5.1 seconds
            Processing completed...    2.0 seconds
            Processing completed...    2.0 seconds
            Processing completed...    1.9 seconds
            Processing completed...    1.9 seconds
            Processing completed...    2.0 seconds
            Processing completed...    2.0 seconds
            Processing completed...    1.9 seconds

tail -n 18

Output:

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2015-12-22
End Day:   2015-12-23
Number of Simulation Days: 2
Domain Name:               12US2
Number of Grid Cells:      3409560  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       576
   All times are in seconds.

Num  Day        Wall Time
01   2015-12-22   1043.09
02   2015-12-23   932.98
     Total Time = 1976.07
      Avg. Time = 988.03

Submit a minimum of 2 benchmark runs#

Ideally, two CMAQ runs should be submitted to the slurm queue, using two different NPCOLxNPROW configurations, to create output needed for the QA and Post Processing Sections in Chapter 6.

upgrade pcluster version to try Persistent 2 Lustre Filesystem#
/Users/lizadams/apc-ve/bin/python3 -m pip install --upgrade pip'\n'
python3 -m pip install --upgrade "aws-parallelcluster"

Create a new configuration file

pcluster configure -r us-east-2 –config hpc6a.48xlarge.ebs.fsx.us-east-2.yaml

Getting a CREATE_FAILED error message

Query the stack formation log messages#
pcluster get-cluster-stack-events --cluster-name cmaq2 --region us-east-2  --query 'events[?resourceStatus==`CREATE_FAILED`]'

Output

    "eventId": "FSX39ea84acf1fef629-CREATE_FAILED-2023-01-23T17:14:19.869Z",
    "physicalResourceId": "",
    "resourceStatus": "CREATE_FAILED",
    "resourceStatusReason": "Linking a Persistent 2 file system to an S3 bucket using the LustreConfiguraton is not supported. Create a file system and then create a data repository association to link S3 buckets to the file system. For more details, visit https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-dra-linked-data-repo.html (Service: AmazonFSx; Status Code: 400; Error Code: BadRequest; Request ID: dd4df24a-0eed-4e94-8205-a9d5a9605aae; Proxy: null)",
    "resourceProperties": "{\"FileSystemTypeVersion\":\"2.12\",\"StorageCapacity\":\"1200\",\"FileSystemType\":\"LUSTRE\",\"LustreConfiguration\":{\"ImportPath\":\"s3://cmas-cmaq-conus2-benchmark\",\"DeploymentType\":\"PERSISTENT_2\",\"PerUnitStorageThroughput\":\"1000\"},\"SecurityGroupIds\":[\"sg-00ab9ad20ea71b395\"],\"SubnetIds\":[\"subnet-02800a67052ad340a\"],\"Tags\":[{\"Value\":\"name2\",\"Key\":\"Name\"}]}",
    "stackId": "arn:aws:cloudformation:us-east-2:440858712842:stack/cmaq2/561cc920-9b41-11ed-a8d2-0a9db28fc6a2",
    "stackName": "cmaq2",
    "logicalResourceId": "FSX39ea84acf1fef629",
    "resourceType": "AWS::FSx::FileSystem",
    "timestamp": "2023-01-23T17:14:19.869Z"

Not sure the best way to set the VPC and security groups. Do you match the Parallel Cluster settings, or as the parallel cluster failed to build with the persistent2 lustre settings, do you create a new VPC and modify the yaml to have the parallel cluster use the VPC settings established when you create the lustre filesystem?

Performance and Cost Optimization#

Timing information and scaling plots to assist users in optimizing the performance of their parallel cluster.

Performance Optimization

Right-sizing Compute Nodes for the ParallelCluster Configuration#

Selection of the compute nodes depends on the domain size and resolution for the CMAQ case, and what your model run time requirements are. Larger hardware and memory configurations may also be required for instrumented versions of CMAQ incuding CMAQ-ISAM and CMAQ-DDM3D. The ParallelCluster allows you to run the compute nodes only as long as the job requires, and you can also update the compute nodes as needed for your domain.

An explanation of why a scaling analysis is required for Multinode or Parallel MPI Codes#

Quote from the following link.

“IMPORTANT: The optimal value of –nodes and –ntasks for a parallel code must be determined empirically by conducting a scaling analysis. As these quantities increase, the parallel efficiency tends to decrease. The parallel efficiency is the serial execution time divided by the product of the parallel execution time and the number of tasks. If multiple nodes are used then in most cases one should try to use all of the CPU-cores on each node.”

Note

For the scaling analysis that was performed with CMAQ, the parallel efficiency was determined as the runtime for the smallest number of CPUs divided by the product of the parallel execution time and the number of additional cpus used. If smallest NPCOLxNPROW configuration was 18 cpus, the run time for that case was used, and then the parallel efficiency for the case where 36 cpus were used would be parallel efficiency = runtime_18cpu/(runtime_36cpu*2)*100

Slurm Compute Node Provisioning#

AWS ParallelCluster relies on SLURM to make the job allocation and scaling decisions. The jobs are launched, terminated, and resources maintained according to the Slurm instructions in the CMAQ run script. The YAML file for Parallel Cluster is used to set the identity of the head node and the compute node, and the maximum number of compute nodes that can be submitted to the queue. The head node can’t be updated after a cluster is created. The compute nodes, and the maximum number of compute nodes can be updated after a cluster is created.

Number of compute nodes dispatched by the slurm scheduler is specified in the run script using #SBATCH –nodes=XX #SBATCH –ntasks-per-node=YY where the maximum value of tasks per node or YY limited by many CPUs are on the compute node.

As an example:

For c5n.18xlarge, there are 36 CPUs/node, so maximum value of YY is 36 or –ntask-per-node=36.

If running a job with 180 processors, this would require the –nodes=XX or XX to be set to 5 compute nodes, as 36x5=180.

The setting for NPCOLxNPROW must also be a maximum of 180, ie. 18 x 10 or 10 x 18 to use all of the CPUs in the parallel cluster.

For c5n.9xlarge, there are 18 CPUS/node, so maximum value of YY is 18 or –ntask-per-node=18.

If running a job with 180 processors, this would require the –nodes=XX or XX to be set to 10 compute nodes, as 18x10=180.

Note

If you submit a slurm job requesting more nodes than are available in the region, then you will get the following message when you use the squeue command under NODELIST(REASON): (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partition) In the scaling tables below, this is indicated as “Unable to provision”.

See also

C5n Instance

Quoted from the above link:

“Each vCPU is a hardware hyperthread on the Intel Xeon Platinum 8000 series processor. You get full control over the C-states on the two largest sizes, allowing you to run a single core at up to 3.5 Ghz using Intel Turbo Boost Technology. The C5n instances also feature a higher amount of memory per core, putting them in the current “sweet spot” for HPC applications that work most efficiently when there’s at least 4 GiB of memory for each core. The instances also benefit from some internal improvements that boost memory access speed by up to 19% in comparison to the C5 and C5d instances. The C5n instances incorporate the fourth generation of our custom Nitro hardware, allowing the high-end instances to provide up to 100 Gbps of network throughput, along with a higher ceiling on packets per second. The Elastic Network Interface (ENI) on the C5n uses up to 32 queues (in comparison to 8 on the C5 and C5d), allowing the packet processing workload to be better distributed across all available vCPUs.”

Resources specified in the YAML file:

  • Ubuntu2004

  • Disable Simultaneous Multi-threading

  • Spot Pricing

  • Shared EBS filesystem to install software

  • 1.2 TiB Shared Lustre file system with imported S3 Bucket (1.2 TiB is the minimum file size that you can specify for Lustre File System) mounted as /fsx or EBS volume 500 GB size mounted as /shared/data

  • Slurm Placement Group enabled

  • Elastic Fabric Adapter Enabled on c5n.9xlarge and c5n.18xlarge

Note

Pricing information in the tables below are subject to change. The links from which this pricing data was collected are listed below.

See also

AWS c5n Pricing

See also

EC2 SPOT Pricing

Spot versus On-Demand Pricing#

Table 1. EC2 Instance On-Demand versus Spot Pricing (price is subject to change)

Instance Name

vCPUs

RAM

EBS Bandwidth

Network Bandwidth

Linux On-Demand Price

Linux Spot Price

c4.large

2

3.75 GiB

Moderate

500 Mbps

$0.116/hour

$0.0312/hour

c4.8xlarge

36

60 GiB

10 Gbps

4,000 Mbps

$1.856/hour

$0.5903/hour

c5n.large

2

5.25 GiB

Up to 3.5 Gbps

Up to 25 Gbps

$0.108/hour

$0.0324/hour

c5n.xlarge

4

10.5 GiB

Up to 3.5 Gbps

Up to 25 Gbps

$0.216/hour

$0.0648/hour

c5n.2xlarge

8

21 GiB

Up to 3.5 Gbps

Up to 25 Gbps

$0.432/hour

$0.1740/hour

c5n.4xlarge

16

42 GiB

3.5 Gbps

Up to 25 Gbps

$0.864/hour

$0.2860/hour

c5n.9xlarge

36

96 GiB

7 Gbps

50 Gbps

$1.944/hour

$0.5971/hour

c5n.18xlarge

72

192 GiB

14 Gbps

100 Gbps

$3.888/hour

$1.1732/hour

c6gn.16xlarge

64

128 GiB

100 Gbps

$2.7648/hour

$0.6385/hour

c6a.48xlarge

192

384 GiB

40 Gbps

50 Gpbs

$7.344/hour

$6.0793/hour

hpc6a.48xlarge

96

384 GiB

100 Gbps

$2.88/hour

unavailable

hpc7g.16xlarge

64

128 GiB

$1.6832/hour

unavailable

*Hpc6a instances have simultaneous multi-threading disabled to optimize for HPC codes. This means that unlike other EC2 instances, Hpc6a vCPUs are physical cores, not threads. *Hpc6a instances available in US East (Ohio) and GovCloud (US-West) *HPC6a is available ondemand only (no spot pricing)

Using c5n.18xlarge as the compute node, it costs (3.888/hr)/(1.1732/hr) = 3.314 times as much to run on demand versus spot pricing. Savings is 70% for SPOT versus ondemand pricing.

Using c5n.9xlarge as the compute node, it costs ($1.944/hr)/($0.5971/hr) = 3.25 times as much to run on demand versus spot pricing. Savings is 70% for SPOT versus ondemand pricing.

Using c6gn.16xlarge as the compute node, it costs ($2.7648/hr)/(.6385/hr) = 4.3 times as much to run on demand versus spot pricing. Savings is 77% for SPOT versus ondemand pricing for this instance type.

Note

Sometimes, the nodes are not available for SPOT pricing in the region you are using. If this is the case, the job will not start runnning in the queue, see AWS Troubleshooting. ParallelCluster Troubleshooting

Benchmark Timings for CMAQv5.3.3 12US2 Benchmark#

Benchmarks were performed using both c5n.18xlarge (36 cores per node) and c5n.9xlarge (18 cores per node), c6a.48xlarge (96 cores per node), hpc6a.48xlarge (96 cores per node)

Benchmark Timing for c5n.18xlarge#

Table 2. Timing Results for CMAQv5.3.3 2 Day CONUS2 Run on ParallelCluster with c5n.large head node and C5n.18xlarge Compute Nodes

Note for the C5n.18xlarge, I/O was done using /fsx, the InputData refers to whether the data was copied to /fsx or imported from fsx.

CPUs

NodesxCPU

COLROW

Day1 Timing (sec)

Day2 Timing (sec)

TotalTime

CPU Hours/day

SBATCHexclusive

InputData

Disable Simultaneous Multithreading (yaml)

with -march=native

Equation using Spot Pricing

SpotCost

Equation using On Demand Pricing

OnDemandCost

36

1x36

6x6

6726.72

5821.47

12548.19

1.74

yes

imported

true

yes

1.1732/hr * 1 node * 3.486 hr=

4.09

3.888/hr * 1 node * 3.496 hr =

13.59

72

2x36

6x12

3562.50

3151.21

6713.71

.93

yes

imported

true

yes

1.1732/hr * 2 nodes * 1.8649 hr =

4.37

3.888/hr * 2 nodes * 1.8649 =

14.5

72

2x36

8x9

3665.65

3159.12

6824.77

.95

yes

imported

true

yes

1.1732/hr * 2 nodes * 1.896 hr =

4.45

3.888/hr * 2 nodes * 1.896 =

14.7

72

2x36

9x8

3562.61

2999.69

6562.30

.91

yes

imported

true

yes

1.1732/hr * 2 nodes * 1.822 hr =

4.28

3.888/hr * 2 nodes * 1.822 =

14.16

108

3x36

6x18

2415.46

2135.26

4550.72

.63

yes

imported

true

yes

1.1732/hr * 3 nodes * 1.26 hr =

4.45

3.888/hr * 3 nodes * 1.26 =

14.7

108

3x36

12x9

2758.01

2370.92

5128.93

.71

yes

imported

true

yes

1.1732/hr * 3 nodes * 1.42 hr =

5.01

3.888/hr * 3 nodes * 1.42 hr =

16.6

108

3x36

9x12

2454.11

2142.11

4596.22

.638

yes

imported

true

yes

1.1732/hr * 3 nodes * 1.276 =

4.49

3.888/hr * 3 nodes * 1.276 hr =

14.88

180

5x36

10x18

2481.55

2225.34

4706.89

.65

no

copied

false

yes

1.1732/hr * 5 nodes * 1.307 hr =

7.66

3.888/hr * 5 nodes * 1.307 hr =

25.4

180

5x36

10x18

2378.73

2378.73

4588.92

.637

no

copied

true

yes

1.1732/hr * 5 nodes * 1.2747 hr =

7.477

3.888/hr * 5 nodes * 1.2747 hr =

24.77

180

5x36

10x18

1585.67

1394.52

2980.19

.41

yes

imported

true

yes

1.1732/hr * 5nodes * 2980.9 / 3600 =

4.85

3.888/hr * 5 nodes * .82 hr =

16.05

256

8x32

16x16

1289.59

1164.53

2454.12

.34

no

copied

true

yes

1.1732/hr * 8nodes * 2454.12 / 3600 =

$6.398

3.888/hr * 8 nodes * .6817 hr =

21.66

256

8x32

16x16

1305.99

1165.30

2471.29

.34

yes

copied

true

yes

1.1732/hr * 8nodes * 2471.29 / 3600 =

6.44

3.888/hr * 8 nodes * .686 hr =

21.11

256

8x32

16x16

1564.90

1381.80

2946.70

.40

yes

imported

true

yes

1.1732/hr * 8nodes * 2946.7 / 3600 =

7.68

3.888/hr * 8 nodes * .818 hr =

25.45

288

8x36

16x18

1873.00

1699.24

3572.2

.49

no

copied

false

yes

1.1732/hr * 8nodes * 3572.2/3600=

9.313

3.888/hr * 8 nodes * .992 hr =

30.8

288

8x36

16x18

1472.69

1302.84

2775.53

.385

yes

imported

true

yes

1.1732/hr * 8nodes * .771 =

7.24

3.888/hr * 8 nodes * .771 =

23.98

288

8x36

16x18

1976.35

1871.61

3847.96

.53

no

copied

true

yes

1.1732/hr * 8nodes * 1.069 =

10.0

3.888/hr * 8 nodes * 1.069 =

33.24

288

8x36

16x18

1197.19

1090.45

2287.64

.31

yes

copied

true

yes 16x18 matched 16x16

1.1732/hr * 8nodes * .635 =

5.96

3.888/hr * 8 nodes * .635 =

19.76

288

8x36

18x16

1206.01

1095.76

2301.77

.32

yes

imported

true

yes

1.1732/hr * 8nodes * 2301.77=

6.00

3.888/hr * 8 nodes * .639 =

19.88

360

10x36

18x20

Unable to provision

Benchmark Timing for c5n.9xlarge#

Table 3. Timing Results for CMAQv5.3.3 2 Day CONUS2 Run on ParallelCluster with c5n.large head node and C5n.9xlarge Compute Nodes

CPUs

NodesxCPU

COLROW

Day1 Timing (sec)

Day2 Timing (sec)

TotalTime

CPU Hours/day

SBATCHexclusive

Disable Simultaneous Multithreading (yaml)

with -march=native

InputData

Equation using Spot Pricing

SpotCost

Equation using On Demand Pricing

OnDemandCost

18

1x18

3x6

14341.77

12881.59

27223.36

3.78

yes

true

no

/fsx

0.5971/hr * 1 node * 7.56 hr=

4.51

1.944/hr * 1 node * 7.56 hr =

14.69

18

1x18

3x6

12955.32

11399.07

24354.39

3.38

yes

true

no

/shared

0.5971/hr * 1 node * 6.76 hr =

4.03

1.944/hr * 1 node * 6.76 =

13.15

18

1x18

6x3

13297.84

11491.99

24789.83

3.44

yes

true

no

/shared

0.5971/hr * 1 node * 6.89 hr =

4.11

1.944/hr * 1 node * 6.89 =

13.39

36

2x18

6x6

6473.95

5599.76

12073.71

1.67

yes

true

no

/shared

0.5971/hr * 2 node * 3.35 hr=

4.0

1.944/hr * 2 node * 3.35 hr =

13.02

54

3x18

6x9

4356.33

3790.13

8146.46

1.13

yes

true

no

/shared

0.5971/hr * 3 node * 2.26 hr=

4.05

1.944/hr * 3 node * 2.26 hr =

13.2

54

3x18

9x6

4500.29

3876.76

8377.05

1.16

yes

true

no

/shared

0.5971/hr * 3 node * 2.33 hr =

4.17

1.944/hr * 3 node * 2.33 =

13.58

72

4x18

8x9

3382.01

2936.66

6318.67

.8775

yes

true

no

/shared

0.5971/hr * 4 node * 1.755 hr=

4.19

1.944/hr * 4 node * 1.755 hr =

13.2

90

5x18

9x10

2878.55

2483.56

5362.11

.745

yes

true

no

/shared

0.5971/hr * 5 node * 1.49 hr=

4.45

1.944/hr * 5 node * 1.49 hr =

14.44

108

6x18

9x12

2463.41

2161.07

4624.48

.642

yes

true

no

/shared

0.5971/hr * 6 node * 1.28 hr=

4.6

1.944/hr * 6 node * 1.28 hr =

14.9

108

6x18

9x12

2713.95

2338.09

5052.04

.702

yes

true

no

/fsx linked

0.5971/hr * 6 node * 1.40hr =

5.03

1.944/hr * 6 node * 1.40 hr =

108

6x18

9x12

2421.19

2144.16

4565.35

.634

yes

true

no

/fsx copied

0.5971/hr * 6 node * 1.27 =

4.54

1.944/hr * 6 node * 1.27hr =

126

7x18

9x14

2144.86

1897.85

4042.71

.56

yes

true

no

/shared

0.5971/hr * 7 node * 1.12 hr=

4.69

1.944/hr * 7 node * 1.12 hr =

15.24

144

8x18

12x12

unable to provision

162

9x18

9x18

unable to provision

180

10x18

10x18

unable to provision

Benchmark Timing for hpc6a.48xlarge#

Table 4. Timing Results for CMAQv5.3.3 2 Day CONUS 2 Run on Parallel Cluster with c6a.xlarge head node and hpc6a.48xlarge Compute Nodes

CPUs

NodesxCPU

COLROW

Day1 Timing (sec)

Day2 Timing (sec)

TotalTime

CPU Hours/day

SBATCHexclusive

Disable Simultaneous Multithreading (yaml)

with -march=native

With Pinning

InputData

Equation using Spot Pricing

SpotCost

Equation using On Demand Pricing

OnDemandCost

96

1x96

12x8

2815.56

2368.43

5183.99

.71

yes

N/A

no

no

/fsx linked ?

?/hr * 1 node * 1.44 =

?

2.88/hr * 1 node * 1.44 =

4.147

96

1x96

12x8

2715.78

2318.15

5033.93

.699

yes

N/A

no

yes

/fsx linked ?

?/hr * 1 node * 1.39 =

?

2.88/hr * 1 node * 1.39 =

4.03

192

2x96

16x12

1586.15

1448.35

3034.50

.421

yes

N/A

no

no

/fsx linked?

?/hr * 1 node * .842 =

?

2.88/hr * 2 node * .842 =

4.84

192

2x96

16x12

1576.05

1447.76

3023.81

.419

yes

N/A

no

yes

/fsx linked?

?/hr * 1 node * .839 =

?

2.88/hr * 2 node * .839 =

4.83

288

3x96

16x18

1282.31

1189.40

2471.71

.343

yes

N/A

no

no

/fsx linked?

?/hr * 1 node * .842 =

?

2.88/hr * 3 node * .686 =

5.93

288

3x96

16x18

1377.44

1223.15

2600.59

.361

yes

N/A

no

yes

/fsx linked?

?/hr * 1 node * .842 =

?

2.88/hr * 3 node * .722 =

6.24

384

4x96

24x16

1211.88

1097.68

2309.56

.321

yes

N/A

no

no

/fsx linked?

?/hr * 1 node * .642 =

?

2.88/hr * 4 node * .642 =

7.39

384

4x96

24x16

1246.72

1095.40

2342.12

.325

yes

N/A

no

yes

/fsx linked?

?/hr * 1 node * .650 =

?

2.88/hr * 4 node * .650 =

7.49

480

5x96

24x20

1120.61

1010.33

2130.94

.296

yes

N/A

no

no

/fsx linked?

?/hr * 1 node * .592 =

?

2.88/hr * 5 node * .592 =

8.52

480

5x96

24x20

1114.46

1017.47

2131.93

.296

yes

N/A

no

yes

/fsx linked?

?/hr * 1 node * .592 =

?

2.88/hr * 5 node * .592 =

8.52

576

6x96

24x24

1041.13

952.11

1993.24

.277

yes

N/A

no

yes

/fsx linked?

?/hr * 1 node * .553 =

?

2.88/hr * 6 node * .553 =

9.57

576

6x96

24x24

1066.59

955.88

2022.47

.281

yes

N/A

no

yes

/fsx linked?

?/hr * 1 node * .561 =

?

2.88/hr * 6 node * .561 =

9.71

Benchmark Timing for c6a.48xlarge#

Table 5. Timing Results for CMAQv5.3.3 2 Day CONUS 2 Run on Parallel Cluster with c6a.xlarge head node and c6a.48xlarge Compute Nodes

CPUs

NodesxCPU

COLROW

Day1 Timing (sec)

Day2 Timing (sec)

TotalTime

CPU Hours/day

SBATCHexclusive

Disable Simultaneous Multithreading (yaml)

with -march=native

With Pinning

InputData

Equation using Spot Pricing

SpotCost

Equation using On Demand Pricing

OnDemandCost

96

1x96

12x8

2996.56

2556.50

5553.06

.771

yes

N/A

no

no

/fsx linked ?

?/hr * 1 node * 1.54 =

?

7.344/hr * 1 node * 1.54 =

11.33

96

1x96

12x8

2786.72

2374.83

5161.55

.716

yes

N/A

no

yes

/fsx linked ?

?/hr * 1 node * 1.43 =

?

7.344/hr * 2 node * 1.43 =

21.0

192

2x96

16x12

1643.19

1491.94

3135.13

.435

yes

N/A

no

yes

/fsx linked ?

?/hr * 1 node * .87 =

?

7.344/hr * 2 node * .87 =

12.8

192

3x64

16x12

1793.09

1586.95

3380.04

.469

yes

N/A

no

yes

/fsx linked ?

?/hr * 1 node * .94 =

?

7.344/hr * 3 node * .94 =

20.68

288

3x96

16x18

1287.99

1177.42

2465.41

.342

yes

N/A

no

yes

/fsx linked ?

?/hr * 1 node * .684 =

?

7.344/hr * 3 node * .684 =

15.09

288

3x96

16x18

1266.97

1201.90

2468.87

.342

yes

N/A

no

yes

/fsx linked ?

?/hr * 1 node * .684 =

?

7.344/hr * 3 node * .684 =

15.09

Benchmark Scaling Plots for CMAQv5.3.3 12US2 Benchmark#

Benchmark Scaling Plot for c5n.18xlarge#

Figure 1. Scaling per Node on C5n.18xlarge Compute Nodes (36 cpu/node)

Scaling per Node for C5n.18xlarge Compute Nodes (36cpu/node

Note, there are several timings that were obtained using 8 nodes. The 288 cpu timings were fully utilizing the 36 pe nodes using 8x36 = 288 cpus, and different NPCOLxNPROW options were used 16x18 and 18x16. The 256 cpu timings were obtained using a NPCOLxNPROW configuration of 16x16. This benchmark configuration doesn’t fully utilize all of the cpus/node, so the efficiency per node is lower, and the cost is higher. It is best to select the NPCOLxNPROW settings that fully utilize all of the CPUs available as specified in the SBATCH commands.

#SBATCH --nodes=8
#SBATCH --ntasks-per-node=36

Figure 2. Scaling per CPU on c5n.18xlarge compute node

Scaling per CPU for C5n.18xlarge Compute Nodes (36cpu/node

Note, poor performance was obtained for the runs using 180 processors when SBATCH –exclusive option was not used. After this finding, the CMAQ run scripts were modified to always use this option. The benchmark runs that were done on c5n.9xlarge used the SBATCH –exclusive option.

Investigation of why there is a difference between the total run times for the benchmark when NPCOLxNPROW used 12x9 as compared to 9x12 and 6x18.#

A comparison of the log files (sdiff run_cctmv5.3.3_Bench_2016_12US2.108.12x9pe.2day.pcluster.log run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.pcluster.log) revealed that the CPU speed for the Parallel Cluster run of the 12x9 benchmark case was slower than the CPU speed used for the 9x12 benchmark case. See the following section for details. Comparison of log filesfor 12x9 versus 9x12 Benchmark runs

The scaling efficiency using 5 nodes of 36 cpus/node = 180 cpus was 84%.

The scaling efficiency dropped to 68% when using 8 nodes of 36 cpus/node = 288 cpus.

Figure 3. Scaling per Node on C5n.9xlarge Compute Nodes (18 cpu/node)

Scaling per Node for C5n.9xlarge Compute Nodes (18cpu/node

Scaling is very good for the c5n.9xlarge compute nodes up to 7 nodes, the largest number of nodes that could be provisioned at the time this benchmark was performed.

Figure 4. Scaling per CPU on C5n.9xlarge Compute Node (18 cpu/node)

Scaling per CPU for C5n.9xlarge Compute Nodes (36cpu/node

Scaling is also good when compared to the number of cpus used. Note that all benchmark runs performed using the c5n.9xlarge compute nodes fully utilized the number of cpus available on a node.

The scaling efficiency using 7 nodes of 18 cpus/node = 126 cpus was 86%.

Benchmark Scaling Plot for c5n.18xlarge and c5n.9xlarge#

Figure 5 shows the scaling per-node, as the configurations that were run were multiples of the number of cpus per node. CMAQ was not run on a single cpu, as this would have been costly and inefficient.

Figure 5. Scaling on C5n.9xlarge (18 cpu/node) and C5n.18xlarge Compute Nodes (36 cpu/node)

Scaling Plot for C5n.9xlarge (18cpu/node) and C5n.18xlarge Compute Nodes (36cpu/node

Total Time and Cost versus CPU Plot for c5n.18xlarge#

Figure 6 shows the timings for many configuration options listed in the table above for the c5n.18xlarge cluster. Running with no hyperthreading, using SBATCH –exclusive, and placement enabled, resulted in the fastest timings.

Additional benchmark runs may be needed to determine the impact on performance when linking the input data using the lustre file system or copying the data to lustre and/or using the /shared ebs volume for I/O.

Figure 6. Plot of Total Time and On Demand Cost versus CPUs for c5n.18xlarge

Plot of Total Time and On Demand Cost versus CPUs for c5n18xlarge

Total Time and Cost versus CPU Plot for c5n.9xlarge#

Figure 7 shows how the total run time and On Demand Cost varies as additional CPUs are used. Note that the run script and yaml settings used for the c5n.9xlarge used settings that were optimized for running CMAQ on the cluster.

Figure 7. Plot of Total Time and On Demand Cost versus CPUs for c5n.9xlarge

Plot of Total Time and On Demand Cost versus CPUs for c5n9xlarge

Total Time and Cost versus CPU Plot for both c5n.18xlarge and c5n.9xlarge#

Figure 8. Plot of Total Time and On Demand Cost versus CPUs for both c5n.18xlarge and c5n.9xlarge

Plot of Total Time and On Demand Cost versus CPUs for c5n18xlarge and c5n9xlarge

Total Time and Cost versus CPU Plot for hpc6a.48xlarge#

Figure 9 shows how the total run time and On Demand Cost varies as additional CPUs are used. Note that the run script and yaml settings used for the hpc6a.48xlarge used settings that were optimized for running CMAQ on the cluster.

Figure 9. Plot of Total Time and On Demand Cost versus CPUs for hpc6a.48xlarge

Plot of Total Time and On Demand Cost versus CPUs for hpc6a.48xlarge

Cost Information#

Cost information is available within the AWS Web Console for your account as you use resources, and there are also ways to forecast your costs using the pricing information available from AWS.

Cost Explorer#

Example screenshots of the AWS Cost Explorer Graphs were obtained after running several of the CMAQ Benchmarks, varying # nodes and # cpus and NPCOL/NPROW. These costs are of a two day session of running CMAQ on the ParallelCluster, and should only be used to understand the relative cost of the EC2 instances (head node and compute nodes), compared to the storage, and network costs.

In Figure 10 The Cost Explorer Display shows the cost of different EC2 Instance Types: note that c5n.18xlarge is highest cost - as these are used as the compute nodes

Figure 10. Cost by Instance Type - AWS Console

AWS Cost Management Console - Cost by Instance Type

In Figure 11 The Cost Explorer displays a graph of the cost categorized by usage by spot or OnDemand, NatGateway, or Timed Storage. Note: spot-c5n.18xlarge is highest generating cost resource, but other resources such as storage on the EBS volume and the network NatGatway or SubnetIDs also incur costs

Figure 11. Cost by Usage Type - AWS Console

AWS Cost Management Console - Cost by Usage Type

In Figure 12. The Cost Explorer Display shows the cost by Services including EC2 Instances, S3 Buckets, and FSx Lustre File Systems

Figure 12. Cost by Service Type - AWS Console

AWS Cost Management Console - Cost by Service Type

Compute Node Cost Estimate#

Head node c5n.large compute cost = entire time that the parallel cluster is running ( creation to deletion) = 6 hours * $0.0324/hr = $ .1944 using spot pricing, 6 hours * $.108/hr = $.648 using on demand pricing.

Using 288 cpus on the ParallelCluster, it would take ~4.83 days to run a full year, using 8 c5n.18xlarge (36cpu/node) compute nodes.

Using 288 cpus on the ParallelCluster, it would take ~ 6.37 days to run a full year using 2 hpc6a.48xlarge (96cpu/node) compute nodes.

Using 126 cpus on the ParallelCluster, it would take ~8.92 days to run a full year, using 7 c5n.9xlarge (18cpu/node) compute nodes.

Table 8. Extrapolated Cost of compute nodes used for CMAQv5.3.3 Annual Simulation based on 2 day CONUS benchmark

Benchmark Case

Compute Node

Number of PES

Number of Nodes

Pricing

Cost per node

Time to completion (hour)

Equation Extrapolate Cost for Annual Simulation

Annual Cost

Days to Complete Annual Simulation

2 day 12US2

c5n.18xlarge

108

3

SPOT

1.1732/hour

4550.72/3600 = 1.264

1.264/2 * 365 = 231 hours/node * 3 nodes = 692 hr * $1.1732/hr =

$811.9

9.61

2 day 12US2

c5n.18xlarge

108

3

ONDEMAND

3.888/hour

4550.72/3600 = 1.264

1.264/2 * 365 = 231 hours/node * 3 nodes = 692 hr * $3.888/hr =

$2690.4

9.61

2 day 12US2

c5n.18xlarge

180

5

SPOT

1.1732/hour

2980.19/3600 = .8278

.8278/2 * 365 = 151 hours/node * 5 nodes = 755 hr * $1.1732/hr =

$886

6.29

2 day 12US2

c5n.18xlarge

180

5

ONDEMAND

3.888/hour

2980.19/3600 = .8278

.8278/2 * 365 = 151 hours/node * 5 nodes = 755 hr * $3.888/hr =

$2935.44

6.29

2 day 12US2

c5n.9xlarge

126

7

SPOT

.5971/hour

4042.71/3600 = 1.12

1.12/2 * 365 = 204.94 hours/node * 7 nodes = 1434.6 hr * $.5971/hr =

$856

8.52

2 day 12US2

c5n.9xlarge

126

7

ONDEMAND

1.944/hour

4042.71/3600 = 1.12

1.12/2 * 365 = 204.94 hours/node * 7 nodes = 1434.6 hr * $1.944/hr =

$2788.8

8.52

2 day 12US2

hpc6a.48xlarge

96

1

ONDEMAND

$2.88/hour

5033.93/3600 = 1.40

1.40/2 * 365 = 255 hours/node * 1 nodes = 255 hr * $2.88/hr =

$734

10.6

2 day 12US2

hpc6a.48xlarge

192

2

ONDEMAND

$2.88/hour

3023.81/3600 = .839

.839/2 * 365 = 153.29 hours/node * 2 nodes = 306 hr * $2.88/hr =

$883

6.4

Note

These cost estimates depend on the availability of number of nodes for the instance type. If fewer nodes are available, then it will take longer to complete the annual run, but the costs should be accurate, as the CONUS 12US2 Domain Benchmark scales well up to this number of nodes. The cost of running an annual simulation on 3 c5n.18xlarge nodes using OnDemand Pricing is $2690.4, the cost of running an annual simulation on 5 c5n.18xlarge nodes using OnDemand pricing is $2935.44, if only 3 nodes are available, then you would pay less, but wait longer for the run to be completed, 9.61 days using 3 nodes versus 6.29 days using 5 nodes.

Storage Cost Estimate#

Table 9. Lustre SSD File System Pricing for us-east-1 region

Storage Type

Storage options

Pricing with data compression enabled*

Pricing (monthly)

Persistent

125 MB/s/TB

$0.073

$0.145/month

Persistent

250 MB/s/TB

$0.105

$0.210/month

Persistent

500 MB/s/TB

$0.170

$0.340/month

Persistent

1,000 MB/s/TB

$0.300

$0.600/month

Scratch

200/MB/s/TiB

$0.070

$0.140/month

Note, there is a difference in the storage sizing units that were obtained from AWS.

See also

TB vs TiB

Quote from the above website; “One tebibyte is equal to 2^40 or 1,099,511,627,776 bytes. One terabyte is equal to 1012 or 1,000,000,000,000 bytes. A tebibyte equals nearly 1.1 TB. That’s about a 10% difference between the size of a tebibyte and a terabyte, which is significant when talking about storage capacity.”

Lustre Scratch SSD 200 MB/s/TiB is tier of the storage pricing that we have configured in the yaml for the cmaq parallel cluster.

Cost example: 0.14 USD per month / 730 hours in a month = 0.00019178 USD per hour

Note: 1.2 TiB is the minimum file size that you can specify for the lustre file system

1,200 GiB x 0.00019178 USD per hour x 24 hours x 5 days = 27.6 USD

Question is 1.2 TiB enough for the output of a yearly CMAQ run?

For the output data, assuming 2 day CONUS Run, all 35 layers, all 244 variables in CONC output

cd /fsx/data/output/output_CCTM_v532_gcc_2016_CONUS_16x8pe_full
du -sh

Size of output directory when CMAQ is run to output all 35 layers, all 244 variables in the CONC file, includes all other output files

173G .

So we need 86.5 GB per day

Storage requirement for an annual simulation if you assumed you would keep all data on lustre filesystem

 86.5 GB * 365 days = 31,572.5 GB  = 31.5 TB

Annual simulation local storage cost estimate#

Assuming it takes 5 days to complete the annual simulation, and after the annual simulation is completed, the data is moved to archive storage.

 31,572.5 GB x 0.00019178 USD per hour x 24 hours x 5 days = $726.5 USD

To reduce storage requirements; after the CMAQ run is completed for each month, the post-processing scripts are run and completed, and then the CMAQ Output data for that month is moved from the Lustre Filesystem to the Archived Storage. Monthly data volume storage requirements to store 1 month of data on the lustre file system is approximately 86.5 x 30 days = 2,595 GB or 2.6 TB.

  2,595 GB x 0.00019178 USD per hour x 24 hours x 5 days = $60 USD

Estimate for S3 Bucket cost for storing an annual simulation

S3 Standard - General purpose storage

Storage Pricing

First 50 TB / Month

$0.023 per GB

Next 450 TB / Month

$0.022 per GB

Over 500 TB / Month

$0.021 per GB

Archive Storage cost estimate for annual simulation - assuming you want to save it for 1 year#

31.5 TB * 1024 GB/TB * .023 per GB * 12 months = $8,903

S3 Glacier Flexible Retrieval (Formerly S3 Glacier)

Storage Pricing

long-term archives with retrieval option from 1 minute to 12 hours

All Storage / Month

$0.0036 per GB

S3 Glacier Flexible Retrieval Costs 6.4 times less than the S3 Standard

31.5 TB * 1024 GB/TB * $.0036 per GB * 12 months = $1393.0 USD

Lower cost option is S3 Glacier Deep Archive (accessed once or twice a year, and restored in 12 hours)

31.5 TB * 1024 GB/TB * $.00099 per GB * 12 months = $383 USD

Side by Side Comparison of the information in the log files for 12x9 pe run compared to 9x12 pe run.#

cd /shared/pcluster-cmaq/c5n.18xlarge_scripts_logs

sdiff run_cctmv5.3.3_Bench_2016_12US2.108.12x9pe.2day.pcluster.log run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.pcluster.log | more

Output:

Start Model Run At  Fri Feb 25 20:48:42 UTC 2022	      |	Start Model Run At  Thu Feb 24 01:04:42 UTC 2022
information about processor including whether using hyperthre	information about processor including whether using hyperthre
Architecture:                    x86_64				Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit			CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian			Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits vi	Address sizes:                   46 bits physical, 48 bits vi
CPU(s):                          36				CPU(s):                          36
On-line CPU(s) list:             0-35				On-line CPU(s) list:             0-35
Thread(s) per core:              1				Thread(s) per core:              1
Core(s) per socket:              18				Core(s) per socket:              18
Socket(s):                       2				Socket(s):                       2
NUMA node(s):                    2				NUMA node(s):                    2
Vendor ID:                       GenuineIntel			Vendor ID:                       GenuineIntel
CPU family:                      6				CPU family:                      6
Model:                           85				Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 81	Model name:                      Intel(R) Xeon(R) Platinum 81
Stepping:                        4				Stepping:                        4
CPU MHz:                         2887.020		      |	CPU MHz:                         2999.996
BogoMIPS:                        5999.98		      |	BogoMIPS:                        5999.99
Hypervisor vendor:               KVM				Hypervisor vendor:               KVM
Virtualization type:             full				Virtualization type:             full
L1d cache:                       1.1 MiB			L1d cache:                       1.1 MiB
L1i cache:                       1.1 MiB			L1i cache:                       1.1 MiB
L2 cache:                        36 MiB				L2 cache:                        36 MiB
L3 cache:                        49.5 MiB			L3 cache:                        49.5 MiB
NUMA node0 CPU(s):               0-17				NUMA node0 CPU(s):               0-17
NUMA node1 CPU(s):               18-35				NUMA node1 CPU(s):               18-35

     ===========================================		     ===========================================
     |>---   ENVIRONMENT VARIABLE REPORT   ---<|		     |>---   ENVIRONMENT VARIABLE REPORT   ---<|
     ===========================================		     ===========================================

     |> Grid and High-Level Model Parameters:			     |> Grid and High-Level Model Parameters:
     +=========================================			     +=========================================
      --Env Variable-- | --Value--				      --Env Variable-- | --Value--
      -------------------------------------------------------	      -------------------------------------------------------
             BLD  |             (default)			             BLD  |             (default)
          OUTDIR  |  /fsx/data/output/output_CCTM_v533_gcc_20 |	          OUTDIR  |  /fsx/data/output/output_CCTM_v533_gcc_20
       NEW_START  |          T					       NEW_START  |          T
  ISAM_NEW_START  |  Y (default)				  ISAM_NEW_START  |  Y (default)
       GRID_NAME  |  12US2					       GRID_NAME  |  12US2
       CTM_TSTEP  |       10000					       CTM_TSTEP  |       10000
      CTM_RUNLEN  |      240000					      CTM_RUNLEN  |      240000
    CTM_PROGNAME  |  DRIVER (default)				    CTM_PROGNAME  |  DRIVER (default)
      CTM_STDATE  |     2015356					      CTM_STDATE  |     2015356
      CTM_STTIME  |           0					      CTM_STTIME  |           0
     NPCOL_NPROW  |  12 9				      |	     NPCOL_NPROW  |  9 12
     CTM_MAXSYNC  |         300					     CTM_MAXSYNC  |         300



==================================				==================================
  ***** CMAQ TIMING REPORT *****				  ***** CMAQ TIMING REPORT *****
==================================				==================================
Start Day: 2015-12-22						Start Day: 2015-12-22
End Day:   2015-12-23						End Day:   2015-12-23
Number of Simulation Days: 2					Number of Simulation Days: 2
Domain Name:               12US2				Domain Name:               12US2
Number of Grid Cells:      3409560  (ROW x COL x LAY)		Number of Grid Cells:      3409560  (ROW x COL x LAY)
Number of Layers:          35					Number of Layers:          35
Number of Processes:       108					Number of Processes:       108
   All times are in seconds.					   All times are in seconds.

Num  Day        Wall Time					Num  Day        Wall Time
01   2015-12-22   2758.01				      |	01   2015-12-22   2454.11
02   2015-12-23   2370.92				      |	02   2015-12-23   2142.11
     Total Time = 5128.93				      |	     Total Time = 4596.22
      Avg. Time = 2564.46				      |	      Avg. Time = 2298.11


Developer Guide to install and run CMAQv5.33 on Single VM or Parallel Cluster#

CMAQv5.3.3 on Single Virtual Machine Advanced (optional)#

Run CMAQv5.3.3 on a single Virtual Machine (VM) using c6a.xlarge (4 CPUs) and Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-1031-aws x86_64), then upgrade to c6a.48xlarge.

Install Software and run CMAQv5.3.3 on c6a.2xlarge for the 2016_12US3 Benchmark#

Instructions are provided to build and install CMAQ on c6a.2xlarge compute node installed from Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-1031-aws x86_64) Image that contains modules for git, openmpi and gcc. The compute node does not have a SLURM scheduler on it, so jobs are run interactively from the command line.

Instructions to install data and CMAQ libraries and model are provided along with sample run scripts to run CMAQ on 4 processors on a single c6a.2xlarge instance.

This will provide users with experience using the AWS Console to create a Virtual Machine, select Operating System, select the size of the VM as c6a.2xlarge vcpus, 8 GiB memory, using an SSH private key to login and install and run CMAQ.

Using this method, the user needs to be careful to start and stop the Virtual Machine and only have it run while doing the intial installation, and while running CMAQ. The full c6a.2xlarge instance will incur charges as long as it is on, even if a job isn’t running on it.

This is different than the Parallel Cluster, where if CMAQ is not running in the queue, then the Compute nodes are down, and not incurring costs.

Build CMAQv5.3.3 on c6a.2xlarge EC2 instance#
Create a c6a.xlarge Virtual Machine#
  1. Login to AWS Console

  2. Select Get Started with EC2

  3. Select Launch Instance

  4. Application and OS (Operating System) Images: Select Ubunutu 22.04 LTS(HVM), SSD Volume Type (the version of OS determines what packages are available from apt-get and that determines the version of software obtained, ie. cdo version > 2.0 for Ubuntu 22.04 LTS, or cdo version < 2.0 for Ubuntu 18.04.

  5. Instance Type: Select c6a.2xlarge ($0.xxx/hr)

  6. Key pair - SSH public key, select existing key or create a new one.

  7. Network settings - select default settings

  8. Configure storage - select 100 GiB gp3 Root volume

  9. Select Launch instance

AWS EC2 Console

Login to the Virtual Machine#

Change the permissions on the public key using command

chmod 400  [your-key-name].pem

Login to the Virtual Machine using ssh to the IP address using the public key.

ssh -Y -i ./xxxxxxx_key.pem ubuntu@xx.xx.xx.xx

Make the /shared directory#

sudo mkdir /shared

Change the group and ownership of the shared directory#
sudo chown ubuntu /shared
sudo chgrp ubuntu /shared

Change directories and verify that you see the /shared directory with Size of 100 GB

cd /shared

df -h

Output

df -h
Filesystem       Size  Used Avail Use% Mounted on
/dev/root         97G  1.6G   96G   2% /
tmpfs             16G     0   16G   0% /dev/shm
tmpfs            6.2G  876K  6.2G   1% /run
tmpfs            5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p15  105M  6.1M   99M   6% /boot/efi
tmpfs            3.1G  4.0K  3.1G   1% /run/user/1000

Create subdirectories on /shared#

Create a /shared/build and /shared/data directory

cd /shared
mkdir build
mkdir data
Check operating system version#
lsb_release -a

output

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.2 LTS
Release:	22.04
Codename:	jammy

Install Environment Modules#
sudo apt-get upgrade
sudo apt-get install environment-modules
Logout and then log back in to activate modules command#
Verify module command works#
 module list

Output:

No Modulefiles Currently Loaded.

module avail

Output:

--------------------------------------------------------------------------------------- /usr/share/modules/modulefiles ---------------------------------------------------------------------------------------
dot  module-git  module-info  modules  null  use.own  
Set up build environment#

Load the git module

module load module-git

If you do not see git available as a module, you may need to install it as follows:

sudo apt-get install git

Install Compilers and OpenMPI#
sudo apt-get update
sudo apt-get install gcc-9
sudo apt-get  install gfortran-9
sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev libgtk2.0-dev
sudo apt-get install tcsh
Change shell to use tcsh#
sudo usermod -s /usr/bin/tcsh ubuntu
Logout and log back in, then check the shell#
echo $SHELL

output

/usr/bin/tcsh
Check available versions of compiler#
dpkg --list | grep compiler
Choose gcc-9 and gfortran-9 as default compilers#
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 9
sudo update-alternatives --install /usr/bin/gfortran gfortran /usr/bin/gfortran-9 9
Check version of gcc#
gcc --version

output

gcc --version
gcc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
Check version of gfortran#
gfortran --version

Output

GNU Fortran (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
Check version of OpenMPI#
mpirun --version

output

mpirun (Open MPI) 4.1.2
Install Parallel Cluster CMAQ Repo#

cd /shared

git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git

Install and build netcdf C, netcdf Fortran, I/O API, and CMAQ#

cd /shared/pcluster-cmaq/install

Install netcdf-C and netcdf-Fortran#

./gcc_netcdf_singlevm.csh |& tee ./gcc_netcdf_singlevm.log

If successful, you will see the following output, that at the bottom shows what versions of the netCDF library were installed.


+-------------------------------------------------------------+
| Congratulations! You have successfully installed the netCDF |
| Fortran libraries.                                          |
|                                                             |
| You can use script "nf-config" to find out the relevant     |
| compiler options to build your application. Enter           |
|                                                             |
|     nf-config --help                                        |
|                                                             |
| for additional information.                                 |
|                                                             |
| CAUTION:                                                    |
|                                                             |
| If you have not already run "make check", then we strongly  |
| recommend you do so. It does not take very long.            |
|                                                             |
| Before using netCDF to store important data, test your      |
| build with "make check".                                    |
|                                                             |
| NetCDF is tested nightly on many platforms at Unidata       |
| but your platform is probably different in some ways.       |
|                                                             |
| If any tests fail, please see the netCDF web site:          |
| https://www.unidata.ucar.edu/software/netcdf/                |
|                                                             |
| NetCDF is developed and maintained at the Unidata Program   |
| Center. Unidata provides a broad array of data and software |
| tools for use in geoscience education and research.         |
| https://www.unidata.ucar.edu                                 |
+-------------------------------------------------------------+

make[3]: Leaving directory '/shared/build/netcdf-fortran-4.5.4'
make[2]: Leaving directory '/shared/build/netcdf-fortran-4.5.4'
make[1]: Leaving directory '/shared/build/netcdf-fortran-4.5.4'
netCDF 4.8.1
netCDF-Fortran 4.5.3

Install I/O API

./gcc_ioapi_singlevm.csh |& tee ./gcc_ioapi_singlevm.log

Find what operating system is on the system:

cat /etc/os-release

Output

PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Copy a file to set paths#

cd /shared/pcluster-cmaq/install

cp dot.cshrc.singlevm ~/.cshrc

Exit cluster and log back in to activate the update shell, or use csh#
Create Custom Environment Module for Libraries#

There are two steps required to create your own custome module:

  1. write a module file

  2. add a line to your ~/.cshrc to update the MODULEPATH

Create a new custom module that will be loaded with:

module load ioapi-3.2/gcc-9.5-netcdf

Step 1: Create the module file for ioapi-3.2.

First, create a path to store the module file. The path must contain /Modules/modulefiles/ and should have the general form //Modules/modulefiles// where is typically numerical and is the actual module file.

mkdir -p /shared/build/Modules/modulefiles/ioapi-3.2

Next, create the module file and save it in the directory above.

cd /shared/build/Modules/modulefiles/ioapi-3.2
vim gcc-9.5-netcdf

Contents of gcc-9.5-netcdf:

#%Module
  
proc ModulesHelp { } {
   puts stderr "This module adds ioapi-3.2/gcc-9.5 to your path"
}

module-whatis "This module adds ioapi-3.2/gcc-9.5 to your path\n"

set basedir "/shared/build/ioapi-3.2/"
prepend-path PATH "${basedir}/Linux2_x86_64gfort"
prepend-path LD_LIBRARY_PATH "${basedir}/ioapi/fixed_src"

The example module file above sets two evironment variables.

The modules update the PATH and LD_LIBRARY_PATH.

Step 2. Create the module file for netcdf-4.8.1

mkdir -p /shared/build/Modules/modulefiles/netcdf-4.8.1

Next, create the module file and save it in the directory above.

cd /shared/build/Modules/modulefiles/netcdf-4.8.1
vim gcc-9.5

Contents of gcc-9.5

#%Module
  
proc ModulesHelp { } {
   puts stderr "This module adds netcdf-4.8.1/gcc-9.5 to your path"
}

module-whatis "This module adds netcdf-4.8.1/gcc-9.5 to your path\n"

set basedir "/shared/build/netcdf"
prepend-path PATH "${basedir}/bin"
prepend-path LD_LIBRARY_PATH "${basedir}/lib"
module load mpi/openmpi-4.1.2

Step 3. Create the module file for mpi

mkdir -p /shared/build/Modules/modulefiles/mpi

Next, create the module file and save it in the directory above.

cd /shared/build/Modules/modulefiles/mpi
vim openmpi-4.1.2

Contents of openmpi-4.1.2

#%Module
  
proc ModulesHelp { } {
   puts stderr "This module adds mpi/openmpi-4.1.2 to your path"
}

module-whatis "This module adds mpi/openmpi-4.1.2 to your path\n"

set basedir "/usr/lib/x86_64-linux-gnu/openmpi/"
prepend-path PATH "/usr/bin/"
prepend-path LD_LIBRARY_PATH "${basedir}/lib"

Step 4: Add the module path to MODULEPATH.

Now that the module file has been created, add the following line to your ~/.cshrc file so that it can be found:

module use --append /shared/build/Modules/modulefiles

Step 5: View the modules available after creation of the new module

The module avail command shows the paths to the module files on a given cluster.

module avail

Output

ioapi-3.2/gcc-9.5-netcdf  mpi/openmpi-4.1.2  netcdf-4.8.1/gcc-9.5 

Step 4: Load the new modules

ioapi-3.2/gcc-9.5-netcdf  mpi/openmpi-4.1.2  netcdf-4.8.1/gcc-9.5 
Find path for openmpi libraries#
ompi_info --path libdir

output

 Libdir: /usr/lib/x86_64-linux-gnu/openmpi/lib
Find path for include files for openmpi#
ompi_info --path incdir

output

Incdir: /usr/lib/x86_64-linux-gnu/openmpi/include
Edit the config_cmaq_singlevm.csh script to specify the paths for OpenMPI#

Note, search for case gcc so that you edit the section of the file that is using the gcc compiler.

       setenv MPI_INCL_DIR     /usr/lib/x86_64-linux-gnu/openmpi/include              #> MPI Include directory path
        setenv MPI_LIB_DIR     /usr/lib/x86_64-linux-gnu/openmpi/lib             #> MPI Lib directory path
Install Python#
sudo apt-get install python3 python3-pip

Check Version

python3 --version
Python 3.10.6
ip-172-31-27-148:/shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts> python3 -m pip --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

Install jupyter notebook.#
pip install jupyterlab
Install and Build CMAQ#
cd /shared/pcluster-cmaq/install
./gcc_cmaq533_singlevm.csh |& tee ./.gcc_cmaq533_singlevm.log
Add compile option to makefile to get beyond a type mismatch error (note, this is only needed if you were using the gcc-11 compiler.#

SKIP this step.

Add the following to the compile option: -fallow-argument-mismatch

cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts/BLD_CCTM_v54_gcc
vi Makefile.gcc

Output:

 FSTD = -fallow-argument-mismatch -O3 -funroll-loops -finit-character=32 -Wtabs -Wsurprising -ftree-vectorize  -ftree-loop-if-convert -finline-limit=512
Run make again#
make |& tee Make.log

Verfify that the executable was successfully built.

ls  /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/BLD_CCTM_v533_gcc/*.exe

Output

/shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/BLD_CCTM_v533_gcc/CCTM_v533.exe
Check to see what scripts are available#

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts

List the scripts available

ls -rlt *.csh*

Output

-rwxrwxr-x 1 ubuntu ubuntu 34318 Jul 19 17:47 run_cctm_Bench_2011_12SE1.csh
-rwxrwxr-x 1 ubuntu ubuntu 32649 Jul 19 17:47 bldit_cctm.csh
-rwxrwxr-x 1 ubuntu ubuntu 36130 Jul 19 17:47 run_cctm_2016_12US1.csh
-rwxrwxr-x 1 ubuntu ubuntu 36850 Jul 19 17:47 run_cctm_2015_HEMI.csh
-rwxrwxr-x 1 ubuntu ubuntu 34948 Jul 19 17:47 run_cctm_2014_12US1.csh
-rwxrwxr-x 1 ubuntu ubuntu 34262 Jul 19 17:47 run_cctm_2011_12US1.csh
-rwxrwxr-x 1 ubuntu ubuntu 35242 Jul 19 17:47 run_cctm_2010_4CALIF1.csh
-rwxrwxr-x 1 ubuntu ubuntu 49472 Jul 19 17:47 run_cctm_Bench_2016_12SE1.WRFCMAQ.csh
-rwxrwxr-x 1 ubuntu ubuntu 35799 Jul 19 18:43 run_cctm_Bench_2016_12SE1.csh

Download the Input data from the S3 Bucket#
Install aws command line#

see Install AWS CLI

cd /shared/build

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"

Install unzip and unzip file#

sudo apt install zip

/usr/bin/unzip awscliv2.zip

sudo ./aws/install

output

You can now run: /usr/local/bin/aws --version

Note, you will need to add this path to your .cshrc

Edit .cshrc#

vi ~/.cshrc

add the following to the path /usr/local/bin

Output:

# start .cshrc

umask 002

if ( ! $?LD_LIBRARY_PATH ) then
    setenv LD_LIBRARY_PATH /shared/build/netcdf/lib
else
    setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/shared/build/netcdf/lib
endif

set path = ($path /shared/build/netcdf/bin /shared/build/ioapi-3.2/Linux2_x86_64gfort /opt/slurm/bin/ /usr/local/bin/ )

if ($?tcsh) then
   source /usr/share/modules/init/tcsh
else
   source /usr/share/modules/init/csh
endif
Install the input data using the s3 script#
need scriptable method to obtain 12SE1 benchmark

Note, this Virtual Machine does not have Slurm installed or configured.

Run CMAQ interactively using the following command:#
First check to see how many cpus you have available on the machine.#
lscpu

Output

CPU(s):                  4
  On-line CPU(s) list:   0-3

Verify that the run script is set to run on 4 cpus

   @ NPCOL  =  2; @ NPROW =  2

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts

`./run_cctm_Bench_2016_12SE1.csh |& tee ./run_cctm_Bench_2016_12SE1.log

When the run has completed, record the timing of the two day benchmark.

tail -n 30  run_cctm_Bench_2016_12SE1.log

Output on 4 cores

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2016-07-01
End Day:   2016-07-01
Number of Simulation Days: 1
Domain Name:               2016_12SE1
Number of Grid Cells:      280000  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       4
   All times are in seconds.

Num  Day        Wall Time
01   2016-07-01   2083.32
     Total Time = 2083.32
      Avg. Time = 2083.32

Install I/O API libraries that support HDF5#

This is required in order to:

  1. Run CMAQ using the compressed netCDF-4 input files provided on the S3 bucket or

  2. Convert the *.nc4 files to *.nc files (to uncompressed classic netCDF-3 input files)

First build HDF5 libraries, then build netCDF-C, netCDF-Fortran

cd /shared/pcluster-cmaq
./gcc11_install_hdf5.csh

Upgrade to run CMAQ on larger EC2 Instance#

Save the AMI and create a new VM using a larger c6a.8xlarge (with 32 processors)#

Requires access to the AWS Web Interface (I will look for insructions on how to do this from the aws command line, but I don’t currently have a method for this.)

Use the AWS Console to Stop the Image#

add screenshot

Use the AWS Console to Create a new AMI#

add screenshot

Check to see that the AMI has been created by examining the status. Wait for the status to change from Pending to Available.

Use the newly created AMI to launch a new Single VM using a larger EC2 instance.#

Launch a new instance using the AMI with the software loaded and request a spot instance for the c6a.8xlarge EC2 instance

Load the modules#
Test running the listos domain on 32 processors#

Output

     Processing Day/Time [YYYYDDD:HHMMSS]: 2017357:235600
       Which is Equivalent to (UTC): 23:56:00 Saturday,  Dec. 23, 2017
       Time-Step Length (HHMMSS): 000400
                 VDIFF completed...       3.6949 seconds
                COUPLE completed...       0.3336 seconds
                  HADV completed...       1.8413 seconds
                  ZADV completed...       0.5154 seconds
                 HDIFF completed...       0.4116 seconds
              DECOUPLE completed...       0.0696 seconds
                  PHOT completed...       0.7443 seconds
               CLDPROC completed...       2.4009 seconds
                  CHEM completed...       1.3362 seconds
                  AERO completed...       1.3210 seconds
            Master Time Step
            Processing completed...      12.6698 seconds

      =--> Data Output completed...       0.9872 seconds


     ==============================================
     |>---   PROGRAM COMPLETED SUCCESSFULLY   ---<|
     ==============================================
     Date and time 0:00:00   Dec. 24, 2017  (2017358:000000)

     The elapsed time for this simulation was    3389.0 seconds.

315644.552u 1481.008s 56:29.98 9354.7%	0+0k 33221248+26871200io 9891pf+0w

CMAQ Processing of Day 20171223 Finished at Wed Jun  7 02:25:47 UTC 2023

\\\\\=====\\\\\=====\\\\\=====\\\\\=====/////=====/////=====/////=====/////


==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day:   2018-08-07
Number of Simulation Days: 3
Domain Name:               2018_12Listos
Number of Grid Cells:      21875  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       32
   All times are in seconds.

Num  Day        Wall Time
01   2018-08-05   80.6
02   2018-08-06   72.7
03   2018-08-07   76.3
     Total Time = 229.60
      Avg. Time = 76.53

Run CMAQv5.4 for the full 12US1 Domain on c6a.48xlarge#

Download the full 12US1 Domain that is netCDF4 compressed and convert it to classic netCDF-3 compression.

Note: I first tried running this domain on the c6a.8xlarge on 32 processors. The model failed, with a signal 9 - likely not enough memory available to run the model.

I re-saved the AMI and launched a c6a.48xlarge with 192 vcpus, running as spot instance.

Spot Pricing cost for Linux in US East Region

c6a.48xlarge $6.4733 per Hour

Run utility to uncompress hdf5 *.nc4 files and save as classic *.nc files#

May need to look at disabling hyperthreading at runtime.

Disable Hyperthreading

Increased disk space on /shared to 500 GB#

Ran out of disk space when trying to run the full 12US1 domain, so it is necessary to increase the size. You can do this in the AWS Web Interface without stopping the instance.

Expanded the root volume to 500 GB, and increased the throughput to 1000 MB/s and then expanded it using these instructions, and then resized it.

Recognize Expanded Volume

Rerunning the 12US1 case on 8x12 processors - for total of 96 processors.

It takes about 13 minutes of initial I/O prior to the model starting.

Successful run output:

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day:   2017-12-23
Number of Simulation Days: 2
Domain Name:               12US1
Number of Grid Cells:      4803435  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       96
   All times are in seconds.

Num  Day        Wall Time
01   2017-12-22   3395.1
02   2017-12-23   3389.0
     Total Time = 6784.10
      Avg. Time = 3392.05

Note, this run time is slower than a single node of the Parallel Cluster using the HPC6a.48xlarge (total time = 5000 seconds). Note the 12US1 domain is larger than the 12US2 domain that was used for the HPC6a.48xlarge benchmarks. It would be good to do another benchmark for 12US1 using HPC6a.48xlarge a compute node that is configured for HPC by AWS. AWS turns off hyperthreading by default for HPC6a.48xlarge, and there may be other optimizations for HPC applications (disk/networking/cpu).

CMAQv5.3.3 Advanced Tutorial (optional)#

  • Learn how to upgrade the ParallelCluster, by first creating a cluster that uses c5n.4xlarge as the compute nodes, and then upgrading the cluster to use c5n.18xlarge as the compute nodes.

  • Learn how to install CMAQ software and underlying libraries, copy input data, and run CMAQ.

Notice

Skip this tutorial if you successfully completed the Intermediate Tutorial and wish to proceed to the post-processing and QA instructions. Note, you may wish to build the underlying libraries and CMAQ and code if you wish to create a ParallelCluster using a different family of compute nodes, such as the c6gn.16xlarge compute nodes AMD Graviton.

Advanced Tutorial (optional)

Use ParallelCluster without Software and Data pre-installed#

Step by step instructions to configuring and running a ParallelCluster for the CMAQ 12US2 benchmark with instructions to install the libraries and software.

Notice

Skip this tutorial if you successfully completed the Intermediate Tutorial. Unless you need to build the CMAQ libraries and code and run on a different family of compute nodes, such as the c6gn.16xlarge compute nodes AMD Graviton.

Create CMAQ Cluster using SPOT pricing#
Use an existing yaml file from the git repo to create a ParallelCluster#

cd /your/local/machine/install/path/

Use a configuration file from the github repo that was cloned to your local machine#

git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq

cd pcluster-cmaq

Edit the c5n-4xlarge.yaml#

vi c5n-4xlarge.yaml

Note

  1. the c5n-4xlarge.yaml is configured to use SPOT instance pricing for the compute nodes.

  2. the c5n-4xlarge.yaml is configured to the the c5n-4xlarge as the compute node, with up to 10 compute nodes, specified by MaxCount: 10.

  3. the c5n-4xlarge.yaml is configured to disable multithreading (This option restricts the computing to CPUS rather than allowing the use of all virtual CPUS. (16 virtual cpus reduced to 8 cpus)

  4. given this yaml configuration, the maximum number of PEs that could be used to run CMAQ is 8 cpus x 10 = 80, the max settings for NPCOL, NPROW is NPCOL = 8, NPROW = 10 or NPCOL=10, NPROW=8 in the CMAQ run script.

Replace the key pair and subnet ID in the c5n-4xlarge.yaml file with the values created when you configured the demo cluster#
Region: us-east-1
Image:
  Os: ubuntu2004
HeadNode:
  InstanceType: c5n.large
  Networking:
    SubnetId: subnet-xx-xx-xx           << replace
  DisableSimultaneousMultithreading: true
  Ssh:
    KeyName: your_key                     << replace
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: queue1
      CapacityType: SPOT
      Networking:
        SubnetIds:
          - subnet-xx-xx-x         x    << replace
      ComputeResources:
        - Name: compute-resource-1
          InstanceType: c5n.4xlarge
          MinCount: 0
          MaxCount: 10
          DisableSimultaneousMultithreading: true
SharedStorage:
  - MountDir: /shared
    Name: ebs-shared
    StorageType: Ebs
  - MountDir: /fsx
    Name: name2
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
The Yaml file for the c5n-4xlarge contains the settings as shown in the following diagram.#

Figure 1. Diagram of YAML file used to configure a ParallelCluster with a c5n.large head node and c5n.4xlarge compute nodes using SPOT pricing c5n-4xlarge yaml configuration

Create the c5n-4xlarge pcluster#

pcluster create-cluster --cluster-configuration c5n-4xlarge.yaml --cluster-name cmaq --region us-east-1

Check on status of cluster#

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

After 5-10 minutes, you see the following status: “clusterStatus”: “CREATE_COMPLETE”

Start the compute nodes#

pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED

Login to cluster#

Note

Replace the your-key.pem with your Key Pair.

pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq

Show compute nodes#

scontrol show nodes

Output:

NodeName=queue1-dy-compute-resource-1-10 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=8 CPULoad=N/A
   AvailableFeatures=dynamic,c5n.4xlarge,compute-resource-1
   ActiveFeatures=dynamic,c5n.4xlarge,compute-resource-1
   Gres=(null)
   NodeAddr=queue1-dy-compute-resource-1-10 NodeHostName=queue1-dy-compute-resource-1-10 
   RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=8 Boards=1
   State=IDLE+CLOUD+POWERED_DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=queue1 
   BootTime=None SlurmdStartTime=None
   LastBusyTime=Unknown
   CfgTRES=cpu=8,mem=1M,billing=8
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Update the compute nodes#

Before building the software, verify that you can update the compute nodes from the c5n.4xlarge to c5n.18xlarge

By updating the compute node from a c5n.4xlarge (max 8 cpus per compute node) to c5n.18xlarge (max 36 cpus per compute node) would allow the benchmark case to be run on up to 360 cpus ( 36 cpu/node x 10 nodes ).

Note

Provisioning 10 c5n.18xlarge in one region may be difficult. In practice, it is possible to obtain 8 c5n.18xlarge compute nodes, with 36 cpu/node x 8 nodes = 288 cpus.

Note

The c5n.18xlarge requires that the elastic network adapter is enabled in the yaml file. Exit the pcluster and return to your local command line.

If you only modified the yaml file to update the compute node identity, without making additional updates to the network and other settings, then you would not achieve all of the benefits of using the c5n.18xlarge compute node in the ParallelCluster.

For this reason, a yaml file that contains these advanced options to support the c5n.18xlarge compute instance will be used to upgrade the ParallelCluster from c5n.5xlarge to c5n.18xlarge.

Exit the cluster#

exit

Stop the compute nodes#

pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status STOP_REQUESTED

Verify that the compute nodes are stopped#

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

keep rechecking until you see the following status “computeFleetStatus”: “STOPPED”,

Examine the differences between YAML files#

The YAML file for the c5n.xlarge head node and c5n18xlarge compute Node contains additional settings than the YAML file that used the c5n.4xlarge as the compute node.

Note

  1. the c5n-18xlarge.yaml is configured to use SPOT instance pricing for the compute nodes.

  2. the c5n-18xlarge.yaml is configured to the the c5n-18xlarge as the compute node, with up to 10 compute nodes, specified by MaxCount: 10.

  3. the c5n-18xlarge.yaml is configured to disable multithreading (This option restricts the computing to CPUS rather than allowing the use of all virtual CPUS. (72 virtual cpus reduced to 36 cpus)

  4. the c5n-18xlarge.yaml is configured to enable the setting of a placement group to allow low inter-node latency

  5. the c5n-18xlarge.yaml is configured to enables the elastic fabric adapter

Figure 2. Diagram of YAML file used to configure a ParallelCluster with a c5n-xlarge head node and c5n-18xlarge compute nodes(36CPU per Node)

c5n-18xlarge yaml configuration

Note

Notice that the c5n-18xlarge yaml configuration file contains a setting for PlacementGroup.

PlacementGroup:
          Enabled: true

A placement group is used to get the lowest inter-node latency.

A placement group guarantees that your instances are on the same networking backbone.

Edit the YAML file for c5n.n18xlarge#

You will need to edit the c5n-18xlarge.yaml to specify your KeyName and SubnetId (use the values generated in your new-hello-world.yaml) This yaml file specifies ubuntu2004 as the OS, c5n.large for the head node, c5n.18xlarge as the compute nodes and both a /shared Ebs directory(for software install) and a /fsx Lustre File System (for Input and Output Data) and enables the elastic fabric adapter.

vi c5n-18xlarge.yaml

Output:

Region: us-east-1
Image:
  Os: ubuntu2004
HeadNode:
  InstanceType: c5n.large
  Networking:
    SubnetId: subnet-018cfea3edf3c4765      <<<  replace
  DisableSimultaneousMultithreading: true
  Ssh:
    KeyName: centos                         <<<  replace
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 5
  SlurmQueues:
    - Name: queue1
      CapacityType: SPOT
      Networking:
        SubnetIds:
          - subnet-018cfea3edf3c4765         <<<  replace
        PlacementGroup:
          Enabled: true
      ComputeResources:
        - Name: compute-resource-1
          InstanceType: c5n.18xlarge
          MinCount: 0
          MaxCount: 10
          DisableSimultaneousMultithreading: true
          Efa:                                     <<< Note new section that enables elastic fabric adapter
            Enabled: true
            GdrSupport: false
SharedStorage:
  - MountDir: /shared
    Name: ebs-shared
    StorageType: Ebs
  - MountDir: /fsx
    Name: name2
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200

Create the c5n.18xlarge cluster#

Use the pcluster command to update cluster to use c5n.18xlarge compute node

pcluster update-cluster --region us-east-1 --cluster-name cmaq --cluster-configuration c5n-18xlarge.yaml

Verify that the compute nodes have been updated#

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Output:

{
  "creationTime": "2022-02-23T17:39:42.953Z",
  "headNode": {
    "launchTime": "2022-02-23T17:48:03.000Z",
    "instanceId": "xxx-xx-xx",
    "publicIpAddress": "xx-xx-xx",
    "instanceType": "c5n.large",
    "state": "running",
    "privateIpAddress": "xx-xx-xx"
  },
  "version": "3.1.1",
  "clusterConfiguration": {
  },
  "tags": [
    {
      "value": "3.1.1",
      "key": "parallelcluster:version"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_IN_PROGRESS",
  "clusterName": "cmaq",
  "computeFleetStatus": "STOPPED",
  "cloudformationStackArn": 
  "lastUpdatedTime": "2022-02-23T17:56:31.114Z",
  "region": "us-east-1",
  "clusterStatus": "UPDATE_IN_PROGRESS"
Wait 5 to 10 minutes for the update to be completed#

Keep rechecking status until update is completed and computeFleetStatus is RUNNING

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Output:

{
  "creationTime": "2022-02-23T17:39:42.953Z",
  "headNode": {
    "launchTime": "2022-02-23T17:48:03.000Z",
    "instanceId": "xx-xx-xxx",
    "publicIpAddress": "xx-xx-xx",
    "instanceType": "c5n.large",
    "state": "running",
    "privateIpAddress": "xx-xxx-xx"
  },
  "version": "3.1.1",
  "clusterConfiguration": {
  },
  "tags": [
    {
      "value": "3.1.1",
      "key": "parallelcluster:version"
    }
  ],
  "cloudFormationStackStatus": "UPDATE_COMPLETE",
  "clusterName": "cmaq",
  "computeFleetStatus": "STOPPED",
  "cloudformationStackArn": 
  "lastUpdatedTime": "2022-02-23T17:56:31.114Z",
  "region": "us-east-1",
  "clusterStatus": "UPDATE_COMPLETE"
}

Wait until UPDATE_COMPLETE message is received, then proceed.

Re-start the compute nodes#

pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED

Verify status of cluster#

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Wait until you see

computeFleetStatus": "RUNNING",

Login to c5n.18xlarge cluster#

Note

Replace the your-key.pem with your Key Pair.

pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq

Check to make sure elastic network adapter (ENA) is enabled#

modinfo ena

lspci

Check what modules are available on the cluster#

module avail

Load the openmpi module#

module load openmpi/4.1.1

Load the Libfabric module#

module load libfabric-aws/1.13.0amzn1.0

Verify the gcc compiler version is greater than 8.0#

gcc --version

output:

gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Install Input Data on ParallelCluster#

Verify AWS CLI is available obtain data from AWS S3 Bucket#

Check to see if the aws command line interface (CLI) is installed

which aws

If it is installed, skip to the next step.

If it is not available please follow these instructions to install it.

See also

https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

cd /shared

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"

unzip awscliv2.zip

sudo ./aws/install

Verify you can run the aws command#

aws --help

If not, you may need to logout and back in.

Note

If you do not have credintials, skip this. The data is on a public bucket, so you do not need credentials.

Set up your credentials for using s3 copy (you can skip this if you do not have credentials)

aws configure

Copy Input Data from S3 Bucket to lustre filesystem#

Verify that the /fsx directory exists; this is a lustre file system where the I/O is fastest

ls /fsx

If you are unable to use the lustre file system, the data can be installed on the /shared volume, if you have resized the volume to be large enough to store the input and output data.

Install the parallel cluster scripts using the commands:

cd /shared

git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq

Use the S3 script to copy the CONUS input data from the CMAS s3 bucket#

Data will be saved to the /fsx file system

/shared/pcluster-cmaq/s3_scripts/s3_copy_nosign_conus_cmas_opendata_to_fsx.csh

check that the resulting directory structure matches the run script

Note

The CONUS 12US2 input data requires 44 GB of disk space
(if you use the yaml file to import the data to the lustre file system rather than copying the data you save this space)

cd /fsx/data/CMAQ_Modeling_Platform_2016/CONUS/12US2/

du -sh

output:

44G     .

CMAQ ParallelCluster is configured to have 1.2 Terrabytes of space on /fsx filesystem (minimum size allowed for lustre /fsx), to allow multiple output runs to be stored.

For ParallelCluster: Import the Input data from a public S3 Bucket#

A second method is available to import the data on the lustre file system using the yaml file to specify the s3 bucket location in the yaml file, rather than using the above aws s3 copy commands.

See also

Example available in c5n-18xlarge.ebs_shared.fsx_import.yaml

cd /shared/pcluster-cmaq/
vi c5n-18xlarge.ebs_shared.fsx_import.yaml   

Section that of the YAML file that specifies the name of the S3 Bucket.

  - MountDir: /fsx
    Name: name2
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      ImportPath: s3://cmas-cmaq-conus2-benchmark/data/CMAQ_Modeling_Platform_2016/CONUS    <<<  specify name of S3 bucket

This requires that the S3 bucket specified is publically available

Install CMAQ sofware and libraries on ParallelCluster#

Login to updated cluster#

Note

Replace the your-key.pem with your Key Pair.

pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq

Change shell to use .tcsh#

Note

This command depends on what OS you have installed on the ParallelCluster

sudo usermod -s /bin/tcsh ubuntu

or

sudo usermod -s /bin/tcsh centos

Log out and log back in to have the tcsh shell be active

exit

pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq

Check to see the tcsh shell is default#

echo $SHELL

The following instructions assume that you will be installing the software to a /shared/build directory

mkdir /shared/build

Install the pcluster-cmaq git repo to the /shared directory

cd /shared

Use a configuration file from the github repo that was cloned to your local machine#

git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq

cd pcluster-cmaq

Check to make sure elastic network adapter (ENA) is enabled#

modinfo ena

lspci

Check what modules are available on the cluster#

module avail

Load the openmpi module#

module load openmpi/4.1.1

Load the Libfabric module#

module load libfabric-aws/1.13.2amzn1.0

Verify the gcc compiler version is greater than 8.0#

gcc --version

Output:

gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Change directories to install and build the libraries and CMAQ#

cd /shared/pcluster-cmaq

Note: the sofware build process for CMAQ integration and continuous deployment needs improvement. Currently the Unidata Ucar netcdf-c download page is broken, and the location where the source code can be obtained may need to be updated from their website to the netcdf git repository. For this reason, this tutorial provides a snapshot image that was compiled on a c5n.xlarge head node, and runs on the c5n.18xlarge compute node. A different snapshot image would need to be created to compile and run CMAQ on a c6gn.16xlarge Arm-based AWS Graviton2 processor.

An alternative is to keep a copy of the source code for netcdf-C and netcdf-Fortran and all of the other underlying code on an S3 bucket and to use custom bootstrap actions to build the sofware as the ParallelCluster is provisioned.

The following link provides instructions on how to create a custom bootstrap action to pre-load software from an S3 bucket to the ParallelCluster at the time that the cluster is created.

Custom Bootstrap Actions

Build netcdf C and netcdf F libraries - these scripts work for the gcc 8+ compiler#

Note, if this script fails, it is typically because NCAR has released a new version of netCDF C or Fortran, so the old version is no longer available, or if they have changed the name or location of the download file.

./gcc_netcdf_cluster.csh

A .cshrc script with LD_LIBRARY_PATH was copied to your home directory, enter the shell again and check environment variables that were set using#

cat ~/.cshrc

If the .cshrc was not created use the following command to create it#

cp dot.cshrc.pcluster ~/.cshrc

Execute the shell to activate it#

csh

env

Verify that you see the following setting#

Output:

LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/shared/build/netcdf/lib:/shared/build/netcdf/lib
Build I/O API library#

./gcc_ioapi_cluster.csh

Build CMAQ#

./gcc_cmaq_pcluster.csh

Check to confirm that the cmaq executable has been built

ls /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/BLD_CCTM_v533_gcc/*.exe

Run CMAQ#

Verify that you have an updated set of run scripts from the pcluster-cmaq repo#

To ensure you have the correct directory specified

cd /shared/pcluster-cmaq/run_scripts/cmaq533/

ls -lrt run*pcluster*

Compare with

ls -lrt /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run*pcluster*

If they are not identical, then copy the set from the repo

cp /shared/pcluster-cmaq/run_scripts/cmaq533/run*pcluster* /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/

Verify that the input data is imported to /fsx from the S3 Bucket#

cd /fsx/12US2

Need to make this directory and then link it to the path created when the data is copied from the S3 Bucket.

This is to make the paths consistent between the two methods of obtaining the input data.

mkdir -p /fsx/data/CONUS cd /fsx/data/CONUS ln -s /fsx/12US2 .

Create the output directory#

mkdir -p /fsx/data/output

Run the CONUS Domain on 180 pes#

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/

sbatch run_cctm_2016_12US2.180pe.5x36.pcluster.csh

Note, it will take about 3-5 minutes for the compute notes to start up. This is reflected in the Status (ST) of CF (configuring)

Check the status in the queue#

squeue -u ubuntu

Output:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2    queue1     CMAQ   ubuntu CF       3:00      5 queue1-dy-computeresource1-[1-5]

After 5 minutes the status will change once the compute nodes have been created and the job is running

squeue -u ubuntu

Output:


             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute     CMAQ   ubuntu  R      16:50      5 compute-dy-c5n18xlarge-[1-5]

The 180 pe job should take 60 minutes to run (30 minutes per day)

check on the status of the cluster using CloudWatch#

(optional)

<a href="https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=cmaq-us-east-1">Cloudwatch Dashboard</a>
<a href="https://aws.amazon.com/blogs/compute/monitoring-dashboard-for-aws-parallelcluster/">Monitoring Dashboard for ParallelCluster</a>
check the timings while the job is still running using the following command#

grep 'Processing completed' CTM_LOG_001*

Output:

            Processing completed...    8.8 seconds
            Processing completed...    7.4 seconds
When the job has completed, use tail to view the timing from the log file.#

tail run_cctmv5.3.3_Bench_2016_12US2.10x18pe.2day.pcluster.log

Output:

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2015-12-22
End Day:   2015-12-23
Number of Simulation Days: 2
Domain Name:               12US2
Number of Grid Cells:      3409560  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       180
   All times are in seconds.

Num  Day        Wall Time
01   2015-12-22   2481.55
02   2015-12-23   2225.34
     Total Time = 4706.89
      Avg. Time = 2353.44
Submit a request for a 288 pe job ( 8 x 36 pe) or 8 nodes instead of 5 nodes#

`sbatch run_cctm_2016_12US2.288pe.8x36.pcluster.csh``

Check on the status in the queue#

squeue -u ubuntu

Note, it takes about 5 minutes for the compute nodes to be initialized, once the job is running the ST or status will change from CF (configure) to R

Output:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 6    queue1     CMAQ   ubuntu  R      24:57      8 queue1-dy-computeresource1-[1-8]
Check the status of the run#

tail CTM_LOG_025.v533_gcc_2016_CONUS_16x18pe_20151222

Check whether the scheduler thinks there are cpus or vcpus#

sinfo -lN

Output:

Wed Jan 05 19:34:05 2022
NODELIST                       NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
queue1-dy-computeresource1-1       1   queue1*       mixed 72     72:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-2       1   queue1*       mixed 72     72:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-3       1   queue1*       mixed 72     72:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-4       1   queue1*       mixed 72     72:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-5       1   queue1*       mixed 72     72:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-6       1   queue1*       mixed 72     72:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-7       1   queue1*       mixed 72     72:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-8       1   queue1*       mixed 72     72:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-9       1   queue1*       idle~ 72     72:1:1      1        0      1 dynamic, Scheduler health che
queue1-dy-computeresource1-10      1   queue1*       idle~ 72     72:1:1      1        0      1 dynamic, Scheduler health che

Note: on a c5n.18xlarge, the number of virtual cpus is 72.

If the YAML contains the Compute Resources Setting of DisableSimultaneousMultithreading: false, then all 72 vcpus will be used

If DisableSimultaneousMultithreading: true, then the number of cpus is 36 and there are no virtual cpus.

edit run script to use#

SBATCH –exclusive

Edit the yaml file to use DisableSimultaneousMultithreading: true#
Confirm that there are only 36 cpus available to the slurm scheduler#

sinfo -lN

Output:

Wed Jan 05 20:54:01 2022
NODELIST                       NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
queue1-dy-computeresource1-1       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-2       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-3       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-4       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-5       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-6       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-7       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-8       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-9       1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
queue1-dy-computeresource1-10      1   queue1*       idle~ 36     36:1:1      1        0      1 dynamic, none
Re-run the CMAQ CONUS Case#

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/

Submit a request for a 288 pe job ( 8 x 36 pe) or 8 nodes instead of 10 nodes with full output#

sbatch run_cctm_2016_12US2.288pe.full.pcluster.csh

squeue -u ubuntu

Output:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 7    queue1     CMAQ   ubuntu CF       3:06      8 queue1-dy-computeresource1-[1-8]

Note, it takes about 5 minutes for the compute nodes to be initialized, once the job is running the ST or status will change from CF (configure) to R

squeue -u ubuntu

Output:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 7    queue1     CMAQ   ubuntu  R      24:57      8 queue1-dy-computeresource1-[1-8]
Check the status of the run#

tail CTM_LOG_025.v533_gcc_2016_CONUS_16x18pe_full_20151222

Once you have submitted a few benchmark runs and they have completed successfully, proceed to the next chapter.

Post-process and qa#

Post-process CMAQ and Install R#

Post-processing CMAQ Run, Install R and packages Instructions to install R and packages for QA of CMAQ difference in output between two runs.

Scripts to run combine and post processing#

Build the POST processing routines#

Copy the buildit script from the repo, as it was corrected to use CMAQv533 rather than CMAQv532

cd /shared/build/openmpi_gcc/CMAQ_v533/POST/combine/scripts
cp /shared/pcluster-cmaq/run_scripts/bldit_combine.csh .

Run the bldit script for combine.

./bldit_combine.csh gcc |& tee ./bldit_combine.gcc.log

Copy the bldit script from the repo, as it was corrected to use CMAQv533 rather than CMAQv532

cd /shared/build/openmpi_gcc/CMAQ_v533/POST/calc_tmetric/scripts
cp /shared/pcluster-cmaq/run_scripts/bldit_calc_tmetric.csh .

Run the bldit script for calc_tmetric

./bldit_calc_tmetric.csh gcc |& tee ./bldit_calc_tmetric.gcc.log

Copy the bldit script from the repo

cd /shared/build/openmpi_gcc/CMAQ_v533/POST/hr2day/scripts
cp /shared/pcluster-cmaq/run_scripts/bldit_hr2day.csh

Run the bldit script

./bldit_hr2day.csh gcc |& tee ./bldit_hr2day.gcc.log

Copy the bldit script from the repo and run

cd /shared/build/openmpi_gcc/CMAQ_v533/POST/bldoverlay/scripts
cp /shared/pcluster-cmaq/run_scripts/bldit_bldoverlay.csh .
./bldit_bldoverlay.csh gcc |& tee ./bldit_bldoverlay.gcc.log
Scripts to post-process CMAQ output#

Instructions on how to Post-process CMAQ using the utilities under the POST directory

Note

The post-processing analysis is run on the head node.

Verify that the compute nodes are no longer running if you have completed all of the benchmark runs

squeue

You should see that no jobs are running.

Show compute nodes

scontrol show nodes

Edit, Build and Run the POST processing routines#
setenv DIR /shared/build/openmpi_gcc/CMAQ_v533/

cd $DIR/POST/combine/scripts
sed -i 's/v532/v533/g' bldit_combine.csh
./bldit_combine.csh gcc |& tee ./bldit_combine.gcc.log

cp run_combine.csh run_combine_conus.csh
sed -i 's/v532/v533/g' run_combine_conus.csh
sed -i 's/Bench_2016_12SE1/2016_CONUS_16x18pe/g' run_combine_conus.csh
sed -i 's/intel/gcc/g' run_combine_conus.csh
sed -i 's/2016-07-01/2015-12-22/g' run_combine_conus.csh
sed -i 's/2016-07-14/2015-12-23/g' run_combine_conus.csh
setenv CMAQ_DATA /fsx/data
./run_combine_conus.csh

cd $DIR/POST/calc_tmetric/scripts
sed -i 's/v532/v533/g' bldit_calc_tmetric.csh
./bldit_calc_tmetric.csh gcc |& tee ./bldit_calc_tmetric.gcc.log

cp run_calc_tmetric.csh run_calc_tmetric_conus.csh
sed -i 's/v532/v533/g' run_calc_tmetric_conus.csh
sed -i 's/Bench_2016_12SE1/2016_CONUS_16x18pe/g' run_calc_tmetric_conus.csh
sed -i 's/intel/gcc/g' run_calc_tmetric_conus.csh
sed -i 's/201607/201512/g' run_calc_tmetric_conus.csh
setenv CMAQ_DATA /fsx/data
./run_calc_tmetric_conus.csh

cd $DIR/POST/hr2day/scripts
sed -i 's/v532/v533/g' bldit_hr2day.csh
./bldit_hr2day.csh gcc |& tee ./bldit_hr2day.gcc.log

cp run_hr2day.csh run_hr2day_conus.csh
sed -i 's/v532/v533/g' run_hr2day_conus.csh
sed -i 's/Bench_2016_12SE1/2016_CONUS_16x18pe/g' run_hr2day_conus.csh
sed -i 's/intel/gcc/g' run_hr2day_conus.csh
sed -i 's/2016182/2015356/g' run_hr2day_conus.csh
sed -i 's/2016195/2015357/g' run_hr2day_conus.csh
setenv CMAQ_DATA /fsx/data
./run_hr2day_conus.csh

cd $DIR/POST/bldoverlay/scripts
sed -i 's/v532/v533/g' bldit_bldoverlay.csh

./bldit_bldoverlay.csh gcc |& tee ./bldit_bldoverlay.gcc.log

cp run_bldoverlay.csh run_bldoverlay_conus.csh
sed -i 's/v532/v533/g' run_bldoverlay_conus.csh
sed -i 's/Bench_2016_12SE1/2016_CONUS_16x18pe/g' run_bldoverlay_conus.csh
sed -i 's/intel/gcc/g' run_bldoverlay_conus.csh
sed -i 's/2016-07-01/2015-12-22/g' run_bldoverlay_conus.csh
sed -i 's/2016-07-02/2015-12-23/g' run_bldoverlay_conus.csh
setenv CMAQ_DATA /fsx/data
./run_bldoverlay_conus.csh

Install R, Rscripts and Packages#

First check to see if R is already installed.

R --version

If not, Install R on Ubuntu 2004 instructions available in the link below.

sudo apt install build-essential

See also

ubuntu install

Install geospatial dependencies

be sure to have an updated system

sudo apt-get update && sudo apt-get upgrade -y

install PROJ

sudo apt-get install libproj-dev proj-data proj-bin unzip -y

optionally, install (selected) datum grid files

sudo apt-get install proj-data

install GEOS

sudo apt-get install libgeos-dev -y

install GDAL

sudo apt-get install libgdal-dev python3-gdal gdal-bin -y

install PDAL (optional)

sudo apt-get install libpdal-dev pdal libpdal-plugin-python -y

recommended to give Python3 precedence over Python2 (which is end-of-life since 2019)

sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 1

Install further compilation dependencies (Ubuntu 20.04)

sudo apt-get install \
  build-essential \
  flex make bison gcc libgcc1 g++ ccache \
  python3 python3-dev \
  python3-opengl python3-wxgtk4.0 \
  python3-dateutil libgsl-dev python3-numpy \
  wx3.0-headers wx-common libwxgtk3.0-gtk3-dev \
  libwxbase3.0-dev   \
  libncurses5-dev \
  libbz2-dev \
  zlib1g-dev gettext \
  libtiff5-dev libpnglite-dev \
  libcairo2 libcairo2-dev \
  sqlite3 libsqlite3-dev \
  libpq-dev \
  libreadline6-dev libfreetype6-dev \
  libfftw3-3 libfftw3-dev \
  libboost-thread-dev libboost-program-options-dev  libpdal-dev\
  subversion libzstd-dev \
  checkinstall \
  libglu1-mesa-dev libxmu-dev \
  ghostscript wget -y

For NVIZ on Ubuntu 20.04:

sudo apt-get install \
  ffmpeg libavutil-dev ffmpeg2theora \
  libffmpegthumbnailer-dev \
  libavcodec-dev \
  libxmu-dev \
  libavformat-dev libswscale-dev

ncdf4 package REQUIRES the netcdf library be version 4 or above, AND installed with HDF-5 support (i.e., the netcdf library must be compiled with the –enable-netcdf-4 flag). Building netcdf with HDF5 support requires curl.

sudo apt-get install curl
sudo apt-get install libcurl4-openssl-dev

cd /shared/pcluster-cmaq

Install libraries with hdf5 support

Load modules

module load openmpi/4.1.1

module load libfabric-aws/1.13.2amzn1.0

./gcc_install_hdf5.pcluster.csh

Install ncdf4 package from source:

cd /shared/pcluster-cmaq/qa_scripts/R_packages

sudo R CMD INSTALL ncdf4_1.13.tar.gz --configure-args="--with-nc-config=/shared/build-hdf5/install/bin/nc-config"

Install packages used in the R scripts

sudo -i R
install.packages("rgdal")
install.packages("M3")
install.packages("fields")
install.packages("ggplot2")
install.packages("patchwork")

To view the script, install imagemagick

sudo apt-get install imagemagick

Install X11

sudo apt install x11-apps

Enable X11 forwarding

sudo vi /etc/ssh/sshd_config

add line

X11Forwarding yes

Verify that it was added

sudo cat /etc/ssh/sshd_config | grep -i X11Forwarding

Restart ssh

sudo service ssh restart

Exit the cluster

exit

Re-login to the cluster

pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq

Test display

display xclock

Note, it looks like the examples are using the older config or CLI 2 format, and need to convert this to a yaml format to try it out.

The bug says that you can use a custom post installation script to re-enable X11 Forwarding.

QA CMAQ#

Quality Assurance: Comparing the output of two CMAQ runs.

Quality Assurance#

Instructions on how to to verify a successful CMAQ Run on ParallelCluster.

Run m3diff to compare the output data for two runs that have different values for NPCOL#

cd /fsx/data/output
ls */*ACONC*
setenv AFILE output_CCTM_v533_gcc_2016_CONUS_10x18pe_full/CCTM_ACONC_v533_gcc_2016_CONUS_10x18pe_full_20151222.nc
setenv BFILE output_CCTM_v533_gcc_2016_CONUS_16x18pe_full/CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_full_20151222.nc
m3diff

hit return several times to accept the default options

grep A:B REPORT

Should see all zeros.

Recompiled CMAQ using -march=native compiler option for gcc compiler, but still see differences in answers. The answers are the same, or the differences are all zeros if the domain decomposition uses the same NPCOL, here, NPCOL differs (10 vs 16)

This behavior is different from what was observed with removing the -march=native compiler option for gcc on the AMD Cyclecloud HBV3 processor. On cycle cloud, if CMAQ is compiled with -march=native removed from the compiler options, then the answers match if NPCOL differs.

NPCOL  =  10; @ NPROW = 18
NPCOL  =  16; @ NPROW = 18

grep A:B REPORT

output

 A:B  4.54485E-07@(316, 27, 1) -3.09199E-07@(318, 25, 1)  1.42188E-11  2.71295E-09
 A:B  4.73112E-07@(274,169, 1) -2.36556E-07@(200,113, 1)  3.53046E-11  3.63506E-09
 A:B  7.37607E-07@(226,151, 1) -2.98955E-07@(274,170, 1)  3.68974E-11  5.29013E-09
 A:B  3.15718E-07@(227,150, 1) -2.07219E-07@(273,170, 1)  2.52149E-11  3.60005E-09
 A:B  2.65893E-07@(299,154, 1) -2.90573E-07@(201,117, 1)  1.78237E-12  4.15726E-09
 A:B  3.11527E-07@(300,156, 1) -7.43195E-07@(202,118, 1) -9.04127E-12  6.38413E-09
 A:B  4.59142E-07@(306,160, 1) -7.46921E-07@(203,119, 1) -2.57731E-11  8.06486E-09
 A:B  5.25266E-07@(316,189, 1) -5.90459E-07@(291,151, 1) -2.67232E-11  9.36312E-09
 A:B  5.31785E-07@(294,156, 1) -6.33299E-07@(339,201, 1)  3.01644E-11  1.12862E-08
 A:B  1.01421E-06@(297,168, 1) -5.08502E-07@(317,190, 1)  9.97206E-11  1.35965E-08
 A:B  1.28523E-06@(297,168, 1) -2.96347E-06@(295,160, 1)  1.57728E-10  1.88143E-08
 A:B  1.69873E-06@(298,169, 1) -6.47269E-07@(343,205, 1)  1.99673E-10  1.96824E-08
 A:B  2.10665E-06@(298,170, 1) -8.53091E-07@(290,133, 1)  2.75009E-10  2.38824E-08
 A:B  2.77534E-06@(298,166, 1) -1.38395E-06@(339,201, 1)  4.32676E-10  3.19499E-08
 A:B  4.05498E-06@(298,166, 1) -2.29478E-06@(292,134, 1)  5.94668E-10  4.56470E-08
 A:B  1.64844E-06@(380,195, 1) -1.24970E-05@(312,119, 1)  2.99392E-10  6.27748E-08
 A:B  2.40747E-06@(350,207, 1) -2.38372E-06@(313,120, 1) -1.23841E-11  4.06153E-08
 A:B  2.54810E-06@(353,207, 1) -1.68476E-06@(258,179, 1)  4.69896E-10  4.00601E-08
 A:B  2.92342E-06@(259,180, 1) -1.84122E-06@(258,180, 1)  3.00556E-10  3.75263E-08
 A:B  4.37256E-06@(259,180, 1) -1.51433E-06@(258,180, 1)  3.44610E-10  4.03537E-08
 A:B  5.51227E-06@(313,160, 1) -1.60793E-06@(312,160, 1)  6.49188E-10  4.60905E-08
 A:B  5.58607E-06@(259,182, 1) -6.47921E-06@(278,186, 1)  3.40245E-11  4.89799E-08
 A:B  3.61912E-06@(259,183, 1) -4.28502E-06@(278,187, 1)  2.10923E-10  4.86613E-08
 A:B  2.02795E-06@(278,185, 1) -3.63495E-06@(278,187, 1)  5.26566E-10  5.32271E-08
 A:B  1.25729E-07@(225,183, 1) -8.38190E-08@(200,114, 1)  2.04043E-12  7.34096E-10
 A:B  9.66247E-08@(225,151, 1) -4.09782E-07@(225,182, 1) -6.33767E-12  1.73157E-09
 A:B  2.10712E-07@(225,151, 1) -2.71946E-07@(200,114, 1) -5.41618E-12  1.65727E-09
 A:B  5.45755E-07@(225,182, 1) -1.04494E-06@(200,115, 1) -1.47753E-11  4.57864E-09
 A:B  4.30271E-07@(200,114, 1) -7.39470E-07@(200,116, 1) -3.24581E-11  5.33182E-09
 A:B  7.71135E-07@(225,181, 1) -7.92556E-07@(201,117, 1) -2.74377E-11  6.31589E-09
 A:B  6.33299E-07@(225,182, 1) -6.53090E-07@(202,118, 1) -2.86715E-11  4.42746E-09
 A:B  6.25849E-07@(225,182, 1) -2.21189E-07@(225,184, 1) -5.32567E-12  2.66906E-09
 A:B  3.64147E-07@(306,158, 1) -3.12924E-07@(175,  2, 1)  3.15538E-12  2.74893E-09

Compare CMAQv533 run with -march=native compiler flag removed.

more REPORT.6x12pe_vs_9x12pe

     FILE A:  AFILE (output_CCTM_v533_gcc_2016_CONUS_6x12pe/CCTM_ACONC_v533_gcc_2016_CONUS_6x12pe_20151222.nc)

     FILE B:  BFILE (output_CCTM_v533_gcc_2016_CONUS_9x12pe/CCTM_ACONC_v533_gcc_2016_CONUS_9x12pe_20151222.nc)


     -----------------------------------------------------------

 Date and time  2015356:000000 (0:00:00   Dec. 22, 2015)
 A:AFILE/NO2  vs  B:BFILE/NO2  vs  (A - B)
      MAX        @(  C,  R, L)  Min        @(  C,  R, L)  Mean         Sigma
 A    5.19842E-02@(127, 62, 1)  1.56425E-05@(258,239, 1)  2.27752E-03  3.47514E-03
 B    5.19842E-02@(127, 62, 1)  1.56425E-05@(258,239, 1)  2.27752E-03  3.47514E-03
 A:B  2.27243E-07@(264,163, 1) -5.42961E-07@(264,165, 1)  9.77191E-12  2.54661E-09


 Date and time  2015356:010000 (1:00:00   Dec. 22, 2015)
 A:AFILE/NO2  vs  B:BFILE/NO2  vs  (A - B)
      MAX        @(  C,  R, L)  Min        @(  C,  R, L)  Mean         Sigma
 A    6.55882E-02@(128, 62, 1)  1.29276E-05@(260,245, 1)  2.56435E-03  4.35617E-03
 B    6.55882E-02@(128, 62, 1)  1.29276E-05@(260,245, 1)  2.56435E-03  4.35617E-03
 A:B  2.76603E-07@(197,102, 1) -2.45869E-07@(264,163, 1)  6.01613E-12  1.72038E-09


 Date and time  2015356:020000 (2:00:00   Dec. 22, 2015)
 A:AFILE/NO2  vs  B:BFILE/NO2  vs  (A - B)
      MAX        @(  C,  R, L)  Min        @(  C,  R, L)  Mean         Sigma
 A    6.86494E-02@(128, 62, 1)  1.03682E-05@(262,243, 1)  2.62483E-03  4.58060E-03
 B    6.86494E-02@(128, 62, 1)  1.03682E-05@(262,243, 1)  2.62483E-03  4.58060E-03
 A:B  3.27826E-07@(197,102, 1) -3.79980E-07@(264,157, 1)  7.99431E-12  2.56835E-09


 Date and time  2015356:030000 (3:00:00   Dec. 22, 2015)
 A:AFILE/NO2  vs  B:BFILE/NO2  vs  (A - B)
      MAX        @(  C,  R, L)  Min        @(  C,  R, L)  Mean         Sigma
 A    6.58664E-02@( 48, 83, 1)  8.24041E-06@(265,241, 1)  2.57739E-03  4.54646E-03
 B    6.58664E-02@( 48, 83, 1)  8.24041E-06@(265,241, 1)  2.57739E-03  4.54646E-03
 A:B  5.47618E-07@(264,156, 1) -3.96743E-07@(264,160, 1)  9.99427E-12  3.22602E-09

Reconfirmed that with -march=native flag removed, still get matching answers if NPCOL is the same.

more REPORT_6x12pe_6x18pe

     FILE A:  AFILE (output_CCTM_v533_gcc_2016_CONUS_6x12pe/CCTM_ACONC_v533_gcc_2016_CONUS_6x12pe_20151222.nc)
     FILE B:  BFILE (output_CCTM_v533_gcc_2016_CONUS_6x18pe/CCTM_ACONC_v533_gcc_2016_CONUS_6x18pe_20151222.nc)
     -----------------------------------------------------------
 Date and time  2015356:000000 (0:00:00   Dec. 22, 2015)
 A:AFILE/NO2  vs  B:BFILE/NO2  vs  (A - B)
      MAX        @(  C,  R, L)  Min        @(  C,  R, L)  Mean         Sigma 
 A    5.19842E-02@(127, 62, 1)  1.56425E-05@(258,239, 1)  2.27752E-03  3.47514E-03
 B    5.19842E-02@(127, 62, 1)  1.56425E-05@(258,239, 1)  2.27752E-03  3.47514E-03
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00


 Date and time  2015356:010000 (1:00:00   Dec. 22, 2015)
 A:AFILE/NO2  vs  B:BFILE/NO2  vs  (A - B)
      MAX        @(  C,  R, L)  Min        @(  C,  R, L)  Mean         Sigma 
 A    6.55882E-02@(128, 62, 1)  1.29276E-05@(260,245, 1)  2.56435E-03  4.35617E-03
 B    6.55882E-02@(128, 62, 1)  1.29276E-05@(260,245, 1)  2.56435E-03  4.35617E-03
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00


 Date and time  2015356:020000 (2:00:00   Dec. 22, 2015)
 A:AFILE/NO2  vs  B:BFILE/NO2  vs  (A - B)
      MAX        @(  C,  R, L)  Min        @(  C,  R, L)  Mean         Sigma 
 A    6.86494E-02@(128, 62, 1)  1.03682E-05@(262,243, 1)  2.62483E-03  4.58060E-03
 B    6.86494E-02@(128, 62, 1)  1.03682E-05@(262,243, 1)  2.62483E-03  4.58060E-03
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00


 Date and time  2015356:030000 (3:00:00   Dec. 22, 2015)
 A:AFILE/NO2  vs  B:BFILE/NO2  vs  (A - B)
      MAX        @(  C,  R, L)  Min        @(  C,  R, L)  Mean         Sigma 
 A    6.58664E-02@( 48, 83, 1)  8.24041E-06@(265,241, 1)  2.57739E-03  4.54646E-03
 B    6.58664E-02@( 48, 83, 1)  8.24041E-06@(265,241, 1)  2.57739E-03  4.54646E-03
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00

Use m3diff to compare two runs that have the same NPCOL#

setenv AFILE /fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x16pe/CCTM_ACONC_v533_gcc_2016_CONUS_16x16pe_20151222.nc
setenv BFILE /fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x18pe/CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_20151222.nc
m3diff
grep A:B REPORT
NPCOL  =  16; @ NPROW = 16
NPCOL  =  16; @ NPROW = 18

NPCOL was the same for both runs

Resulted in zero differences in the output

 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00
 A:B  0.00000E+00@(  1,  0, 0)  0.00000E+00@(  1,  0, 0)  0.00000E+00  0.00000E+00

Run an R script to create the box plots and spatial plots comparing the output of two runs#

Examine the script to create the box plots and spatial plots and edit to use the output that you have generated in your runs.

First check what output is available on your ParallelCluster

If your I/O directory is /fsx

ls -rlt /fsx/data/output/*/*ACONC*

If your I/O directory is /shared/data

ls -lrt /shared/data/output/*/*ACONC*

Then edit the script to use the output filenames available.

vi compare_EQUATES_benchmark_output_CMAS_pcluster.r

#Directory, file name, and label for first model simulation (sim1)
sim1.label <- "CMAQ 16x16pe"
sim1.dir <- "/fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x16pe/"
sim1.file <- paste0(sim1.dir,"CCTM_ACONC_v533_gcc_2016_CONUS_16x16pe_20151222.nc")

#Directory, file name, and label for second model simulation (sim2)
sim2.label <- "CMAQ 16x18pe"
sim2.dir <- "/fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x18pe"
sim2.file <- paste0(sim2.dir,"CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_20151222.nc")

Run the R script

cd /shared/pcluster-cmaq/qa_scripts
Rscript compare_EQUATES_benchmark_output_CMAS_pcluster.r

Note: your plots will be created based on the setting of the output directory in the script

An example set of scripts are available, but these instructions can be modified to use the output generated in the script above.

To view the PDF plots use the command:

cd /shared/pcluster-cmaq/qa_scripts/qa_plots
gio open O3_MAPS_CMAQ*.pdf

To convert the PDF to a jpeg image use the script convert.csh.

cd /shared/pcluster-cmaq/qa_scripts/qa_plots

First examine what the convert.csh script is doing

more convert.csh

output:

#!/bin/csh

foreach name (`ls *.pdf`) 
  set name2=`basename $name .pdf`
  echo $name
  echo $name2
  pdftoppm -jpeg -r 600 $name $name2
end

Run the convert script.

./convert.csh

When NPCOL is fixed, we are seeing no difference in the answers.

Example comparison using: 6x6 compared to 6x9

cd /shared/pcluster-cmaq/docs/qa_plots/box_plots/6x6_vs_6x9/

Use display to view the plots

display O3_BOXPLOT_CMAQv533-GCC-6x6pe_vs_CMAQv533-GCC-6x9pe.jpeg

They are also displayed in the following plots:

Box Plot for ANO3J when NPCOL is identical

O3_BOXPLOT_CMAQv533-GCC-6x6pe_vs_CMAQv533-GCC-6x9pe.jpeg

Box plot shows no difference between ACONC output for a CMAQv5.3.3 run using different PE configurations as long as NPCOL is fixed (this is true for all species that were plotted (AOTHRJ, CO, NH3, NO2, O3, OH, SO2)

Example of plots created when NPCOL is different between simulation 1 and simulation 2.

Box plot shows a difference betweeen ACONC output for a CMAQv5.3.3 run using different PE configurations when NPCOL is different

ANO3J

ANO3J_BOXPLOT_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe.jpeg

AOTHRJ

AOTHRJ_BOXPLOT_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe.jpeg

CO

CO_BOXPLOT_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe.jpeg

NH3

NH3_BOXPLOT_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe.jpeg

NO2

NO2_BOXPLOT_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe.jpeg

O3

O3_BOXPLOT_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe.jpeg

OH

OH_BOXPLOT_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe.jpeg

SO2

SO2_BOXPLOT_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe.jpeg

Example of Spatial Plots for when NPCOL is different

Note, the differences are small, but they grow with time. There is one plot for each of the 24 hours. The plot that contains the most differences will be in the bottom right of the panel for each species. You will need to zoom in to see the differences, as most of the grid cells do not have any difference, and they are displayed as grey. For the NO2 plot, you can see the most differences over the state of Pennsylvania at hour 12/22/2015 at hour 23:00, with the the magnitude of the maximum difference of +/- 4. E-6.

cd ../spatial_plots/12x9_vs_8x9
display ANO3J_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

ANO3J

ANO3J_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

AOTHRJ

AOTHRJ_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

CO

CO_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

NH3

NH3_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

NO2

NO2_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

O3

O3_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

OH

OH_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

SO2 SO2_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg

Compare Timing of CMAQ Routines#

Compare the timing of CMAQ Routines for two different run configurations.

Parse timings from the log file#

Compare the timings for the CONUS ParallelCluster Runs#

Note

ParallelCluster Configurations can impact the model run times.

It is up the the user, as to what model run configurations are used to run CMAQ on the ParallelCluster. The following configurations may impact the run time of the model.

  • Using different PE configurations, using DisableSimultaneousMultithreading: true in yaml file, using 36 cpus - no virtual cpus

       NPCOL x NPROW  , CPU   , SBATCH Command  
    
    • [ ] 10x18 , 180 , #SBATCH –nodes=5, #SBATCH –ntasks-per-node=36

    • [ ] 16x16, 256 , #SBATCH –nodes=8, #SBATCH –ntasks-per-node=32

    • [ ] 16x18, 288 , #SBATCH –nodes=8, #SBATCH –ntasks-per-node=36

  • Using different compute nodes

    • [ ] c5n.18xlarge (72 virtual cpus, 36 cpus) - with Elastic Fabric Adapter

    • [ ] c5n.9xlarge (36 virtual cpus, 18 cpus) - with Eleastic Fabric Adapter

    • [ ] c5n.4xlarge (16 virtual cpus, 4 cpus) - without Elastic Fabric Adapter

  • With and without SBATCH –exclusive option

  • With and without Elastic Fabric and Elastic Network Adapter turned on

  • With and without network placement turned on

  • Using different local storage options and copying versus importing data to lustre

    • [ ] input data imported from S3 bucket to lustre

    • [ ] input data copied from S3 bucket to lustre

    • [ ] input data copied from S3 bucket to an EBS volume

  • Using different yaml settings for slurm

    • [ ] DisableSimultaneousMultithreading= true

    • [ ] DisableSimultaneousMultithreading= false

Edit the R script#

First check to see what log files are available:

ls -lrt /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/*.log

Modify the name of the log file to match what is avaible on your system.

cd /shared/pcluster-cmaq/qa_scripts vi parse_timing_pcluster.r

Edit the following section of the script to specify the log file names available on your ParallelCluster

sens.dir  <- '/shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/'
base.dir  <- '/shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/'
files     <- dir(sens.dir, pattern ='run_cctmv5.3.3_Bench_2016_12US2.108.12x9pe.2day.pcluster.log' )
b.files <- dir(base.dir,pattern='run_cctmv5.3.3_Bench_2016_12US2.108.6x18pe.2day.pcluster.log')
#Compilers <- c('intel','gcc','pgi')
Compilers <- c('gcc')
# name of the base case timing. I am using the current master branch from the CMAQ_Dev repository.
# The project directory name is used for the sensitivity case.
base.name <- '12x9pe'
sens.name <- '6x18pe'
Run parse_timing.r script to examine timings of each science process in CMAQ#
Rscript parse_timing.r

Timing Plot Comparing GCC run on 16 x 8 pe versus 8 x 16 pe

gcc_16x8_vs_8x16

Timing Plot Comparing GCC run on 8 x 8 pe versus 8 x 16 pe

gcc_8x8_vs_8x16

Timing Plot Comparing GCC run on 9 x 8 pe versus 8 x 9 pe

gcc_9x8_vs_8x9

Copy Output to S3 Bucket#

Copy output from ParallelCluster to an S3 Bucket

Copy Output Data and Run script logs to S3 Bucket#

Note

You need permissions to copy to a S3 Bucket.

Be sure you enter your access credentials on the parallel cluster by running:

aws configure

Currently, the bucket listed below has ACL turned off

See also

S3 disable ACL

See example of sharing bucket across accounts.

Copy scripts and logs to /fsx#

The CTM_LOG files don’t contain any information about the compute nodes that the jobs were run on. Note, it is important to keep a record of the NPCOL, NPROW setting and the number of nodes and tasks used as specified in the run script: #SBATCH –nodes=16 #SBATCH –ntasks-per-node=8 It is also important to know what volume was used to read and write the input and output data, so it is recommended to save a copy of the standard out and error logs, and a copy of the run scripts to the OUTPUT directory for each benchmark.

cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts
cp run*.log /fsx/data/output
cp run*.csh /fsx/data/output

Examine the output files#

Note

The following commands will vary depending on what APPL or domain decomposition was run

cd /fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x18pe
ls -lht

output:

total 173G
drwxrwxr-x 2 ubuntu ubuntu 145K Jan  5 23:53 LOGS
-rw-rw-r-- 1 ubuntu ubuntu 3.2G Jan  5 23:53 CCTM_CGRID_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 2.2G Jan  5 23:52 CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu  78G Jan  5 23:52 CCTM_CONC_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 348M Jan  5 23:52 CCTM_APMDIAG_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 1.5G Jan  5 23:52 CCTM_WETDEP1_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 1.7G Jan  5 23:52 CCTM_DRYDEP_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 3.6K Jan  5 23:22 CCTM_v533_gcc_2016_CONUS_16x18pe_20151223.cfg
-rw-rw-r-- 1 ubuntu ubuntu 3.2G Jan  5 23:22 CCTM_CGRID_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 2.2G Jan  5 23:21 CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu  78G Jan  5 23:21 CCTM_CONC_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 348M Jan  5 23:21 CCTM_APMDIAG_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 1.5G Jan  5 23:21 CCTM_WETDEP1_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 1.7G Jan  5 23:21 CCTM_DRYDEP_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 3.6K Jan  5 22:49 CCTM_v533_gcc_2016_CONUS_16x18pe_20151222.cfg

Check disk space

 du -sh
173G    .

Copy the output to an S3 Bucket#

Examine the example script

cd /shared/pcluster-cmaq/s3_scripts
cat s3_upload.c5n.18xlarge.csh

output:

#!/bin/csh -f
# Script to upload output data to S3 bucket
# NOTE: a new bucket needs to be created to store each set of cluster runs

aws s3 mb s3://c5n-head-c5n.18xlarge-compute-conus-output
aws s3 cp --recursive /fsx/data/output/ s3://c5n-head-c5n.18xlarge-compute-conus-output
aws s3 cp --recursive /fsx/data/POST s3://c5n-head-c5n.18xlarge-compute-conus-output

If you do not have permissions to write to the s3 bucket, you may need to ask the administrator of your account to add S3 Bucket writing permissions.

Run the script to copy all of the CMAQ output and logs to the S3 bucket.

./s3_upload.c5n.18xlarge.csh

Logout and Delete ParallelCluster#

Logout and delete the ParallelCluster when you are done to avoid incurring costs.

Logout of cluster when you are done#

To avoid incurring costs for the lustre file system and the c5n.xlarge compute node, it is best to delete the cluster after you have copied the output data to the S3 Bucket.

If you are logged into the Parallel Cluster then use the following command

exit

Delete Cluster#

Run the following command on your local computer.

pcluster delete-cluster --region=us-east-1 --cluster-name cmaq

Verify that the cluster was deleted#

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Output:

"lastUpdatedTime": "2022-02-25T20:17:19.263Z",
  "region": "us-east-1",
  "clusterStatus": "DELETE_IN_PROGRESS"

Verify that you see the following output

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq

Output:

pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
{
  "message": "Cluster 'cmaq' does not exist or belongs to an incompatible ParallelCluster major version."
}

Additional Resources#

For a tutorial that explains cloud terminology as well as how to obtain single EC2 instances for running GEOS-CHEM on a single node, please see the Beginner Tutorial provided by GEOS-Chem as well as the resources in this chapter.

FAQ#

Q. Can you update a cluster with a Snapshot ID, ie. update a cluster to use the /shared/build pre-installed software?

A. No. An existing cluster can not be updated with a Snapshot ID, solution is to delete the cluster and re-create it. see error:

pcluster update-cluster --region us-east-1 --cluster-name cmaq --cluster-configuration c5n-18xlarge.ebs_unencrypted.fsx_import.yaml

Output:


{
  "message": "Update failure",
  "updateValidationErrors": [
    {
      "parameter": "SharedStorage[ebs-shared].EbsSettings.SnapshotId",
      "requestedValue": "snap-065979e115804972e",
      "message": "Update actions are not currently supported for the 'SnapshotId' parameter. Remove the parameter 'SnapshotId'. If you need this change, please consider creating a new cluster instead of updating the existing one."
    }
  ],
  "changeSet": [
    {
      "parameter": "SharedStorage[ebs-shared].EbsSettings.SnapshotId",
      "requestedValue": "snap-065979e115804972e"
    }
  ]
}

Q. How do you figure out why a job isn’t successfully running in the slurm queue?

A. Check the logs available in the following link

Pcluster Troubleshooting

vi /var/log/parallelcluster/slurm_resume.log

Output:

2022-03-23 21:04:23,600 - [slurm_plugin.instance_manager:_launch_ec2_instances] - ERROR - Failed RunInstances request: 0c6422af-c300-4fe6-b942-2b7923f7b362
2022-03-23 21:04:23,600 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x3) ['queue1-dy-compute-resource-1-4', 'queue1-dy-compute-resource-1-5', 'queue1-dy-compute-resource-1-6']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient c5n.18xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get c5n.18xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.

Q. How do I determine what node(s) the job is running on?

A. echo $SLURM_JOB_NODELIST

Slurm Environment Variables

Q. I see other tutorials that use a configure file instead of a yaml file to create the cluster. Can I use this instead?

A. No, you must convert the text based config file to a yaml file to use with the Parallel Cluster CLI 3.+ version used in this tutorial. An example of this type of tutorial is < a href=”https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/”> Fire Dynamics Simulation CFD workflow using AWS ParallelCluster, Elastic Fabric Adapter, Amazon FSx for Lustre and NICE DCV You can try to use the v2 to v3 converter, see more: moving from v2 to v3

Q. If I find an issue, or need help with this CMAQ ParallelCluster Tutorial what do I do?

A. Please file an issue using github.

Submit Github Issue for help with documentation

Please indicate the issue you are having, and include a link from the read the doc section that you are referring to. The tutorial documentation has an edit icon in the upper right corner of each page. You can click on that, and github will fork the repo and allow you to edit the page. After you have made the edits, you can submit a pull request, and then include the link to the pull request in the github issue.

Free Training#

Numerical Weather Prediction HPC Workship

AWS Free Training Courses

Another workshop to learn the AWS CLI 3.0#

Workshop on learning AWS CLI 3.0

Youtube video#

Youtube video on AWS CLI 3.0

Intro to AWS for HPC People - HPC Tech Shorts#

Intro to AWS for HPC People - Tech Short Foundations Level 1

Benchmarking#

Benchmarks optimized for HPC high memory

AWS Graviton WRF Performance Comparison

Deep Dive hpc7g

AWS EPYC and WRF Performance Comparison

Help Resources for CMAQ#

  1. CMAS Center Forum

  2. EPA CMAQ Website

  3. UNC CMAS Center Website

Computing on the Cloud References#

WRF Cloud Computing Paper

AWS High Performance Computing (HPC) Lens for the AWS Well-Architected Framework#

AWS High Performance Computing (HPC) Lens for the AWS Well-Architected Framework

HPC on AWS - WRF (uses cfnCluster - older version of Parallel Cluster#

HPC on AWS

WRF on Parallel Cluster#

A Scientist Guide to Cloud-HPC: Example with AWS ParallelCluster, Slurm, Spack, and WRF

Advancing Large Scale Weather and Climate Modeling Data in the Cloud#

AWS and Intel Research Webinar Series: Advancing the large scale weather and climate modeling data in the cloud

AWS Well-Architected Framework#

AWS Well-Architected Framework

Cost Comparison on-premisis and cloud#

WRF Performance on Google Cloud

<a href=”https://journal.fluidnumerics.com/comparing-on-premise-and-cloud-costs-for-high-performance-computing>Comparing on-premise and cloud costs for hpc

Resources from AWS for diagnosing issues with running the Parallel Cluster#

  1. Github for AWS Parallel Cluster

  2. User Guide

  3. Getting Started Guide

  4. Guide to obtaining AWS Key Pair

  5. Lustre FAQ

  6. Parallel Cluster FAQ (somewhat outdated..)

  7. Tool to convert v2 config files to v3 yaml files for Parallel Cluster

  8. Instructions for creating a fault tolerance parallel cluster using lustre filesystem

  9. AWS HPC discussion forum

Issues#

For AWS Parallel Cluster you can create a GitHub issue for feedback or issues: Github Issues There is also an active community driven Q&A site that may be helpful: AWS re:Post a community-driven Q&A site

Tips to managing the parallel cluster#

  1. The head node can be stopped from the AWS Console after stopping compute nodes of the cluster, as long as it is restarted before issuing the command to restart the cluster.

  2. The pcluster slurm queue system will create and delete the compute nodes, so that helps reduce manual cleanup for the cluster.

  3. The compute nodes are terminated after they have been idle for a period of time. The yaml setting used for this is as follows: SlurmSettings: ScaledownIdletime: 5

  4. The default idle time is 10 minutes, and can be reduced by specifing a shorter idle time in the YAML file. It is important to verify that the are deleted after a job is finished, to avoid incurring unexpected costs.

  5. copy/backup the outputs and logs to an s3 bucket for follow-up analysis

  6. After copying output and log files to the s3 bucket the cluster can be deleted

  7. Once the pcluster is deleted all of the volumes, head node, and compute node will be terminated, and costs will only be incurred by the S3 Bucket storage.

Instructions on how to create Parallel Cluster Amazon Machine Image (AMI) from the command line#

Tutorial How-to Create AMI from Command Line

We also need to have additional protections if we make these AMI’s public.

Building Shared AMIs

Securing Access to AMIs for AWS Marketplace

Building Pcluster from Existing AMI

ParallelCluster Update#

  1. not all settings in the yaml file can be updated

  2. it is important to know what the policy is for each setting

Example Update policy:

If this setting is changed, the update is not allowed. After changing this setting, the cluster can’t be updated. Either the change must be reverted or the cluster must be deleted (using pcluster delete-cluster), and then a new cluster created (using pcluster create-cluster) in the old cluster’s place.

see more information

ParallelCluster Update Policy

Use Elastic Fabric Adapter/Elastic Network Adapter for better performance#

“In order to make the most of the available network bandwidth, you need to be using the latest Elastic Network Adapter (ENA) drivers (available in the latest Amazon Linux, Red Hat 7.6, and Ubuntu AMIs, and in the upstream Linux kernel) and you need to make use of multiple traffic flows. Flows within a Placement Group can reach 10 Gbps; the rest can reach 5 Gbps. When using multiple flows on the high-end instances, you can transfer 100 Gbps between EC2 instances in the same region (within or across AZs), S3 buckets, and AWS services such as Amazon Relational Database Service (RDS), Amazon ElastiCache, and Amazon EMR.”

The above was quoted from the following link:

C5n Instances

Elastic Fabric Adapter for HPC systems

“EFA is currently available on c5n.18xlarge, c5n.9xlarge, c5n.metal, i3en.24xlarge, i3en.metal, inf1.24xlarge, m5dn.24xlarge, m5n.24xlarge, r5dn.24xlarge, r5n.24xlarge, p3dn.24xlarge, p4d, m6i.32xlarge, m6i.metal, c6i.32xlarge, c6i.metal, r6i.32xlarge, and r6i.metal instances.”

What are the differences between an EFA ENI and an ENA ENI?

“An ENA ENI provides traditional IP networking features necessary to support VPC networking. An EFA ENI provides all the functionality of an ENA ENI, plus hardware support for applications to communicate directly with the EFA ENI without involving the instance kernel (OS-bypass communication) using an extended programming interface. Due to the advanced capabilities of the EFA ENI, EFA ENIs can only be attached at launch or to stopped instances.”

Q: What are the pre-requisites to enabling EFA on an instance?

“EFA support can be enabled either at the launch of the instance or added to a stopped instance. EFA devices cannot be attached to a running instance.”

Elastic Fabric Adapter for Tightly Coupled Workloads

Quoted from the above link.

“An EFA can still handle IP traffic, but also supports an important access model commonly called OS bypass. This model allows the application (most commonly through some user-space middleware) access the network interface without having to get the operating system involved with each message. Doing so reduces overhead and allows the application to run more efficiently. Here’s what this looks like (source):”

“The MPI Implementation and libfabric layers of this cake play crucial roles:”

“MPI – Short for Message Passing Interface, MPI is a long-established communication protocol that is designed to support parallel programming. It provides functions that allow processes running on a tightly-coupled set of computers to communicate in a language-independent way.”

“libfabric – This library fits in between several different types of network fabric providers (including EFA) and higher-level libraries such as MPI. EFA supports the standard RDM (reliable datagram) and DGRM (unreliable datagram) endpoint types; to learn more, check out the libfabric Programmer’s Manual. EFA also supports a new protocol that we call Scalable Reliable Datagram; this protocol was designed to work within the AWS network and is implemented as part of our Nitro chip.”

“Working together, these two layers (and others that can be slotted in instead of MPI), allow you to bring your existing HPC code to AWS and run it with little or no change.

“You can use EFA today on c5n.18xlarge and p3dn.24xlarge instances in all AWS regions where those instances are available. The instances can use EFA to communicate within a VPC subnet, and the security group must have ingress and egress rules that allow all traffic within the security group to flow. Each instance can have a single EFA, which can be attached when an instance is started or while it is stopped.”

“You will also need the following software components:”

“EFA Kernel Module – The EFA Driver is in the Amazon GitHub repo; read Getting Started with EFA to learn how to create an EFA-enabled AMI for Amazon Linux, Amazon Linux 2, and other popular Linux distributions.”

“Libfabric Network Stack – You will need to use an AWS-custom version for now; again, the Getting Started document contains installation information. We are working to get our changes into the next release (1.8) of libfabric.”

“Note the parallel cluster deplopyment takes care of setting this up for you.”

VPC Management#

There is a limit on the number of VPCs that are allowed per account - limit is 5.

What is the difference between a private and a public vpc? (what setting is used in the yaml file, and why is one preferred over the other?)

Note, there is a default VPC, that is used to create EC2 instances, that should not be deleted.

Q1. is there a separate default VPC for each region?

Q2. Each time you run a configure cluster command, does the ParallelCluster create a new VPC?

Q3. Why don’t the VPC and subnet IDs get deleted when the ParallelClusters are deleted?

Deleting VPCs#

If pcluster configure created a new VPC, you can delete that VPC by deleting the AWS CloudFormation stack it created. The name will start with “parallelclusternetworking-” and contain the creation time in a “YYYYMMDDHHMMSS” format. You can list the stacks using the list-stacks command. The following instructions are available here:

Instructions for Cleaning Up VPCs

$ aws --region us-east-2 cloudformation list-stacks \
   --stack-status-filter "CREATE_COMPLETE" \
   --query "StackSummaries[].StackName" | \
   grep -e "parallelclusternetworking-""parallelclusternetworking-pubpriv-20191029205804"

The stack can be deleted using the delete-stack command.

$ aws --region us-west-2 cloudformation delete-stack \
   --stack-name parallelclusternetworking-pubpriv-20191029205804

If pcluster configure created a new VPC, you can delete that VPC by deleting the AWS CloudFormation stack it created. The name will start with “parallelclusternetworking-” and contain the creation time in a “YYYYMMDDHHMMSS” format. You can list the stacks using the list-stacks command.

Pcluster Configure

Note: I can see why you wouldn’t want to delete the VPC, if you want to reuse the yaml file that contains the SubnetID that is tied to that VPC.

I was able to use the Amazon Website to find the SubnetID, and then identify the VPC that it is part of.

I currently have the following VPCs

Name

VPC ID

State

IPv4 CIDR

IPv6 CIDR (Network border group)

IPv6 pool

DHCP options set

Main route table

Main network ACL

Tenancy

Default VPC

Owner ID

ParallelClusterVPC-20211210200003

vpc-0445c3fa089b004d8

Available

10.0.0.0/16

dopt-eaeaf888

rtb-048c503f3e6b9acc3

acl-0fecfa7ff42e04ead

Default

No

xxxx

ParallelClusterVPC-20211021183813

vpc-00e3f4e34aaf80f06

Available

10.0.0.0/16

dopt-eaeaf888

rtb-0a5b7ac9873486bcb

acl-0852d06b1170db68c

Default

No

xxxx

-

vpc-3cfc5759

Available

172.31.0.0/16

dopt-eaeaf888

rtb-99cd64fc

acl-bb9b39de

Default

Yes

440858712842

ParallelClusterVPC-20210419174552

vpc-0ab948b66554c71ea

Available

10.0.0.0/16

dopt-eaeaf888

rtb-03fd47f05eac5379f

acl-079fe1be7ff972858

Default

No

xxxx

ParallelClusterVPC-20211021174405

vpc-0f34a572da1515e49

Available

10.0.0.0/16

dopt-eaeaf888

rtb-0b6310d9ea70a699e

acl-01fa1529b65545e91

Default

No

xxxx

This is the subnet id that I am currently using in the yaml files: subnet-018cfea3edf3c4765

I currently have 11 subnet IDs - how many are no longer being used?

Name

Subnet ID

State

VPC

IPv4 CIDR

IPv6 CIDR

Available IPv4 addresses

Availability Zone

Availability Zone ID

Network border group

Route table

Network ACL

Default subnet

Auto-assign public IPv4 address

Auto-assign customer-owned IPv4 address

Customer-owned IPv4 pool

Auto-assign IPv6 address

Owner ID

parallelcluster:public-subnet

subnet-018cfea3edf3c4765

Available

vpc-0445c3fa089b004d8-ParallelClusterVPC-20211210200003

10.0.0.0/20

4091

us-east-1a

use1-az6

us-east-1

rtb-034bcab9e4b8c4023-parallelcluster:route-table-public

acl-0fecfa7ff42e04ead

No

Yes

No

-

No

xx

Using Cost Allocation Tags with ParallelCluster#

This blog post uses the v2 command line Using Cost Allocation Tags

Need to update instructions for AWS v3 CLI - using yaml files.

Future Work#

Future Work#

AWS ParallelCluster

  • Create yaml and software install scripts for intel compiler

  • Benchmark 2 day case using intel compiler version of CMAQ and compare to GCC timings

  • Repeat Benchmark Runs using c6gn.16xlarge compute nodes AMD Graviton and compare to Azure Cycle Cloud HBV3 compute nodes.

  • Create script for installing all software and R packages as a custom bootstrap as the ParallelCluster is created.

  • Create method to automatically checkpoint and save a job prior to it being bumped from the schedule if running on spot instances.

  • Set up an additional slurm queue that uses a smaller compute node to do the post-processing and learn how to submit the post processing jobs to this queue, rather than running them on the head node.

  • Install software using SPACK

  • Install netCDF-4 compressed version of I/O API Library and set up environment module to compile and run CMAQ for 2018_12US1 data that is nc4 compressed

Documentation

  • Create instructions on how to create a ParallelCluster using encrypted ebs volume and snapshot.

Contribute to this Tutorial#

The community is encouraged to contribute to this documentation. It is open source, created by the CMAS Center, under contract to EPA, for the benefit of the CMAS Community.

Contribute to Pcluster-cmaq Documentation#

Please take note of any issues and submit to Github Issue

Note

At the top of each page of the documentation, there is also an pencil icon, that you can click. It will create a fork of the project on your github account that you can make edits and then submit a pull request.

Figure with Pencil: Edit this Page Icon

If you are able to create a pull request, please include the following in your issue:

  • pull request number

If you are not able to create a pull request, please include the following in your issue:

  • section number

  • description of the issue encountered

  • recommended fix, if available