CMAQv5.3.3 on AWS Tutorials (Single VM and ParallelCluster)#
Scripts and code to run CMAQ on Single Virtural Machine or Parallel Cluster (multiple VMs).
To obtain this code use the following command.#
git clone -b CMAQv5.3.3 https://github.com/CMASCenter/pcluster-cmaq pcluster-cmaq-533
Warning
This documentation is under continuous development. This documentation is under continuous development latest version is available here: CMAQ on AWS Tutorials Latest Version
Overview#
This document provides tutorials and information on how users can create High Performance Computers (Single Virtual Machine (VM) or ParallelCluster) on Amazon Web Service (AWS) using the AWS Command Line Interface. The tutorials are aimed at users with cloud computing experience that are already familiar with Amazon Web Service (AWS). For those with no cloud computing experience we recommend reviewing the Additional Resources listed in chapter 16 of this document.
Format of this documentation#
This document provides several hands-on tutorials that are designed to be read in order.
The Introductory Tutorial will walk you through creating a demo ParallelCluster. You will learn how to set up your AWS Identity and Access Management Roles, configure and create a demo cluster, and exit and delete the cluster.
Single VM Tutorials#
The Single VM Intermediate Tutorial will show you how to create a single virtual machine using an AMI that has the software and data pre-loaded and give instructions for creating the virtual machine using ec2 instances that have different number of cores, and are matched to the benchmark domain. The Single VM Advanced tutorial will show you how to install the CMAQv5.3.3 software and libraries, and how to create custom environment modules.
Parallel Cluster Tutorials#
The CMAQv5.3.3 Parallel Cluster Intermediate chapter will show you how to run CMAQv5.3.3 using the 12US2 benchmark. The CMAQv5.3.3 Advanced Tutorial explains how to scale the ParallelCluster for larger compute jobs and install CMAQv5.3.3 and required libraries from scratch on the cloud. The Chapter “Benchmark on HPC6a-48xlarge with EBS and Lustre” uses CMAQv5.3.3 on advanced HPC6a compute nodes that are only available in the us-east-2 region.
The remaining sections provide instructions on post-processing CMAQ output, comparing output and runtimes from multiple simulations, and copying output from ParallelCluster to an AWS Simple Storage Service (S3) bucket.
Why might I need to use ParallelCluster?#
The AWS ParallelCluster may be configured to be the equivalent of a High Performance Computing (HPC) environment, including using job schedulers such as Slurm, running on multiple nodes using code compiled with Message Passing Interface (MPI), and reading and writing output to a high performance, low latency shared disk. The advantage of using the AWS ParallelCluster command line interface is that the compute nodes can be easily scaled up or down to match the compute requirements of a given simulation. In addition, the user can reduce costs by using Spot instances rather than On-Demand for the compute nodes. ParallelCluster also supports submitting multiple jobs to the job submission queue.
Our goal is make this user guide to running CMAQ on a ParallelCluster as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.
Additional information on AWS ParallelCluster:
AWS ParallelCluster documentation
AWS ParallelCluster training video
System Requirements#
Description of the compute node and head nodes used for the Parallel Cluster
Configurations for running CMAQv5.3.3 on AWS ParallelCluster#
Recommend that users set up a spending alarm using AWS#
Configure alarm to receive an email alert if you exceed $100 per month (or what ever monthly spending limit you need).
See also
See the AWS Tutorial on setting up an alarm for AWS Free Tier. AWS Free Tier Budgets
Software Requirements for CMAQ on AWS ParallelCluster#
Tier 1: Native OS and associated system libraries, compilers
Operating System: Ubuntu2004
Tcsh shell
Git
Compilers (C, C++, and Fortran) - GNU compilers version ≥ 8.3
MPI (Message Passing Interface) - OpenMPI ≥ 4.0
Slurm Scheduler
Tier 2: additional libraries required for installing CMAQ
NetCDF (with C, C++, and Fortran support)
I/O API
R Software and packages
Tier 3: Software distributed thru the CMAS Center
CMAQv533
CMAQv533 Post Processors
Tier 4: R packages and Scripts
R QA Scripts
Software on Local Computer
AWS CLI v3.0 installed in a virtual environment
pcluster is the primary AWS ParallelCluster CLI command. You use pcluster to launch and manage HPC clusters in the AWS Cloud and to create and manage custom AMI images
run-instances is another AWS Command Line method to create a single virtual machine to run CMAQ described in chapter 6.
Edit YAML Configuration Files using vi, nedit or other editor (yaml does not accept tabs as spacing)
Git
Mac - XQuartz for X11 Display
Windows - MobaXterm - to connect to ParallelCluster IP address
AWS CLI v3.0 AWS Region Availability#
Note
The scripts in this tutorial use the us-east-1 region, but the scripts can be modified to use any of the supported regions listed in the url below. CLI v3 Supported Regions
CONUS 12US2 Domain Description#
GRIDDESC
'12US2'
'12CONUS' -2412000.0 -1620000.0 12000.0 12000.0 396 246 1
Single VM Configuration for CMAQv5.3.2_Benchmark_2Day_Input.tar.gz Benchmark#
c6a.2xlarge
ParallelCluster Configuration for 12US2 Benchmark Domain#
Note
It is recommended to use a head node that is in the same family a the compute node so that the compiler options and executable is optimized for that processor type.
Recommended configuration of the ParallelCluster HPC head node and compute nodes to run the CMAQ CONUS benchmark for two days:
Head node:
c5n.large
or
c6a.xlarge
(note that head node should match the processor family of the compute nodes)
Compute Node:
c5n.9xlarge (16 cpus/node with Multithreading disabled) with 96 GiB memory, 50 Gbps Network Bandwidth, 9,500 EBS Bandwidth (Mbps) and Elastic Fabric Adapter (EFA)
or
c5n.18xlarge (36 cpus/node with Multithreading disabled) with 192 GiB memory, 100 Gbps Network Bandwidth, 19,000 EBS Bandwidth (Mbps) and Elastic Fabric Adapter (EFA)
or
c6a.48xlarge (96 cpus/node with Multithreading disabled) with 384 GiB memory, 50 Gigabit Network Bandwidth, 40 EBS Bandwidth (Gbps), Elastic Fabric Adapter (EFA) and Nitro Hypervisor
or
hpc6a.48xlarge (96 cpus/node) only available in us-east-2 region with 384 GiB memory, using two 48-core 3rd generation AMD EPYC 7003 series processors built on 7nm process nodes for increased efficiency with a total of 96 cores (4 GiB of memory per core), Elatic Fabric Adapter (EFA) and Nitro Hypervisor (lower cost than c6a.48xlarge)
Note
CMAQ is developed using OpenMPI and can take advantage of increasing the number of CPUs and memory. ParallelCluster provides a ready-made auto scaling solution.
Note
Additional best practice of allowing the ParallelCluster to create a placement group. Network Performance Placement Groups
This is specified in the yaml file in the slurm queue’s network settings.
Networking:
PlacementGroup:
Enabled: true
Note
To provide the lowest latency and the highest packet-per-second network performance for your placement group, choose an instance type that supports enhanced networking. For more information, see Enhanced Networking. Enhanced Networking (ENA)
To measure the network performance, you can use iPerf to measure network bandwidth.
Note
Elastic Fabric Adapter(EFA) “EFA provides lower and more consistent latency and higher throughput than the TCP transport traditionally used in cloud-based HPC systems. It enhances the performance of inter-instance communication that is critical for scaling HPC and machine learning applications. It is optimized to work on the existing AWS network infrastructure and it can scale depending on application requirements.” “An EFA is an Elastic Network Adapter (ENA) with added capabilities. It provides all of the functionality of an ENA, with an additional OS-bypass functionality. OS-bypass is an access model that allows HPC and machine learning applications to communicate directly with the network interface hardware to provide low-latency, reliable transport functionality.” Elastic Fabric Adapter(EFA)
Note
Nitro Hypervisor “AWS Nitro System is composed of three main components: Nitro cards, the Nitro security chip, and the Nitro hypervisor. Nitro cards provide controllers for the VPC data plane (network access), Amazon Elastic Block Store (Amazon EBS) access, instance storage (local NVMe), as well as overall coordination for the host. By offloading these capabilities to the Nitro cards, this removes the need to use host processor resources to implement these functions, as well as offering security benefits. “ Bare metal performance with the Nitro Hypervisor
Importing data from S3 Bucket to Lustre
Justification for using the capability of importing data from an S3 bucket to the lustre file system over using elastic block storage file system and copying the data from the S3 bucket for the input and output data storage volume on the cluster.
Saves storage cost
Removes need to copy data from S3 bucket to Lustre file system. FSx for Lustre integrates natively with Amazon S3, making it easy for you to process HPC data sets stored in Amazon S3
Simplifies running HPC workloads on AWS
Amazon FSx for Lustre uses parallel data transfer techniques to transfer data to and from S3 at up to hundreds of GB/s.
Note
To find the default settings for Lustre see: Lustre Settings for ParallelCluster
Figure 1. AWS Recommended ParallelCluster Configuration (Number of compute nodes depends on setting for NPCOLxNPROW and #SBATCH –nodes=XX #SBATCH –ntasks-per-node=YY )
Create Single VM and run CMAQv5.3.3 (software pre-installed)#
Creating an EC2 instance either from the AWS Web Interface or Command Line is easy to do. In this tutorial we will give examples on how to create and run using ec2 instances that vary in size depending on the size of the CMAQ benchmarks.
Use AWS Management Console to Create Single VM and run CMAQv5.3.3 (software pre-installed)#
Creating an EC2 instance from the AWS Management Console is easy to do. In this tutorial we will give examples on how to create and run using ec2 instances that vary in size depending on the size of the CMAQ benchmarks.
Launch an EC2 Instance using the AWS Manaement Console SPOT Pricing
Benchmark Name |
Grid Domain |
EC2 Instance |
vCPU |
Cores |
Memory |
Network Performance |
Storage (EBS Only) |
On Demand Hourly Cost |
Spot Hourly Cost |
---|---|---|---|---|---|---|---|---|---|
2016_12SE1 |
(100x80x35) |
c6a.2xlarge |
8 |
4 |
16 GiB |
Up to 12500 Megabit |
gp3 |
0.306 |
0.2879 |
Data in table above is from the following: Sizing and Price Calculator from AWS
Run CMAQv5.3.3 on a single Virtual Machine (VM) using an ami with software pre-loaded to run on either a c6a.2xlarge instance with gp3 filesystem.
Learn how to Use the AWS Management Console to launch EC2 instance using Public AMI#
Public AMI contains the software and data to run 2016_12SE3 using CMAQv5.3.3#
Software was pre-installed and saved to a public ami.
The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.
This chapter describes the process used in the AWS Web interface to configure and create a c6a.2xlarge ec2 instance using a public ami. With additional instructions to use ssh to login and run CMAQ for the 2016_12SE3 domain.
Login to the AWs Consol and select EC2#
Enter the ami name: ami-019eb54acc4924d3f in the Search box and return or enter.#
Click on the Community AMI tab and then and click on the orange “Select” button
Note this AMI was built for the following architecture, and can be used by the c6a - hpc6a family of instances#
Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2023-05-16
Search for c6a.2xlarge Instance Type and select#
Select key pair name or create a new key pair#
Use the default Network Settings#
Configure Storage#
The AMI is preconfigured to use 500 GiB of gp3 as the root volume (Not encrypted)
Select the Pull-down options for Advanced details#
Scroll down until you see option to Specify CPU cores
Click the checkbox for “Specify CPU cores”
Then select 4 Cores, and 1 thread per core
Click on the link to the instance once it is successfully launched#
Wait until the Status check has been completed and the Instance State is running#
Click on the instance link and copy the Public IP address to your clipboard#
Use the ssh command to login to the c6a.2xlarge instance#
ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@xx.xxx.xxx.xxx
Run CMAQv5.3.3 on c6a.2xlarge#
Login to the ec2 instance#
Note, the following command must be modified to specify your key, and ip address (obtained from the previous command):
ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address
Login to the ec2 instance again, so that you have two windows logged into the machine.#
ssh -Y -i ~/downloads/your-pem.pem ubuntu@your-ip-address
Load the environment modules#
module avail
module load ioapi-3.2/gcc-11.3.0-netcdf mpi/openmpi-4.1.2 netcdf-4.8.1/gcc-11.3
Update the pcluster-cmaq repo using git#
cd /shared/pcluster-cmaq
git pull
Verify that the input data is available#
Input Data for the 2016_12SE1 Benchmark
ls -lrt /shared/data/CMAQv5.3.2_Benchmark_2Day_Input/2016_12SE1/*
Run CMAQv5.3.3 for 2016_12SE1 1 Day benchmark Case on 4 pe#
' '
'LamCon_40N_97W'
2 33.000 45.000 -97.000 -97.000 40.000
' '
'SE52BENCH'
'LamCon_40N_97W' 792000.000 -1080000.000 12000.000 12000.000 100 80 1
'2016_12SE1'
'LamCon_40N_97W' 792000.000 -1080000.000 12000.000 12000.000 100 80 1
Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts
./run_cctm_Bench_2016_12SE1.csh |& tee run_cctm_Bench_2016_12SE1.log
Use HTOP to view performance.#
htop
output
If the ec2 instance was created without specifying 1 thread per core in the Advanced Settings, then it will have 8 vcpus.
Successful output using the gp3 volume with hyperthreading on (8vcpus)#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2016-07-01
End Day: 2016-07-01
Number of Simulation Days: 1
Domain Name: 2016_12SE1
Number of Grid Cells: 280000 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 4
All times are in seconds.
Num Day Wall Time
01 2016-07-01 2083.32
Total Time = 2083.32
Avg. Time = 2083.32
Use lscpu to confirm that there are 4 cores on the c6a.2xlarge ec2 instance that was created with hyperthreading turned off (1 thread per core).#
lscpu
Output:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
BogoMIPS: 5300.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm con
stant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt a
es xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmm
call fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nr
ip_save vaes vpclmulqdq rdpid
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Save output data and run script logs#
Copy the log files and the output data to an s3 bucket.
cd /shared/pcluster-cmaq/s3_scripts
cat s3_upload_cmaqv533.c6a.2xlarge.csh
Output
#!/bin/csh -f
# Script to upload output data to S3 bucket
# need to set up your AWS credentials prior to running this script
# aws configure
# NOTE: need permission to create a bucket and write to an s3 bucket.
#
mkdir /shared/data/output/logs
mkdir /shared/data/output/scripts
cp /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/*.log /shared/data/output_CCTM_v533_gcc_Bench_2016_12SE1/logs/
cp /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_Bench_2016_12SE1.csh /shared/data/output_CCTM_v533_gcc_Bench_2016_12SE1/scripts/
setenv BUCKET c6a.2xlarge.cmaqv533
aws s3 mb s3://$BUCKET
aws s3 cp --recursive /shared/data/output_CCTM_v533_gcc_Bench_2016_12SE1 s3://$BUCKET
Set your aws credentials by running the command
aws configure
Edit the script to create a unique bucket name
Run the script
./s3_upload_cmaqv533.c6a.2xlarge.csh
or
Save the full input data, run scripts, output data and logs to an AMI that is owned by your account.#
Go to the EC2 Dashboard#
Click on Instances Running#
Select the checkbox next to the c6a.2xlarge instance name
Fill out the name of the image#
Name the instance to help identify the ec2 instance type, CMAQ version installed, and perhaps the input/output data available
Click Save Image#
Wait until the image status available before terminating the ec2 instance
Stop Instance#
Go to the EC2 Dashboard#
Click on Instances Running#
Select the checkbox next to the c6a.2xlarge instance name
CMAQv5.3.3 on Single Virtual Machine Intermediate (software pre-installed)#
Creating an EC2 instance from the Command Line is easy to do. In this tutorial we will give examples on how to create and run using ec2 instances that vary in size depending on the size of the CMAQ benchmarks.
Using Amazon EC2 with the AWS CLI SPOT Pricing
Benchmark Name |
Grid Domain |
EC2 Instance |
vCPU |
Cores |
Memory |
Network Performance |
Storage |
On Demand Hourly Cost |
Spot Hourly Cost |
---|---|---|---|---|---|---|---|---|---|
2016_12SE1 |
(100x80x35) |
c6a.2xlarge |
8 |
4 |
16 GiB |
Up to 12500 Megabit |
EBS Only |
0.306 |
0.2879 |
Data in table above is from the following: Sizing and Price Calculator from AWS
Run CMAQv5.33 on a single Virtual Machine (VM) using an ami with software pre-loaded to run on either a c6a.2xlarge, c6a.8xlarge or c6a.48xlarge instance with gp3filesystem.
Learn how to Use AWS CLI to launch c6a.2xlarge EC2 instance using Public AMI#
Public AMI contains the software and data to run 2016_12SE1 benchmark using CMAQv5.33#
Software was pre-installed and saved to a public ami.
The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.
This chapter describes the process that was used to test and configure the c6a.2xlarge ec2 instance to run CMAQv5.3.3 for the 2016_12SE1 domain.
Todo: Need to create command line options to copy a public ami to a different region.
Verify that you can see the public AMI on the us-east-1 region.#
aws ec2 describe-images --region us-east-1 --image-id ami-065049c5c78e6c6a5
Output:
{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2023-06-24T00:17:02.000Z",
"ImageId": "ami-065049c5c78e6c6a5",
"ImageLocation": "440858712842/cmaqv5.4_c6a.48xlarge.io2.iops.100000",
"ImageType": "machine",
"Public": true,
"OwnerId": "440858712842",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"Iops": 100000,
"SnapshotId": "snap-08b8608dca836ef2e",
"VolumeSize": 500,
"VolumeType": "io2",
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "cmaqv5.4_c6a.48xlarge.io2.iops.100000",
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm",
"DeprecationTime": "2025-06-24T00:17:02.000Z"
}
]
}
Use q to exit out of the command line
Note, the AMI uses the maximum value available on io2 for Iops of 100000.
AWS Resources for the aws cli method to launch ec2 instances.#
Tutorial Launch Spot Instances
(note, it discourages the use of run-instances for launching spot instances, but they do provide an example method)
Launching EC2 Spot Instances using Run Instances API
Additional resources for spot instance provisioning.
To launch a Spot Instance with RunInstances API you create the configuration file as described below:
cat <<EoF > ./runinstances-config.json
{
"DryRun": false,
"MaxCount": 1,
"MinCount": 1,
"InstanceType": "c6a.2xlarge",
"ImageId": "ami-065049c5c78e6c6a5",
"InstanceMarketOptions": {
"MarketType": "spot"
},
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "Name",
"Value": "EC2SpotCMAQv54"
}
]
}
]
}
EoF
Use the publically available AMI to launch an ondemand c6a.2xlarge ec2 instance using a io2 volume with 100000 IOPS with hyperthreading disabled#
Note, we will be using a json file that has been preconfigured to specify the ImageId
Obtain the code using git#
git clone -b main https://github.com/CMASCenter/pcluster-cmaq
cd pcluster-cmaq/json
Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids launch-wizard-with-tcp-access
Example command: note launch-wizard-with-tcp-access needs to be replaced by your security group ID, and your-pem key needs to be replaced by the name of your-pem.pem key.
aws ec2 run-instances --debug --key-name your-pem --security-group-ids launch-wizard-with-tcp-access --dry-run --region us-east-1 --cli-input-json file://runinstances-config.json
Command that works for UNC’s security group and pem key:
yaws ec2 run-instances –debug –key-name cmaqv5.4 –security-group-ids launch-wizard-179 –region us-east-1 –dry-run –ebs-optimized –cpu-options CoreCount=4,ThreadsPerCore=1 –cli-input-json file://runinstances-config.io2.c6a.2xlarge.json`
Once you have verified that the command above works with the –dry-run option, rerun it without as follows.
aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --cpu-options CoreCount=4,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.c6a.2xlarge.json
Example of security group inbound and outbound rules required to connect to EC2 instance via ssh.
Additional resources
CLI commands to create Security Group
Use the following command to obtain the public IP address of the machine.#
This command is commented out, as the instance hasn’t been created yet. keeping the instructions for documentation purposes.
aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-065049c5c78e6c6a5" | grep PublicIpAddress
Login to the ec2 instance#
Note, the following command must be modified to specify your key, and ip address (obtained from the previous command):
ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address
Login to the ec2 instance again, so that you have two windows logged into the machine.#
ssh -Y -i ~/downloads/your-pem.pem ubuntu@your-ip-address
Load the environment modules#
module avail
module load ioapi-3.2/gcc-11.3.0-netcdf mpi/openmpi-4.1.2 netcdf-4.8.1/gcc-11.3
Update the pcluster-cmaq repo using git#
cd /shared/pcluster-cmaq
git pull
Run CMAQv5.4 for 12US1 Listos Training 3 Day benchmark Case on 4 pe#
Input data is available for a subdomain of the 12km 12US1 case.
GRIDDESC
'2018_12Listos'
'LamCon_40N_97W' 1812000.000 240000.000 12000.000 12000.000 25 25 1
Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts
./run_cctm_2018_12US1_listos.csh | & tee ./run_cctm_2018_12US1_listos.c6a.2xlarge.log
Use HTOP to view performance.#
htop
output
Successful output#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day: 2018-08-07
Number of Simulation Days: 3
Domain Name: 2018_12Listos
Number of Grid Cells: 21875 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 4
All times are in seconds.
Num Day Wall Time
01 2018-08-05 166.7
02 2018-08-06 167.0
03 2018-08-07 171.3
Total Time = 505.00
Avg. Time = 168.33
Note, this took longer than the run done using c6a.48xlarge, where 32 cores were used. The c6a.2xlarge also has smaller cache sizes than the c6a.48xlarge, which you can see when you compare output of the lscpu command.
Change to the scripts directory#
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts/
Use lscpu to confirm that there are 4 cores on the c6a.2xlarge ec2 instance that was created with hyperthreading turned off.#
lscpu
Output:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
BogoMIPS: 5299.98
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdt
scp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x
2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_s
ingle ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clze
ro xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Edit the 12US3 Benchmark run script to use the gcc compiler and to output all species to CONC output file.#
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts/
vi run_cctm_Bench_2018_12NE3.c6a48xlarge.csh
change
setenv compiler intel
to
setenv compiler gcc
Comment out the CONC_SPCS setting that limits them to only 12 species
# setenv CONC_SPCS "O3 NO ANO3I ANO3J NO2 FORM ISOP NH3 ANH4I ANH4J ASO4I ASO4J"
Change the NPCOL, NPROW to run on 4 cores
@ NPCOL = 2; @ NPROW = 2
Run the 12US3 Benchmark case#
./run_cctm_Bench_2018_12NE3.c6a.2xlarge.csh |& tee ./run_cctm_Bench_2018_12NE3.c6a.2xlarge.4pe.log
Use HTOP to view performance.#
htop
output
Note, this 12NE3 Domain uses more memory, and takes longer than the 12LISTOS-Training Domain. It also takes longer to run using 4 cores on c6a.2xlarge instance than on 32 cores on c6a.48xlarge instance.
Successful output for 12 species output in the 3-D CONC file took 56 minutes to run 1 day#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-07-01
End Day: 2018-07-01
Number of Simulation Days: 1
Domain Name: 2018_12NE3
Number of Grid Cells: 367500 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 4
All times are in seconds.
Num Day Wall Time
01 2018-07-01 3410.99
Total Time = 3410.99
Avg. Time = 3410.99
Compared to the timing for running on 32 processors, which took 444.34 seconds, this is a factor of 7.67 or close to perfect scalability of adding 8x as many cores.
Find the InstanceID using the following command on your local machine.#
aws ec2 describe-instances --region=us-east-1 | grep InstanceId
Output
i-xxxx
Stop the instance#
aws ec2 stop-instances --region=us-east-1 --instance-ids i-xxxx
Get the following error message.
aws ec2 stop-instances –region=us-east-1 –instance-ids i-041a702cc9f7f7b5d
An error occurred (UnsupportedOperation) when calling the StopInstances operation: You can’t stop the Spot Instance ‘i-041a702cc9f7f7b5d’ because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.
Note sure how to do a persistent spot instance request .
Terminate Instance#
aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx
Verify that the instance is being shut down.#
aws ec2 describe-instances --region=us-east-1
Learn how to Use AWS CLI to launch c6a.8xlarge EC2 instance using Public AMI#
Public AMI contains the software and data to run 2016_12SE1 using CMAQv5.3.3#
Software was pre-installed and saved to a public ami.
The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.
This chapter describes the process that was used to test and configure the c6a.8xlarge ec2 instance to run CMAQv5.3.3 for the 12SE1 domain.
Todo: Need to create command line options to copy a public ami to a different region.
Verify that you can see the public AMI on the us-east-1 region.#
aws ec2 describe-images --region us-east-1 --image-id ami-088f82f334dde0c9f
Output:
{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2023-06-26T18:17:08.000Z",
"ImageId": "ami-088f82f334dde0c9f",
"ImageLocation": "440858712842/EC2CMAQv54io2_12LISTOS-training_12NE3_12US1",
"ImageType": "machine",
"Public": true,
"OwnerId": "440858712842",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"Iops": 100000,
"SnapshotId": "snap-042b05034228ec830",
"VolumeSize": 500,
"VolumeType": "io2",
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "EC2CMAQv54io2_12LISTOS-training_12NE3_12US1",
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm",
"DeprecationTime": "2025-06-26T18:17:08.000Z"
}
]
}
Use q to exit out of the command line
Note, the AMI uses the maximum value available on io2 for Iops of 100000.
AWS Resources for the aws cli method to launch ec2 instances.#
Tutorial Launch Spot Instances
(note, it discourages the use of run-instances for launching spot instances, but they do provide an example method)
Launching EC2 Spot Instances using Run Instances API
Additional resources for spot instance provisioning.
To launch a Spot Instance with RunInstances API you create the configuration file as described below:
cat <<EoF > ./runinstances-config.json
{
"DryRun": false,
"MaxCount": 1,
"MinCount": 1,
"InstanceType": "c6a.8xlarge",
"ImageId": "ami-088f82f334dde0c9f",
"InstanceMarketOptions": {
"MarketType": "spot"
},
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "Name",
"Value": "EC2SpotCMAQv54"
}
]
}
]
}
EoF
Use the publically available AMI to launch an ondemand c6a.8xlarge ec2 instance using a io2 volume with 100000 IOPS with hyperthreading disabled#
Note, we will be using a json file that has been preconfigured to specify the ImageId
Obtain the code using git#
git clone -b main https://github.com/CMASCenter/pcluster-cmaq
cd pcluster-cmaq/json
Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids launch-wizard-with-tcp-access
Example command: note launch-wizard-with-tcp-access needs to be replaced by your security group ID, and your-pem key needs to be replaced by the name of your-pem.pem key.
aws ec2 run-instances --debug --key-name your-pem --security-group-ids launch-wizard-with-tcp-access --dry-run --region us-east-1 --cli-input-json file://runinstances-config.json
Command that works for UNC’s security group and pem key:
aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --dry-run --ebs-optimized --cpu-options CoreCount=16,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.c6a.8xlarge.json
Once you have verified that the command above works with the –dry-run option, rerun it without as follows.
aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --cpu-options CoreCount=16,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.c6a.8xlarge.json
Use q to quit to return to the command prompt.
Example of security group inbound and outbound rules required to connect to EC2 instance via ssh.
Additional resources
CLI commands to create Security Group
Use the following command to obtain the public IP address of the machine.#
This command is commented out, as the instance hasn’t been created yet. keeping the instructions for documentation purposes.
aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-088f82f334dde0c9f" | grep PublicIpAddress
Login to the ec2 instance (may need to wait 5 minutes for the ec2 instance to initialize and be ready for login)#
Note, the following command must be modified to specify your key, and ip address (obtained from the previous command):
ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address
Login to the ec2 instance again, so that you have two windows logged into the machine.#
ssh -Y -i ~/your-pem.pem ubuntu@your-ip-address
Load the environment modules#
module avail
module load ioapi-3.2/gcc-11.3.0-netcdf mpi/openmpi-4.1.2 netcdf-4.8.1/gcc-11.3
Update the pcluster-cmaq repo using git#
cd /shared/pcluster-cmaq
git pull
Run CMAQv5.3.3 for 2016_12SE1 1 Day benchmark Case#
GRIDDESC
' '
'LamCon_40N_97W'
2 33.000 45.000 -97.000 -97.000 40.000
' '
'SE53BENCH'
'LamCon_40N_97W' 792000.000 -1080000.000 12000.000 12000.000 100 80 1
'2016_12SE1'
'LamCon_40N_97W' 792000.000 -1080000.000 12000.000 12000.000 100 80 1
Edit the run script to run on 16 cores#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
cp run_cctm_Bench_2016_12SE1.csh run_cctm_Bench_2016_12SE1.16pe.csh
change NPCOLxNPROW to 4x4
Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
./run_cctm_Bench_2016_12SE1.16pe.csh |& tee ./run_cctm_Bench_2016_12SE1.16pe.log
Use HTOP to view performance.#
htop
output
Successful output#
Find the InstanceID using the following command on your local machine.#
aws ec2 describe-instances --region=us-east-1 | grep InstanceId
Output
i-xxxx
Stop the instance#
aws ec2 stop-instances --region=us-east-1 --instance-ids i-xxxx
Get the following error message.
aws ec2 stop-instances –region=us-east-1 –instance-ids i-041a702cc9f7f7b5d
An error occurred (UnsupportedOperation) when calling the StopInstances operation: You can’t stop the Spot Instance ‘i-041a702cc9f7f7b5d’ because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.
Note sure how to do a persistent spot instance request .
Terminate Instance#
aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx
Verify that the instance is being shut down.#
aws ec2 describe-instances --region=us-east-1
Learn how to Use AWS CLI to launch c6a.48xlarge EC2 instance using Public AMI#
Public AMI contains the software and data to run 2016_12SE1 using CMAQv5.3.3#
Software was pre-installed and saved to a public ami.
The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.
This chapter describes the process that was used to test and configure the c6a.48xlarge ec2 instance to run CMAQv5.4 for the 12US2 domain.
Todo: Need to create command line options to copy a public ami to a different region.
Verify that you can see the public AMI on the us-east-1 region.#
aws ec2 describe-images --region us-east-1 --image-id ami-051ba52c157e4070c
Output:
{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2023-06-26T18:17:08.000Z",
"ImageId": "ami-088f82f334dde0c9f",
"ImageLocation": "440858712842/EC2CMAQv54io2_12LISTOS-training_12NE3_12US1",
"ImageType": "machine",
"Public": true,
"OwnerId": "440858712842",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"Iops": 100000,
"SnapshotId": "snap-042b05034228ec830",
"VolumeSize": 500,
"VolumeType": "io2",
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "EC2CMAQv54io2_12LISTOS-training_12NE3_12US1",
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm",
"DeprecationTime": "2025-06-26T18:17:08.000Z"
}
]
}
Use q to exit out of the command line
Note, the AMI uses the maximum value available on io2 for Iops of 100000.
AWS Resources for the aws cli method to launch ec2 instances.#
Tutorial Launch Spot Instances
(note, it discourages the use of run-instances for launching spot instances, but they do provide an example method)
Launching EC2 Spot Instances using Run Instances API
Additional resources for spot instance provisioning.
To launch a Spot Instance with RunInstances API you create the configuration file as described below:
cat <<EoF > ./runinstances-config.json
{
"DryRun": false,
"MaxCount": 1,
"MinCount": 1,
"InstanceType": "c6a.48xlarge",
"ImageId": "ami-088f82f334dde0c9f",
"InstanceMarketOptions": {
"MarketType": "spot"
},
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "Name",
"Value": "EC2SpotCMAQv54"
}
]
}
]
}
EoF
Use the publically available AMI to launch an ondemand c6a.48xlarge ec2 instance using a gp3 volume with 16000 IOPS with hyperthreading disabled#
Note, we will be using a json file that has been preconfigured to specify the ImageId
Obtain the code using git#
git clone -b main https://github.com/CMASCenter/pcluster-cmaq
cd pcluster-cmaq/json
Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids launch-wizard-with-tcp-access
Example command: note launch-wizard-with-tcp-access needs to be replaced by your security group ID, and your-pem key needs to be replaced by the name of your-pem.pem key.
aws ec2 run-instances --debug --key-name your-pem --security-group-ids launch-wizard-with-tcp-access --dry-run --region us-east-1 --cli-input-json file://runinstances-config.json
Command that works for UNC’s security group and pem key:
aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --dry-run --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.json
Once you have verified that the command above works with the –dry-run option, rerun it without as follows.
aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.io2.json
Example of security group inbound and outbound rules required to connect to EC2 instance via ssh.
Additional resources
CLI commands to create Security Group
Use the following command to obtain the public IP address of the machine.#
This command is commented out, as the instance hasn’t been created yet. keeping the instructions for documentation purposes.
aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-088f82f334dde0c9f" | grep PublicIpAddress
Login to the ec2 instance#
Note, the following command must be modified to specify your key, and ip address (obtained from the previous command): Note, you will get a connection refused if you try to login prior to the ec2 instance being ready to run (takes ~5 minutes for initialization).
ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address
Login to the ec2 instance again, so that you have two windows logged into the machine.#
ssh -Y -i ~/downloads/your-pem.pem ubuntu@your-ip-address
Load the environment modules#
module avail
module load ioapi-3.2/gcc-11.3.0-netcdf mpi/openmpi-4.1.2 netcdf-4.8.1/gcc-11.3
Update the pcluster-cmaq repo using git#
cd /shared/pcluster-cmaq
git pull
Run CMAQv5.3.3 for 2016_12SE1 1 Day benchmark Case on 4 pe#
' '
'LamCon_40N_97W'
2 33.000 45.000 -97.000 -97.000 40.000
' '
'SE53BENCH'
'LamCon_40N_97W' 792000.000 -1080000.000 12000.000 12000.000 100 80 1
'2016_12SE1'
'LamCon_40N_97W' 792000.000 -1080000.000 12000.000 12000.000 100 80 1
Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
./run_cctm_Bench_2016_12SE1.csh |& tee run_cctm_Bench_2016_12SE1.log
Use HTOP to view performance.#
htop
output
Successful output#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2016-07-01
End Day: 2016-07-01
Number of Simulation Days: 1
Domain Name: 2016_12SE1
Number of Grid Cells: 280000 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 4
All times are in seconds.
Num Day Wall Time
01 2016-07-01 2083.32
Total Time = 2083.32
Avg. Time = 2083.32
Use lscpu to confirm that there are 8 processors on the c6a.2xlarge ec2 instance that was created with hyperthreading turned on.#
lscpu
Output:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
BogoMIPS: 5300.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm con
stant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt a
es xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmm
call fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nr
ip_save vaes vpclmulqdq rdpid
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Run 12US2 benchmark again using gp3 volume#
Stop the instance#
aws ec2 stop-instances --region=us-east-1 --instance-ids i-xxxx
Get the following error message.
aws ec2 stop-instances –region=us-east-1 –instance-ids i-041a702cc9f7f7b5d
An error occurred (UnsupportedOperation) when calling the StopInstances operation: You can’t stop the Spot Instance ‘i-041a702cc9f7f7b5d’ because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.
Note sure how to do a persistent spot instance request .
Terminate Instance#
aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx
Verify that the instance is being shut down.#
aws ec2 describe-instances --region=us-east-1
Documentation of Troubleshooting effort for CMAQv5.4+ on 12US1#
Public AMI contains the software and data to run 12US1 using CMAQv5.4+#
Software was pre-installed and saved to a public ami.
The input data was also transferred from the AWS Open Data Program and installed on the EBS volume.
This chapter describes the process that was used to test and configure the c6a.48xlarge ec2 instance to run CMAQv5.4 for the 12US1 domain.
Todo: Need to create command line options to copy a public ami to a different region.
Verify that you can see the public AMI on the us-east-1 region.#
aws ec2 describe-images --region us-east-1 --image-id ami-0aaa0cfeb5ed5763c
Output:
{
"Images": [
{
"Architecture": "x86_64",
"CreationDate": "2023-06-07T02:52:26.000Z",
"ImageId": "ami-0aaa0cfeb5ed5763c",
"ImageLocation": "440858712842/cmaqv5.4_c6a.48xlarge",
"ImageType": "machine",
"Public": true,
"OwnerId": "440858712842",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"Iops": 4000,
"SnapshotId": "snap-0c2f11a82e76aac9b",
"VolumeSize": 500,
"VolumeType": "gp3",
"Throughput": 1000,
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "cmaqv5.4_c6a.48xlarge",
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm",
"DeprecationTime": "2025-06-07T02:52:26.000Z"
}
]
}
Note that the above AMI has a the maximum throughput limit of 1000, but this AMI had an IOPS limit of 4000 which caused I/O issues documented below.
The solution is to use update the volume to a use the maximum value for IOPS of 16000, and then save the EC2 instance as a new AMI that will have the highest IOPS and throughput for the gp3 VolumeType. The following is a screenshot of the option to do this within the AWS Web Interface. I will work on documenting a method to do this from the command line, but this will be saved for the advanced tutorial.
AWS Resources for the aws cli method to launch ec2 instances.#
Tutorial Launch Spot Instances
(note, it discourages the use of run-instances for launching spot instances, but they do provide an example method)
Launching EC2 Spot Instances using Run Instances API
Additional resources for spot instance provisioning.
To launch a Spot Instance with RunInstances API you create the configuration file as described below:
cat <<EoF > ./runinstances-config.json
{
"DryRun": false,
"MaxCount": 1,
"MinCount": 1,
"InstanceType": "c6a.48xlarge",
"ImageId": "ami-0aaa0cfeb5ed5763c",
"InstanceMarketOptions": {
"MarketType": "spot"
},
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "Name",
"Value": "EC2SpotCMAQv54"
}
]
}
]
}
EoF
{ “DryRun”: false, “MaxCount”: 1, “MinCount”: 1, “InstanceType”: “c6a.48xlarge”, “ImageId”: “ami-0aaa0cfeb5ed5763c”, “InstanceMarketOptions”: { “MarketType”: “spot” }, “TagSpecifications”: [ { “ResourceType”: “instance”, “Tags”: [ { “Key”: “Name”, “Value”: “EC2SpotCMAQv54” } ] } ] }
Use a publically available AMI to launch a c6a.48xlarge ec2 instance using a gp3 volume with 16000 IOPS#
Launch a new instance using the AMI with the software loaded and request a spot instance for the c6a.8xlarge EC2 instance
Note, we will be using a json file that has been preconfigured to specify the ImageId
cd /shared/pcluster-cmaq
Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids launch-wizard-with-tcp-access
Example command: note launch-wizard-with-tcp-access needs to be replaced by your security group ID, and your-pem key needs to be replaced by the name of your-pem.pem key.
aws ec2 run-instances --debug --key-name your-pem --security-group-ids launch-wizard-with-tcp-access --dryrun --region us-east-1 --cli-input-json file://runinstances-config.json
Command that works for UNC’s security group and pem key:
aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --dryrun --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.hyperthread-off.16000IOPS.json
Once you have verified that the command above works with the –dryrun option, rerun it without as follows.
aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.hyperthread-off.16000IOPS.json
Example of security group inbound and outbound rules required to connect to EC2 instance via ssh.
(I am not sure if you can create a security group rule from the aws command line.)
Additional resources
CLI commands to create Security Group
Use the following command to obtain the public IP address of the machine.#
This command is commented out, as the instance hasn’t been created yet. keeping the instructions for documentation purposes.
aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-0aaa0cfeb5ed5763c" | grep PublicIpAddress
Login to the ec2 instance#
Note, the following command must be modified to specify your key, and ip address (obtained from the previous command):
ssh -v -Y -i ~/downloads/your-pem.pem ubuntu@ip.address
Load the environment modules#
module avail
module load ioapi-3.2/gcc-11.3.0-netcdf mpi/openmpi-4.1.2 netcdf-4.8.1/gcc-11.3
Run CMAQv5.4 for the 12km Listos Training Case#
Input data is available for a subdomain of the 12km 12US1 case.
GRIDDESC
'2018_12Listos'
'LamCon_40N_97W' 1812000.000 240000.000 12000.000 12000.000 25 25 1
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts
./run_cctm_2018_12US1_listos_32pe.csh |& tee ./run_cctm_2018_12US1_listos_32pe.log
Successful output:
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day: 2018-08-07
Number of Simulation Days: 3
Domain Name: 2018_12Listos
Number of Grid Cells: 21875 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 32
All times are in seconds.
Num Day Wall Time
01 2018-08-05 69.9
02 2018-08-06 64.7
03 2018-08-07 66.5
Total Time = 201.10
Avg. Time = 67.03
Run CMAQv5.4 for the full 12US1 Domain on c6a.48xlarge with 192 vcpus#
GRIDDESC
' ' ! end coords. grids: name; xorig yorig xcell ycell ncols nrows nthik
'12US1'
'LAM_40N97W' -2556000. -1728000. 12000. 12000. 459 299 1
Input Data for the 12US1 domain is available for a 2 day benchmark 12US1 Domain for both netCDF4 compressed (.nc4) and classic netCDF-3 compression (.nc). The 96 pe run on the c6a.48xlarge instance will take approximately 120 minutes for 1 day, or 240 minutes for the full 2 day benchmark.
Options that were used to disable multi-trheading:
--cpu-options (structure)
The CPU options for the instance. For more information, see Optimize CPU options in the Amazon EC2 User Guide .
CoreCount -> (integer)
The number of CPU cores for the instance.
ThreadsPerCore -> (integer)
The number of threads per CPU core. To disable multithreading for the instance, specify a value of 1 . Otherwise, specify the default value of 2 .
--cpu-options CoreCount=integer,ThreadsPerCore=integer,AmdSevSnp=string
JSON Syntax:
{
"CoreCount": integer,
"ThreadsPerCore": integer,
"AmdSevSnp": "enabled"|"disabled"
}
Use command line to submit the job. This single virtual machine does not have a job scheduler such as slurm installed.#
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts
./run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.8x12.ncclassic.csh |& tee ./run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.8x12.ncclassic.2nd.log
Spot Pricing cost for Linux in US East Region
c6a.48xlarge $5.88 per Hour
Rerunning the 12US1 case on 8x12 processors - for total of 96 processors.
It took about 39 minutes of initial I/O prior to the model starting using this gp3 ami. Fahim was not able to reproduce this performance issue. I am not sure how to diagnose the issue. When I upgraded the AMI to use an io2 disk, this poor I/O issue was resolved.
Once the model starts running (see Processing cmpleted …) in the log file, then use htop to view the CPU usage.#
Login to the virtual machine and then run the following command.
./htop
Using Cloudwatch to see the CPU utilization.#
Note that we are using 96 pes of the 192 virtual cpus, so the maximum cpu utilization reported would be 50%.
Successful run output, but it is taking too long (twice as long as on the Parallel Cluster).
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day: 2017-12-23
Number of Simulation Days: 2
Domain Name: 12US1
Number of Grid Cells: 4803435 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 96
All times are in seconds.
Num Day Wall Time
01 2017-12-22 6320.8
02 2017-12-23 5409.6
Total Time = 11730.40
Avg. Time = 5865.20
Perhaps the instance is being i/o throttled?
ebs-volume-io-queue-latency-issues
Trying this CloudWatch Report
This report is saying that the maximum throughput for this gp3 volume is 1,000 MiB/s, and the baseline throughtput Limit is 125 MiB/s. Need to run this same report for the io2 volume, and see what the values are.
Volume ID: vol-050662148aef41b8f
Instance ID: i-0c2615494c0a89ea9
You can use the AWS Web Interface to get an estimate of the savings of using a SPOT versus OnDEMAND Instance.
Save volume as a snapshot#
saving the volume as a snapshot so that I can have a copy of the log files to show the poor performance of the spot instance. After the snapshot is created then I will delete the instance. The snapshot name is c6a.48xlarge.cmaqv54.spot, snap-0cc3df82ba5bf5da8
Clean up Virtual Machine#
Find the InstanceID using the following command on your local machine.#
## aws ec2 describe-instances --region=us-east-1 | grep InstanceId
Output
i-xxxx
Terminate the instance#
## aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx
Create c6a.48xlarge with hyperthreading disabled#
## aws ec2 run-instances --debug --key-name cmaqv5.4 --security-group-ids launch-wizard-179 --region us-east-1 --ebs-optimized --dry-run --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.json
(note, take out –dry-run option after you try and verify it works)
Obtain the public IP address for the virtual machine
## aws ec2 describe-instances --region=us-east-1 --filters "Name=image-id,Values=ami-0aaa0cfeb5ed5763c" | grep PublicIpAddress
Login to the machine `## ssh -v -Y -i ~/your-pem.pem ubuntu@your-ip-address
Retry the Listos run script.#
## cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts
## ./run_cctm_2018_12US1_listos_32pe.csh |& tee ./run_cctm_2018_12US1_listos_32pe.log
Use HTOP to view performance.#
htop
output
Successful output#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day: 2018-08-07
Number of Simulation Days: 3
Domain Name: 2018_12Listos
Number of Grid Cells: 21875 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 32
All times are in seconds.
Num Day Wall Time
01 2018-08-05 87.6
02 2018-08-06 77.9
03 2018-08-07 77.2
Total Time = 242.70
Avg. Time = 80.90
Retried the 12US1 benchmark case but the i/o was still too slow.
Used the AWS Web Interface to upgrade to an io1 system#
After upgrading to the io1 volume, the performance was much improved.
Now, we need to examine the cost, and whether it would cost less for an io2 volume.
Additional information about how to calculate storage pricing.
Good comparison of EBS vs EFS, and discussion of using Cloud Volumes ONTAP for data tiering between S3 Buckets and EBS volumes.
Comparison between EBS and EFS
The aws cli can also be used to modify the volume as per these instructions.
Output
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day: 2017-12-23
Number of Simulation Days: 2
Domain Name: 12US1
Number of Grid Cells: 4803435 (ROW x COL x LAY)
`Number of Layers: 35
Number of Processes: 96
All times are in seconds.
Num Day Wall Time
01 2017-12-22 3045.2
02 2017-12-23 3351.8
Total Time = 6397.00
Avg. Time = 3198.50
Saved the EC2 instance as an AMI and made that ami public.
Use new ami instance with faster storage (io1) to create c6a.48xlarge ec2 instance#
Note: these command should work, using a runinstance-config.jason file that is in the /shared/pcluster-cmaq directory. (it has already been edited to specify the ami listed below.)
The your-key.pem and the runinstance-config.jason file should be copied to the same directory before using the aws cli instructions below.
New AMI instance name to use for CMAQv5.4 on c6a.48xlarge using 500 GB io1 Storage.
ami-031a6e4499abffdb6
Edit runinstances-config.json to use the new ami.
Add the following line:
"ImageId": "ami-031a6e4499abffdb6",
Create new instance#
Note, you will need to obtain a security group id from your IT administrator that allows ssh login access. If this is enabled by default, then you can remove the –security-group-ids your-security-group-with-ssh-access-to-Instance option.
Note, you will need to create or have a keypair that will be used to login to the ec2 instance that you create.
Create c6a.48xlarge instance:
aws ec2 run-instances --debug --key-name your-pem --security-group-ids your-security-group-with-ssh-access-to-Instance --region us-east-1 --ebs-optimized --dry-run --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.json
(take out –dryrun option after you see the following message:
botocore.exceptions.ClientError: An error occurred (DryRunOperation) when calling the RunInstances operation: Request would have succeeded, but DryRun flag is set.
Re-try creating the c5a.48xlarge instance without the dry-run option::
aws ec2 run-instances --debug --key-name your-pem --security-group-ids your-security-group-with-ssh-access-to-Instance --region us-east-1 --ebs-optimized --cpu-options CoreCount=96,ThreadsPerCore=1 --cli-input-json file://runinstances-config.json
Check that the ec2 instance is running using the following command.#
aws ec2 describe-instances --region=us-east-1
Use the following command to obtain the IP address#
aws ec2 describe-instances --region=us-east-1 | grep PublicIpAddress
Login#
ssh -v -Y -i ~/your-pem.pem ubuntu@your-publicIpAddress
Load environment modules#
module avail
module load ioapi-3.2/gcc-11.3.0-netcdf mpi/openmpi-4.1.2 netcdf-4.8.1/gcc-11.3
Change to the scripts directory#
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts/
Use lscpu to confirm that there are 96 processors on the c6a.48xlarge ec2 instance that was created with hyperthreading turned off.#
lscpu
Output:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 2
Stepping: 1
BogoMIPS: 5299.98
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxs
r_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq m
onitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_l
egacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 a
vx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat
npt nrip_save vaes vpclmulqdq rdpid
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 3 MiB (96 instances)
L1i: 3 MiB (96 instances)
L2: 48 MiB (96 instances)
L3: 384 MiB (12 instances)
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Login to the ec2 instance again, so that you have two windows logged into the machine.#
ssh -Y -i ~/your-pem.pem ubuntu@your-ip-address
Run 12US1 Listos Training 3 Day benchmark Case on 32 pe (this will take less than 2 minutes)#
./run_cctm_2018_12US1_listos_32pe.csh | & tee ./run_cctm_2018_12US1_listos_32pe.2nd.log
Successful output#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day: 2018-08-07
Number of Simulation Days: 3
Domain Name: 2018_12Listos
Number of Grid Cells: 21875 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 32
All times are in seconds.
Num Day Wall Time
01 2018-08-05 35.7
02 2018-08-06 35.2
03 2018-08-07 36.1
Total Time = 107.00
Avg. Time = 35.66
Download input data for 12NE3 1 day Benchmark case#
Instructions to copy data from the s3 bucket to the ec2 instance and run this benchmark.
cd /shared/pcluster-cmaq/
Examine the command line options that are used to download the data. Note, that we can use the –nosign option, as the data is available from the CMAS Open Data Warehouse on AWS.
cat s3_copy_12NE3_Bench.csh
Output
#!/bin/csh -f
#Script to download enough data to run START_DATE 201522 and END_DATE 201523 for 12km Northeast Domain
#Requires installing aws command line interface
#https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html#cliv2-linux-install
#Total storage required is 56 G
setenv AWS_REGION "us-east-1"
aws s3 cp --no-sign-request --recursive s3://cmas-cmaq/CMAQv5.4_2018_12NE3_Benchmark_2Day_Input /shared/data/
Use the aws s3 copy command to copy data from the CMAS Data Warehouse Open Data S3 bucket.#
./s3_copy_12NE3_Bench.csh
Edit the 12US3 Benchmark run script to use the gcc compiler and to output all species to CONC output file.#
vi run_cctm_Bench_2018_12NE3.c6a48xlarge.csh
change
setenv compiler intel
to
setenv compiler gcc
Comment out the CONC_SPCS setting that limits them to only 12 species
# setenv CONC_SPCS "O3 NO ANO3I ANO3J NO2 FORM ISOP NH3 ANH4I ANH4J ASO4I ASO4J"
Run the 12US3 Benchmark case#
./run_cctm_Bench_2018_12NE3.c6a48xlarge.csh |& tee ./run_cctm_Bench_2018_12NE3.c6a48xlarge.32pe.log
Successful output for 12 species output in the 3-D CONC file took 7.4 minutes to run 1 day#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-07-01
End Day: 2018-07-01
Number of Simulation Days: 1
Domain Name: 2018_12NE3
Number of Grid Cells: 367500 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 32
All times are in seconds.
Num Day Wall Time
01 2018-07-01 445.19
Total Time = 445.19
Avg. Time = 445.19
Successful output for all species output in the 3-D CONC File (222 variables)#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-07-01
End Day: 2018-07-01
Number of Simulation Days: 1
Domain Name: 2018_12NE3
Number of Grid Cells: 367500 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 32
All times are in seconds.
Num Day Wall Time
01 2018-07-01 444.34
Total Time = 444.34
Avg. Time = 444.34
Todo: look into process pinning. (will it make a difference on a single VM for number of cores less than 96?)
Compare to timings available in Table 3-1 Example of job scenarios at EPA for a single day simulation.
Domain Domain size Species Tracked Input files size Output files size Run time (# cores)
2018 North East US 100 X 105 X 35 225 26GB 2GB 15 min/day (32)
Run 12US1 2 day benchmark case on 96 processors#
./run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.12x8.ncclassic.csh |& tee ./run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.12x8.ncclassic.log
Verify that it is using 99% of each of the 96 cores using htop#
htop
Successful run timing#
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day: 2017-12-23
Number of Simulation Days: 2
Domain Name: 12US1
Number of Grid Cells: 4803435 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 96
All times are in seconds.
Num Day Wall Time
01 2017-12-22 3070.4
02 2017-12-23 3386.7
Total Time = 6457.10
Avg. Time = 3228.55
Compare timing to output available CMAQ User Guide: Running CMAQ
Find the InstanceID using the following command on your local machine.#
aws ec2 describe-instances --region=us-east-1 | grep InstanceId
Output
i-xxxx
Stop the instance#
aws ec2 stop-instances --region=us-east-1 --instance-ids i-xxxx
Get the following error message.
aws ec2 stop-instances –region=us-east-1 –instance-ids i-041a702cc9f7f7b5d
An error occurred (UnsupportedOperation) when calling the StopInstances operation: You can’t stop the Spot Instance ‘i-041a702cc9f7f7b5d’ because it is associated with a one-time Spot Instance request. You can only stop Spot Instances associated with persistent Spot Instance requests.
Note sure how to do a persistent spot instance request .
Terminate Instance#
aws ec2 terminate-instances --region=us-east-1 --instance-ids i-xxxx
Try creating the gp3 version of the ami using the Nitro Hypervisor, and see if that improves the performance without the cost of the io1 volume.#
no - the nitro is being used.
“Hypervisor”: “xen”, - this applies to the nitro hypervisor according to the documentation.
Try creating the gp3 ami from the web interface, and see if you can reproduce the performance issues or not. If it performs well, then use the –describe-instances command to see what is different between the ami created from web interface and that created from the command line.
Create a Parallel Cluster and run CMAQv5.3.3#
Why might I need to use ParallelCluster?#
The AWS ParallelCluster may be configured to be the equivalent of a High Performance Computing (HPC) environment, including using job schedulers such as Slurm, running on multiple nodes using code compiled with Message Passing Interface (MPI), and reading and writing output to a high performance, low latency shared disk. The advantage of using the AWS ParallelCluster command line interface is that the compute nodes can be easily scaled up or down to match the compute requirements of a given simulation. In addition, the user can reduce costs by using Spot instances rather than On-Demand for the compute nodes. ParallelCluster also supports submitting multiple jobs to the job submission queue.
Our goal is make this user guide to running CMAQ on a ParallelCluster as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.
Additional information on AWS ParallelCluster:
AWS ParallelCluster documentation
AWS ParallelCluster training video
Introductory Tutorial#
Create a Demo cluster to configure your aws credentials and set up your identity and access management roles.
Introductory Tutorial
Step by Step Instructions to Build a Demo ParallelCluster.#
The goal is for users to get started and make sure they can spin up a node, launch the pcluster and terminate it.
Establish Identity and Permissions#
AWS Identity and Access Management Roles#
Requires the user to have AWS Identity and Access Management roles in AWS ParallelCluster
AWS ParallelCluster uses multiple AWS services to deploy and operate a cluster. See the complete list in the AWS Services used in AWS ParallelCluster section. It appears you can create the demo cluster, and even the intermediate or advanced cluster, but you can’t submit a slurm job and have it provision compute nodes until you have the IAM Policies set for your account. This likely requires the system administrator who has permissions to access the AWS Web Interface with root access to add these policies and then to attach them to each user account.
Use the AWS Web Interface to add a policy called AWSEC2SpotServiceRolePolicy to the account prior to running a job that uses spot pricing on the ParallelCluster.
AWS CLI 3.0#
Use AWS Command Line Interface (CLI) v3.0 to configure and launch a demo cluster
Requires the user to have a key.pair that was created on an ec2.instance
See also
Create a virtual environment on a linux machine to install aws-parallel cluster
python3 -m virtualenv ~/apc-ve
source ~/apc-ve/bin/activate
python --version
python3 -m pip install --upgrade aws-parallelcluster
pcluster version
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh
chmod ug+x ~/.nvm/nvm.sh
source ~/.nvm/nvm.sh
nvm install node
node --version
Run pcluster version.
pcluster version
Output:
{
"version": "3.1.2"
}
Note
If you start a new terminal window, you need to re-activate the virtual environment using the following commands:
source ~/apc-ve/bin/activate
source ~/.nvm/nvm.sh
Verify that the parallel cluster is working using:
pcluster version
Configure AWS Command line credentials on your local machine#
aws configure
Configure a Demo Cluster#
To create a parallel cluster, a yaml file needs to be created that is unique to your account.#
An example of the yaml file contents is described in the following Diagram:
Figure 1. Diagram of YAML file used to configure a ParallelCluster with a t2.micro head node and t2.micro compute nodes
See also
pcluster configure --config new-hello-world.yaml
Input the following answers at each prompt:
Allowed values for AWS Region ID:
us-east-1
Allowed values for EC2 Key Pair Name:
choose your key pair
Allowed values for Scheduler:
slurm
Allowed values for Operating System:
ubuntu2004
Head node instance type:
t2.micro
Number of queues:
1
Name of queue 1:
queue1
Number of compute resources for queue1 [1]:
1
Compute instance type for compute resource 1 in queue1:
t2.micro
Maximum instance count [10]:
10
Automate VPC creation?:
y
Allowed values for Availability Zone:
1
Allowed values for Network Configuration:
2. Head node and compute fleet in the same public subnet
Beginning VPC creation. Please do not leave the terminal until the creation is finalized
Note
The choice of operating system (specified during the yaml creation, or in an existing yaml file) determines what modules and gcc compiler versions are available.
Centos7 has an older gcc version 4
Ubuntu2004 has gcc version 9+
Alinux or Amazon Linux/Red Hat Linux (haven’t tried)
cat new-hello-world.yaml
Region: us-east-1
Image:
Os: ubuntu2004
HeadNode:
InstanceType: t2.micro
Networking:
SubnetId: subnet-xx-xx-xx <<< unique to your account
Ssh:
KeyName: your-key <<< unique to your account
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: queue1
ComputeResources:
- Name: t2micro
InstanceType: t2.micro
MinCount: 0
MaxCount: 10
Networking:
SubnetIds:
- subnet-xx-xx-xx <<< unique to your account
Note
The above yaml file is the very simplest form available. If you upgrade the compute node to using a faster compute instance, then you will need to add additional configuration options (networking, elastic fabric adapter) to the yaml file. These modifications will be highlighted in the yaml figures provided in the tutorial.
The key pair and Subnetid in the yaml file are unique to your account. To create the AWS Intermediate ParallelCluster, the key pair and subnet ID from the new-hello-world.yaml file that you created using your account will need to be transferred to the Yaml files that will be used to create the Intermediate ParallelCluster in the next section of the tutorial. You will need to edit these yaml files to use the key pair and your Subnetid that are valid for your AWS Account.
Create a Demo Cluster#
pcluster create-cluster --cluster-configuration new-hello-world.yaml --cluster-name hello-pcluster --region us-east-1
Check on the status of the cluster#
pcluster describe-cluster --region=us-east-1 --cluster-name hello-pcluster
pcluster list-clusters --region=us-east-1
pcluster describe-cluster --region=us-east-1 --cluster-name hello-pcluster
After 5-10 minutes, you see the following status: “clusterStatus”: “CREATE_COMPLETE”
While the cluster has been created, only the t2.micro head node is running. Before any jobs can be submitted to the slurm queue, the compute nodes need to be started.
Note
The compute nodes are not “provisioned” or “created” at this time (so they do not begin to incur costs). The compute nodes are only provisioned when a slurm job is scheduled. After a slurm job is completed, then the compute nodes will be terminated after 5 minutes of idletime.
Login and Examine Cluster#
SSH into the cluster#
Note
replace the your-key.pem key pair with your key pair you will need to change the permissions on your key pair so to be read only by owner.
cd ~
chmod 400 your-key.pem
Example: pcluster ssh -v -Y -i ~/your-key.pem –cluster-name hello-pcluster
pcluster ssh -v -Y -i ~/[your-key-pair] --cluster-name hello-pcluster
login prompt should look something like (this will depend on what OS was chosen in the yaml file).
[ip-xx-x-xx-xxx pcluster-cmaq]
module avail
gcc --version
Need a minimum of gcc 8+ for CMAQ
mpirun --version
Need a minimum openmpi version 4.0.1 for CMAQ
which sbatch
the t2.micro head node is too small
Save the key pair and SubnetId from this new-hello-world.yaml to use in the yaml for the Intermediate Tutorial
exit
Delete the Demo Cluster#
pcluster delete-cluster --cluster-name hello-pcluster --region us-east-1
See also
pcluster --help
CMAQv5.3.3 Intermediate Tutorial#
Run CMAQ on a ParallelCluster using pre-loaded software and input data.
Intermediate Tutorial
Use ParallelCluster pre-installed with software and data.#
Step by step instructions for running the CMAQ 12US2 Benchmark for 2 days on a ParallelCluster.
Obtain YAML file pre-loaded with input data and software#
cd /your/local/machine/install/path/
git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq
cd pcluster-cmaq/yaml
Note
To find the default settings for Lustre see: Lustre Settings for ParallelCluster
Examine Diagram of the YAML file to build pre-installed software and input data.#
Includes Snapshot ID of volume pre-installed with CMAQ software stack and name of S3 Bucket to import data to the Lustre Filesystem
Figure 1. Diagram of YAML file used to configure a ParallelCluster with a c5n.large head node and c5n.18xlarge compute nodes with Software and Data Pre-installed (linked on lustre filesystem)
Edit Yaml file#
This Yaml file specifies the /shared directory that contains the CMAQv5.3.3 and libraries, and the input data that will be imported from an S3 bucket to the /fsx lustre file system Note, the following yaml file is using a c5n-9xlarge compute node, and is using ONDEMAND pricing.
Note
Edit the c5n-9xlarge.ebs_unencrypted_installed_public_ubuntu2004.fsx_import_opendata.yaml file to specify your subnet-id and your keypair prior to creating the cluster
vi c5n-9xlarge.ebs_unencrypted_installed_public_ubuntu2004.fsx_import_opendata.yaml
Output:
Region: us-east-1
Image:
Os: ubuntu2004
HeadNode:
InstanceType: c5n.large
Networking:
SubnetId: subnet-xx-xx-xx <<< replace subnetID
DisableSimultaneousMultithreading: true
Ssh:
KeyName: your-key <<< replace keyname
Scheduling:
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 5
SlurmQueues:
- Name: queue1
CapacityType: SPOT
Networking:
SubnetIds:
- subnet-xx-xx-xxx <<< replace subnetID
PlacementGroup:
Enabled: true
ComputeResources:
- Name: compute-resource-1
InstanceType: c5n.9xlarge
MinCount: 0
MaxCount: 10
DisableSimultaneousMultithreading: true
Efa:
Enabled: true
GdrSupport: false
SharedStorage:
- MountDir: /shared
Name: ebs-shared
StorageType: Ebs
EbsSettings:
SnapshotId: snap-017568d24a4cedc83
- MountDir: /fsx
Name: name2
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
ImportPath: s3://cmas-cmaq-conus2-benchmark/data/CMAQ_Modeling_Platform_2016/CONUS
Create CMAQ ParallelCluster with software/data pre-installed#
pcluster create-cluster --cluster-configuration c5n-9xlarge.ebs_unencrypted_installed_public_ubuntu2004.fsx_import_opendata.yaml --cluster-name cmaq --region us-east-1
Output:
{
"cluster": {
"clusterName": "cmaq",
"cloudformationStackStatus": "CREATE_IN_PROGRESS",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-1:440858712842:stack/cmaq/6cfb1a50-6e99-11ec-8af1-0ea2256597e5",
"region": "us-east-1",
"version": "3.0.2",
"clusterStatus": "CREATE_IN_PROGRESS"
}
}
Check status again
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Output:
{
"creationTime": "2022-01-06T02:36:18.119Z",
"version": "3.0.2",
"clusterConfiguration": {
"url": "
},
"tags": [
{
"value": "3.0.2",
"key": "parallelcluster:version"
}
],
"cloudFormationStackStatus": "CREATE_IN_PROGRESS",
"clusterName": "cmaq",
"computeFleetStatus": "UNKNOWN",
"cloudformationStackArn":
"lastUpdatedTime": "2022-01-06T02:36:18.119Z",
"region": "us-east-1",
"clusterStatus": "CREATE_IN_PROGRESS"
}
After 5-10 minutes, check the status again and recheck until you see the following status: “clusterStatus”: “CREATE_COMPLETE”
Check status again
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Output:
"cloudFormationStackStatus": "CREATE_COMPLETE",
"clusterName": "cmaq",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-1:440858712842:stack/cmaq/3cd2ba10-c18f-11ec-9f57-0e9b4dd12971",
"lastUpdatedTime": "2022-04-21T16:22:28.879Z",
"region": "us-east-1",
"clusterStatus": "CREATE_COMPLETE"
Start the compute nodes, if the computeFleetStatus is not set to RUNNING
pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED
Log into the new cluster#
Note
replace your-key.pem with your Key Name
pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq
Change shell to use tcsh#
sudo usermod -s /bin/tcsh ubuntu
Log out and then log back in to have the shell take effect.
Verify Software#
The software is pre-loaded on the /shared volume of the ParallelCluster. The software was previously loaded and saved to the snapshot.
ls /shared/build
Create a .cshrc file by copying it from the git repo that is on /shared/pcluster-cmaq
cp /shared/pcluster-cmaq/install/dot.cshrc.pcluster ~/.cshrc
Source shell
csh
Load the modules
module avail
Output:
------------------------------------------------------------ /usr/share/modules/modulefiles -------------------------------------------------------------
dot libfabric-aws/1.13.2amzn1.0 module-git module-info modules null openmpi/4.1.1 use.own
Load the modules openmpi and libfabric
module load openmpi/4.1.1
module load libfabric-aws/1.13.2amzn1.0
Verify Input Data#
The input data was imported from the S3 bucket to the lustre file system (/fsx).
cd /fsx/data/CMAQ_Modeling_Platform_2016/CONUS/12US2/
Notice that the data doesn’t take up much space, it must be linked, rather than copied.
du -h
Output:
27K ./land
33K ./MCIP
28K ./emissions/ptegu
55K ./emissions/ptagfire
27K ./emissions/ptnonipm
55K ./emissions/ptfire_othna
27K ./emissions/pt_oilgas
26K ./emissions/inln_point/stack_groups
51K ./emissions/inln_point
28K ./emissions/cmv_c1c2_12
28K ./emissions/cmv_c3_12
28K ./emissions/othpt
55K ./emissions/ptfire
407K ./emissions
27K ./icbc
518K .
Change the group and ownership permissions on the /fsx/data directory
sudo chown ubuntu /fsx/data
sudo chgrp ubuntu /fsx/data
Create the output directory
mkdir -p /fsx/data/output
Examine CMAQ Run Scripts#
The run scripts are available in two locations, one in the CMAQ scripts directory.
Another copy is available in the pcluster-cmaq repo. Do a git pull to obtain the latest scripts in the pcluster-cmaq repo.
cd /shared/pcluster-cmaq
git pull
Verify that the run scripts are updated and pre-configured for the parallel cluster by comparing with what is available in the github repo
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts
Example:
diff /shared/pcluster-cmaq/run_scripts/cmaq533/c5n.9xlarge/run_cctm_2016_12US2.108pe.6x18.pcluster.csh .
If a run script is missing or outdated, copy the run scripts from the repo. Note, there are different run scripts depending on what compute node is used. This tutorial assumes c5n.9xlarge is the compute node.
cp /shared/pcluster-cmaq/run_scripts/cmaq533/c5n.9xlarge/run*pcluster.csh /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
Note
The time that it takes the 2 day CONUS benchmark to run will vary based on the number of CPUs used, and the compute node that is being used. See Figure 3 Benchmark Scaling Plot for c5n.18xlarge and c5n.9xlarge in chapter 11 for reference.
Examine how the run script is configured
head -n 30 /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_2016_12US2.108pe.6x18.pcluster.csh
#!/bin/csh -f
## For c5n.9xlarge (36 vcpu - 18 cpu)
## works with cluster-ubuntu.yaml
## data on /fsx directory
#SBATCH --nodes=6
#SBATCH --ntasks-per-node=18
#SBATCH --exclusive
#SBATCH -J CMAQ
#SBATCH -o /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.sharedvol.log
#SBATCH -e /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.sharedvol.log
Note
In this run script, slurm or SBATCH requests 6 nodes, each node with 18 pes, or 6x18 = 108 pes
Verify that the NPCOL and NPROW settings in the script are configured to match what is being requested in the SBATCH commands that tell slurm how many compute nodes to provision. In this case, to run CMAQ using on 108 cpus (SBATCH –nodes=6 and –ntasks-per-node=18), use NPCOL=9 and NPROW=12.
grep NPCOL /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_2016_12US2.108pe.6x18.pcluster.csh
Output:
setenv NPCOL_NPROW "1 1"; set NPROCS = 1 # single processor setting
@ NPCOL = 9; @ NPROW = 12
@ NPROCS = $NPCOL * $NPROW
setenv NPCOL_NPROW "$NPCOL $NPROW";
Submit Job to Slurm Queue#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
sbatch run_cctm_2016_12US2.108pe.6x18.pcluster.csh
Check status of run#
squeue
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 queue1 CMAQ ubuntu PD 0:00 6 (BeginTime)
Successfully started run#
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 queue1 CMAQ ubuntu R 22:39 6 queue1-dy-compute-resource-1-[1-6]
Once the job is successfully running#
Check on the log file status
grep -i 'Processing completed.' CTM_LOG_001*_gcc_2016*
Output:
Processing completed... 6.5 seconds
Processing completed... 6.5 seconds
Processing completed... 6.5 seconds
Processing completed... 6.5 seconds
Processing completed... 6.4 seconds
Once the job has completed running the two day benchmark check the log file for the timings.
tail -n 30 run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.fsx_copied.log
Output:
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2015-12-22
End Day: 2015-12-23
Number of Simulation Days: 2
Domain Name: 12US2
Number of Grid Cells: 3409560 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 108
All times are in seconds.
Num Day Wall Time
01 2015-12-22 2421.19
02 2015-12-23 2144.16
Total Time = 4565.35
Avg. Time = 2282.67
Note
if you see the following message, you may want to submit a job that requires fewer PEs.
ip-10-0-5-165:/shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 queue1 CMAQ ubuntu PD 0:00 6 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
If you repeatedly see that the job is not successfully provisioned, cancel the job.#
To cancel the job use the following command
scancel 1
Try submitting a smaller job to the queue.#
sbatch run_cctm_2016_12US2.90pe.5x18.pcluster.csh
Check status of run#
squeue
Check to view any errors in the log on the parallel cluster#
vi /var/log/parallelcluster/slurm_resume.log
An error occurred (MaxSpotInstanceCountExceeded) when calling the RunInstances operation: Max spot instance count exceeded
Note
If you encounter this error, you will need to submit a request to increase this spot instance limit using the AWS Website.
if the job will not run using SPOT pricing, then update the compute nodes to use ONDEMAND pricing#
To do this, exit the cluster, stop the compute nodes, then edit the yaml file to modify SPOT to ONDEMAND.
exit
On your local computer use the following command to stop the compute nodes
pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status STOP_REQUESTED
Edit the yaml file to modify SPOT to ONDEMAND, then update the cluster using the following command:
pcluster update-cluster --region us-east-1 --cluster-name cmaq --cluster-configuration c5n-18xlarge.ebs_unencrypted_installed_public_ubuntu2004.fsx_import_opendata.yaml
Output:
{
"cluster": {
"clusterName": "cmaq",
"cloudformationStackStatus": "UPDATE_IN_PROGRESS",
"cloudformationStackArn": "xx-xxx-xx",
"region": "us-east-1",
"version": "3.1.1",
"clusterStatus": "UPDATE_IN_PROGRESS"
},
"changeSet": [
{
"parameter": "Scheduling.SlurmQueues[queue1].CapacityType",
"requestedValue": "ONDEMAND",
"currentValue": "SPOT" <<< Modify to use ONDEMAND
}
]
}
Check status of updated cluster
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Output:
"clusterStatus": "UPDATE_IN_PROGRESS"
once you see
"clusterStatus": "UPDATE_COMPLETE"
Restart the compute nodes
pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED
Verify that compute nodes have started
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Output:
"computeFleetStatus": "RUNNING",
Re-login to the cluster
pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq
Submit a new job using the updated ondemand compute nodes#
sbatch run_cctm_2016_12US2.180pe.5x36.pcluster.csh
Note
If you still have difficulty running a job in the slurm queue, there may be other issues that need to be resolved.
Verify that your IAM Policy has been created for your account.
Someone with administrative permissions should eable the spot instances IAM Policy: AWSEC2SpotServiceRolePolicy
An alternative way to enable this policy is to login to the EC2 Website and launch a spot instance. The service policy will be automatically created, that can then be used by ParallelCluster.
Submit a 72 pe job 2 nodes x 36 cpus#
sbatch run_cctm_2016_12US2.72pe.2x36.pcluster.csh
grep -i 'Processing completed.' CTM_LOG_036.v533_gcc_2016_CONUS_6x12pe_20151223
Output:
Processing completed... 9.0 seconds
Processing completed... 12.0 seconds
Processing completed... 11.2 seconds
Processing completed... 9.0 seconds
Processing completed... 9.1 seconds
tail -n 20 run_cctmv5.3.3_Bench_2016_12US2.72.6x12pe.2day.pcluster.log
Output:
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2015-12-22
End Day: 2015-12-23
Number of Simulation Days: 2
Domain Name: 12US2
Number of Grid Cells: 3409560 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 72
All times are in seconds.
Num Day Wall Time
01 2015-12-22 3562.50
02 2015-12-23 3151.21
Total Time = 6713.71
Avg. Time = 3356.85
Submit a minimum of 2 benchmark runs#
Ideally, two CMAQ runs should be submitted to the slurm queue, using two different NPCOLxNPROW configurations, to create output needed for the QA and Post Processing Sections in Chapter 10.
CMAQv5.3.3 Parallel Cluster Benchmark on HPC6a-48xlarge with EBS and Lustre (optional)#
Run CMAQv5.3.3 on a ParallelCluster using pre-loaded software and input data on EBS and Lustre using HPC6a-48xlarge Parallel Cluster.
CMAQv5.3.3 CONUS 2 Benchmark Tutorial using 12US2 Domain
Use ParallelCluster pre-installed with CMAQv5.3.3 software and 12US2 Benchmark#
Step by step instructions for running the CMAQ 12US2 Benchmark for 2 days on a ParallelCluster.
Obtain YAML file pre-loaded with input data and software#
cd /your/local/machine/install/path/
git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq
cd pcluster-cmaq/yaml
Note
To find the default settings for Lustre see: Lustre Settings for ParallelCluster
Examine Diagram of the YAML file to build pre-installed software and input data.#
Includes Snapshot ID of volume pre-installed with CMAQ software stack and name of S3 Bucket to import data to the Lustre Filesystem
Figure 1. Diagram of YAML file used to configure a ParallelCluster with a c5n.large head node and c5n.18xlarge compute nodes with Software and Data Pre-installed (linked on lustre filesystem)
Edit Yaml file#
This Yaml file specifies the /shared directory that contains the CMAQv5.3.3 and libraries, and the input data that will be imported from an S3 bucket to the /fsx lustre file system Note, the following yaml file is using a hpc6a-48xlarge compute node, and is using ONDEMAND pricing.
Note
Edit the hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml file to specify your subnet-id and your keypair prior to creating the cluster In order to obtain these subnet id you will need to run pcluster configure
pcluster configure -r us-east-2 --config hpc6a.48xlarge.ebs.fsx.us-east-2.yaml
Example of the answers that were used to create the yaml for this benchmark:
Allowed values for EC2 Key Pair Name:
1. xxx-xxx
2. xxx-xxx-xxx
EC2 Key Pair Name [xxx-xxx]: 1
Allowed values for Scheduler:
1. slurm
2. awsbatch
Scheduler [slurm]: 1
Allowed values for Operating System:
1. alinux2
2. centos7
3. ubuntu1804
4. ubuntu2004
Operating System [alinux2]: 4
Head node instance type [t2.micro]: c6a.xlarge
Number of queues [1]:
Name of queue 1 [queue1]:
Number of compute resources for queue1 [1]: 1
Compute instance type for compute resource 1 in queue1 [t2.micro]: hpc6a.48xlarge
The EC2 instance selected supports enhanced networking capabilities using Elastic Fabric Adapter (EFA). EFA enables you to run applications requiring high levels of inter-node communications at scale on AWS at no additional charge (https://docs.aws.amazon.com/parallelcluster/latest/ug/efa-v3.html).
Enable EFA on hpc6a.48xlarge (y/n) [y]: y
Maximum instance count [10]:
Enabling EFA requires compute instances to be placed within a Placement Group. Please specify an existing Placement Group name or leave it blank for ParallelCluster to create one.
Placement Group name []:
Automate VPC creation? (y/n) [n]: y
Allowed values for Availability Zone:
1. us-east-2b
Availability Zone [us-east-2b]:
Allowed values for Network Configuration:
1. Head node in a public subnet and compute fleet in a private subnet
2. Head node and compute fleet in the same public subnet
Network Configuration [Head node in a public subnet and compute fleet in a private subnet]: 2
Beginning VPC creation. Please do not leave the terminal until the creation is finalized
Creating CloudFormation stack...
Do not leave the terminal until the process has finished.
Status: parallelclusternetworking-pub-20230123170628 - CREATE_COMPLETE
The stack has been created.
Configuration file written to hpc6a.48xlarge.ebs.fsx.us-east-2.yaml
You can edit your configuration file or simply run 'pcluster create-cluster --cluster-configuration hpc6a.48xlarge.ebs.fsx.us-east-2.yaml --cluster-name cluster-name --region us-east-2' to create your cluster.
vi hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml
Output:
Region: us-east-2
Image:
Os: ubuntu2004
HeadNode:
InstanceType: c6a.xlarge
Networking:
SubnetId: subnet-xx-xx-xx <<< replace subnetID
DisableSimultaneousMultithreading: true
Ssh:
KeyName: your-key <<< replace keyname
LocalStorage:
RootVolume:
Encrypted: false
Scheduling:
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 5
SlurmQueues:
- Name: queue1
CapacityType: ONDEMAND
Networking:
SubnetIds:
- subnet-xx-xx-xxx <<< replace subnetID
PlacementGroup:
Enabled: true
ComputeResources:
- Name: compute-resource-1
InstanceType: hpc6a.48xlarge
MinCount: 0
MaxCount: 10
DisableSimultaneousMultithreading: true
Efa:
Enabled: true
GdrSupport: false
SharedStorage:
- MountDir: /shared
Name: ebs-shared
StorageType: Ebs
EbsSettings:
VolumeType: gp3
Size: 500
Encrypted: false
SnapshotId: snap-0f9592e0ea1749b5b
- MountDir: /fsx
Name: name2
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
ImportPath: s3://cmas-cmaq-conus2-benchmark
Create CMAQ ParallelCluster with software/data pre-installed#
pcluster create-cluster --cluster-configuration hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml --cluster-name cmaq --region us-east-2
Output:
{
"cluster": {
"clusterName": "cmaq",
"cloudformationStackStatus": "CREATE_IN_PROGRESS",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-2:440858712842:stack/cmaq/6cfb1a50-6e99-11ec-8af1-0ea2256597e5",
"region": "us-east-2",
"version": "3.0.2",
"clusterStatus": "CREATE_IN_PROGRESS"
}
}
Check status again
pcluster describe-cluster --region=us-east-2 --cluster-name cmaq
Output:
{
"creationTime": "2022-01-06T02:36:18.119Z",
"version": "3.0.2",
"clusterConfiguration": {
"url": "
},
"tags": [
{
"value": "3.0.2",
"key": "parallelcluster:version"
}
],
"cloudFormationStackStatus": "CREATE_IN_PROGRESS",
"clusterName": "cmaq",
"computeFleetStatus": "UNKNOWN",
"cloudformationStackArn":
"lastUpdatedTime": "2022-01-06T02:36:18.119Z",
"region": "us-east-2",
"clusterStatus": "CREATE_IN_PROGRESS"
}
Note, the snapshot image used is smaller than the EBS volume requested in the yaml file. Therefore you will get a warning from Parallel Cluster:
pcluster create-cluster --cluster-configuration hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml --cluster-name cmaq --region us-east-2
{
"cluster": {
"clusterName": "cmaq",
"cloudformationStackStatus": "CREATE_IN_PROGRESS",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-2:440858712842:stack/cmaq/276abf10-94fc-11ed-885c-02032a236214",
"region": "us-east-2",
"version": "3.1.2",
"clusterStatus": "CREATE_IN_PROGRESS"
},
"validationMessages": [
{
"level": "WARNING",
"type": "EbsVolumeSizeSnapshotValidator",
"message": "The specified volume size is larger than snapshot size. In order to use the full capacity of the volume, you'll need to manually resize the partition according to this doc: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html"
}
]
}
After 5-10 minutes, check the status again and recheck until you see the following status: “clusterStatus”: “CREATE_COMPLETE”
Check status again
pcluster describe-cluster --region=us-east-2 --cluster-name cmaq
Output:
"cloudFormationStackStatus": "CREATE_COMPLETE",
"clusterName": "cmaq",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-1:440858712842:stack/cmaq/3cd2ba10-c18f-11ec-9f57-0e9b4dd12971",
"lastUpdatedTime": "2022-04-21T16:22:28.879Z",
"region": "us-east-2",
"clusterStatus": "CREATE_COMPLETE"
Start the compute nodes, if the computeFleetStatus is not set to RUNNING
pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED
Log into the new cluster#
Note
replace your-key.pem with your Key Name
pcluster ssh -v -Y -i ~/your-key.pem --region=us-east-2 --cluster-name cmaq
Resize the EBS Volume#
To resize the EBS volume, you will need to login to the cluster and then run the following command:
sudo resize2fs /dev/nvme1n1
output:
resize2fs 1.45.5 (07-Jan-2020)
Filesystem at /dev/nvme1n1 is mounted on /shared; on-line resizing required
old_desc_blocks = 5, new_desc_blocks = 63
The filesystem on /dev/nvme1n1 is now 131072000 (4k) blocks long.
Change shell to use tcsh#
sudo usermod -s /bin/tcsh ubuntu
Log out and then log back in to have the shell take effect.
Verify Software#
The software is pre-loaded on the /shared volume of the ParallelCluster. The software was previously loaded and saved to the snapshot.
ls /shared/build
Create a .cshrc file by copying it from the git repo that is on /shared/pcluster-cmaq
cp /shared/pcluster-cmaq/install/dot.cshrc.pcluster ~/.cshrc
Source shell
csh
Load the modules
module avail
Output:
------------------------------------------------------------ /usr/share/modules/modulefiles ------------------------------------------------------------
dot libfabric-aws/1.16.1amzn1.0 module-git module-info modules null openmpi/4.1.4 use.own
--------------------------------------------------------- /opt/intel/mpi/2021.6.0/modulefiles ----------------------------------------------------------
intelmpi
Load the modules openmpi and libfabric
module load openmpi/4.1.4
module load libfabric-aws/1.16.1amzn1.0
Verify Input Data#
The input data was imported from the S3 bucket to the lustre file system (/fsx).
cd /fsx/data/CMAQ_Modeling_Platform_2016/CONUS/12US2/
Notice that the data doesn’t take up much space, only the objects are loaded, the datasets will not be loaded to the /fsx volume until they are used either by the run scripts or using the touch command.
Note
More information about enhanced s3 integration for Lustre see: Enhanced S3 integration with lustre
du -h
Output:
27K ./land
33K ./MCIP
28K ./emissions/ptegu
55K ./emissions/ptagfire
27K ./emissions/ptnonipm
55K ./emissions/ptfire_othna
27K ./emissions/pt_oilgas
26K ./emissions/inln_point/stack_groups
51K ./emissions/inln_point
28K ./emissions/cmv_c1c2_12
28K ./emissions/cmv_c3_12
28K ./emissions/othpt
55K ./emissions/ptfire
407K ./emissions
27K ./icbc
518K .
Change the group and ownership permissions on the /fsx/data directory
sudo chown ubuntu /fsx/data
sudo chgrp ubuntu /fsx/data
Create the output directory
mkdir -p /fsx/data/output
Examine CMAQ Run Scripts#
The run scripts are available in two locations, one in the CMAQ scripts directory.
Another copy is available in the pcluster-cmaq repo. Do a git pull to obtain the latest scripts in the pcluster-cmaq repo.
cd /shared/pcluster-cmaq
git pull
Copy the run scripts from the repo. Note, there are different run scripts depending on what compute node is used. This tutorial assumes hpc6a-48xlarge is the compute node.
cp /shared/pcluster-cmaq/run_scripts/hpc6a_shared/*.pin.codemod.csh /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
Note
The time that it takes the 2 day CONUS benchmark to run will vary based on the number of CPUs used, and the compute node that is being used, and what disks are used for the I/O (EBS or lustre). The Benchmark Scaling Plot for hpc6a-48xlarge on fsx and shared (include here).
Examine how the run script is configured
head -n 30 /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.fsx.pin.codemod.csh
#!/bin/csh -f
## For hpc6a.48xlarge (96 cpu)
## works with cluster-ubuntu.yaml
## data on /fsx directory
#SBATCH --nodes=6
#SBATCH --ntasks-per-node=96
#SBATCH --exclusive
#SBATCH -J CMAQ
#SBATCH -o /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctmv5.3.3_Bench_2016_12US2.hpc6a.48xlarge.576.6x96.24x24pe.2day.pcluster.fsx.pin.codemod.log
#SBATCH -e /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctmv5.3.3_Bench_2016_12US2.hpc6a.48xlarge.576.6x96.24x24pe.2day.pcluster.fsx.pin.codemod.log
Note
In this run script, slurm or SBATCH requests 6 nodes, each node with 96 pes, or 6x96 = 576 pes
Verify that the NPCOL and NPROW settings in the script are configured to match what is being requested in the SBATCH commands that tell slurm how many compute nodes to provision. In this case, to run CMAQ using on 108 cpus (SBATCH –nodes=6 and –ntasks-per-node=69), use NPCOL=24 and NPROW=24.
grep NPCOL /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.fsx.pin.codemod.csh
Output:
setenv NPCOL_NPROW "1 1"; set NPROCS = 1 # single processor setting
@ NPCOL = 24; @ NPROW = 24
@ NPROCS = $NPCOL * $NPROW
setenv NPCOL_NPROW "$NPCOL $NPROW";
To run on the EBS Volume a code modification is required.#
Note, we will use this modification when running on both lustre and EBS.
Copy the BLD directory with a code modification to wr_conc.F and wr_aconc.F to your directory.
cp -rp /shared/pcluster-cmaq/run_scripts/BLD_CCTM_v533_gcc_codemod /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
Build the code by running the makefile#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/BLD_CCTM_v533_gcc_codemod
Check to see you have the modules loaded
module list
openmpi/4.1.1 2) libfabric-aws/1.13.2amzn1.0
Run the Make command
make
Verify that the executable has been created
ls -lrt CCTM_v533.exe
Submit Job to Slurm Queue to run CMAQ on Lustre#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
sbatch run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.fsx.pin.codemod.csh
Check status of run#
squeue
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 queue1 CMAQ ubuntu PD 0:00 6 (BeginTime)
Successfully started run#
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 queue1 CMAQ ubuntu R 22:39 6 queue1-dy-compute-resource-1-[1-6]
Once the job is successfully running#
Check on the log file status
grep -i 'Processing completed.' CTM_LOG_001*_gcc_2016*
Output:
Processing completed... 6.5 seconds
Processing completed... 6.5 seconds
Processing completed... 6.5 seconds
Processing completed... 6.5 seconds
Processing completed... 6.4 seconds
Once the job has completed running the two day benchmark check the log file for the timings.
tail -n 5 run_cctmv5.3.3_Bench_2016_12US2.hpc6a.48xlarge.576.6x96.24x24pe.2day.pcluster.fsx.pin.codemod.2.log
Output:
Num Day Wall Time
01 2015-12-22 1028.33
02 2015-12-23 916.31
Total Time = 1944.64
Avg. Time = 972.32
Submit a run script to run on the EBS volume#
To run on the EBS volume, you need to copy the input data from the s3 bucket to the /shared volume. You don’t want to copy directly from the /fsx volume, as this will copy more files than you need. The s3 copy script below copies only two days worth of data from the s3 bucket. If you copy from /fsx directory, you would be copying all of the files on the s3 bucket.
cd /shared/pcluster-cmaq/s3_scripts
./s3_copy_nosign_conus_cmas_opendata_to_shared.csh
Modify YAML and then Update Parallel Cluster.#
Note, not all settings in the yaml file can be updated, for some settings, such as using a different snapshot, you will need to terminate this cluster and create a new one.
If you want to edit the yaml file to update a setting such as the maximum number of compute nodes available, use the following command to stop the compute nodes
pcluster update-compute-fleet --region us-east-2 --cluster-name cmaq --status STOP_REQUESTED
Edit the yaml file to modify MaxCount under ComputeResoureces, then update the cluster using the following command:
pcluster update-cluster --region us-east-2 --cluster-name cmaq --cluster-configuration hpc6a.48xlarge.ebs_unencrypted_installed_public_ubuntu2004.ebs_200.fsx_import_east-2b.yaml
Output:
{
"cluster": {
"clusterName": "cmaq",
"cloudformationStackStatus": "UPDATE_IN_PROGRESS",
"cloudformationStackArn": "xx-xxx-xx",
"region": "us-east-2",
"version": "3.1.1",
"clusterStatus": "UPDATE_IN_PROGRESS"
},
"changeSet": [
{
"parameter": "Scheduling.SlurmQueues[queue1].ComputeResources[compute-resource-1].MaxCount",
"requestedValue": 15,
"currentValue": 10
}
]
}
Check status of updated cluster
pcluster describe-cluster --region=us-east-2 --cluster-name cmaq
Output:
"clusterStatus": "UPDATE_IN_PROGRESS"
once you see
"clusterStatus": "UPDATE_COMPLETE"
"clusterName": "cmaq",
"computeFleetStatus": "STOPPED",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-2:440858712842:stack/cmaq2/d68e5180-9698-11ed-b06c-06cfae76125a",
"lastUpdatedTime": "2023-01-23T14:39:44.670Z",
"region": "us-east-2",
"clusterStatus": "UPDATE_COMPLETE"
}
Restart the compute nodes
pcluster update-compute-fleet --region us-east-2 --cluster-name cmaq --status START_REQUESTED
Verify that compute nodes have started
pcluster describe-cluster --region=us-east-2 --cluster-name cmaq
Output:
"computeFleetStatus": "RUNNING",
Re-login to the cluster
pcluster ssh -v -Y -i ~/your-key.pem --region=us-east-2 --cluster-name cmaq
Submit a new job using the updated compute nodes#
sbatch run_cctm_2016_12US2.576pe.6x96.24x24.pcluster.hpc6a.48xlarge.fsx.pin.codemod.csh
Note
If you still have difficulty running a job in the slurm queue, there may be other issues that need to be resolved.
Submit a minimum of 2 benchmark runs#
Ideally, two CMAQ runs should be submitted to the slurm queue, using two different NPCOLxNPROW configurations, to create output needed for the QA and Post Processing Sections in Chapter 6.
upgrade pcluster version to try Persistent 2 Lustre Filesystem#
/Users/lizadams/apc-ve/bin/python3 -m pip install --upgrade pip'\n'
python3 -m pip install --upgrade "aws-parallelcluster"
Create a new configuration file
pcluster configure -r us-east-2 –config hpc6a.48xlarge.ebs.fsx.us-east-2.yaml
Getting a CREATE_FAILED error message
Query the stack formation log messages#
pcluster get-cluster-stack-events --cluster-name cmaq2 --region us-east-2 --query 'events[?resourceStatus==`CREATE_FAILED`]'
Output
"eventId": "FSX39ea84acf1fef629-CREATE_FAILED-2023-01-23T17:14:19.869Z",
"physicalResourceId": "",
"resourceStatus": "CREATE_FAILED",
"resourceStatusReason": "Linking a Persistent 2 file system to an S3 bucket using the LustreConfiguraton is not supported. Create a file system and then create a data repository association to link S3 buckets to the file system. For more details, visit https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-dra-linked-data-repo.html (Service: AmazonFSx; Status Code: 400; Error Code: BadRequest; Request ID: dd4df24a-0eed-4e94-8205-a9d5a9605aae; Proxy: null)",
"resourceProperties": "{\"FileSystemTypeVersion\":\"2.12\",\"StorageCapacity\":\"1200\",\"FileSystemType\":\"LUSTRE\",\"LustreConfiguration\":{\"ImportPath\":\"s3://cmas-cmaq-conus2-benchmark\",\"DeploymentType\":\"PERSISTENT_2\",\"PerUnitStorageThroughput\":\"1000\"},\"SecurityGroupIds\":[\"sg-00ab9ad20ea71b395\"],\"SubnetIds\":[\"subnet-02800a67052ad340a\"],\"Tags\":[{\"Value\":\"name2\",\"Key\":\"Name\"}]}",
"stackId": "arn:aws:cloudformation:us-east-2:440858712842:stack/cmaq2/561cc920-9b41-11ed-a8d2-0a9db28fc6a2",
"stackName": "cmaq2",
"logicalResourceId": "FSX39ea84acf1fef629",
"resourceType": "AWS::FSx::FileSystem",
"timestamp": "2023-01-23T17:14:19.869Z"
Not sure the best way to set the VPC and security groups. Do you match the Parallel Cluster settings, or as the parallel cluster failed to build with the persistent2 lustre settings, do you create a new VPC and modify the yaml to have the parallel cluster use the VPC settings established when you create the lustre filesystem?
Performance and Cost Optimization#
Timing information and scaling plots to assist users in optimizing the performance of their parallel cluster.
Performance Optimization
Right-sizing Compute Nodes for the ParallelCluster Configuration#
Selection of the compute nodes depends on the domain size and resolution for the CMAQ case, and what your model run time requirements are. Larger hardware and memory configurations may also be required for instrumented versions of CMAQ incuding CMAQ-ISAM and CMAQ-DDM3D. The ParallelCluster allows you to run the compute nodes only as long as the job requires, and you can also update the compute nodes as needed for your domain.
An explanation of why a scaling analysis is required for Multinode or Parallel MPI Codes#
Quote from the following link.
“IMPORTANT: The optimal value of –nodes and –ntasks for a parallel code must be determined empirically by conducting a scaling analysis. As these quantities increase, the parallel efficiency tends to decrease. The parallel efficiency is the serial execution time divided by the product of the parallel execution time and the number of tasks. If multiple nodes are used then in most cases one should try to use all of the CPU-cores on each node.”
Note
For the scaling analysis that was performed with CMAQ, the parallel efficiency was determined as the runtime for the smallest number of CPUs divided by the product of the parallel execution time and the number of additional cpus used. If smallest NPCOLxNPROW configuration was 18 cpus, the run time for that case was used, and then the parallel efficiency for the case where 36 cpus were used would be parallel efficiency = runtime_18cpu/(runtime_36cpu*2)*100
Slurm Compute Node Provisioning#
AWS ParallelCluster relies on SLURM to make the job allocation and scaling decisions. The jobs are launched, terminated, and resources maintained according to the Slurm instructions in the CMAQ run script. The YAML file for Parallel Cluster is used to set the identity of the head node and the compute node, and the maximum number of compute nodes that can be submitted to the queue. The head node can’t be updated after a cluster is created. The compute nodes, and the maximum number of compute nodes can be updated after a cluster is created.
Number of compute nodes dispatched by the slurm scheduler is specified in the run script using #SBATCH –nodes=XX #SBATCH –ntasks-per-node=YY where the maximum value of tasks per node or YY limited by many CPUs are on the compute node.
As an example:
For c5n.18xlarge, there are 36 CPUs/node, so maximum value of YY is 36 or –ntask-per-node=36.
If running a job with 180 processors, this would require the –nodes=XX or XX to be set to 5 compute nodes, as 36x5=180.
The setting for NPCOLxNPROW must also be a maximum of 180, ie. 18 x 10 or 10 x 18 to use all of the CPUs in the parallel cluster.
For c5n.9xlarge, there are 18 CPUS/node, so maximum value of YY is 18 or –ntask-per-node=18.
If running a job with 180 processors, this would require the –nodes=XX or XX to be set to 10 compute nodes, as 18x10=180.
Note
If you submit a slurm job requesting more nodes than are available in the region, then you will get the following message when you use the squeue command under NODELIST(REASON): (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partition) In the scaling tables below, this is indicated as “Unable to provision”.
See also
Quoted from the above link:
“Each vCPU is a hardware hyperthread on the Intel Xeon Platinum 8000 series processor. You get full control over the C-states on the two largest sizes, allowing you to run a single core at up to 3.5 Ghz using Intel Turbo Boost Technology. The C5n instances also feature a higher amount of memory per core, putting them in the current “sweet spot” for HPC applications that work most efficiently when there’s at least 4 GiB of memory for each core. The instances also benefit from some internal improvements that boost memory access speed by up to 19% in comparison to the C5 and C5d instances. The C5n instances incorporate the fourth generation of our custom Nitro hardware, allowing the high-end instances to provide up to 100 Gbps of network throughput, along with a higher ceiling on packets per second. The Elastic Network Interface (ENI) on the C5n uses up to 32 queues (in comparison to 8 on the C5 and C5d), allowing the packet processing workload to be better distributed across all available vCPUs.”
Resources specified in the YAML file:
Ubuntu2004
Disable Simultaneous Multi-threading
Spot Pricing
Shared EBS filesystem to install software
1.2 TiB Shared Lustre file system with imported S3 Bucket (1.2 TiB is the minimum file size that you can specify for Lustre File System) mounted as /fsx or EBS volume 500 GB size mounted as /shared/data
Slurm Placement Group enabled
Elastic Fabric Adapter Enabled on c5n.9xlarge and c5n.18xlarge
See also
Note
Pricing information in the tables below are subject to change. The links from which this pricing data was collected are listed below.
See also
See also
See also
Spot versus On-Demand Pricing#
Table 1. EC2 Instance On-Demand versus Spot Pricing (price is subject to change)
Instance Name |
vCPUs |
RAM |
EBS Bandwidth |
Network Bandwidth |
Linux On-Demand Price |
Linux Spot Price |
---|---|---|---|---|---|---|
c4.large |
2 |
3.75 GiB |
Moderate |
500 Mbps |
$0.116/hour |
$0.0312/hour |
c4.8xlarge |
36 |
60 GiB |
10 Gbps |
4,000 Mbps |
$1.856/hour |
$0.5903/hour |
c5n.large |
2 |
5.25 GiB |
Up to 3.5 Gbps |
Up to 25 Gbps |
$0.108/hour |
$0.0324/hour |
c5n.xlarge |
4 |
10.5 GiB |
Up to 3.5 Gbps |
Up to 25 Gbps |
$0.216/hour |
$0.0648/hour |
c5n.2xlarge |
8 |
21 GiB |
Up to 3.5 Gbps |
Up to 25 Gbps |
$0.432/hour |
$0.1740/hour |
c5n.4xlarge |
16 |
42 GiB |
3.5 Gbps |
Up to 25 Gbps |
$0.864/hour |
$0.2860/hour |
c5n.9xlarge |
36 |
96 GiB |
7 Gbps |
50 Gbps |
$1.944/hour |
$0.5971/hour |
c5n.18xlarge |
72 |
192 GiB |
14 Gbps |
100 Gbps |
$3.888/hour |
$1.1732/hour |
c6gn.16xlarge |
64 |
128 GiB |
100 Gbps |
$2.7648/hour |
$0.6385/hour |
|
c6a.48xlarge |
192 |
384 GiB |
40 Gbps |
50 Gpbs |
$7.344/hour |
$6.0793/hour |
hpc6a.48xlarge |
96 |
384 GiB |
100 Gbps |
$2.88/hour |
unavailable |
|
hpc7g.16xlarge |
64 |
128 GiB |
$1.6832/hour |
unavailable |
*Hpc6a instances have simultaneous multi-threading disabled to optimize for HPC codes. This means that unlike other EC2 instances, Hpc6a vCPUs are physical cores, not threads. *Hpc6a instances available in US East (Ohio) and GovCloud (US-West) *HPC6a is available ondemand only (no spot pricing)
Using c5n.18xlarge as the compute node, it costs (3.888/hr)/(1.1732/hr) = 3.314 times as much to run on demand versus spot pricing. Savings is 70% for SPOT versus ondemand pricing.
Using c5n.9xlarge as the compute node, it costs ($1.944/hr)/($0.5971/hr) = 3.25 times as much to run on demand versus spot pricing. Savings is 70% for SPOT versus ondemand pricing.
Using c6gn.16xlarge as the compute node, it costs ($2.7648/hr)/(.6385/hr) = 4.3 times as much to run on demand versus spot pricing. Savings is 77% for SPOT versus ondemand pricing for this instance type.
Note
Sometimes, the nodes are not available for SPOT pricing in the region you are using. If this is the case, the job will not start runnning in the queue, see AWS Troubleshooting. ParallelCluster Troubleshooting
Benchmark Timings for CMAQv5.3.3 12US2 Benchmark#
Benchmarks were performed using both c5n.18xlarge (36 cores per node) and c5n.9xlarge (18 cores per node), c6a.48xlarge (96 cores per node), hpc6a.48xlarge (96 cores per node)
Benchmark Timing for c5n.18xlarge#
Table 2. Timing Results for CMAQv5.3.3 2 Day CONUS2 Run on ParallelCluster with c5n.large head node and C5n.18xlarge Compute Nodes
Note for the C5n.18xlarge, I/O was done using /fsx, the InputData refers to whether the data was copied to /fsx or imported from fsx.
CPUs |
NodesxCPU |
COLROW |
Day1 Timing (sec) |
Day2 Timing (sec) |
TotalTime |
CPU Hours/day |
SBATCHexclusive |
InputData |
Disable Simultaneous Multithreading (yaml) |
with -march=native |
Equation using Spot Pricing |
SpotCost |
Equation using On Demand Pricing |
OnDemandCost |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36 |
1x36 |
6x6 |
6726.72 |
5821.47 |
12548.19 |
1.74 |
yes |
imported |
true |
yes |
1.1732/hr * 1 node * 3.486 hr= |
4.09 |
3.888/hr * 1 node * 3.496 hr = |
13.59 |
72 |
2x36 |
6x12 |
3562.50 |
3151.21 |
6713.71 |
.93 |
yes |
imported |
true |
yes |
1.1732/hr * 2 nodes * 1.8649 hr = |
4.37 |
3.888/hr * 2 nodes * 1.8649 = |
14.5 |
72 |
2x36 |
8x9 |
3665.65 |
3159.12 |
6824.77 |
.95 |
yes |
imported |
true |
yes |
1.1732/hr * 2 nodes * 1.896 hr = |
4.45 |
3.888/hr * 2 nodes * 1.896 = |
14.7 |
72 |
2x36 |
9x8 |
3562.61 |
2999.69 |
6562.30 |
.91 |
yes |
imported |
true |
yes |
1.1732/hr * 2 nodes * 1.822 hr = |
4.28 |
3.888/hr * 2 nodes * 1.822 = |
14.16 |
108 |
3x36 |
6x18 |
2415.46 |
2135.26 |
4550.72 |
.63 |
yes |
imported |
true |
yes |
1.1732/hr * 3 nodes * 1.26 hr = |
4.45 |
3.888/hr * 3 nodes * 1.26 = |
14.7 |
108 |
3x36 |
12x9 |
2758.01 |
2370.92 |
5128.93 |
.71 |
yes |
imported |
true |
yes |
1.1732/hr * 3 nodes * 1.42 hr = |
5.01 |
3.888/hr * 3 nodes * 1.42 hr = |
16.6 |
108 |
3x36 |
9x12 |
2454.11 |
2142.11 |
4596.22 |
.638 |
yes |
imported |
true |
yes |
1.1732/hr * 3 nodes * 1.276 = |
4.49 |
3.888/hr * 3 nodes * 1.276 hr = |
14.88 |
180 |
5x36 |
10x18 |
2481.55 |
2225.34 |
4706.89 |
.65 |
no |
copied |
false |
yes |
1.1732/hr * 5 nodes * 1.307 hr = |
7.66 |
3.888/hr * 5 nodes * 1.307 hr = |
25.4 |
180 |
5x36 |
10x18 |
2378.73 |
2378.73 |
4588.92 |
.637 |
no |
copied |
true |
yes |
1.1732/hr * 5 nodes * 1.2747 hr = |
7.477 |
3.888/hr * 5 nodes * 1.2747 hr = |
24.77 |
180 |
5x36 |
10x18 |
1585.67 |
1394.52 |
2980.19 |
.41 |
yes |
imported |
true |
yes |
1.1732/hr * 5nodes * 2980.9 / 3600 = |
4.85 |
3.888/hr * 5 nodes * .82 hr = |
16.05 |
256 |
8x32 |
16x16 |
1289.59 |
1164.53 |
2454.12 |
.34 |
no |
copied |
true |
yes |
1.1732/hr * 8nodes * 2454.12 / 3600 = |
$6.398 |
3.888/hr * 8 nodes * .6817 hr = |
21.66 |
256 |
8x32 |
16x16 |
1305.99 |
1165.30 |
2471.29 |
.34 |
yes |
copied |
true |
yes |
1.1732/hr * 8nodes * 2471.29 / 3600 = |
6.44 |
3.888/hr * 8 nodes * .686 hr = |
21.11 |
256 |
8x32 |
16x16 |
1564.90 |
1381.80 |
2946.70 |
.40 |
yes |
imported |
true |
yes |
1.1732/hr * 8nodes * 2946.7 / 3600 = |
7.68 |
3.888/hr * 8 nodes * .818 hr = |
25.45 |
288 |
8x36 |
16x18 |
1873.00 |
1699.24 |
3572.2 |
.49 |
no |
copied |
false |
yes |
1.1732/hr * 8nodes * 3572.2/3600= |
9.313 |
3.888/hr * 8 nodes * .992 hr = |
30.8 |
288 |
8x36 |
16x18 |
1472.69 |
1302.84 |
2775.53 |
.385 |
yes |
imported |
true |
yes |
1.1732/hr * 8nodes * .771 = |
7.24 |
3.888/hr * 8 nodes * .771 = |
23.98 |
288 |
8x36 |
16x18 |
1976.35 |
1871.61 |
3847.96 |
.53 |
no |
copied |
true |
yes |
1.1732/hr * 8nodes * 1.069 = |
10.0 |
3.888/hr * 8 nodes * 1.069 = |
33.24 |
288 |
8x36 |
16x18 |
1197.19 |
1090.45 |
2287.64 |
.31 |
yes |
copied |
true |
yes 16x18 matched 16x16 |
1.1732/hr * 8nodes * .635 = |
5.96 |
3.888/hr * 8 nodes * .635 = |
19.76 |
288 |
8x36 |
18x16 |
1206.01 |
1095.76 |
2301.77 |
.32 |
yes |
imported |
true |
yes |
1.1732/hr * 8nodes * 2301.77= |
6.00 |
3.888/hr * 8 nodes * .639 = |
19.88 |
360 |
10x36 |
18x20 |
Unable to provision |
Benchmark Timing for c5n.9xlarge#
Table 3. Timing Results for CMAQv5.3.3 2 Day CONUS2 Run on ParallelCluster with c5n.large head node and C5n.9xlarge Compute Nodes
CPUs |
NodesxCPU |
COLROW |
Day1 Timing (sec) |
Day2 Timing (sec) |
TotalTime |
CPU Hours/day |
SBATCHexclusive |
Disable Simultaneous Multithreading (yaml) |
with -march=native |
InputData |
Equation using Spot Pricing |
SpotCost |
Equation using On Demand Pricing |
OnDemandCost |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18 |
1x18 |
3x6 |
14341.77 |
12881.59 |
27223.36 |
3.78 |
yes |
true |
no |
/fsx |
0.5971/hr * 1 node * 7.56 hr= |
4.51 |
1.944/hr * 1 node * 7.56 hr = |
14.69 |
18 |
1x18 |
3x6 |
12955.32 |
11399.07 |
24354.39 |
3.38 |
yes |
true |
no |
/shared |
0.5971/hr * 1 node * 6.76 hr = |
4.03 |
1.944/hr * 1 node * 6.76 = |
13.15 |
18 |
1x18 |
6x3 |
13297.84 |
11491.99 |
24789.83 |
3.44 |
yes |
true |
no |
/shared |
0.5971/hr * 1 node * 6.89 hr = |
4.11 |
1.944/hr * 1 node * 6.89 = |
13.39 |
36 |
2x18 |
6x6 |
6473.95 |
5599.76 |
12073.71 |
1.67 |
yes |
true |
no |
/shared |
0.5971/hr * 2 node * 3.35 hr= |
4.0 |
1.944/hr * 2 node * 3.35 hr = |
13.02 |
54 |
3x18 |
6x9 |
4356.33 |
3790.13 |
8146.46 |
1.13 |
yes |
true |
no |
/shared |
0.5971/hr * 3 node * 2.26 hr= |
4.05 |
1.944/hr * 3 node * 2.26 hr = |
13.2 |
54 |
3x18 |
9x6 |
4500.29 |
3876.76 |
8377.05 |
1.16 |
yes |
true |
no |
/shared |
0.5971/hr * 3 node * 2.33 hr = |
4.17 |
1.944/hr * 3 node * 2.33 = |
13.58 |
72 |
4x18 |
8x9 |
3382.01 |
2936.66 |
6318.67 |
.8775 |
yes |
true |
no |
/shared |
0.5971/hr * 4 node * 1.755 hr= |
4.19 |
1.944/hr * 4 node * 1.755 hr = |
13.2 |
90 |
5x18 |
9x10 |
2878.55 |
2483.56 |
5362.11 |
.745 |
yes |
true |
no |
/shared |
0.5971/hr * 5 node * 1.49 hr= |
4.45 |
1.944/hr * 5 node * 1.49 hr = |
14.44 |
108 |
6x18 |
9x12 |
2463.41 |
2161.07 |
4624.48 |
.642 |
yes |
true |
no |
/shared |
0.5971/hr * 6 node * 1.28 hr= |
4.6 |
1.944/hr * 6 node * 1.28 hr = |
14.9 |
108 |
6x18 |
9x12 |
2713.95 |
2338.09 |
5052.04 |
.702 |
yes |
true |
no |
/fsx linked |
0.5971/hr * 6 node * 1.40hr = |
5.03 |
1.944/hr * 6 node * 1.40 hr = |
|
108 |
6x18 |
9x12 |
2421.19 |
2144.16 |
4565.35 |
.634 |
yes |
true |
no |
/fsx copied |
0.5971/hr * 6 node * 1.27 = |
4.54 |
1.944/hr * 6 node * 1.27hr = |
|
126 |
7x18 |
9x14 |
2144.86 |
1897.85 |
4042.71 |
.56 |
yes |
true |
no |
/shared |
0.5971/hr * 7 node * 1.12 hr= |
4.69 |
1.944/hr * 7 node * 1.12 hr = |
15.24 |
144 |
8x18 |
12x12 |
unable to provision |
|||||||||||
162 |
9x18 |
9x18 |
unable to provision |
|||||||||||
180 |
10x18 |
10x18 |
unable to provision |
Benchmark Timing for hpc6a.48xlarge#
Table 4. Timing Results for CMAQv5.3.3 2 Day CONUS 2 Run on Parallel Cluster with c6a.xlarge head node and hpc6a.48xlarge Compute Nodes
CPUs |
NodesxCPU |
COLROW |
Day1 Timing (sec) |
Day2 Timing (sec) |
TotalTime |
CPU Hours/day |
SBATCHexclusive |
Disable Simultaneous Multithreading (yaml) |
with -march=native |
With Pinning |
InputData |
Equation using Spot Pricing |
SpotCost |
Equation using On Demand Pricing |
OnDemandCost |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
96 |
1x96 |
12x8 |
2815.56 |
2368.43 |
5183.99 |
.71 |
yes |
N/A |
no |
no |
/fsx linked ? |
?/hr * 1 node * 1.44 = |
? |
2.88/hr * 1 node * 1.44 = |
4.147 |
96 |
1x96 |
12x8 |
2715.78 |
2318.15 |
5033.93 |
.699 |
yes |
N/A |
no |
yes |
/fsx linked ? |
?/hr * 1 node * 1.39 = |
? |
2.88/hr * 1 node * 1.39 = |
4.03 |
192 |
2x96 |
16x12 |
1586.15 |
1448.35 |
3034.50 |
.421 |
yes |
N/A |
no |
no |
/fsx linked? |
?/hr * 1 node * .842 = |
? |
2.88/hr * 2 node * .842 = |
4.84 |
192 |
2x96 |
16x12 |
1576.05 |
1447.76 |
3023.81 |
.419 |
yes |
N/A |
no |
yes |
/fsx linked? |
?/hr * 1 node * .839 = |
? |
2.88/hr * 2 node * .839 = |
4.83 |
288 |
3x96 |
16x18 |
1282.31 |
1189.40 |
2471.71 |
.343 |
yes |
N/A |
no |
no |
/fsx linked? |
?/hr * 1 node * .842 = |
? |
2.88/hr * 3 node * .686 = |
5.93 |
288 |
3x96 |
16x18 |
1377.44 |
1223.15 |
2600.59 |
.361 |
yes |
N/A |
no |
yes |
/fsx linked? |
?/hr * 1 node * .842 = |
? |
2.88/hr * 3 node * .722 = |
6.24 |
384 |
4x96 |
24x16 |
1211.88 |
1097.68 |
2309.56 |
.321 |
yes |
N/A |
no |
no |
/fsx linked? |
?/hr * 1 node * .642 = |
? |
2.88/hr * 4 node * .642 = |
7.39 |
384 |
4x96 |
24x16 |
1246.72 |
1095.40 |
2342.12 |
.325 |
yes |
N/A |
no |
yes |
/fsx linked? |
?/hr * 1 node * .650 = |
? |
2.88/hr * 4 node * .650 = |
7.49 |
480 |
5x96 |
24x20 |
1120.61 |
1010.33 |
2130.94 |
.296 |
yes |
N/A |
no |
no |
/fsx linked? |
?/hr * 1 node * .592 = |
? |
2.88/hr * 5 node * .592 = |
8.52 |
480 |
5x96 |
24x20 |
1114.46 |
1017.47 |
2131.93 |
.296 |
yes |
N/A |
no |
yes |
/fsx linked? |
?/hr * 1 node * .592 = |
? |
2.88/hr * 5 node * .592 = |
8.52 |
576 |
6x96 |
24x24 |
1041.13 |
952.11 |
1993.24 |
.277 |
yes |
N/A |
no |
yes |
/fsx linked? |
?/hr * 1 node * .553 = |
? |
2.88/hr * 6 node * .553 = |
9.57 |
576 |
6x96 |
24x24 |
1066.59 |
955.88 |
2022.47 |
.281 |
yes |
N/A |
no |
yes |
/fsx linked? |
?/hr * 1 node * .561 = |
? |
2.88/hr * 6 node * .561 = |
9.71 |
Benchmark Timing for c6a.48xlarge#
Table 5. Timing Results for CMAQv5.3.3 2 Day CONUS 2 Run on Parallel Cluster with c6a.xlarge head node and c6a.48xlarge Compute Nodes
CPUs |
NodesxCPU |
COLROW |
Day1 Timing (sec) |
Day2 Timing (sec) |
TotalTime |
CPU Hours/day |
SBATCHexclusive |
Disable Simultaneous Multithreading (yaml) |
with -march=native |
With Pinning |
InputData |
Equation using Spot Pricing |
SpotCost |
Equation using On Demand Pricing |
OnDemandCost |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
96 |
1x96 |
12x8 |
2996.56 |
2556.50 |
5553.06 |
.771 |
yes |
N/A |
no |
no |
/fsx linked ? |
?/hr * 1 node * 1.54 = |
? |
7.344/hr * 1 node * 1.54 = |
11.33 |
96 |
1x96 |
12x8 |
2786.72 |
2374.83 |
5161.55 |
.716 |
yes |
N/A |
no |
yes |
/fsx linked ? |
?/hr * 1 node * 1.43 = |
? |
7.344/hr * 2 node * 1.43 = |
21.0 |
192 |
2x96 |
16x12 |
1643.19 |
1491.94 |
3135.13 |
.435 |
yes |
N/A |
no |
yes |
/fsx linked ? |
?/hr * 1 node * .87 = |
? |
7.344/hr * 2 node * .87 = |
12.8 |
192 |
3x64 |
16x12 |
1793.09 |
1586.95 |
3380.04 |
.469 |
yes |
N/A |
no |
yes |
/fsx linked ? |
?/hr * 1 node * .94 = |
? |
7.344/hr * 3 node * .94 = |
20.68 |
288 |
3x96 |
16x18 |
1287.99 |
1177.42 |
2465.41 |
.342 |
yes |
N/A |
no |
yes |
/fsx linked ? |
?/hr * 1 node * .684 = |
? |
7.344/hr * 3 node * .684 = |
15.09 |
288 |
3x96 |
16x18 |
1266.97 |
1201.90 |
2468.87 |
.342 |
yes |
N/A |
no |
yes |
/fsx linked ? |
?/hr * 1 node * .684 = |
? |
7.344/hr * 3 node * .684 = |
15.09 |
Benchmark Scaling Plots for CMAQv5.3.3 12US2 Benchmark#
Benchmark Scaling Plot for c5n.18xlarge#
Figure 1. Scaling per Node on C5n.18xlarge Compute Nodes (36 cpu/node)
Note, there are several timings that were obtained using 8 nodes. The 288 cpu timings were fully utilizing the 36 pe nodes using 8x36 = 288 cpus, and different NPCOLxNPROW options were used 16x18 and 18x16. The 256 cpu timings were obtained using a NPCOLxNPROW configuration of 16x16. This benchmark configuration doesn’t fully utilize all of the cpus/node, so the efficiency per node is lower, and the cost is higher. It is best to select the NPCOLxNPROW settings that fully utilize all of the CPUs available as specified in the SBATCH commands.
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=36
Figure 2. Scaling per CPU on c5n.18xlarge compute node
Note, poor performance was obtained for the runs using 180 processors when SBATCH –exclusive option was not used. After this finding, the CMAQ run scripts were modified to always use this option. The benchmark runs that were done on c5n.9xlarge used the SBATCH –exclusive option.
Investigation of why there is a difference between the total run times for the benchmark when NPCOLxNPROW used 12x9 as compared to 9x12 and 6x18.#
A comparison of the log files (sdiff run_cctmv5.3.3_Bench_2016_12US2.108.12x9pe.2day.pcluster.log run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.pcluster.log) revealed that the CPU speed for the Parallel Cluster run of the 12x9 benchmark case was slower than the CPU speed used for the 9x12 benchmark case. See the following section for details. Comparison of log filesfor 12x9 versus 9x12 Benchmark runs
The scaling efficiency using 5 nodes of 36 cpus/node = 180 cpus was 84%.
The scaling efficiency dropped to 68% when using 8 nodes of 36 cpus/node = 288 cpus.
Figure 3. Scaling per Node on C5n.9xlarge Compute Nodes (18 cpu/node)
Scaling is very good for the c5n.9xlarge compute nodes up to 7 nodes, the largest number of nodes that could be provisioned at the time this benchmark was performed.
Figure 4. Scaling per CPU on C5n.9xlarge Compute Node (18 cpu/node)
Scaling is also good when compared to the number of cpus used. Note that all benchmark runs performed using the c5n.9xlarge compute nodes fully utilized the number of cpus available on a node.
The scaling efficiency using 7 nodes of 18 cpus/node = 126 cpus was 86%.
Benchmark Scaling Plot for c5n.18xlarge and c5n.9xlarge#
Figure 5 shows the scaling per-node, as the configurations that were run were multiples of the number of cpus per node. CMAQ was not run on a single cpu, as this would have been costly and inefficient.
Figure 5. Scaling on C5n.9xlarge (18 cpu/node) and C5n.18xlarge Compute Nodes (36 cpu/node)
Total Time and Cost versus CPU Plot for c5n.18xlarge#
Figure 6 shows the timings for many configuration options listed in the table above for the c5n.18xlarge cluster. Running with no hyperthreading, using SBATCH –exclusive, and placement enabled, resulted in the fastest timings.
Additional benchmark runs may be needed to determine the impact on performance when linking the input data using the lustre file system or copying the data to lustre and/or using the /shared ebs volume for I/O.
Figure 6. Plot of Total Time and On Demand Cost versus CPUs for c5n.18xlarge
Total Time and Cost versus CPU Plot for c5n.9xlarge#
Figure 7 shows how the total run time and On Demand Cost varies as additional CPUs are used. Note that the run script and yaml settings used for the c5n.9xlarge used settings that were optimized for running CMAQ on the cluster.
Figure 7. Plot of Total Time and On Demand Cost versus CPUs for c5n.9xlarge
Total Time and Cost versus CPU Plot for both c5n.18xlarge and c5n.9xlarge#
Figure 8. Plot of Total Time and On Demand Cost versus CPUs for both c5n.18xlarge and c5n.9xlarge
Total Time and Cost versus CPU Plot for hpc6a.48xlarge#
Figure 9 shows how the total run time and On Demand Cost varies as additional CPUs are used. Note that the run script and yaml settings used for the hpc6a.48xlarge used settings that were optimized for running CMAQ on the cluster.
Figure 9. Plot of Total Time and On Demand Cost versus CPUs for hpc6a.48xlarge
Cost Information#
Cost information is available within the AWS Web Console for your account as you use resources, and there are also ways to forecast your costs using the pricing information available from AWS.
Cost Explorer#
Example screenshots of the AWS Cost Explorer Graphs were obtained after running several of the CMAQ Benchmarks, varying # nodes and # cpus and NPCOL/NPROW. These costs are of a two day session of running CMAQ on the ParallelCluster, and should only be used to understand the relative cost of the EC2 instances (head node and compute nodes), compared to the storage, and network costs.
In Figure 10 The Cost Explorer Display shows the cost of different EC2 Instance Types: note that c5n.18xlarge is highest cost - as these are used as the compute nodes
Figure 10. Cost by Instance Type - AWS Console
In Figure 11 The Cost Explorer displays a graph of the cost categorized by usage by spot or OnDemand, NatGateway, or Timed Storage. Note: spot-c5n.18xlarge is highest generating cost resource, but other resources such as storage on the EBS volume and the network NatGatway or SubnetIDs also incur costs
Figure 11. Cost by Usage Type - AWS Console
In Figure 12. The Cost Explorer Display shows the cost by Services including EC2 Instances, S3 Buckets, and FSx Lustre File Systems
Figure 12. Cost by Service Type - AWS Console
Compute Node Cost Estimate#
Head node c5n.large compute cost = entire time that the parallel cluster is running ( creation to deletion) = 6 hours * $0.0324/hr = $ .1944 using spot pricing, 6 hours * $.108/hr = $.648 using on demand pricing.
Using 288 cpus on the ParallelCluster, it would take ~4.83 days to run a full year, using 8 c5n.18xlarge (36cpu/node) compute nodes.
Using 288 cpus on the ParallelCluster, it would take ~ 6.37 days to run a full year using 2 hpc6a.48xlarge (96cpu/node) compute nodes.
Using 126 cpus on the ParallelCluster, it would take ~8.92 days to run a full year, using 7 c5n.9xlarge (18cpu/node) compute nodes.
Table 8. Extrapolated Cost of compute nodes used for CMAQv5.3.3 Annual Simulation based on 2 day CONUS benchmark
Benchmark Case |
Compute Node |
Number of PES |
Number of Nodes |
Pricing |
Cost per node |
Time to completion (hour) |
Equation Extrapolate Cost for Annual Simulation |
Annual Cost |
Days to Complete Annual Simulation |
---|---|---|---|---|---|---|---|---|---|
2 day 12US2 |
c5n.18xlarge |
108 |
3 |
SPOT |
1.1732/hour |
4550.72/3600 = 1.264 |
1.264/2 * 365 = 231 hours/node * 3 nodes = 692 hr * $1.1732/hr = |
$811.9 |
9.61 |
2 day 12US2 |
c5n.18xlarge |
108 |
3 |
ONDEMAND |
3.888/hour |
4550.72/3600 = 1.264 |
1.264/2 * 365 = 231 hours/node * 3 nodes = 692 hr * $3.888/hr = |
$2690.4 |
9.61 |
2 day 12US2 |
c5n.18xlarge |
180 |
5 |
SPOT |
1.1732/hour |
2980.19/3600 = .8278 |
.8278/2 * 365 = 151 hours/node * 5 nodes = 755 hr * $1.1732/hr = |
$886 |
6.29 |
2 day 12US2 |
c5n.18xlarge |
180 |
5 |
ONDEMAND |
3.888/hour |
2980.19/3600 = .8278 |
.8278/2 * 365 = 151 hours/node * 5 nodes = 755 hr * $3.888/hr = |
$2935.44 |
6.29 |
2 day 12US2 |
c5n.9xlarge |
126 |
7 |
SPOT |
.5971/hour |
4042.71/3600 = 1.12 |
1.12/2 * 365 = 204.94 hours/node * 7 nodes = 1434.6 hr * $.5971/hr = |
$856 |
8.52 |
2 day 12US2 |
c5n.9xlarge |
126 |
7 |
ONDEMAND |
1.944/hour |
4042.71/3600 = 1.12 |
1.12/2 * 365 = 204.94 hours/node * 7 nodes = 1434.6 hr * $1.944/hr = |
$2788.8 |
8.52 |
2 day 12US2 |
hpc6a.48xlarge |
96 |
1 |
ONDEMAND |
$2.88/hour |
5033.93/3600 = 1.40 |
1.40/2 * 365 = 255 hours/node * 1 nodes = 255 hr * $2.88/hr = |
$734 |
10.6 |
2 day 12US2 |
hpc6a.48xlarge |
192 |
2 |
ONDEMAND |
$2.88/hour |
3023.81/3600 = .839 |
.839/2 * 365 = 153.29 hours/node * 2 nodes = 306 hr * $2.88/hr = |
$883 |
6.4 |
Note
These cost estimates depend on the availability of number of nodes for the instance type. If fewer nodes are available, then it will take longer to complete the annual run, but the costs should be accurate, as the CONUS 12US2 Domain Benchmark scales well up to this number of nodes. The cost of running an annual simulation on 3 c5n.18xlarge nodes using OnDemand Pricing is $2690.4, the cost of running an annual simulation on 5 c5n.18xlarge nodes using OnDemand pricing is $2935.44, if only 3 nodes are available, then you would pay less, but wait longer for the run to be completed, 9.61 days using 3 nodes versus 6.29 days using 5 nodes.
Storage Cost Estimate#
See also
Table 9. Lustre SSD File System Pricing for us-east-1 region
Storage Type |
Storage options |
Pricing with data compression enabled* |
Pricing (monthly) |
---|---|---|---|
Persistent |
125 MB/s/TB |
$0.073 |
$0.145/month |
Persistent |
250 MB/s/TB |
$0.105 |
$0.210/month |
Persistent |
500 MB/s/TB |
$0.170 |
$0.340/month |
Persistent |
1,000 MB/s/TB |
$0.300 |
$0.600/month |
Scratch |
200/MB/s/TiB |
$0.070 |
$0.140/month |
Note, there is a difference in the storage sizing units that were obtained from AWS.
See also
Quote from the above website; “One tebibyte is equal to 2^40 or 1,099,511,627,776 bytes. One terabyte is equal to 1012 or 1,000,000,000,000 bytes. A tebibyte equals nearly 1.1 TB. That’s about a 10% difference between the size of a tebibyte and a terabyte, which is significant when talking about storage capacity.”
Lustre Scratch SSD 200 MB/s/TiB is tier of the storage pricing that we have configured in the yaml for the cmaq parallel cluster.
See also
Cost example: 0.14 USD per month / 730 hours in a month = 0.00019178 USD per hour
Note: 1.2 TiB is the minimum file size that you can specify for the lustre file system
1,200 GiB x 0.00019178 USD per hour x 24 hours x 5 days = 27.6 USD
Question is 1.2 TiB enough for the output of a yearly CMAQ run?
For the output data, assuming 2 day CONUS Run, all 35 layers, all 244 variables in CONC output
cd /fsx/data/output/output_CCTM_v532_gcc_2016_CONUS_16x8pe_full
du -sh
Size of output directory when CMAQ is run to output all 35 layers, all 244 variables in the CONC file, includes all other output files
173G .
So we need 86.5 GB per day
Storage requirement for an annual simulation if you assumed you would keep all data on lustre filesystem
86.5 GB * 365 days = 31,572.5 GB = 31.5 TB
Annual simulation local storage cost estimate#
Assuming it takes 5 days to complete the annual simulation, and after the annual simulation is completed, the data is moved to archive storage.
31,572.5 GB x 0.00019178 USD per hour x 24 hours x 5 days = $726.5 USD
To reduce storage requirements; after the CMAQ run is completed for each month, the post-processing scripts are run and completed, and then the CMAQ Output data for that month is moved from the Lustre Filesystem to the Archived Storage. Monthly data volume storage requirements to store 1 month of data on the lustre file system is approximately 86.5 x 30 days = 2,595 GB or 2.6 TB.
2,595 GB x 0.00019178 USD per hour x 24 hours x 5 days = $60 USD
Estimate for S3 Bucket cost for storing an annual simulation
See also
S3 Standard - General purpose storage |
Storage Pricing |
---|---|
First 50 TB / Month |
$0.023 per GB |
Next 450 TB / Month |
$0.022 per GB |
Over 500 TB / Month |
$0.021 per GB |
Archive Storage cost estimate for annual simulation - assuming you want to save it for 1 year#
31.5 TB * 1024 GB/TB * .023 per GB * 12 months = $8,903
S3 Glacier Flexible Retrieval (Formerly S3 Glacier) |
Storage Pricing |
---|---|
long-term archives with retrieval option from 1 minute to 12 hours |
|
All Storage / Month |
$0.0036 per GB |
S3 Glacier Flexible Retrieval Costs 6.4 times less than the S3 Standard
31.5 TB * 1024 GB/TB * $.0036 per GB * 12 months = $1393.0 USD
Lower cost option is S3 Glacier Deep Archive (accessed once or twice a year, and restored in 12 hours)
31.5 TB * 1024 GB/TB * $.00099 per GB * 12 months = $383 USD
Recommended Workflow for extending to annual run#
Post-process monthly save output and/or post-processed outputs to S3 Bucket at the end of each month.
Still need to determine size of post-processed output (combine output, etc).
86.5 GB * 31 days = 2,681.5 GB * 1 TB/1024 GB = 2.62 TB
Cost for lustre storage of a monthly simulation
2,681.5 GB x 0.00019178 USD per hour x 24 hours x 5 days = $61.7 USD
Goal is to develop a reproducable workflow that does the post processing after every month, and then copies what is required to the S3 Bucket, so that only 1 month of output is imported at a time to the lustre scratch file system from the S3 bucket. This workflow will help with preserving the data in case the cluster or scratch file system gets pre-empted.
Side by Side Comparison of the information in the log files for 12x9 pe run compared to 9x12 pe run.#
cd /shared/pcluster-cmaq/c5n.18xlarge_scripts_logs
sdiff run_cctmv5.3.3_Bench_2016_12US2.108.12x9pe.2day.pcluster.log run_cctmv5.3.3_Bench_2016_12US2.108.9x12pe.2day.pcluster.log | more
Output:
Start Model Run At Fri Feb 25 20:48:42 UTC 2022 | Start Model Run At Thu Feb 24 01:04:42 UTC 2022
information about processor including whether using hyperthre information about processor including whether using hyperthre
Architecture: x86_64 Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits vi Address sizes: 46 bits physical, 48 bits vi
CPU(s): 36 CPU(s): 36
On-line CPU(s) list: 0-35 On-line CPU(s) list: 0-35
Thread(s) per core: 1 Thread(s) per core: 1
Core(s) per socket: 18 Core(s) per socket: 18
Socket(s): 2 Socket(s): 2
NUMA node(s): 2 NUMA node(s): 2
Vendor ID: GenuineIntel Vendor ID: GenuineIntel
CPU family: 6 CPU family: 6
Model: 85 Model: 85
Model name: Intel(R) Xeon(R) Platinum 81 Model name: Intel(R) Xeon(R) Platinum 81
Stepping: 4 Stepping: 4
CPU MHz: 2887.020 | CPU MHz: 2999.996
BogoMIPS: 5999.98 | BogoMIPS: 5999.99
Hypervisor vendor: KVM Hypervisor vendor: KVM
Virtualization type: full Virtualization type: full
L1d cache: 1.1 MiB L1d cache: 1.1 MiB
L1i cache: 1.1 MiB L1i cache: 1.1 MiB
L2 cache: 36 MiB L2 cache: 36 MiB
L3 cache: 49.5 MiB L3 cache: 49.5 MiB
NUMA node0 CPU(s): 0-17 NUMA node0 CPU(s): 0-17
NUMA node1 CPU(s): 18-35 NUMA node1 CPU(s): 18-35
=========================================== ===========================================
|>--- ENVIRONMENT VARIABLE REPORT ---<| |>--- ENVIRONMENT VARIABLE REPORT ---<|
=========================================== ===========================================
|> Grid and High-Level Model Parameters: |> Grid and High-Level Model Parameters:
+========================================= +=========================================
--Env Variable-- | --Value-- --Env Variable-- | --Value--
------------------------------------------------------- -------------------------------------------------------
BLD | (default) BLD | (default)
OUTDIR | /fsx/data/output/output_CCTM_v533_gcc_20 | OUTDIR | /fsx/data/output/output_CCTM_v533_gcc_20
NEW_START | T NEW_START | T
ISAM_NEW_START | Y (default) ISAM_NEW_START | Y (default)
GRID_NAME | 12US2 GRID_NAME | 12US2
CTM_TSTEP | 10000 CTM_TSTEP | 10000
CTM_RUNLEN | 240000 CTM_RUNLEN | 240000
CTM_PROGNAME | DRIVER (default) CTM_PROGNAME | DRIVER (default)
CTM_STDATE | 2015356 CTM_STDATE | 2015356
CTM_STTIME | 0 CTM_STTIME | 0
NPCOL_NPROW | 12 9 | NPCOL_NPROW | 9 12
CTM_MAXSYNC | 300 CTM_MAXSYNC | 300
================================== ==================================
***** CMAQ TIMING REPORT ***** ***** CMAQ TIMING REPORT *****
================================== ==================================
Start Day: 2015-12-22 Start Day: 2015-12-22
End Day: 2015-12-23 End Day: 2015-12-23
Number of Simulation Days: 2 Number of Simulation Days: 2
Domain Name: 12US2 Domain Name: 12US2
Number of Grid Cells: 3409560 (ROW x COL x LAY) Number of Grid Cells: 3409560 (ROW x COL x LAY)
Number of Layers: 35 Number of Layers: 35
Number of Processes: 108 Number of Processes: 108
All times are in seconds. All times are in seconds.
Num Day Wall Time Num Day Wall Time
01 2015-12-22 2758.01 | 01 2015-12-22 2454.11
02 2015-12-23 2370.92 | 02 2015-12-23 2142.11
Total Time = 5128.93 | Total Time = 4596.22
Avg. Time = 2564.46 | Avg. Time = 2298.11
Developer Guide to install and run CMAQv5.33 on Single VM or Parallel Cluster#
CMAQv5.3.3 on Single Virtual Machine Advanced (optional)#
Run CMAQv5.3.3 on a single Virtual Machine (VM) using c6a.xlarge (4 CPUs) and Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-1031-aws x86_64), then upgrade to c6a.48xlarge.
Install Software and run CMAQv5.3.3 on c6a.2xlarge for the 2016_12US3 Benchmark#
Instructions are provided to build and install CMAQ on c6a.2xlarge compute node installed from Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-1031-aws x86_64) Image that contains modules for git, openmpi and gcc. The compute node does not have a SLURM scheduler on it, so jobs are run interactively from the command line.
Instructions to install data and CMAQ libraries and model are provided along with sample run scripts to run CMAQ on 4 processors on a single c6a.2xlarge instance.
This will provide users with experience using the AWS Console to create a Virtual Machine, select Operating System, select the size of the VM as c6a.2xlarge vcpus, 8 GiB memory, using an SSH private key to login and install and run CMAQ.
Using this method, the user needs to be careful to start and stop the Virtual Machine and only have it run while doing the intial installation, and while running CMAQ. The full c6a.2xlarge instance will incur charges as long as it is on, even if a job isn’t running on it.
This is different than the Parallel Cluster, where if CMAQ is not running in the queue, then the Compute nodes are down, and not incurring costs.
Build CMAQv5.3.3 on c6a.2xlarge EC2 instance#
Create a c6a.xlarge Virtual Machine#
Login to AWS Console
Select Get Started with EC2
Select Launch Instance
Application and OS (Operating System) Images: Select Ubunutu 22.04 LTS(HVM), SSD Volume Type (the version of OS determines what packages are available from apt-get and that determines the version of software obtained, ie. cdo version > 2.0 for Ubuntu 22.04 LTS, or cdo version < 2.0 for Ubuntu 18.04.
Instance Type: Select c6a.2xlarge ($0.xxx/hr)
Key pair - SSH public key, select existing key or create a new one.
Network settings - select default settings
Configure storage - select 100 GiB gp3 Root volume
Select Launch instance
Login to the Virtual Machine#
Change the permissions on the public key using command
chmod 400 [your-key-name].pem
Login to the Virtual Machine using ssh to the IP address using the public key.
ssh -Y -i ./xxxxxxx_key.pem ubuntu@xx.xx.xx.xx
Check operating system version#
lsb_release -a
output
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy
Install Environment Modules#
sudo apt-get upgrade
sudo apt-get install environment-modules
Logout and then log back in to activate modules command#
Verify module command works#
module list
Output:
No Modulefiles Currently Loaded.
module avail
Output:
--------------------------------------------------------------------------------------- /usr/share/modules/modulefiles ---------------------------------------------------------------------------------------
dot module-git module-info modules null use.own
Set up build environment#
Load the git module
module load module-git
If you do not see git available as a module, you may need to install it as follows:
sudo apt-get install git
Install Compilers and OpenMPI#
sudo apt-get update
sudo apt-get install gcc-9
sudo apt-get install gfortran-9
sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev libgtk2.0-dev
sudo apt-get install tcsh
Change shell to use tcsh#
sudo usermod -s /usr/bin/tcsh ubuntu
Logout and log back in, then check the shell#
echo $SHELL
output
/usr/bin/tcsh
Check available versions of compiler#
dpkg --list | grep compiler
Choose gcc-9 and gfortran-9 as default compilers#
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 9
sudo update-alternatives --install /usr/bin/gfortran gfortran /usr/bin/gfortran-9 9
Check version of gcc#
gcc --version
output
gcc --version
gcc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
Check version of gfortran#
gfortran --version
Output
GNU Fortran (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
Check version of OpenMPI#
mpirun --version
output
mpirun (Open MPI) 4.1.2
Install Parallel Cluster CMAQ Repo#
cd /shared
git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git
Install and build netcdf C, netcdf Fortran, I/O API, and CMAQ#
cd /shared/pcluster-cmaq/install
Install netcdf-C and netcdf-Fortran#
./gcc_netcdf_singlevm.csh |& tee ./gcc_netcdf_singlevm.log
If successful, you will see the following output, that at the bottom shows what versions of the netCDF library were installed.
+-------------------------------------------------------------+
| Congratulations! You have successfully installed the netCDF |
| Fortran libraries. |
| |
| You can use script "nf-config" to find out the relevant |
| compiler options to build your application. Enter |
| |
| nf-config --help |
| |
| for additional information. |
| |
| CAUTION: |
| |
| If you have not already run "make check", then we strongly |
| recommend you do so. It does not take very long. |
| |
| Before using netCDF to store important data, test your |
| build with "make check". |
| |
| NetCDF is tested nightly on many platforms at Unidata |
| but your platform is probably different in some ways. |
| |
| If any tests fail, please see the netCDF web site: |
| https://www.unidata.ucar.edu/software/netcdf/ |
| |
| NetCDF is developed and maintained at the Unidata Program |
| Center. Unidata provides a broad array of data and software |
| tools for use in geoscience education and research. |
| https://www.unidata.ucar.edu |
+-------------------------------------------------------------+
make[3]: Leaving directory '/shared/build/netcdf-fortran-4.5.4'
make[2]: Leaving directory '/shared/build/netcdf-fortran-4.5.4'
make[1]: Leaving directory '/shared/build/netcdf-fortran-4.5.4'
netCDF 4.8.1
netCDF-Fortran 4.5.3
Install I/O API
./gcc_ioapi_singlevm.csh |& tee ./gcc_ioapi_singlevm.log
Find what operating system is on the system:
cat /etc/os-release
Output
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Copy a file to set paths#
cd /shared/pcluster-cmaq/install
cp dot.cshrc.singlevm ~/.cshrc
Exit cluster and log back in to activate the update shell, or use csh#
Create Custom Environment Module for Libraries#
There are two steps required to create your own custome module:
write a module file
add a line to your ~/.cshrc to update the MODULEPATH
Create a new custom module that will be loaded with:
module load ioapi-3.2/gcc-9.5-netcdf
Step 1: Create the module file for ioapi-3.2.
First, create a path to store the module file. The path must contain /Modules/modulefiles/ and should have the general form
/
mkdir -p /shared/build/Modules/modulefiles/ioapi-3.2
Next, create the module file and save it in the directory above.
cd /shared/build/Modules/modulefiles/ioapi-3.2
vim gcc-9.5-netcdf
Contents of gcc-9.5-netcdf:
#%Module
proc ModulesHelp { } {
puts stderr "This module adds ioapi-3.2/gcc-9.5 to your path"
}
module-whatis "This module adds ioapi-3.2/gcc-9.5 to your path\n"
set basedir "/shared/build/ioapi-3.2/"
prepend-path PATH "${basedir}/Linux2_x86_64gfort"
prepend-path LD_LIBRARY_PATH "${basedir}/ioapi/fixed_src"
The example module file above sets two evironment variables.
The modules update the PATH and LD_LIBRARY_PATH.
Step 2. Create the module file for netcdf-4.8.1
mkdir -p /shared/build/Modules/modulefiles/netcdf-4.8.1
Next, create the module file and save it in the directory above.
cd /shared/build/Modules/modulefiles/netcdf-4.8.1
vim gcc-9.5
Contents of gcc-9.5
#%Module
proc ModulesHelp { } {
puts stderr "This module adds netcdf-4.8.1/gcc-9.5 to your path"
}
module-whatis "This module adds netcdf-4.8.1/gcc-9.5 to your path\n"
set basedir "/shared/build/netcdf"
prepend-path PATH "${basedir}/bin"
prepend-path LD_LIBRARY_PATH "${basedir}/lib"
module load mpi/openmpi-4.1.2
Step 3. Create the module file for mpi
mkdir -p /shared/build/Modules/modulefiles/mpi
Next, create the module file and save it in the directory above.
cd /shared/build/Modules/modulefiles/mpi
vim openmpi-4.1.2
Contents of openmpi-4.1.2
#%Module
proc ModulesHelp { } {
puts stderr "This module adds mpi/openmpi-4.1.2 to your path"
}
module-whatis "This module adds mpi/openmpi-4.1.2 to your path\n"
set basedir "/usr/lib/x86_64-linux-gnu/openmpi/"
prepend-path PATH "/usr/bin/"
prepend-path LD_LIBRARY_PATH "${basedir}/lib"
Step 4: Add the module path to MODULEPATH.
Now that the module file has been created, add the following line to your ~/.cshrc file so that it can be found:
module use --append /shared/build/Modules/modulefiles
Step 5: View the modules available after creation of the new module
The module avail command shows the paths to the module files on a given cluster.
module avail
Output
ioapi-3.2/gcc-9.5-netcdf mpi/openmpi-4.1.2 netcdf-4.8.1/gcc-9.5
Step 4: Load the new modules
ioapi-3.2/gcc-9.5-netcdf mpi/openmpi-4.1.2 netcdf-4.8.1/gcc-9.5
Find path for openmpi libraries#
ompi_info --path libdir
output
Libdir: /usr/lib/x86_64-linux-gnu/openmpi/lib
Find path for include files for openmpi#
ompi_info --path incdir
output
Incdir: /usr/lib/x86_64-linux-gnu/openmpi/include
Edit the config_cmaq_singlevm.csh script to specify the paths for OpenMPI#
Note, search for case gcc so that you edit the section of the file that is using the gcc compiler.
setenv MPI_INCL_DIR /usr/lib/x86_64-linux-gnu/openmpi/include #> MPI Include directory path
setenv MPI_LIB_DIR /usr/lib/x86_64-linux-gnu/openmpi/lib #> MPI Lib directory path
Install Python#
sudo apt-get install python3 python3-pip
Check Version
python3 --version
Python 3.10.6
ip-172-31-27-148:/shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts> python3 -m pip --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
Install jupyter notebook.#
pip install jupyterlab
Install and Build CMAQ#
cd /shared/pcluster-cmaq/install
./gcc_cmaq533_singlevm.csh |& tee ./.gcc_cmaq533_singlevm.log
SKIP this step.
Add the following to the compile option: -fallow-argument-mismatch
cd /shared/build/openmpi_gcc/CMAQ_v54+/CCTM/scripts/BLD_CCTM_v54_gcc
vi Makefile.gcc
Output:
FSTD = -fallow-argument-mismatch -O3 -funroll-loops -finit-character=32 -Wtabs -Wsurprising -ftree-vectorize -ftree-loop-if-convert -finline-limit=512
Run make again#
make |& tee Make.log
Verfify that the executable was successfully built.
ls /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/BLD_CCTM_v533_gcc/*.exe
Output
/shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/BLD_CCTM_v533_gcc/CCTM_v533.exe
Check to see what scripts are available#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts
List the scripts available
ls -rlt *.csh*
Output
-rwxrwxr-x 1 ubuntu ubuntu 34318 Jul 19 17:47 run_cctm_Bench_2011_12SE1.csh
-rwxrwxr-x 1 ubuntu ubuntu 32649 Jul 19 17:47 bldit_cctm.csh
-rwxrwxr-x 1 ubuntu ubuntu 36130 Jul 19 17:47 run_cctm_2016_12US1.csh
-rwxrwxr-x 1 ubuntu ubuntu 36850 Jul 19 17:47 run_cctm_2015_HEMI.csh
-rwxrwxr-x 1 ubuntu ubuntu 34948 Jul 19 17:47 run_cctm_2014_12US1.csh
-rwxrwxr-x 1 ubuntu ubuntu 34262 Jul 19 17:47 run_cctm_2011_12US1.csh
-rwxrwxr-x 1 ubuntu ubuntu 35242 Jul 19 17:47 run_cctm_2010_4CALIF1.csh
-rwxrwxr-x 1 ubuntu ubuntu 49472 Jul 19 17:47 run_cctm_Bench_2016_12SE1.WRFCMAQ.csh
-rwxrwxr-x 1 ubuntu ubuntu 35799 Jul 19 18:43 run_cctm_Bench_2016_12SE1.csh
Download the Input data from the S3 Bucket#
Install aws command line#
see Install AWS CLI
cd /shared/build
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
Install unzip and unzip file#
sudo apt install zip
/usr/bin/unzip awscliv2.zip
sudo ./aws/install
output
You can now run: /usr/local/bin/aws --version
Note, you will need to add this path to your .cshrc
Edit .cshrc#
vi ~/.cshrc
add the following to the path /usr/local/bin
Output:
# start .cshrc
umask 002
if ( ! $?LD_LIBRARY_PATH ) then
setenv LD_LIBRARY_PATH /shared/build/netcdf/lib
else
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/shared/build/netcdf/lib
endif
set path = ($path /shared/build/netcdf/bin /shared/build/ioapi-3.2/Linux2_x86_64gfort /opt/slurm/bin/ /usr/local/bin/ )
if ($?tcsh) then
source /usr/share/modules/init/tcsh
else
source /usr/share/modules/init/csh
endif
Install the input data using the s3 script#
need scriptable method to obtain 12SE1 benchmark
Note, this Virtual Machine does not have Slurm installed or configured.
Run CMAQ interactively using the following command:#
First check to see how many cpus you have available on the machine.#
lscpu
Output
CPU(s): 4
On-line CPU(s) list: 0-3
Verify that the run script is set to run on 4 cpus
@ NPCOL = 2; @ NPROW = 2
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts
`./run_cctm_Bench_2016_12SE1.csh |& tee ./run_cctm_Bench_2016_12SE1.log
When the run has completed, record the timing of the two day benchmark.
tail -n 30 run_cctm_Bench_2016_12SE1.log
Output on 4 cores
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2016-07-01
End Day: 2016-07-01
Number of Simulation Days: 1
Domain Name: 2016_12SE1
Number of Grid Cells: 280000 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 4
All times are in seconds.
Num Day Wall Time
01 2016-07-01 2083.32
Total Time = 2083.32
Avg. Time = 2083.32
Install I/O API libraries that support HDF5#
This is required in order to:
Run CMAQ using the compressed netCDF-4 input files provided on the S3 bucket or
Convert the *.nc4 files to *.nc files (to uncompressed classic netCDF-3 input files)
First build HDF5 libraries, then build netCDF-C, netCDF-Fortran
cd /shared/pcluster-cmaq
./gcc11_install_hdf5.csh
Upgrade to run CMAQ on larger EC2 Instance#
Save the AMI and create a new VM using a larger c6a.8xlarge (with 32 processors)#
Requires access to the AWS Web Interface (I will look for insructions on how to do this from the aws command line, but I don’t currently have a method for this.)
Use the AWS Console to Stop the Image#
add screenshot
Use the AWS Console to Create a new AMI#
add screenshot
Check to see that the AMI has been created by examining the status. Wait for the status to change from Pending to Available.
Use the newly created AMI to launch a new Single VM using a larger EC2 instance.#
Launch a new instance using the AMI with the software loaded and request a spot instance for the c6a.8xlarge EC2 instance
Load the modules#
Test running the listos domain on 32 processors#
Output
Processing Day/Time [YYYYDDD:HHMMSS]: 2017357:235600
Which is Equivalent to (UTC): 23:56:00 Saturday, Dec. 23, 2017
Time-Step Length (HHMMSS): 000400
VDIFF completed... 3.6949 seconds
COUPLE completed... 0.3336 seconds
HADV completed... 1.8413 seconds
ZADV completed... 0.5154 seconds
HDIFF completed... 0.4116 seconds
DECOUPLE completed... 0.0696 seconds
PHOT completed... 0.7443 seconds
CLDPROC completed... 2.4009 seconds
CHEM completed... 1.3362 seconds
AERO completed... 1.3210 seconds
Master Time Step
Processing completed... 12.6698 seconds
=--> Data Output completed... 0.9872 seconds
==============================================
|>--- PROGRAM COMPLETED SUCCESSFULLY ---<|
==============================================
Date and time 0:00:00 Dec. 24, 2017 (2017358:000000)
The elapsed time for this simulation was 3389.0 seconds.
315644.552u 1481.008s 56:29.98 9354.7% 0+0k 33221248+26871200io 9891pf+0w
CMAQ Processing of Day 20171223 Finished at Wed Jun 7 02:25:47 UTC 2023
\\\\\=====\\\\\=====\\\\\=====\\\\\=====/////=====/////=====/////=====/////
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2018-08-05
End Day: 2018-08-07
Number of Simulation Days: 3
Domain Name: 2018_12Listos
Number of Grid Cells: 21875 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 32
All times are in seconds.
Num Day Wall Time
01 2018-08-05 80.6
02 2018-08-06 72.7
03 2018-08-07 76.3
Total Time = 229.60
Avg. Time = 76.53
Run CMAQv5.4 for the full 12US1 Domain on c6a.48xlarge#
Download the full 12US1 Domain that is netCDF4 compressed and convert it to classic netCDF-3 compression.
Note: I first tried running this domain on the c6a.8xlarge on 32 processors. The model failed, with a signal 9 - likely not enough memory available to run the model.
I re-saved the AMI and launched a c6a.48xlarge with 192 vcpus, running as spot instance.
Spot Pricing cost for Linux in US East Region
c6a.48xlarge $6.4733 per Hour
Run utility to uncompress hdf5 *.nc4 files and save as classic *.nc files#
May need to look at disabling hyperthreading at runtime.
CMAQv5.3.3 Advanced Tutorial (optional)#
Learn how to upgrade the ParallelCluster, by first creating a cluster that uses c5n.4xlarge as the compute nodes, and then upgrading the cluster to use c5n.18xlarge as the compute nodes.
Learn how to install CMAQ software and underlying libraries, copy input data, and run CMAQ.
Notice
Skip this tutorial if you successfully completed the Intermediate Tutorial and wish to proceed to the post-processing and QA instructions. Note, you may wish to build the underlying libraries and CMAQ and code if you wish to create a ParallelCluster using a different family of compute nodes, such as the c6gn.16xlarge compute nodes AMD Graviton.
Advanced Tutorial (optional)
Use ParallelCluster without Software and Data pre-installed#
Step by step instructions to configuring and running a ParallelCluster for the CMAQ 12US2 benchmark with instructions to install the libraries and software.
Notice
Skip this tutorial if you successfully completed the Intermediate Tutorial. Unless you need to build the CMAQ libraries and code and run on a different family of compute nodes, such as the c6gn.16xlarge compute nodes AMD Graviton.
Create CMAQ Cluster using SPOT pricing#
Use an existing yaml file from the git repo to create a ParallelCluster#
cd /your/local/machine/install/path/
Use a configuration file from the github repo that was cloned to your local machine#
git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq
cd pcluster-cmaq
Edit the c5n-4xlarge.yaml#
vi c5n-4xlarge.yaml
Note
the c5n-4xlarge.yaml is configured to use SPOT instance pricing for the compute nodes.
the c5n-4xlarge.yaml is configured to the the c5n-4xlarge as the compute node, with up to 10 compute nodes, specified by MaxCount: 10.
the c5n-4xlarge.yaml is configured to disable multithreading (This option restricts the computing to CPUS rather than allowing the use of all virtual CPUS. (16 virtual cpus reduced to 8 cpus)
given this yaml configuration, the maximum number of PEs that could be used to run CMAQ is 8 cpus x 10 = 80, the max settings for NPCOL, NPROW is NPCOL = 8, NPROW = 10 or NPCOL=10, NPROW=8 in the CMAQ run script.
Replace the key pair and subnet ID in the c5n-4xlarge.yaml file with the values created when you configured the demo cluster#
Region: us-east-1
Image:
Os: ubuntu2004
HeadNode:
InstanceType: c5n.large
Networking:
SubnetId: subnet-xx-xx-xx << replace
DisableSimultaneousMultithreading: true
Ssh:
KeyName: your_key << replace
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: queue1
CapacityType: SPOT
Networking:
SubnetIds:
- subnet-xx-xx-x x << replace
ComputeResources:
- Name: compute-resource-1
InstanceType: c5n.4xlarge
MinCount: 0
MaxCount: 10
DisableSimultaneousMultithreading: true
SharedStorage:
- MountDir: /shared
Name: ebs-shared
StorageType: Ebs
- MountDir: /fsx
Name: name2
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
The Yaml file for the c5n-4xlarge contains the settings as shown in the following diagram.#
Figure 1. Diagram of YAML file used to configure a ParallelCluster with a c5n.large head node and c5n.4xlarge compute nodes using SPOT pricing
Create the c5n-4xlarge pcluster#
pcluster create-cluster --cluster-configuration c5n-4xlarge.yaml --cluster-name cmaq --region us-east-1
Check on status of cluster#
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
After 5-10 minutes, you see the following status: “clusterStatus”: “CREATE_COMPLETE”
Start the compute nodes#
pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED
Login to cluster#
Note
Replace the your-key.pem with your Key Pair.
pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq
Show compute nodes#
scontrol show nodes
Output:
NodeName=queue1-dy-compute-resource-1-10 CoresPerSocket=1
CPUAlloc=0 CPUTot=8 CPULoad=N/A
AvailableFeatures=dynamic,c5n.4xlarge,compute-resource-1
ActiveFeatures=dynamic,c5n.4xlarge,compute-resource-1
Gres=(null)
NodeAddr=queue1-dy-compute-resource-1-10 NodeHostName=queue1-dy-compute-resource-1-10
RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=8 Boards=1
State=IDLE+CLOUD+POWERED_DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=queue1
BootTime=None SlurmdStartTime=None
LastBusyTime=Unknown
CfgTRES=cpu=8,mem=1M,billing=8
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Update the compute nodes#
Before building the software, verify that you can update the compute nodes from the c5n.4xlarge to c5n.18xlarge
By updating the compute node from a c5n.4xlarge (max 8 cpus per compute node) to c5n.18xlarge (max 36 cpus per compute node) would allow the benchmark case to be run on up to 360 cpus ( 36 cpu/node x 10 nodes ).
Note
Provisioning 10 c5n.18xlarge in one region may be difficult. In practice, it is possible to obtain 8 c5n.18xlarge compute nodes, with 36 cpu/node x 8 nodes = 288 cpus.
Note
The c5n.18xlarge requires that the elastic network adapter is enabled in the yaml file. Exit the pcluster and return to your local command line.
If you only modified the yaml file to update the compute node identity, without making additional updates to the network and other settings, then you would not achieve all of the benefits of using the c5n.18xlarge compute node in the ParallelCluster.
For this reason, a yaml file that contains these advanced options to support the c5n.18xlarge compute instance will be used to upgrade the ParallelCluster from c5n.5xlarge to c5n.18xlarge.
Exit the cluster#
exit
Stop the compute nodes#
pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status STOP_REQUESTED
Verify that the compute nodes are stopped#
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
keep rechecking until you see the following status “computeFleetStatus”: “STOPPED”,
Examine the differences between YAML files#
The YAML file for the c5n.xlarge head node and c5n18xlarge compute Node contains additional settings than the YAML file that used the c5n.4xlarge as the compute node.
Note
the c5n-18xlarge.yaml is configured to use SPOT instance pricing for the compute nodes.
the c5n-18xlarge.yaml is configured to the the c5n-18xlarge as the compute node, with up to 10 compute nodes, specified by MaxCount: 10.
the c5n-18xlarge.yaml is configured to disable multithreading (This option restricts the computing to CPUS rather than allowing the use of all virtual CPUS. (72 virtual cpus reduced to 36 cpus)
the c5n-18xlarge.yaml is configured to enable the setting of a placement group to allow low inter-node latency
the c5n-18xlarge.yaml is configured to enables the elastic fabric adapter
Figure 2. Diagram of YAML file used to configure a ParallelCluster with a c5n-xlarge head node and c5n-18xlarge compute nodes(36CPU per Node)
Note
Notice that the c5n-18xlarge yaml configuration file contains a setting for PlacementGroup.
PlacementGroup:
Enabled: true
A placement group is used to get the lowest inter-node latency.
A placement group guarantees that your instances are on the same networking backbone.
Edit the YAML file for c5n.n18xlarge#
You will need to edit the c5n-18xlarge.yaml to specify your KeyName and SubnetId (use the values generated in your new-hello-world.yaml) This yaml file specifies ubuntu2004 as the OS, c5n.large for the head node, c5n.18xlarge as the compute nodes and both a /shared Ebs directory(for software install) and a /fsx Lustre File System (for Input and Output Data) and enables the elastic fabric adapter.
vi c5n-18xlarge.yaml
Output:
Region: us-east-1
Image:
Os: ubuntu2004
HeadNode:
InstanceType: c5n.large
Networking:
SubnetId: subnet-018cfea3edf3c4765 <<< replace
DisableSimultaneousMultithreading: true
Ssh:
KeyName: centos <<< replace
Scheduling:
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 5
SlurmQueues:
- Name: queue1
CapacityType: SPOT
Networking:
SubnetIds:
- subnet-018cfea3edf3c4765 <<< replace
PlacementGroup:
Enabled: true
ComputeResources:
- Name: compute-resource-1
InstanceType: c5n.18xlarge
MinCount: 0
MaxCount: 10
DisableSimultaneousMultithreading: true
Efa: <<< Note new section that enables elastic fabric adapter
Enabled: true
GdrSupport: false
SharedStorage:
- MountDir: /shared
Name: ebs-shared
StorageType: Ebs
- MountDir: /fsx
Name: name2
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
Create the c5n.18xlarge cluster#
Use the pcluster command to update cluster to use c5n.18xlarge compute node
pcluster update-cluster --region us-east-1 --cluster-name cmaq --cluster-configuration c5n-18xlarge.yaml
Verify that the compute nodes have been updated#
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Output:
{
"creationTime": "2022-02-23T17:39:42.953Z",
"headNode": {
"launchTime": "2022-02-23T17:48:03.000Z",
"instanceId": "xxx-xx-xx",
"publicIpAddress": "xx-xx-xx",
"instanceType": "c5n.large",
"state": "running",
"privateIpAddress": "xx-xx-xx"
},
"version": "3.1.1",
"clusterConfiguration": {
},
"tags": [
{
"value": "3.1.1",
"key": "parallelcluster:version"
}
],
"cloudFormationStackStatus": "UPDATE_IN_PROGRESS",
"clusterName": "cmaq",
"computeFleetStatus": "STOPPED",
"cloudformationStackArn":
"lastUpdatedTime": "2022-02-23T17:56:31.114Z",
"region": "us-east-1",
"clusterStatus": "UPDATE_IN_PROGRESS"
Wait 5 to 10 minutes for the update to be completed#
Keep rechecking status until update is completed and computeFleetStatus is RUNNING
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Output:
{
"creationTime": "2022-02-23T17:39:42.953Z",
"headNode": {
"launchTime": "2022-02-23T17:48:03.000Z",
"instanceId": "xx-xx-xxx",
"publicIpAddress": "xx-xx-xx",
"instanceType": "c5n.large",
"state": "running",
"privateIpAddress": "xx-xxx-xx"
},
"version": "3.1.1",
"clusterConfiguration": {
},
"tags": [
{
"value": "3.1.1",
"key": "parallelcluster:version"
}
],
"cloudFormationStackStatus": "UPDATE_COMPLETE",
"clusterName": "cmaq",
"computeFleetStatus": "STOPPED",
"cloudformationStackArn":
"lastUpdatedTime": "2022-02-23T17:56:31.114Z",
"region": "us-east-1",
"clusterStatus": "UPDATE_COMPLETE"
}
Wait until UPDATE_COMPLETE message is received, then proceed.
Re-start the compute nodes#
pcluster update-compute-fleet --region us-east-1 --cluster-name cmaq --status START_REQUESTED
Verify status of cluster#
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Wait until you see
computeFleetStatus": "RUNNING",
Login to c5n.18xlarge cluster#
Note
Replace the your-key.pem with your Key Pair.
pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq
Check to make sure elastic network adapter (ENA) is enabled#
modinfo ena
lspci
Check what modules are available on the cluster#
module avail
Load the openmpi module#
module load openmpi/4.1.1
Load the Libfabric module#
module load libfabric-aws/1.13.0amzn1.0
Verify the gcc compiler version is greater than 8.0#
gcc --version
output:
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See also
Install Input Data on ParallelCluster#
Verify AWS CLI is available obtain data from AWS S3 Bucket#
Check to see if the aws command line interface (CLI) is installed
which aws
If it is installed, skip to the next step.
If it is not available please follow these instructions to install it.
See also
https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
cd /shared
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
Verify you can run the aws command#
aws --help
If not, you may need to logout and back in.
Note
If you do not have credintials, skip this. The data is on a public bucket, so you do not need credentials.
Set up your credentials for using s3 copy (you can skip this if you do not have credentials)
aws configure
Copy Input Data from S3 Bucket to lustre filesystem#
Verify that the /fsx directory exists; this is a lustre file system where the I/O is fastest
ls /fsx
If you are unable to use the lustre file system, the data can be installed on the /shared volume, if you have resized the volume to be large enough to store the input and output data.
Install the parallel cluster scripts using the commands:
cd /shared
git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq
Use the S3 script to copy the CONUS input data from the CMAS s3 bucket#
Data will be saved to the /fsx file system
/shared/pcluster-cmaq/s3_scripts/s3_copy_nosign_conus_cmas_opendata_to_fsx.csh
check that the resulting directory structure matches the run script
Note
The CONUS 12US2 input data requires 44 GB of disk space
(if you use the yaml file to import the data to the lustre file system rather than copying the data you save this space)
cd /fsx/data/CMAQ_Modeling_Platform_2016/CONUS/12US2/
du -sh
output:
44G .
CMAQ ParallelCluster is configured to have 1.2 Terrabytes of space on /fsx filesystem (minimum size allowed for lustre /fsx), to allow multiple output runs to be stored.
For ParallelCluster: Import the Input data from a public S3 Bucket#
A second method is available to import the data on the lustre file system using the yaml file to specify the s3 bucket location in the yaml file, rather than using the above aws s3 copy commands.
See also
Example available in c5n-18xlarge.ebs_shared.fsx_import.yaml
cd /shared/pcluster-cmaq/
vi c5n-18xlarge.ebs_shared.fsx_import.yaml
Section that of the YAML file that specifies the name of the S3 Bucket.
- MountDir: /fsx
Name: name2
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
ImportPath: s3://cmas-cmaq-conus2-benchmark/data/CMAQ_Modeling_Platform_2016/CONUS <<< specify name of S3 bucket
This requires that the S3 bucket specified is publically available
Install CMAQ sofware and libraries on ParallelCluster#
Login to updated cluster#
Note
Replace the your-key.pem with your Key Pair.
pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq
Change shell to use .tcsh#
Note
This command depends on what OS you have installed on the ParallelCluster
sudo usermod -s /bin/tcsh ubuntu
or
sudo usermod -s /bin/tcsh centos
Log out and log back in to have the tcsh shell be active
exit
pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq
Check to see the tcsh shell is default#
echo $SHELL
The following instructions assume that you will be installing the software to a /shared/build directory
mkdir /shared/build
Install the pcluster-cmaq git repo to the /shared directory
cd /shared
Use a configuration file from the github repo that was cloned to your local machine#
git clone -b main https://github.com/CMASCenter/pcluster-cmaq.git pcluster-cmaq
cd pcluster-cmaq
Check to make sure elastic network adapter (ENA) is enabled#
modinfo ena
lspci
Check what modules are available on the cluster#
module avail
Load the openmpi module#
module load openmpi/4.1.1
Load the Libfabric module#
module load libfabric-aws/1.13.2amzn1.0
Verify the gcc compiler version is greater than 8.0#
gcc --version
Output:
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Change directories to install and build the libraries and CMAQ#
cd /shared/pcluster-cmaq
Note: the sofware build process for CMAQ integration and continuous deployment needs improvement. Currently the Unidata Ucar netcdf-c download page is broken, and the location where the source code can be obtained may need to be updated from their website to the netcdf git repository. For this reason, this tutorial provides a snapshot image that was compiled on a c5n.xlarge head node, and runs on the c5n.18xlarge compute node. A different snapshot image would need to be created to compile and run CMAQ on a c6gn.16xlarge Arm-based AWS Graviton2 processor.
An alternative is to keep a copy of the source code for netcdf-C and netcdf-Fortran and all of the other underlying code on an S3 bucket and to use custom bootstrap actions to build the sofware as the ParallelCluster is provisioned.
The following link provides instructions on how to create a custom bootstrap action to pre-load software from an S3 bucket to the ParallelCluster at the time that the cluster is created.
Build netcdf C and netcdf F libraries - these scripts work for the gcc 8+ compiler#
Note, if this script fails, it is typically because NCAR has released a new version of netCDF C or Fortran, so the old version is no longer available, or if they have changed the name or location of the download file.
./gcc_netcdf_cluster.csh
A .cshrc script with LD_LIBRARY_PATH was copied to your home directory, enter the shell again and check environment variables that were set using#
cat ~/.cshrc
If the .cshrc was not created use the following command to create it#
cp dot.cshrc.pcluster ~/.cshrc
Execute the shell to activate it#
csh
env
Verify that you see the following setting#
Output:
LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/shared/build/netcdf/lib:/shared/build/netcdf/lib
Build I/O API library#
./gcc_ioapi_cluster.csh
Build CMAQ#
./gcc_cmaq_pcluster.csh
Check to confirm that the cmaq executable has been built
ls /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/BLD_CCTM_v533_gcc/*.exe
Run CMAQ#
Verify that you have an updated set of run scripts from the pcluster-cmaq repo#
To ensure you have the correct directory specified
cd /shared/pcluster-cmaq/run_scripts/cmaq533/
ls -lrt run*pcluster*
Compare with
ls -lrt /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/run*pcluster*
If they are not identical, then copy the set from the repo
cp /shared/pcluster-cmaq/run_scripts/cmaq533/run*pcluster* /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
Verify that the input data is imported to /fsx from the S3 Bucket#
cd /fsx/12US2
Need to make this directory and then link it to the path created when the data is copied from the S3 Bucket.
This is to make the paths consistent between the two methods of obtaining the input data.
mkdir -p /fsx/data/CONUS
cd /fsx/data/CONUS
ln -s /fsx/12US2 .
Create the output directory#
mkdir -p /fsx/data/output
Run the CONUS Domain on 180 pes#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
sbatch run_cctm_2016_12US2.180pe.5x36.pcluster.csh
Note, it will take about 3-5 minutes for the compute notes to start up. This is reflected in the Status (ST) of CF (configuring)
Check the status in the queue#
squeue -u ubuntu
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 queue1 CMAQ ubuntu CF 3:00 5 queue1-dy-computeresource1-[1-5]
After 5 minutes the status will change once the compute nodes have been created and the job is running
squeue -u ubuntu
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute CMAQ ubuntu R 16:50 5 compute-dy-c5n18xlarge-[1-5]
The 180 pe job should take 60 minutes to run (30 minutes per day)
check on the status of the cluster using CloudWatch#
(optional)
<a href="https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=cmaq-us-east-1">Cloudwatch Dashboard</a>
<a href="https://aws.amazon.com/blogs/compute/monitoring-dashboard-for-aws-parallelcluster/">Monitoring Dashboard for ParallelCluster</a>
check the timings while the job is still running using the following command#
grep 'Processing completed' CTM_LOG_001*
Output:
Processing completed... 8.8 seconds
Processing completed... 7.4 seconds
When the job has completed, use tail to view the timing from the log file.#
tail run_cctmv5.3.3_Bench_2016_12US2.10x18pe.2day.pcluster.log
Output:
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2015-12-22
End Day: 2015-12-23
Number of Simulation Days: 2
Domain Name: 12US2
Number of Grid Cells: 3409560 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 180
All times are in seconds.
Num Day Wall Time
01 2015-12-22 2481.55
02 2015-12-23 2225.34
Total Time = 4706.89
Avg. Time = 2353.44
Submit a request for a 288 pe job ( 8 x 36 pe) or 8 nodes instead of 5 nodes#
`sbatch run_cctm_2016_12US2.288pe.8x36.pcluster.csh``
Check on the status in the queue#
squeue -u ubuntu
Note, it takes about 5 minutes for the compute nodes to be initialized, once the job is running the ST or status will change from CF (configure) to R
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6 queue1 CMAQ ubuntu R 24:57 8 queue1-dy-computeresource1-[1-8]
Check the status of the run#
tail CTM_LOG_025.v533_gcc_2016_CONUS_16x18pe_20151222
Check whether the scheduler thinks there are cpus or vcpus#
sinfo -lN
Output:
Wed Jan 05 19:34:05 2022
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
queue1-dy-computeresource1-1 1 queue1* mixed 72 72:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-2 1 queue1* mixed 72 72:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-3 1 queue1* mixed 72 72:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-4 1 queue1* mixed 72 72:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-5 1 queue1* mixed 72 72:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-6 1 queue1* mixed 72 72:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-7 1 queue1* mixed 72 72:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-8 1 queue1* mixed 72 72:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-9 1 queue1* idle~ 72 72:1:1 1 0 1 dynamic, Scheduler health che
queue1-dy-computeresource1-10 1 queue1* idle~ 72 72:1:1 1 0 1 dynamic, Scheduler health che
Note: on a c5n.18xlarge, the number of virtual cpus is 72.
If the YAML contains the Compute Resources Setting of DisableSimultaneousMultithreading: false, then all 72 vcpus will be used
If DisableSimultaneousMultithreading: true, then the number of cpus is 36 and there are no virtual cpus.
edit run script to use#
SBATCH –exclusive
Edit the yaml file to use DisableSimultaneousMultithreading: true#
Confirm that there are only 36 cpus available to the slurm scheduler#
sinfo -lN
Output:
Wed Jan 05 20:54:01 2022
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
queue1-dy-computeresource1-1 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-2 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-3 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-4 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-5 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-6 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-7 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-8 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-9 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
queue1-dy-computeresource1-10 1 queue1* idle~ 36 36:1:1 1 0 1 dynamic, none
Re-run the CMAQ CONUS Case#
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/
Submit a request for a 288 pe job ( 8 x 36 pe) or 8 nodes instead of 10 nodes with full output#
sbatch run_cctm_2016_12US2.288pe.full.pcluster.csh
squeue -u ubuntu
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7 queue1 CMAQ ubuntu CF 3:06 8 queue1-dy-computeresource1-[1-8]
Note, it takes about 5 minutes for the compute nodes to be initialized, once the job is running the ST or status will change from CF (configure) to R
squeue -u ubuntu
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7 queue1 CMAQ ubuntu R 24:57 8 queue1-dy-computeresource1-[1-8]
Check the status of the run#
tail CTM_LOG_025.v533_gcc_2016_CONUS_16x18pe_full_20151222
Once you have submitted a few benchmark runs and they have completed successfully, proceed to the next chapter.
Post-process and qa#
Post-process CMAQ and Install R#
Post-processing CMAQ Run, Install R and packages Instructions to install R and packages for QA of CMAQ difference in output between two runs.
Scripts to run combine and post processing#
Build the POST processing routines#
Copy the buildit script from the repo, as it was corrected to use CMAQv533 rather than CMAQv532
cd /shared/build/openmpi_gcc/CMAQ_v533/POST/combine/scripts
cp /shared/pcluster-cmaq/run_scripts/bldit_combine.csh .
Run the bldit script for combine.
./bldit_combine.csh gcc |& tee ./bldit_combine.gcc.log
Copy the bldit script from the repo, as it was corrected to use CMAQv533 rather than CMAQv532
cd /shared/build/openmpi_gcc/CMAQ_v533/POST/calc_tmetric/scripts
cp /shared/pcluster-cmaq/run_scripts/bldit_calc_tmetric.csh .
Run the bldit script for calc_tmetric
./bldit_calc_tmetric.csh gcc |& tee ./bldit_calc_tmetric.gcc.log
Copy the bldit script from the repo
cd /shared/build/openmpi_gcc/CMAQ_v533/POST/hr2day/scripts
cp /shared/pcluster-cmaq/run_scripts/bldit_hr2day.csh
Run the bldit script
./bldit_hr2day.csh gcc |& tee ./bldit_hr2day.gcc.log
Copy the bldit script from the repo and run
cd /shared/build/openmpi_gcc/CMAQ_v533/POST/bldoverlay/scripts
cp /shared/pcluster-cmaq/run_scripts/bldit_bldoverlay.csh .
./bldit_bldoverlay.csh gcc |& tee ./bldit_bldoverlay.gcc.log
Scripts to post-process CMAQ output#
Instructions on how to Post-process CMAQ using the utilities under the POST directory
Note
The post-processing analysis is run on the head node.
Verify that the compute nodes are no longer running if you have completed all of the benchmark runs
squeue
You should see that no jobs are running.
Show compute nodes
scontrol show nodes
Edit, Build and Run the POST processing routines#
setenv DIR /shared/build/openmpi_gcc/CMAQ_v533/
cd $DIR/POST/combine/scripts
sed -i 's/v532/v533/g' bldit_combine.csh
./bldit_combine.csh gcc |& tee ./bldit_combine.gcc.log
cp run_combine.csh run_combine_conus.csh
sed -i 's/v532/v533/g' run_combine_conus.csh
sed -i 's/Bench_2016_12SE1/2016_CONUS_16x18pe/g' run_combine_conus.csh
sed -i 's/intel/gcc/g' run_combine_conus.csh
sed -i 's/2016-07-01/2015-12-22/g' run_combine_conus.csh
sed -i 's/2016-07-14/2015-12-23/g' run_combine_conus.csh
setenv CMAQ_DATA /fsx/data
./run_combine_conus.csh
cd $DIR/POST/calc_tmetric/scripts
sed -i 's/v532/v533/g' bldit_calc_tmetric.csh
./bldit_calc_tmetric.csh gcc |& tee ./bldit_calc_tmetric.gcc.log
cp run_calc_tmetric.csh run_calc_tmetric_conus.csh
sed -i 's/v532/v533/g' run_calc_tmetric_conus.csh
sed -i 's/Bench_2016_12SE1/2016_CONUS_16x18pe/g' run_calc_tmetric_conus.csh
sed -i 's/intel/gcc/g' run_calc_tmetric_conus.csh
sed -i 's/201607/201512/g' run_calc_tmetric_conus.csh
setenv CMAQ_DATA /fsx/data
./run_calc_tmetric_conus.csh
cd $DIR/POST/hr2day/scripts
sed -i 's/v532/v533/g' bldit_hr2day.csh
./bldit_hr2day.csh gcc |& tee ./bldit_hr2day.gcc.log
cp run_hr2day.csh run_hr2day_conus.csh
sed -i 's/v532/v533/g' run_hr2day_conus.csh
sed -i 's/Bench_2016_12SE1/2016_CONUS_16x18pe/g' run_hr2day_conus.csh
sed -i 's/intel/gcc/g' run_hr2day_conus.csh
sed -i 's/2016182/2015356/g' run_hr2day_conus.csh
sed -i 's/2016195/2015357/g' run_hr2day_conus.csh
setenv CMAQ_DATA /fsx/data
./run_hr2day_conus.csh
cd $DIR/POST/bldoverlay/scripts
sed -i 's/v532/v533/g' bldit_bldoverlay.csh
./bldit_bldoverlay.csh gcc |& tee ./bldit_bldoverlay.gcc.log
cp run_bldoverlay.csh run_bldoverlay_conus.csh
sed -i 's/v532/v533/g' run_bldoverlay_conus.csh
sed -i 's/Bench_2016_12SE1/2016_CONUS_16x18pe/g' run_bldoverlay_conus.csh
sed -i 's/intel/gcc/g' run_bldoverlay_conus.csh
sed -i 's/2016-07-01/2015-12-22/g' run_bldoverlay_conus.csh
sed -i 's/2016-07-02/2015-12-23/g' run_bldoverlay_conus.csh
setenv CMAQ_DATA /fsx/data
./run_bldoverlay_conus.csh
Install R, Rscripts and Packages#
First check to see if R is already installed.
R --version
If not, Install R on Ubuntu 2004 instructions available in the link below.
See also
sudo apt install build-essential
See also
Install geospatial dependencies
be sure to have an updated system
sudo apt-get update && sudo apt-get upgrade -y
install PROJ
sudo apt-get install libproj-dev proj-data proj-bin unzip -y
optionally, install (selected) datum grid files
sudo apt-get install proj-data
install GEOS
sudo apt-get install libgeos-dev -y
install GDAL
sudo apt-get install libgdal-dev python3-gdal gdal-bin -y
install PDAL (optional)
sudo apt-get install libpdal-dev pdal libpdal-plugin-python -y
recommended to give Python3 precedence over Python2 (which is end-of-life since 2019)
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 1
Install further compilation dependencies (Ubuntu 20.04)
sudo apt-get install \
build-essential \
flex make bison gcc libgcc1 g++ ccache \
python3 python3-dev \
python3-opengl python3-wxgtk4.0 \
python3-dateutil libgsl-dev python3-numpy \
wx3.0-headers wx-common libwxgtk3.0-gtk3-dev \
libwxbase3.0-dev \
libncurses5-dev \
libbz2-dev \
zlib1g-dev gettext \
libtiff5-dev libpnglite-dev \
libcairo2 libcairo2-dev \
sqlite3 libsqlite3-dev \
libpq-dev \
libreadline6-dev libfreetype6-dev \
libfftw3-3 libfftw3-dev \
libboost-thread-dev libboost-program-options-dev libpdal-dev\
subversion libzstd-dev \
checkinstall \
libglu1-mesa-dev libxmu-dev \
ghostscript wget -y
For NVIZ on Ubuntu 20.04:
sudo apt-get install \
ffmpeg libavutil-dev ffmpeg2theora \
libffmpegthumbnailer-dev \
libavcodec-dev \
libxmu-dev \
libavformat-dev libswscale-dev
ncdf4 package REQUIRES the netcdf library be version 4 or above, AND installed with HDF-5 support (i.e., the netcdf library must be compiled with the –enable-netcdf-4 flag). Building netcdf with HDF5 support requires curl.
sudo apt-get install curl
sudo apt-get install libcurl4-openssl-dev
cd /shared/pcluster-cmaq
Install libraries with hdf5 support
Load modules
module load openmpi/4.1.1
module load libfabric-aws/1.13.2amzn1.0
./gcc_install_hdf5.pcluster.csh
Install ncdf4 package from source:
cd /shared/pcluster-cmaq/qa_scripts/R_packages
sudo R CMD INSTALL ncdf4_1.13.tar.gz --configure-args="--with-nc-config=/shared/build-hdf5/install/bin/nc-config"
Install packages used in the R scripts
sudo -i R
install.packages("rgdal")
install.packages("M3")
install.packages("fields")
install.packages("ggplot2")
install.packages("patchwork")
To view the script, install imagemagick
sudo apt-get install imagemagick
Install X11
sudo apt install x11-apps
Enable X11 forwarding
sudo vi /etc/ssh/sshd_config
add line
X11Forwarding yes
Verify that it was added
sudo cat /etc/ssh/sshd_config | grep -i X11Forwarding
Restart ssh
sudo service ssh restart
Exit the cluster
exit
Re-login to the cluster
pcluster ssh -v -Y -i ~/your-key.pem --cluster-name cmaq
Test display
display xclock
See also
Note, it looks like the examples are using the older config or CLI 2 format, and need to convert this to a yaml format to try it out.
The bug says that you can use a custom post installation script to re-enable X11 Forwarding.
See also
QA CMAQ#
Quality Assurance: Comparing the output of two CMAQ runs.
Quality Assurance#
Instructions on how to to verify a successful CMAQ Run on ParallelCluster.
Run m3diff to compare the output data for two runs that have different values for NPCOL#
cd /fsx/data/output
ls */*ACONC*
setenv AFILE output_CCTM_v533_gcc_2016_CONUS_10x18pe_full/CCTM_ACONC_v533_gcc_2016_CONUS_10x18pe_full_20151222.nc
setenv BFILE output_CCTM_v533_gcc_2016_CONUS_16x18pe_full/CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_full_20151222.nc
m3diff
hit return several times to accept the default options
grep A:B REPORT
Should see all zeros.
Recompiled CMAQ using -march=native compiler option for gcc compiler, but still see differences in answers. The answers are the same, or the differences are all zeros if the domain decomposition uses the same NPCOL, here, NPCOL differs (10 vs 16)
This behavior is different from what was observed with removing the -march=native compiler option for gcc on the AMD Cyclecloud HBV3 processor. On cycle cloud, if CMAQ is compiled with -march=native removed from the compiler options, then the answers match if NPCOL differs.
NPCOL = 10; @ NPROW = 18
NPCOL = 16; @ NPROW = 18
grep A:B REPORT
output
A:B 4.54485E-07@(316, 27, 1) -3.09199E-07@(318, 25, 1) 1.42188E-11 2.71295E-09
A:B 4.73112E-07@(274,169, 1) -2.36556E-07@(200,113, 1) 3.53046E-11 3.63506E-09
A:B 7.37607E-07@(226,151, 1) -2.98955E-07@(274,170, 1) 3.68974E-11 5.29013E-09
A:B 3.15718E-07@(227,150, 1) -2.07219E-07@(273,170, 1) 2.52149E-11 3.60005E-09
A:B 2.65893E-07@(299,154, 1) -2.90573E-07@(201,117, 1) 1.78237E-12 4.15726E-09
A:B 3.11527E-07@(300,156, 1) -7.43195E-07@(202,118, 1) -9.04127E-12 6.38413E-09
A:B 4.59142E-07@(306,160, 1) -7.46921E-07@(203,119, 1) -2.57731E-11 8.06486E-09
A:B 5.25266E-07@(316,189, 1) -5.90459E-07@(291,151, 1) -2.67232E-11 9.36312E-09
A:B 5.31785E-07@(294,156, 1) -6.33299E-07@(339,201, 1) 3.01644E-11 1.12862E-08
A:B 1.01421E-06@(297,168, 1) -5.08502E-07@(317,190, 1) 9.97206E-11 1.35965E-08
A:B 1.28523E-06@(297,168, 1) -2.96347E-06@(295,160, 1) 1.57728E-10 1.88143E-08
A:B 1.69873E-06@(298,169, 1) -6.47269E-07@(343,205, 1) 1.99673E-10 1.96824E-08
A:B 2.10665E-06@(298,170, 1) -8.53091E-07@(290,133, 1) 2.75009E-10 2.38824E-08
A:B 2.77534E-06@(298,166, 1) -1.38395E-06@(339,201, 1) 4.32676E-10 3.19499E-08
A:B 4.05498E-06@(298,166, 1) -2.29478E-06@(292,134, 1) 5.94668E-10 4.56470E-08
A:B 1.64844E-06@(380,195, 1) -1.24970E-05@(312,119, 1) 2.99392E-10 6.27748E-08
A:B 2.40747E-06@(350,207, 1) -2.38372E-06@(313,120, 1) -1.23841E-11 4.06153E-08
A:B 2.54810E-06@(353,207, 1) -1.68476E-06@(258,179, 1) 4.69896E-10 4.00601E-08
A:B 2.92342E-06@(259,180, 1) -1.84122E-06@(258,180, 1) 3.00556E-10 3.75263E-08
A:B 4.37256E-06@(259,180, 1) -1.51433E-06@(258,180, 1) 3.44610E-10 4.03537E-08
A:B 5.51227E-06@(313,160, 1) -1.60793E-06@(312,160, 1) 6.49188E-10 4.60905E-08
A:B 5.58607E-06@(259,182, 1) -6.47921E-06@(278,186, 1) 3.40245E-11 4.89799E-08
A:B 3.61912E-06@(259,183, 1) -4.28502E-06@(278,187, 1) 2.10923E-10 4.86613E-08
A:B 2.02795E-06@(278,185, 1) -3.63495E-06@(278,187, 1) 5.26566E-10 5.32271E-08
A:B 1.25729E-07@(225,183, 1) -8.38190E-08@(200,114, 1) 2.04043E-12 7.34096E-10
A:B 9.66247E-08@(225,151, 1) -4.09782E-07@(225,182, 1) -6.33767E-12 1.73157E-09
A:B 2.10712E-07@(225,151, 1) -2.71946E-07@(200,114, 1) -5.41618E-12 1.65727E-09
A:B 5.45755E-07@(225,182, 1) -1.04494E-06@(200,115, 1) -1.47753E-11 4.57864E-09
A:B 4.30271E-07@(200,114, 1) -7.39470E-07@(200,116, 1) -3.24581E-11 5.33182E-09
A:B 7.71135E-07@(225,181, 1) -7.92556E-07@(201,117, 1) -2.74377E-11 6.31589E-09
A:B 6.33299E-07@(225,182, 1) -6.53090E-07@(202,118, 1) -2.86715E-11 4.42746E-09
A:B 6.25849E-07@(225,182, 1) -2.21189E-07@(225,184, 1) -5.32567E-12 2.66906E-09
A:B 3.64147E-07@(306,158, 1) -3.12924E-07@(175, 2, 1) 3.15538E-12 2.74893E-09
Compare CMAQv533 run with -march=native compiler flag removed.
more REPORT.6x12pe_vs_9x12pe
FILE A: AFILE (output_CCTM_v533_gcc_2016_CONUS_6x12pe/CCTM_ACONC_v533_gcc_2016_CONUS_6x12pe_20151222.nc)
FILE B: BFILE (output_CCTM_v533_gcc_2016_CONUS_9x12pe/CCTM_ACONC_v533_gcc_2016_CONUS_9x12pe_20151222.nc)
-----------------------------------------------------------
Date and time 2015356:000000 (0:00:00 Dec. 22, 2015)
A:AFILE/NO2 vs B:BFILE/NO2 vs (A - B)
MAX @( C, R, L) Min @( C, R, L) Mean Sigma
A 5.19842E-02@(127, 62, 1) 1.56425E-05@(258,239, 1) 2.27752E-03 3.47514E-03
B 5.19842E-02@(127, 62, 1) 1.56425E-05@(258,239, 1) 2.27752E-03 3.47514E-03
A:B 2.27243E-07@(264,163, 1) -5.42961E-07@(264,165, 1) 9.77191E-12 2.54661E-09
Date and time 2015356:010000 (1:00:00 Dec. 22, 2015)
A:AFILE/NO2 vs B:BFILE/NO2 vs (A - B)
MAX @( C, R, L) Min @( C, R, L) Mean Sigma
A 6.55882E-02@(128, 62, 1) 1.29276E-05@(260,245, 1) 2.56435E-03 4.35617E-03
B 6.55882E-02@(128, 62, 1) 1.29276E-05@(260,245, 1) 2.56435E-03 4.35617E-03
A:B 2.76603E-07@(197,102, 1) -2.45869E-07@(264,163, 1) 6.01613E-12 1.72038E-09
Date and time 2015356:020000 (2:00:00 Dec. 22, 2015)
A:AFILE/NO2 vs B:BFILE/NO2 vs (A - B)
MAX @( C, R, L) Min @( C, R, L) Mean Sigma
A 6.86494E-02@(128, 62, 1) 1.03682E-05@(262,243, 1) 2.62483E-03 4.58060E-03
B 6.86494E-02@(128, 62, 1) 1.03682E-05@(262,243, 1) 2.62483E-03 4.58060E-03
A:B 3.27826E-07@(197,102, 1) -3.79980E-07@(264,157, 1) 7.99431E-12 2.56835E-09
Date and time 2015356:030000 (3:00:00 Dec. 22, 2015)
A:AFILE/NO2 vs B:BFILE/NO2 vs (A - B)
MAX @( C, R, L) Min @( C, R, L) Mean Sigma
A 6.58664E-02@( 48, 83, 1) 8.24041E-06@(265,241, 1) 2.57739E-03 4.54646E-03
B 6.58664E-02@( 48, 83, 1) 8.24041E-06@(265,241, 1) 2.57739E-03 4.54646E-03
A:B 5.47618E-07@(264,156, 1) -3.96743E-07@(264,160, 1) 9.99427E-12 3.22602E-09
Reconfirmed that with -march=native flag removed, still get matching answers if NPCOL is the same.
more REPORT_6x12pe_6x18pe
FILE A: AFILE (output_CCTM_v533_gcc_2016_CONUS_6x12pe/CCTM_ACONC_v533_gcc_2016_CONUS_6x12pe_20151222.nc)
FILE B: BFILE (output_CCTM_v533_gcc_2016_CONUS_6x18pe/CCTM_ACONC_v533_gcc_2016_CONUS_6x18pe_20151222.nc)
-----------------------------------------------------------
Date and time 2015356:000000 (0:00:00 Dec. 22, 2015)
A:AFILE/NO2 vs B:BFILE/NO2 vs (A - B)
MAX @( C, R, L) Min @( C, R, L) Mean Sigma
A 5.19842E-02@(127, 62, 1) 1.56425E-05@(258,239, 1) 2.27752E-03 3.47514E-03
B 5.19842E-02@(127, 62, 1) 1.56425E-05@(258,239, 1) 2.27752E-03 3.47514E-03
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
Date and time 2015356:010000 (1:00:00 Dec. 22, 2015)
A:AFILE/NO2 vs B:BFILE/NO2 vs (A - B)
MAX @( C, R, L) Min @( C, R, L) Mean Sigma
A 6.55882E-02@(128, 62, 1) 1.29276E-05@(260,245, 1) 2.56435E-03 4.35617E-03
B 6.55882E-02@(128, 62, 1) 1.29276E-05@(260,245, 1) 2.56435E-03 4.35617E-03
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
Date and time 2015356:020000 (2:00:00 Dec. 22, 2015)
A:AFILE/NO2 vs B:BFILE/NO2 vs (A - B)
MAX @( C, R, L) Min @( C, R, L) Mean Sigma
A 6.86494E-02@(128, 62, 1) 1.03682E-05@(262,243, 1) 2.62483E-03 4.58060E-03
B 6.86494E-02@(128, 62, 1) 1.03682E-05@(262,243, 1) 2.62483E-03 4.58060E-03
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
Date and time 2015356:030000 (3:00:00 Dec. 22, 2015)
A:AFILE/NO2 vs B:BFILE/NO2 vs (A - B)
MAX @( C, R, L) Min @( C, R, L) Mean Sigma
A 6.58664E-02@( 48, 83, 1) 8.24041E-06@(265,241, 1) 2.57739E-03 4.54646E-03
B 6.58664E-02@( 48, 83, 1) 8.24041E-06@(265,241, 1) 2.57739E-03 4.54646E-03
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
Use m3diff to compare two runs that have the same NPCOL#
setenv AFILE /fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x16pe/CCTM_ACONC_v533_gcc_2016_CONUS_16x16pe_20151222.nc
setenv BFILE /fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x18pe/CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_20151222.nc
m3diff
grep A:B REPORT
NPCOL = 16; @ NPROW = 16
NPCOL = 16; @ NPROW = 18
NPCOL was the same for both runs
Resulted in zero differences in the output
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
A:B 0.00000E+00@( 1, 0, 0) 0.00000E+00@( 1, 0, 0) 0.00000E+00 0.00000E+00
Run an R script to create the box plots and spatial plots comparing the output of two runs#
Examine the script to create the box plots and spatial plots and edit to use the output that you have generated in your runs.
First check what output is available on your ParallelCluster
If your I/O directory is /fsx
ls -rlt /fsx/data/output/*/*ACONC*
If your I/O directory is /shared/data
ls -lrt /shared/data/output/*/*ACONC*
Then edit the script to use the output filenames available.
vi compare_EQUATES_benchmark_output_CMAS_pcluster.r
#Directory, file name, and label for first model simulation (sim1)
sim1.label <- "CMAQ 16x16pe"
sim1.dir <- "/fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x16pe/"
sim1.file <- paste0(sim1.dir,"CCTM_ACONC_v533_gcc_2016_CONUS_16x16pe_20151222.nc")
#Directory, file name, and label for second model simulation (sim2)
sim2.label <- "CMAQ 16x18pe"
sim2.dir <- "/fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x18pe"
sim2.file <- paste0(sim2.dir,"CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_20151222.nc")
Run the R script
cd /shared/pcluster-cmaq/qa_scripts
Rscript compare_EQUATES_benchmark_output_CMAS_pcluster.r
Note: your plots will be created based on the setting of the output directory in the script
An example set of scripts are available, but these instructions can be modified to use the output generated in the script above.
To view the PDF plots use the command:
cd /shared/pcluster-cmaq/qa_scripts/qa_plots
gio open O3_MAPS_CMAQ*.pdf
To convert the PDF to a jpeg image use the script convert.csh.
cd /shared/pcluster-cmaq/qa_scripts/qa_plots
First examine what the convert.csh script is doing
more convert.csh
output:
#!/bin/csh
foreach name (`ls *.pdf`)
set name2=`basename $name .pdf`
echo $name
echo $name2
pdftoppm -jpeg -r 600 $name $name2
end
Run the convert script.
./convert.csh
When NPCOL is fixed, we are seeing no difference in the answers.
Example comparison using: 6x6 compared to 6x9
cd /shared/pcluster-cmaq/docs/qa_plots/box_plots/6x6_vs_6x9/
Use display to view the plots
display O3_BOXPLOT_CMAQv533-GCC-6x6pe_vs_CMAQv533-GCC-6x9pe.jpeg
They are also displayed in the following plots:
Box Plot for ANO3J when NPCOL is identical
Box plot shows no difference between ACONC output for a CMAQv5.3.3 run using different PE configurations as long as NPCOL is fixed (this is true for all species that were plotted (AOTHRJ, CO, NH3, NO2, O3, OH, SO2)
Example of plots created when NPCOL is different between simulation 1 and simulation 2.
Box plot shows a difference betweeen ACONC output for a CMAQv5.3.3 run using different PE configurations when NPCOL is different
ANO3J
AOTHRJ
CO
NH3
NO2
O3
OH
SO2
Example of Spatial Plots for when NPCOL is different
Note, the differences are small, but they grow with time. There is one plot for each of the 24 hours. The plot that contains the most differences will be in the bottom right of the panel for each species. You will need to zoom in to see the differences, as most of the grid cells do not have any difference, and they are displayed as grey. For the NO2 plot, you can see the most differences over the state of Pennsylvania at hour 12/22/2015 at hour 23:00, with the the magnitude of the maximum difference of +/- 4. E-6.
cd ../spatial_plots/12x9_vs_8x9
display ANO3J_MAPS_CMAQv533-GCC-12x9pe_vs_CMAQv533-GCC-8x9pe-1.jpg
ANO3J
AOTHRJ
CO
NH3
NO2
O3
OH
SO2
Compare Timing of CMAQ Routines#
Compare the timing of CMAQ Routines for two different run configurations.
Parse timings from the log file#
Compare the timings for the CONUS ParallelCluster Runs#
Note
ParallelCluster Configurations can impact the model run times.
It is up the the user, as to what model run configurations are used to run CMAQ on the ParallelCluster. The following configurations may impact the run time of the model.
Using different PE configurations, using DisableSimultaneousMultithreading: true in yaml file, using 36 cpus - no virtual cpus
NPCOL x NPROW , CPU , SBATCH Command
[ ] 10x18 , 180 , #SBATCH –nodes=5, #SBATCH –ntasks-per-node=36
[ ] 16x16, 256 , #SBATCH –nodes=8, #SBATCH –ntasks-per-node=32
[ ] 16x18, 288 , #SBATCH –nodes=8, #SBATCH –ntasks-per-node=36
Using different compute nodes
[ ] c5n.18xlarge (72 virtual cpus, 36 cpus) - with Elastic Fabric Adapter
[ ] c5n.9xlarge (36 virtual cpus, 18 cpus) - with Eleastic Fabric Adapter
[ ] c5n.4xlarge (16 virtual cpus, 4 cpus) - without Elastic Fabric Adapter
With and without SBATCH –exclusive option
With and without Elastic Fabric and Elastic Network Adapter turned on
With and without network placement turned on
Using different local storage options and copying versus importing data to lustre
[ ] input data imported from S3 bucket to lustre
[ ] input data copied from S3 bucket to lustre
[ ] input data copied from S3 bucket to an EBS volume
Using different yaml settings for slurm
[ ] DisableSimultaneousMultithreading= true
[ ] DisableSimultaneousMultithreading= false
Edit the R script#
First check to see what log files are available:
ls -lrt /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/*.log
Modify the name of the log file to match what is avaible on your system.
cd /shared/pcluster-cmaq/qa_scripts
vi parse_timing_pcluster.r
Edit the following section of the script to specify the log file names available on your ParallelCluster
sens.dir <- '/shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/'
base.dir <- '/shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts/'
files <- dir(sens.dir, pattern ='run_cctmv5.3.3_Bench_2016_12US2.108.12x9pe.2day.pcluster.log' )
b.files <- dir(base.dir,pattern='run_cctmv5.3.3_Bench_2016_12US2.108.6x18pe.2day.pcluster.log')
#Compilers <- c('intel','gcc','pgi')
Compilers <- c('gcc')
# name of the base case timing. I am using the current master branch from the CMAQ_Dev repository.
# The project directory name is used for the sensitivity case.
base.name <- '12x9pe'
sens.name <- '6x18pe'
Run parse_timing.r script to examine timings of each science process in CMAQ#
Rscript parse_timing.r
Timing Plot Comparing GCC run on 16 x 8 pe versus 8 x 16 pe
Timing Plot Comparing GCC run on 8 x 8 pe versus 8 x 16 pe
Timing Plot Comparing GCC run on 9 x 8 pe versus 8 x 9 pe
Copy Output to S3 Bucket#
Copy output from ParallelCluster to an S3 Bucket
Copy Output Data and Run script logs to S3 Bucket#
Note
You need permissions to copy to a S3 Bucket.
See also
Be sure you enter your access credentials on the parallel cluster by running:
aws configure
Currently, the bucket listed below has ACL turned off
See also
See example of sharing bucket across accounts.
Copy scripts and logs to /fsx#
The CTM_LOG files don’t contain any information about the compute nodes that the jobs were run on. Note, it is important to keep a record of the NPCOL, NPROW setting and the number of nodes and tasks used as specified in the run script: #SBATCH –nodes=16 #SBATCH –ntasks-per-node=8 It is also important to know what volume was used to read and write the input and output data, so it is recommended to save a copy of the standard out and error logs, and a copy of the run scripts to the OUTPUT directory for each benchmark.
cd /shared/build/openmpi_gcc/CMAQ_v533/CCTM/scripts
cp run*.log /fsx/data/output
cp run*.csh /fsx/data/output
Examine the output files#
Note
The following commands will vary depending on what APPL or domain decomposition was run
cd /fsx/data/output/output_CCTM_v533_gcc_2016_CONUS_16x18pe
ls -lht
output:
total 173G
drwxrwxr-x 2 ubuntu ubuntu 145K Jan 5 23:53 LOGS
-rw-rw-r-- 1 ubuntu ubuntu 3.2G Jan 5 23:53 CCTM_CGRID_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 2.2G Jan 5 23:52 CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 78G Jan 5 23:52 CCTM_CONC_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 348M Jan 5 23:52 CCTM_APMDIAG_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 1.5G Jan 5 23:52 CCTM_WETDEP1_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 1.7G Jan 5 23:52 CCTM_DRYDEP_v533_gcc_2016_CONUS_16x18pe_20151223.nc
-rw-rw-r-- 1 ubuntu ubuntu 3.6K Jan 5 23:22 CCTM_v533_gcc_2016_CONUS_16x18pe_20151223.cfg
-rw-rw-r-- 1 ubuntu ubuntu 3.2G Jan 5 23:22 CCTM_CGRID_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 2.2G Jan 5 23:21 CCTM_ACONC_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 78G Jan 5 23:21 CCTM_CONC_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 348M Jan 5 23:21 CCTM_APMDIAG_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 1.5G Jan 5 23:21 CCTM_WETDEP1_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 1.7G Jan 5 23:21 CCTM_DRYDEP_v533_gcc_2016_CONUS_16x18pe_20151222.nc
-rw-rw-r-- 1 ubuntu ubuntu 3.6K Jan 5 22:49 CCTM_v533_gcc_2016_CONUS_16x18pe_20151222.cfg
Check disk space
du -sh
173G .
Copy the output to an S3 Bucket#
Examine the example script
cd /shared/pcluster-cmaq/s3_scripts
cat s3_upload.c5n.18xlarge.csh
output:
#!/bin/csh -f
# Script to upload output data to S3 bucket
# NOTE: a new bucket needs to be created to store each set of cluster runs
aws s3 mb s3://c5n-head-c5n.18xlarge-compute-conus-output
aws s3 cp --recursive /fsx/data/output/ s3://c5n-head-c5n.18xlarge-compute-conus-output
aws s3 cp --recursive /fsx/data/POST s3://c5n-head-c5n.18xlarge-compute-conus-output
If you do not have permissions to write to the s3 bucket, you may need to ask the administrator of your account to add S3 Bucket writing permissions.
Run the script to copy all of the CMAQ output and logs to the S3 bucket.
./s3_upload.c5n.18xlarge.csh
Logout and Delete ParallelCluster#
Logout and delete the ParallelCluster when you are done to avoid incurring costs.
Logout of cluster when you are done#
To avoid incurring costs for the lustre file system and the c5n.xlarge compute node, it is best to delete the cluster after you have copied the output data to the S3 Bucket.
If you are logged into the Parallel Cluster then use the following command
exit
Delete Cluster#
Run the following command on your local computer.
pcluster delete-cluster --region=us-east-1 --cluster-name cmaq
Verify that the cluster was deleted#
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Output:
"lastUpdatedTime": "2022-02-25T20:17:19.263Z",
"region": "us-east-1",
"clusterStatus": "DELETE_IN_PROGRESS"
Verify that you see the following output
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
Output:
pcluster describe-cluster --region=us-east-1 --cluster-name cmaq
{
"message": "Cluster 'cmaq' does not exist or belongs to an incompatible ParallelCluster major version."
}
Additional Resources#
For a tutorial that explains cloud terminology as well as how to obtain single EC2 instances for running GEOS-CHEM on a single node, please see the Beginner Tutorial provided by GEOS-Chem as well as the resources in this chapter.
FAQ#
Q. Can you update a cluster with a Snapshot ID, ie. update a cluster to use the /shared/build pre-installed software?
A. No. An existing cluster can not be updated with a Snapshot ID, solution is to delete the cluster and re-create it. see error:
pcluster update-cluster --region us-east-1 --cluster-name cmaq --cluster-configuration c5n-18xlarge.ebs_unencrypted.fsx_import.yaml
Output:
{
"message": "Update failure",
"updateValidationErrors": [
{
"parameter": "SharedStorage[ebs-shared].EbsSettings.SnapshotId",
"requestedValue": "snap-065979e115804972e",
"message": "Update actions are not currently supported for the 'SnapshotId' parameter. Remove the parameter 'SnapshotId'. If you need this change, please consider creating a new cluster instead of updating the existing one."
}
],
"changeSet": [
{
"parameter": "SharedStorage[ebs-shared].EbsSettings.SnapshotId",
"requestedValue": "snap-065979e115804972e"
}
]
}
Q. How do you figure out why a job isn’t successfully running in the slurm queue?
A. Check the logs available in the following link
vi /var/log/parallelcluster/slurm_resume.log
Output:
2022-03-23 21:04:23,600 - [slurm_plugin.instance_manager:_launch_ec2_instances] - ERROR - Failed RunInstances request: 0c6422af-c300-4fe6-b942-2b7923f7b362
2022-03-23 21:04:23,600 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x3) ['queue1-dy-compute-resource-1-4', 'queue1-dy-compute-resource-1-5', 'queue1-dy-compute-resource-1-6']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient c5n.18xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get c5n.18xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.
Q. How do I determine what node(s) the job is running on?
A. echo $SLURM_JOB_NODELIST
Q. I see other tutorials that use a configure file instead of a yaml file to create the cluster. Can I use this instead?
A. No, you must convert the text based config file to a yaml file to use with the Parallel Cluster CLI 3.+ version used in this tutorial. An example of this type of tutorial is < a href=”https://aws.amazon.com/blogs/compute/fire-dynamics-simulation-cfd-workflow-using-aws-parallelcluster-elastic-fabric-adapter-amazon-fsx-for-lustre-and-nice-dcv/”> Fire Dynamics Simulation CFD workflow using AWS ParallelCluster, Elastic Fabric Adapter, Amazon FSx for Lustre and NICE DCV You can try to use the v2 to v3 converter, see more: moving from v2 to v3
Q. If I find an issue, or need help with this CMAQ ParallelCluster Tutorial what do I do?
A. Please file an issue using github.
Submit Github Issue for help with documentation
Please indicate the issue you are having, and include a link from the read the doc section that you are referring to. The tutorial documentation has an edit icon in the upper right corner of each page. You can click on that, and github will fork the repo and allow you to edit the page. After you have made the edits, you can submit a pull request, and then include the link to the pull request in the github issue.
Free Training#
Another workshop to learn the AWS CLI 3.0#
Youtube video#
Intro to AWS for HPC People - HPC Tech Shorts#
Intro to AWS for HPC People - Tech Short Foundations Level 1
Benchmarking#
Benchmarks optimized for HPC high memory
Help Resources for CMAQ#
Computing on the Cloud References#
AWS High Performance Computing (HPC) Lens for the AWS Well-Architected Framework#
AWS High Performance Computing (HPC) Lens for the AWS Well-Architected Framework
HPC on AWS - WRF (uses cfnCluster - older version of Parallel Cluster#
WRF on Parallel Cluster#
A Scientist Guide to Cloud-HPC: Example with AWS ParallelCluster, Slurm, Spack, and WRF
Advancing Large Scale Weather and Climate Modeling Data in the Cloud#
AWS Well-Architected Framework#
Cost Comparison on-premisis and cloud#
WRF Performance on Google Cloud
<a href=”https://journal.fluidnumerics.com/comparing-on-premise-and-cloud-costs-for-high-performance-computing>Comparing on-premise and cloud costs for hpc
Resources from AWS for diagnosing issues with running the Parallel Cluster#
Parallel Cluster FAQ (somewhat outdated..)
Tool to convert v2 config files to v3 yaml files for Parallel Cluster
Instructions for creating a fault tolerance parallel cluster using lustre filesystem
Issues#
For AWS Parallel Cluster you can create a GitHub issue for feedback or issues: Github Issues There is also an active community driven Q&A site that may be helpful: AWS re:Post a community-driven Q&A site
Tips to managing the parallel cluster#
The head node can be stopped from the AWS Console after stopping compute nodes of the cluster, as long as it is restarted before issuing the command to restart the cluster.
The pcluster slurm queue system will create and delete the compute nodes, so that helps reduce manual cleanup for the cluster.
The compute nodes are terminated after they have been idle for a period of time. The yaml setting used for this is as follows: SlurmSettings: ScaledownIdletime: 5
The default idle time is 10 minutes, and can be reduced by specifing a shorter idle time in the YAML file. It is important to verify that the are deleted after a job is finished, to avoid incurring unexpected costs.
copy/backup the outputs and logs to an s3 bucket for follow-up analysis
After copying output and log files to the s3 bucket the cluster can be deleted
Once the pcluster is deleted all of the volumes, head node, and compute node will be terminated, and costs will only be incurred by the S3 Bucket storage.
Instructions on how to create Parallel Cluster Amazon Machine Image (AMI) from the command line#
Tutorial How-to Create AMI from Command Line
We also need to have additional protections if we make these AMI’s public.
ParallelCluster Update#
not all settings in the yaml file can be updated
it is important to know what the policy is for each setting
Example Update policy:
If this setting is changed, the update is not allowed. After changing this setting, the cluster can’t be updated. Either the change must be reverted or the cluster must be deleted (using pcluster delete-cluster), and then a new cluster created (using pcluster create-cluster) in the old cluster’s place.
see more information
Use Elastic Fabric Adapter/Elastic Network Adapter for better performance#
“In order to make the most of the available network bandwidth, you need to be using the latest Elastic Network Adapter (ENA) drivers (available in the latest Amazon Linux, Red Hat 7.6, and Ubuntu AMIs, and in the upstream Linux kernel) and you need to make use of multiple traffic flows. Flows within a Placement Group can reach 10 Gbps; the rest can reach 5 Gbps. When using multiple flows on the high-end instances, you can transfer 100 Gbps between EC2 instances in the same region (within or across AZs), S3 buckets, and AWS services such as Amazon Relational Database Service (RDS), Amazon ElastiCache, and Amazon EMR.”
The above was quoted from the following link:
Elastic Fabric Adapter for HPC systems
“EFA is currently available on c5n.18xlarge, c5n.9xlarge, c5n.metal, i3en.24xlarge, i3en.metal, inf1.24xlarge, m5dn.24xlarge, m5n.24xlarge, r5dn.24xlarge, r5n.24xlarge, p3dn.24xlarge, p4d, m6i.32xlarge, m6i.metal, c6i.32xlarge, c6i.metal, r6i.32xlarge, and r6i.metal instances.”
What are the differences between an EFA ENI and an ENA ENI?
“An ENA ENI provides traditional IP networking features necessary to support VPC networking. An EFA ENI provides all the functionality of an ENA ENI, plus hardware support for applications to communicate directly with the EFA ENI without involving the instance kernel (OS-bypass communication) using an extended programming interface. Due to the advanced capabilities of the EFA ENI, EFA ENIs can only be attached at launch or to stopped instances.”
Q: What are the pre-requisites to enabling EFA on an instance?
“EFA support can be enabled either at the launch of the instance or added to a stopped instance. EFA devices cannot be attached to a running instance.”
Elastic Fabric Adapter for Tightly Coupled Workloads
Quoted from the above link.
“An EFA can still handle IP traffic, but also supports an important access model commonly called OS bypass. This model allows the application (most commonly through some user-space middleware) access the network interface without having to get the operating system involved with each message. Doing so reduces overhead and allows the application to run more efficiently. Here’s what this looks like (source):”
“The MPI Implementation and libfabric layers of this cake play crucial roles:”
“MPI – Short for Message Passing Interface, MPI is a long-established communication protocol that is designed to support parallel programming. It provides functions that allow processes running on a tightly-coupled set of computers to communicate in a language-independent way.”
“libfabric – This library fits in between several different types of network fabric providers (including EFA) and higher-level libraries such as MPI. EFA supports the standard RDM (reliable datagram) and DGRM (unreliable datagram) endpoint types; to learn more, check out the libfabric Programmer’s Manual. EFA also supports a new protocol that we call Scalable Reliable Datagram; this protocol was designed to work within the AWS network and is implemented as part of our Nitro chip.”
“Working together, these two layers (and others that can be slotted in instead of MPI), allow you to bring your existing HPC code to AWS and run it with little or no change.
“You can use EFA today on c5n.18xlarge and p3dn.24xlarge instances in all AWS regions where those instances are available. The instances can use EFA to communicate within a VPC subnet, and the security group must have ingress and egress rules that allow all traffic within the security group to flow. Each instance can have a single EFA, which can be attached when an instance is started or while it is stopped.”
“You will also need the following software components:”
“EFA Kernel Module – The EFA Driver is in the Amazon GitHub repo; read Getting Started with EFA to learn how to create an EFA-enabled AMI for Amazon Linux, Amazon Linux 2, and other popular Linux distributions.”
“Libfabric Network Stack – You will need to use an AWS-custom version for now; again, the Getting Started document contains installation information. We are working to get our changes into the next release (1.8) of libfabric.”
“Note the parallel cluster deplopyment takes care of setting this up for you.”
VPC Management#
There is a limit on the number of VPCs that are allowed per account - limit is 5.
What is the difference between a private and a public vpc? (what setting is used in the yaml file, and why is one preferred over the other?)
Note, there is a default VPC, that is used to create EC2 instances, that should not be deleted.
Q1. is there a separate default VPC for each region?
Q2. Each time you run a configure cluster command, does the ParallelCluster create a new VPC?
Q3. Why don’t the VPC and subnet IDs get deleted when the ParallelClusters are deleted?
Deleting VPCs#
If pcluster configure created a new VPC, you can delete that VPC by deleting the AWS CloudFormation stack it created. The name will start with “parallelclusternetworking-” and contain the creation time in a “YYYYMMDDHHMMSS” format. You can list the stacks using the list-stacks command. The following instructions are available here:
Instructions for Cleaning Up VPCs
$ aws --region us-east-2 cloudformation list-stacks \
--stack-status-filter "CREATE_COMPLETE" \
--query "StackSummaries[].StackName" | \
grep -e "parallelclusternetworking-""parallelclusternetworking-pubpriv-20191029205804"
The stack can be deleted using the delete-stack command.
$ aws --region us-west-2 cloudformation delete-stack \
--stack-name parallelclusternetworking-pubpriv-20191029205804
If pcluster configure created a new VPC, you can delete that VPC by deleting the AWS CloudFormation stack it created. The name will start with “parallelclusternetworking-” and contain the creation time in a “YYYYMMDDHHMMSS” format. You can list the stacks using the list-stacks command.
Note: I can see why you wouldn’t want to delete the VPC, if you want to reuse the yaml file that contains the SubnetID that is tied to that VPC.
I was able to use the Amazon Website to find the SubnetID, and then identify the VPC that it is part of.
I currently have the following VPCs
Name |
VPC ID |
State |
IPv4 CIDR |
IPv6 CIDR (Network border group) |
IPv6 pool |
DHCP options set |
Main route table |
Main network ACL |
Tenancy |
Default VPC |
Owner ID |
---|---|---|---|---|---|---|---|---|---|---|---|
ParallelClusterVPC-20211210200003 |
vpc-0445c3fa089b004d8 |
Available |
10.0.0.0/16 |
– |
– |
dopt-eaeaf888 |
rtb-048c503f3e6b9acc3 |
acl-0fecfa7ff42e04ead |
Default |
No |
xxxx |
ParallelClusterVPC-20211021183813 |
vpc-00e3f4e34aaf80f06 |
Available |
10.0.0.0/16 |
– |
– |
dopt-eaeaf888 |
rtb-0a5b7ac9873486bcb |
acl-0852d06b1170db68c |
Default |
No |
xxxx |
- |
vpc-3cfc5759 |
Available |
172.31.0.0/16 |
– |
– |
dopt-eaeaf888 |
rtb-99cd64fc |
acl-bb9b39de |
Default |
Yes |
440858712842 |
ParallelClusterVPC-20210419174552 |
vpc-0ab948b66554c71ea |
Available |
10.0.0.0/16 |
– |
– |
dopt-eaeaf888 |
rtb-03fd47f05eac5379f |
acl-079fe1be7ff972858 |
Default |
No |
xxxx |
ParallelClusterVPC-20211021174405 |
vpc-0f34a572da1515e49 |
Available |
10.0.0.0/16 |
– |
– |
dopt-eaeaf888 |
rtb-0b6310d9ea70a699e |
acl-01fa1529b65545e91 |
Default |
No |
xxxx |
This is the subnet id that I am currently using in the yaml files: subnet-018cfea3edf3c4765
I currently have 11 subnet IDs - how many are no longer being used?
Name |
Subnet ID |
State |
VPC |
IPv4 CIDR |
IPv6 CIDR |
Available IPv4 addresses |
Availability Zone |
Availability Zone ID |
Network border group |
Route table |
Network ACL |
Default subnet |
Auto-assign public IPv4 address |
Auto-assign customer-owned IPv4 address |
Customer-owned IPv4 pool |
Auto-assign IPv6 address |
Owner ID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
parallelcluster:public-subnet |
subnet-018cfea3edf3c4765 |
Available |
vpc-0445c3fa089b004d8-ParallelClusterVPC-20211210200003 |
10.0.0.0/20 |
– |
4091 |
us-east-1a |
use1-az6 |
us-east-1 |
rtb-034bcab9e4b8c4023-parallelcluster:route-table-public |
acl-0fecfa7ff42e04ead |
No |
Yes |
No |
- |
No |
xx |
Future Work#
Future Work#
AWS ParallelCluster
Create yaml and software install scripts for intel compiler
Benchmark 2 day case using intel compiler version of CMAQ and compare to GCC timings
Repeat Benchmark Runs using c6gn.16xlarge compute nodes AMD Graviton and compare to Azure Cycle Cloud HBV3 compute nodes.
Create script for installing all software and R packages as a custom bootstrap as the ParallelCluster is created.
Create method to automatically checkpoint and save a job prior to it being bumped from the schedule if running on spot instances.
Set up an additional slurm queue that uses a smaller compute node to do the post-processing and learn how to submit the post processing jobs to this queue, rather than running them on the head node.
Install software using SPACK
Install netCDF-4 compressed version of I/O API Library and set up environment module to compile and run CMAQ for 2018_12US1 data that is nc4 compressed
Documentation
Create instructions on how to create a ParallelCluster using encrypted ebs volume and snapshot.
Contribute to this Tutorial#
The community is encouraged to contribute to this documentation. It is open source, created by the CMAS Center, under contract to EPA, for the benefit of the CMAS Community.
Contribute to Pcluster-cmaq Documentation#
Please take note of any issues and submit to Github Issue
Note
At the top of each page of the documentation, there is also an pencil icon, that you can click. It will create a fork of the project on your github account that you can make edits and then submit a pull request.
If you are able to create a pull request, please include the following in your issue:
pull request number
If you are not able to create a pull request, please include the following in your issue:
section number
description of the issue encountered
recommended fix, if available