Documentation for CMAQ on AWS ParallelCluster#
Warning
This documentation is under continuous development
Overview#
This document provides tutorials and information on using the ParallelCluster on Amazon Web Service (AWS). The tutorials are aimed at users with cloud computing experience that are already familiar with Amazon Web Service (AWS). For those with no cloud computing experience we recommend reviewing the Additional Resources listed in chapter 12 of this document.
Format of this documentation#
This document provides three hands-on tutorials that are designed to be read in order. The Introductory Tutorial will walk you through creating a demo ParallelCluster. You will learn how to set up your AWS Identity and Access Management Roles, configure and create a demo cluster, and exit and delete the cluster. The Intermediate Tutorial steps you through running a CMAQ test case on ParallelCluster using pre-loaded software and input data. The Advanced Tutorial explains how to scale the ParallelCluster for larger compute jobs and install CMAQ and required libraries from scratch on the cloud. The remaining sections provide instructions on post-processing CMAQ output, comparing output and runtimes from multiple simulations, and copying output from ParallelCluster to an AWS Simple Storage Service (S3) bucket.
Why might I need to use ParallelCluster?#
The AWS ParallelCluster may be configured to be the equivalent of a High Performance Computing (HPC) environment, including using job schedulers such as Slurm, running on multiple nodes using code compiled with Message Passing Interface (MPI), and reading and writing output to a high performance, low latency shared disk. The advantage of using the AWS ParallelCluster command line interface is that the compute nodes can be easily scaled up or down to match the compute requirements of a given simulation. In addition, the user can reduce costs by using Spot instances rather than On-Demand for the compute nodes. ParallelCluster also supports submitting multiple jobs to the job submission queue.
Our goal is make this user guide to running CMAQ on a ParallelCluster as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.
Additional information on AWS ParallelCluster:
AWS ParallelCluster documentation
AWS ParallelCluster training video
Contents:
- 1. Introductory Tutorial
- 2. System Requirements
- 3. Intermediate Tutorial
- 3.1. Use ParallelCluster pre-installed with software and data.
- 3.2. Create CMAQ ParallelCluster with software/data pre-installed
- 3.3. Log into the new cluster
- 3.4. Change shell to use tcsh
- 3.5. Verify Software
- 3.6. Verify Input Data
- 3.7. Examine CMAQ Run Scripts
- 3.8. Submit Job to Slurm Queue
- 3.9. Submit a minimum of 2 benchmark runs
- 4. Advanced Tutorial (optional)
- 4.1. Use ParallelCluster without Software and Data pre-installed
- 4.2. Create the c5n-4xlarge pcluster
- 4.3. Update the compute nodes
- 4.4. Create the c5n.18xlarge cluster
- 4.5. Login to c5n.18xlarge cluster
- 4.6. Install Input Data on ParallelCluster
- 4.7. Install CMAQ sofware and libraries on ParallelCluster
- 4.8. Run CMAQ
- 5. Benchmark on HPC6a-48xlarge with EBS and Lustre
- 5.1. Use ParallelCluster pre-installed with CMAQv5.3.3 software and 12US2 Benchmark data.
- 5.2. Create CMAQ ParallelCluster with software/data pre-installed
- 5.3. Log into the new cluster
- 5.4. Resize the EBS Volume
- 5.5. Change shell to use tcsh
- 5.6. Verify Software
- 5.7. Verify Input Data
- 5.8. Examine CMAQ Run Scripts
- 5.9. To run on the EBS Volume a code modification is required.
- 5.10. Build the code by running the makefile
- 5.11. Submit Job to Slurm Queue to run CMAQ on Lustre
- 5.12. Submit a run script to run on the EBS volume
- 5.13. Modify YAML and then Update Parallel Cluster.
- 5.14. Submit a minimum of 2 benchmark runs
- 5.15. upgrade pcluster version to try Persistent 2 Lustre Filesystem
- 5.16. Query the stack formation log messages
- 6. CMAQv5.4 Benchmark of 12US1 Domain 2 Day case.
- 6.1. Create a c6a.xlarge Virtual Machine
- 6.2. Login to the Virtual Machine
- 6.3. Change the group and ownership of the shared directory
- 6.4. Check operating system version
- 6.5. Set up build environment
- 6.6. Create Environment Module for Libraries
- 6.7. Install and Build CMAQ
- 6.8. Copy the run scripts from the repo to the run directory
- 6.9. Download the Input data from the S3 Bucket
- 6.10. Install unzip and unzip file
- 6.11. Run CMAQ interactively using the following command:
- 7. Scripts to run combine and post processing
- 8. Scripts to post-process CMAQ output
- 9. Install R, Rscripts and Packages
- 10. QA CMAQ
- 11. Compare Timing of CMAQ Routines
- 12. Copy Output to S3 Bucket
- 13. Logout and Delete ParallelCluster
- 14. Performance Optimization
- 14.1. Right-sizing Compute Nodes for the ParallelCluster Configuration
- 14.2. An explanation of why a scaling analysis is required for Multinode or Parallel MPI Codes
- 14.3. Slurm Compute Node Provisioning
- 14.4. Spot versus On-Demand Pricing
- 14.5. Benchmark Timings
- 14.6. Benchmark Scaling Plots
- 14.7. Cost Information
- 14.8. Recommended Workflow for extending to annual run
- 14.9. Side by Side Comparison of the information in the log files for 12x9 pe run compared to 9x12 pe run.
- 15. Additional Resources
- 15.1. FAQ
- 15.2. Free Training
- 15.3. Another workshop to learn the AWS CLI 3.0
- 15.4. Youtube video
- 15.5. Intro to AWS for HPC People - HPC Tech Shorts
- 15.6. Help Resources for CMAQ
- 15.7. Computing on the Cloud References
- 15.8. Resources from AWS for diagnosing issues with running the Parallel Cluster
- 15.9. Instructions on how to create Parallel Cluster Amazon Machine Image (AMI) from the command line
- 15.10. ParallelCluster Update
- 15.11. Use Elastic Fabric Adapter/Elastic Network Adapter for better performance
- 15.12. VPC Management
- 15.13. Using Cost Allocation Tags with ParallelCluster
- 16. Future Work
- 17. Contribute to this Tutorial