2. Create a Parallel Cluster and run CMAQv5.4#
Why might I need to use ParallelCluster?
The AWS ParallelCluster may be configured to be the equivalent of a High Performance Computing (HPC) environment, including using job schedulers such as Slurm, running on multiple nodes using code compiled with Message Passing Interface (MPI), and reading and writing output to a high performance, low latency shared disk. The advantage of using the AWS ParallelCluster command line interface is that the compute nodes can be easily scaled up or down to match the compute requirements of a given simulation. HPC compute nodes such as hpc6a or hpc7g are available in a limited set of regions at significantly discounted pricing (60% below on demand costs). Users can also attempt to reduce costs by using Spot instances rather than On-Demand for the compute nodes. ParallelCluster also supports submitting multiple jobs to the job submission queue.
Our goal is make this user guide to running CMAQ on a ParallelCluster as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.
- 2.1. Build a Demo ParallelCluster
- 2.2. Use ParallelCluster with Software and Data pre-installed on hpc7g.16xlarge
- 2.3. Run CMAQ on hpc7g.16xlarge
- 2.3.1. Login to cluster
- 2.3.2. Check the volume sizes
- 2.3.3. Resize the EBS Volume
- 2.3.4. Copy file to .cshrc
- 2.3.5. Preloading the files
- 2.3.6. Run the 12US1 Domain on 128 cores
- 2.3.7. Submit a job to run on 192 pes, 3x64 nodes
- 2.3.8. Submit a job to run on 320 pes running on 5 nodes
- 2.3.9. Submit a job to run on 128 cores with 32 cores per node.
- 2.4. Run DESID CMAQ on hpc7g.16xlarge
- 2.5. Modify the ParallelCluster to remove the lustre filesystem
- 2.5.1. The following section assumes that you have already created the hpc7g pcluster using this command:
- 2.5.2. Output recieved from command line:
- 2.5.3. Check on status of cluster
- 2.5.4. Modify the yaml file to remove the /fsx volume
- 2.5.5. Stop the compute fleet
- 2.5.6. Update cluster to remove the lustre filesystem
- 2.5.7. Check on the status until it says the update is complete
- 2.5.8. Verify that the fsx volume is being deleted in the AWS Website Console
- 2.5.9. Update the compute fleet to restart the compute nodes
- 2.5.10. To add or re-add the /fsx filesystem
- 2.6. Create Cost Allocation Tags for Analysis using AWS Cost Explorer.
- 2.6.1. Activation of an AWS Defined tag: createdBy in the Console
- 2.6.2. Creation and activation of a user defined tag
- 2.6.3. Creation of the pclusterTagsAndBudget IAM Policy
- 2.6.4. An S3 bucket named: cost-alloc-tag-pcluster was created to host files that were obtained and then modified according to the tutorial for the CMAS Account:
- 2.6.5. Review example yaml file that has the lines that need to be added highlighted by !!
- 2.6.6. Use above template to modify your cluster yaml to add cost allocation tags
- 2.6.7. Or - use the Listos Cost Allocation Yaml that is provided here (with the exception of the <account_id>)
- 2.6.8. Edit the <account_id> to use the value for your account
- 2.6.9. Use the modified yaml file to create the cluster
- Check on the cluster status
- login to your cluster
- Resize the EBS Volume
- Obtain the Listos Benchmark Case from the S3 bucket
- Obtain the Listos Run script from the S3 bucket
- Modify the version number in the run script to match the precompiled code version
- Modify the run script to add CMAQ_DATA environment variable and modify INPDIR to match what is available after downloading the inputs, and modify number of processos used
- Load the modules to get the libraries and compiler
- Check modules that are now available
- Load the library modules
- Submit job to slurm using the –comment flag to specify the project name
- Use squeue to check on status of runs
- Verify that the run completes successfully
- Check the status of the cluster via the console and the command line to verify that the compute nodes have shut down
- Terminate the cluster after the compute nodes have successfully terminated.
- It takes 24 hours for the cost data to appear in the Cost Explorer. Once 24 hours has elapsed check the AWS Website Cost Explorer and select by tags.
- Verify that the cost allocation tags are visible on the head node and compute nodes using the AWS Console
- Tried updating the cluster to use different values for tags
- Stop the compute fleet before trying to update tags
- try to update the cluster tag aws-parallelcluster-jobid now that the compute nodes are stopped.
- Found an error in the yaml file I was using