7.10. Resources from AWS for diagnosing issues with running the Parallel Cluster#

7.10.1. Issues#

For AWS Parallel Cluster you can create a GitHub issue for feedback or issues: Github Issues There is also an active community driven Q&A site that may be helpful: AWS re:Post a community-driven Q&A site

The head node can be stopped from the AWS Console after stopping compute nodes of the cluster, as long as it is restarted before issuing the command to restart the cluster.
The pcluster slurm queue system will create and delete the compute nodes, so that helps reduce manual cleanup for the cluster.
The compute nodes are terminated after they have been idle for a period of time. The yaml setting used for this is as follows: SlurmSettings: ScaledownIdletime: 5
The default idle time is 10 minutes, and can be reduced by specifing a shorter idle time in the YAML file. It is important to verify that the are deleted after a job is finished, to avoid incurring unexpected costs.
copy/backup the outputs and logs to an s3 bucket for follow-up analysis
After copying output and log files to the s3 bucket the cluster can be deleted
Once the pcluster is deleted all of the volumes, head node, and compute node will be terminated, and costs will only be incurred by the S3 Bucket storage.