Building an Automated Recovery Pipeline for GPU Clusters with Slurm on Azure Part 2

Partnering for Success: Insights from Imran Mahmood on Collaborative Growth

June 13, 2025

Provisioning user MFA programatically in Azure AD B2C

June 13, 2025

Published by azurefeeds on June 13, 2025

Repository Overview: slurm-cluster-health-manager

The slurm-cluster-health-manager repository provides a modular and administrator-driven recovery pipeline for GPU clusters managed by Slurm on Azure. It supports automated node diagnostics, conditional reboots, reporting, and notifications without relying on Slurm job scheduling.

Key Capabilities:

Parallel health check orchestration

Node-local reboot control and retry logic

JSON, HTML, and CSV result reporting

Microsoft Teams alert integration

Repository Structure:

slurm-cluster-health-manager/
├── cluster_health_orchestrator.py # Main entry point for multi-node orchestration
├── node_health_check_runner.py # Performs health checks on individual nodes
├── run_all_nodes_check.py # Reboot logic + retry within the node
├── remote_node_utils.py # SSH-based file/cmd transfer support
├── report_generator.py # HTML and CSV report creation
├── notify_teams_failed_nodes.py # Teams integration via Webhook
├── health_manager_config.py # Thresholds, flags, and tuning config
├── ghr_submission_controller.py # (Optional) Azure GHR integration
├── ghr_payload_utils.py # GHR payload builder
└── README.md

1. Prerequisites and Environment Setup

This pipeline assumes that health checks are executed by infrastructure administrators, not end-users. Therefore, instead of relying on Slurm to run diagnostics as jobs, checks are executed independently and in parallel using remote orchestration scripts. This avoids interference with user workloads.

Key configuration settings—such as NCCL bandwidth thresholds, maximum reboot attempts, and Teams/GHR integration flags—are defined in health_manager_config.py. For example, you can customize the minimum NCCL bandwidth required to pass a health check by adjusting the NCCL_BW_THRESHOLD parameter.

Slurm can execute NHC health checks as Slurm jobs. However, in this case, we assume that these checks are conducted by infrastructure administrators who do not submit or manage Slurm jobs directly. Therefore, instead of using srun or submitting jobs via the Slurm scheduler, the health checks are executed independently and in parallel using direct remote execution or orchestration scripts. This separation ensures that diagnostic and recovery processes do not interfere with user-submitted workloads and do not rely on Slurm job scheduling functionality.

Also, keep in mind that we’re running NCCL bandwidth tests, so if bandwidth is being used by something else like Slurm, this test will fail.

Before diving in, ensure you have the following environment:

Azure CycleCloud-based GPU cluster with Slurm

Python 3.8 environment available on the scheduler node

Shared storage (e.g., /shared) across all nodes

The Epilog script from Part 1 working and correctly reporting failed jobs

Cloned repository: slurm-cluster-health-manager

2. Node Health Check Implementation

The slurm-cluster-health-manager repository defines a modular, Python-first approach, orchestrating health diagnostics with reusable components:

2.1 Relevant Modules & Their Roles

health_manager_config.py
Centralizes configuration: thresholds (NCCL_BW_THRESHOLD), maximum reboots (MAX_REBOOT_COUNT), GHR flag, and Teams WebHook URL.

cluster_health_orchestrator.py
Entry point for the admin-run workflow.
Reads config (health_manager_config.py) to list nodes, thresholds, retry policies.
For each target node, it executes parallel health evaluations and orchestrates recovery.

node_health_check_runner.py
Called remotely per node (via SSH) by the orchestrator.
Performs sequential checks:
1. NHC – runs via a subprocess call to NHC.
2. NCCL single-node – checks intra-node GPU bandwidth.
3. NCCL multi-node – optionally runs an inter-node NCCL bandwidth test.
  Returns JSON summary including pass/fail results and metrics.

remote_node_utils.py
Handles SSH and SCP utility logic enabling file transfer and remote execution of node_health_check_runner.py.

report_generator.py
Aggregates all JSON health responses.
Outputs human-readable CSV and HTML reports that highlight failures, metrics, and recovery attempts.

ghr_submission_controller.py & ghr_payload_utils.py
Build and submit failure metadata to Azure Global Health Reporting (GHR), if configured.

notify_teams_failed_nodes.py
Reads the final health summary, then sends formatted alerts (with metrics and timestamps) to Microsoft Teams via configured WebHook.

2.2 Execution Flow

The orchestrator initiates health checks in parallel by executing run_all_nodes_check.py remotely on each node via SSH. Each node performs checks (NHC, NCCL single-node/multi-node), reboots if necessary, and saves results to shared storage.

Administrator runs:
$ python3 cluster_health_orchestrator.py

-> Orchestrator:
• Reads node list and config
• For each node:
– SSH into node
– Run run_all_nodes_check.py (includes health checks, reboot logic)
– Retrieve JSON result

-> Aggregation:
• Generate HTML/CSV report from collected JSONs
• Optionally send Teams notification
• Optionally submit GHR if failure persists

3. Detailed Node Checks in node_health_check_runner.py

The script node_health_check_runner.py is executed per-node and is responsible for running a series of health diagnostics. It is designed to be modular and self-contained, writing a structured JSON output per node. Below are the main checks executed within this script:

3.1 NHC (Node Health Check)

The script invokes the standard NHC utility using a subprocess call:

result = subprocess.run([“/opt/azurehpc/test/azurehpc-health-checks/run-health-checks.sh”], …)

It captures the exit code and determines if the node passes or fails based on standard health indicators like GPU failure, GPU memory error, NCCL single node bandwidth test, etc.

3.2 NCCL Multi-Node Test

If multi-node testing is enabled in the configuration, it launches synchronized tests across nodes. This step uses mpirun or srun to validate interconnect health.

3.4 Result Classification

The final output JSON includes flags such as:

{
“hostname”: “slurm-node-01”,
“initial_returncode”: 0,
“nhc_result”: “PASS”,
“nccl_single_result”: “PASS”,
“nccl_bandwidth”: 53.4,
“final_status”: “All_Success”
}

Return codes and statuses are designed for use by the orchestrator to decide on reboot necessity or escalation.

4. Reboot and Recovery Flow

The decision to reboot a node is handled within the node itself, specifically in run_all_nodes_check.py, not in the orchestrator.

After being triggered via SSH by cluster_health_orchestrator.py, the script performs the following steps on each node:

Runs initial diagnostics using node_health_check_runner.py.

If critical failures are detected (e.g., GPU not found, NHC failure, or low NCCL bandwidth), the script creates a local flag file: /tmp/reboot_required.

If this flag exists and ENABLE_REBOOT_ON_FAILURE=true is set in the environment, the node will automatically reboot itself.

After reboot, the script re-runs node_health_check_runner.py and updates the result JSON accordingly.

4.1 Execution Snippet from the Orchestrator

ssh_cmd = (
f”ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null {node} ”
f”‘export CHECK_TIMESTAMP={TIMESTAMP} ”
f”CHECK_RESULT_DIR={BASE_DIR} ”
f”PREVIOUS_RESULT_PATH={BASE_DIR}/hpc_check_result_{node}.json ”
f”ENABLE_REBOOT_ON_FAILURE=true ”
f”MAX_REBOOT_COUNT={MAX_REBOOT_COUNT} && ”
f”python3 run_all_nodes_check.py'”
)

4.2 Summary

Reboot logic and retry cycles are entirely controlled by the node’s own run_all_nodes_check.py execution.

The orchestrator simply launches the script remotely and collects the resulting JSON.

This design localizes recovery decisions, simplifies orchestration, and avoids redundant cross-node state management.

5. Job Resubmission Policy

In this recovery framework, job resubmission is handled manually or by external orchestration, rather than relying on Slurm’s –requeue feature.

5.1 Why We Avoid –requeue

We intentionally avoid using –requeue because at the time of job failure:

The failed node may still be in recovery

Slurm might prematurely reassign the job to the same faulty node

This can lead to repeated failures and confuse users

Instead, we decouple job recovery from node recovery. The orchestrator focuses purely on node health validation.

5.2 Resubmission Workflow

Our policy is: validate node recovery first, then resubmit. The general steps are:

Slurm job fails, triggering the Epilog and node failure detection

Administrator runs cluster_health_orchestrator.py

On each target node, run_all_nodes_check.py is executed:
- Performs diagnostics
- Reboots if necessary
- Produces a final JSON result

If the node’s final_status is All_Success, the administrator or workflow system may resubmit the job manually or automatically by slurm script.

This ensures that resubmitted jobs are only dispatched to healthy nodes.

5.3 How to Resubmit

Resubmission can be performed via:

Manual re-run: sbatch

Portal or workflow system that parses result JSONs

Custom script watching health result directory and auto-submitting on All_Success

5.4 Future Enhancements

In future iterations, job-level integration may include:

Job tagging and structured metadata for post-failure triage

Integration with checkpoint/restart tools (e.g., DMTCP, CRIU)

Auto-submission triggered by node health state changes

This resubmission policy minimizes risk and supports a robust operational model in Slurm-based GPU clusters.

6. GHR Submission (Optional)

If enabled, the orchestrator can submit diagnostic failure data to Azure’s Global Health Reporting (GHR) system. This is controlled by the ENABLE_GHR_SUBMISSION flag in health_manager_config.py.

The GHR submission logic is implemented in ghr_submission_controller.py, and invoked by cluster_health_orchestrator.py when failures persist even after recovery attempts.

6.1 When GHR Submission Occurs

GHR submission is attempted only when the following conditions are met:

ENABLE_GHR_SUBMISSION = True is set in health_manager_config.py

The final health result of a node is classified as Reboot_Failed

All health checks and maximum allowed reboot attempts have been exhausted

A valid Azure authentication context (e.g., az login) is available

6.2 Payload Contents

Hostname and timestamp of failure

Number of reboot attempts

Status of each health check component (NHC, NCCL, etc.)

Final classification of the node state (e.g., Reboot_Failed)

This data is submitted securely using Azure CLI or REST APIs as defined in the GHR payload logic. GHR integration is optional but beneficial for organizations requiring centralized diagnostics and reliability metrics across large-scale HPC infrastructures.

7. Reporting: HTML and CSV Reporting: HTML and CSV

After all node checks (and any reboots) are completed, the orchestrator collects JSON results from each node. These are passed to report_generator.py, which generates consolidated reports.

7.1 Execution Example

python3 report_generator.py
–result_dir /shared/health_results/20250613-0930
–output_html /shared/health_results/result_summary.html
–output_csv /shared/health_results/result_summary.csv

7.2 Report Contents

The reports include:

Per-node status: initial_returncode, final_status, reboot counts

Check results: nhc_result, nccl_single_result, nccl_bandwidth, etc.

Classification: “All_Success”, “Reboot_Resolved”, “Reboot_Failed”, “Skip”, etc.

HTML format is ideal for visualization, and CSV can be imported into Excel or dashboards for long-term tracking or aggregation.

These reports help cluster operators quickly assess node health trends and identify persistent hardware issues or configuration gaps.

8. Troubleshooting Tips

GPU check fails: Confirm NVIDIA driver installation and run nvidia-smi for visibility.

JSON results missing: Check if run_all_nodes_check.py executed successfully via orchestrator subprocess logs.

Node stuck or not rebooting:
- Verify that ENABLE_REBOOT_ON_FAILURE=true is exported in the SSH command.
- Ensure the node supports sudo reboot or equivalent reboot commands.
- Confirm that MAX_REBOOT_COUNT is set to a non-zero value.

HTML/CSV reports not generated: Validate that all node JSON files exist in the target result_dir, and check report_generator.py arguments.

9. Conclusion

This article has shown how to implement a resilient and modular recovery pipeline for GPU clusters running Slurm on Azure, building on the foundation introduced in Part 1. By leveraging parallel health checks, node-local reboot logic, detailed result reporting, and optional GHR integration, infrastructure administrators can proactively and automatically manage node failures without user intervention.

While each environment will have its own unique requirements, the structure provided by the slurm-cluster-health-manager repository is highly adaptable. You can customize thresholds, enable or disable reboot policies, and extend the logic to integrate with your existing workflows.

We recommend starting with the default configuration, gradually enabling components like Teams notifications or GHR submission as you validate reliability. Collaboration between platform teams and workload owners is key to ensuring that job resiliency, node recovery, and reporting align with organizational SLAs.

High-performance AI/GPU infrastructure is operationally demanding, but by adopting structured recovery pipelines like this one, teams can improve uptime, reduce manual intervention, and scale confidently.

References

Part 1: Building an Automated Recovery Pipeline

GitHub: slurm-cluster-health-manager

Slurm Documentation

Azure HPC Best Practices

Partnering for Success: Insights from Imran Mahmood on Collaborative Growth

Provisioning user MFA programatically in Azure AD B2C

Partnering for Success: Insights from Imran Mahmood on Collaborative Growth

Provisioning user MFA programatically in Azure AD B2C

Repository Overview: slurm-cluster-health-manager

1. Prerequisites and Environment Setup

2. Node Health Check Implementation

2.1 Relevant Modules & Their Roles

2.2 Execution Flow

3. Detailed Node Checks in node_health_check_runner.py

3.1 NHC (Node Health Check)

3.2 NCCL Multi-Node Test

3.4 Result Classification

4. Reboot and Recovery Flow

4.1 Execution Snippet from the Orchestrator

4.2 Summary

5. Job Resubmission Policy

5.1 Why We Avoid –requeue

5.2 Resubmission Workflow

5.3 How to Resubmit

5.4 Future Enhancements

6. GHR Submission (Optional)

6.1 When GHR Submission Occurs

6.2 Payload Contents

7. Reporting: HTML and CSV Reporting: HTML and CSV

7.1 Execution Example

7.2 Report Contents

8. Troubleshooting Tips

9. Conclusion

References

Related posts

Using Gromacs through EESSI on NC_A100_v4

AI Courseware Alignment Issues Resolved

Upgrade Azure Local operating system to new version