Partnering for Success: Insights from Imran Mahmood on Collaborative Growth
June 13, 2025
Provisioning user MFA programatically in Azure AD B2C
June 13, 2025Disclaimer: The slurm-cluster-health-manager project is a sample tool created specifically for the article it accompanies. This is not an official Microsoft product, and it is not supported or maintained by Microsoft.
In Part 1, we introduced how to detect Slurm job failures using Epilog and initiate the first step of an automated recovery pipeline. In this follow-up, we’ll go deeper into automating the recovery of failed GPU nodes, including health checks, reboots, node state management, and job resubmission.
The implementation of the theme in this article will require modifications depending on individual environmental factors. We hope that the implementation described in this article will be helpful to you.
Repository Overview: slurm-cluster-health-manager
The slurm-cluster-health-manager repository provides a modular and administrator-driven recovery pipeline for GPU clusters managed by Slurm on Azure. It supports automated node diagnostics, conditional reboots, reporting, and notifications without relying on Slurm job scheduling.
Key Capabilities:
- Parallel health check orchestration
- Node-local reboot control and retry logic
- JSON, HTML, and CSV result reporting
- Microsoft Teams alert integration
Repository Structure:
slurm-cluster-health-manager/
├── cluster_health_orchestrator.py # Main entry point for multi-node orchestration
├── node_health_check_runner.py # Performs health checks on individual nodes
├── run_all_nodes_check.py # Reboot logic + retry within the node
├── remote_node_utils.py # SSH-based file/cmd transfer support
├── report_generator.py # HTML and CSV report creation
├── notify_teams_failed_nodes.py # Teams integration via Webhook
├── health_manager_config.py # Thresholds, flags, and tuning config
├── ghr_submission_controller.py # (Optional) Azure GHR integration
├── ghr_payload_utils.py # GHR payload builder
└── README.md
1. Prerequisites and Environment Setup
This pipeline assumes that health checks are executed by infrastructure administrators, not end-users. Therefore, instead of relying on Slurm to run diagnostics as jobs, checks are executed independently and in parallel using remote orchestration scripts. This avoids interference with user workloads.
Key configuration settings—such as NCCL bandwidth thresholds, maximum reboot attempts, and Teams/GHR integration flags—are defined in health_manager_config.py. For example, you can customize the minimum NCCL bandwidth required to pass a health check by adjusting the NCCL_BW_THRESHOLD parameter.
Slurm can execute NHC health checks as Slurm jobs. However, in this case, we assume that these checks are conducted by infrastructure administrators who do not submit or manage Slurm jobs directly. Therefore, instead of using srun or submitting jobs via the Slurm scheduler, the health checks are executed independently and in parallel using direct remote execution or orchestration scripts. This separation ensures that diagnostic and recovery processes do not interfere with user-submitted workloads and do not rely on Slurm job scheduling functionality.
Also, keep in mind that we’re running NCCL bandwidth tests, so if bandwidth is being used by something else like Slurm, this test will fail.
Before diving in, ensure you have the following environment:
- Azure CycleCloud-based GPU cluster with Slurm
- Python 3.8 environment available on the scheduler node
- Shared storage (e.g., /shared) across all nodes
- The Epilog script from Part 1 working and correctly reporting failed jobs
- Cloned repository: slurm-cluster-health-manager
2. Node Health Check Implementation
The slurm-cluster-health-manager repository defines a modular, Python-first approach, orchestrating health diagnostics with reusable components:
2.1 Relevant Modules & Their Roles
- health_manager_config.py
Centralizes configuration: thresholds (NCCL_BW_THRESHOLD), maximum reboots (MAX_REBOOT_COUNT), GHR flag, and Teams WebHook URL. - cluster_health_orchestrator.py
Entry point for the admin-run workflow.
Reads config (health_manager_config.py) to list nodes, thresholds, retry policies.
For each target node, it executes parallel health evaluations and orchestrates recovery. - node_health_check_runner.py
Called remotely per node (via SSH) by the orchestrator.
Performs sequential checks:
- NHC – runs via a subprocess call to NHC.
- NCCL single-node – checks intra-node GPU bandwidth.
- NCCL multi-node – optionally runs an inter-node NCCL bandwidth test.
Returns JSON summary including pass/fail results and metrics.
- remote_node_utils.py
Handles SSH and SCP utility logic enabling file transfer and remote execution of node_health_check_runner.py. - report_generator.py
Aggregates all JSON health responses.
Outputs human-readable CSV and HTML reports that highlight failures, metrics, and recovery attempts. - ghr_submission_controller.py & ghr_payload_utils.py
Build and submit failure metadata to Azure Global Health Reporting (GHR), if configured. - notify_teams_failed_nodes.py
Reads the final health summary, then sends formatted alerts (with metrics and timestamps) to Microsoft Teams via configured WebHook.
2.2 Execution Flow
The orchestrator initiates health checks in parallel by executing run_all_nodes_check.py remotely on each node via SSH. Each node performs checks (NHC, NCCL single-node/multi-node), reboots if necessary, and saves results to shared storage.
Administrator runs:
$ python3 cluster_health_orchestrator.py
-> Orchestrator:
• Reads node list and config
• For each node:
– SSH into node
– Run run_all_nodes_check.py (includes health checks, reboot logic)
– Retrieve JSON result
-> Aggregation:
• Generate HTML/CSV report from collected JSONs
• Optionally send Teams notification
• Optionally submit GHR if failure persists
3. Detailed Node Checks in node_health_check_runner.py
The script node_health_check_runner.py is executed per-node and is responsible for running a series of health diagnostics. It is designed to be modular and self-contained, writing a structured JSON output per node. Below are the main checks executed within this script:
3.1 NHC (Node Health Check)
The script invokes the standard NHC utility using a subprocess call:
result = subprocess.run([“/opt/azurehpc/test/azurehpc-health-checks/run-health-checks.sh”], …)
It captures the exit code and determines if the node passes or fails based on standard health indicators like GPU failure, GPU memory error, NCCL single node bandwidth test, etc.
3.2 NCCL Multi-Node Test
If multi-node testing is enabled in the configuration, it launches synchronized tests across nodes. This step uses mpirun or srun to validate interconnect health.
3.4 Result Classification
The final output JSON includes flags such as:
{
“hostname”: “slurm-node-01”,
“initial_returncode”: 0,
“nhc_result”: “PASS”,
“nccl_single_result”: “PASS”,
“nccl_bandwidth”: 53.4,
“final_status”: “All_Success”
}
Return codes and statuses are designed for use by the orchestrator to decide on reboot necessity or escalation.
4. Reboot and Recovery Flow
The decision to reboot a node is handled within the node itself, specifically in run_all_nodes_check.py, not in the orchestrator.
After being triggered via SSH by cluster_health_orchestrator.py, the script performs the following steps on each node:
- Runs initial diagnostics using node_health_check_runner.py.
- If critical failures are detected (e.g., GPU not found, NHC failure, or low NCCL bandwidth), the script creates a local flag file: /tmp/reboot_required.
- If this flag exists and ENABLE_REBOOT_ON_FAILURE=true is set in the environment, the node will automatically reboot itself.
- After reboot, the script re-runs node_health_check_runner.py and updates the result JSON accordingly.
4.1 Execution Snippet from the Orchestrator
ssh_cmd = (
f”ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null {node} ”
f”‘export CHECK_TIMESTAMP={TIMESTAMP} ”
f”CHECK_RESULT_DIR={BASE_DIR} ”
f”PREVIOUS_RESULT_PATH={BASE_DIR}/hpc_check_result_{node}.json ”
f”ENABLE_REBOOT_ON_FAILURE=true ”
f”MAX_REBOOT_COUNT={MAX_REBOOT_COUNT} && ”
f”python3 run_all_nodes_check.py'”
)
4.2 Summary
- Reboot logic and retry cycles are entirely controlled by the node’s own run_all_nodes_check.py execution.
- The orchestrator simply launches the script remotely and collects the resulting JSON.
- This design localizes recovery decisions, simplifies orchestration, and avoids redundant cross-node state management.
5. Job Resubmission Policy
In this recovery framework, job resubmission is handled manually or by external orchestration, rather than relying on Slurm’s –requeue feature.
5.1 Why We Avoid –requeue
We intentionally avoid using –requeue because at the time of job failure:
- The failed node may still be in recovery
- Slurm might prematurely reassign the job to the same faulty node
- This can lead to repeated failures and confuse users
Instead, we decouple job recovery from node recovery. The orchestrator focuses purely on node health validation.
5.2 Resubmission Workflow
Our policy is: validate node recovery first, then resubmit. The general steps are:
- Slurm job fails, triggering the Epilog and node failure detection
- Administrator runs cluster_health_orchestrator.py
- On each target node, run_all_nodes_check.py is executed:
- Performs diagnostics
- Reboots if necessary
- Produces a final JSON result
- If the node’s final_status is All_Success, the administrator or workflow system may resubmit the job manually or automatically by slurm script.
This ensures that resubmitted jobs are only dispatched to healthy nodes.
5.3 How to Resubmit
Resubmission can be performed via:
- Manual re-run: sbatch
- Portal or workflow system that parses result JSONs
- Custom script watching health result directory and auto-submitting on All_Success
5.4 Future Enhancements
In future iterations, job-level integration may include:
- Job tagging and structured metadata for post-failure triage
- Integration with checkpoint/restart tools (e.g., DMTCP, CRIU)
- Auto-submission triggered by node health state changes
This resubmission policy minimizes risk and supports a robust operational model in Slurm-based GPU clusters.
6. GHR Submission (Optional)
If enabled, the orchestrator can submit diagnostic failure data to Azure’s Global Health Reporting (GHR) system. This is controlled by the ENABLE_GHR_SUBMISSION flag in health_manager_config.py.
The GHR submission logic is implemented in ghr_submission_controller.py, and invoked by cluster_health_orchestrator.py when failures persist even after recovery attempts.
6.1 When GHR Submission Occurs
GHR submission is attempted only when the following conditions are met:
- ENABLE_GHR_SUBMISSION = True is set in health_manager_config.py
- The final health result of a node is classified as Reboot_Failed
- All health checks and maximum allowed reboot attempts have been exhausted
- A valid Azure authentication context (e.g., az login) is available
6.2 Payload Contents
- Hostname and timestamp of failure
- Number of reboot attempts
- Status of each health check component (NHC, NCCL, etc.)
- Final classification of the node state (e.g., Reboot_Failed)
This data is submitted securely using Azure CLI or REST APIs as defined in the GHR payload logic. GHR integration is optional but beneficial for organizations requiring centralized diagnostics and reliability metrics across large-scale HPC infrastructures.
7. Reporting: HTML and CSV Reporting: HTML and CSV
After all node checks (and any reboots) are completed, the orchestrator collects JSON results from each node. These are passed to report_generator.py, which generates consolidated reports.
7.1 Execution Example
python3 report_generator.py
–result_dir /shared/health_results/20250613-0930
–output_html /shared/health_results/result_summary.html
–output_csv /shared/health_results/result_summary.csv
7.2 Report Contents
The reports include:
- Per-node status: initial_returncode, final_status, reboot counts
- Check results: nhc_result, nccl_single_result, nccl_bandwidth, etc.
- Classification: “All_Success”, “Reboot_Resolved”, “Reboot_Failed”, “Skip”, etc.
HTML format is ideal for visualization, and CSV can be imported into Excel or dashboards for long-term tracking or aggregation.
These reports help cluster operators quickly assess node health trends and identify persistent hardware issues or configuration gaps.
8. Troubleshooting Tips
- GPU check fails: Confirm NVIDIA driver installation and run nvidia-smi for visibility.
- JSON results missing: Check if run_all_nodes_check.py executed successfully via orchestrator subprocess logs.
- Node stuck or not rebooting:
- Verify that ENABLE_REBOOT_ON_FAILURE=true is exported in the SSH command.
- Ensure the node supports sudo reboot or equivalent reboot commands.
- Confirm that MAX_REBOOT_COUNT is set to a non-zero value.
- HTML/CSV reports not generated: Validate that all node JSON files exist in the target result_dir, and check report_generator.py arguments.
9. Conclusion
This article has shown how to implement a resilient and modular recovery pipeline for GPU clusters running Slurm on Azure, building on the foundation introduced in Part 1. By leveraging parallel health checks, node-local reboot logic, detailed result reporting, and optional GHR integration, infrastructure administrators can proactively and automatically manage node failures without user intervention.
While each environment will have its own unique requirements, the structure provided by the slurm-cluster-health-manager repository is highly adaptable. You can customize thresholds, enable or disable reboot policies, and extend the logic to integrate with your existing workflows.
We recommend starting with the default configuration, gradually enabling components like Teams notifications or GHR submission as you validate reliability. Collaboration between platform teams and workload owners is key to ensuring that job resiliency, node recovery, and reporting align with organizational SLAs.
High-performance AI/GPU infrastructure is operationally demanding, but by adopting structured recovery pipelines like this one, teams can improve uptime, reduce manual intervention, and scale confidently.