Using OSConfig to manage Windows Server 2025 security baselines
May 21, 2025[In preview] Public Preview: Azure Container Apps serverless GPUs now support Azure AI Foundry models
May 21, 2025Overview
As GPU clusters grow in scale, failure recovery becomes a critical part of maintaining workload resiliency and maximizing compute resource utilization. In this article series, I’ll walk through how to build an automated recovery system for a Slurm-managed GPU cluster running on Microsoft Azure. This system detects job failures, identifies unhealthy nodes, performs reboot-based remediation, and reintegrates healthy nodes back into the cluster—all while logging and notifying operators of persistent issues.
The overall recovery pipeline consists of:
- Slurm Job Failure Detection
- Health Diagnostics (NHC, NCCL Bandwidth Tests)
- Automated Reboot and Retry Logic
- Node State Recording & HTML/CSV Reporting
- GHR(Guest Health Reporting) Integration
- Final Node Status & Requeueing Failed Jobs
Supplemental terminology
- GHR: GHR is API-based for secure exchange of valid nodes.
- NHC: Node Health Check script: https://github.com/Azure/azurehpc-health-checks
- Enable functionality on CC (Slurm->AdvancedSettings->NHC)
See also the slurm-cluster-health-manager repository for more information on this article.
Part 1: Detecting Slurm Job Failures and Triggering the Recovery Workflow
The foundation of our automated recovery system is the ability to detect failed jobs at runtime and trigger a health-check workflow for involved nodes.
Prerequisite
- Use Azure CycleCloud (CC) and Slurm Cluster
- Azure CycleCloud’s Slurm template uses /shared as the user home directory and /sched as the Slurm as the persistent disk, by convention.
- For example, if the admin user is azureuser, then /shared/home/azureuser/ is the home directory of the admin user.
- The script referenced in the event of this Slurm job failure must be accessible from the compute node. Therefore, a shared directory must be specified.
- The current slurm template in Azure CycleCloud (CC) will be placed under /sched//, including slurm.conf.
Note that the directory to be used depends on . - Epilog requires Slurm Accounting. please read this article to set up(Changes since this article require the creation of a new DB table).
For more information on how to use Azure CyecleCloud and slurm in general, please refer to this guide.
Step1: Set up Epilog Function
Slurm has a number of features that work at job start and completion. Here I will set up Epilog at a minimum. See this URL for general reference
slurm.conf at /sched//slurm.conf (ex. /sched/slurm01/slurm.conf)
Epilog=/shared/slurm/job_epilog.sh
Step 2: Use Epilog Scripts to Detect Failures
Slurm’s Job Epilog script allows us to hook into job termination and check the exit status. The script is executed after the job completes (whether successfully or not).
Sample Epilog Snippet (/shared/slurm/job_epilog.sh – download):
#!/bin/bash
set -euo pipefail
exec 2>> /shared/slurm/logs/epilog_job_errors.log
# Only process failed jobs
if [[ “${SLURM_JOB_EXIT_CODE:-1}” -ne 0 ]]; then
echo “[FAIL] Job $SLURM_JOB_ID failed with code $SLURM_JOB_EXIT_CODE”
# Extract NodeList
scontrol show job “${SLURM_JOB_ID}”
| grep -oP ‘NodeList=KS+’
| grep -v ‘^(null)$’
> “/shared/slurm/failed_jobs/failed_job_${SLURM_JOB_ID}.nodes”
NODE_LIST=$(cat “/shared/slurm/failed_jobs/failed_job_${SLURM_JOB_ID}.nodes”)
# Trigger recovery orchestration (e.g., via Python script) #
# Example 1: For scripts that do not require a job ID and check the entire cluster
/usr/bin/python3 /shared/slurm/scripts/cluster_health_orchestrator.py
# Example 2: To check only the nodes associated with the associated job ID (for scripts that require the job ID as an argument)
/usr/bin/python3 /shared/slurm/scripts/slurm_node_recovery.py –job-id “$SLURM_JOB_ID”
fi
Restart the slurmctld service as Root User.
systemctl restart slurmctld
To check valid Epilog script and service works
systemctl status slurmctld
Check for syntax errors if the slurmctld service is restarted.
Step 3: Save the Failed Node List
For debugging and traceability, I persist the list of nodes involved in the failed job to a flat file or database:
Once a job failure is confirmed, a Python orchestrator script(cluster_health_orchestrator.py) is launched in Part2. This script performs the following:
- Distributes health check scripts to nodes
- Runs diagnostics (e.g., GPU presence, NCCL bandwidth)
- Determines whether reboot is necessary
- Logs all steps in JSON format
Example command:
/usr/bin/python3 /shared/slurm/scripts/cluster_health_orchestrator.py
Example command2(When node ID or job ID is required):
python3 /shared/slurm/scripts/slurm_node_recovery.py –job-id 123456 –node-file /shared/slurm/failed_jobs/failed_job_123456.nodes
The above Epilog script can be used to run arbitrary scripts, and I will consider a few more uses in a stand-alone case before going to Part 2.
Troubleshooting
Also check to see if the Epilog script in Step 2 can be executed. If an error occurs during execution, the following log will appear. In this case, review the Epilog script.
slurmctld.log at /var/log/slurmctld/slurmctld.log
[2025-05-18T10:54:28.190] error: job_epilog_complete: JobId=31 epilog error on slurm01-hpc-2, draining the node
[2025-05-18T10:54:28.190] drain_nodes: node slurm01-hpc-2 state set to DRAIN
[2025-05-18T10:54:28.190] error: _slurm_rpc_epilog_complete: epilog error JobId=31 Node=slurm01-hpc-2 Err=Job epilog failed
Please also review the following items
- Specified path for Epilog in slurm.conf
- Restart and successfully start slurmctld service
- Describe Epilog script (can it run by itself?)
- Exit code of the test job
Part 1 Supplement: Notifying Microsoft Teams on Slurm Job Failure via Webhook
In this series of articles, I will configure workflows for more automation. Even as a standalone feature, this article demonstrates how to implement Microsoft Teams notifications upon Slurm job failure when a job failure is detected to increase real-time observability for operators.
This lightweight integration provides early visibility into problems, especially in large GPU clusters where silent failures can cascade and delay dependent jobs.
Use Case
- Alert operators when a job exits with a non-zero code
- Include metadata like job ID, user, exit code, and involved nodes
- Provide direct links to logs or dashboards for follow-up
Step-by-Step Implementation
Step#1 Create Incoming Webhook in Teams
- Open your Microsoft Teams channel.
- Click … next to the channel name -> Connectors.
- Search for and configure Incoming Webhook.
- Name it (e.g., Slurm Alerts) and upload an icon if desired.
- Copy the Webhook URL – it will look like:
https://outlook.office.com/webhook/…
For more Webhook in Teams – this article.
Step#2: Python Script to Post Notifications
Here’s a sample script (/shared/slurm/scripts/notify_teams_failed_job.py – download) that sends a formatted alert to Teams:
#!/usr/bin/env python3
import sys
import json
import requests
from datetime import datetime
WEBHOOK_URL = “https://outlook.office.com/webhook/…” # Replace with your actual URL
def notify_teams(job_id, user, exit_code, nodes):
message = {
“@type”: “MessageCard”,
“@context”: “https://schema.org/extensions”,
“themeColor”: “FF0000”,
“summary”: f”Slurm Job {job_id} Failed”,
“sections”: [{
“activityTitle”: f” Slurm Job Failure Detected”,
“facts”: [
{“name”: “Job ID”, “value”: job_id},
{“name”: “User”, “value”: user},
{“name”: “Exit Code”, “value”: str(exit_code)},
{“name”: “Nodes”, “value”: ‘, ‘.join(nodes)},
{“name”: “Time”, “value”: datetime.now().strftime(‘%Y-%m-%d %H:%M:%S’)}
],
“markdown”: True
}]
}
response = requests.post(WEBHOOK_URL, json=message)
if response.status_code == 200:
print(f”[INFO] Teams notification sent successfully.”)
if response.status_code != 200:
print(f”[ERROR] Failed to send Teams notification: {response.status_code}, {response.text}”, file=sys.stderr)
# Example usage
if __name__ == “__main__”:
job_id = sys.argv[1]
user = sys.argv[2]
exit_code = sys.argv[3]
nodes = sys.argv[4:] # passed as list
notify_teams(job_id, user, exit_code, nodes)
Step#3: Call It from the Epilog Script
Update your Slurm Epilog script(/shared/slurm/job_epilog.sh – download) like this:
# Only process failed jobs
if [[ “${SLURM_JOB_EXIT_CODE:-1}” -ne 0 ]]; then
echo “[FAIL] Job $SLURM_JOB_ID failed with code $SLURM_JOB_EXIT_CODE”
# Extract NodeList
scontrol show job “${SLURM_JOB_ID}”
| grep -oP ‘NodeList=KS+’
| grep -v ‘^(null)$’
> “/shared/slurm/failed_jobs/failed_job_${SLURM_JOB_ID}.nodes”
NODE_LIST=$(cat “/shared/slurm/failed_jobs/failed_job_${SLURM_JOB_ID}.nodes”)
/usr/bin/python3 /shared/slurm/scripts/notify_teams_failed_job.py
“$SLURM_JOB_ID” “$SLURM_JOB_USER” “$SLURM_JOB_EXIT_CODE” $NODE_LIST
fi
Sample Output in Teams
When a failure occurs, you’ll receive a message in Teams:
Slurm Job Failure Detected
- Job ID: 48
- User: azureuser
- Exit Code: 15
- Nodes: slurm00-htc-01
- Time: 2025-05-19 02:54:36
Future Enhancements or Considerations
- Include direct links to logs (e.g., HTML report or Grafana dashboard)
- Severity-based formatting (e.g., green/yellow/red for different codes)
- Notify different Teams channels depending on partition or user group
Future enhancements may include attaching direct links to job-specific log files or HTML reports within the Teams notification, and implementing retry or escalation policies based on repeated failures or node health history.
Next in Series
In Part 2, I will explore:
- How to run distributed GPU diagnostics with NHC and NCCL tests
- How to capture and summarize health results in structured JSON
- Techniques for isolating root causes (e.g., PCIe flaps, InfiniBand port down)
Stay tuned for Part 2, where we’ll dive deeper into distributed GPU health diagnostics with NHC and NCCL, revealing how to capture, summarize, and act on structured health data to drive automated recovery decisions.
I will explore this in detail in Part 2: Running GPU Health Checks and NCCL Diagnostics at Scale.
Feedback & Contribution
This series is based on production-scale GPU clusters on Azure using Slurm, and contributions or feedback from the community are welcome. If you’ve built similar automation or are exploring recovery strategies at scale, let’s collaborate.