Building an Automated Recovery Pipeline for GPU Clusters with Slurm on Azure Part1

Using OSConfig to manage Windows Server 2025 security baselines

May 21, 2025

[In preview] Public Preview: Azure Container Apps serverless GPUs now support Azure AI Foundry models

May 21, 2025

Published by azurefeeds on May 21, 2025

Overview

As GPU clusters grow in scale, failure recovery becomes a critical part of maintaining workload resiliency and maximizing compute resource utilization. In this article series, I’ll walk through how to build an automated recovery system for a Slurm-managed GPU cluster running on Microsoft Azure. This system detects job failures, identifies unhealthy nodes, performs reboot-based remediation, and reintegrates healthy nodes back into the cluster—all while logging and notifying operators of persistent issues.

The overall recovery pipeline consists of:

Slurm Job Failure Detection

Health Diagnostics (NHC, NCCL Bandwidth Tests)

Automated Reboot and Retry Logic

Node State Recording & HTML/CSV Reporting

GHR(Guest Health Reporting) Integration

Final Node Status & Requeueing Failed Jobs

Supplemental terminology

GHR: GHR is API-based for secure exchange of valid nodes.

NHC: Node Health Check script: https://github.com/Azure/azurehpc-health-checks
- Enable functionality on CC (Slurm->AdvancedSettings->NHC)

See also the slurm-cluster-health-manager repository for more information on this article.

Part 1: Detecting Slurm Job Failures and Triggering the Recovery Workflow

The foundation of our automated recovery system is the ability to detect failed jobs at runtime and trigger a health-check workflow for involved nodes.

Prerequisite

Use Azure CycleCloud (CC) and Slurm Cluster

Azure CycleCloud’s Slurm template uses /shared as the user home directory and /sched as the Slurm as the persistent disk, by convention.
- For example, if the admin user is azureuser, then /shared/home/azureuser/ is the home directory of the admin user.

The script referenced in the event of this Slurm job failure must be accessible from the compute node. Therefore, a shared directory must be specified.

The current slurm template in Azure CycleCloud (CC) will be placed under /sched//, including slurm.conf.
Note that the directory to be used depends on .

Epilog requires Slurm Accounting. please read this article to set up(Changes since this article require the creation of a new DB table).

For more information on how to use Azure CyecleCloud and slurm in general, please refer to this guide.

Step1: Set up Epilog Function

Slurm has a number of features that work at job start and completion. Here I will set up Epilog at a minimum. See this URL for general reference

slurm.conf at /sched//slurm.conf (ex. /sched/slurm01/slurm.conf)

Epilog=/shared/slurm/job_epilog.sh

Step 2: Use Epilog Scripts to Detect Failures

Slurm’s Job Epilog script allows us to hook into job termination and check the exit status. The script is executed after the job completes (whether successfully or not).

Sample Epilog Snippet (/shared/slurm/job_epilog.sh – download):

#!/bin/bash
set -euo pipefail
exec 2>> /shared/slurm/logs/epilog_job_errors.log

# Only process failed jobs
if [[ “${SLURM_JOB_EXIT_CODE:-1}” -ne 0 ]]; then
echo “[FAIL] Job $SLURM_JOB_ID failed with code $SLURM_JOB_EXIT_CODE”

# Extract NodeList
scontrol show job “${SLURM_JOB_ID}”
| grep -oP ‘NodeList=KS+’
| grep -v ‘^(null)$’
> “/shared/slurm/failed_jobs/failed_job_${SLURM_JOB_ID}.nodes”
NODE_LIST=$(cat “/shared/slurm/failed_jobs/failed_job_${SLURM_JOB_ID}.nodes”)

# Trigger recovery orchestration (e.g., via Python script) #
# Example 1: For scripts that do not require a job ID and check the entire cluster
/usr/bin/python3 /shared/slurm/scripts/cluster_health_orchestrator.py
# Example 2: To check only the nodes associated with the associated job ID (for scripts that require the job ID as an argument)
/usr/bin/python3 /shared/slurm/scripts/slurm_node_recovery.py –job-id “$SLURM_JOB_ID”
fi

Restart the slurmctld service as Root User.

systemctl restart slurmctld

To check valid Epilog script and service works

systemctl status slurmctld

Check for syntax errors if the slurmctld service is restarted.

Step 3: Save the Failed Node List

For debugging and traceability, I persist the list of nodes involved in the failed job to a flat file or database:

Once a job failure is confirmed, a Python orchestrator script(cluster_health_orchestrator.py) is launched in Part2. This script performs the following:

Distributes health check scripts to nodes

Runs diagnostics (e.g., GPU presence, NCCL bandwidth)

Determines whether reboot is necessary

Logs all steps in JSON format

Example command:

/usr/bin/python3 /shared/slurm/scripts/cluster_health_orchestrator.py

Example command2(When node ID or job ID is required):

python3 /shared/slurm/scripts/slurm_node_recovery.py –job-id 123456 –node-file /shared/slurm/failed_jobs/failed_job_123456.nodes

The above Epilog script can be used to run arbitrary scripts, and I will consider a few more uses in a stand-alone case before going to Part 2.

Troubleshooting

Also check to see if the Epilog script in Step 2 can be executed. If an error occurs during execution, the following log will appear. In this case, review the Epilog script.

slurmctld.log at /var/log/slurmctld/slurmctld.log

[2025-05-18T10:54:28.190] error: job_epilog_complete: JobId=31 epilog error on slurm01-hpc-2, draining the node
[2025-05-18T10:54:28.190] drain_nodes: node slurm01-hpc-2 state set to DRAIN
[2025-05-18T10:54:28.190] error: _slurm_rpc_epilog_complete: epilog error JobId=31 Node=slurm01-hpc-2 Err=Job epilog failed

Please also review the following items

Specified path for Epilog in slurm.conf

Restart and successfully start slurmctld service

Describe Epilog script (can it run by itself?)

Exit code of the test job

Part 1 Supplement: Notifying Microsoft Teams on Slurm Job Failure via Webhook

In this series of articles, I will configure workflows for more automation. Even as a standalone feature, this article demonstrates how to implement Microsoft Teams notifications upon Slurm job failure when a job failure is detected to increase real-time observability for operators.

This lightweight integration provides early visibility into problems, especially in large GPU clusters where silent failures can cascade and delay dependent jobs.

Use Case

Alert operators when a job exits with a non-zero code

Include metadata like job ID, user, exit code, and involved nodes

Provide direct links to logs or dashboards for follow-up

Step-by-Step Implementation

Step#1 Create Incoming Webhook in Teams

Open your Microsoft Teams channel.

Click … next to the channel name -> Connectors.

Search for and configure Incoming Webhook.

Name it (e.g., Slurm Alerts) and upload an icon if desired.

Copy the Webhook URL – it will look like:

https://outlook.office.com/webhook/…

For more Webhook in Teams – this article.

Step#2: Python Script to Post Notifications

Here’s a sample script (/shared/slurm/scripts/notify_teams_failed_job.py – download) that sends a formatted alert to Teams:

#!/usr/bin/env python3
import sys
import json
import requests
from datetime import datetime

WEBHOOK_URL = “https://outlook.office.com/webhook/…” # Replace with your actual URL

def notify_teams(job_id, user, exit_code, nodes):
message = {
“@type”: “MessageCard”,
“@context”: “https://schema.org/extensions”,
“themeColor”: “FF0000”,
“summary”: f”Slurm Job {job_id} Failed”,
“sections”: [{
“activityTitle”: f” Slurm Job Failure Detected”,
“facts”: [
{“name”: “Job ID”, “value”: job_id},
{“name”: “User”, “value”: user},
{“name”: “Exit Code”, “value”: str(exit_code)},
{“name”: “Nodes”, “value”: ‘, ‘.join(nodes)},
{“name”: “Time”, “value”: datetime.now().strftime(‘%Y-%m-%d %H:%M:%S’)}
],
“markdown”: True
}]
}

response = requests.post(WEBHOOK_URL, json=message)
if response.status_code == 200:
print(f”[INFO] Teams notification sent successfully.”)
if response.status_code != 200:
print(f”[ERROR] Failed to send Teams notification: {response.status_code}, {response.text}”, file=sys.stderr)

# Example usage
if __name__ == “__main__”:
job_id = sys.argv[1]
user = sys.argv[2]
exit_code = sys.argv[3]
nodes = sys.argv[4:] # passed as list
notify_teams(job_id, user, exit_code, nodes)

Step#3: Call It from the Epilog Script

Update your Slurm Epilog script(/shared/slurm/job_epilog.sh – download) like this:

# Only process failed jobs
if [[ “${SLURM_JOB_EXIT_CODE:-1}” -ne 0 ]]; then
echo “[FAIL] Job $SLURM_JOB_ID failed with code $SLURM_JOB_EXIT_CODE”

/usr/bin/python3 /shared/slurm/scripts/notify_teams_failed_job.py
“$SLURM_JOB_ID” “$SLURM_JOB_USER” “$SLURM_JOB_EXIT_CODE” $NODE_LIST
fi

Sample Output in Teams

When a failure occurs, you’ll receive a message in Teams:

Slurm Job Failure Detected

Job ID: 48

User: azureuser

Exit Code: 15

Nodes: slurm00-htc-01

Time: 2025-05-19 02:54:36

Future Enhancements or Considerations

Include direct links to logs (e.g., HTML report or Grafana dashboard)

Severity-based formatting (e.g., green/yellow/red for different codes)

Notify different Teams channels depending on partition or user group

Future enhancements may include attaching direct links to job-specific log files or HTML reports within the Teams notification, and implementing retry or escalation policies based on repeated failures or node health history.

Next in Series

In Part 2, I will explore:

How to run distributed GPU diagnostics with NHC and NCCL tests

How to capture and summarize health results in structured JSON

Techniques for isolating root causes (e.g., PCIe flaps, InfiniBand port down)

Stay tuned for Part 2, where we’ll dive deeper into distributed GPU health diagnostics with NHC and NCCL, revealing how to capture, summarize, and act on structured health data to drive automated recovery decisions.

I will explore this in detail in Part 2: Running GPU Health Checks and NCCL Diagnostics at Scale.

Feedback & Contribution

This series is based on production-scale GPU clusters on Azure using Slurm, and contributions or feedback from the community are welcome. If you’ve built similar automation or are exploring recovery strategies at scale, let’s collaborate.

Using OSConfig to manage Windows Server 2025 security baselines

[In preview] Public Preview: Azure Container Apps serverless GPUs now support Azure AI Foundry models

Using OSConfig to manage Windows Server 2025 security baselines

[In preview] Public Preview: Azure Container Apps serverless GPUs now support Azure AI Foundry models

Overview

Part 1: Detecting Slurm Job Failures and Triggering the Recovery Workflow

Step1: Set up Epilog Function

Step 2: Use Epilog Scripts to Detect Failures

Step 3: Save the Failed Node List

Troubleshooting

Part 1 Supplement: Notifying Microsoft Teams on Slurm Job Failure via Webhook

Use Case

Step-by-Step Implementation

Step#1 Create Incoming Webhook in Teams

Step#2: Python Script to Post Notifications

Step#3: Call It from the Epilog Script

Sample Output in Teams

Future Enhancements or Considerations

Next in Series

Feedback & Contribution

Related posts

Understanding the ICL impact [Under the draft, reviewing by the team]

Announcing General Availability of Microsoft Purview SDK and APIs

Deploying macOS FileVault with Microsoft Intune