Azure CycleCloud + Slurm: A Beginner’s Guide to Job Submission

Deploy Terraform to Azure with OIDC and GitHub Actions

May 15, 2025

Published by azurefeeds on May 16, 2025

Tags

High Performance Computing Cluster:

A high-performance computing (HPC) cluster is a collection of interconnected computers (nodes) that work together to perform complex computational tasks at high speeds, far beyond the capabilities of a single machine. These clusters are designed to handle large-scale, data-intensive, or computationally demanding workloads, such as scientific simulations, big data analysis, machine learning, or weather forecasting.

Azure Cycle Cloud:

Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale. Through CycleCloud, users can create different types of file systems and mount them to the compute cluster nodes to support HPC workloads.

Azure CycleCloud is targeted at HPC administrators and users who want to deploy an HPC environment with a specific scheduler in mind — commonly used schedulers such as Slurm, PBSPro, LSF, Grid Engine, and HT-Condor are supported out of the box. CycleCloud is the sister product to Azure Batch, which provides a Scheduler as a Service on Azure.

Azure CycleCloud workspace for slurm :

Azure CycleCloud Workspace for Slurm is an Azure Marketplace solution template that allows users to easily create, configure, and deploy pre-defined Slurm clusters with CycleCloud on Azure, without requiring any prior knowledge of Azure or Slurm. Slurm clusters will be pre-configured with PMix v4, Pyxis, and enroot to support containerized AI/HPC Slurm jobs. Users can access the provisioned login node using SSH or Visual Studio Code to perform common tasks like submitting and managing Slurm jobs.

CycleCloud Architecture:

Components of CycleCloud Cluster:

Login Node: The entry point for users to interact with the cluster, where jobs are submitted and managed.

Scheduler Node: Runs the Slurm controller (slurmctld) to manage job scheduling and resource allocation.

Compute Nodes: Dynamically provisioned virtual machines (VMs) that execute the jobs.

Networking: Includes a virtual network (VNet) with subnets for CycleCloud, compute, storage, and Azure Bastion for secure access.

Storage: Supports options like Azure NetApp Files or Azure Managed Lustre Filesystem.

Azure CycleCloud: Orchestrates the cluster, handling provisioning, scaling, and monitoring.

Slurm partitions: SLURM (Simple Linux Utility for Resource Management) partitions are logical groupings of compute nodes within a SLURM-managed cluster.They organize resources to handle specific types of workloads, enabling efficient job scheduling, resource allocation, and access control. Each partition can have unique configurations like node types, resource limits, and scheduling policies.

HPC (High-Performance Computing) Partition:

Optimized for tightly coupled, parallel workloads requiring multiple nodes to communicate efficiently, often using Message Passing Interface (MPI) or similar frameworks.

Leverages Azure’s HPC-optimized VMs (e.g., HBv3 or HC-series) with InfiniBand (200 Gb/s HDR) for low-latency, high-bandwidth interconnects.

GPU Partition:

Dedicated to accelerated computing workloads requiring GPU hardware for parallel processing.

Usecase: Deep learning training/inference

Uses GPU-enabled VMs (e.g., NCv3 with V100 GPUs, NDv2 with A100 GPUs).

HTC Partition:
Tailored for independent, single-node, high-volume workloads that process numerous tasks in parallel, often embarrassingly parallel jobs

Runs on general-purpose or memory-optimized VMs (e.g., D-series, E-series) with moderate CPU/memory needs.

Azure CycleCloud Workspace for Slurm Deployment:

Before deploying the Azure CycleCloud Workspace for Slurm, ensure the following are in place:

Azure Subscription:

An active Azure subscription with sufficient quotas for VMs (e.g., Standard_D4s_v3 for CycleCloud, HBv3 for HPC, NCv3 for GPU).

Check quotas in the Azure Portal under Subscriptions > Usage + Quotas.

Permissions:

Contributor or Owner role on the subscription to create resources (e.g., VNets, VMs, storage).

Network Contributor role if modifying existing VNets or peering.

SSH Keys:
- On your local machine, generate an SSH key pair using the ssh-keygen command:

Generate an SSH Key Pair:

ssh-keygen -t rsa -b 4096 -f ~/.ssh/azure_hpc_key

- - This creates a private key (azure_hpc_key) and a public key (azure_hpc_key.pub).
  - Do not set a passphrase for automation purposes or securely store the passphrase if used.
- Store the Public Key:
  - Copy the contents of the public key file (~/.ssh/azure_hpc_key.pub) for use during deployment.
  - Example content: ssh-rsa AAAAB3NzaC1yc2E… user@machine
- Purpose:
  - The public key will be added to the login node (and other nodes) to allow secure SSH access via Azure Bastion.
  - The private key will be used by your local machine to authenticate when connecting.

4.Azure CLI or Portal Access:

- Use the Azure Portal for a UI-based deployment or Azure CLI for scripting.
- Ensure you’re logged in (az login) if using CLI.

Deployment Steps:
Step 1: Access the Azure CycleCloud Workspace for Slurm in Azure Marketplace

o Navigate to Azure Marketplace:

o In the Azure Portal, go to Create a Resource > Search for “Azure CycleCloud Workspace for Slurm”.

o Select the offering by Microsoft.

Click Create to begin the deployment process.

Step 2: Configure Basic Settings

Step 3: Configure Filesystem settings

Step 4: Configure Network settings

Step 5: Configure Slurm settings

Select the scheduler Node size , image and the slurm version

If slurm job accounting is required, check-in the box and pass the existing MySQL details.

Select the Login Node, size, min and max nodes.

Note: Login Node is a Virtual Machine Scale set.

Step 6: Configure the Partition settings

Select the size and node count for the partitions.

HTC : For non-MPI jobs (e.g., Standard_F-series VMs, supports spot instances)

HPC: For MPI jobs (e.g., HBv3-series VMs with InfiniBand).

GPU: For GPU workloads (e.g., NCv3-series VMs with NVIDIA GPUs).

Step7: Add the Tags as required

Step8: select Create + Review.

After this cyclecloud cluster will be created.
To check the status of cluster please find below.

Connect to CycleCloud VM:

Use ssh to connect to the CycleCloud VM from your local machine

Check the cluster status with cyclecloud show_cluster

This error occurs because CycleCloud is not initialized.

Initialize CycleCloud using `cyclecloud initialize`.

Enter the username and password set when creating the cluster.

Use cyclecloud show_cluster to verify the cluster status.

The “ccw” cluster has been initiated and consists of two nodes: one scheduler node and one login node.

CycleCloud Cluster Web UI:

Open your browser and navigate to <A id="lia-url-1747293942235" class="lia-external-url" href="https://” target=”_blank” rel=”noopener”>https://cyclecloudvmip

You will be prompted for the username and password. Once you have provided the necessary credentials, you should be able to view the cluster.

In the below screenshot, we can observe that cluster ccw is currently started and includes both a login node and a scheduler node.

We can connect to both the nodes as below screenshots.

Connecting to the scheduler Node:

If SSH keypair is configured, then you can connect using the `ssh username@schedulernodeodeip`

You can also connect from cyclecloud vm using cyclecloud cli `cyclecloud connect scheduler -c ccw`

Similarly, we can also connect from login nodes

Connecting to Login Nodes:

If SSH keypair is configured, then you can connect using the `ssh username@loginnodeodeip`

You can also connect from cyclecloud vm using cyclecloud cli `cyclecloud connect login-1-c ccw`

Note: All the requests are submitted in the form of jobs to the Login Node.

Job Submission Process on the Login Node

When a user submits a job on the login node, the following steps occur:

Accessing the Login Node:

Users connect to the login node via SSH from cyclecloud vm/local or Visual Studio Code, typically through Azure Bastion for secure access, as direct SSH from external networks is disabled by default for security.

The login node is pre-configured with Slurm client tools, allowing users to interact with the Slurm scheduler.

Submitting a Job:

Users submit jobs using Slurm commands like `sbatch` (for batch scripts), `srun` (for interactive jobs), or `salloc` (for resource allocation). For example:

“`bash

sbatch test-job.sh

“`

The job script specifies resource requirements (e.g., number of nodes, CPUs, GPUs, memory, partition) and the tasks to execute.

The login node forwards the job request to the Slurm scheduler (slurmctld) running on the scheduler node.

Job Scheduling and Resource Allocation:

The Slurm scheduler processes the job based on:

Resource Requirements: Matches the job’s needs (e.g., CPU, GPU, memory) to available partitions (HTC for non-MPI, HPC for MPI, GPU for GPU-based jobs).

Queue and Priority: Determines the job’s position in the queue based on priority and fairness policies.

Cluster State: Checks the availability of compute nodes.

If resources are unavailable, the scheduler communicates with Azure CycleCloud to provision new compute nodes dynamically.

Dynamic Provisioning by Azure CycleCloud:

Azure CycleCloud monitors the Slurm job queue and uses autoscaling to provision or deprovision VMs based on demand.

When a job requires additional resources:

CycleCloud creates new compute nodes in the appropriate partition (e.g., HPC, HTC, or GPU) using predefined VM types and images (e.g., Azure HPC images).

Nodes are configured with the necessary software (e.g., Slurm, PMix, Pyxis, Enroot) via Cluster-Init scripts .

The `azslurm` CLI on the scheduler node updates the Slurm configuration (`slurm.conf`, `gres.conf`) to include new nodes.

Once provisioned, the scheduler assigns the job to the new nodes.

Job Execution:

The Slurm controller allocates the job to the assigned compute nodes.

Compute nodes mount shared filesystems (e.g., Azure NetApp Files or Lustre) to access input data and store output.

For containerized jobs, Pyxis and Enroot manage container execution, leveraging PMix for MPI support.

The job runs, and Slurm tracks its progress, logging details like resource usage and job status.

Monitoring and Management:

Users can monitor job status on the login node using Slurm commands like `squeue`, `sinfo`, or `sacct`.

The Azure CycleCloud GUI provides cluster-wide metrics, such as node status, performance, and utilization.

If job accounting is enabled (e.g., with Azure Database for MySQL), detailed usage data is stored for analysis.

Resource Deallocation:

Once the job completes, Slurm marks the compute nodes as idle.

Azure CycleCloud scales down the cluster by deprovisioning idle nodes (unless configured otherwise, e.g., with `SuspendTime=-1`)

This ensures cost efficiency by only maintaining active resources.

Example of Job submission:

Connect to the Login Node as follows:

To connect to the login node from a CycleCloud virtual machine, ensure that the private key is available with the necessary permissions.

Use ssh key command << ssh -i “privatekey” hpcadmin@privateIp

Upon successfully connecting to the login node, you should be able to execute Slurm commands.

`sinfo` – Provides information about the Slurm cluster.

This output displays the partitions (HPC & HTC), and both are in an idle state.

`sinfo -l ` – detailed information about the state of partitions and nodes in a cluster

Once the job is submitted, the partition state changes.

Created a sample script for the HPC partition (hello-world.slurm).

Submit the job using sbatch hello-world.slurm

Use squeue to monitor job status.

Here job is submitted and JOBID = 1ONE HPC Node(ccw-hpc-1) is allocatedAgain, submitted another job with sbatch hello-world.slurm and another HPC node (ccw-hpc-1) got allocated.so total two HPC nodes got allocated.squeue to monitor the jobs

Status from the cyclecloud Portal:

Two HPC Nodes are provisioning

To connect to the HPC Node:

connect via ssh/cyclecloud cli

Activity Logs from Portal:

Activity logs, how nodes are provisioningsince the job completed, an output file hello_world.out got generated

Output of the Job :

A new file, hello-world.out has been created.

Since the job is completed, the dynamic nodes (HPC) get terminated.

The output is stored in shared path, which is shared across all nodes on the cluster

Sample Script:

#!/bin/bash #shebangline to interpret the script in bash
#SBATCH –job-name=hello_world # Job name to identify
#SBATCH –output=hello_world.out # Output file
#SBATCH –partition=hpc # Partition name to use
#SBATCH –nodes=1 # Number of nodes
#SBATCH –ntasks=1 # Number of tasks
#SBATCH –time=00:05:00 # Max runtime (HH:MM:SS)

echo “Hello World”

Conclusion:

The blog has outlined the steps of Azure CycleCloud Workspace for slurm Deployment, role of each component, backend process of job submission, logging into the cluster, connecting nodes and submitting jobs.

Deploy Terraform to Azure with OIDC and GitHub Actions

Deploy Terraform to Azure with OIDC and GitHub Actions

Related posts

Learn How to Build Smarter AI Agents with Microsoft’s MCP Resources Hub

Leveraging Azure NetApp Files for High-Performance Kubernetes Storage

Demystifying Azure Private Link: Benefits, Pitfalls & Best Practices