Connecting Azure Kubernetes Service Cluster to Azure Machine Learning for Multi-Node GPU Training

FOCUS 1.2 in Microsoft Cost Management: Unified multi-cloud, multi-currency reporting

June 17, 2025

Azure AI Foundry vs Azure AI Services vs Azure Machine Learning

June 17, 2025

Published by azurefeeds on June 17, 2025

Tags

TLDR

Create an Azure Kubernetes Service cluster with GPU nodes and connect it to Azure Machine Learning to run distributed ML training workloads. This integration provides a managed data science platform while maintaining Kubernetes flexibility under the hood, enables multi-node training that spans multiple GPUs, and bridges the gap between infrastructure and ML teams. The solution works for both new and existing clusters, supporting specialized GPU hardware and hybrid scenarios.

Why Should You Care?

Integrating Azure Kubernetes Service (AKS) clusters with GPUs into Azure Machine Learning (AML) offers several key benefits:

Utilize existing infrastructure: Leverage your existing AKS clusters with GPUs via a managed data science platform like AML

Flexible resource sharing: Allow both AKS workloads and AML jobs to access the same GPU resources

Organizational alignment: Bridge the gap between infrastructure teams (who prefer AKS) and ML teams (who prefer AML)

Hybrid scenarios: Connect on-premises GPUs to AML using Azure Arc in a similar way to this tutorial

We are looking at Multi-Node Training because it is needed for most bigger training jobs. If you just need a single GPU or single VM we also look at how to do this.

Prerequisites

Before you begin, ensure you have:

Azure subscription with privileges to create and manage AKS clusters and add compute targets in AML. We recommend the AKS and AML resources to be in the same region.

Sufficient quota for GPU compute resources. Check this article on how to request quota How to Increase Quota for Specific Types of Azure Virtual Machines. We are using two Standard_NC8as_T4_v3. So, 4 T4s in total. You can also opt for other GPU enabled compute.

Azure CLI version 2.24.0 or higher (az upgrade)

Azure CLI k8s-extension version 1.2.3 or higher (az extension update –name k8s-extension)

kubectl installed and updated

Step 1: Create an AKS Cluster with GPU Nodes

For Windows users, it’s recommended to use WSL (Ubuntu 22.04 or similar).

# Login to Azure
az login

# Create resource group
az group create -n ResourceGroup -l francecentral

# Create AKS cluster with a system node
az aks create -g ResourceGroup -n MyCluster
–node-vm-size Standard_D16s_v5
–node-count 2
–enable-addons monitoring

# Get cluster credentials
az aks get-credentials -g ResourceGroup -n MyCluster

# Add GPU node pool (Spot Instances are not recommended)
az aks nodepool add
–resource-group ResourceGroup
–cluster-name MyCluster
–name gpupool
–node-count 2
–vm-size standard_nc8as_t4_v3

# Verify cluster configuration
kubectl get namespaces
kubectl get nodes

Step 2: Install NVIDIA Device Plugin

Next, we need to make sure that our GPUs exactly work as expected. The NVIDIA Device Plugin is a Kubernetes plugin that enables the use of NVIDIA GPUs in containers running on Kubernetes clusters. It acts as a bridge between Kubernetes and the physical GPU hardware. Create and apply the NVIDIA device plugin to enable GPU access within AKS:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

To confirm that the GPUs are working as expected follow the steps here and run a test workload Use GPUs on Azure Kubernetes Service (AKS) – Azure Kubernetes Service | Microsoft Learn.

Step 3: Register the KubernetesConfiguration Provider

The KubernetesConfiguration Provider enables Azure to deploy and manage extensions on Kubernetes clusters, including the Azure Machine Learning extension. Before installing extensions, ensure the required resource provider is registered:

# Install the k8s-extension Azure CLI extension
az extension add –name k8s-extension
# Check if the provider is already registered
az provider list –query “[?contains(namespace,’Microsoft.KubernetesConfiguration’)]” -o table

# If not registered, register it
az provider register –namespace Microsoft.KubernetesConfiguration

az account set –subscription

az feature registration create –namespace Microsoft.KubernetesConfiguration –name ExtensionTypes

# Check the status after a few minutes and wait until it shows Registered

az feature show –namespace Microsoft.KubernetesConfiguration –name ExtensionTypes

# Install the Dapr extension
az k8s-extension create –cluster-type managedClusters
–cluster-name MyCluster
–resource-group ResourceGroup
–name dapr
–extension-type Microsoft.Dapr
–auto-upgrade-minor-version false

You can also check out the “Before you begin” section here Install the Dapr extension for Azure Kubernetes Service (AKS) and Arc-enabled Kubernetes – Azure Kubernetes Service | Microsoft Learn.

Step 4: Deploy the Azure Machine Learning Extension

Install the AML extension on your AKS cluster for training:

az k8s-extension create
–name azureml-extension
–extension-type Microsoft.AzureML.Kubernetes
–config enableTraining=True enableInference=False
–cluster-type managedClusters
–cluster-name MyCluster
–resource-group ResourceGroup
–scope cluster

There are several options on the extension installation available which are listed here Deploy Azure Machine Learning extension on Kubernetes cluster – Azure Machine Learning | Microsoft Learn.

Verify Extension Deployment

The extension is successfully deployed when provisioning state shows “Succeeded” and all pods in the “azureml” namespace are in the “Running” state.

Step 5: Create a GPU-Enabled Instance Type

By default, AML only has access to an instance type that doesn’t include GPU resources. Create a custom instance type to utilize your GPUs:

# Create a custom instance type definition
cat > t4-full-node.yaml << EOF
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
name: t4-full-node
spec:
nodeSelector:
agentpool: gpupool
kubernetes.azure.com/accelerator: nvidia
resources:
limits:
cpu: "6"
nvidia.com/gpu: 2 # Integer value equal to the number of GPUs
memory: "55Gi"
requests:
cpu: "6"
memory: "55Gi"
EOF
# Apply the instance type
kubectl apply -f t4-full-node.yaml

This configuration creates an instance type that allocates two T4 GPU nodes, making it ideal for ML training jobs.

Step 6: Attach the Cluster to Azure Machine Learning

Once your instance type is created, you can attach the AKS cluster to your AML workspace:

In the Azure Machine Learning Studio, navigate to Compute > Kubernetes clusters

Click New and select your AKS cluster

Specify your custom instance type (“t4-full-node”) when configuring the compute target

Complete the attachment process following the UI workflow

Alternatively, you can use the Azure CLI or Python SDK to attach the cluster programmatically Attach a Kubernetes cluster to Azure Machine Learning workspace – Azure Machine Learning | Microsoft Learn.

Step 7: Test Distributed Training

With your GPU-enabled AKS cluster now attached to AML, you can:

Create an AML experiment that uses distributed training

Specify your custom instance type in the training configuration

Submit the job to take advantage of multi-node GPU capabilities

You can now run advanced ML workloads like distributed deep learning, which requires multiple GPUs across nodes, all managed through the AML platform.

If you want to submit such a job you simply need to list the compute name, the registered instance_type and the number of instances.

As an example, clone yuvmaz/aml_labs: Labs to showcase the capabilities of Azure ML and switch to Lab 4 – Foundations of Distributed Deep Learning. Lab 4 introduces you on how distributed training works in general and in AML. In the Jupyter Notebook that guides through that tutorial you will find that the first job definition is in simple_environment.yaml. Open this file an make the following adjustments to use the AKS compute target:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: env | sort | grep -e ‘WORLD’ -e ‘RANK’ -e ‘MASTER’ -e ‘NODE’
environment:
image: library/python:latest
distribution:
type: pytorch
process_count_per_instance: 2 # We use 2 GPUs per node, Cross GPUs
compute: azureml:
resources:
instance_count: 2 # We want to VMs/instances in total, Cross node
instance_type:
display_name: simple-env-vars-display
experiment_name: distributed-training-foundations

You can proceed in the same way for all other distributed training jobs.

Conclusion

By integrating AKS clusters with GPUs into Azure Machine Learning, you get the best of both worlds – the container orchestration and infrastructure capabilities of Kubernetes with the ML workflow management features of AML. This setup is particularly valuable for organizations that want to:

Maximize GPU utilization across both operational and ML workloads

Provide data scientists with self-service access to GPU resources

Establish a consistent ML platform that spans both cloud and on-premises resources

For production deployments, consider implementing additional security measures, networking configurations, and monitoring solutions appropriate for your organization’s requirements.

Thanks a lot, to Yuval Mazor and Alan Weaver for their collaboration on this blog post.

FOCUS 1.2 in Microsoft Cost Management: Unified multi-cloud, multi-currency reporting

Azure AI Foundry vs Azure AI Services vs Azure Machine Learning

FOCUS 1.2 in Microsoft Cost Management: Unified multi-cloud, multi-currency reporting

Azure AI Foundry vs Azure AI Services vs Azure Machine Learning

Related posts

Mastering Model Context Protocol (MCP): Building Multi Server MCP with Azure OpenAI

Building an MCP Server for Microsoft Learn

Drive carbon reductions in cloud migrations with Sustainability insights in Azure Migrate