FOCUS 1.2 in Microsoft Cost Management: Unified multi-cloud, multi-currency reporting
June 17, 2025
Azure AI Foundry vs Azure AI Services vs Azure Machine Learning
June 17, 2025TLDR
Create an Azure Kubernetes Service cluster with GPU nodes and connect it to Azure Machine Learning to run distributed ML training workloads. This integration provides a managed data science platform while maintaining Kubernetes flexibility under the hood, enables multi-node training that spans multiple GPUs, and bridges the gap between infrastructure and ML teams. The solution works for both new and existing clusters, supporting specialized GPU hardware and hybrid scenarios.
Why Should You Care?
Integrating Azure Kubernetes Service (AKS) clusters with GPUs into Azure Machine Learning (AML) offers several key benefits:
- Utilize existing infrastructure: Leverage your existing AKS clusters with GPUs via a managed data science platform like AML
- Flexible resource sharing: Allow both AKS workloads and AML jobs to access the same GPU resources
- Organizational alignment: Bridge the gap between infrastructure teams (who prefer AKS) and ML teams (who prefer AML)
- Hybrid scenarios: Connect on-premises GPUs to AML using Azure Arc in a similar way to this tutorial
We are looking at Multi-Node Training because it is needed for most bigger training jobs. If you just need a single GPU or single VM we also look at how to do this.
Prerequisites
Before you begin, ensure you have:
- Azure subscription with privileges to create and manage AKS clusters and add compute targets in AML. We recommend the AKS and AML resources to be in the same region.
- Sufficient quota for GPU compute resources. Check this article on how to request quota How to Increase Quota for Specific Types of Azure Virtual Machines. We are using two Standard_NC8as_T4_v3. So, 4 T4s in total. You can also opt for other GPU enabled compute.
- Azure CLI version 2.24.0 or higher (az upgrade)
- Azure CLI k8s-extension version 1.2.3 or higher (az extension update –name k8s-extension)
- kubectl installed and updated
Step 1: Create an AKS Cluster with GPU Nodes
For Windows users, it’s recommended to use WSL (Ubuntu 22.04 or similar).
# Login to Azure
az login
# Create resource group
az group create -n ResourceGroup -l francecentral
# Create AKS cluster with a system node
az aks create -g ResourceGroup -n MyCluster
–node-vm-size Standard_D16s_v5
–node-count 2
–enable-addons monitoring
# Get cluster credentials
az aks get-credentials -g ResourceGroup -n MyCluster
# Add GPU node pool (Spot Instances are not recommended)
az aks nodepool add
–resource-group ResourceGroup
–cluster-name MyCluster
–name gpupool
–node-count 2
–vm-size standard_nc8as_t4_v3
# Verify cluster configuration
kubectl get namespaces
kubectl get nodes
Step 2: Install NVIDIA Device Plugin
Next, we need to make sure that our GPUs exactly work as expected. The NVIDIA Device Plugin is a Kubernetes plugin that enables the use of NVIDIA GPUs in containers running on Kubernetes clusters. It acts as a bridge between Kubernetes and the physical GPU hardware. Create and apply the NVIDIA device plugin to enable GPU access within AKS:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
To confirm that the GPUs are working as expected follow the steps here and run a test workload Use GPUs on Azure Kubernetes Service (AKS) – Azure Kubernetes Service | Microsoft Learn.
Step 3: Register the KubernetesConfiguration Provider
The KubernetesConfiguration Provider enables Azure to deploy and manage extensions on Kubernetes clusters, including the Azure Machine Learning extension. Before installing extensions, ensure the required resource provider is registered:
# Install the k8s-extension Azure CLI extension
az extension add –name k8s-extension
# Check if the provider is already registered
az provider list –query “[?contains(namespace,’Microsoft.KubernetesConfiguration’)]” -o table
# If not registered, register it
az provider register –namespace Microsoft.KubernetesConfiguration
az account set –subscription
az feature registration create –namespace Microsoft.KubernetesConfiguration –name ExtensionTypes
# Check the status after a few minutes and wait until it shows Registered
az feature show –namespace Microsoft.KubernetesConfiguration –name ExtensionTypes
# Install the Dapr extension
az k8s-extension create –cluster-type managedClusters
–cluster-name MyCluster
–resource-group ResourceGroup
–name dapr
–extension-type Microsoft.Dapr
–auto-upgrade-minor-version false
You can also check out the “Before you begin” section here Install the Dapr extension for Azure Kubernetes Service (AKS) and Arc-enabled Kubernetes – Azure Kubernetes Service | Microsoft Learn.
Step 4: Deploy the Azure Machine Learning Extension
Install the AML extension on your AKS cluster for training:
az k8s-extension create
–name azureml-extension
–extension-type Microsoft.AzureML.Kubernetes
–config enableTraining=True enableInference=False
–cluster-type managedClusters
–cluster-name MyCluster
–resource-group ResourceGroup
–scope cluster
There are several options on the extension installation available which are listed here Deploy Azure Machine Learning extension on Kubernetes cluster – Azure Machine Learning | Microsoft Learn.
Verify Extension Deployment
az k8s-extension create
–name azureml-extension
–extension-type Microsoft.AzureML.Kubernetes
–config enableTraining=True enableInference=False
–cluster-type managedClusters
–cluster-name MyCluster
–resource-group ResourceGroup
–scope cluster
The extension is successfully deployed when provisioning state shows “Succeeded” and all pods in the “azureml” namespace are in the “Running” state.
Step 5: Create a GPU-Enabled Instance Type
By default, AML only has access to an instance type that doesn’t include GPU resources. Create a custom instance type to utilize your GPUs:
# Create a custom instance type definition
cat > t4-full-node.yaml << EOF
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
name: t4-full-node
spec:
nodeSelector:
agentpool: gpupool
kubernetes.azure.com/accelerator: nvidia
resources:
limits:
cpu: "6"
nvidia.com/gpu: 2 # Integer value equal to the number of GPUs
memory: "55Gi"
requests:
cpu: "6"
memory: "55Gi"
EOF
# Apply the instance type
kubectl apply -f t4-full-node.yaml
This configuration creates an instance type that allocates two T4 GPU nodes, making it ideal for ML training jobs.
Step 6: Attach the Cluster to Azure Machine Learning
Once your instance type is created, you can attach the AKS cluster to your AML workspace:
- In the Azure Machine Learning Studio, navigate to Compute > Kubernetes clusters
- Click New and select your AKS cluster
- Specify your custom instance type (“t4-full-node”) when configuring the compute target
- Complete the attachment process following the UI workflow
Alternatively, you can use the Azure CLI or Python SDK to attach the cluster programmatically Attach a Kubernetes cluster to Azure Machine Learning workspace – Azure Machine Learning | Microsoft Learn.
Step 7: Test Distributed Training
With your GPU-enabled AKS cluster now attached to AML, you can:
- Create an AML experiment that uses distributed training
- Specify your custom instance type in the training configuration
- Submit the job to take advantage of multi-node GPU capabilities
You can now run advanced ML workloads like distributed deep learning, which requires multiple GPUs across nodes, all managed through the AML platform.
If you want to submit such a job you simply need to list the compute name, the registered instance_type and the number of instances.
As an example, clone yuvmaz/aml_labs: Labs to showcase the capabilities of Azure ML and switch to Lab 4 – Foundations of Distributed Deep Learning. Lab 4 introduces you on how distributed training works in general and in AML. In the Jupyter Notebook that guides through that tutorial you will find that the first job definition is in simple_environment.yaml. Open this file an make the following adjustments to use the AKS compute target:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: env | sort | grep -e ‘WORLD’ -e ‘RANK’ -e ‘MASTER’ -e ‘NODE’
environment:
image: library/python:latest
distribution:
type: pytorch
process_count_per_instance: 2 # We use 2 GPUs per node, Cross GPUs
compute: azureml:
resources:
instance_count: 2 # We want to VMs/instances in total, Cross node
instance_type:
display_name: simple-env-vars-display
experiment_name: distributed-training-foundations
You can proceed in the same way for all other distributed training jobs.
Conclusion
By integrating AKS clusters with GPUs into Azure Machine Learning, you get the best of both worlds – the container orchestration and infrastructure capabilities of Kubernetes with the ML workflow management features of AML. This setup is particularly valuable for organizations that want to:
- Maximize GPU utilization across both operational and ML workloads
- Provide data scientists with self-service access to GPU resources
- Establish a consistent ML platform that spans both cloud and on-premises resources
For production deployments, consider implementing additional security measures, networking configurations, and monitoring solutions appropriate for your organization’s requirements.
Thanks a lot, to Yuval Mazor and Alan Weaver for their collaboration on this blog post.