[Launched] Generally Available: Carbon optimization in Azure
May 17, 2025Bringing AI to Meetings with the Sample Builder
May 17, 2025DeepEP
DeepEP is a high-performance communication library developed by DeepSeek AI to optimize Mixture-of-Experts (MoE) and expert parallelism (EP) in large-scale AI models. It provides high-throughput, low-latency all-to-all GPU kernels for MoE dispatch and combine operations, which are critical for efficiently routing data between expert modules during training and inference. DeepEP includes specialized kernels for asymmetric-domain bandwidth forwarding—such as transfers between NVLink and RDMA/InfiniBand domains—and requires only 20 Streaming Multiprocessors (SMs) to saturate both. Tokens are first transmitted via IB to GPUs with matching in-node indices, then forwarded via NVLink to target experts, fully overlapping both communication paths. It leverages NVSHMEM for efficient one-sided communication, enabling low-latency data movement without host involvement. With its network-aware design and deep integration with MoE algorithms, DeepEP is a foundational component for scalable, high-performance expert model training and inference.
The Importance of NUMA Affinity
NUMA affinity refers to how well a process or thread is aligned with the memory and hardware resources—such as CPUs, GPUs, or NICs—within a Non-Uniform Memory Access (NUMA) system. In a NUMA architecture, the system’s memory is divided among multiple nodes (often corresponding to CPU sockets), and each node can access its local memory faster than the memory attached to other nodes. NUMA affinity is about ensuring that a process runs on a CPU (or accesses a device) that is physically close to the memory or network resources it needs, minimizing latency and maximizing bandwidth.
NUMA affinity is particularly critical in multi-GPU and multi-node systems where GPUs communicate with each other or with the network through NICs. If a GPU is not NUMA-affined to the NIC it uses, data may be routed across additional interconnects like PCIe switches or CPU sockets, increasing communication latency and reducing throughput. By maintaining proper NUMA affinity—ensuring, for example, that a GPU communicates through a NIC on the same NUMA node—systems can achieve significantly better performance, especially in communication-heavy workloads like distributed deep learning, MoE expert dispatch, or all-to-all collective operations.
NVIDIA DGX H100 system topology (Courtesy: https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html)
Affinity Considerations on Azure NDv5 VMs (H100)
The command lscpu can be used to get information about NUMA to cores binding. This is from the output of lscpu on an NVIDIA DGX H100 system, showing that the system has two NUMA nodes: cores 0–47 belong to NUMA node 0, and cores 48–95 belong to NUMA node 1.
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
In addition, the lstopo command along with the bus_id of GPUs and HCA (Host Channel Adapter) cards can be used to find the mapping between NUMA nodes, CPU cores, GPUs, and HCA.
Affinity-Aware Code Adjustments to Boost DeepEP Performance
To improve DeepEP performance on Azure, we introduced code changes that explicitly bind processes to the right set of cores, GPU, and HCA., ensuring alignment with the system’s NUMA topology. These modifications reduce cross-NUMA communication overhead and improve data locality, which is crucial for communication-heavy workloads like expert parallelism.
For this, we integrated the libnuma library using ctypes to enable memory binding to specific NUMA nodes, ensuring that memory allocations are local to the process’s assigned CPU cores. We also used the psutil library to explicitly set CPU affinity, binding each process to a distinct range of cores based on its rank. This reduces cross-node traffic and improves cache locality. As mentioned earlier, on the NVIDIA DGX H100 system, we have two NUMA nodes with 48 cores per NUMA. With 8 processes per node, we can assign 12 cores to each process on this system. These settings are applied early in the init_dist() function, ensuring that compute and communication operations benefit from optimal CPU and memory placement.
diff –git a/tests/utils.py b/tests/utils.py
index a574366..fffa905 100644
— a/tests/utils.py
+++ b/tests/utils.py
@@ -1,10 +1,34 @@
import os
import sys
+import psutil
import numpy as np
import torch
import torch.distributed as dist
from typing import Optional
–
+import ctypes
+
+# Load libnuma
+libnuma = ctypes.CDLL(“libnuma.so”)
+libnuma.numa_available.restype = ctypes.c_int
+libnuma.numa_run_on_node.argtypes = [ctypes.c_int]
+libnuma.numa_set_preferred.argtypes = [ctypes.c_int]
+
+def set_numa_affinity(rank):
+ cores_per_rank = 12
+ numa_node = rank // 4
+ core_start = rank * cores_per_rank
+ core_end = core_start + cores_per_rank
+ p = psutil.Process(os.getpid())
+ p.cpu_affinity(list(range(core_start, core_end)))
+ print(f”Rank {rank} numa node {numa_node} bound to cores {core_start}-{core_end – 1}”)
+
+ # Bind memory to NUMA node
+ if libnuma.numa_available() != -1:
+ libnuma.numa_set_preferred(numa_node)
+ print(f”Rank {rank}: CPU affinity → cores {core_start}-{core_end – 1}, memory NUMA → node {numa_node}”)
+ else:
+ print(f”Rank {rank}: libnuma not available”)
def init_dist(local_rank: int, num_local_ranks: int):
# NOTES: you may rewrite this function with your own cluster settings
@@ -20,8 +44,10 @@ def init_dist(local_rank: int, num_local_ranks: int):
world_size=num_nodes * num_local_ranks,
rank=node_rank * num_local_ranks + local_rank
)
+ set_numa_affinity(local_rank)
torch.set_default_dtype(torch.bfloat16)
torch.set_default_device(‘cuda’)
+
torch.cuda.set_device(local_rank)
return dist.get_rank(), dist.get_world_size(), dist.new_group(list(range(num_local_ranks * num_nodes)))
Additionally, as noted earlier, DeepEP leverages NVSHMEM for inter-GPU communication. To ensure each process uses the correct set of Host Channel Adapters (HCAs), we set the NVSHMEM_HCA_LIST environment variable with a comma-separated list of HCAs. For this setting to take effect, the NVSHMEM_ENABLE_NIC_PE_MAPPING variable must also be set to 1.
diff –git a/deep_ep/buffer.py b/deep_ep/buffer.py
index feeb386..d81130e 100644
— a/deep_ep/buffer.py
+++ b/deep_ep/buffer.py
@@ -72,6 +72,8 @@ class Buffer:
os.environ[‘NVSHMEM_IB_ENABLE_IBGDA’] = ‘1’
os.environ[‘NVSHMEM_IBGDA_NIC_HANDLER’] = ‘gpu’
os.environ[‘NVSHMEM_IBGDA_NUM_RC_PER_PE’] = f'{num_qps_per_rank}’
+ os.environ[‘NVSHMEM_ENABLE_NIC_PE_MAPPING’] = ‘1’
+ os.environ[‘NVSHMEM_HCA_LIST’] = ‘mlx5_ib0:1,mlx5_ib1:1,mlx5_ib2:1,mlx5_ib3:1,mlx5_ib4:1,mlx5_ib5:1,mlx5_ib6:1,mlx5_ib7:1’
# Make sure QP depth is always larger than the number of on-flight WRs, so that we can skip WQ slot check
os.environ[‘NVSHMEM_QP_DEPTH’] = ‘1024’
# NOTES: NVSHMEM initialization requires at least 256 MiB
Performance Experiments
After applying the above changes, we got the following performance numbers for the test_internode.py on two Standard_ND96isr_H100_v5 VMs with 8 processes per node (16 total processes). This benchmark evaluates the performance of dispatch and combine operations on a multi-node setting. In this benchmark, the intranode communication is overlapped with internode communications. Please note that the benchmark reports the algorithm bandwidth so the total time of both communication and computation is considered in the performance results. The experimental results on Standard_ND96isr_H100_v5 VMs shows that we’re reaching and exceeding the claimed performance in DeepEP repository
Item | Best Reported RDMA BW | Best Reported NVL BW |
Dispatch (FP8) | 45.9 GB/s | 149.82 GB/s |
Dispatch (BF16) | 60.32 GB/s | 196.89 GB/s |
Combine | 61.34 GB/s | 200.22 GB/s |