Accelerating DeepSeek Inference with AMD MI300: A Collaborative Breakthrough

Migrating Cloud-Based Databases from AWS to Azure: Key Insights and Best Practices

April 25, 2025

[In preview] Public Preview: Metrics Usage Insights

April 25, 2025

Published by azurefeeds on April 25, 2025

Why AMD MI300?

While many enterprise workloads are optimized for NVIDIA GPUs, AMD’s MI300 architecture has proven to be a strong contender—especially for larger models like DeepSeek. With high VRAM capacity, bandwidth, and a growing ecosystem of tooling (like SGLang), MI300 offered us the opportunity to scale faster while keeping infrastructure costs optimized.

We initially began testing DeepSeek on MI300s with a single VM and were pleasantly surprised—early results were already comparable to NVIDIA H200s. With further tuning, including custom kernel library (AITER) from AMD and optimizations of MSFT Bing teams, we’ve exceeded the performance of H200s even without Multi Token Prediction (MTP), making MI300 highly viable for production-grade inference.

What We Optimized

Our work with AMD focused on:

SGLang kernel tuning for DeepSeek, with progress day-by-day

Design and implementation for more advanced optimizations like MTP and dis-aggregated prefill/decode

Internal Bing contributions to optimize shared inference kernels

This wasn’t just a one-off tuning exercise—it’s an ongoing partnership. We are aiming for even greater improvements in current and future DeepSeek models, as well as many other models

Benchmarks

AMDs recent improvements are reproducible within Microsoft

Microsoft has worked on separate optimizations resulting in very similar gains in performance. This chart includes some early results from enabling Multi-Token Predication, the `sglang0.4.4.post1` build is based on AMDs `rocm/sgl-dev:upstream_20250312_v1` image.

Not all kernel optimizations made for DeepSeek-R1 by Microsoft have been contributed back to sglang yet. However, there is no intention to withhold them, and we are committed to collaborating with sglang and AMD to get them upstreamed.

We’re very excited to continue working with AMD to combine our optimizations to achieve maximum throughput while prioritizing low latency.

Scaling Globally with Cost-Efficient Inference

One of the biggest wins? Hardware Availability.

Because MI300s are more readily available in regions like East US and Germany Central, we were able to rapidly scale DeepSeek inference capacity—faster than if we’d waited for scarce high-end NVIDIA hardware. This flexibility allowed us to meet customer demand without compromising on performance or budget.

How to Reproduce the Benchmark

Now let’s reproduce the same performance boost on your system and apply the same techniques to your application on the MI300X GPUs.

The following instructions assume that the user already downloaded a model.

Note: The image provided for replicating the MI300X benchmark is a pre-upstream staging version which isn’t the same as the one shown above with Microsoft’s internal changes. The optimizations and performance enhancements in this release are expected to be included in the upcoming lmsysorg upstream production release.

Nvidia H200 GPU with SGLang

Set relevant environment variables and launch the NVIDIA SGLang container.

docker pull lmsysorg/sglang:v0.4.4.post1-cu125
export MODEL_DIR=
docker run -it
–ipc=host
–network=host
–privileged
–shm-size 32G
–gpus all
-v $MODEL_DIR:/model
lmsysorg/sglang:v0.4.4.post1-cu125

2. Start the SGLang server.

export SGL_ENABLE_JIT_DEEPGEMM=1
python3 -m sglang.launch_server
–model /model
–trust-remote-code
–tp 8
–mem-fraction-static 0.9
–enable-torch-compile
–torch-compile-max-bs 256
–chunked-prefill-size 131072
–enable-flashinfer-mla &

3. Run the SGLang benchmark serving script for the user defined concurrency values and desired parameters.

# Run after “The server is fired up and ready to roll!”
concurrency_values=(128 64 32 16 8 4 2 1)
for concurrency in “${concurrency_values[@]}”; do
python3 -m sglang.bench_serving
–dataset-name random
–random-range-ratio 1
–num-prompt 500
–random-input 3200
–random-output 800
–max-concurrency “${concurrency}”
done

AMD Instinct MI300X GPU with SGLang

NOTE: These instructions are only for rocm/sgl-dev:upstream_20250312_v1 as msft optimizations are not yet public.

Set relevant environment variables and launch the AMD SGLang container.

docker pull rocm/sgl-dev:upstream_20250312_v1
export MODEL_DIR=
docker run -it
–ipc=host
–network=host
–privileged
–shm-size 32G
–cap-add=CAP_SYS_ADMIN
–device=/dev/kfd
–device=/dev/dri
–group-add video
–group-add render
–cap-add=SYS_PTRACE
–security-opt seccomp=unconfined
–security-opt apparmor=unconfined
-v $MODEL_DIR:/model
rocm/sgl-dev:upstream_20250312_v1

2. Start the SGLang server.

python3 -m sglang.launch_server
–model /model
–tp 8
–trust-remote-code
–chunked-prefill-size 131072
–enable-torch-compile
–torch-compile-max-bs 256 &

3. Run the SGLang benchmark serving script for the user defined concurrency values and desired parameters.

Note: enabling torch compile will result in longer graph compile time thus longer server launch time

What’s Next?

We see this as the beginning of a longer-term investment in heterogeneous, cost-efficient hardware for model serving. While we’re committed to supporting a wide range of models and GPUs, the MI300 work with DeepSeek has proven that smart optimization can unlock new infrastructure choices. Further enhancements such as disaggregated decode + prefill are in the pipeline!

With continued collaboration, we plan to bring this level of performance to future models—including the newly released MAI model—which will also run on the MI300 pool. Explore Azure AI Foundry model catalog today.

Migrating Cloud-Based Databases from AWS to Azure: Key Insights and Best Practices

[In preview] Public Preview: Metrics Usage Insights

Migrating Cloud-Based Databases from AWS to Azure: Key Insights and Best Practices

[In preview] Public Preview: Metrics Usage Insights

Why AMD MI300?

What We Optimized

Benchmarks

Scaling Globally with Cost-Efficient Inference

How to Reproduce the Benchmark

Nvidia H200 GPU with SGLang

AMD Instinct MI300X GPU with SGLang

What’s Next?

Related posts

SharePoint Embedded & M-Files: powering a new wave of industry agents for document management

Submissions Response Using AI for Enhanced Result Explainability

Seamless Online Migration from SQL Server to Azure SQL DB/MI/IaaS Using Azure Data Factory and CDC