Open AI’s gpt-oss models on Azure Container Apps serverless GPUs

Decoding On-Premises ADC Rules: Migration to Azure Application Gateway

August 7, 2025

How do I catch bad data before it derails my agent?

August 7, 2025

Published by azurefeeds on August 7, 2025

Why use Azure Container Apps serverless GPUs?

Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. With serverless GPU support, you can bring your own model containers, such as Ollama, and deploy them to GPU-backed environments that automatically scale based on demand.

Key benefits:

Autoscaling – scale to zero when idle, scale out with usage

Pay-per-second billing – pay only for the compute you use

Ease of use – accelerate developer velocity and easily bring any container and run it on GPUs in the cloud

No infrastructure management – focus on your model and app

Enterprise-grade features – Out of the box support for bringing your own virtual networks, managed identity, private endpoints and more with full data governance

Choosing the right gpt-oss model

The gpt-oss models deliver strong performance across common language benchmarks and are optimized for different use cases:

gpt-oss-120b is comparable to OpenAI’s gpt-o4-mini and is a powerful reasoning model suitable for high-performance workloads. The model can be run on A100 GPUs on Azure Container Apps serverless GPUs.

gpt-oss-20b is comparable to gpt-o3-mini and is ideal for lighter-weight small language model (SLM) apps and has an excellent performance for the cost. This model can run efficiently and cheaper on T4 GPUs or faster on A100 GPUs.

Deploy Azure Container Apps resources

Go to the Azure Portal.

Click Create a resource.

Search for Azure Container Apps.

Select Container App and Create.

On the Basics tab, you can leave most of the defaults. The region you’ll want to select will depend on the gpt-oss model that you want to use. To run the 120B parameter model, select one of the A100 regions. To run the 20B model, select either a T4 or A100 region.

Region A100 T4

West US Yes

West US 3 Yes Yes

Sweden Central Yes Yes

Australia East Yes Yes

West Europe Yes

In the Container tab, fill in the following details. The container that will be deployed has Ollama and Open WebUI bundled together. For more details on the container.

Field	Value
Image source	Docker hub or other registries
Image type	Public
Registry login server	Docker.io
Image and tag	ollama/ollama:latest
Workload profile	Consumption
GPU	check the box
GPU type	A100 for gpt-oss:120B. T4 or A100 for gpt-oss:20B

*By default, pay-as-you-go and EA customers have quota. If you don’t have quota for serverless GPUs in Azure Container Apps, request quota here.

In the Ingress tab, fill in the following details:

Field Value

Ingress Enabled

Ingress traffic Accepting traffic from anywhere

Target port 11434

Select Review + Create at the bottom of the page, then select Create.

Use your gpt-oss model

Once your deployment is complete, select Go to resource.

Select the Application Url for your container app. This will launch the container.

(Optional) The following steps will show how to interact with the thinking models through the Azure Container Apps console. Console commands in the container app aren’t counted as traffic for the container app to stay scaled out, so your application may scale back in after a set period of time. If you want to have the container app remain active for a longer duration while going through the following, you can go to the scaling blade under Application and set the min replica count to 1 or increase the cooldown period duration. If you set the min replica count to 1, please ensure you reset it to 0 when not in use, or your app will not scale back in, and you will be billed for the duration it is active.

In the Azure portal, select the Monitoring dropdown. Then, select Console.

Under Choose start up command, select Connect.

Run the below command to start Ollama:
ollama serve

Run the below command to pull the gpt-oss model. Use 120b or 20b depending on which model you want to run:
ollama pull gpt-oss:120b

Run the below command to run the gpt-oss model. It may take a couple minutes:
ollama run gpt-oss:120b

Input your prompt to see the model in action:
Can you explain LLMs and the recent developments in AI the last few years like I’m five?

Congratulations! You’ve successfully run an Open AI gpt-oss model on Azure Container Apps serverless GPUs!

(Optional) Call the Ollama gpt-oss API endpoint from your local machine

The following curl commands can be used from your local machine to call the container app endpoint and interact with the Ollama gpt-oss endpoint.

Open your local shell

Copy your container app URL

Run the following command to set the OLLAMA_URL environment variable
export OLLAMA_URL=”{Your application URL}”

Run the following command to prompt the gpt-oss model. This curl request has streaming set to false, so it will return the fully generated response.

curl -X POST “$OLLAMA_URL/api/generate” -H “Content-Type: application/json” -d ‘{
“model”: “gpt-oss:120b”,
“prompt”: “Can you explain LLMs and the recent developments in AI the last few years like I am five?”,
“stream”: false
}’

Congratulations!

You have successfully run a gpt-oss model on Azure Container Apps! You can follow these same steps to run any model that you can find in Ollama’s library. In addition, Azure Container Apps is a completely agnostic compute platform. You can bring any Linux-based container for your AI workloads and run them on serverless GPUs.

Please comment below to let us know what you think of the experience and any AI workloads you’re deploying to Azure Container Apps.

Next steps

Azure Container Apps is fully ephemeral and doesn’t have a mounted storage. In order to persist your data and conversations, you can add a volume mount to your Azure Container App. For steps on how to add a volume mount, follow steps here.

Region	A100	T4
West US	Yes
West US 3	Yes	Yes
Sweden Central	Yes	Yes
Australia East	Yes	Yes
West Europe		Yes

Field	Value
Ingress	Enabled
Ingress traffic	Accepting traffic from anywhere
Target port	11434

Decoding On-Premises ADC Rules: Migration to Azure Application Gateway

How do I catch bad data before it derails my agent?

Decoding On-Premises ADC Rules: Migration to Azure Application Gateway

How do I catch bad data before it derails my agent?

Why use Azure Container Apps serverless GPUs?

Choosing the right gpt-oss model

Deploy Azure Container Apps resources

Use your gpt-oss model

(Optional) Call the Ollama gpt-oss API endpoint from your local machine

Congratulations!

Next steps

Related posts

Your city agent speaks 100 languages. But does It understand you?

Microsoft 365 Champion community call | October 2025 – Recap & top Q+A

Agentic Integration with SAP, ServiceNow, and Salesforce