Decoding On-Premises ADC Rules: Migration to Azure Application Gateway
August 7, 2025How do I catch bad data before it derails my agent?
August 7, 2025Just yesterday, OpenAI announced the release of gpt-oss-120b and gpt-oss-20b, two new state-of-the-art open-weight language models. These models are designed to run on lighter weight GPU resources making them highly accessible for developers who want to self-host powerful language models within their own environments.
If you’re looking to deploy these models in the cloud, Azure Container Apps serverless GPUs are a great option. With support for both A100 and T4 GPUs, serverless GPUs support both the gpt-oss-120b and gpt-oss-20b models, providing a cost-efficient and scalable platform with minimal infrastructure overhead.
In this blog post, we’ll walk through:
- Understanding the benefits of using serverless GPUs for open-source model hosting
- Choosing the right gpt-oss model for you
- Deploying the Ollama container on Azure Container Apps serverless GPUs
- Running OpenAI’s gpt-oss models in a scalable, cost-effective environment
Why use Azure Container Apps serverless GPUs?
Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. With serverless GPU support, you can bring your own model containers, such as Ollama, and deploy them to GPU-backed environments that automatically scale based on demand.
Key benefits:
- Autoscaling – scale to zero when idle, scale out with usage
- Pay-per-second billing – pay only for the compute you use
- Ease of use – accelerate developer velocity and easily bring any container and run it on GPUs in the cloud
- No infrastructure management – focus on your model and app
- Enterprise-grade features – Out of the box support for bringing your own virtual networks, managed identity, private endpoints and more with full data governance
Choosing the right gpt-oss model
The gpt-oss models deliver strong performance across common language benchmarks and are optimized for different use cases:
- gpt-oss-120b is comparable to OpenAI’s gpt-o4-mini and is a powerful reasoning model suitable for high-performance workloads. The model can be run on A100 GPUs on Azure Container Apps serverless GPUs.
- gpt-oss-20b is comparable to gpt-o3-mini and is ideal for lighter-weight small language model (SLM) apps and has an excellent performance for the cost. This model can run efficiently and cheaper on T4 GPUs or faster on A100 GPUs.
Deploy Azure Container Apps resources
- Go to the Azure Portal.
- Click Create a resource.
- Search for Azure Container Apps.
- Select Container App and Create.
- On the Basics tab, you can leave most of the defaults. The region you’ll want to select will depend on the gpt-oss model that you want to use. To run the 120B parameter model, select one of the A100 regions. To run the 20B model, select either a T4 or A100 region.
Region A100 T4 West US Yes West US 3 Yes Yes Sweden Central Yes Yes Australia East Yes Yes West Europe Yes
- In the Container tab, fill in the following details. The container that will be deployed has Ollama and Open WebUI bundled together. For more details on the container.
Field Value Image source Docker hub or other registries Image type Public Registry login server Docker.io Image and tag ollama/ollama:latest Workload profile Consumption GPU check the box GPU type A100 for gpt-oss:120B. T4 or A100 for gpt-oss:20B
*By default, pay-as-you-go and EA customers have quota. If you don’t have quota for serverless GPUs in Azure Container Apps, request quota here. - In the Ingress tab, fill in the following details:
Field Value Ingress Enabled Ingress traffic Accepting traffic from anywhere Target port 11434
- Select Review + Create at the bottom of the page, then select Create.
Use your gpt-oss model
- Once your deployment is complete, select Go to resource.
- Select the Application Url for your container app. This will launch the container.
- (Optional) The following steps will show how to interact with the thinking models through the Azure Container Apps console. Console commands in the container app aren’t counted as traffic for the container app to stay scaled out, so your application may scale back in after a set period of time. If you want to have the container app remain active for a longer duration while going through the following, you can go to the scaling blade under Application and set the min replica count to 1 or increase the cooldown period duration. If you set the min replica count to 1, please ensure you reset it to 0 when not in use, or your app will not scale back in, and you will be billed for the duration it is active.
- In the Azure portal, select the Monitoring dropdown. Then, select Console.
- Under Choose start up command, select Connect.
- Run the below command to start Ollama:
ollama serve
- Run the below command to pull the gpt-oss model. Use 120b or 20b depending on which model you want to run:
ollama pull gpt-oss:120b
- Run the below command to run the gpt-oss model. It may take a couple minutes:
ollama run gpt-oss:120b
- Input your prompt to see the model in action:
Can you explain LLMs and the recent developments in AI the last few years like I’m five?
- Congratulations! You’ve successfully run an Open AI gpt-oss model on Azure Container Apps serverless GPUs!
(Optional) Call the Ollama gpt-oss API endpoint from your local machine
The following curl commands can be used from your local machine to call the container app endpoint and interact with the Ollama gpt-oss endpoint.
- Open your local shell
- Copy your container app URL
- Run the following command to set the OLLAMA_URL environment variable
export OLLAMA_URL=”{Your application URL}”
Run the following command to prompt the gpt-oss model. This curl request has streaming set to false, so it will return the fully generated response.
curl -X POST “$OLLAMA_URL/api/generate” -H “Content-Type: application/json” -d ‘{
“model”: “gpt-oss:120b”,
“prompt”: “Can you explain LLMs and the recent developments in AI the last few years like I am five?”,
“stream”: false
}’
Congratulations!
You have successfully run a gpt-oss model on Azure Container Apps! You can follow these same steps to run any model that you can find in Ollama’s library. In addition, Azure Container Apps is a completely agnostic compute platform. You can bring any Linux-based container for your AI workloads and run them on serverless GPUs.
Please comment below to let us know what you think of the experience and any AI workloads you’re deploying to Azure Container Apps.
Next steps
Azure Container Apps is fully ephemeral and doesn’t have a mounted storage. In order to persist your data and conversations, you can add a volume mount to your Azure Container App. For steps on how to add a volume mount, follow steps here.