🚀 Building Intelligent, Scalable APIs with Azure APIM, Azure OpenAI, and Semantic Caching

Sentinel-Threat Intelligence Feeds Integration to strengthen Threat Detection & Proactive Hunting.

May 19, 2025

Published by azurefeeds on May 19, 2025

🔐Azure API Management (APIM): The Gateway to Modern APIs

Azure API Management provides a robust platform to expose, manage, and secure APIs. It acts as a facade between your backend services and consumers, offering:

Rate limiting and throttling

Authentication and authorization

Analytics and monitoring

Policy-based transformations

🤖 Integrating Azure OpenAI with APIM

Azure OpenAI brings the power of GPT models to your enterprise applications. By exposing Azure OpenAI endpoints through APIM, you can:

Apply rate limits and quotas to control usage

Add authentication layers (e.g., OAuth2, subscription keys)

Monitor usage and performance via Azure Monitor

⚡ Understanding Token Per Minute (TPM) Limits

Azure OpenAI enforces TPM limits to manage model usage. Each model (e.g., GPT-4, GPT-3.5) has a quota for how many tokens can be processed per minute.

📌 Best Practices

Distribute load across multiple deployments

Use APIM policies to throttle requests before hitting TPM limits

Monitor usage with Azure Monitor and alerts

🧠 Azure OpenAI Semantic Caching: Optimize LLM Performance

Enable semantic caching of responses to Azure OpenAI API requests to reduce bandwidth and processing requirements imposed on the backend APIs and lower latency perceived by API consumers. With semantic caching, you can return cached responses for identical prompts and for prompts that are similar in meaning, even if the text isn’t the same.

🛠️ How It Works

Generate embeddings for incoming prompts

Compare with cached embeddings using similarity score threshold based on vector proximity of the prompt to previous requests.

Return cached response if similarity exceeds threshold

📈 Benefits

Reduces costs and latency

Improves scalability

Enhances user experience

📷 Semantic Cache Flow

🌐 APIM Self-Hosted Gateway: Hybrid API Management

The self-hosted gateway allows you to run APIM in your own infrastructure—ideal for on-prem or hybrid cloud scenarios.

🔍 Key Features

Same policies and configuration as cloud APIM

Works in Kubernetes, Docker, or VMs

Enables local traffic routing and compliance

📷 Self-Hosted Gateway Architecture

https://learn.microsoft.com/en-us/ai/playbook/solutions/generative-ai/genai-gateway/reference-architectures/apim-based#on-premises-genai-gateway-using-apim-self-hosted-gateways

🧩 Bringing It All Together

By combining these technologies, you can build a secure, scalable, and intelligent API platform:

Use APIM to expose and secure OpenAI endpoints

Enforce TPM limits and throttle requests

Implement semantic caching to reduce LLM costs

Deploy self-hosted gateways for hybrid environments

📚 Resources

Azure API Management Documentation

Azure OpenAI Service

Azure Open AI Token Per Limit

Azure Open AI Semantic Caching

Self-Hosted Gateway Setup

🧪 Proof of Concept (POC): Bringing the Architecture to Life

To validate the integration of Azure APIM, Azure OpenAI, semantic caching, and self-hosted gateways, here are some POC steps you can implement:

List of Azure Resources to be deployed for this POC–

1 API Management service

1 Azure Managed Redis

OpenAI model deployments
- gpt-4o
- gpt-4o-mini
- text-embedding-3-small

2 Azure Virtual Machines (APIM Self-Hosted gateway configured on docker container)

1 Log Analytics Workspace

1 Application Insights

✅ 1. Expose Azure OpenAI via APIM

Create an Azure OpenAI resource and deploy a model (e.g., gpt-4o/gpt-4o-mini).

Create an API in Azure API Management that proxy’s requests to the Azure OpenAI endpoint.

✅ 2. Implement Token Per Minute (TPM) Throttling

Use APIM policies to enforce TPM limits.

Monitor usage via Azure Monitor and set alerts for threshold breaches.

Refer below APIM policy for TPM setup, where dynamic values are set during the API call for TPM attribute:

✅ 3. Integrate Azure OpenAI Semantic Caching

Pre-requisites

An Azure Cache for Redis Enterprise or Azure Managed Redis instance. The RediSearch module must be enabled on the Redis cache.

Azure Open AI embeddings model example – text-embedding-3-small

Create a “backend” resource in the APIM instance which points to embeddings model URL.

To enable semantic caching for Azure OpenAI APIs in Azure API Management, apply the following policies: one to check the cache before sending requests (lookup) and another to store responses for future reuse (store):

In the Inbound processing section for the API, add the azure-openai-semantic-cache-lookup policy. In the embeddings-backend-id attribute, specify the Embeddings API backend you created.

<azure-openai-semantic-cache-lookup

score-threshold=”0.8″

embeddings-backend-id=”embeddings-backend”

embeddings-backend-auth=”system-assigned”

ignore-system-messages=”true”

max-message-count=”10″>

@(context.Subscription.Id)

In the Outbound processing section for the API, add the azure-openai-semantic-cache-store policy.

✅ 4. Deploy Self-Hosted Gateway

Provision a gateway resource in API Management instance.

Deploy the APIM self-hosted gateway in a Docker container or Kubernetes cluster.

Connect it to your APIM instance using a gateway token.

Route internal traffic through the self-hosted gateway for compliance and reduced latency.

Reference – Deploy self-hosted gateway to Docker | Microsoft Learn

✅ 5. End-to-End Testing

Simulate user queries via Postman or a frontend app.

Validate:

Response times with and without caching

TPM enforcement

Gateway routing and failover

Logging and analytics in Azure Monitor

🧾 Conclusion

Integrating Azure API Management with Azure OpenAI endpoints and leveraging Azure OpenAI semantic caching unlocks a powerful architecture for building intelligent, scalable, and secure APIs. By thoughtfully managing Token Per Minute (TPM) limits and deploying self-hosted gateways, organizations can ensure high performance, cost efficiency, and compliance across hybrid environments.

This architecture not only supports modern AI-driven applications but also provides the flexibility and control needed for enterprise-grade deployments. Whether you’re building a chatbot, a knowledge assistant, or an internal AI tool, this approach offers a robust foundation to scale responsibly and intelligently.

Sentinel-Threat Intelligence Feeds Integration to strengthen Threat Detection & Proactive Hunting.

Sentinel-Threat Intelligence Feeds Integration to strengthen Threat Detection & Proactive Hunting.

🔐Azure API Management (APIM): The Gateway to Modern APIs

🤖 Integrating Azure OpenAI with APIM

⚡ Understanding Token Per Minute (TPM) Limits

🧠 Azure OpenAI Semantic Caching: Optimize LLM Performance

🌐 APIM Self-Hosted Gateway: Hybrid API Management

🧩 Bringing It All Together

📚 Resources

🧪 Proof of Concept (POC): Bringing the Architecture to Life

🧾 Conclusion

Related posts

Sentinel-Threat Intelligence Feeds Integration to strengthen Threat Detection & Proactive Hunting.

Azure SQL Database Hyperscale – Enhanced Performance Features Are Now Generally Available!

Kickstart Your AI Development with the Model Context Protocol (MCP) Course