Azure AI Voice Live API: what’s new and the pricing announcement

Model Context Protocol (MCP) in VS Code with Microsoft Learn

July 1, 2025

PostgreSQL 17 General Availability with In-Place Upgrade Support

July 1, 2025

Published by azurefeeds on July 1, 2025

Recap: What is the Voice Live API and why does it matter?

Voice is the next generation interface between humans and computers.

In the era of voice-driven technologies, creating smooth and intuitive speech-based systems has become a priority for developers. The Voice Live API simplifies the process by combining essential voice processing components into a unified interface. Whether you’re building conversational agents for customer support, automotive assistants, or educational tools, this API is designed to streamline workflows, reduce latency, and deliver high-quality, real-time voice interactions.

The Voice Live API integrates speech-to-text (STT), GenAI models, text-to-speech (TTS), avatar, and conversational enhancement features into a single interface. By eliminating the need to stitch together disparate components, the API offers an end-to-end solution for scalable voice-driven experiences.

The Voice Live API shines in scenarios where voice-driven interactions enhance user experiences. Here are some key applications:

Contact Centers: Develop dynamic voice bots for tasks such as customer support, product catalog navigation, self-service solutions. These bots can improve operational efficiency and provide 24/7 support, reducing wait times for customers.

Automotive Assistants: Enable hands-free, in-car voice assistantsfor command execution, navigation assistance and general inquiries. This ensures safer driving experiences while keeping users engaged.

Education: Create voice-enabled learning companionsand virtual tutors for interactive training sessions, personalized education experiences, language learning and skill development. Voice-based systems can make learning more engaging and accessible for students of all ages.

Public Services: Develop voice agentsto assist citizens with administrative queries, public service information, appointment scheduling and more. These agents can improve accessibility for individuals with limited digital literacy

Human Resources: Enhance HR processes using voice-enabled tools for employee support(e.g., FAQs about benefits or policies), career development (e.g., performance feedback or skill-building recommendations), training (e.g., interactive onboarding experiences) and more. Voice-driven HR tools can streamline operations, reduce workload for HR teams, and provide employees with faster resolutions to their queries.

The Voice Live API is packed with features designed to support diverse use cases and deliver superior voice interactions. Here’s a breakdown of its key capabilities:

Broad locale coverage: Speech-to-Text (STT) supports over >50 locales with an option to use Azure’s multilingual model for 15 locales. Text-to-Speech (TTS) offers more than 600 out of box voices across 150+ locales, with access to 30+ highly natural conversational voices optimized with the neural HD models.

Flexible GenAI model options: The API allows you to choose from multiple AI models tailored to conversational needs including GPT-4o, GPT4o-mini and Phi.

Advanced conversational enhancement features: Ensure smooth and natural interactions with Noise Suppression that reduces environmental noise, making conversations clearer even in busy settings, Echo Cancellation that prevents the agent from picking up its own audio responses, avoiding feedback loops, Robust Interruption Detection that accurately identifies interruptions during conversations and Advanced End-of-Turn Detection that allows natural pauses without prematurely concluding interactions.

Avatar integration: Provides avatars synchronized with audio output, offering a visual identity for voice agents.

Customization: Design unique, brand-aligned voices for audio output and customized avatars to reinforce brand identity.

Integration with Foundry Agents: Give your agents built in Azure AI Foundry a voice interface.

To get started, try Voice Live in Azure AI Foundry Playground, or learn more about how to use Voice Live API.

What’s new in June

During the past few weeks, we have released a few new features for Voice Live API to address customer requests.

Support more GenAI models

o GPT4.1 model series: GPT-4.1, GPT-4.1 Mini and GPT-4.1 Nano are now natively supported.

o Phi series: Phi-4 mini and Phi-4 Multimodal models are now supported.

Support more customization capabilities

Developers need customization to manage input and output for different use cases. In June, we added more features to support speech input and output customizations.

o Phrase list: Use phrase list for lightweight just-in-time customization on audio input, for example, define “Neo QLED TV” or “Surface Pro 12” as one phrase.

o Speaking rate control: The speaking rate parameter allows developers to easily adjust the speaking speed for any standard Azure text to speech voices and custom voices.

o Custom lexicon: Custom lexicon enables developers to customize pronunciation for both standard Azure text to speech voices and custom voices.

Learn more about how to use these features in this document.

Azure Semantic VAD is extended to support GPT-4o-Realtime and GPT-4o-Mini-Realtime.

Azure Semantic VAD (voice activity detection) detects start and end of speech based on semantic meaning. It improves turn detection by removing filler words to reduce the false alarm rate. This feature is now extended to support Azure OpenAI GPT-4o realtime models.

Create Call Center Voice Agents by combining the Voice Live API and Azure Communication Services

The blog post by the Azure Communication Services team and the corresponding sample in GitHub show how you can leverage Azure Communication Services to access audio from live calls and connect it to the Voice Live API to build Call Center Voice Agents leveraging Azure AI Speech’s advanced audio and voice capabilities.

Availability in more regions

More regions supported: WestUS 2, Central India, South East Asia. To check the features supported in each region, go to this document.

Pricing note

The Voice Live API will implement charges starting on July 1, 2025. The following pricing table indicates the charges based on the configurations chosen for voice agent applications.

Category	Price (1M Tokens)
Pro	Text	Input: $5.5 Cached Input: $2.75 Output: $22
	Audio with Azure AI Speech – Standard	Input: $17 Cached Input: $2.75 Output: $38
	Audio with Azure AI Speech – Custom	Output: $55
	Native audio with GPT-4o-Realtime	Input: $44 Cached Input: $2.75 Output: $88
Basic	Text	Input: $0.66 Cached Input: $0.33 Output: $2.64
	Audio with Azure AI Speech – Standard	Input: $15 Cached Input: $0.33 Output: $33
	Audio with Azure AI Speech – Custom	Output: $50
	Native audio with GPT-4o Mini-Realtime	Input: $11 Cached Input: $0.33 Output: $22
Lite	Text	Input: $0.08 Cached Input: $0.04 Output: $0.32
	Audio with Azure AI Speech – Standard	Input: $13 Cached Input: $0.04 Output: $33
	Audio with Azure AI Speech – Custom	Output: $50
	Native audio with Phi-MM	Input: $4 Cached Input: $0.04

With Voice Live Pro, developers can choose from LLMs such as GPT-4o-Realtime, GPT-4o and GPT-4.1 models. With Voice Live Basic, developers can choose from smaller LLMs such as GPT-4o-Mini-Realtime, GPT-4o Mini and GPT-4.1 Mini models. With Voice Live Lite, developers can choose from SLMs and equivalent models such as GPT-4.1 Nano and Phi models.

If you choose to use custom voice for your speech output, you will be charged separately for custom voice model training and hosting. Refer to the ‘Text to Speech – Custom Voice – Professional’ pricing for details. Custom voice is a limited access feature. Learn more about how to create custom voices.

Avatars are charged separately with the interactive avatar pricing published here.

For more details regarding how custom voice and avatar training charges, refer to this pricing note.

Here are a few examples of different setups and their charges.

Scenario 1: a customer service agent built with standard Azure speech-to-text input, GPT-4.1, and custom Azure speech-to-text output, plus a custom avatar. This scenario will align with the ‘Voice Live Pro’ category and the charges will include:

Feature	Price (1M Tokens)
Text	Input: $5.5 Cached Input: $2.75 Output: $22
Audio with Azure AI Speech – Standard	Input: $17 Cached Input: $2.75
Audio with Azure AI Speech – Custom	Output: $55

Separate charges for custom voice and custom avatar:

Feature

Price

Custom voice – professional

Voice model training: $52 per compute hour, up to $4,992 per training

Endpoint hosting: $4.04 per model per hour

Custom avatar

Avatar model training: $15 per compute hour
Interactive avatar (real-time): $0.60 per minute

Endpoint hosting: $0.60 per model per hour

Scenario 2: a learning agent built with GPT-4o-Realtime native audio input, and standard Azure Speech output. The charges will include ‘Voice Live Pro’:

Feature	Price (1M Tokens)
Text	Input: $5.5 Cached Input: $2.75 Output: $22
Native audio with GPT-4o-Realtime	Input: $44 Cached Input: $2.75
Audio with Azure AI Speech – Standard	Output: $38

Scenario 3: a talent interview agent built with GPT-4o-Mini-Realtime native audio input, and standard Azure Speech output and standard avatar. The charges will include ‘Voice Live Basic’:

Feature	Price (1M Tokens)
Text	Input: $0.66 Cached Input: $0.33 Output: $2.64
Native audio with GPT-4o Mini-Realtime	Input: $11 Cached Input: $0.33
Audio with Azure AI Speech – Standard	Output: $33

And additional charge for standard avatar:

Feature	Price
Text to speech avatar (standard)	Interactive avatar (real-time): $0.50 per minute

Scenario 4: an in-car assistant built with Phi-multimodal modal and Azure custom voice. The charges will include ‘Voice Live Lite’:

Feature	Price (1M Tokens)
Text	Input: $0.08 Cached Input: $0.04 Output: $0.32
Native audio with Phi-MM	Input: $4 Cached Input: $0.04
Audio with Azure AI Speech – Custom	Output: $50

Separate charges for custom voice:

Category

Price

Custom voice – professional

Voice model training: $52 per compute hour, up to $4,992 per training

Endpoint hosting: $4.04 per model per hour

Get started

The Voice Live API is transforming how developers build speech-to-speech systems by providing an integrated, scalable, and efficient solution. By combining speech recognition, generative AI, and text-to-speech functionalities into a unified interface, it addresses the challenges of traditional implementations, enabling faster development and superior user experiences. From streamlining customer service to enhancing education and public services, the opportunities are endless. The future of voice-first solutions is here—let’s build it together!

Voice Live API introduction

Try Voice Live in Azure AI Foundry

Voice Live API documents

Voice Live Agent code sample in GitHub