
Model Context Protocol (MCP) in VS Code with Microsoft Learn
July 1, 2025PostgreSQL 17 General Availability with In-Place Upgrade Support
July 1, 2025At the //Build conference in May 2025, we announced the public preview of Azure AI Voice Live API (Breakout Session 144). Today we are exciting to share some updates to this API and the latest pricing.
Recap: What is the Voice Live API and why does it matter?
Voice is the next generation interface between humans and computers.
In the era of voice-driven technologies, creating smooth and intuitive speech-based systems has become a priority for developers. The Voice Live API simplifies the process by combining essential voice processing components into a unified interface. Whether you’re building conversational agents for customer support, automotive assistants, or educational tools, this API is designed to streamline workflows, reduce latency, and deliver high-quality, real-time voice interactions.
The Voice Live API integrates speech-to-text (STT), GenAI models, text-to-speech (TTS), avatar, and conversational enhancement features into a single interface. By eliminating the need to stitch together disparate components, the API offers an end-to-end solution for scalable voice-driven experiences.
The Voice Live API shines in scenarios where voice-driven interactions enhance user experiences. Here are some key applications:
- Contact Centers: Develop dynamic voice bots for tasks such as customer support, product catalog navigation, self-service solutions. These bots can improve operational efficiency and provide 24/7 support, reducing wait times for customers.
- Automotive Assistants: Enable hands-free, in-car voice assistantsfor command execution, navigation assistance and general inquiries. This ensures safer driving experiences while keeping users engaged.
- Education: Create voice-enabled learning companionsand virtual tutors for interactive training sessions, personalized education experiences, language learning and skill development. Voice-based systems can make learning more engaging and accessible for students of all ages.
- Public Services: Develop voice agentsto assist citizens with administrative queries, public service information, appointment scheduling and more. These agents can improve accessibility for individuals with limited digital literacy
- Human Resources: Enhance HR processes using voice-enabled tools for employee support(e.g., FAQs about benefits or policies), career development (e.g., performance feedback or skill-building recommendations), training (e.g., interactive onboarding experiences) and more. Voice-driven HR tools can streamline operations, reduce workload for HR teams, and provide employees with faster resolutions to their queries.
The Voice Live API is packed with features designed to support diverse use cases and deliver superior voice interactions. Here’s a breakdown of its key capabilities:
- Broad locale coverage: Speech-to-Text (STT) supports over >50 locales with an option to use Azure’s multilingual model for 15 locales. Text-to-Speech (TTS) offers more than 600 out of box voices across 150+ locales, with access to 30+ highly natural conversational voices optimized with the neural HD models.
- Flexible GenAI model options: The API allows you to choose from multiple AI models tailored to conversational needs including GPT-4o, GPT4o-mini and Phi.
- Advanced conversational enhancement features: Ensure smooth and natural interactions with Noise Suppression that reduces environmental noise, making conversations clearer even in busy settings, Echo Cancellation that prevents the agent from picking up its own audio responses, avoiding feedback loops, Robust Interruption Detection that accurately identifies interruptions during conversations and Advanced End-of-Turn Detection that allows natural pauses without prematurely concluding interactions.
- Avatar integration: Provides avatars synchronized with audio output, offering a visual identity for voice agents.
- Customization: Design unique, brand-aligned voices for audio output and customized avatars to reinforce brand identity.
- Integration with Foundry Agents: Give your agents built in Azure AI Foundry a voice interface.
To get started, try Voice Live in Azure AI Foundry Playground, or learn more about how to use Voice Live API.
What’s new in June
During the past few weeks, we have released a few new features for Voice Live API to address customer requests.
- Support more GenAI models
o GPT4.1 model series: GPT-4.1, GPT-4.1 Mini and GPT-4.1 Nano are now natively supported.
o Phi series: Phi-4 mini and Phi-4 Multimodal models are now supported.
- Support more customization capabilities
Developers need customization to manage input and output for different use cases. In June, we added more features to support speech input and output customizations.
o Phrase list: Use phrase list for lightweight just-in-time customization on audio input, for example, define “Neo QLED TV” or “Surface Pro 12” as one phrase.
o Speaking rate control: The speaking rate parameter allows developers to easily adjust the speaking speed for any standard Azure text to speech voices and custom voices.
o Custom lexicon: Custom lexicon enables developers to customize pronunciation for both standard Azure text to speech voices and custom voices.
Learn more about how to use these features in this document.
- Azure Semantic VAD is extended to support GPT-4o-Realtime and GPT-4o-Mini-Realtime.
Azure Semantic VAD (voice activity detection) detects start and end of speech based on semantic meaning. It improves turn detection by removing filler words to reduce the false alarm rate. This feature is now extended to support Azure OpenAI GPT-4o realtime models.
- Create Call Center Voice Agents by combining the Voice Live API and Azure Communication Services
The blog post by the Azure Communication Services team and the corresponding sample in GitHub show how you can leverage Azure Communication Services to access audio from live calls and connect it to the Voice Live API to build Call Center Voice Agents leveraging Azure AI Speech’s advanced audio and voice capabilities.
- Availability in more regions
More regions supported: WestUS 2, Central India, South East Asia. To check the features supported in each region, go to this document.
Pricing note
The Voice Live API will implement charges starting on July 1, 2025. The following pricing table indicates the charges based on the configurations chosen for voice agent applications.
Category |
Price (1M Tokens) |
|
Pro | Text |
Input: $5.5 Cached Input: $2.75 |
Audio with Azure AI Speech – Standard |
Input: $17 |
|
Audio with Azure AI Speech – Custom |
Output: $55 |
|
Native audio with GPT-4o-Realtime |
Input: $44 Cached Input: $2.75 |
|
Basic |
Text |
Input: $0.66 |
Audio with Azure AI Speech – Standard |
Input: $15 |
|
Audio with Azure AI Speech – Custom |
Output: $50 |
|
Native audio with GPT-4o Mini-Realtime |
Input: $11 |
|
Lite
|
Text |
Input: $0.08 |
Audio with Azure AI Speech – Standard |
Input: $13 Cached Input: $0.04 |
|
Audio with Azure AI Speech – Custom |
Output: $50 |
|
Native audio with Phi-MM |
Input: $4 |
With Voice Live Pro, developers can choose from LLMs such as GPT-4o-Realtime, GPT-4o and GPT-4.1 models. With Voice Live Basic, developers can choose from smaller LLMs such as GPT-4o-Mini-Realtime, GPT-4o Mini and GPT-4.1 Mini models. With Voice Live Lite, developers can choose from SLMs and equivalent models such as GPT-4.1 Nano and Phi models.
If you choose to use custom voice for your speech output, you will be charged separately for custom voice model training and hosting. Refer to the ‘Text to Speech – Custom Voice – Professional’ pricing for details. Custom voice is a limited access feature. Learn more about how to create custom voices.
Avatars are charged separately with the interactive avatar pricing published here.
For more details regarding how custom voice and avatar training charges, refer to this pricing note.
Here are a few examples of different setups and their charges.
Scenario 1: a customer service agent built with standard Azure speech-to-text input, GPT-4.1, and custom Azure speech-to-text output, plus a custom avatar. This scenario will align with the ‘Voice Live Pro’ category and the charges will include:
Feature |
Price (1M Tokens) |
Text |
Input: $5.5 Cached Input: $2.75 |
Audio with Azure AI Speech – Standard |
Input: $17 |
Audio with Azure AI Speech – Custom |
Output: $55 |
Separate charges for custom voice and custom avatar:
Feature |
Price |
Custom voice – professional
|
Voice model training: $52 per compute hour, up to $4,992 per training Endpoint hosting: $4.04 per model per hour |
Custom avatar |
Avatar model training: $15 per compute hour Endpoint hosting: $0.60 per model per hour |
Scenario 2: a learning agent built with GPT-4o-Realtime native audio input, and standard Azure Speech output. The charges will include ‘Voice Live Pro’:
Feature |
Price (1M Tokens) |
Text |
Input: $5.5 Cached Input: $2.75 |
Native audio with GPT-4o-Realtime |
Input: $44 Cached Input: $2.75 |
Audio with Azure AI Speech – Standard |
Output: $38 |
Scenario 3: a talent interview agent built with GPT-4o-Mini-Realtime native audio input, and standard Azure Speech output and standard avatar. The charges will include ‘Voice Live Basic’:
Feature |
Price (1M Tokens) |
Text |
Input: $0.66 |
Native audio with GPT-4o Mini-Realtime |
Input: $11 |
Audio with Azure AI Speech – Standard |
Output: $33 |
And additional charge for standard avatar:
Feature |
Price |
Text to speech avatar (standard) |
Interactive avatar (real-time): $0.50 per minute |
Scenario 4: an in-car assistant built with Phi-multimodal modal and Azure custom voice. The charges will include ‘Voice Live Lite’:
Feature |
Price (1M Tokens) |
Text |
Input: $0.08 |
Native audio with Phi-MM |
Input: $4 |
Audio with Azure AI Speech – Custom |
Output: $50 |
Separate charges for custom voice:
Category |
Price |
Custom voice – professional
|
Voice model training: $52 per compute hour, up to $4,992 per training Endpoint hosting: $4.04 per model per hour |
Get started
The Voice Live API is transforming how developers build speech-to-speech systems by providing an integrated, scalable, and efficient solution. By combining speech recognition, generative AI, and text-to-speech functionalities into a unified interface, it addresses the challenges of traditional implementations, enabling faster development and superior user experiences. From streamlining customer service to enhancing education and public services, the opportunities are endless. The future of voice-first solutions is here—let’s build it together!
Try Voice Live in Azure AI Foundry