A fresh look for the Microsoft authentication background
July 31, 2025.NET Bounty Program now offers up to $40,000 in awards
July 31, 2025At the Build conference on May 21, 2024, we announced the general available of Personal Voice, a feature designed to empower customers to build applications where users can easily create and utilize their own AI voices (see the blog).
Today we’re thrilled to announce that Azure AI Speech Service has upgraded a new zero-shot TTS (text-to-speech) model, named “DragonV2.1Neural”. This new model delivers more natural-sounding and expressive voices, offering improved pronunciation accuracy and greater controllability compared to the earlier zero-shot TTS model.
In this blog, we’ll present the new zero-shot TTS model audio quality, new features and benchmarks results. We’ll also share a guide for controlling pronunciation and accent using the Personal Voice API with the new zero-shot TTS model.
Personal Voice model upgrade
The Personal Voice feature in Azure AI Speech Service empowers users to craft highly personalized synthetic voices based on their own speech characteristics. By providing just a few seconds speech sample as the audio prompt, users can rapidly generate an AI voice replica, which can then synthesize speech in any of the output languages supported. This capability unlocks a wide range of applications, from customizing chatbot voices to dubbing video content in an actor’s original voice across multiple languages, enabling truly immersive and individualized audio experiences.
Our earlier Personal Voice Dragon TTS model can produce speech with exceptionally realistic prosody and high-fidelity audio quality, but it still encounters pronunciation challenges, especially with complex elements such as named entities. As a result, pronunciation control remains a crucial feature for delivering accurate and natural-sounding speech synthesis.
In addition, for scenarios involving speech or video translation, it is crucial for a zero-shot TTS model to accurately produce not only different languages but also specific accents. The ability to precisely control accent ensures that speakers can deliver natural speech in any target accent.
Dragon V2.1 model cards
Attribute |
Details |
Architecture |
Transformer model |
Highlights |
– Multilingual |
Context Length |
30 seconds of audio |
Supported Languages |
100+ Azure TTS locales |
SSML Support |
Yes |
Latency |
< 300 ms |
RTF (Real-Time Factor) |
< 0.05 |
Prosody and pronunciation improvement
Comparing with our previous dragon TTS model (“DragonV1”), our new “DragonV2.1” model brings improvements to the naturalness of speech, offering more realistic and stable prosody while maintaining better pronunciation accuracy.
Here are a few voice samples showing prosody improvement compared to DragonV1, prompt audio is the source speech from humans.
Locale |
Prompt audio |
DragonV1 |
DragonV2.1 |
En-US |
|
|
|
Zh-CN |
|
|
|
The new “DragonV2.1” model also shows pronunciation improvements, we compared WER (Word error rate), which measures the intelligibility of the synthesis speech by an automatic speech recognition (ASR) system.
We evaluated WER (lower is better) on all supported locales, each locale is evaluated on more than 100 test cases. The new model achieves on average 12.8% relative WER reduction compared to DragonV1.
Here are a few complicated cases showing the pronunciation improvement, compared to DragonV1, the new DragonV2.1 model can read correctly on challenge cases such as Chinese polyphony and better produce in en-GB accent:
Locale |
Prompt audio |
DragonV1 |
DragonV2.1 |
Zh-CN |
|
唐朝高僧玄奘受皇帝之命,前往天竺取回真经,途中收服了四位徒弟:机智勇敢的孙悟空、好吃懒做的猪八戒、忠诚踏实的沙和尚以及白龙马。他们一路历经九九八十一难,战胜了无数妖魔鬼怪,克服重重困难。
|
|
En-GB |
|
[En-GB accent] Tomato, potato, and basil are in the salad.
|
|
Pronunciation control
The “DragonV2.1” model supports pronunciation control with SSML phoneme tags, you can use ipa phoneme tag and custom lexicon to specify how the speech is pronounced.
In below examples, we supported “ipa” values for attributes of the phoneme element described here. In the below example, the values of ph=”tə.ˈmeɪ.toʊ” or ph=”təmeɪˈtoʊ” are specified to stress the syllable meɪ.
tomato
You can define how single entities (such as company, a medical term, or an emoji) are read in SSML by using the phoneme elements. To define how multiple entities are read, create an XML structured custom lexicon file. Then you upload the custom-lexicon XML file and reference it with the SSML lexicon element.
After you publish your custom lexicon, you can reference it from your SSML. The following SSML example references a custom lexicon that was uploaded to https://www.example.com/customlexicon.xml.
BTW, we will be there probably at 8:00 tomorrow morning. Could you help leave a message to Robert Benigni for me?
Language and accent control
You can use the element to adjust speaking languages and accents for your voice to set the preferred accent such as en-GB for British English. For information about the supported languages, see the lang element for a table showing the syntax and attribute definitions. This element is recommended to use for better pronunciation accuracy.
The following table describes the usage of the element’s attributes:
Tomato, potato, and basil are in the salad.
Benchmark evaluation
Benchmarking plays a key role in evaluating the performance of zero-shot TTS models. In this work, we compared our system with other top zero-shot text-to-speech providers — Company A and Company B — for English, and with Company A specifically for Mandarin. This assessment allowed us to measure performance across both languages; we used a widely accepted subjective metric:
MOS (Mean Opinion Score) tests were conducted to assess perceptual quality. Listeners listened to the audios carefully and rated them. In our evaluation, the opinion score is mainly judged from four aspects, including overall impression, naturalness, conversational and audio quality. Each judge gives 1-5 score on each aspect; we show the average score below.
English set:
Chinese set:
These results show that our zero-shot TTS model is slightly better than Company A and B on English (> 0.05 score gap) and on par with Company A on Mandarin.
Quick trial with prebuilt voice profiles
To facilitate testing of the new DragonV2.1 model, several prebuilt voice profiles have been made available. By providing a brief prompt audio from each voice and using the new zero-shot model, these prebuilt profiles aim to provide more expressive prosody, high audio fidelity, and a natural tone while preserving the original voice persona. You can explore these profiles firsthand to experience the enhanced quality of our new model, without using your own custom profiles.
We provided several prebuilt profiles for you and the profile names are listed below.
Profile name |
Andrew |
Ava |
Brian |
Emma |
Adam |
Jenny |
To utilize these prebuilt profiles for output, assign the appropriate profile name into the “speaker” attribute of tag.
I’m happy to hear that you find me amazing and that I have made your trip planning easier and more fun.
Here are Dragonv2.1 audio samples of these prebuilt profiles.
Profile name |
DragonV2.1 |
Ava |
|
Andrew |
|
Brian |
|
Emma |
|
Adam |
|
Jenny |
|
Customer use case
This advanced, high-fidelity model can be used to enable dubbing scenarios, allowing video content to be voiced in the original actor’s tone and style across multiple languages. The new Personal Voice model has been integrated in Azure AI video translation and targeting to empower creators of short dramas to reach out to global markets effortlessly.
TopShort and JOWO.ai are the next generation of short drama creator and translation provider, partners with Azure Video Translation Service to deliver one-click AI translation.
Check out the demo from TopShort. More videos are available in this channel, owned by JOWO.ai.
Get started
The new zero-shot TTS model will be available in the middle of August and will be exposed in the BaseModels_List operation of the custom voice API.
When you get the new model’s name “DragonV2.1Neural” in the base models list, please follow these steps to register your use case and apply for the access, create the speaker profile ID and use voice name “DragonV2.1Neural” to synthesize speech in any of the 100 supported languages.
Below is an SSML example using DragonV2.1Neural to generate speech for your personal voice in different languages. More details are provided here.
I’m happy to hear that you find me amazing and that I have made your trip planning easier and more fun.
Building personal voices responsibly
All customers must agree to our usage policies, which include requiring explicit consent from the original speaker, disclosing the synthetic nature of the content created, and prohibiting impersonation of any person or deceiving people using the personal voice service. The full code of conduct guides integrations of synthetic speech and personal voice to ensure consistency with our commitment to responsible AI.
Watermarks are automatically added to the speech output generated with personal voices. As the personal voice feature enters general availability, we have updated the watermark technology with enhanced robustness and stronger capabilities for identifying watermark existence. To measure the robustness of the new watermark, we have evaluated the accuracy of watermark detection with audio samples generated using personal voice. Our results showed an average accuracy rate higher than 99.7% for detecting the existence of watermarks in various audio editing scenarios. This improvement provides us stronger mitigations to prevent potential misuse.
Try the personal voice feature on Speech Studio as a test, or apply for full access to the API for business use.
In addition to creating a personal voice, eligible customers can create a brand voice for your business with Custom Voice’s professional voice fine-tuning feature. Azure AI Speech also offers over 600 neural voices covering more than 150 languages and locales. With these pre-built Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users.