Hotfix: JDBC Driver 12.10.1 for SQL Server Released
June 27, 2025Distributed Databases: Adaptive Optimization with Graph Neural Networks and Causal Inference
June 27, 2025We are delighted to announce the availability of the Voice Conversion (VC) feature in Azure AI Speech service, which is currently in preview.
What is voice Conversion
Voice Conversion (or voice changer, speech to speech conversion) is the process of transforming the voice characteristics of a given audio to a target voice speaker, and after Voice Conversion, the resulting audio reserves source audio’s linguistic content and prosody while the voice timbre sounds like the target speaker.
Below is a diagram of Voice Conversion.
The purpose of Voice Conversion
There are 3 reasons users need Voice Conversion functionality:
- Voice Conversion can replicate your content using a different voice identity while maintaining the original prosody and emotion. For instance, in education, teachers can record themselves reading stories, and Voice Conversion can deliver these stories using a pre-designed cartoon character’s voice. This method preserves the expressiveness of the teacher’s reading while incorporating the unique timbre of the cartoon character’s voice.
- Another application is multilingual dubbing. When localized content is read by different voices, Voice Conversion can transform them into a uniform voice, ensuring a consistent experience across all languages while keeping the most localized voice characters.
- Voice Conversion enhances the control over the expressiveness of a voice. By transforming various speaking styles, such as adopting a unique tone or conveying exaggerated emotions, a voice gains greater versatility in expression and can be more dynamic in different scenarios.
Brief introduction to Our Voice Conversion Technology
The Voice Conversion is built on state-of-the-art generative models and offers high-quality voice conversion. It delivers the following core capabilities:
Key Capability |
Description |
High Speaker Similarity |
|
Prosody Preservation |
|
High Audio Fidelity |
|
Multilingual Support |
|
Voice Conversion in Standard TTS voices
In this release 28 Standard TTS voices on EN-US have been enabled with Voice Conversion capabilities. These voices are available in East US, West Europe and Southeast Asia service regions.
Sample
Here are some of them with voice samples, same content is delivered by both TTS and Voice Conversion, you can compare and listen:
Voice | Source Audio | TTS samples | Voice conversion samples |
AvaMultilingualNeural |
|
|
|
AvaMultilingualNeural |
|
|
|
AvaMultilingualNeural |
|
|
|
AvaMultilingualNeural |
|
|
|
AndrewMultilingualNeural |
|
|
|
AndrewMultilingualNeural |
|
|
|
AndrewMultilingualNeural |
|
|
|
How to Use
You can enable Voice Conversion by adding mstts:voiceconversion tag to your SSML. The structure is nearly identical to a standard TTS request, with the addition of specifying a source audio URL and a target voice name.
Note: In voice conversion mode, the synthesized output follows the content and prosody of the provided source audio. Therefore, text input is not required, and any text included in the SSML will be ignored during rendering. Additionally, All SSML elements related to prosody and pronunciation, such as or , will lose effect, because prosody is derived directly from the source audio.
SSML example
Voice List
Here is the list of Standard Neural TTS supporting this feature
AdamMultilingualNeural |
AlloyTurboMultilingualNeural |
AmandaMultilingualNeural |
AndrewMultilingualNeural |
AvaMultilingualNeural |
BrandonMultilingualNeural |
BrianMultilingualNeural |
ChristopherMultilingualNeural |
CoraMultilingualNeural |
DavisMultilingualNeural |
DerekMultilingualNeural |
DustinMultilingualNeural |
EchoTurboMultilingualNeural |
EmmaMultilingualNeural |
EvelynMultilingualNeural |
FableTurboMultilingualNeural |
JennyMultilingualNeural |
LewisMultilingualNeural |
LolaMultilingualNeural |
NancyMultilingualNeural |
NovaTurboMultilingualNeural |
OnyxTurboMultilingualNeural |
PhoebeMultilingualNeural |
RyanMultilingualNeural |
SamuelMultilingualNeural |
SerenaMultilingualNeural |
ShimmerTurboMultilingualNeural |
SteffanMultilingualNeural |
Voice Conversion in Custom Voice
Voice Conversion can also be applied to Custom Voice to enhance its expression. This feature is currently available in Custom Voice in Private Preview. This feature enhances the Custom Voice experience, and since it only requires a small amount of target speaker data, it offers a quick solution for dynamic voice customization. Customers who have built or plan to build custom voice on Azure and have a suitable use case for Voice Conversion are invited to contact us at mstts@microsoft.com to preview this feature.
Here are some samples:
Source Audio | Target speaker recording | Voice conversion samples |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Benchmark Evaluation
Benchmarking plays a key role in evaluating the quality of Voice Conversion. In this work, we have compared our solution against a leading Voice Conversion provider across a range of objective and subjective metrics, showcasing its advantages.
Objective Evaluation
We evaluated our system and a leading Voice Conversion provider (Company A) on two language sets (English and Mandarin) using three widely accepted objective metrics:
- SIM (Speaker Similarity): measures how closely the converted voice matches the target speaker’s vocal characteristics (higher is better).
- WER (Word Error Rate): measures the intelligibility of the converted voice by an automatic speech recognition (ASR) system (lower is better).
- Pitch Correlation: measures how well the pitch contour (intonation) of the converted voice aligns with the source (higher is better).
Solution |
Test Set |
SIM ↑ |
WER ↓ |
Pitch Correlation ↑ |
Ours |
En-US set |
0.70 |
1.9% |
0.61 |
Company A |
En-US set |
0.63 |
2.0% |
0.54 |
Ours |
Zh-CN set |
0.66 |
6.94% |
0.47 |
Company A |
Zh-CN set |
0.55 |
66.48% |
0.40 |
Our Voice Conversion consistently outperforms Company A in speaker similarity and pitch preservation, while achieving lower WER, particularly on Mandarin.
Subjective Evaluation
CMOS (Comparison Mean Opinion Score) tests were conducted to assess perceptual quality. Listeners compared audio pairs and rated which sample sounded more natural. A positive score reflects a preference for one system over the other.
Test Set |
CMOS (Company A vs Ours) |
En-US set |
On par |
Zh-CN set |
+0.75 in favor of ours |
These results show that our system achieves the same perceptual quality in English and performs significantly better in Mandarin.
Conclusion
In terms of objective evaluation, our Voice Conversion outperforms the leading Voice Conversion provider in speaker similarity (SIM), pitch correlation, and multilingual capabilities.
In terms of subjective evaluation, our Voice Conversion is on par with the provider in English, while achieving a significant advantage in Mandarin which demonstrates its advantages in multilingual conversion.
Overall, these results show that our current Voice Conversion delivers state-of-the-art quality.