VALL-E TTS model has been unveiled by Microsoft

January 11, 2023

VALL-E TTS model has been unveiled by Microsoft

Recently, Microsoft company announced The VALL-E TTS (text-to-speech) AI model, developed by Microsoft researchers which can convincingly imitate a human voice using just a 3-second clip. In the process of creating a recording, it is claimed to simulate any type of voice, including tone and emotions. Furthermore, the generated speech retains all the style, charisma and intonation of the original voice. TTS systems are becoming more natural as a result of this advancement.

Researchers at technology major Microsoft have unveiled their latest text-to-speech (TTS) generator, VALL-E that can be trained to mimic anybody's voice in just three seconds.

Find out more at https://t.co/kcY8VlqGM0 🚀#engineering #interestingengineering pic.twitter.com/0YDMVOOXlG
— Interesting Engineering (@IntEngineering) January 11, 2023

An English speech database of 60,000 hours was used to train the tool. There are, however, ethical concerns associated with this latest new technology. As compared to existing TTS systems, the team’s system uses 100 more data in this new technology which helps to provide them with assistance in overcoming the zero-shot problem.

A close match between the target user’s voice and the training data is necessary to mimic the voice. So, the AI can “teach” itself to try to sound like the desired reader when it reads a selected passage. The most obvious benefit of these innovations is the cost savings they bring to businesses that can complete the same work for far less than it would cost to hire a human to do it.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Presents VALL-E, a LM approach for TTS that significantly outperforms the SotA zero-shot TTS in terms of speech naturalness and speaker similarity.

proj: https://t.co/wdDdBoyV3b
abs: https://t.co/S4V4iTyTs2 pic.twitter.com/EODHKJJY78
— Aran Komatsuzaki (@arankomatsuzaki) January 6, 2023

As nowadays other technologies and VALL-E become more reliable, the voices they generate will sound more and more realistic. By imitating real people’s voices, spam calls might sound real to potential victims. It is therefore important for the company to set up measures to ensure VALL-E model is used wisely and not maliciously. Currently, this tool isn’t available to the general public because it raises concerns about its safety since it could conceivably be able to generate any kind of text.