Recently, Microsoft company announced The VALL-E TTS (text-to-speech) AI model, developed by Microsoft researchers which can convincingly imitate a human voice using just a 3-second clip. In the process of creating a recording, it is claimed to simulate any type of voice, including tone and emotions. Furthermore, the generated speech retains all the style, charisma and intonation of the original voice. TTS systems are becoming more natural as a result of this advancement.
An English speech database of 60,000 hours was used to train the tool. There are, however, ethical concerns associated with this latest new technology. As compared to existing TTS systems, the team’s system uses 100 more data in this new technology which helps to provide them with assistance in overcoming the zero-shot problem.
A close match between the target user’s voice and the training data is necessary to mimic the voice. So, the AI can “teach” itself to try to sound like the desired reader when it reads a selected passage. The most obvious benefit of these innovations is the cost savings they bring to businesses that can complete the same work for far less than it would cost to hire a human to do it.
As nowadays other technologies and VALL-E become more reliable, the voices they generate will sound more and more realistic. By imitating real people’s voices, spam calls might sound real to potential victims. It is therefore important for the company to set up measures to ensure VALL-E model is used wisely and not maliciously. Currently, this tool isn’t available to the general public because it raises concerns about its safety since it could conceivably be able to generate any kind of text.