Microsoft's new VALL-E AI can capture your voice in 3 seconds
Microsoft researchers have presented an impressive new text-to-speech AI model, called Vall-E, which can listen to a voice for just a few seconds, then mimic that voice – including the emotional tone and acoustics – to say whatever you like.
It's the latest of many AI algorithms that can harness a recording of a person's voice and make it say words and sentences that person never spoke – and it's remarkable for just how small a scrap of audio it needs in order to extrapolate an entire human voice. Where 2017's Lyrebird algorithm from the University of Montreal, for example, needed a full minute of speech to analyze, Vall-E needs just a three-second audio snippet.
The AI has been trained on some 60,000 hours of English speech – mainly, it seems, by audiobook narrators, and the researchers have presented a swag of samples, in which Vall-E attempts to puppeteer a range of human voices. Some do a pretty extraordinary job of capturing the essence of the voice and building new sentences that sound natural – you'd struggle to tell which was the real voice and which was the synthesis. In others, the only giveaway is when the AI puts the emphasis in strange places in the sentence.
Vall-E does a particularly good job of recreating the audio environment of the original sample. If the sample sounds like it was recorded over a telephone, so does the synthesis. It's pretty good with accents, too – at least, American, British and a few European-sounding accents.
In terms of emotion, the results are less impressive. Using samples of speech marked as angry, sleepy, amused or disgusted seems to send things off the rails, and the synthesis comes out sounding weirdly distorted.
The implications of this sort of tech are pretty clear; on the positive side, at some point you'll be able to have Morgan Freeman narrate your shopping list as you ride a trolley down the supermarket aisle. If an actor dies halfway through a movie, they can finish their performance through deepfaked video and audio using systems like this. Apple has recently introduced a catalog of audiobooks read to you by an AI, and it stands to reason that you'll soon be able to flip between narrators on the fly.
On the negative side, well, it's not great news for voice actors and narrators. Or indeed for listeners; AI might be able to pump out narrations quickly and extremely cheaply, but don't expect much art to it. They won't interpret Douglas Adams like Stephen Fry.
The potential for scam artists is also sky-high. If a scammer can get you on the phone for three seconds, they can steal your voice and call your grandma with it. Or bypass any voice-recognition security devices. This is exactly the kind of thing Terminator robots will need to make phone calls.
And of course, everyone's still waiting for the moment when the first deepfaked speech from a political figure fools enough people to undermine the very notion of believing your eyes and ears – as if objective truth wasn't already a concept under assault in this strange age.
The Microsoft Vall-E team tacks a short ethics statement on the end of its demonstration page: "The experiments in this work were carried out under the assumption that the user of the model is the target speaker and has been approved by the speaker. However, when the model is generalized to unseen speakers, relevant components should be accompanied by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech."
The rise of creative AIs like DALL-E, ChatGPT, various deepfake algorithms and countless others feels like it's at an inflection point in the last few months, beginning to break out of laboratories and into the real world. As with all change, it brings opportunities and risks. We truly live in interesting times.
Check out all the audio samples at the Vall-E demo page.