Microsoft Research Asia has revealed an AI model that can generate frighteningly realistic deepfake videos from a single still image and an audio track. How will we be able to trust what we see and hear online from here on in?
As we noted earlier, artificial intelligence systems have bested us across key benchmarks over the past few years, and already has many folks very worried about being prematurely put out to pasture and replaced by algorithms.
We've recently witnessed fairly limited smart gadgets transformed into powerful everyday assistants and vital productivity tools. And then there are models that can generate realistic sound effects to silent video clips, and even create stunning footage from text prompts. Microsoft's VASA-1 framework seems like another huge leap.
After training the model on footage of around 6,000 real-life talking faces from the VoxCeleb2 dataset, the technology is able to generate scary real video where the newly animated subject is not only able to accurately lip-sync to a supplied voice audio track, but also sports varied facial expressions and natural head movements – all from a single static headshot photo.
It's quite similar to the Audio2Video Diffusion Model from Alibaba's Institute for Intelligent Computer that emerged a couple of months back, but even more photo realistic and accurate. VASA-1 is reportedly capable of generating synced videos at 512x512 pixels at 40 frames per second, "with negligible starting latency."
Though all of the reference photos used for the project demos were themselves AI-generated by StyleGAN2 or DALL-E, there is one stand-out real-world example used to show off the framework's prowess for stepping outside of its training set – a rapping Mona Lisa!
The project page has many examples of talking and singing videos generated from a still image and matched to an audio track, but the tool also has optional controls to set "facial dynamics and head poses" such as emotions, expressions, distance from the virtual videocam and gaze direction. Powerful stuff.
"The emergence of AI-generated talking faces offers a window into a future where technology amplifies the richness of human-human and human-AI interactions," reads the introduction to a paper detailing the achievement. "Such technology holds the promise of enriching digital communication, increasing accessibility for those with communicative impairments, transforming education methods with interactive AI tutoring, and providing therapeutic support and social interaction in healthcare."
All very laudable, but the researchers also acknowledge the potential for misuse. Though it already feels like an impossible task to weed out fact from outright fabrication when digesting our daily dose of online news, imagine having a tool at your disposal that could make pretty much anyone appear to say whatever you want them to say.
That could shape up to be harmless pranking of a relative with a FaceTime from a favorite Hollywood actor or pop star, implicating an innocent person for a serious crime by posting an online confession, scamming someone for money by taking on the persona of a treasured grandchild in trouble, having key politicians voice support for controversial agendas, and so on. Realistically and convincingly.
However, content generated by the VASA-1 model does "contain identifiable artifacts" and the researchers don't intend to make the platform publicly available "until we are certain that the technology will be used responsibly and in accordance with proper regulations."
A paper detailing the project has been published on the arXiv server.
Source: Microsoft Research