Technology

Microsoft AI creates scary real talkie videos from a single photo

Microsoft AI creates scary real talkie videos from a single photo
The VASA-1 AI model can generate realistic talking head video footage from a single reference photo, which is lip-synced to an audio track
The VASA-1 AI model can generate realistic talking head video footage from a single reference photo, which is lip-synced to an audio track
View 2 Images
The VASA-1 AI model can generate realistic talking head video footage from a single reference photo, which is lip-synced to an audio track
1/2
The VASA-1 AI model can generate realistic talking head video footage from a single reference photo, which is lip-synced to an audio track
The VASA-1 AI model is able to generate scary real video that's not only able to lip-sync to a supplied voice audio track, but also include facial expressions and natural head movements – all from a single static head shot
2/2
The VASA-1 AI model is able to generate scary real video that's not only able to lip-sync to a supplied voice audio track, but also include facial expressions and natural head movements – all from a single static head shot

Microsoft Research Asia has revealed an AI model that can generate frighteningly realistic deepfake videos from a single still image and an audio track. How will we be able to trust what we see and hear online from here on in?

As we noted earlier, artificial intelligence systems have bested us across key benchmarks over the past few years, and already has many folks very worried about being prematurely put out to pasture and replaced by algorithms.

We've recently witnessed fairly limited smart gadgets transformed into powerful everyday assistants and vital productivity tools. And then there are models that can generate realistic sound effects to silent video clips, and even create stunning footage from text prompts. Microsoft's VASA-1 framework seems like another huge leap.

After training the model on footage of around 6,000 real-life talking faces from the VoxCeleb2 dataset, the technology is able to generate scary real video where the newly animated subject is not only able to accurately lip-sync to a supplied voice audio track, but also sports varied facial expressions and natural head movements – all from a single static headshot photo.

It's quite similar to the Audio2Video Diffusion Model from Alibaba's Institute for Intelligent Computer that emerged a couple of months back, but even more photo realistic and accurate. VASA-1 is reportedly capable of generating synced videos at 512x512 pixels at 40 frames per second, "with negligible starting latency."

The VASA-1 AI model is able to generate scary real video that's not only able to lip-sync to a supplied voice audio track, but also include facial expressions and natural head movements – all from a single static head shot
The VASA-1 AI model is able to generate scary real video that's not only able to lip-sync to a supplied voice audio track, but also include facial expressions and natural head movements – all from a single static head shot

Though all of the reference photos used for the project demos were themselves AI-generated by StyleGAN2 or DALL-E, there is one stand-out real-world example used to show off the framework's prowess for stepping outside of its training set – a rapping Mona Lisa!

The project page has many examples of talking and singing videos generated from a still image and matched to an audio track, but the tool also has optional controls to set "facial dynamics and head poses" such as emotions, expressions, distance from the virtual videocam and gaze direction. Powerful stuff.

"The emergence of AI-generated talking faces offers a window into a future where technology amplifies the richness of human-human and human-AI interactions," reads the introduction to a paper detailing the achievement. "Such technology holds the promise of enriching digital communication, increasing accessibility for those with communicative impairments, transforming education methods with interactive AI tutoring, and providing therapeutic support and social interaction in healthcare."

All very laudable, but the researchers also acknowledge the potential for misuse. Though it already feels like an impossible task to weed out fact from outright fabrication when digesting our daily dose of online news, imagine having a tool at your disposal that could make pretty much anyone appear to say whatever you want them to say.

That could shape up to be harmless pranking of a relative with a FaceTime from a favorite Hollywood actor or pop star, implicating an innocent person for a serious crime by posting an online confession, scamming someone for money by taking on the persona of a treasured grandchild in trouble, having key politicians voice support for controversial agendas, and so on. Realistically and convincingly.

However, content generated by the VASA-1 model does "contain identifiable artifacts" and the researchers don't intend to make the platform publicly available "until we are certain that the technology will be used responsibly and in accordance with proper regulations."

A paper detailing the project has been published on the arXiv server.

Source: Microsoft Research

6 comments
6 comments
FatherBartholomew
If I were conspiratorially minded, I would wonder if MS is working on getting rid of the artifacts to make it undetectable.
Paul, you are a youthful person. I cut my teeth on an IBM 1130 with 4K of hardwired LSI core memory chips and its associated punch card reader. It is true that I was in high school at the time.
Spud Murphy
Damn, checked out their website, and the human race is in deep trouble, there is no way to tell if you are looking at a real human or not now. They say they are not planning to release this, but that's unlikely, given the money they could make. And even if they don't, someone else will replicate the system.
solas
This article poses a question without giving an answer: how does one distinguish deep fake vs real? This requires authentication- at - record: easily done with current tech and cannot be forged , but distributing will take some time
dave be
We already realize people we know lie to us all the time. Why should it be different from people we dont know, whether those people are real or manufactured. Everything has to be verified that we take in, first by our own logic and then with other sources.
Steve7734
I agree with dave be, it's about time everyone realised they cannot trust what they're told without corroborating the information from other sources. That would be a huge benefit to society. It would make it more difficult for random folk to pretend to be knowledgeable on a subject, and it might also encourage us to connect with real people face to face. Until robots become indistinguishable from humans ... but I don't think we have to worry about that just yet.
ArdisLille
I truly believe AI will take us down. I'm decorating my hand-basket now--it's a 2-seater.