New deepfake algorithm allows you to text-edit the words of a speaker in a video

New deepfake algorithm allows ...
New deepfake software lets you add, edit or delete words from the transcript of a video, and the changes are reflected seamlessly in the video
New deepfake software lets you add, edit or delete words from the transcript of a video, and the changes are reflected seamlessly in the video
View 2 Images
New deepfake software lets you add, edit or delete words from the transcript of a video, and the changes are reflected seamlessly in the video
New deepfake software lets you add, edit or delete words from the transcript of a video, and the changes are reflected seamlessly in the video
How the algorithm works
How the algorithm works

It is now possible to take a talking-head style video, and add, delete or edit the speaker's words as simply as you'd edit text in a word processor. A new deepfake algorithm can process the audio and video into a new file in which the speaker says more or less whatever you want them to.

It's the work of a collaborative team from Stanford University, Max Planck Institute for Informatics, Princeton University and Adobe Research, who say that in a perfect world the technology would be used to cut down on expensive re-shoots when an actor gets something wrong, or a script needs to be changed.

In order to learn the face movements of a speaker, the algorithm requires about 40 minutes of training video, and a transcript of what's being said, so it's not something that can be thrown onto a short video snippet and run if you want good results. That 40 minutes of video gives the algorithm the chance to work out exactly what face shapes the subject is making for each phonetic syllable in the original script.

From there, once you edit the script, the algorithm can then create a 3D model of the face making the new shapes required. And from there, a machine learning technique called Neural Rendering can paint the 3D model over with photo-realistic textures to make it look basically indistinguishable from the real thing.

How the algorithm works
How the algorithm works

Other software such as VoCo can be used if you wish to generate the speaker's audio as well as video, and it takes the same approach, by breaking down a heap of training audio into phonemes and then using that dataset to generate new words in a familiar voice.

The team is aware of the potential its software has for unethical uses. The world has yet to be hit by its first great deepfake scandal – perhaps we'll see deepfakes becoming part of the battleground of the 2020 US elections – but it's easy to imagine them being incredibly effective tools of deception in front of an uneducated audience.

It's even more worrisome to realize that their mere existence will allow dishonest public figures to deny or cast doubt on genuine videos that show them in a bad light. As soon as a couple of decent sized deepfake scandals have made it past the editors at CNN, and been exposed, we'll be entering an era where people cannot, or more to the point, will not, trust what they've seen in any video format.

The research team behind this software makes some feeble attempts to deal with its potential misuse, proffering a solution in which anyone who uses the software can optionally watermark it as a fake and provide "a full ledger of edits." This is obviously no barrier to misuse.

The team also suggests that other researchers develop "better forensics such as digital or non-digital fingerprinting techniques to determine whether a video has been manipulated for ulterior purposes." Indeed, there's some potential for blockchain-style permanent records to be used here, which would allow any piece of video to be compared back to its point of origin. But this isn't in place yet, and it's unclear how it could be implemented globally.

On the non-fingerprinting side of things, many, if not most, deep learning applications are already working on the problem of how to spot fakes. Indeed, with the Generative Adversarial Network approach, two networks compete against each other – one generating fake after fake, and another trying to pick the fakes from real inputs. Over millions of generations, the discerning network gets better at picking fakes, and the better it gets, the better the fake generating network has to become to fool it.

So the better these systems get at automatically spotting fake videos, the better the fakes will become. Thus, the CNNs of this world will not be able to rely on a simple algorithm that lets them auto-scan incoming video looking for deepfakes. It's a complex and serious problem, and it's virtually guaranteed to have a major impact on news reporting over the coming decades, even if it's in an embryonic form right now.

The video below shows how easy it is to edit video using the new algorithm.

Source: Stanford University

Text-based Editing of Talking-head Video (SIGGRAPH 2019)

Mark K.
Ohh boy.
The genie won't be going back into the bottle. Fortunately the article explains the technology fairly well because the video's audio track is nearly incomprehensible.
I'm normally all for technological progress but this one should definitely not happen. The risk of abuse greatly outweigh the benefits. Currently in this era of fake news the main weapon against politicians going against their word is video testimony. With this tool it would be easy to allege a genuine video has been faked. In short it would allow corrupt officials to cast doubt on anything that has been recorded as evidence. Stanford, do the right thing and reassess the need for this
Quite frightening
There's not much anyone can do about it. At least they told us it's here. I think it's a blessing. Otherwise, we might not of known this was possible before it was too late to stop a catastrophe. Imagine the panic there would be if we didn't know this software existed and someone created a deep fake video showing the head of Iran telling others there were nuclear bombs planted in major cities throughout the US or Europe ready to blow or set to go off at a certain date? Can you imagine the chaos that would cause even if only 20% of the population believed. That's just one example I can think of. No, I'm glad we know they can do this. The cats out of the bag and I'm glad. And, I promise you if they didn't invent it, someone else would. The question now is going to be if you can trust ANYTHING on video anymore? This is going to be big and just add more fuel to the fake news that's already out there at the cost of real news. We won't know which is which.