New deepfake algorithm allows you to text-edit the words of a speaker in a video
It is now possible to take a talking-head style video, and add, delete or edit the speaker's words as simply as you'd edit text in a word processor. A new deepfake algorithm can process the audio and video into a new file in which the speaker says more or less whatever you want them to.
It's the work of a collaborative team from Stanford University, Max Planck Institute for Informatics, Princeton University and Adobe Research, who say that in a perfect world the technology would be used to cut down on expensive re-shoots when an actor gets something wrong, or a script needs to be changed.
In order to learn the face movements of a speaker, the algorithm requires about 40 minutes of training video, and a transcript of what's being said, so it's not something that can be thrown onto a short video snippet and run if you want good results. That 40 minutes of video gives the algorithm the chance to work out exactly what face shapes the subject is making for each phonetic syllable in the original script.
From there, once you edit the script, the algorithm can then create a 3D model of the face making the new shapes required. And from there, a machine learning technique called Neural Rendering can paint the 3D model over with photo-realistic textures to make it look basically indistinguishable from the real thing.
Other software such as VoCo can be used if you wish to generate the speaker's audio as well as video, and it takes the same approach, by breaking down a heap of training audio into phonemes and then using that dataset to generate new words in a familiar voice.
The team is aware of the potential its software has for unethical uses. The world has yet to be hit by its first great deepfake scandal – perhaps we'll see deepfakes becoming part of the battleground of the 2020 US elections – but it's easy to imagine them being incredibly effective tools of deception in front of an uneducated audience.
It's even more worrisome to realize that their mere existence will allow dishonest public figures to deny or cast doubt on genuine videos that show them in a bad light. As soon as a couple of decent sized deepfake scandals have made it past the editors at CNN, and been exposed, we'll be entering an era where people cannot, or more to the point, will not, trust what they've seen in any video format.
The research team behind this software makes some feeble attempts to deal with its potential misuse, proffering a solution in which anyone who uses the software can optionally watermark it as a fake and provide "a full ledger of edits." This is obviously no barrier to misuse.
The team also suggests that other researchers develop "better forensics such as digital or non-digital fingerprinting techniques to determine whether a video has been manipulated for ulterior purposes." Indeed, there's some potential for blockchain-style permanent records to be used here, which would allow any piece of video to be compared back to its point of origin. But this isn't in place yet, and it's unclear how it could be implemented globally.
On the non-fingerprinting side of things, many, if not most, deep learning applications are already working on the problem of how to spot fakes. Indeed, with the Generative Adversarial Network approach, two networks compete against each other – one generating fake after fake, and another trying to pick the fakes from real inputs. Over millions of generations, the discerning network gets better at picking fakes, and the better it gets, the better the fake generating network has to become to fool it.
So the better these systems get at automatically spotting fake videos, the better the fakes will become. Thus, the CNNs of this world will not be able to rely on a simple algorithm that lets them auto-scan incoming video looking for deepfakes. It's a complex and serious problem, and it's virtually guaranteed to have a major impact on news reporting over the coming decades, even if it's in an embryonic form right now.
The video below shows how easy it is to edit video using the new algorithm.
Source: Stanford University