It's already possible to create a digital copy of someone's voice, enabling users to produce an audio file of them saying things that they never actually said. Listeners still might not be fooled, though, as there wouldn't be footage of the person speaking those words. Well … University of Washington researchers have now created a system that converts audio clips into lip-synced videos of the speaker.
In order for the system to work, it needs to analyze approximately 14 hours of existing footage of the person speaking – the researchers are hoping to reduce that figure significantly, perhaps down to one hour. Utilizing a neural network, it learns which of their mouth shapes accompany which speech sounds.
When the system is subsequently provided with a "target video" of the person (in which they could be talking about anything), along with an audio file of them speaking the desired words, it pairs the two together. It does so by dropping the video's original audio, replacing it with the desired audio, and mapping a computer-animated version of the speaker's mouth in place of their mouth in the video.
The end result is that people hear them speaking the desired words, and apparently see their mouth doing so, also. Although there's certainly the potential for treachery, the researchers have developed the technology with other uses in mind.
"Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio," says assistant professor Ira Kemelmacher-Shlizerman. "This is the kind of breakthrough that will help enable those next steps."
You can see and hear the system in use, in the following video.
Source: University of Washington