Machine learning makes mono soundtrack more immersive
A stereo sound system will give you a good feel for where instruments have been placed in the soundscape created by the production team. And there are headphones that can custom tune the listening experience for more immersion. Now researchers have developed a machine learning system that can work out the direction of sound in a mono recording by examining accompanying video footage.
A few mechanisms help us to determine where sounds are coming from. A noise to the left will hit our left ear first and the brain uses that to tell us that the sound is indeed coming from our left.
A sound in front of us will reach the ear canal without obstruction, but the sound behind will get distorted by our flappy bits of cartilage and skin. Again, the brain uses this to work out sound direction. And the shape of the ear can help place sounds from above, below and all around.
Volume can also be a useful indicator for sound placement in a three dimensional space. A noise to the left will appear louder in the left ear than the right. And yet again, this information can help the brain work out where such a sound is coming from.
Precisely recreating this sound placement for a mono video soundtrack is quite a challenge. However, Ruohan Gao at the University of Texas at Austin and Kristen Grauman at Facebook Research have designed a system they're calling 2.5D sound, which approximates the direction of a sound using visual clues and then artificially distorts the left/right time and volume differences.
The researchers began by building a database of binaural recordings of more than 2,000 musical clips with accompanying video. The recorder used to capture the audio and video consisted of a pair of fake ears placed about a human head width apart, with a microphone in each ear canal picking up directional variations. The scene in front of the device was recorded using a GoPro action cam.
The audio tracks were then used to train a machine learning algorithm to identify where a sound was coming from in the recorded videos. The algorithm was then able to "watch" a video and use what it had learned to distort a mono recording to simulate sound direction.
"We call the resulting output 2.5D visual sound – the visual stream helps 'lift' the flat single channel audio into spatialized sound," said the researchers.
The video below compares the monaural soundtrack with the 2.5D manipulation. The system isn't perfect. It can't approximate sound direction if the sound source is not visible in the video, for example, and it won't recognize sounds that aren't in its learning database (though the latter could be solved by adding more source data to its sound library).
"We plan to explore ways to incorporate object localization and motion, and explicitly model scene sounds," the researchers added.
A paper detailing the research is available online.