Machine learning makes mono soundtrack more immersive

December 28, 2018

The 2.5D system is able to approximate sound direction in a monaural audio track by looking for visual cues in an accompanying video

Syda_Productions/Depositphotos

View 1 Image

1/1

The 2.5D system is able to approximate sound direction in a monaural audio track by looking for visual cues in an accompanying video

Syda_Productions/Depositphotos

A stereo sound system will give you a good feel for where instruments have been placed in the soundscape created by the production team. And there are headphones that can custom tune the listening experience for more immersion. Now researchers have developed a machine learning system that can work out the direction of sound in a mono recording by examining accompanying video footage.

A few mechanisms help us to determine where sounds are coming from. A noise to the left will hit our left ear first and the brain uses that to tell us that the sound is indeed coming from our left.

A sound in front of us will reach the ear canal without obstruction, but the sound behind will get distorted by our flappy bits of cartilage and skin. Again, the brain uses this to work out sound direction. And the shape of the ear can help place sounds from above, below and all around.

Volume can also be a useful indicator for sound placement in a three dimensional space. A noise to the left will appear louder in the left ear than the right. And yet again, this information can help the brain work out where such a sound is coming from.

Precisely recreating this sound placement for a mono video soundtrack is quite a challenge. However, Ruohan Gao at the University of Texas at Austin and Kristen Grauman at Facebook Research have designed a system they're calling 2.5D sound, which approximates the direction of a sound using visual clues and then artificially distorts the left/right time and volume differences.

The researchers began by building a database of binaural recordings of more than 2,000 musical clips with accompanying video. The recorder used to capture the audio and video consisted of a pair of fake ears placed about a human head width apart, with a microphone in each ear canal picking up directional variations. The scene in front of the device was recorded using a GoPro action cam.

The audio tracks were then used to train a machine learning algorithm to identify where a sound was coming from in the recorded videos. The algorithm was then able to "watch" a video and use what it had learned to distort a mono recording to simulate sound direction.

"We call the resulting output 2.5D visual sound – the visual stream helps 'lift' the flat single channel audio into spatialized sound," said the researchers.

The video below compares the monaural soundtrack with the 2.5D manipulation. The system isn't perfect. It can't approximate sound direction if the sound source is not visible in the video, for example, and it won't recognize sounds that aren't in its learning database (though the latter could be solved by adding more source data to its sound library).

"We plan to explore ways to incorporate object localization and motion, and explicitly model scene sounds," the researchers added.

A paper detailing the research is available online.

Source: University of Texas at Austin via MIT

2.5D Visual Sound (CVPR 2019)

1 comment

toyhouse December 29, 2018 04:58 AM

That is impressive. I wonder, can the system also learn to be controlled with icons/animations, etc.? That would help where the system can't see the object. Or be trained to follow non-live visuals, (ie; cartoons)? A lot of material out there recorded in mono. Maybe some best left mono, but a few might really benefit from this I would think.

Machine learning makes mono soundtrack more immersive

Tags

Most Viewed

Toyota and Lexus no longer most reliable carmakers, says Consumer Reports

France runs fusion reactor for record 22 minutes

Laser-wielding device is like an anti-aircraft system for mosquitoes

FREE NEWSLETTER