If you've ever had the feeling that you’re being listened to by chip bags, spied on by your houseplants, and eavesdropped by chocolate bar wrappers, then you may not be paranoid; you may just be up on the latest technology. That’s because a team of scientists led by MIT that includes participants from Microsoft, and Adobe has created a "visual microphone" that uses a computer algorithm to analyze and reconstruct audio signals from objects in a video image.
According to MIT, the visual microphone is able to detect and reconstruct intelligible speech by analyzing the vibrations from items such as a potato chip bag, sitting behind soundproof glass 15 ft (4.5 m) away from the camera. In addition to snack food packaging, the scientists also found that the algorithm could reconstruct audio signals from video images of an aluminum foil wrapper, the surface of a glass of water, and the leaves of a potted plant.
The algorithm works by measuring vibrations from an object that’s been recorded on video. Just as a video can store and reproduce how a person or thing moves across the frame, the MIT team discovered that a video of sufficient bandwidth and frame rate can record sound-induced vibrations as well, which the algorithm can reconstruct.
"When sound hits an object, it causes the object to vibrate," says Abe Davis, a graduate student in electrical engineering and computer science at MIT. "The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there."
The algorithm works by running scans of video frames through color filters, then looking at any changes in color on the edges of objects at different orientations and scales. The algorithm uses these changes to measure motions as the objects respond to sound, and then aligns these motions so they don’t cancel one another out. From this, the sound that caused the vibration can be detected, analyzed, and reconstructed.
The tricky bit is getting enough information to reconstruct the sound. In practice, this means using video that operates faster than the standard 24 frames per second or the 60 frames per second used in a smartphone, because the video frequency has to be higher than that of the sound being analyzed. In the MIT team’s case, this meant 2,000 to 6,000 frames per second.
If you’re worried that you could be bugged by a wrapper, you can breathe easy – a little bit. MIT says that even though a video taken with a phone might not allow a spy to listen to your conversations, it may provide enough data to reveal the sex of a person, the number of people speaking in a room, and even their identities.
According to MIT, even low-resolution video can provide a lot of information because modern video cameras, even simple ones, are complex devices with millions of photodetectors. This array produces interesting effects, such as the distortion caused when photographing a helicopter in flight. On film, this produces a blur, but on a digital camera, it can produce a weird distortion that turns the blades into a spiral due to the digital scan being unable to keep up with the whirling blade. For the algorithm, this distortion is just more data to be analyzed, which allows it to recover audio even from phone videos.
MIT says that the exciting thing about the new algorithm is that it goes beyond potato chip paranoia and into a new way of imaging that extracts information from videos with a surprising degree of precision, due to the fact that different objects respond to sound in different ways that can be measured to within a tenth of a micrometer. That’s about the same as one-five thousandth of a pixel.
The question is, how does the MIT team manage to get any information out a video that’s smaller than a pixel? The team says that the algorithm manages this by making inferences based on changes in the pixel itself. The example given is of a screen that’s half red and half blue. Where they meet, the pixels turn purple as the red and blue mix. If the red starts to encroach on the blue on even a very fine level, the purple strip turns redder, so it’s possible to figure out what’s going on in the strip even though it’s below the pixel level.
The ability to turn objects into visual microphones has a number of obvious applications in law enforcement, intelligence, and military operations, but the MIT team points out that its ability to extract information from videos has other potential uses. For instance, it may have medical applications, such as a way to unintrusively measure the pulse of a neonatal infant from a video scan of its wrist
"This is new and refreshing. It’s the kind of stuff that no other group would do right now," says Alexei Efros, an associate professor of electrical engineering and computer science at the University of California at Berkeley. "We’re scientists, and sometimes we watch these movies, like James Bond, and we think, 'This is Hollywood theatrics. It’s not possible to do that. This is ridiculous.' And suddenly, there you have it. This is totally out of some Hollywood thriller. You know that the killer has admitted his guilt because there’s surveillance footage of his potato chip bag vibrating."
The team’s findings are described in a paper being presented at this year's Siggraph conference.
The video below outlines the experiment, and provides examples of the reconstructed audio.
Source: MIT
1. What is the minimum threshold of sound required for this system to work, ie dB away from audio source 2. Frame rate is one thing, but how much does the pixel density (resolution) of the video play on fidelity of the sound reconstructed 3. Is a sound sweep required at this location with a synthetic device to estimate object movement based on different frequencies, or does the algorithm achieve this purely on the samples received from actual 4. How does it compare to a laser microphone