This is one of those times when words like amazing and mind-blowing are not hyperbole. Researchers at MIT, Microsoft, and Adobe recently joined forces to do something that seems completely impossible: they've been able to extract audio from visual information alone -- meaning they have recovered sound from videos that have no audio whatsoever.
Here's the video showing their process and how they achieved their incredible results:
And here's a breakdown from the MIT article on the subject:
“When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science at MIT and first author on the new paper. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”
Reconstructing audio from video requires that the frequency of the video samples — the number of frames of video captured per second — be higher than the frequency of the audio signal. In some of their experiments, the researchers used a high-speed camera that captured 2,000 to 6,000 frames per second. That’s much faster than the 60 frames per second possible with some smartphones, but well below the frame rates of the best commercial high-speed cameras, which can top 100,000 frames per second.
In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as it was with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.
Obviously needing the high-speed camera for the better quality sound is kind of a deal-breaker for everyday usage, but the fact that they extracted audio from a regular DSLR -- albeit recording at 60fps -- is absolutely insane. There is some serious math going on in their algorithm, because the vibrations from the sound are only moving the object a tenth of a micrometer (0.001 millimeters), which is imperceptible to the naked eye. We're probably quite a ways off from having software like this as a plugin in Premiere, but being able to extract audio from footage that has none could be very useful, especially if you're recording slow motion in-camera, which typically would not have any audio.
Even if we don't find useful ways to use this technique on a daily basis in our work, it's very possible it will have significant uses in other applications. To read more about the research and how this technique could be used, check out the MIT post.