MIT's deep-learning software produces videos of the future

2 pictures

Currently, the software can generate videos showing what happens one and a half seconds into the future(Credit: Carl Vondrick/MIT CSAIL)

View gallery - 2 images

When you see a photo of a dog bounding across the lawn, it's pretty easy for us humans to imagine how the following moments played out. Well, scientists at MIT have just trained machines to do the same thing, with artificial intelligence software that can take a single image and use it to to create a short video of the seconds that followed. The technology is still bare-bones, but could one day make for smarter self-driving cars that are better prepared for the unexpected, among other applications.

The software uses a deep-learning algorithm that was trained on two million unlabeled videos amounting to a year's worth of screen time. It actually consists of two separate neural networks that compete with one another. The first has been taught to separate the foreground and the background and to identify the object in the image, which allows the model to then determine what is moving and what isn't.

According to the scientists, this approach improves on other computer vision technologies under development that can also create video of the future. These involve taking the information available in existing videos and stretching them out with computer-generated vision, by building each frame one at a time. The new software is claimed to be more accurate, by producing up to 32 frames per second and building out entire scenes in one go.

"Building up a scene frame-by-frame is like a big game of 'Telephone,' which means that the message falls apart by the time you go around the whole room," says Carl Vondrick, first author of the study. "By instead trying to predict all frames simultaneously, it's as if you're talking to everyone in the room at once."

Once the scene is produced, the job of the second neural network is to assess whether it is real video or has been produced by a computer. Through this form of "adversarial learning," the first network teaches itself to trick the second into thinking the clips are real, the result being stationary waves that now come crashing down and planted feet that shuffle across a golf course.

And the technology appears to be more realistic than others, with the team pitting its handiwork against a baseline of other computer-generated clips and quizzing 150 human subjects on which seemed more realistic. From more than 13,000 cases, the subjects opted for the new videos 20 percent more than the baseline.

But there are a few areas that could do with improvement, the team says. The videos are just one and half seconds long, and to extend the motion in a way that still made sense would be a challenge, one that might require a human eye. In its current form, the model also produces objects and humans much bigger than they are in reality, and these often lack resolution and resemble "blobs."

With further development, the team images that the technology could be used in autonomous vehicles that are better at tracking the motion of the objects around them. Other possible applications include filling in gaps in security footage, adding animation to still images and helping to compress large video files.

The research paper can be accessed online, and the team will present its work at the Neural Information Processing Systems (NIPS) conference in Barcelona next week.

You can check out some of the moving images in the video below.

Source: MIT

View gallery - 2 images

Top stories

Recommended for you

Latest in Science

Editors Choice