The Latest Google Algorithm Creates Video Based On a Few Still Images

By Kelsey Campbell-Dollaghan on at

Google, a company whose primary business is advertising, also does some pretty incredible things with the technology it has developed.

For example, this month we watched as the company’s “dream robot,” also known as its super-advanced artificial neural network, ran wild across the internet. These networks are being developed by Google’s engineering team for a bunch of practical reasons involving a computer being able to identify the contents of an image—which is a remarkably complex task for a dumb machine. The incredible artificial “brains” Google is teaching to recognise, say, animals or architecture, also happens to be able to “dream,” and the results have startled and awed us.

Here’s another pretty fascinating example of the company’s computer vision work.

This week, MIT’s Technology Review trained its eye on a paper by a Google engineer named John Flynn, the lead author on a new paper called DeepStereo: Learning to Predict New Views from the World’s Imagery. Flynn and his three co-authors, all of whom work at Google, explain how they’ve developed a system called DeepStereo that can look at a series of images of a place and combine them into a seamless animation.

The Latest Google Algorithm Creates Video Based On a Few Still Images

That might not seem very different from other similar projects, for example this SIGGRAPH project that mines images from the web to create time lapses. Well, it’s true that DeepStero does create something like a time lapse. Except it actually creates new images to fill in the blanks, predicting parts of the image and perspectives it can’t see in any of the source photos. Rather than our eyes filling in the blanks between two disparate still images, DeepStereo itself can “imagine” what’s there, as the Register puts it. “Unlike this prior work, we learn to synthesise new views directly using a new deep architecture, and do not require known depth or disparity as training data,” Flynn and his co-authors write.

Obviously, the network architecture behind this thing is vastly complex and based on various precedents. But the authors do tell us a bit about what’s going on here: There are two separate “towers,” or network architectures, at work. One makes a prediction about the depth of the pixels based on the available 2D data. The other makes a prediction about the colour. Together, they make a prediction about the depth and the colours of the forms in the 2D images, ultimately synthesizing a full video.

Watch closely below, and you’ll see where DeepStereo gets tripped up: Moments where corners look blurry or pixelated. “[R]egions where the algorithm is not confident tend to be blurred out, rather than being filled with warped or distorted input pixels,” the team explains. There’s even a trick for creating objects that are blurred in the source images. “Moving objects, which occur often in the training data, are handled gracefully by our model: They appear blurred in a manner that evokes motion blur.”

Of course, the final product—to eyes without knowledge of what it took to create it—doesn’t look all that different from a time lapse. But knowing that so much of these videos are created from scratch by a deep-learning algorithm makes a banal tour of Street View pretty extraordinary.