AI tools are now creating video, and matching sound effects
"No Lights, no cameras, all action." You knew it was coming. One of the key companies behind the Stable Diffusion image generator has launched a mind-blowing AI video creation and editing tool that operates something like DALL-E for moving pictures.
Runway AI is working on a number of extraordinary next-gen creative AI projects, but its freshly-released Gen-1 video tool is a truly confronting snapshot of where this stuff is at, and how quickly it's advancing. Take a quick look at our wrap-up on the state of creative AI back in 2015 for some context.
And then take a look at what Gen-1 can do. It's not an outright text-to-video generator; you can't just ask it to go away and make a dog food commercial in the style of Hitchcock. Well, not yet. Instead, it asks you for an input video, and then creates different versions of that input video in response to text, image or video prompts.
So if you go and film something extremely roughly – just to get the basic angles, actions and camera movements down – you can ask Gen-1 to take that footage and recreate it in a completely different style. You can flat-out tell it "make this a film noir scene," or "make this an underwater scene set in Atlantis," or "put these characters on a moving bus in London." It's like you can now instantly design your own Snapchat filters.
Or you can find an image or video example that fits the style you're going for, and just upload it – Gen-1 will analyze it, work out what it is, and then do its best to recreate the key elements of your video in a similar context. Or you can get it to isolate and track a subject, and change it in some way. Or you can use a broader set of training data to improve the fidelity of your results. Check it out:
Yes, like Snapchat filters, it's a bit crude, flickery and fidgety right now – but even in its current form, it's already absolutely relevant to music videos, commercials and a broad range of other artsy video projects.
And it doesn't even matter if it's Gen-1 or something else; it should be clear enough where this will go. The pace of progress in creative AI is going gangbusters. Blink, and algorithms like this will be making whole movies in 4K 3D. Upload Pulp Fiction and see it performed entirely by dogs. Take a cartoon and generate a different live-action version of it for every region you're showing it in, changing the race of the cast, the setting, the backgrounds and the landmarks to let everyone feel at home. Give everyone in the movie a handlebar moustache. Auto-replace your product placements. Take Winnie the Pooh off the kid's toy shelf for the Chinese release. Put the buttholes back on the cats.
This will grow to become a super-fast, super-cheap visual effects studio in a box. And lest the sound effects guys are feeling smug, Runway's got audio jobs in its sights as well.
The company appears still to be at the research stage on another system called Soundify. Soundify accepts a video input, analyzes it to work out what it is and what's likely happening, and then creates audio to match.
So let's say you upload a scene where somebody gets in a car parked in the countryside and drives away. It tries to match a background sound to the environment, then tries to identify subjects, and what they're doing, and the exact moments when their activity should cause sounds, and where in the stereo space those sounds should come from. Then it generates that sound, matched up to the video. There should be footsteps, door closing noises, engine noise, tire noise, whatever the scene demands. Here are some examples:
Again, like Gen-1, Soundify is an early iteration and it's not yet ready for prime time. But honestly, who's betting against AI tools at this point – particularly ones that'll let a director tweak their output with the same kinds of plain-language prompts they're currently giving to their sound effects team?
These tools are another bittersweet inflection point; they'll democratize moviemaking to an extent that would've been unimaginable a few years ago. They'll also vaporize entire careers – in this case, dream careers for creatives.
At some point soon, these tools will begin to converge. Text generators descended from godlike entities like ChatGPT will begin coming up with entire screenplays, from the concept to the art style and the script, based upon their encyclopedic knowledge of the entire history of the art form, combined with an unprecedented ability to follow current human trends, issues, concerns, language use and fashion.
They'll interface with a DALL-E style image generator to create a coherent visual style, drawing upon every significant piece of human art since cave paintings. And they'll interface with moviemaking tools like Gen-1 and Soundify, again trained on every significant piece of cinema humans have ever created, to pump out entire movies, ads, Tik Toks, custom Christmas greeting videos, propaganda ... You get the drift. Any style, any face, any voice, any tweaks, nothing will bother it.
Soundtracks? Have you checked out Google's MusicLM tool? Again in its infancy, it creates entire recordings, fully orchestrated and mixed, in nearly any style you can name, in response to text prompts. The music will rise and fall perfectly in response to the script and the action; it'll be trivial for tools like this to pinpoint the emotional climax of a scene and amplify or subvert it with perfectly timed music. The entire system will respond to change requests effortlessly, as clients seem to expect today's video professionals will.
Movie trailers, posters, merchandise ... it's hard to see which parts of the entire movie industry can't eventually be turned into lightning-fast algorithms. And looking at where this tech is at right now, we might legitimately be talking about a system that's feasible within 10 years.
On a smaller scale, how about making your own custom Snapchat filter for live video, just using image or text prompts? Three years, tops. Heck, it could drop next week and I don't think I'd bat an eyelid at this point.
Buckle up folks, this could be a bumpy ride.
Please keep comments to less than 150 words. No abusive material or spam will be published.
I'm reminded of that fleeting moment between 1945 and 1949, when the US was the only country with the A-Bomb. There was a serious debate as to whether we should just destroy our research and vow never to go down that path again. Of course, the argument always remained: What if the Soviets go ahead and develop one anyway after we've torched ours? And so we learned to live with a push-button Armageddon hanging over our heads.
How is AI any different? It's not. Almost more certainly than the A-Bomb, AI has the potential to wipe us out. Yet I think that's a less likely a scenario than it simply rendering us obsolete.
Do we humans matter? If we do, why are we rushing headlong into oblivion? Why doesn't someone pull the emergency brake?
The rate of development of AI now seems to be (almost?) exponential.
A potential cause appears here:
I no longer think it will be five years before we arrive at artificial general intelligence. At this exponential rate, it looks like 2023.
We are about to find out what that means.