Google DeepMind's Genie turns images into playable video games in one step – but it's just the latest in a rapidly converging list of technologies that point to a bizarre sci-fi future of interactive entertainment, designed and run by real-time AIs.
DeepMind's Genie AI is a relatively small 11 billion-parameter model, trained on more than 200,000 hours of video of people playing 2D platformer-style games, without human supervision. These are fairly formulaic, so perhaps it's no surprise that Genie has figured out the mechanics and action physics involved – even though the video streams contained no information about when a button or control was pressed.
As a result, this model accepts a single image – be it a photo, sketch, or AI-generated picture – and turns it into a playable game, responsive to user controls. Image to rudimentary interactive environment in a single step.
I am really excited to reveal what @GoogleDeepMind's Open Endedness Team has been up to 🚀. We introduce Genie 🧞, a foundation world model trained exclusively from Internet videos that can generate an endless variety of action-controllable 2D worlds given image prompts. pic.twitter.com/TnQ8uv81wc
— Tim Rocktäschel (@_rockt) February 26, 2024
Don't get too hung up on the quality of the 'games' you're seeing; Genie is a research project, not a final product. It was trained on super-low-resolution videos at minuscule 160 x 90-pixel resolution and just 10 frames per second, and it generates 'games' at similarly low resolution that operate for just 16 seconds at a miserly one frame per second.
But with the basic idea now proven, every indication is that Genie will improve significantly with scale; throw in longer, higher-resolution video clips and sic a ton of compute on this system and the results will begin to leap in quality the way we're seeing them do so in literally every nook and cranny of the AI space.
So in a sense, Genie is not the real story here. The story is much broader, and it can be summed up like this: Everything you're seeing from advanced text-to-video AIs like OpenAI's jaw-dropping Sora demo from last week is starting to converge with 3D interactive worlds, AI-generated characters and GPT-style natural language models, with VR hardware advancing at pace as well.
The repercussions will be absolutely colossal, a fundamental shift in not just gaming, but entertainment overall. Let me throw some building block videos into the pot here that point to where things are heading.
Take a look at this video from 2021. It shows an AI that, two-and-a-half years ago, had watched enough Grand Theft Auto V to be able to recreate a blurry, stripped-down facsimile of the game, complete with a drivable car, in real time.
Again, that was a couple of years ago, and we've all seen the berserk pace of progress here. The takeaway from this video is: AI game generation will certainly not stop at Genie's 2D platformers. It's long had the ability to do this kind of thing in 3D, and essentially it's just a matter of where the focus is pointed at a given time. Gaming is heading toward a place where everything you see, hear and do will be generated by an AI in real time.
Secondly – and this is also perhaps old news, but it's an important building block here. We've written before about AI-generated video game NPCs, whose looks, personalities, goals and knowledge you can tweak using natural language, and with whom players can converse either verbally or through text with no limits on conversation topics.
If you haven't seen this stuff in action, it's getting faster, more responsive and better all the time. Check out what Alystria AI has done using Cyberpunk 2077, Ghost of Tsushima, Red Dead: Redemption 2 and other open-world titles as a baseline, making some of the world's most iconic characters fully AI-interactive within the context of the game.
In the above examples, the original character actors' voices have not been preserved, but that's frankly trivial now from a tech standpoint if contract arrangements allow it. There are apps you can download right now to clone your own voice, or anyone else's – it's a good time to start setting up code words with your older relatives, because bad-faith actors need very little of your voice to start cloning it and ringing them asking for money.
Given the hundreds of hours of high-def voice recordings that go into video game production, there's massive opportunities for game studios to train voice models. We wouldn't be surprised to see a flood of AI-enhanced re-releases of older games coming through in which players can hold boundless natural conversations with iconic NPCs as they play.
Now let's take a quick refresher on OpenAI's Sora, which as of this minute strikes us as the world's most advanced text-to-video generator – although by the time we hit publish, it may well have been eclipsed. Here's one of many more recent videos released since Sora's debut last week.
You are not ready for this.
— Min Choi (@minchoi) February 24, 2024
New Sora videos just dropped and they are wild.
100% AI (minus the sound).
10 new videos: 🧵👇
1. Scuba diver discovering a futuristic shipwreckpic.twitter.com/A2Itlehvl4
Sora isn't just generating the most staggeringly photorealistic videos we've ever seen coming out of an AI, it's capable of creating persistent characters, styles and environments. That is, scenes in which the camera might look around, then look back and objects are still there. Characters that can be kept consistent between different scenes. That sort of thing.
And it's also developing, simply by ingesting so much video from the world around it, a staggering understanding of how physics works in the real world, and how objects, surfaces and substances relate to and interact with one another. Here's Sora's attempt at creating a helmet-cam view of a Formula 1 race set in San Francisco.
Mind blow
— Paul Æ Blundell (@PAUL__BLUNDELL) February 25, 2024
with this new example of video generated by #Sora.
Filmmakers have nothing to worry about
"an f1 driver races through the streets of san francisco during the day, the driver's pov is captured from a helmet cam. the golden gate bridge and the cityscape can be seen… pic.twitter.com/zQZgdQjENq
Look closely, and it's janky as hell, with silly mistakes everywhere. But we're not talking about what's here now, we're looking at the near-future point toward which all of this stuff is converging. Sora shows us the shocking level of quality at which you can generate video given enough training, compute and processing power, and videos like the above are simply what it's capable of in 2024.
Next we can quickly pull in audio and sound effects, which we saw last week, again in a relatively early and janky form, from ElevenLabs.
We were blown away by the Sora announcement but felt it needed something...
— ElevenLabs (@elevenlabsio) February 18, 2024
What if you could describe a sound and generate it with AI? pic.twitter.com/HcUxQ7Wndg
So basically, whatever you're generating visually, another AI can take and put an audio track onto. Easy.
And of course, if you want a soundtrack, AI music generation is also moving at a shocking pace. Here's a random example I found – it's pop music, not a soundtrack, but it shows how easy it is now to throw some lyrics into a pot and generate an entire song, complete with vocals.
AI Music Newsletter #12
— AmliArt (@amli_art) February 23, 2024
Suno V3 dropped.
So I put in some lyrics, made a bunch of variations, and cried a bit.
That's it, that's the newsletter. 🧵👇👇 pic.twitter.com/EAJmdRWClc
In the broader interactive entertainment scenario we're building, you can take it as read: soundtracks can absolutely be generated in near-real time, in a way that's responsive to action. And there's no reason why NPC characters won't soon be composing songs about what you've been up to in the game and singing them to you, again in a way that's totally interactive.
So let's look at the building blocks we've got here:
- AI-generated playable games with responsive controls
- Real-time neural generation of interactive game worlds
- Language-based generation and tuning of fully-interactive NPC characters
- Text-to-video generation of super high quality visuals, in just about any style, with persistent styles, characters and environments
- Video-and-text-to-audio foley and sound effects generation
- AI soundtrack generation
Throw those together with rapidly improving language models like GPT, with their ability to create and respond to narratives while also driving a range of other AI technologies, and you get a very different picture of what video-game design will be in the not-too-distant future.
This video was generated using text-to-video AI model Sora by OpenAI.
— kokaha@休暇中 (@kokahashinda) February 26, 2024
Prompt: Epic gaming pic.twitter.com/pCql7zIl8n
You'll be able to start with nothing, or with a sketch or two, and have AI generate an interactive world, which even to begin with will probably be extraordinarily beautiful.
Then, like a digital God, you'll be able to say, "Let there be tree," and there will be tree, and if it is not good, you'll be able to request a different tree. You'll be able to create your characters just by painting a verbal picture: "I want a talking donkey with a Mexican accent and a chip on his shoulder. No, more sassy. Let's give him an air of danger and a penchant for epic storytelling about his shady past as a merchant sailor. Lose the sombrero, let's go with a cowboy-style handkerchief. His hidden motive in this story is that he's looking for his sister, who he believes may be held by ninjas in the castle on top of that hill."
The term 'gaming' hardly covers what we're talking about here; you'll be able to verbally design an experience, then play through and interact with it, adjusting things like a director instead of like a programmer. Given enough computing resources, you'll be able to generate entire games this way; shareable single- or multi-player expressions of your own individual imagination that others can enjoy and potentially iterate on with their own touches.
Place this in a VR context with real-time neural generation capabilities, and a GPT-X level ability to manage the overall experience and generate narratives and ... well, you've got the Holodeck from Star Trek, or for that matter, "the simulation." Entire interactive worlds, populated with interactive characters of yoru choosing, where anything you desire can happen in response to real-time requests. Who's turning on Netflix or the PS7 when an interactive version of whatever you can think of is available?
"Cinematic trailer for a group of adventurous puppies exploring ruins in the sky"
— Tim Brooks (@_tim_brooks) February 21, 2024
Video generated by #Sora pic.twitter.com/FNZmvstONj
One shudders to think what happens when this stuff is controlled by corporations or advertisers, who will have an unprecedented ability to steer your experiences in ways that benefit them.
This won't all happen overnight. Clearly, hardware is probably the main limiting factor at this point. There are only so many GPUs in the world to train and run this stuff on, although new chips are being invented and put into production specifically to drive the AI industry's push toward artificial general intelligence and beyond.
So that need is being addressed as fast as human commerce is capable of doing it – but we're probably not within 12 months of seeing Sora-quality video creation in real time, so there's a little room to breathe there.
Major leaps in hardware, connectivity and energy storage would be needed to run this stuff through a compact VR headset, as well as further work around haptic feedback mechanisms that'll embody players even more within these experiences.
Taking things toward the limits of where we can see things going, maybe the best way to get these extraordinary visions, sounds and sensations into our brains is directly through wires, skipping our fallible sense organs altogether.
Check out our latest video to learn more about our PRIME Study! 🧠📱 pic.twitter.com/7zTMFzdZsF
— Neuralink (@neuralink) November 22, 2023
Brain-computer interface technology is already further advanced than many people realize, and while most of it is currently targeted at medical use, Elon Musk has been clear from the beginning that the eventual point of Neuralink is to create a connection between humans and AIs. This connection will allow us to get much more information back and forth than we can accomplish through the low-bandwidth bottlenecks of keyboards, voice recognition and even language itself. The goal? Brain-to-AI communication both ways, at the speed of thought.
And we're seeing other tech coming through that's focused on monitoring and responding to humans at an even deeper level than thought: emotionally responsive technology that makes your real-time feelings another input that a system could react to in real time, pushing excitement to the peak or playing your heart strings to perfection, then knowing exactly when a moment is dying so the pace of an experience can be perfectly optimized to the user.
As for the AIs themselves ... No matter how much we at New Atlas do our best to keep up with what's happening in this space, I don't think we, or the vast majority of people have any idea how quickly these things are really advancing. Sora is a good example; we get the impression OpenAI had that one in the bag for several months before it decided to make an announcement, and it chose to drop it just to stomp all over Google's Gemini 1.5 release.
It worked; with our limited human resources and our policy only to use human writers, we had to choose which to cover, and Gemini didn't get a guernsey.
Ok, I've been waiting a while before saying it confidently, but I now understand that the Gemini-1.5 Pro 🌟had its deserved spotlight stolen last week by Sora.
— Itamar Golan 🤓 (@ItakGol) February 18, 2024
After experimenting with it for a while, I believe it represents the most significant advancement in LLM capabilities… pic.twitter.com/tlesXGWoAR
Gemini 1.5 is its own game-changer, and we couldn't even get to it. The rate of world-changing progress in AI is absolutely unprecedented, not just in our lifetime but probably in the history of humanity.
So when we see Google's Genie, embryonic and low-res as it is today, it's all part of a giant, building tsunami of disruption and convergence that's bringing science fiction into fact at a dizzying rate. I keep saying it: Buckle up, folks – these head-spinning concepts will keep coming at an accelerating rate.
This is not just DeepMind and OpenAI, it's an entire nascent industry with massive investment pouring into it, that hasn't yet begun to hit its stride. Different sides of AI are coming crashing together more and more often, and starting to converge with a range of other technologies that themselves are advancing, even if it's at a slower rate.
Every little piece of the world that these things learn to understand and replicate for our amusement is a step towards embodied intelligences in humanoids as well as other types of robots. Each is also a step toward artificial general intelligence – and very soon thereafter, artificial super-intelligence. These two concepts seemed ludicrously far off in the future just a year or two ago, but I wouldn't bet against either being announced in the next 12 months.
The world of 2030, just six years away, is becoming a complete mystery to me. I have no idea what skills I should be teaching my five- and 10-year-old kids to prepare them. Do you? Honest question; I'll be checking the comments section!
Source: DeepMind and many others.