The AI behavior models controlling how robots interact with the physical world haven't been advancing at the crazy pace that GPT-style language models have – but new multiverse 'world simulators' from Nvidia and Google could change that rapidly.
There's a chicken-and-egg issue slowing things down for AI robotics; large language model (LLM) AIs have enjoyed the benefit of massive troves of data to train from, since the Internet already holds an extraordinary wealth of text, image, video and audio data.
But there's far less data for large behavior model (LBM) AIs to train on. Robots and autonomous vehicles are expensive and annoyingly physical, so data around 3D representations of real-world physical situations is taking a lot longer to collect and incorporate into AI models.
This is one of the reasons why Tesla was so keen to get self-driving hardware into as many of its cars as possible, as early as possible, to give the company a head start on data collection that could position it as the leader in autonomous vehicles.
But recent announcements from Nvidia and Google Deepmind suggest this data bottleneck will soon be eliminated, unlocking a massive acceleration of physical AI development.
Multiversal AI acceleration through real-world data simulation
The idea is to generate enormous amounts of reliable training data through the use of multiverse-style world simulators that can take a single real-world situation – or even just a text prompt, then create a virtual model of it, and then split it into a theoretically infinite number of slightly different situations.
So if you've got six cameras' worth of data from an autonomous car, for example, driving down a street on a nice summer's day, you could take that data, virtualize it to create a 3D world representation, and then use it to generate a huge number of slightly different situations. You could recreate the same situation at 100 different times of the day and night, under 100 different weather conditions that might include rain, snow, heavy wind or dense fog.
You could then split out virtual worlds for each of these time and weather scenarios, in which other vehicles on the road, or pedestrians, or animals, or objects, act slightly differently, creating an entirely new situation for your autonomous car to react to. If something drops, you can simulate it bouncing away in 100 different directions. You can simulate all sorts of edge cases that are incredibly unlikely in the real world.
And of course, you can split out different worlds from each of these, in which the autonomous car itself reacts and chooses different courses of action.
You can then take that simulated 3D world representation, and work backwards to generate high-quality simulated video feeds for all six of your original car's cameras – and data feeds for whatever other sensors your robotic system might have.
And hey presto: your single original chunk of data can turn into thousands, or millions of similar, but slightly different training scenarios, all generated using advanced physics and materials simulators.
“The ChatGPT moment for robotics is coming," said Jensen Huang, founder and CEO of Nvidia, announcing the launch of the company's Cosmos world simulation model during his keynote at CES. "Like large language models, world foundation models are fundamental to advancing robot and AV development, yet not all developers have the expertise and resources to train their own. We created Cosmos to democratize physical AI and put general robotics in reach of every developer.”
The Cosmos model can also operate in real time, according to the video below, "bringing the power of foresight and multiverse simulation to AI models, generating every possible future to help the model select the right path."
Obviously, the data and processing requirements for this sort of thing will be absolutely epic, and nVidia has attempted to help address this with its own Cosmos Tokenizer, which can turn images and videos into tokens that AI models can process using about 1/8th the amount of data required by today's leading tokenizers, unlocking a 12X speed boost in processing.
As the world's leading AI hardware provider, nVidia already has a solid chunk of the emerging robotics industry on board with the Cosmos initiative. Companies like 1X, Figure AI, Fourier and Agility are adopting Cosmos to accelerate the training of humanoid robots, and Xpeng, Uber, Waavi and Wayve are among the autonomous car companies that are getting involved.
Meanwhile, Google Deepmind is launching its own similar initiative – albeit apparently a decent step behind nVidia. Former OpenAI Sora lead Tim Brooks, who now leads Deepmind's video generation and world sim team, made the following post on X yesterday:
DeepMind has ambitious plans to make massive generative models that simulate the world. I'm hiring for a new team with this mission. Come build with us!https://t.co/pqvALtAvLs https://t.co/vtwgeXl9Dl
— Tim Brooks (@_tim_brooks) January 6, 2025
In the job descriptions linked, the Google team points out that this kind of physical world simulation will be a critical step on the path to artificial general intelligence (AGI): "We believe scaling pre-training on video and multimodal data is on the critical path to artificial general intelligence. World models will power numerous domains, such as visual reasoning and simulation, planning for embodied agents, and real-time interactive entertainment."
Friends, it can be hard to know what's significant in the firehose of announcements around AI progress, and nigh-on impossible to keep track of everything that's going on. But to put this stuff in context, where LLMs like GPT are rapidly coming for white-collar jobs, LBMs embodied in robots – be they humanoid, vehicle-oid or in some other shape designed for a specific environment – are coming for anything more blue-collar, or that involves more interaction with the physical world.
The technology in this sector is already absolutely incredible, barely distinguishable from magic, and it promises to fundamentally and profoundly change the world over the coming years and decades. This multiverse simulation gear looks like it'll significantly accelerate progress toward the utopian vision of the post-labor economy... Or whatever less palatable outcome we might get instead.
Source: nVidia / Google Deepmind