Imagine a next-gen VR experience that lets you speak realistic scenes, intelligent characters and complex situations into being, then interact with them in real time. It's coming, due to a convergence of tech like this advance in real-time 3D video.
I've been wonking on about this for months now; between AI video generators like Sora, AI character and narrative building tools, AI music and sound effects creation tools, and projects like Google Genie, dedicated to live-creating entire interactive games and experiences in real time, most of the major ingredients are already here – in embryonic form.
Sure, there's no proper hologram generator as yet, but if you're willing to accept a VR headset on your noggin, speed, latency and convergence strike me as the only barriers between where we are now and a fully functioning Holodeck experience, in which you simply say where you want to be, who else is there and what should be happening, and then have a version of that appear before your eyes as a fully interactive experience.
Every now and then, in the flurry of progress across the AI field, something catches my eye that seems to bring this kind of experience another step closer. And today's is a research paper titled Representing Long Volumetric Video with Temporal Gaussian Hierarchy.
2. Real-time rendering demo on the SelfCap dataset pic.twitter.com/5qmhpVE2dS
— Min Choi (@minchoi) December 15, 2024
Volumetric video is by its nature more complex than regular video. Instead of a 2D array of square pixels changing over time, volumetric video generates cubic 'voxels' in a 3D space, a much more useful representation of a scene if you want to be able to walk around it and change your perspective. When you're playing a video game that represents the world in 3D, you're looking at volumetric video.
This paper details an advance in volumetric video presentation that radically reduces the video RAM and data storage needed to render photo-realistic video from 3D video assets. It can render highly detailed scenes in 1080p resolution, at 450 frames per second, for a full 10 minutes or more, using a standard nVidia RTX 4090 GPU – and it can do it in real time, allowing interactive camera movement and whatnot.
The technique involved – Temporal Gaussian Hierarchy – essentially looks at the scene and works out which areas in the scene are changing quickly, and which are moving more slowly or not at all, and creates a hierarchy of representation so it can dedicate more time to rendering the complex, fast-moving bits and save time by dedicating less processing to the slow or static bits.
Boy does it do a good job, too. The researchers, a multinational team between Zhejiang University, Stanford University, and the Hong Kong University of Science and Technology, say the technique generated 18,000 frames of video using just 17.2 GB of VRAM and 2.2 GB of storage – a 30X and 26X reduction, respectively, compared to the previous state of the art 4K4D method.
Check out a more detailed explanation in the video below, if you've got the noggin for this kind of thing!
Whatever the witchcraft behind it, the results are extraordinary, as you'll have seen in the videos embedded throughout this piece. Just the way the hair is rendered blows my tiny mind. Again, that's in real time on a standard, if high-end, consumer-grade video card.
This kind of efficient, instantaneous rendering of complex 3D worlds could well become a crucial part of that Holodeck VR experience; if you can generate 450 frames of volumetric video per second, well, you can generate 225 frames of stereo 1080p vision per second for a VR headset, as shown below with an Apple Vision Pro.
4. Real-time VR demos on Apple Vision Pro pic.twitter.com/dYS29m60Ei
— Min Choi (@minchoi) December 15, 2024
It's pretty crazy stuff, and yet another reminder of the wild acceleration we're seeing across multiple fields in 2024. Very neat!