Terra: Explorable Native 3D World Model with Point Latents
- URL: http://arxiv.org/abs/2510.14977v1
- Date: Thu, 16 Oct 2025 17:59:56 GMT
- Title: Terra: Explorable Native 3D World Model with Point Latents
- Authors: Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu,
- Abstract summary: We present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space.<n>Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation.<n>We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents.
- Score: 74.90179419859415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.
Related papers
- Beyond Pixel Histories: World Models with Persistent 3D State [50.4601060508243]
PERSIST is a new paradigm of world model which simulates the evolution of a latent 3D scene.<n>We show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods.
arXiv Detail & Related papers (2026-03-03T19:58:31Z) - WorldGrow: Generating Infinite 3D World [75.81531067447203]
We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance.<n>We propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis.<n>Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity.
arXiv Detail & Related papers (2025-10-24T17:39:52Z) - SPATIALGEN: Layout-guided 3D Indoor Scene Generation [37.30623176278608]
We present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes.<n>Given a 3D layout and a reference image, our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints.<n>We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.
arXiv Detail & Related papers (2025-09-18T14:12:32Z) - From 2D to 3D Cognition: A Brief Survey of General World Models [16.121071388463694]
3D-aware generative world models have demonstrated the ability to synthesize geometrically consistent, interactive 3D environments.<n>Despite rapid progress, the field lacks systematic analysis to categorize emerging techniques and clarify their roles in advancing 3D cognitive world models.<n>This survey addresses this need by introducing a conceptual framework, providing a structured and forward-looking review of world models transitioning from 2D perception to 3D cognition.
arXiv Detail & Related papers (2025-06-25T05:05:09Z) - Seeing World Dynamics in a Nutshell [132.79736435144403]
NutWorld is a framework that transforms monocular videos into dynamic 3D representations in a single forward pass.<n>We demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling downstream applications in real-time.
arXiv Detail & Related papers (2025-02-05T18:59:52Z) - GaussianAnything: Interactive Point Cloud Flow Matching For 3D Object Generation [75.39457097832113]
This paper introduces a novel 3D generation framework, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space.<n>Our framework employs a Variational Autoencoder with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information.<n>The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single image inputs.
arXiv Detail & Related papers (2024-11-12T18:59:32Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Deep Generative Models on 3D Representations: A Survey [81.73385191402419]
Generative models aim to learn the distribution of observed data by generating new instances.
Recently, researchers started to shift focus from 2D to 3D space.
representing 3D data poses significantly greater challenges.
arXiv Detail & Related papers (2022-10-27T17:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.