FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
- URL: http://arxiv.org/abs/2509.21657v1
- Date: Thu, 25 Sep 2025 22:24:23 GMT
- Title: FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
- Authors: Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, Yonggang Qi,
- Abstract summary: We present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch.<n>Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction.<n>Experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency.
- Score: 13.098585993121722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
Related papers
- Beyond Pixel Histories: World Models with Persistent 3D State [50.4601060508243]
PERSIST is a new paradigm of world model which simulates the evolution of a latent 3D scene.<n>We show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods.
arXiv Detail & Related papers (2026-03-03T19:58:31Z) - Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video [76.32954467706581]
We propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams.<n>We use a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision.<n>Experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks.
arXiv Detail & Related papers (2026-02-08T09:53:21Z) - EA3D: Online Open-World 3D Object Extraction from Streaming Videos [55.48835711373918]
We present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction.<n>Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge.<n>A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding.
arXiv Detail & Related papers (2025-10-29T03:56:41Z) - UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding [57.86758122195093]
We introduce UniUGG, the first unified understanding and generation framework for 3D modalities.<n>Our framework employs an LLM to comprehend and decode sentences and 3D representations.<n>We propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations.
arXiv Detail & Related papers (2025-08-16T07:27:31Z) - Geometry-aware 4D Video Generation for Robot Manipulation [28.709339959536106]
We propose a 4D video generation model that enforces multi-view 3D consistency of videos by supervising the model with cross-view pointmap alignment during training.<n>This geometric supervision enables the model to learn a shared 3D representation of the scene, allowing it to predict future video sequences from novel viewpoints.<n>Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets.
arXiv Detail & Related papers (2025-07-01T18:01:41Z) - Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach [54.559847511280545]
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness.<n>To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space.<n>The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model.
arXiv Detail & Related papers (2025-02-05T21:49:06Z) - Diffusion Models in 3D Vision: A Survey [18.805222552728225]
3D vision has become a crucial field within computer vision, powering a range of applications such as autonomous driving, robotics, augmented reality, and medical imaging.<n>We review the state-of-the-art methods that use diffusion models for 3D visual tasks, including but not limited to 3D object generation, shape completion, point-cloud reconstruction, and scene construction.<n>We discuss potential solutions, including improving computational efficiency, enhancing multimodal fusion, and exploring the use of large-scale pretraining for better generalization across 3D tasks.
arXiv Detail & Related papers (2024-10-07T04:12:23Z) - How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach [46.85336335756483]
Learned 3D Evaluation (L3DE) is a method for assessing AI-generated videos' ability to simulate the real world in terms of 3D visual qualities and consistencies.<n>Confidence scores quantify the gap between real and synthetic videos in terms of 3D visual coherence.<n>L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies.
arXiv Detail & Related papers (2024-06-27T23:03:58Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Large-Vocabulary 3D Diffusion Model with Transformer [57.076986347047]
We introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model.
We propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance.
arXiv Detail & Related papers (2023-09-14T17:59:53Z) - 3D-Aware Video Generation [149.5230191060692]
We explore 4D generative adversarial networks (GANs) that learn generation of 3D-aware videos.
By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos.
arXiv Detail & Related papers (2022-06-29T17:56:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.