Related papers: Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

URL: http://arxiv.org/abs/2506.04225v1
Date: Wed, 04 Jun 2025 17:59:04 GMT
Title: Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation
Authors: Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson W. H. Lau, Wangmeng Zuo, Chunchao Guo,
Abstract summary: We present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image.<n>Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames.
Score: 66.95956271144982
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world applications like video gaming and virtual reality often demand the ability to model 3D scenes that users can explore along custom camera trajectories. While significant progress has been made in generating 3D objects from text or images, creating long-range, 3D-consistent, explorable 3D scenes remains a complex and challenging problem. In this work, we present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames, eliminating the need for 3D reconstruction pipelines (e.g., structure-from-motion or multi-view stereo). Our method integrates three key components: 1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence 2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency, and 3) Scalable Data Engine: A video reconstruction pipeline that automates camera pose estimation and metric depth prediction for arbitrary videos, enabling large-scale, diverse training data curation without manual 3D annotations. Collectively, these designs result in a clear improvement over existing methods in visual quality and geometric accuracy, with versatile applications.

Related papers

WorldExplorer: Towards Generating Fully Navigable 3D Scenes [49.21733308718443]
WorldExplorer builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints.<n>We generate multiple videos along short, pre-defined trajectories, that explore the scene in depth.<n>Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results.
arXiv Detail & Related papers (2025-06-02T15:41:31Z)
SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations [44.53106180688135]
This work takes on the challenge of reconstructing 3D scenes from sparse or single-view inputs.<n>We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations.<n>Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency.
arXiv Detail & Related papers (2025-05-17T13:05:13Z)
WonderVerse: Extendable 3D Scene Generation with Video Generative Models [28.002645364066005]
We introduce WonderVerse, a framework for generating extendable 3D scenes.<n>WonderVerse leverages the powerful world-level priors embedded within video generative foundation models.<n>It is compatible with various 3D reconstruction methods, allowing both efficient and high-quality generation.
arXiv Detail & Related papers (2025-03-12T08:44:51Z)
Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner.<n>We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z)
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos [76.07894127235058]
We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos.<n>We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds.<n>We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs.
arXiv Detail & Related papers (2024-12-12T18:59:54Z)
Zero-Shot Multi-Object Scene Completion [59.325611678171974]
We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-03-21T17:59:59Z)
Persistent Nature: A Generative Model of Unbounded 3D Worlds [74.51149070418002]
We present an extendable, planar scene layout grid that can be rendered from arbitrary camera poses via a 3D decoder and volume rendering. Based on this representation, we learn a generative world model solely from single-view internet photos. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation.
arXiv Detail & Related papers (2023-03-23T17:59:40Z)
Video Autoencoder: self-supervised disentanglement of static 3D structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos. The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.