Related papers: Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

URL: http://arxiv.org/abs/2602.21929v1
Date: Wed, 25 Feb 2026 14:09:03 GMT
Title: Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
Authors: JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, Yanye Lu,
Abstract summary: Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory.<n>Previous methods rely on video generation models with external memory for consistency.<n>We introduce geometry-as-context" to overcome these limitations.
Score: 33.99324999592141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.

Related papers

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians [7.051077403685518]
Humans excel at forecasting the future dynamics of a scene given just a single image.<n>Video generation models that can mimic this ability are an essential component for intelligent systems.<n>Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation.
arXiv Detail & Related papers (2026-01-02T13:04:47Z)
LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding [34.74478301165912]
We present a method that directly generates large-scale 3D driving scenes with accurate geometry.<n>The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors.<n>We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure.
arXiv Detail & Related papers (2025-08-26T17:04:49Z)
Towards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization [14.673302810271219]
We propose a novel three-stage framework for 3D scene generation with explicit geometric representations and high-quality textural details.<n>Our approach not only outperforms state-of-the-art methods in terms of geometric accuracy and texture fidelity of individual generated 3D models, but also has significant advantages in scene layout synthesis.
arXiv Detail & Related papers (2025-07-20T06:59:42Z)
GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z)
CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image [44.8172828045897]
Current methods often struggle with domain-specific limitations or low-quality object generation.<n>We propose CAST, a novel method for 3D scene reconstruction and recovery.
arXiv Detail & Related papers (2025-02-18T14:29:52Z)
FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views [100.45129752375658]
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images.<n>Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes.
arXiv Detail & Related papers (2025-02-17T18:54:05Z)
Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z)
Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting [75.7154104065613]
We introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process. We also introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry.
arXiv Detail & Related papers (2024-04-30T17:59:40Z)
FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction. Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z)
Online Adaptation for Consistent Mesh Reconstruction in the Wild [147.22708151409765]
We pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video. We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild.
arXiv Detail & Related papers (2020-12-06T07:22:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.