4D Driving Scene Generation With Stereo Forcing
- URL: http://arxiv.org/abs/2509.20251v1
- Date: Wed, 24 Sep 2025 15:37:17 GMT
- Title: 4D Driving Scene Generation With Stereo Forcing
- Authors: Hao Lu, Zhuang Ma, Guangfeng Jiang, Wenhang Ge, Bohan Li, Yuzhan Cai, Wenzhao Zheng, Yunpeng Zhang, Yingcong Chen,
- Abstract summary: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization.<n>We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency.
- Score: 62.47705572424127
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at \href{https://jiangxb98.github.io/PhiGensis}{PhiGensis}.
Related papers
- Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation [21.075786141331974]
We present emphTrack4DGen, a framework for generating dynamic 4D objects from sparse inputs.<n>In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator.<n>In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding.
arXiv Detail & Related papers (2025-12-05T21:13:04Z) - Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image [88.71287865590273]
We introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories.<n>We propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D trajectories.<n>We then propose a 4D View Synthesis Module (4D-Vi) to render videos with arbitrary camera trajectories from 4D point track representations.
arXiv Detail & Related papers (2025-12-04T17:59:10Z) - Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models [79.06910348413861]
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image.<n>Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion.
arXiv Detail & Related papers (2025-11-01T11:16:25Z) - ShapeGen4D: Towards High Quality 4D Shape Generation from Videos [85.45517487721257]
We introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video.<n>Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization.
arXiv Detail & Related papers (2025-10-07T17:58:11Z) - Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation [3.1852855132066673]
Current approaches often struggle to maintain view consistency while handling complex scene dynamics.<n>This framework is the first to leverage both rich temporal priors video diffusion models and geometric awareness of the reconstruction models.<n>It significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.
arXiv Detail & Related papers (2025-08-11T08:55:47Z) - Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models [83.76517697509156]
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input.<n>We propose a novel iterative sliding denoising process to enhance view-temporal consistency of the 4D diffusion model.<n>Our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches.
arXiv Detail & Related papers (2025-07-17T17:59:17Z) - BulletGen: Improving 4D Reconstruction with Bullet-Time Generation [15.225127596594582]
We introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a dynamic scene representation.<n>Our method seamlessly blends generative content with both static and dynamic scene components, achieving state-of-the-art results on both novel-view synthesis, and 2D/3D tracking tasks.
arXiv Detail & Related papers (2025-06-23T13:03:42Z) - 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation [66.20991603309054]
We propose the first framework capable of computing a 4D-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture.<n>In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design.<n>In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training.
arXiv Detail & Related papers (2025-06-18T23:44:59Z) - Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation [50.01520547454224]
Current generative models struggle to synthesize 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS)<n>We propose DiST-4D, which disentangles the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency.<n>Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
arXiv Detail & Related papers (2025-03-19T13:49:48Z) - 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [115.67081491747943]
Dynamic 3D scene representation and novel view synthesis are crucial for enabling AR/VR and metaverse applications.<n>We reformulate the reconstruction of a time-varying 3D scene as approximating its underlying 4D volume.<n>We derive several compact variants that effectively reduce the memory footprint to address its storage bottleneck.
arXiv Detail & Related papers (2024-12-30T05:30:26Z) - Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels [35.27805034331218]
We present Vidu4D, a novel reconstruction model that excels in accurately reconstructing 4D representations from single generated videos.
At the core of Vidu4D is our proposed Dynamic Gaussian Surfels (DGS) technique.
arXiv Detail & Related papers (2024-05-27T04:43:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.