Related papers: AutoScape: Geometry-Consistent Long-Horizon Scene Generation

AutoScape: Geometry-Consistent Long-Horizon Scene Generation

URL: http://arxiv.org/abs/2510.20726v1
Date: Thu, 23 Oct 2025 16:44:34 GMT
Title: AutoScape: Geometry-Consistent Long-Horizon Scene Generation
Authors: Jiacheng Chen, Ziyu Jiang, Mingfu Liang, Bingbing Zhuang, Jong-Chyi Su, Sparsh Garg, Ying Wu, Manmohan Chandraker,
Abstract summary: AutoScape is a long-horizon driving scene generation framework.<n>It generates realistic and geometrically consistent driving videos of over 20 seconds.<n>It improves the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.
Score: 69.2451355181344
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.

Related papers

Geometry-Aware Rotary Position Embedding for Consistent Video World Model [48.914346802616414]
ViewRope is a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers.<n>Geometry-Aware Frame-Sparse Attention exploits these geometric cues to selectively attend to relevant historical frames.<n>Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
arXiv Detail & Related papers (2026-02-08T08:01:16Z)
GeoVideo: Introducing Geometric Regularization into Video Generation Model [46.38507581500745]
We introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction.<n>Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved structural coherence-temporal shape, consistency, and physical plausibility.
arXiv Detail & Related papers (2025-12-03T05:11:57Z)
DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion [62.589889759543446]
DriveGen3D is a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes.<n>Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction.
arXiv Detail & Related papers (2025-10-17T03:00:08Z)
3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation [55.29423122177883]
3DScenePrompt is a framework that generates the next chunk from arbitrary-length input.<n>It enables camera control and preserving scene consistency.<n>Our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality.
arXiv Detail & Related papers (2025-10-16T17:55:25Z)
ShapeGen4D: Towards High Quality 4D Shape Generation from Videos [85.45517487721257]
We introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video.<n>Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization.
arXiv Detail & Related papers (2025-10-07T17:58:11Z)
4D Driving Scene Generation With Stereo Forcing [62.47705572424127]
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization.<n>We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency.
arXiv Detail & Related papers (2025-09-24T15:37:17Z)
MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion [45.67028461223564]
Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving.<n>This paper proposes Splat MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior.
arXiv Detail & Related papers (2025-08-21T02:16:15Z)
D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos [12.24209693552492]
Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge.<n>This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences.<n> Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds.
arXiv Detail & Related papers (2025-07-08T10:39:32Z)
Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.<n>Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z)
DoubleTake: Geometry Guided Depth Estimation [17.464549832122714]
Estimating depth from a sequence of posed RGB images is a fundamental computer vision task. We introduce a reconstruction which combines volume features with a hint of the prior geometry, rendered as a depth map from the current camera location. We demonstrate that our method can run at interactive speeds, state-of-the-art estimates of depth and 3D scene in both offline and incremental evaluation scenarios.
arXiv Detail & Related papers (2024-06-26T14:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.