AutoScape: Geometry-Consistent Long-Horizon Scene Generation
- URL: http://arxiv.org/abs/2510.20726v1
- Date: Thu, 23 Oct 2025 16:44:34 GMT
- Title: AutoScape: Geometry-Consistent Long-Horizon Scene Generation
- Authors: Jiacheng Chen, Ziyu Jiang, Mingfu Liang, Bingbing Zhuang, Jong-Chyi Su, Sparsh Garg, Ying Wu, Manmohan Chandraker,
- Abstract summary: AutoScape is a long-horizon driving scene generation framework.<n>It generates realistic and geometrically consistent driving videos of over 20 seconds.<n>It improves the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.
- Score: 69.2451355181344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.
Related papers
- Geometry-Aware Rotary Position Embedding for Consistent Video World Model [48.914346802616414]
ViewRope is a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers.<n>Geometry-Aware Frame-Sparse Attention exploits these geometric cues to selectively attend to relevant historical frames.<n>Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
arXiv Detail & Related papers (2026-02-08T08:01:16Z) - GeoVideo: Introducing Geometric Regularization into Video Generation Model [46.38507581500745]
We introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction.<n>Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved structural coherence-temporal shape, consistency, and physical plausibility.
arXiv Detail & Related papers (2025-12-03T05:11:57Z) - DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion [62.589889759543446]
DriveGen3D is a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes.<n>Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction.
arXiv Detail & Related papers (2025-10-17T03:00:08Z) - 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation [55.29423122177883]
3DScenePrompt is a framework that generates the next chunk from arbitrary-length input.<n>It enables camera control and preserving scene consistency.<n>Our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality.
arXiv Detail & Related papers (2025-10-16T17:55:25Z) - ShapeGen4D: Towards High Quality 4D Shape Generation from Videos [85.45517487721257]
We introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video.<n>Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization.
arXiv Detail & Related papers (2025-10-07T17:58:11Z) - 4D Driving Scene Generation With Stereo Forcing [62.47705572424127]
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization.<n>We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency.
arXiv Detail & Related papers (2025-09-24T15:37:17Z) - MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion [45.67028461223564]
Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving.<n>This paper proposes Splat MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior.
arXiv Detail & Related papers (2025-08-21T02:16:15Z) - D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos [12.24209693552492]
Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge.<n>This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences.<n> Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds.
arXiv Detail & Related papers (2025-07-08T10:39:32Z) - Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.<n>Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z) - DoubleTake: Geometry Guided Depth Estimation [17.464549832122714]
Estimating depth from a sequence of posed RGB images is a fundamental computer vision task.
We introduce a reconstruction which combines volume features with a hint of the prior geometry, rendered as a depth map from the current camera location.
We demonstrate that our method can run at interactive speeds, state-of-the-art estimates of depth and 3D scene in both offline and incremental evaluation scenarios.
arXiv Detail & Related papers (2024-06-26T14:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.