Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion
- URL: http://arxiv.org/abs/2407.13759v2
- Date: Thu, 25 Jul 2024 17:47:20 GMT
- Title: Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion
- Authors: Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, Gordon Wetzstein,
- Abstract summary: We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene.
Our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency.
- Score: 61.929653153389964
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.
Related papers
- DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes [11.761871622954214]
diffusion-based autoregressive video generation model designed for long-term generation of 3D-controllable and video.
DreamForge supports flexible conditions such as text descriptions, camera poses, 3D bounding boxes, and road layouts.
For consistency, we ensure inter-view consistency through cross-view attention and temporal coherence via an autoregressive architecture enhanced with motion cues.
arXiv Detail & Related papers (2024-09-06T03:09:58Z) - Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion [77.34078223594686]
We propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques.
Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner.
Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.
arXiv Detail & Related papers (2024-01-19T16:15:37Z) - ProSGNeRF: Progressive Dynamic Neural Scene Graph with Frequency
Modulated Auto-Encoder in Urban Scenes [16.037300340326368]
Implicit neural representation has demonstrated promising results in view synthesis for large and complex scenes.
Existing approaches either fail to capture the fast-moving objects or need to build the scene graph without camera ego-motions.
We aim to jointly solve the view synthesis problem of large-scale urban scenes and fast-moving vehicles.
arXiv Detail & Related papers (2023-12-14T16:11:42Z) - PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - DynIBaR: Neural Dynamic Image-Based Rendering [79.44655794967741]
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene.
We adopt a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views.
We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets.
arXiv Detail & Related papers (2022-11-20T20:57:02Z) - Urban Radiance Fields [77.43604458481637]
We perform 3D reconstruction and novel view synthesis from data captured by scanning platforms commonly deployed for world mapping in urban outdoor environments.
Our approach extends Neural Radiance Fields, which has been demonstrated to synthesize realistic novel images for small scenes in controlled settings.
Each of these three extensions provides significant performance improvements in experiments on Street View data.
arXiv Detail & Related papers (2021-11-29T15:58:16Z) - Infinite Nature: Perpetual View Generation of Natural Scenes from a
Single Image [73.56631858393148]
We introduce the problem of perpetual view generation -- long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image.
We take a hybrid approach that integrates both geometry and image synthesis in an iterative render, refine, and repeat framework.
Our approach can be trained from a set of monocular video sequences without any manual annotation.
arXiv Detail & Related papers (2020-12-17T18:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.