DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes
- URL: http://arxiv.org/abs/2409.04003v1
- Date: Fri, 6 Sep 2024 03:09:58 GMT
- Title: DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes
- Authors: Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Tiantian Wei, Min Dou, Botian Shi, Yong Liu,
- Abstract summary: diffusion-based autoregressive video generation model designed for long-term generation of 3D-controllable and video.
DreamForge supports flexible conditions such as text descriptions, camera poses, 3D bounding boxes, and road layouts.
For consistency, we ensure inter-view consistency through cross-view attention and temporal coherence via an autoregressive architecture enhanced with motion cues.
- Score: 11.761871622954214
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in diffusion models have significantly enhanced the cotrollable generation of streetscapes for and facilitated downstream perception and planning tasks. However, challenges such as maintaining temporal coherence, generating long videos, and accurately modeling driving scenes persist. Accordingly, we propose DreamForge, an advanced diffusion-based autoregressive video generation model designed for the long-term generation of 3D-controllable and extensible video. In terms of controllability, our DreamForge supports flexible conditions such as text descriptions, camera poses, 3D bounding boxes, and road layouts, while also providing perspective guidance to produce driving scenes that are both geometrically and contextually accurate. For consistency, we ensure inter-view consistency through cross-view attention and temporal coherence via an autoregressive architecture enhanced with motion cues. Codes will be available at https://github.com/PJLab-ADG/DriveArena.
Related papers
- Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention [62.2447324481159]
Cavia is a novel framework for camera-controllable, multi-view video generation.
Our framework extends the spatial and temporal attention modules, improving both viewpoint and temporal consistency.
Cavia is the first of its kind that allows the user to specify distinct camera motion while obtaining object motion.
arXiv Detail & Related papers (2024-10-14T17:46:32Z) - MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control [4.556249147612401]
MyGo is an end-to-end framework for driving video generation.
MyGo introduces motion of onboard cameras as conditions to make progress in camera controllability and multi-view consistency.
Results show that MyGo has achieved state-of-the-art results in both general camera-controlled video generation and multi-view driving video generation tasks.
arXiv Detail & Related papers (2024-09-10T03:39:08Z) - DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation.
Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information.
DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z) - MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion [61.929653153389964]
We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene.
Our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency.
arXiv Detail & Related papers (2024-07-18T17:56:30Z) - OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving [62.54220021308464]
We propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving.
OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
arXiv Detail & Related papers (2024-05-30T17:59:42Z) - MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes [72.02827211293736]
We introduce MagicDrive3D, a novel pipeline for controllable 3D street scene generation.
Unlike previous methods that reconstruct before training the generative models, MagicDrive3D first trains a video generation model and then reconstructs from the generated data.
Our results demonstrate the framework's superior performance, showcasing its transformative potential for autonomous driving simulation and beyond.
arXiv Detail & Related papers (2024-05-23T12:04:51Z) - DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation [32.30436679335912]
We propose DriveDreamer-2, which builds upon the framework of DriveDreamer to generate user-defined driving videos.
Ultimately, we propose the Unified Multi-View Model to enhance temporal and spatial coherence in the generated driving videos.
arXiv Detail & Related papers (2024-03-11T16:03:35Z) - Generative Rendering: Controllable 4D-Guided Video Generation with 2D
Diffusion Models [40.71940056121056]
We present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models.
We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
arXiv Detail & Related papers (2023-12-03T14:17:11Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.