MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation
- URL: http://arxiv.org/abs/2508.14327v1
- Date: Wed, 20 Aug 2025 00:51:36 GMT
- Title: MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation
- Authors: Guile Wu, David Huang, Dongfeng Bai, Bingbing Liu,
- Abstract summary: We propose a novel multi-modal multi-view video generation approach to autonomous driving.<n>Our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework.<n>Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability.
- Score: 20.943599420478105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video generation has recently shown superiority in urban scene synthesis for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the unified diffusion model for multi-modal multi-view video generation. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability, surpassing the state-of-the-art methods.
Related papers
- Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing [93.8111348452324]
Tele- Omni is a unified framework for video generation and editing that follows multimodal instructions.<n>It supports text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing.
arXiv Detail & Related papers (2026-02-10T10:01:16Z) - MultiCOIN: Multi-Modal COntrollable Video INbetweening [46.37499813275259]
We introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls.<n>To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse representation.<n>We also propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly.
arXiv Detail & Related papers (2025-10-09T17:59:27Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z) - UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving [18.189392365510848]
UniMLVG is a unified framework designed to generate extended street multi-perspective videos.<n>Our framework achieves improvements of 48.2% in FID and 35.2% in FVD.
arXiv Detail & Related papers (2024-12-06T08:27:53Z) - DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation.
Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information.
DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Moonshot: Towards Controllable Video Generation and Editing with
Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text.
Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - DrivingDiffusion: Layout-Guided multi-view driving scene video
generation with latent diffusion model [19.288610627281102]
We propose DrivingDiffusion to generate realistic multi-view videos controlled by 3D layout.
Our model can generate large-scale realistic multi-camera driving videos in complex urban scenes.
arXiv Detail & Related papers (2023-10-11T18:00:08Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.