Related papers: DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

URL: http://arxiv.org/abs/2408.16647v1
Date: Thu, 29 Aug 2024 15:52:56 GMT
Title: DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving
Authors: Yongjie Fu, Anmol Jain, Xuan Di, Xu Chen, Zhaobin Mo,
Abstract summary: Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. We propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them.
Score: 12.004604110512421
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fr\'echet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.

Related papers

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models [11.748457186467727]
We propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving.<n>AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making.<n>We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance.
arXiv Detail & Related papers (2026-02-04T06:37:14Z)
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds [56.749972238005604]
This paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos.<n> TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces.<n>In turn, logit scores define embedding directions for conditional video generation.
arXiv Detail & Related papers (2025-11-23T09:12:48Z)
LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z)
From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models [65.0487600936788]
Video Diffusion Models (VDMs) have emerged as powerful generative tools capable of synthesizing high-quality content.<n>We argue that VDMs naturally push to probe structured representations and an implicit understanding of the visual world.<n>Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input- sequences.
arXiv Detail & Related papers (2025-06-08T20:52:34Z)
VaViM and VaVAM: Autonomous Driving through Video Generative Modeling [88.33638585518226]
We introduce an open-source auto-regressive video model (VaM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving.
arXiv Detail & Related papers (2025-02-21T18:56:02Z)
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training. VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z)
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control [68.74166535159311]
We introduce MagicDriveDiT, a novel approach based on the DiT architecture. By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents. Experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames.
arXiv Detail & Related papers (2024-11-21T03:13:30Z)
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey [61.39993881402787]
World models and video generation are pivotal technologies in the domain of autonomous driving. This paper investigates the relationship between these two technologies. By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions.
arXiv Detail & Related papers (2024-11-05T08:58:35Z)
DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z)
GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model [6.144680854063938]
GenDDS is a novel approach for generating driving scenarios for autonomous driving systems. We employ the KITTI dataset, which includes real-world driving videos, to train the model. We demonstrate that our model can generate high-quality driving videos that closely replicate the complexity and variability of real-world driving scenarios.
arXiv Detail & Related papers (2024-08-28T15:37:44Z)
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [1.727597257312416]
CoVLA (Comprehensive Vision-Language-Action) dataset comprises real-world driving videos spanning more than 80 hours. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems.
arXiv Detail & Related papers (2024-08-19T09:53:49Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
Video as the New Language for Real-World Decision Making [100.68643056416394]
Video data captures important information about the physical world that is difficult to express in language. Video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks. We identify major impact opportunities in domains such as robotics, self-driving, and science.
arXiv Detail & Related papers (2024-02-27T02:05:29Z)
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [31.552397390480525]
We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. We propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline.
arXiv Detail & Related papers (2024-02-19T17:04:04Z)
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models. Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z)
Geo-Context Aware Study of Vision-Based Autonomous Driving Models and Spatial Video Data [9.883009014227815]
Vision-based deep learning (DL) methods have made great progress in learning autonomous driving models from large-scale crowd-sourced video datasets. We develop a geo-context aware visualization system for the study of Autonomous Driving Model (ADM) predictions together with large-scale ADM video data.
arXiv Detail & Related papers (2021-08-20T17:33:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.