Probing Multimodal LLMs as World Models for Driving
- URL: http://arxiv.org/abs/2405.05956v2
- Date: Fri, 25 Oct 2024 19:31:37 GMT
- Title: Probing Multimodal LLMs as World Models for Driving
- Authors: Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, Daniela Rus,
- Abstract summary: We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
- Score: 72.18727651074563
- License:
- Abstract: We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.
Related papers
- Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.
To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.
Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - A Survey of World Models for Autonomous Driving [63.33363128964687]
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling.
This paper systematically reviews recent advances in world models for autonomous driving.
arXiv Detail & Related papers (2025-01-20T04:00:02Z) - DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning.
Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z) - Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey [61.39993881402787]
World models and video generation are pivotal technologies in the domain of autonomous driving.
This paper investigates the relationship between these two technologies.
By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions.
arXiv Detail & Related papers (2024-11-05T08:58:35Z) - DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics.
Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z) - OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving [12.004183122121042]
OccLLaMA is an occupancy-language-action generative world model.
We build a unified multi-modal vocabulary for vision, language and action.
OccLLaMA achieves competitive performance across multiple tasks.
arXiv Detail & Related papers (2024-09-05T06:30:01Z) - Cycle-Consistent World Models for Domain Independent Latent Imagination [0.0]
High costs and risks make it hard to train autonomous cars in the real world.
We propose a novel model-based reinforcement learning approach called Cycleconsistent World Models.
arXiv Detail & Related papers (2021-10-02T13:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.