Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
- URL: http://arxiv.org/abs/2501.04003v1
- Date: Tue, 07 Jan 2025 18:59:55 GMT
- Title: Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
- Authors: Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan,
- Abstract summary: We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)
Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.
We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
- Score: 56.528835143531694
- License:
- Abstract: Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs' awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.
Related papers
- Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving [65.61999354218628]
We take the first step toward designing black-box adversarial attacks specifically targeting vision-language models (VLMs) in autonomous driving systems.
We propose Cascading Adversarial Disruption (CAD), which targets low-level reasoning breakdown by generating and injecting semantics.
We present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios.
arXiv Detail & Related papers (2025-01-23T11:10:02Z) - AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving [106.0319745724181]
We introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs)
We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios.
Our evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats.
arXiv Detail & Related papers (2024-12-19T18:59:33Z) - Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving [24.485164073626674]
We propose IDKB, a large-scale dataset containing over one million data items collected from various countries.
Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice.
arXiv Detail & Related papers (2024-09-04T17:52:43Z) - A Superalignment Framework in Autonomous Driving with Large Language Models [2.650382010271]
Large language models (LLMs) and multi-modal large language models (MLLMs) are extensively used in autonomous driving.
Despite their importance, the security aspect of LLMs in autonomous driving remains underexplored.
This research introduces a novel security framework for autonomous vehicles, utilizing a multi-agent LLM approach.
arXiv Detail & Related papers (2024-06-09T05:26:38Z) - Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [38.28159034562901]
Reason2Drive is a benchmark dataset with over 600K video-text pairs.
We characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps.
We introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems.
arXiv Detail & Related papers (2023-12-06T18:32:33Z) - On the Road with GPT-4V(ision): Early Explorations of Visual-Language
Model on Autonomous Driving [37.617793990547625]
This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V.
We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver.
Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.
arXiv Detail & Related papers (2023-11-09T12:58:37Z) - LLM4Drive: A Survey of Large Language Models for Autonomous Driving [62.10344445241105]
Large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers.
In this paper, we systematically review a research line about textitLarge Language Models for Autonomous Driving (LLM4AD).
arXiv Detail & Related papers (2023-11-02T07:23:33Z) - DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs)
DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
arXiv Detail & Related papers (2023-10-02T17:59:52Z) - Unsupervised Self-Driving Attention Prediction via Uncertainty Mining
and Knowledge Embedding [51.8579160500354]
We propose an unsupervised way to predict self-driving attention by uncertainty modeling and driving knowledge integration.
Results show equivalent or even more impressive performance compared to fully-supervised state-of-the-art approaches.
arXiv Detail & Related papers (2023-03-17T00:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.