AutoDrive-QA: A Multiple-Choice Benchmark for Vision-Language Evaluation in Urban Autonomous Driving
- URL: http://arxiv.org/abs/2503.15778v2
- Date: Sat, 04 Oct 2025 21:23:23 GMT
- Title: AutoDrive-QA: A Multiple-Choice Benchmark for Vision-Language Evaluation in Urban Autonomous Driving
- Authors: Boshra Khalili, Andrew W. Smyth,
- Abstract summary: We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA into structured multiple-choice questions.<n>We show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8% accuracy, and Qwen2-VL models also perform competitively.
- Score: 0.7734726150561086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating vision-language models (VLMs) in urban driving contexts remains challenging, as existing benchmarks rely on open-ended responses that are ambiguous, annotation-intensive, and inconsistent to score. This lack of standardized evaluation slows progress toward safe and reliable AI for urban mobility. We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA datasets (DriveLM, NuScenes-QA, LingoQA) into structured multiple-choice questions (MCQs) with distractors grounded in five realistic error categories: Driving Domain Misconceptions, Logical Inconsistencies, Misinterpreted Sensor Inputs, Computational Oversights, and Question Ambiguity. This framework enables reproducible and interpretable evaluation of VLMs across perception, prediction, and planning tasks in complex urban scenes. Experiments show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8% accuracy, and Qwen2-VL models also perform competitively, particularly in multi-view settings. Moreover, traditional metrics such as BLEU and CIDEr fail to distinguish strong from weak models. By providing an objective, domain-grounded evaluation protocol, AutoDrive-QA contributes to more transparent benchmarking of urban AI systems, supporting the development of safer and more trustworthy autonomous driving technologies for smart cities.
Related papers
- dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z) - DriveQA: Passing the Driving Knowledge Test [13.569275971952154]
We present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios.<n>We show that state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios.<n>We also demonstrate that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.
arXiv Detail & Related papers (2025-08-29T17:59:53Z) - STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes [5.685235562999083]
STRIDE-QA is the largest visual question answering dataset fortemporal reasoning in urban driving.<n>It supports both object-centric and ego-centric reasoning through spatial localization and temporal prediction.<n>Our benchmarks demonstrate that existing Vision-Language Models (VLMs) struggle to achieve near-zero scores on prediction consistency.
arXiv Detail & Related papers (2025-08-14T07:57:06Z) - AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving [28.378854340190973]
Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning.<n>We introduce AgentThink, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks.
arXiv Detail & Related papers (2025-05-21T09:27:43Z) - Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving [55.96227460521096]
Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities.<n>We propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios.<n>Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving.
arXiv Detail & Related papers (2025-05-09T20:28:17Z) - Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding [10.242043337117005]
Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering.<n>However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored.<n>We introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos.
arXiv Detail & Related papers (2025-04-20T07:50:44Z) - NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving [10.41584658117874]
We propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark designed to evaluate the spatial understanding and reasoning capabilities of Vision-Language Models (VLMs) in autonomous driving.<n>Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline.<n>Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving.
arXiv Detail & Related papers (2025-04-04T04:43:10Z) - DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z) - SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
We propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge.<n>To explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic.<n>Our Multimodal Retrieval-Augmented Generation model leverages video, control signals, and environmental attributes to learn from past driving experiences.
arXiv Detail & Related papers (2025-02-28T21:53:47Z) - Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)<n>Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.<n>We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z) - Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation [69.81654421834989]
We introduce Auto, an agentic framework that automatically converts open-ended questions into multiple-choice format.<n>Using Auto, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format.<n>We evaluate 33 state-of-the-art vision language models on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
arXiv Detail & Related papers (2025-01-06T18:57:31Z) - DriveMM: All-in-One Large Multimodal Model for Autonomous Driving [63.882827922267666]
DriveMM is a large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of autonomous driving tasks.
We conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks.
arXiv Detail & Related papers (2024-12-10T17:27:32Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning [54.47116888545878]
AutoAct is an automatic agent learning framework for QA.
It does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models.
arXiv Detail & Related papers (2024-01-10T16:57:24Z) - DriveLM: Driving with Graph Visual Question Answering [57.51930417790141]
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems.<n>We propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
arXiv Detail & Related papers (2023-12-21T18:59:12Z) - NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous
Driving Datasets using Markup Annotations [0.6827423171182154]
Visual Question Answering (VQA) is one of the most important tasks in autonomous driving.
We introduce a novel dataset annotation technique in which QAs are enclosed within markups.
This dataset empowers the development of vision language models, especially for autonomous driving tasks.
arXiv Detail & Related papers (2023-12-11T12:58:54Z) - Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [38.28159034562901]
Reason2Drive is a benchmark dataset with over 600K video-text pairs.
We characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps.
We introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems.
arXiv Detail & Related papers (2023-12-06T18:32:33Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - The 5th AI City Challenge [51.83023045451549]
The fifth AI City Challenge attracted 305 participating teams across 38 countries.
The evaluation was conducted on both algorithmic effectiveness and computational efficiency.
Results show the promise of AI in Smarter Transportation.
arXiv Detail & Related papers (2021-04-25T19:15:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.