Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
- URL: http://arxiv.org/abs/2510.19001v1
- Date: Tue, 21 Oct 2025 18:24:59 GMT
- Title: Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
- Authors: Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim,
- Abstract summary: We present a vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions.<n>In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models.<n> Notably, the system maintains 96% accuracy under severe visual corruption.
- Score: 27.64955941993406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.
Related papers
- Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning [0.47745223151611654]
This report presents our solution for the RoboSense Challenge at IROS 2025.<n>It evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks.<n>We propose a systematic framework built on four core components.
arXiv Detail & Related papers (2025-10-28T07:43:30Z) - SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection [6.806105013817923]
SAVANT is a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios.<n>By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection.
arXiv Detail & Related papers (2025-10-20T19:14:29Z) - Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis [7.392659193819963]
Traffic safety analysis requires complex video understanding to capture behavioral patterns and generate descriptions for accident prevention.<n>In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization.
arXiv Detail & Related papers (2025-10-13T20:18:23Z) - Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment [10.701522670464463]
multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments.<n>We propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage.<n>We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder.
arXiv Detail & Related papers (2025-06-03T10:11:51Z) - LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z) - AutoDrive-QA: A Multiple-Choice Benchmark for Vision-Language Evaluation in Urban Autonomous Driving [0.7734726150561086]
We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA into structured multiple-choice questions.<n>We show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8% accuracy, and Qwen2-VL models also perform competitively.
arXiv Detail & Related papers (2025-03-20T01:32:00Z) - DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z) - Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation [64.64849950642619]
We develop an evaluation framework inspired by formal semantics for evaluating text-to-image models.
We show that Davidsonian Scene Graph (DSG) produces atomic and unique questions organized in dependency graphs.
We also present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts.
arXiv Detail & Related papers (2023-10-27T16:20:10Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.