Related papers: Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts

Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts

URL: http://arxiv.org/abs/2510.19001v1
Date: Tue, 21 Oct 2025 18:24:59 GMT
Title: Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
Authors: Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim,
Abstract summary: We present a vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions.<n>In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models.<n> Notably, the system maintains 96% accuracy under severe visual corruption.
Score: 27.64955941993406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.

Related papers

Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning [0.47745223151611654]
This report presents our solution for the RoboSense Challenge at IROS 2025.<n>It evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks.<n>We propose a systematic framework built on four core components.
arXiv Detail & Related papers (2025-10-28T07:43:30Z)
SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection [6.806105013817923]
SAVANT is a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios.<n>By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection.
arXiv Detail & Related papers (2025-10-20T19:14:29Z)
Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis [7.392659193819963]
Traffic safety analysis requires complex video understanding to capture behavioral patterns and generate descriptions for accident prevention.<n>In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization.
arXiv Detail & Related papers (2025-10-13T20:18:23Z)
Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment [10.701522670464463]
multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments.<n>We propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage.<n>We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder.
arXiv Detail & Related papers (2025-06-03T10:11:51Z)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z)
AutoDrive-QA: A Multiple-Choice Benchmark for Vision-Language Evaluation in Urban Autonomous Driving [0.7734726150561086]
We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA into structured multiple-choice questions.<n>We show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8% accuracy, and Qwen2-VL models also perform competitively.
arXiv Detail & Related papers (2025-03-20T01:32:00Z)
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z)
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation [64.64849950642619]
We develop an evaluation framework inspired by formal semantics for evaluating text-to-image models. We show that Davidsonian Scene Graph (DSG) produces atomic and unique questions organized in dependency graphs. We also present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts.
arXiv Detail & Related papers (2023-10-27T16:20:10Z)
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. We leverage existing 3D detection annotations to generate scene graphs and design question templates manually. We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z)
PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions. We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.