Related papers: Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning

Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning

URL: http://arxiv.org/abs/2510.24152v1
Date: Tue, 28 Oct 2025 07:43:30 GMT
Title: Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning
Authors: Aodi Wu, Xubo Luo,
Abstract summary: This report presents our solution for the RoboSense Challenge at IROS 2025.<n>It evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks.<n>We propose a systematic framework built on four core components.
Score: 0.47745223151611654
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.

Related papers

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning [65.36458157092207]
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations.<n>We propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities.<n>We introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization.
arXiv Detail & Related papers (2026-02-12T08:53:32Z)
Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance [92.72779885657373]
We propose a framework that grounds model selection in the internal functional dynamics of the visual encoder.<n>Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment.<n>Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target's salient functional blocks.
arXiv Detail & Related papers (2026-02-01T17:29:43Z)
Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts [27.64955941993406]
We present a vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions.<n>In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models.<n> Notably, the system maintains 96% accuracy under severe visual corruption.
arXiv Detail & Related papers (2025-10-21T18:24:59Z)
Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control [22.74768543283102]
Graph-Fused Vision-Language-Action (GF-VLA) is a framework that enables dual-arm robotic systems to perform task-level reasoning and execution.<n>GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance.<n>Cross-hand selection policy infers optimal assignment without explicit geometric reasoning.
arXiv Detail & Related papers (2025-08-07T12:48:09Z)
Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning [3.588567067449924]
We present a Collaborative Agent-Based Framework for Multi-Image Reasoning.<n>Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats.<n>We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge.
arXiv Detail & Related papers (2025-08-01T06:39:15Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation [57.628771707989166]
We propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution.<n>ReMAC incorporates two key modules: a self-reflection module performing pre-conditions and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning.
arXiv Detail & Related papers (2025-03-28T03:51:40Z)
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z)
SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving [15.551625571158056]
We propose an e2eAD method called SimpleLLM4AD. In our method, the e2eAD task are divided into four stages, which are perception, prediction, planning, and behavior. Our experiments demonstrate that SimpleLLM4AD achieves competitive performance in complex driving scenarios.
arXiv Detail & Related papers (2024-07-31T02:35:33Z)
DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem. To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects. In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z)
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. Recent solutions are mainly all-in-one models. We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z)
Planning-oriented Autonomous Driving [60.93767791255728]
We argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework that incorporates full-stack driving tasks in one network.
arXiv Detail & Related papers (2022-12-20T10:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.