VLMs Guided Interpretable Decision Making for Autonomous Driving
- URL: http://arxiv.org/abs/2511.13881v1
- Date: Mon, 17 Nov 2025 19:57:51 GMT
- Title: VLMs Guided Interpretable Decision Making for Autonomous Driving
- Authors: Xin Hu, Taotao Jing, Renran Tian, Zhengming Ding,
- Abstract summary: We evaluate state-of-the-art open-source vision-language models (VLMs) on high-level decision-making tasks.<n>We propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers.<n>Our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
- Score: 39.29020915361483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
Related papers
- AutoDriDM: An Explainable Benchmark for Decision-Making of Vision-Language Models in Autonomous Driving [26.866150191410032]
We present AutoDriDM, a decision-centric, progressive benchmark with 6,650 questions across three dimensions - Object, Scene, and Decision.<n>We evaluate mainstream vision-language models to delineate the perception-to-decision capability boundary in autonomous driving.<n>We conduct explainability analyses of models' reasoning processes, identifying key failure modes such as logical reasoning errors.
arXiv Detail & Related papers (2026-01-21T06:29:09Z) - dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z) - Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.48672228625821]
We introduce Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability.<n>Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks.<n>Our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
arXiv Detail & Related papers (2025-10-13T05:51:22Z) - LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z) - Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning [16.938301925105097]
This paper shows that Vision Language Models can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions.<n>We propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making.
arXiv Detail & Related papers (2025-03-21T09:25:23Z) - VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making [17.313485392764353]
VIPER is a novel framework for multimodal instruction-based planning.<n>It integrates VLM-based perception with LLM-based reasoning.<n>We show that VIPER significantly outperforms state-of-the-art visual instruction-based planners.
arXiv Detail & Related papers (2025-03-19T11:05:42Z) - Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving [5.456780031044544]
We propose a knowledge distillation method that transfers knowledge from large-scale vision-language foundation models to efficient vision networks.<n>We apply it to pedestrian behavior prediction and scene understanding tasks, achieving promising results in generating more diverse and comprehensive semantic attributes.
arXiv Detail & Related papers (2025-01-12T01:31:07Z) - Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)<n>Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.<n>We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.