LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving
- URL: http://arxiv.org/abs/2508.12404v1
- Date: Sun, 17 Aug 2025 15:42:54 GMT
- Title: LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving
- Authors: Nan Song, Bozhou Zhang, Xiatian Zhu, Jiankang Deng, Li Zhang,
- Abstract summary: Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
- Score: 58.535516533697425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (VLMs) have shown promising capabilities in scene understanding, enhancing the explainability of driving behaviors and interactivity with users. Existing methods primarily fine-tune VLMs on on-board multi-view images and scene reasoning text, but this approach often lacks the holistic and nuanced scene recognition and powerful spatial awareness required for autonomous driving, especially in complex situations. To address this gap, we propose a novel vision-language framework tailored for autonomous driving, called LMAD. Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs. In particular, we introduce preliminary scene interaction and specialized expert adapters within the same driving task structure, which better align VLMs with autonomous driving scenarios. Furthermore, our approach is designed to be fully compatible with existing VLMs while seamlessly integrating with planning-oriented driving systems. Extensive experiments on the DriveLM and nuScenes-QA datasets demonstrate that LMAD significantly boosts the performance of existing VLMs on driving reasoning tasks,setting a new standard in explainable autonomous driving.
Related papers
- SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z) - ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving [64.12414815634847]
Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge.<n>We propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer.
arXiv Detail & Related papers (2025-08-15T12:06:55Z) - LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving [13.136505955569875]
Vision-Language Models (VLMs) have demonstrated significant potential for end-to-end autonomous driving.<n>We introduce LightEMMA, a Lightweight End-to-End Multimodal Model for Autonomous driving.
arXiv Detail & Related papers (2025-05-01T04:12:41Z) - VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.<n>VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z) - Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving [65.04643267731122]
General MLLMs combined with CLIP often struggle to represent driving-specific scenarios accurately.
We propose the Hints of Prompt (HoP) framework, which introduces three key enhancements.
These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning.
arXiv Detail & Related papers (2024-11-20T06:58:33Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z) - Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases [102.05741859030951]
We propose CODA-LM, the first benchmark for the automatic evaluation of LVLMs for self-driving corner cases.<n>We show that using the text-only large language models as judges reveals even better alignment with human preferences than the LVLM judges.<n>Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task.
arXiv Detail & Related papers (2024-04-16T14:20:55Z) - DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [31.552397390480525]
We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs)
DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning.
We propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline.
arXiv Detail & Related papers (2024-02-19T17:04:04Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.