VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
- URL: http://arxiv.org/abs/2511.06256v1
- Date: Sun, 09 Nov 2025 07:14:53 GMT
- Title: VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
- Authors: Ruifei Zhang, Wei Zhang, Xiao Tan, Sibei Yang, Xiang Wan, Xiaonan Luo, Guanbin Li,
- Abstract summary: We introduce a novel approach featuring a lightweight MLLM architecture with enhanced vision components.<n>VLDrive achieves state-of-the-art driving performance while reducing parameters by 81%.
- Score: 90.21844353859454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.
Related papers
- MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning [51.20229133553804]
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL)<n>Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning.<n>We propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters.<n>By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions.
arXiv Detail & Related papers (2025-12-15T18:31:32Z) - dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z) - Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation [7.171333807979583]
We introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics.<n>We show that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.
arXiv Detail & Related papers (2025-11-14T11:21:48Z) - AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving [71.55254573283793]
Existing approaches either activate Large Language Models too frequently, causing excessive computational overhead, or use fixed schedules.<n>We propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decision-making.<n>AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance.
arXiv Detail & Related papers (2025-11-09T07:05:03Z) - The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning [27.75632811770582]
LightVLA is a differentiable differentiable token pruning framework for vision-language-action (VLA) models.<n>It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection.<n>We show that LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate.
arXiv Detail & Related papers (2025-09-16T02:43:46Z) - Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving [24.2108745917843]
Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving.<n>VLMs offer a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions.<n>We propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving.
arXiv Detail & Related papers (2025-08-18T18:47:26Z) - Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z) - Tracking Meets Large Multimodal Models for Driving Scenario Understanding [76.71815464110153]
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research.<n>We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details.<n>We introduce a novel approach for embedding this tracking information into LMMs to enhance their understanding of driving scenarios.
arXiv Detail & Related papers (2025-03-18T17:59:12Z) - Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.