Related papers: AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving

AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving

URL: http://arxiv.org/abs/2506.05404v1
Date: Wed, 04 Jun 2025 08:25:40 GMT
Title: AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving
Authors: Lianming Huang, Haibo Hu, Yufei Cui, Jiacheng Zuo, Shangyu Wu, Nan Guan, Chun Jason Xue,
Abstract summary: Real-time application of Vision-Language Models (VLMs) is hindered by high latency and computational overhead.<n>We propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving.<n>We show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.
Score: 14.250084730478797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid advancement of autonomous driving, deploying Vision-Language Models (VLMs) to enhance perception and decision-making has become increasingly common. However, the real-time application of VLMs is hindered by high latency and computational overhead, limiting their effectiveness in time-critical driving scenarios. This challenge is particularly evident when VLMs exhibit over-inference, continuing to process unnecessary layers even after confident predictions have been reached. To address this inefficiency, we propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers. We evaluate our method on large-scale real-world autonomous driving datasets, including Waymo and the corner-case-focused CODA, as well as on a real vehicle running the Autoware Universe platform. Extensive experiments across multiple VLMs show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.

Related papers

DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving [15.457670964093156]
We propose a novel hybrid sparse-dense diffusion policy, empowered by a Vision-Language Model (VLM)<n>Our method shows superior performance in Autonomous Grand Challenge 2025 which contains challenging real and reactive synthetic scenarios.
arXiv Detail & Related papers (2025-05-26T00:49:35Z)
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving [51.47621083057114]
SOLVE is an innovative framework that synergizes Vision-Language Models with end-to-end (E2E) models to enhance autonomous vehicle planning.<n>Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components.
arXiv Detail & Related papers (2025-05-22T15:44:30Z)
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving [9.447298958886265]
Vision-Language Models (VLMs) have demonstrated significant potential for end-to-end autonomous driving.<n>We introduce LightEMMA, a Lightweight End-to-End Multimodal Model for Autonomous driving.<n>We construct twelve autonomous driving agents using various VLMs and evaluate their performance on the nuScenes prediction task.
arXiv Detail & Related papers (2025-05-01T04:12:41Z)
VLM-C4L: Continual Core Dataset Learning with Corner Case Optimization via Vision-Language Models for Autonomous Driving [20.136096264189156]
We propose VLM-C4L, a continual learning framework that introduces Vision-Language Models (VLMs) to dynamically optimize and enhance corner case datasets.<n>VLM-C4L combines VLM-guided high-quality data extraction with a core data replay strategy, enabling the model to incrementally learn from diverse corner cases.
arXiv Detail & Related papers (2025-03-29T11:40:34Z)
RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving [10.984203470464687]
Vision-language models (VLMs) often suffer from limitations such as inadequate spatial perception and hallucination.<n>We propose a retrieval-augmented decision-making (RAD) framework to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes.<n>We fine-tune VLMs on a dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities.
arXiv Detail & Related papers (2025-03-18T03:25:57Z)
TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning [61.33599727106222]
TeLL-Drive is a hybrid framework that integrates a Teacher LLM to guide an attention-based Student DRL policy.<n>A self-attention mechanism then fuses these strategies with the DRL agent's exploration, accelerating policy convergence and boosting robustness.
arXiv Detail & Related papers (2025-02-03T14:22:03Z)
Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving [65.61999354218628]
We take the first step toward designing black-box adversarial attacks specifically targeting vision-language models (VLMs) in autonomous driving systems.<n>We propose Cascading Adversarial Disruption (CAD), which targets low-level reasoning breakdown by generating and injecting semantics.<n>We present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios.
arXiv Detail & Related papers (2025-01-23T11:10:02Z)
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)<n>Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.<n>We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z)
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.<n>VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z)
DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z)
Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases [102.05741859030951]
We propose CODA-LM, the first benchmark for the automatic evaluation of LVLMs for self-driving corner cases.<n>We show that using the text-only large language models as judges reveals even better alignment with human preferences than the LVLM judges.<n>Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task.
arXiv Detail & Related papers (2024-04-16T14:20:55Z)
Bayesian Optimization and Deep Learning forsteering wheel angle prediction [58.720142291102135]
This work aims to obtain an accurate model for the prediction of the steering angle in an automated driving system. BO was able to identify, within a limited number of trials, a model -- namely BOST-LSTM -- which resulted, the most accurate when compared to classical end-to-end driving models.
arXiv Detail & Related papers (2021-10-22T15:25:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.