DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving
- URL: http://arxiv.org/abs/2511.20720v1
- Date: Tue, 25 Nov 2025 07:00:26 GMT
- Title: DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving
- Authors: Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue,
- Abstract summary: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks.<n>We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories.<n>Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.
- Score: 20.235153433297384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.
Related papers
- dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z) - DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action [62.70893433854428]
We propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability.<n>Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks.
arXiv Detail & Related papers (2025-11-27T06:03:53Z) - DAP: A Discrete-token Autoregressive Planner for Autonomous Driving [34.32497598431514]
We introduce DAP, a discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories.<n>We incorporate a reinforcement-learning-based fine-tuning, which preserves supervised behavior cloning priors while injecting reward-guided improvements.<n>DAP achieves state-of-the-art performance on open-loop metrics and delivers competitive closed-loop results on the NAVSIM benchmark.
arXiv Detail & Related papers (2025-11-17T12:31:33Z) - ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving [64.42138266293202]
ResAD is a Normalized Residual Trajectory Modeling framework.<n>It reframes the learning task to predict the residual deviation from an inertial reference.<n>On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy.
arXiv Detail & Related papers (2025-10-09T17:59:36Z) - Edge-Based Multimodal Sensor Data Fusion with Vision Language Models (VLMs) for Real-time Autonomous Vehicle Accident Avoidance [12.513296074529727]
This paper proposes the Real-time Edge-based Autonomous Co-pilot Trajectory planner (REACT) for autonomous driving.<n>REACT is a V2X-integrated trajectory optimization framework for AD based on a fine-tuned lightweight Vision-Language Model (VLM)<n> evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference latency on the Jetson AGX Orin.
arXiv Detail & Related papers (2025-08-01T20:16:04Z) - SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z) - Variable Time-Step MPC for Agile Multi-Rotor UAV Interception of Dynamic Targets [6.0967385124149756]
Agile planning using existing non-linear model predictive control methods is limited by the number of planning steps as it becomes increasingly demanding.<n>In this paper, we propose to address these limitations by introducing variable time steps and coupling them with the prediction horizon length.<n>A simplified point-mass motion primitive is used to leverage the differential flatness of quadrotor dynamics and the trajectory generation of feasible trajectories in the flat output space.
arXiv Detail & Related papers (2025-03-18T11:59:24Z) - DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up.<n>It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention.<n>It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z) - AdaShadow: Responsive Test-time Model Adaptation in Non-stationary Mobile Environments [24.606016498430407]
This paper presents AdaShadow, a responsive test-time adaptation framework for non-stationary mobile data distribution and resource dynamics.
AdaShadow addresses challenges in estimating layer importance and latency, as well as scheduling the optimal layer update plan.
Results show that AdaShadow achieves the best accuracy-latency balance under continual shifts.
arXiv Detail & Related papers (2024-10-10T16:41:39Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation [34.070813293944944]
We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD)
Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks.
Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate in nuScenes and surpasses VAD for 41.32 points on the driving score in CARLA's Town05 Long benchmark.
arXiv Detail & Related papers (2024-06-25T16:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.