FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning
- URL: http://arxiv.org/abs/2507.23318v3
- Date: Tue, 16 Sep 2025 09:59:46 GMT
- Title: FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning
- Authors: Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, Ming Lu, Yang Wang, Shanghang Zhang,
- Abstract summary: We propose FastDriveVLA, a reconstruction-based vision token pruning framework for autonomous driving.<n>A novel foreground adversarial-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models.<n>Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.
- Score: 75.80110543049783
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.
Related papers
- BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model [44.72361174037017]
Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs.<n>The substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation.<n>We propose BFA++, a dynamic token pruning framework designed specifically for VLA models.
arXiv Detail & Related papers (2026-02-24T05:31:52Z) - Visual Generation Tuning [84.50113837230333]
We propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within vision language models.<n>In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs.<n>Our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation.
arXiv Detail & Related papers (2025-11-28T18:57:13Z) - Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving [24.2108745917843]
Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving.<n>VLMs offer a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions.<n>We propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving.
arXiv Detail & Related papers (2025-08-18T18:47:26Z) - TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving [10.439455144126617]
TinyDrive is a lightweight VLM for multi-view VQA in driving scenarios.<n>Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences.<n>TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark.
arXiv Detail & Related papers (2025-05-21T14:19:24Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - CASPFormer: Trajectory Prediction from BEV Images with Deformable
Attention [4.9349065371630045]
We propose Context Aware Scene Prediction Transformer (CASPFormer), which can perform multi-modal motion prediction from spatialized Bird-Eye-View (BEV) images.
Our system can be integrated with any upstream perception module that is capable of generating BEV images.
We evaluate our model on the nuScenes dataset and show that it reaches state-of-the-art across multiple metrics.
arXiv Detail & Related papers (2024-09-26T12:37:22Z) - Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving.<n> LAW predicts future scene features based on current features and ego trajectories.<n>This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - Beyond the Field-of-View: Enhancing Scene Visibility and Perception with Clip-Recurrent Transformer [28.326852785609788]
FlowLens architecture explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation.
In this paper, we propose the concept of online video inpainting for autonomous vehicles to expand the field of view.
Experiments and user studies involving offline and online video inpainting, as well as beyondFo-V perception tasks, demonstrate that Flows achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-21T09:34:07Z) - Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z) - Structured Bird's-Eye-View Traffic Scene Understanding from Onboard
Images [128.881857704338]
We study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image.
We show that the method can be extended to detect dynamic objects on the BEV plane.
We validate our approach against powerful baselines and show that our network achieves superior performance.
arXiv Detail & Related papers (2021-10-05T12:40:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.