VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
- URL: http://arxiv.org/abs/2511.16449v2
- Date: Fri, 21 Nov 2025 11:57:47 GMT
- Title: VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
- Authors: Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao,
- Abstract summary: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost limits their real-time deployment.<n>We propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models.<n>VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
- Score: 17.901428758295307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
Related papers
- BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model [44.72361174037017]
Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs.<n>The substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation.<n>We propose BFA++, a dynamic token pruning framework designed specifically for VLA models.
arXiv Detail & Related papers (2026-02-24T05:31:52Z) - ActionCodec: What Makes for Good Action Tokenizers [106.78093973045526]
Vision-Language-Action (VLA) models have demonstrated superior instruction-following and training efficiency.<n>Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity.<n>We introduce textbfActionCodec, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance.
arXiv Detail & Related papers (2026-02-17T07:07:15Z) - Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement [27.517125673741486]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic control.<n>We propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens.<n>We introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs.
arXiv Detail & Related papers (2026-02-03T20:17:47Z) - ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context [54.58057019521198]
Leveraging temporal context is crucial for success in partially observable robotic tasks.<n>Prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations.<n>We introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations.
arXiv Detail & Related papers (2025-10-05T15:29:57Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification [48.81250395291505]
Recent Vision-Language-Action models require extensive post-training, resulting in high computational overhead.<n>We propose CogVLA, a framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance.<n>CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA.
arXiv Detail & Related papers (2025-08-28T17:50:58Z) - SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z) - Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.