Related papers: BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model

BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model

URL: http://arxiv.org/abs/2602.20566v1
Date: Tue, 24 Feb 2026 05:31:52 GMT
Title: BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model
Authors: Haosheng Li, Weixin Mao, Zihan Lan, Hongwei Xiong, Hongan Wang, Chenyang Si, Ziwei Liu, Xiaoming Deng, Hua Chen,
Abstract summary: Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs.<n>The substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation.<n>We propose BFA++, a dynamic token pruning framework designed specifically for VLA models.
Score: 44.72361174037017
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the π0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.

Related papers

DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models [25.91822750707556]
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation.<n>VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'<n>This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks.
arXiv Detail & Related papers (2026-01-22T16:02:56Z)
FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization [61.10456021136654]
We introduce FASTer, a unified framework for efficient and general robot learning.<n>FASTerVQ encodes action chunks as single-channel images, capturing global-temporal dependencies while maintaining a high compression ratio.<n>FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance.
arXiv Detail & Related papers (2025-12-04T16:21:38Z)
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference [17.901428758295307]
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost limits their real-time deployment.<n>We propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models.<n>VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
arXiv Detail & Related papers (2025-11-20T15:16:09Z)
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation [27.007611140797852]
Existing methods optimize inference speed by reducing visual redundancy within VLA models.<n>We propose textbfAction-aware textbfDynamic textbfPruning (textbfADP), a multi-modal pruning framework that integrates text-driven token selection with action-aware trajectory gating.
arXiv Detail & Related papers (2025-09-26T09:13:02Z)
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z)
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.