Related papers: The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

URL: http://arxiv.org/abs/2509.12594v2
Date: Sun, 21 Sep 2025 13:51:09 GMT
Title: The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning
Authors: Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang,
Abstract summary: LightVLA is a differentiable differentiable token pruning framework for vision-language-action (VLA) models.<n>It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection.<n>We show that LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate.
Score: 27.75632811770582
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.

Related papers

BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model [44.72361174037017]
Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs.<n>The substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation.<n>We propose BFA++, a dynamic token pruning framework designed specifically for VLA models.
arXiv Detail & Related papers (2026-02-24T05:31:52Z)
ActionCodec: What Makes for Good Action Tokenizers [106.78093973045526]
Vision-Language-Action (VLA) models have demonstrated superior instruction-following and training efficiency.<n>Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity.<n>We introduce textbfActionCodec, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance.
arXiv Detail & Related papers (2026-02-17T07:07:15Z)
Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement [27.517125673741486]
Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic control.<n>We propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens.<n>We introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs.
arXiv Detail & Related papers (2026-02-03T20:17:47Z)
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference [17.901428758295307]
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost limits their real-time deployment.<n>We propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models.<n>VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
arXiv Detail & Related papers (2025-11-20T15:16:09Z)
VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving [90.21844353859454]
We introduce a novel approach featuring a lightweight MLLM architecture with enhanced vision components.<n>VLDrive achieves state-of-the-art driving performance while reducing parameters by 81%.
arXiv Detail & Related papers (2025-11-09T07:14:53Z)
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification [48.81250395291505]
Recent Vision-Language-Action models require extensive post-training, resulting in high computational overhead.<n>We propose CogVLA, a framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance.<n>CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA.
arXiv Detail & Related papers (2025-08-28T17:50:58Z)
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.