Related papers: DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

URL: http://arxiv.org/abs/2601.16065v1
Date: Thu, 22 Jan 2026 16:02:56 GMT
Title: DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
Authors: Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He, Yuchen Li, Jingqun Tang,
Abstract summary: Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation.<n>VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'<n>This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks.
Score: 25.91822750707556
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.

Related papers

BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model [44.72361174037017]
Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs.<n>The substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation.<n>We propose BFA++, a dynamic token pruning framework designed specifically for VLA models.
arXiv Detail & Related papers (2026-02-24T05:31:52Z)
VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models [26.542479606920423]
Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks.<n>Despite the success, extending large pretrained VLA models to the action space can induce vision-action misalignment.<n>We propose a training framework that explicitly strengthens visual conditioning in VLA models.
arXiv Detail & Related papers (2026-02-04T20:59:29Z)
Vision-Language Models Unlock Task-Centric Latent Actions [75.53481518882275]
We propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations.<n>We show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.
arXiv Detail & Related papers (2026-01-30T08:38:59Z)
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z)
VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching [23.52474883720957]
Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions.<n>This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames.
arXiv Detail & Related papers (2025-02-04T09:48:14Z)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.<n>We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)<n>We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z)
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens [53.99177152562075]
Scaling up autoregressive models in vision has not proven as beneficial as in large language models. We focus on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed order using BERT- or GPT-like transformer architectures. Our results show that while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends.
arXiv Detail & Related papers (2024-10-17T17:59:59Z)
Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust [9.647148940880381]
Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that dynamically identifies regions of the input image that the model is sensitive to. We show that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds.
arXiv Detail & Related papers (2024-10-02T19:29:24Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.