Fugu-MT 論文翻訳(概要): The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

論文の概要: The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

arxiv url: http://arxiv.org/abs/2509.12594v1
Date: Tue, 16 Sep 2025 02:43:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:52.847146
Title: The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning
Title（参考訳）: 学習しやすく、より賢く:識別可能なトーケンプルーニングによる効率的な視覚言語アクションモデルを目指して
Authors: Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang,
Abstract要約: LightVLAは、視覚言語アクション(VLA)モデルのための差別化可能なトークンプルーニングフレームワークである。ビジュアルトークンの重要性を評価するために動的クエリを生成し、差別化可能なトークン選択を可能にするためにGumbel softmaxを採用する。光VLAはFLOPとレイテンシをそれぞれ59.1%、38.2%削減し、タスク成功率は2.9%改善した。
参考スコア（独自算出の注目度）: 27.75632811770582
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.
Abstract（参考訳）: We present LightVLA, a simple yet effective differentiable token pruning framework for vision-lang-action (VLA) model。 VLAモデルは、現実のロボットタスクの実行において印象的な能力を示しているが、リソースに制約のあるプラットフォームへのデプロイは、大きな視覚トークンセットに対する重い注意に基づく計算によってボトルネックになることが多い。ビジュアルトークンの重要性を評価するために動的クエリを生成し、Gumbel softmaxを採用して、異なるトークン選択を可能にする。微調整により、LightVLAはタスク実行に寄与しないトークンをプルーニングしながら、最も情報性の高い視覚トークンを保存することを学び、効率と性能を同時に改善する。特に、LightVLAはヒューリスティックなマジックナンバーを必要とせず、トレーニング可能なパラメータを追加せず、モダンな推論フレームワークと互換性がある。実験の結果, LIBEROベンチマークにおいて, LightVLAは様々なVLAモデルや既存のトークンプルーニング手法よりも優れており, 計算オーバーヘッドを大幅に削減して高い成功率を達成した。具体的には、LightVLAはFLOPとレイテンシをそれぞれ59.1%、38.2%削減し、タスク成功率は2.9%改善した。また、学習可能なクエリベースのトークンプルーニング手法LightVLA*についても、トレーニング可能なパラメータを追加して検討し、良好な性能を実現した。我々の研究は、VLAが最適なパフォーマンスを追求するにつれて、LightVLAは自発的にパフォーマンス駆動の観点からトークンを創出することを学びます。われわれの知る限りでは、LightVLAはVLAタスクに適応的な視覚トークンプルーニングを適用して効率と性能を両立させる最初の試みであり、より効率的で強力で実用的なリアルタイムロボットシステムに向けた重要な一歩である。

論文の概要: The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

関連論文リスト