Fugu-MT 論文翻訳(概要): VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

論文の概要: VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2603.22991v1
Date: Tue, 24 Mar 2026 09:33:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.407302
Title: VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models
Title（参考訳）: VLA-IAP:視覚・言語・アクションモデルのためのインタラクションアライメントによる学習不要な視覚トーケンプルーニング
Authors: Jintao Cheng, Haozhe Wang, Weibin Li, Gang Wang, Yipu Zhang, Xiaoyu Tang, Jin Wu, Xieyuanli Chen, Yunhui Liu, Wei Zhang,
Abstract要約: VLA(Vision-Language-Action)モデルは、ロボットが複雑な命令駆動タスクを実行できるように、急速に高度なインボディードインテリジェンスを備えている。現在のアプローチはしばしば、操作をサポートする視覚的にスパースで構造的に重要な領域を創り出し、初期のタスクフェーズの不安定な振る舞いを引き起こす。提案手法であるVLA-IAP(Interaction-Aligned Pruning)では,構造的アンカーを保存するための幾何学的事前メカニズムと動的スケジューリング戦略を導入する。
参考スコア（独自算出の注目度）: 27.12266806191131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbf{training-free} method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf{97.8\% success rate} with a \textbf{$1.25\times$ speedup} on the LIBERO benchmark, and up to \textbf{$1.54\times$ speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、ロボットが複雑な命令駆動タスクを実行できるように、急速に高度なインボディードインテリジェンスを備えている。しかしながら、モデルキャパシティと視覚的コンテキストの長さが大きくなるにつれて、VLAシステムの推論コストは、リソース制約のあるプラットフォーム上での現実的なデプロイにおいて大きなボトルネックとなる。既存の視覚的トークンプルーニング法は、VLAタスクの基本的な性質である連続的な物理的相互作用を見越して、セマンティック・サリエンシや単純な時間的手がかりに主に依存している。その結果、現在のアプローチはしばしば、操作をサポートする視覚的にスパースで構造的に重要な領域を創り出し、初期のタスクフェーズの不安定な振る舞いを引き起こす。これを解決するために、明示的なインタラクションファーストパラダイムへのシフトを提案する。提案手法であるVLA-IAP(Interaction-Aligned Pruning)は,構造的アンカーを保存するための幾何学的事前メカニズムと,セマンティック・モーションアライメントに基づくプルーニング強度の動的スケジューリング戦略を導入する。これにより、保守的から攻撃的な移行が可能になり、相互作用がロックされると、早期の不確実性と効率性が保証される。 VLA-IAP は LIBERO ベンチマークで \textbf{$1.25\times$ speedup} を、パフォーマンスを維持しながら \textbf{$1.54\times$ speedup} に到達した。さらに、本手法は、複数のモデルアーキテクチャと3つの異なるシミュレーション環境、および実際のロボットプラットフォームにまたがる優れた一貫した性能を示し、その強力な一般化能力と実用性を検証する。プロジェクトウェブサイトは以下のとおりである。

論文の概要: VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

関連論文リスト