Fugu-MT 論文翻訳(概要): Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

論文の概要: Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2512.09927v1
Date: Wed, 10 Dec 2025 18:59:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.267196
Title: Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models
Title（参考訳）: Token Expand-Merge:Vision-Language-Action Modelのためのトレーニング不要のToken Compression
Authors: Yifan Ye, Jiaqi Ma, Jun Cen, Zhihe Lu,
Abstract要約: 大規模マルチモーダルデータセットで事前訓練されたビジョン・ランゲージ・アクション(VLA)モデルは、ロボットの知覚と制御の強力な基盤として現れている。タスク性能を維持しながらVLA推論を高速化する訓練不要なトークン圧縮フレームワークであるExpand Token-and-Merge-VLAを提案する。
参考スコア（独自算出の注目度）: 16.321608201919474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}
Abstract（参考訳）: 大規模マルチモーダルデータセットで事前訓練されたビジョン・ランゲージ・アクション(VLA)モデルは、ロボットの知覚と制御の強力な基盤として現れている。しかし、その膨大なスケール、しばしば数十億のパラメータは、推論が計算コストが高く、動的環境において遅延に敏感になるため、リアルタイムデプロイメントに重大な課題を生じさせる。そこで本稿では,タスク性能を維持しながらVLA推論を高速化するトレーニングフリートークン圧縮フレームワークであるToken Expand-and-Merge-VLA(TEAM-VLA)を提案する。 TEAM-VLAは動的トークン拡張機構を導入し、注目領域の空間的近傍で付加的な情報トークンを識別し、サンプル化し、文脈的完全性を高める。これらの拡張トークンは、アクション認識誘導の下で、より深い層に選択的にマージされ、セマンティックコヒーレンスを維持しながら、冗長性を効果的に低減する。 TEAM-VLAは、単一のフィードフォワードパス内で拡張とマージを結合することにより、再トレーニングやパラメータ更新を行わずに、効率と有効性のバランスの取れたトレードオフを実現する。 LIBEROベンチマークの大規模な実験により、TEAM-VLAは完全なVLAモデルのタスク成功率を維持したり、超えたりしながら、推論速度を一貫して改善することを示した。コードは \href{https://github.com/Jasper-aa/TEAM-VLA}{https://github.com/Jasper-aa/TEAM-VLA} で公開されている。

論文の概要: Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

関連論文リスト