Fugu-MT 論文翻訳(概要): AC^2-VLA: Action-Context-Aware Adaptive Computation in Vision-Language-Action Models for Efficient Robotic Manipulation

論文の概要: AC^2-VLA: Action-Context-Aware Adaptive Computation in Vision-Language-Action Models for Efficient Robotic Manipulation

arxiv url: http://arxiv.org/abs/2601.19634v1
Date: Tue, 27 Jan 2026 14:10:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 05:29:50.518389
Title: AC^2-VLA: Action-Context-Aware Adaptive Computation in Vision-Language-Action Models for Efficient Robotic Manipulation
Title（参考訳）: AC^2-VLA:効率的なロボットマニピュレーションのための視覚・言語・アクションモデルにおける行動文脈対応型計算
Authors: Wenda Yu, Tianshi Wang, Fengling Li, Jingjing Li, Lei Zhu,
Abstract要約: VLAモデル(AC2-VLA)に対するアクションコンテキスト対応適応計算を提案する。 AC2-VLAは、タイムステップ間の認識再利用、トークンプルーニング、統一メカニズム内のモデルコンポーネントの選択的な実行を適応的に行う。ロボット操作ベンチマークの実験では、AC2-VLAはFLOPを29.4%まで減らし、最大1.79倍のスピードアップを達成した。
参考スコア（独自算出の注目度）: 21.23747444669735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have demonstrated strong performance in robotic manipulation, yet their closed-loop deployment is hindered by the high latency and compute cost of repeatedly running large vision-language backbones at every timestep. We observe that VLA inference exhibits structured redundancies across temporal, spatial, and depth dimensions, and that most existing efficiency methods ignore action context, despite its central role in embodied tasks. To address this gap, we propose Action-Context-aware Adaptive Computation for VLA models (AC^2-VLA), a unified framework that conditions computation on current visual observations, language instructions, and previous action states. Based on this action-centric context, AC^2-VLA adaptively performs cognition reuse across timesteps, token pruning, and selective execution of model components within a unified mechanism. To train the adaptive policy, we introduce an action-guided self-distillation scheme that preserves the behavior of the dense VLA policy while enabling structured sparsification that transfers across tasks and settings. Extensive experiments on robotic manipulation benchmarks show that AC^2-VLA achieves up to a 1.79\times speedup while reducing FLOPs to 29.4% of the dense baseline, with comparable task success.
Abstract（参考訳）: Vision-Language-Action(VLA)モデルは、ロボット操作において強力なパフォーマンスを示しているが、そのクローズドループデプロイメントは、大きなビジョン言語バックボーンを毎回繰り返し実行する場合のレイテンシと計算コストによって妨げられている。我々は,VLA推論が時間的,空間的,深度的な構造的冗長性を示し,既存のほとんどの効率性手法は,実施タスクにおいて中心的な役割を担っているにもかかわらず,行動コンテキストを無視していることを観察した。このギャップに対処するために、現在の視覚的観察、言語命令、および過去の動作状態に基づいて計算を行う統合フレームワークであるVLAモデル(AC^2-VLA)のアクションコンテキスト対応適応計算を提案する。このアクション中心のコンテキストに基づいて、AC^2-VLAは、タイムステップ間の認識再利用、トークンプルーニング、統一メカニズム内のモデルコンポーネントの選択的な実行を適応的に行う。適応政策を訓練するために,タスクや設定をまたいで伝達する構造化スペール化を実現しつつ,高密度なVLAポリシーの挙動を保ちながら行動誘導型自己蒸留方式を導入する。ロボット操作ベンチマークの大規模な実験により、AC^2-VLAは1.79倍のスピードアップを達成し、FLOPを29.4%の高密度ベースラインに削減し、同等のタスクを成功させた。

論文の概要: AC^2-VLA: Action-Context-Aware Adaptive Computation in Vision-Language-Action Models for Efficient Robotic Manipulation

関連論文リスト