Fugu-MT 論文翻訳(概要): Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

論文の概要: Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

arxiv url: http://arxiv.org/abs/2606.08242v1
Date: Sat, 06 Jun 2026 15:58:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.989655
Title: Light-WAM: Efficient World Action Models with State-Fusion Action Decoding
Title（参考訳）: Light-WAM: 状態融合動作復号を伴う効率的な世界行動モデル
Authors: Ziang Li, Dongzhou Cheng, Yibin Wang, Shiyue Wang, Xiaoyang Xu, Lingxuan Weng, Juan Wang, Jiaqi Wang,
Abstract要約: Light-WAMは、効率的なロボット操作のための軽量なワールドアクションモデルである。コンパクトなビデオバックボーンで構築され、ダウンサンプリングされた潜在空間で将来のビデオ監視を行う。実験により、Light-WAMはLIBERO上で強力な性能を維持し、RoboTwin 2.0上で使用可能なマルチタスク性能を実現している。
参考スコア（独自算出の注目度）: 15.384126562001027
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.
Abstract（参考訳）: 世界行動モデル(WAM)は、将来の予測を追加の訓練目的として取り入れることでロボットポリシー学習を拡張し、タスク関連時間構造をその表現にエンコードすることを奨励する。現在のWAMは、高いトレーニングコストと推論遅延を発生させる大規模な生成アーキテクチャに依存しており、効率的なクローズドループポリシとしてデプロイすることが困難である。ロボットを効率的に操作するための軽量な世界行動モデルLight-WAMを提案する。具体的には、コンパクトなビデオバックボーンで構築され、ダウンサンプリングされた潜在空間で将来のビデオ監視を行い、ビデオのコトレーニングコストを低減し、表現学習のメリットを維持している。アクション予測のために、Light-WAMはStateFusionActionExpertを導入し、複数のバックボーン層から適応状態を読み出し、学習クエリプーリングを通じてそれらをフューズし、単一のフォワードパスでアクションチャンクを直接予測する。この設計は、ビデオバックボーン表現とロボットアクションの間の効率的なインターフェースを提供し、重質な生成アクション専門家の必要性を回避する。実験では、Light-WAMはLIBERO上での強い性能を維持し、RoboTwin 2.0上で使用可能なマルチタスク性能を実現している。また、4.1GiBのピークGPUメモリで72.03msの推論レイテンシを実現し、トレーニングスループットが向上した。

論文の概要: Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

関連論文リスト