Fugu-MT 論文翻訳(概要): Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

論文の概要: Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

arxiv url: http://arxiv.org/abs/2606.10040v2
Date: Wed, 10 Jun 2026 06:52:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:37.955359
Title: Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination
Title（参考訳）: 高速WAM:低コスト将来イマジネーションによる1Bパラメータ世界反応モデル
Authors: Jiajun Li, Tiecheng Guo, Yifan Ye, Rongyu Zhang, Xiaowei Chi, Qianpu Sun, Ying Li, Yunfan Lou, Yan Huang, Zhihe Lu, Meng Guo, Shanghang Zhang,
Abstract要約: World-Action Models (WAM) は未来の視覚予測とアクション生成を結合する。ほとんどの既存のWAMは将来の予測に依存しており、高い推論遅延を引き起こし、リアルタイムロボットのデプロイを困難にしている。本稿では,その制御利益を保ちつつ,将来の想像力のコストを低減させるワールド・アクション・モデルであるEfficient-WAMを紹介する。
参考スコア（独自算出の注目度）: 45.6948544726412
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost. We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.
Abstract（参考訳）: World-Action Models (WAMs) は将来の視覚的予測と行動生成を結合することで制御を具現化するための有望なパラダイムとして登場した。しかし、既存のWAMの多くは、高い推論遅延を発生させ、リアルタイムロボットの展開を困難にするフォトリアリスティックな未来予測に依存している。これにより、予測コストを低減しつつ、将来の視覚的予測の制御の利点を保ちつつ、より効率的なWAM設計のモチベーションがもたらされる。本稿では,その制御利益を保ちつつ,将来の想像力のコストを低減させるワールド・アクション・モデルであるEfficient-WAMを紹介する。効率的なWAMは、WAN-2.2-5Bから転送されるコンパクトなビデオ専門家、トークンスパースなビデオラテント、およびアクションよりもビデオにサンプリングステップを割り当てる非対称なビデオアクションデノゲーションによる推論効率を改善する。 Efficient-WAMは、将来のブランチを視覚的忠実度に最適化する代わりに、将来のビデオ予測をアクション生成のためのコンパクトなガイダンス信号として扱う。 RoboTwin 2.0の総合的な実験と実世界の操作タスクにより、Efficient-WAMは、目に見えるほど粗い将来の予測にもかかわらず、強力な動作性能を維持していることが示された。競合的な制御能力を維持しながら、我々の1Bパラメータモデルは、物理配置中にチャンク毎のレイテンシを約100ミリ秒に短縮し、既存のWAMよりも30倍のスピードアップを実現します。

論文の概要: Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

関連論文リスト