Fugu-MT 論文翻訳(概要): AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

論文の概要: AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

arxiv url: http://arxiv.org/abs/2604.11135v1
Date: Mon, 13 Apr 2026 07:48:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.408622
Title: AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
Title（参考訳）: AIM:空間値マップを用いた統合世界行動モデリング
Authors: Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, Jiayu Chen,
Abstract要約: AIMは、明示的な空間的インターフェースを通じてこのギャップを橋渡しする意図認識の統一世界行動モデルである。事前訓練されたビデオ生成モデルに基づいて構築されたAIMは、共有変換器アーキテクチャ内の将来の観測と値マップを共同でモデル化する。 RoboTwin 2.0ベンチマークの実験では、AIMは平均94.0%の成功率に達し、以前の統合された世界行動ベースラインを著しく上回っている。
参考スコア（独自算出の注目度）: 7.710034405765985
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.
Abstract（参考訳）: 事前訓練されたビデオ生成モデルは、ロボットの制御に強い優位性を提供するが、既存の統合された世界アクションモデルは、ロボット固有のトレーニングを伴わずに、信頼できるアクションをデコードするのに依然として苦労している。ビデオモデルはシーンの進化をとらえるが、アクション生成には、相互作用する場所と根底にある操作意図に関する明確な推論が必要である。我々は,このギャップを空間的インターフェースを通じて橋渡しする,意図認識型統一世界行動モデルであるAIMを紹介する。アクションを将来の視覚的表現から直接デコードする代わりに、AIMはタスク関連相互作用構造を符号化する整列空間値マップを予測し、将来のダイナミクスの制御指向の抽象化を可能にする。事前訓練されたビデオ生成モデルに基づいて構築されたAIMは、共有変換器アーキテクチャ内の将来の観測と値マップを共同でモデル化する。これは、値表現を通して、アクションブランチに将来の情報をルーティングするために、意図的な注意を払っている。さらに,映像と値分岐を凍結し,プロジェクションされた値マップ応答から得られる高密度な報酬とタスクレベルの疎結合信号を用いて,アクションヘッドのみを最適化する自己蒸留強化学習ステージを提案する。トレーニングと評価を支援するため,同調した多視点観察,アクション,バリューマップアノテーションを用いた30K操作トラジェクトリのシミュレーションデータセットを構築した。 RoboTwin 2.0ベンチマークの実験では、AIMは平均94.0%の成功率に達し、以前の統合された世界行動ベースラインを著しく上回っている。特に,視覚世界モデリングとロボット制御の橋渡しとして,空間意図モデリングが有効であることを示す。

論文の概要: AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

関連論文リスト