Fugu-MT 論文翻訳(概要): Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

論文の概要: Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

arxiv url: http://arxiv.org/abs/2606.12217v1
Date: Wed, 10 Jun 2026 15:31:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.532824
Title: Making Foresight Actionable: Repurposing Representation Alignment in World Action Models
Title（参考訳）: 目に見える行動を可能にする:世界行動モデルにおける表現アライメントの再構築
Authors: Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge, Xihui Liu,
Abstract要約: World Action Models (WAMs)は、ビデオ生成モデルを使用して将来のシーンの進化をモデル化することで、ロボット操作のための有望なルートを提供する。目に見える未来を生み出すことは必ずしも正確な行動の抽出を保証するとは限らない本稿では,Action-Grounded Representation Alignmentの目的であるAGRAを提案する。
参考スコア（独自算出の注目度）: 57.23863557252883
License: http://creativecommons.org/licenses/by/4.0/
Abstract: World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.
Abstract（参考訳）: World Action Models (WAM)は、制御アクションを生成する前に、ビデオ生成モデルを使用して将来のシーン進化をモデル化することで、ロボット操作のための有望なルートを提供する。しかし、我々の経験的観察では、もっともらしい視覚的未来を生成することは、必ずしも正確な行動の抽出を保証しているとは限らない、という現象が明らかになっている。この障害を診断するために、アクション・ヘッド・アテンション分析と因果的介入を行う。動作デコーダはタスク関連相互作用領域に焦点を合わせず,タスク関連領域の摂動に敏感であることがわかった。視覚的再構成に最適化された隠れ状態は、本質的に低レベルのアクション制御に有用な形で組織化されていない。本稿では、中間映像拡散特徴と空間的コヒーレントなセマンティック表現を基礎視覚エンコーダから整列させることにより、世界アクションインタフェースを規則化するAction-Grounded Representation Alignmentの目的であるAGRAを提案する。実世界の操作タスクにおけるAGRAの評価を行う。実験により、AGRAは、アクションデコーダを適切な相互作用領域に集中させることで、オブジェクトのローカライゼーション精度と可視性理解を改善し、タスク非関連領域の摂動に対して、ポリシーをより堅牢にする。その結果、AGRAは、ベースラインワールドアクションモデルに対して、分配性能と分配外一般化の両方を一貫して改善する。

論文の概要: Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

関連論文リスト