Fugu-MT 論文翻訳(概要): Action Images: End-to-End Policy Learning via Multiview Video Generation

論文の概要: Action Images: End-to-End Policy Learning via Multiview Video Generation

arxiv url: http://arxiv.org/abs/2604.06168v1
Date: Tue, 07 Apr 2026 17:59:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.990455
Title: Action Images: End-to-End Policy Learning via Multiview Video Generation
Title（参考訳）: アクション画像:マルチビュー映像生成によるエンドツーエンドのポリシー学習
Authors: Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan,
Abstract要約: 我々は、ポリシー学習をマルチビュービデオ生成として定式化する統合世界アクションモデルであるAction Imagesを提案する。本モデルでは,従来のビデオ空間モデルに比べて,最強のゼロショット成功率を実現し,ビデオアクションジョイント生成品質を向上させる。
参考スコア（独自算出の注目度）: 71.67070674321043
License: http://creativecommons.org/licenses/by/4.0/
Abstract: World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.
Abstract（参考訳）: 世界アクションモデル(WAM)は、強力なビデオバックボーンを活用して将来の状態をモデル化することで、ロボットポリシー学習の有望な方向として登場した。しかし、既存のアプローチは、しばしば別々のアクションモジュールに依存するか、ピクセルを接地しないアクション表現を使用するため、ビデオモデルの事前訓練された知識を十分に活用し、視点や環境間の移動を制限することは困難である。本研究では,政策学習を多視点ビデオ生成として定式化する統合世界行動モデルであるAction Imagesを提案する。低次元のトークンとして制御を符号化する代わりに、7-DoFロボットアクションを解釈可能なアクションイメージに変換する。このピクセルグラウンドのアクション表現により、ビデオバックボーン自体が、別のポリシーヘッドやアクションモジュールなしでゼロショットポリシーとして振る舞うことができる。制御以外にも、同じ統一モデルは、共有表現の下でのビデオアクションジョイント生成、アクション条件付きビデオ生成、アクションラベリングをサポートする。 RLBenchと実世界の評価では、これまでのビデオ空間のモデルよりも最強のゼロショット成功率を達成し、ビデオアクションジョイント生成の品質を向上し、解釈可能なアクションイメージがポリシー学習への有望な経路であることを示唆している。

論文の概要: Action Images: End-to-End Policy Learning via Multiview Video Generation

関連論文リスト