Fugu-MT 論文翻訳(概要): EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

論文の概要: EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

arxiv url: http://arxiv.org/abs/2605.06192v1
Date: Thu, 07 May 2026 13:06:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.810193
Title: EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
Title（参考訳）: EA-WM:構造化キネマティック・ツー・ビジュアルアクション場を用いたイベント認識型生成世界モデル
Authors: Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, Kai Chen,
Abstract要約: 本稿では,運動制御と視覚知覚のループを閉じるイベントウェア生成世界モデルEA-WMを提案する。イベント認識型双方向核融合ブロックを導入し、クロスブランチの注意を変調し、オブジェクトの状態変化と正確な相互作用のダイナミクスを捉える。 EA-WMは最先端のパフォーマンスを達成し、既存のベースラインを著しく上回っている。
参考スコア（独自算出の注目度）: 15.319293934673915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.
Abstract（参考訳）: 事前訓練されたビデオ拡散モデルは、強力な時空間生成先行を提供するため、ロボットの世界モデルにとって自然な基盤となっている。最近のワールドアクションモデルは、将来のビデオとアクションを共同で最適化するが、彼らは主に、ビデオ生成をポリシー学習の補助的な表現として扱う。その結果、ビデオ合成を誘導するためにアクション信号を活用することで、正確なロボット空間形状と、生成されたロールアウトにおけるきめ細かいロボットとオブジェクトの相互作用のダイナミクスを保存できないことがしばしばある。このギャップを埋めるために,キネマティック制御と視覚知覚のループを効果的に閉じるイベント認識生成世界モデルEA-WMを提案する。 EA-WMは、抽象的で低次元のトークンとしてジョイントアクションやエンドエフェクタアクションを注入するのではなく、ターゲットカメラビューに直接、構造化キネマティック・ツー・ビジュアルアクションフィールドとしてアクションとキネマティックステートを投影する。この幾何学的基底表現をフル活用するために、クロスブランチの注意を変調し、オブジェクトの状態変化と相互作用のダイナミクスをキャプチャするイベント認識双方向融合ブロックを導入する。包括的なWorldArenaベンチマークに基づいて、EA-WMは最先端のパフォーマンスを達成し、既存のベースラインを大幅に上回っている。

論文の概要: EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

関連論文リスト