Fugu-MT 論文翻訳(概要): MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

論文の概要: MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

arxiv url: http://arxiv.org/abs/2606.02753v1
Date: Mon, 01 Jun 2026 18:20:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.533611
Title: MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data
Title（参考訳）: MetaWorld:シングルビュービデオデータによるマルチエージェントビデオワールドモデルのスケーリング
Authors: Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao,
Abstract要約: MetaWorldは、マルチエージェントビデオワールドモデルをシングルビュービデオから直接オープンドメイン環境にスケールする新しいフレームワークである。クロスビューの一貫性とアイデンティティの整合性を向上し、マルチエージェントビデオワールドモデリングのための高度にスケーラブルで物理駆動のパラダイムを確立する。
参考スコア（独自算出の注目度）: 125.43597497646444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego-motion and the visible subject's spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.
Abstract（参考訳）: ビデオワールドモデルは、AIとメタバースを具現化するための基礎的な生成技術であるが、既存のアプローチは本質的に単一の視点から観察する単一のエージェントに限られている。これらのモデルをマルチエージェント設定に拡張することは、データの不足(一般的なオープンドメインシナリオのために収集する上で、コーディネートされたマルチビュー記録は違法にコストがかかる)とワールドステートアライメント(独立して生成されたビデオストリームは、ビュー間で共有された物理的環境とイベントが一貫した進化を保証できない)という、2つの重要な課題をもたらす。これらの課題に対処するために,マルチエージェントビデオワールドモデルをシングルビュービデオから直接オープンドメイン環境に拡張する新しいフレームワークであるMetaWorldを提案する。まず,モノクラー・ワールド・ステート・アンロール(MWSU)を導入し,モノクラー映像をカメラ操作者のエゴモーションと視認対象者の空間軌跡に明示的に分解する。このカメラ軌道分解は、共有された3次元空間内の同期されたマルチエージェントモーションデータを自然に抽出し、マルチカメラ設定の必要性を完全に回避する。第2に、正確な視覚制御のために、エージェントごとの識別画像に条件付けされた外観駆動型シミュレーションを可能にするサブジェクト・アウェア・ワールド・ジェネレータを開発する。最後に、両ビューが同一の物理的現実に基礎を置いていることを保証するため、ビデオDiTの各トランスフォーマー層に挿入されるフレーム単位のブランチ間クロスアテンション機構であるWorld-State Alignmentを提案する。復調過程を協調的に同期させることで、WSAは静的な幾何的一貫性と動的動きの整合性の両方を強制し、共有された3D環境と物理的事象が両自我中心の視点で適切に一致し続けることを奨励する。大規模な実験により、MetaWorldは優れたクロスビュー一貫性とアイデンティティの忠実さを実現し、マルチエージェントビデオワールドモデリングのための高度にスケーラブルで物理駆動のパラダイムを確立した。

論文の概要: MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

関連論文リスト