Fugu-MT 論文翻訳(概要): Prisma-World: Camera-Controllable Multi-Agent Video World Model

論文の概要: Prisma-World: Camera-Controllable Multi-Agent Video World Model

arxiv url: http://arxiv.org/abs/2606.09507v1
Date: Mon, 08 Jun 2026 13:59:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:07.168218
Title: Prisma-World: Camera-Controllable Multi-Agent Video World Model
Title（参考訳）: Prisma-World: カメラ制御可能なマルチエージェントビデオワールドモデル
Authors: Huiqiang Sun, Zhan Peng, Size Wu, Kun Wang, Kang Liao, Dianyi Wang, Xingyu Zeng, Sheng Jin, Yangguang Li, Zhiguo Cao, Ziwei Liu, Wei Li,
Abstract要約: カメラ制御可能なマルチエージェントワールドモデルであるPrisma-Worldを紹介する。マルチエージェント生成を、クロスビュー整合性のためのジョイントジオメトリア・アウェア・デノナイジングプロセスとして定式化する。実験により, フレキシブルエージェント数を持つ高忠実度マルチエージェント映像を, 1つのPrisma-Worldモデルで生成できることが確認された。
参考スコア（独自算出の注目度）: 67.72842238020192
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.
Abstract（参考訳）: ビデオワールドモデルは、制御可能な視覚体験を生成するために急速に進歩してきたが、その多くは依然として単一のオブザーバーから世界をシミュレートしている。エージェントの将来の状態が独立して生成されると、重複したビューが同じシーンの異なるバージョンをインスタンス化し、一貫性のないオブジェクト、レイアウト、エージェント間の外観につながる可能性がある。従来のカメラコンディショニングは、個々の軌跡を制御しているが、共有シーン幾何学の下で一致すべきビューを明示的に区別するものではない。本稿では,カメラ制御可能なマルチエージェントワールドモデルであるPrisma-Worldを紹介する。 Prisma-Worldは、すべてのエージェントビデオを1つのフルアテンションシーケンス内で処理し、マルチエージェントのRoPE設計を使用して、同期された時間座標を保持しながらエージェントのアイデンティティを識別し、相対的なカメラ幾何学を、共有されたシーン証拠に対して重なり合う視点に注意に注入する。マルチビューの一貫性をさらに強化し,グローバルな空間知覚を高めるために,ミニマップ条件による構造ガイダンスと並行して,重なり合ったカリキュラムトレーニングパラダイムを用いて,我々のフレームワークを増強する。マルチエージェントモデルのトレーニングと評価を容易にするために,さまざまな場面でパノラマ取得が可能な大規模UE5データセットであるPrismaDataset,フレキシブルエージェント数と複雑なカメラ軌跡を持つ構成可能なマルチエージェントビューグループ,一貫性トレーニングと評価のための正確なカメラ/アクションアノテーションを紹介する。実験により, フレキシブルエージェント数, カメラ制御性, クロスビューの整合性の改善, ミニマップ誘導下での空間接地による高忠実度マルチエージェントビデオを生成することができることがわかった。

論文の概要: Prisma-World: Camera-Controllable Multi-Agent Video World Model

関連論文リスト