Fugu-MT 論文翻訳(概要): WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

論文の概要: WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

arxiv url: http://arxiv.org/abs/2603.19708v1
Date: Fri, 20 Mar 2026 07:22:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:39.037954
Title: WorldAgents: Can Foundation Image Models be Agents for 3D World Models?
Title（参考訳）: WorldAgents:Foundation Image Modelsは3D World Modelsのエージェントになれるか?
Authors: Ziya Erkoç, Angela Dai, Matthias Nießner,
Abstract要約: 2次元モデルが実際に3次元世界の把握をカプセル化していることを実証する。この理解を生かして,本手法は拡張性,現実性,および3D一貫性のある世界をうまく合成する。
参考スコア（独自算出の注目度）: 82.83725150353915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.
Abstract（参考訳）: 高忠実度出力を生成するための2次元基礎画像モデルの顕著な能力を考えると、2D基礎画像モデルは本質的に3次元世界モデル能力を持っているか? そこで我々は,3次元世界合成の課題に対して,複数の最先端画像生成モデルと視覚言語モデル(VLM)を体系的に評価した。暗黙的な3D能力を生かし,評価するために,エージェントフレーミングにより3Dワールドジェネレーションを促進する手法を提案する。提案手法では,新たな画像ビューを合成するジェネレータと,2次元画像と3次元再構成空間の両方から生成されたフレームを評価・選択的にキュレートする2段階検証器を用いて,画像合成の導出を促す。重要なことは、我々のエージェント的アプローチは、一貫性と堅牢な3次元再構成を提供し、新規なビューをレンダリングすることによって探索可能な出力シーンを生成することを実証する。様々な基礎モデルの広範な実験を通して、2次元モデルが実際に3次元世界の把握をカプセル化していることを示す。この理解を生かして,本手法は拡張性,現実性,および3D一貫性のある世界をうまく合成する。

論文の概要: WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

関連論文リスト