Fugu-MT 論文翻訳(概要): SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

論文の概要: SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

arxiv url: http://arxiv.org/abs/2606.08402v3
Date: Tue, 16 Jun 2026 06:24:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.017184
Title: SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration
Title（参考訳）: SceneConductor:マルチエージェントオーケストレーションによる単一画像からの3次元シーン生成
Authors: Jeonghwan Kim, Yushi Lan, Yongwei Chen, Hieu Trung Nguyen, Chuanyu Pan, Xingang Pan,
Abstract要約: 単一画像の3Dシーン生成を3つの構造化ステージに分解するマルチエージェントオーケストレーションフレームワークを提案する。ポイントマップから導出される疎幾何学的事前情報によって教師される幾何学的レイアウト予測器を提案する。本手法は,幾何学的精度,空間的整合性,知覚的リアリズムにおいて,従来手法よりも常に優れていた。
参考スコア（独自算出の注目度）: 32.39337008619354
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.
Abstract（参考訳）: 単一の画像から完全な3Dシーンを生成するには、本質的に曖昧な視覚的証拠から、グローバルに一貫した幾何学、オブジェクトの関係、環境コンテキストを推測する必要がある。最近の共同レイアウトとメッシュ生成の進歩にもかかわらず、既存の手法は多くの場合、多くの要因を一度に絡み合わせるような全体的あるいは弱い分解パイプラインに依存しており、複雑な現実世界環境への一般化を制限している。本稿では,シーン初期化,環境構築,マルチエージェント改良の3段階に分割したマルチエージェントオーケストレーションフレームワークを提案する。初期化段階は、画像由来のオブジェクトマスクを抽出し、オブジェクトレベル3D表現を構築し、初期空間レイアウトを予測して粗い3Dシーンを形成する。環境構築段階は、この初期化とポイントマップ幾何を利用して、支持面、部屋の境界、材料、照明の環境足場を構築する。最後に、改良段階において、プランナーエージェントは、構造的および視覚的不整合を識別し、簡単な修正を直接適用し、グローバルなシーンに再統合される複雑な局所化修正のためのスペシャリストエージェントを派遣する。シーンレベルのアノテーションへの依存を低減しつつ、信頼性の高い構造的初期化を実現するために、ポイントマップから派生した疎幾何学的先行情報によって教師される幾何学的レイアウト予測器を導入する。完全に教師付きレイアウトジェネレータとは異なり、予測子はセグメンテーションレベルのデータからトレーニングすることができ、様々な現実世界のシーンに堅牢に一般化することができる。ベンチマークデータセットの大規模な実験により,我々の手法は幾何的精度,空間整合性,知覚的リアリズムにおいて,従来手法よりも一貫して優れていたことがわかった。

論文の概要: SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

関連論文リスト