Fugu-MT 論文翻訳(概要): GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

論文の概要: GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

arxiv url: http://arxiv.org/abs/2511.14884v1
Date: Tue, 18 Nov 2025 20:06:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 15:51:28.514296
Title: GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis
Title（参考訳）: GeoSceneGraph:テキスト誘導3次元室内シーン合成のための幾何学的シーングラフ拡散モデル
Authors: Antonio Ruiz, Tao Wu, Andrew Melnik, Qing Cheng, Xuqin Wang, Lu Liu, Yongliang Wang, Yanfeng Zhang, Helge Ritter,
Abstract要約: テキストプロンプトから室内の3Dシーンを合成する方法は、映画製作、インテリアデザイン、ビデオゲーム、バーチャルリアリティ、人工エージェントのトレーニングのための合成データ生成に広く応用されている。既存のアプローチは通常、スクラッチから生成モデルを訓練するか、視覚言語モデル(VLM)を活用するかのいずれかである。テキストプロンプトから3次元シーンを合成するGeoSceneGraphを導入する。
参考スコア（独自算出の注目度）: 14.137982018879049
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.
Abstract（参考訳）: テキストプロンプトから室内の3Dシーンを合成する方法は、映画製作、インテリアデザイン、ビデオゲーム、バーチャルリアリティ、人工エージェントのトレーニングのための合成データ生成に広く応用されている。既存のアプローチでは、スクラッチから生成モデルをトレーニングするか、視覚言語モデル(VLM)を利用するのが一般的である。 VLMは特に複雑なプロンプトやオープンエンドプロンプトにおいて高いパフォーマンスを達成するが、拡張現実眼鏡(XR)や携帯電話のようなリソース制約のあるデバイスに展開するためには、より小さなタスク固有のモデルが必要である。しかし、スクラッチから訓練する多くの生成的アプローチは、シーンコヒーレンスとリアリズムを制限することができる屋内シーンの固有のグラフ構造を見落としている。逆に、シーングラフを組み込んだメソッドは、一般的に不便で制限的なユーザが提供するセマンティックグラフを要求するか、あるいは、より多様なオブジェクトインタラクションをキャプチャする能力を制限するために、地平関係アノテーションに依存する。これらの課題に対処するため,テキストプロンプトから3次元シーンを合成するGeoSceneGraphを導入する。 GeoSceneGraphは、接地と真実の関係を使わずに、メソッドに匹敵するパフォーマンスを実現している。我々のモデルは、同変グラフニューラルネットワーク(EGNN)上に構築されているが、既存のEGNNアプローチは通常、低次元条件付けに限定されており、テキストのような複雑なモダリティを扱うように設計されていない。本稿では,テキスト特徴量に基づくEGNNの条件付けをシンプルかつ効果的に行うための戦略を提案し,その設計をアブレーション研究により検証する。

論文の概要: GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis

関連論文リスト