Fugu-MT 論文翻訳(概要): Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

論文の概要: Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

arxiv url: http://arxiv.org/abs/2604.16552v1
Date: Fri, 17 Apr 2026 07:28:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.067775
Title: Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
Title（参考訳）: 自己回帰3次元拡散によるテキストからのレイアウトと形状の同時生成
Authors: Zhenggang Tang, Yuehao Wang, Yuchen Fan, Jun-Kun Chen, Yu-Ying Yeh, Kihyuk Sohn, Zhangyang Wang, Qixing Huang, Alexander Schwing, Rakesh Ranjan, Dilin Wang, Zhicheng Yan,
Abstract要約: 本稿では,インタラクティブなシーン生成のための新たな生成モデルを提案する。中心となる3D自己回帰拡散モデル3D-ARD+は、マルチモーダルトークンシーケンス上の自己回帰生成と、次の対象の3D潜伏剤の拡散生成を統一する。 7B 3D-ARD+を困難な場面で評価し,テキスト命令によって規定される非自明な空間的レイアウトや意味をモデルが生成・配置できることを示す。
参考スコア（独自算出の注目度）: 115.33888186717162
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to generate a scene layout or to generate objects, and few generate both. The generated scene layout is often simple even with LLM's help. Moreover, the generated scene is often inconsistent with the text input that contains non-trivial descriptions of the shape, appearance, and spatial arrangement of the objects. We present a new paradigm of sequential text-to-scene generation and propose a novel generative model for interactive scene creation. At the core is a 3D Autoregressive Diffusion model 3D-ARD+, which unifies the autoregressive generation over a multimodal token sequence and diffusion generation of next-object 3D latents. To generate the next object, the model uses one autoregressive step to generate the coarse-grained 3D latents in the scene space, conditioned on both the current seen text instructions and already synthesized 3D scene. It then uses a second step to generate the 3D latents in the smaller object space, which can be decoded into fine-grained object geometry and appearance. We curate a large dataset of 230K indoor scenes with paired text instructions for training. We evaluate 7B 3D-ARD+, on challenging scenes, and showcase the model can generate and place objects following non-trivial spatial layout and semantics prescribed by the text instructions.
Abstract（参考訳）: 最近のテキスト・ツー・シーン・ジェネレーションのアプローチは、3Dシーンを作成するのに必要な手作業を大幅に削減した。しかし、彼らの焦点はシーンレイアウトを生成するか、オブジェクトを生成するか、両方を生成することにある。生成されたシーンレイアウトは、LLMの助けを借りても単純であることが多い。さらに、生成されたシーンは、オブジェクトの形状、外観、空間配置の非自明な記述を含むテキスト入力と矛盾することが多い。本稿では,インタラクティブなシーン生成のための新たな生成モデルを提案する。中心となる3D自己回帰拡散モデル3D-ARD+は、マルチモーダルトークンシーケンス上の自己回帰生成と、次の対象の3D潜伏剤の拡散生成を統一する。次のオブジェクトを生成するために、モデルは1つの自己回帰ステップを使用して、シーン空間の粗い粒度の3Dラテントを生成し、現在のテキスト命令と既に合成されている3Dシーンの両方に条件付けする。次に2番目のステップを使用して、小さなオブジェクト空間で3Dラテントを生成し、細かなオブジェクト形状と外観にデコードすることができる。 230万の屋内シーンの大規模なデータセットをペアのテキストによるトレーニングでキュレートする。 7B 3D-ARD+を困難な場面で評価し,テキスト命令によって規定される非自明な空間的レイアウトや意味をモデルが生成・配置できることを示す。

論文の概要: Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

関連論文リスト