Fugu-MT 論文翻訳(概要): Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

論文の概要: Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

arxiv url: http://arxiv.org/abs/2606.06002v2
Date: Fri, 05 Jun 2026 01:59:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 12:21:17.574913
Title: Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation
Title（参考訳）: テキストから3次元屋内シーン生成のための視覚言語モデルにおけるグローバルローカルモンテカルロ木探索
Authors: Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma,
Abstract要約: 本稿では,この課題を空間的・レイアウト的常識に制約された計画問題とみなす。この問題を解決するために,我々は,既存の逐次決定手法とは異なる,グローバルおよびローカルツリーを用いた木探索問題としてモデル化する。実験の結果,本手法は最先端の手法よりもリアルな3Dシーンを生成することがわかった。
参考スコア（独自算出の注目度）: 48.70065830279983
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models have achieved significant reasoning performance in various tasks. However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation. In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense. To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches. In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree. To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method. This representation abstracts a scene into room level, region level, floor object level, and supported object level. The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts. In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters. To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene. As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art methods.
Abstract（参考訳）: 大規模ビジョンランゲージモデルは、様々なタスクにおいて重要な推論性能を達成している。しかし,LVLMを用いたテキストから3次元の屋内シーン生成に関する研究はほとんどない。主な課題は、LVLMをベースとした一般的な手法では、事前決定を修正できない逐次決定機構が採用され、エラーの伝播を引き起こすことである。本稿では,この課題を空間的・レイアウト的常識に制約された計画問題とみなす。この問題を解決するために,我々は,既存の逐次決定手法とは異なる,グローバルおよびローカルツリーを用いた木探索問題としてモデル化する。グローバルツリーでは、各オブジェクトを反復的に配置し、問題空間がツリーとして表現される部屋を空けるような、複数の試みを探索する。木を効果的に探索するために,階層的なシーン表現とPRM誘導MCTS法を提案する。この表現はシーンをルームレベル、リージョンレベル、フロアオブジェクトレベル、サポート対象レベルに抽象化する。 PRM誘導MCTS法は、不必要な分岐をプルークするためにPRMを使用し、MCTSアルゴリズムは探索とエクスプロイトのバランスを保ち、少ない試行で最適解を得る。ローカルツリーでは、各オブジェクトの配置を特定の配置パラメータを含むより細かいサブステップに分解する。シーン全体の外観を一貫させるため,事前学習した拡散画像生成モデルを用いてシーン内の全てのオブジェクトのテクスチャを予測する。テキストから3Dの屋内シーン生成のための既存のベンチマークは、スケールと多様性において制限されているため、65のシーンタイプと3250の命令を含む、さまざまなサイズ、レイアウト、スタイルを含む新しい大規模多様なデータセットを収集し、3DTindo-benchと名づけた。実験の結果,本手法は最先端の手法よりもリアルな3Dシーンを生成することがわかった。

論文の概要: Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

関連論文リスト