Fugu-MT 論文翻訳(概要): Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

論文の概要: Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

arxiv url: http://arxiv.org/abs/2606.06002v1
Date: Thu, 04 Jun 2026 10:56:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.732633
Title: Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation
Title（参考訳）: テキストから3次元屋内シーン生成のための視覚言語モデルにおけるグローバルローカルモンテカルロ木探索
Authors: Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma,
Abstract要約: 本稿では,この課題を空間的・レイアウト的常識に制約された計画問題とみなす。我々はこれを,既存の逐次的意思決定手法とは異なる,グローバルおよびローカルツリーを用いた木探索問題としてモデル化する。実験の結果,本手法は最先端の手法よりもリアルな3Dシーンを生成することがわかった。
参考スコア（独自算出の注目度）: 48.70065830279983
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models have achieved significant reasoning performance in various tasks.However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation.In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense.To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches.In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree.To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method.The hierarchical representation abstracts a scene into room level, region level, floor object level, and supported object level.The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts.In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters.To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene.As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3,250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art approaches.
Abstract（参考訳）: LVLMを用いたテキスト・ツー・3D屋内シーン生成に関する研究は少ないが,大規模視覚言語モデルでは様々なタスクにおいて顕著な推論性能が達成されている。主な課題は, LVLM を用いた手法では, 事前決定の修正が不可能な逐次的決定機構を採用し, 誤りの伝播を引き起こす。本稿では, タスクを空間的・レイアウト的共通性に制約された計画問題とみなす。この問題を解決するために, 従来の逐次的決定手法と異なり, 地球的および局所的決定手法による木探索問題としてモデル化する。グローバルツリーでは, 各オブジェクトを反復的に配置し, 問題空間が木として表現されるような複数の試みを探索する。効果的に木を探索するために, 階層的なシーン表現と PRM 誘導 MCTS 手法を提案する。階層的抽象表現は, 部屋レベル, 床レベル, オブジェクトレベル, 支持対象レベル, ; PRM MCTS アルゴリズムを不要に利用し, PRM TS アルゴリズムおよび PRM TS アルゴリズムを不必要に活用する。実験の結果,本手法は最先端の手法よりもリアルな3Dシーンを生成することがわかった。

論文の概要: Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

関連論文リスト