Fugu-MT 論文翻訳(概要): SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

論文の概要: SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

arxiv url: http://arxiv.org/abs/2603.12238v1
Date: Thu, 12 Mar 2026 17:55:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.277196
Title: SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
Title（参考訳）: Scene Assistant:オープンボキャブラリ3次元シーン生成のための視覚フィードバックエージェント
Authors: Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng,
Abstract要約: オープンな3Dシーン生成用に設計されたビジュアルフィードバック駆動エージェントであるSceneAssistantを紹介する。我々のフレームワークは、視覚言語モデルの空間的推論と計画能力とともに、現代の3次元オブジェクト生成モデルを活用する。提案手法では,エージェントに対して,自然言語コマンドに基づいて既存のシーンを編集するように指示することができる。
参考スコア（独自算出の注目度）: 27.16255874731512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant
Abstract（参考訳）: 自然言語からのテキストから3Dのシーン生成は、デジタルコンテンツ作成に非常に望ましい。しかし、既存の手法は領域制限や事前定義された空間関係に依存しており、制約のないオープンな3Dシーン合成の能力を制限している。本稿では,オープンな3Dシーン生成のための視覚フィードバック型エージェントであるSceneAssistantを紹介する。我々のフレームワークは、視覚言語モデル(VLM)の空間的推論と計画能力とともに、現代の3次元オブジェクト生成モデルを活用する。オープン語彙シーンの構成を可能にするため、VLMに包括的なアトミック操作(例えば、スケール、ロータテ、フォーカスオン)を提供する。各インタラクションステップにおいて、VLMはレンダリングされた視覚フィードバックを受け取り、シーンを反復的に精製し、よりコヒーレントな空間配置を実現し、入力テキストとの整合性が向上する。実験結果から,本手法は多種多様でオープンな3Dシーンを生成できることが示された。定性的分析と定量的人間評価の両方が、既存の手法よりもアプローチの優位性を示している。さらに,提案手法では,自然言語コマンドに基づいて既存のシーンを編集するようエージェントに指示することができる。私たちのコードはhttps://github.com/ROUJINN/SceneAssistantで利用可能です。

論文の概要: SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

関連論文リスト