Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes
- URL: http://arxiv.org/abs/2503.18155v2
- Date: Tue, 25 Mar 2025 15:58:36 GMT
- Title: Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes
- Authors: Kelly O. Marshall, Omid Poursaeed, Sergiu Oprea, Amit Kumar, Anushrut Jignasu, Chinmay Hegde, Yilei Li, Rakesh Ranjan,
- Abstract summary: 3D indoor scene generation is an important problem for the design of digital and real-world environments.<n>Existing methods for this task exhibit very limited control over these attributes.<n>Our proposed method Decorum enables users to control the scene generation process with natural language.
- Score: 18.232310061758298
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.
Related papers
- ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding [9.289977174410824]
3D visual grounding involves localizing entities in a 3D scene referred to by natural language text.<n>We introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns.
arXiv Detail & Related papers (2025-01-02T17:20:41Z) - The Scene Language: Representing Scenes with Programs, Words, and Embeddings [23.707974056165042]
We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes.
It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity.
arXiv Detail & Related papers (2024-10-22T07:40:20Z) - SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model [7.707324214953882]
We introduce SceneScript, a method that produces full scene models as a sequence of structured language commands.
Our method infers the set of structured language commands directly from encoded visual data.
Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection.
arXiv Detail & Related papers (2024-03-19T18:01:29Z) - Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases [13.126239167800652]
We present a system for generating indoor scenes in response to text prompts.
The prompts are not limited to a fixed vocabulary of scene descriptions.
The objects in generated scenes are not restricted to a fixed set of object categories.
arXiv Detail & Related papers (2024-02-05T01:59:31Z) - Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Set-the-Scene: Global-Local Training for Generating Controllable NeRF
Scenes [68.14127205949073]
We propose a novel GlobalLocal training framework for synthesizing a 3D scene using object proxies.
We show that using proxies allows a wide variety of editing options, such as adjusting the placement of each independent object.
Our results show that Set-the-Scene offers a powerful solution for scene synthesis and manipulation.
arXiv Detail & Related papers (2023-03-23T17:17:29Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Zero-Shot Text-Guided Object Generation with Dream Fields [111.06026544180398]
We combine neural rendering with multi-modal image and text representations to synthesize diverse 3D objects.
Our method, Dream Fields, can generate the geometry and color of a wide range of objects without 3D supervision.
In experiments, Dream Fields produce realistic, multi-view consistent object geometry and color from a variety of natural language captions.
arXiv Detail & Related papers (2021-12-02T17:53:55Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.