Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with Pre-trained Large Language Model
- URL: http://arxiv.org/abs/2502.10675v1
- Date: Sat, 15 Feb 2025 05:04:14 GMT
- Title: Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with Pre-trained Large Language Model
- Authors: Weilin Sun, Xinran Li, Manyi Li, Kai Xu, Xiangxu Meng, Lei Meng,
- Abstract summary: We propose to generate hierarchically structured scene descriptions with large language models (LLM) and then compute the scene layouts.
Specifically, we train a hierarchy-aware network to infer the fine-grained relative positions between objects.
We also present open-vocabulary scene synthesis and interactive scene design results to show the strength of our approach in the applications.
- Score: 14.70850176122733
- License:
- Abstract: Indoor scene synthesis aims to automatically produce plausible, realistic and diverse 3D indoor scenes, especially given arbitrary user requirements. Recently, the promising generalization ability of pre-trained large language models (LLM) assist in open-vocabulary indoor scene synthesis. However, the challenge lies in converting the LLM-generated outputs into reasonable and physically feasible scene layouts. In this paper, we propose to generate hierarchically structured scene descriptions with LLM and then compute the scene layouts. Specifically, we train a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts. The advantages of using hierarchically structured scene representation are two-fold. First, the hierarchical structure provides a rough grounding for object arrangement, which alleviates contradictory placements with dense relations and enhances the generalization ability of the network to infer fine-grained placements. Second, it naturally supports the divide-and-conquer optimization, by first arranging the sub-scenes and then the entire scene, to more effectively solve for a feasible layout. We conduct extensive comparison experiments and ablation studies with both qualitative and quantitative evaluations to validate the effectiveness of our key designs with the hierarchically structured scene representation. Our approach can generate more reasonable scene layouts while better aligned with the user requirements and LLM descriptions. We also present open-vocabulary scene synthesis and interactive scene design results to show the strength of our approach in the applications.
Related papers
- S-INF: Towards Realistic Indoor Scene Synthesis via Scene Implicit Neural Field [25.539007827647737]
We introduce Scene Implicit Neural Field (S-INF) for indoor scene synthesis, aiming to learn meaningful representations of multimodal relationships.
S-INF disentangles the multimodal relationships into scene layout relationships and detailed object relationships, fusing them later through implicit neural fields.
It consistently achieves state-of-the-art performance under different types of indoor scene synthesis.
arXiv Detail & Related papers (2024-12-23T13:29:35Z) - Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis [109.50718968215658]
We propose Forest2Seq, a framework that formulates indoor scene synthesis as an order-aware sequential learning problem.
By employing a clustering-based algorithm and a breadth-first, Forest2Seq derives meaningful orderings and utilizes a transformer to generate realistic 3D scenes autoregressively.
arXiv Detail & Related papers (2024-07-07T14:32:53Z) - Object-level Scene Deocclusion [92.39886029550286]
We present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, for object-level scene deocclusion.
To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning.
Experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin.
arXiv Detail & Related papers (2024-06-11T20:34:10Z) - N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field.
We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space.
Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z) - InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with
Semantic Graph Prior [27.773451301040424]
InstructScene is a novel generative framework that integrates a semantic graph prior and a layout decoder.
We show that the proposed method surpasses existing state-of-the-art approaches by a large margin.
arXiv Detail & Related papers (2024-02-07T10:09:00Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Semantic Palette: Guiding Scene Generation with Class Proportions [34.746963256847145]
We introduce a conditional framework with novel architecture designs and learning objectives, which effectively accommodates class proportions to guide the scene generation process.
Thanks to the semantic control, we can produce layouts close to the real distribution, helping enhance the whole scene generation process.
We demonstrate the merit of our approach for data augmentation: semantic segmenters trained on real layout-image pairs outperform models only trained on real pairs.
arXiv Detail & Related papers (2021-06-03T07:04:00Z) - End-to-End Optimization of Scene Layout [56.80294778746068]
We propose an end-to-end variational generative model for scene layout synthesis conditioned on scene graphs.
We use scene graphs as an abstract but general representation to guide the synthesis of diverse scene layouts.
arXiv Detail & Related papers (2020-07-23T01:35:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.