CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic
Furniture Embedding
- URL: http://arxiv.org/abs/2303.03565v2
- Date: Fri, 2 Jun 2023 04:48:55 GMT
- Title: CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic
Furniture Embedding
- Authors: Jingyu Liu, Wenhan Xiong, Ian Jones, Yixin Nie, Anchit Gupta, Barlas
O\u{g}uz
- Abstract summary: Indoor scene synthesis involves automatically picking and placing furniture appropriately on a floor plan.
This paper introduces an auto-regressive scene model which can output instance-level predictions.
Our model achieves SOTA results in scene synthesis and improves auto-completion metrics by over 50%.
- Score: 17.053844262654223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Indoor scene synthesis involves automatically picking and placing furniture
appropriately on a floor plan, so that the scene looks realistic and is
functionally plausible. Such scenes can serve as homes for immersive 3D
experiences, or be used to train embodied agents. Existing methods for this
task rely on labeled categories of furniture, e.g. bed, chair or table, to
generate contextually relevant combinations of furniture. Whether heuristic or
learned, these methods ignore instance-level visual attributes of objects, and
as a result may produce visually less coherent scenes. In this paper, we
introduce an auto-regressive scene model which can output instance-level
predictions, using general purpose image embedding based on CLIP. This allows
us to learn visual correspondences such as matching color and style, and
produce more functionally plausible and aesthetically pleasing scenes.
Evaluated on the 3D-FRONT dataset, our model achieves SOTA results in scene
synthesis and improves auto-completion metrics by over 50%. Moreover, our
embedding-based approach enables zero-shot text-guided scene synthesis and
editing, which easily generalizes to furniture not seen during training.
Related papers
- The Scene Language: Representing Scenes with Programs, Words, and Embeddings [23.707974056165042]
We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes.
It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity.
arXiv Detail & Related papers (2024-10-22T07:40:20Z) - Mixed Diffusion for 3D Indoor Scene Synthesis [55.94569112629208]
We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture.
We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation.
Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
arXiv Detail & Related papers (2024-05-31T17:54:52Z) - 3D scene generation from scene graphs and self-attention [51.49886604454926]
We present a variant of the conditional variational autoencoder (cVAE) model to synthesize 3D scenes from scene graphs and floor plans.
We exploit the properties of self-attention layers to capture high-level relationships between objects in a scene.
arXiv Detail & Related papers (2024-04-02T12:26:17Z) - Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning [24.162598399141785]
Scene-LLM is a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments.
Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning.
arXiv Detail & Related papers (2024-03-18T01:18:48Z) - Style-Consistent 3D Indoor Scene Synthesis with Decoupled Objects [84.45345829270626]
Controllable 3D indoor scene synthesis stands at the forefront of technological progress.
Current methods for scene stylization are limited to applying styles to the entire scene.
We introduce a unique pipeline designed for synthesis 3D indoor scenes.
arXiv Detail & Related papers (2024-01-24T03:10:36Z) - RoomDesigner: Encoding Anchor-latents for Style-consistent and
Shape-compatible Indoor Scene Generation [26.906174238830474]
Indoor scene generation aims at creating shape-compatible, style-consistent furniture arrangements within a spatially reasonable layout.
We propose a two-stage model integrating shape priors into the indoor scene generation by encoding furniture as anchor latent representations.
arXiv Detail & Related papers (2023-10-16T03:05:19Z) - Adjustable Visual Appearance for Generalizable Novel View Synthesis [12.901033240320725]
We present a generalizable novel view synthesis method.
It enables modifying the visual appearance of an observed scene so rendered views match a target weather or lighting condition.
Our method is based on a pretrained generalizable transformer architecture and is fine-tuned on synthetically generated scenes.
arXiv Detail & Related papers (2023-06-02T08:17:04Z) - Control-NeRF: Editable Feature Volumes for Scene Rendering and
Manipulation [58.16911861917018]
We present a novel method for performing flexible, 3D-aware image content manipulation while enabling high-quality novel view synthesis.
Our model couples learnt scene-specific feature volumes with a scene agnostic neural rendering network.
We demonstrate various scene manipulations, including mixing scenes, deforming objects and inserting objects into scenes, while still producing photo-realistic results.
arXiv Detail & Related papers (2022-04-22T17:57:00Z) - Towards 3D Scene Understanding by Referring Synthetic Models [65.74211112607315]
Methods typically alleviate on-extensive annotations on real scene scans.
We explore how synthetic models rely on real scene categories of synthetic features to a unified feature space.
Experiments show that our method achieves the average mAP of 46.08% on the ScanNet S3DIS dataset and 55.49% by learning datasets.
arXiv Detail & Related papers (2022-03-20T13:06:15Z) - ATISS: Autoregressive Transformers for Indoor Scene Synthesis [112.63708524926689]
We present ATISS, a novel autoregressive transformer architecture for creating synthetic indoor environments.
We argue that this formulation is more natural, as it makes ATISS generally useful beyond fully automatic room layout synthesis.
Our model is trained end-to-end as an autoregressive generative model using only labeled 3D bounding boxes as supervision.
arXiv Detail & Related papers (2021-10-07T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.