LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts
- URL: http://arxiv.org/abs/2308.06713v1
- Date: Sun, 13 Aug 2023 08:06:18 GMT
- Title: LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts
- Authors: Binbin Yang, Yi Luo, Ziliang Chen, Guangrun Wang, Xiaodan Liang, Liang
Lin
- Abstract summary: We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
- Score: 107.11267074981905
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Thanks to the rapid development of diffusion models, unprecedented progress
has been witnessed in image synthesis. Prior works mostly rely on pre-trained
linguistic models, but a text is often too abstract to properly specify all the
spatial properties of an image, e.g., the layout configuration of a scene,
leading to the sub-optimal results of complex scene generation. In this paper,
we achieve accurate complex scene generation by proposing a semantically
controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from
the previous Layout-to-Image generation (L2I) methods that only explore
category-aware relationships, LAW-Diffusion introduces a spatial dependency
parser to encode the location-aware semantic coherence across objects as a
layout embedding and produces a scene with perceptually harmonious object
styles and contextual relations. To be specific, we delicately instantiate each
object's regional semantics as an object region map and leverage a
location-aware cross-object attention module to capture the spatial
dependencies among those disentangled representations. We further propose an
adaptive guidance schedule for our layout guidance to mitigate the trade-off
between the regional semantic alignment and the texture fidelity of generated
objects. Moreover, LAW-Diffusion allows for instance reconfiguration while
maintaining the other regions in a synthesized image by introducing a
layout-aware latent grafting mechanism to recompose its local regional
semantics. To better verify the plausibility of generated scenes, we propose a
new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to
measure how the images preserve the rational and harmonious relations among
contextual objects. Comprehensive experiments demonstrate that our
LAW-Diffusion yields the state-of-the-art generative performance, especially
with coherent object relations.
Related papers
- DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - Enhancing Object Coherence in Layout-to-Image Synthesis [13.289854750239956]
We propose a novel diffusion model with effective global semantic fusion (GSF) and self-similarity feature enhancement modules.
For semantic coherence, we argue that the image caption contains rich information for defining the semantic relationship within the objects in the images.
To improve the physical coherence, we develop a Self-similarity Coherence Attention synthesis (SCA) module to explicitly integrate local contextual physical coherence relation into each pixel's generation process.
arXiv Detail & Related papers (2023-11-17T13:43:43Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition [5.083140094792973]
SpaCoNet simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation.
Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:04:22Z) - PandA: Unsupervised Learning of Parts and Appearances in the Feature
Maps of GANs [34.145110544546114]
We present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion.
Our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control.
arXiv Detail & Related papers (2022-05-31T18:28:39Z) - Global Aggregation then Local Distribution for Scene Parsing [99.1095068574454]
We show that our approach can be modularized as an end-to-end trainable block and easily plugged into existing semantic segmentation networks.
Our approach allows us to build new state of the art on major semantic segmentation benchmarks including Cityscapes, ADE20K, Pascal Context, Camvid and COCO-stuff.
arXiv Detail & Related papers (2021-07-28T03:46:57Z) - SIRI: Spatial Relation Induced Network For Spatial Description
Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization.
We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius.
Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.