Related papers: LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

URL: http://arxiv.org/abs/2303.17189v2
Date: Tue, 12 Mar 2024 13:15:24 GMT
Title: LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
Authors: Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li
Abstract summary: We propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. In this paper, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Our experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG Code.
Score: 46.567682868550285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.

Related papers

STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation [4.769823364778397]
We propose a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation. A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships.
arXiv Detail & Related papers (2025-03-15T17:36:24Z)
ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts [2.799190378263432]
ToLo is a two-stage, training-free layout-to-image generation framework for high-overlap layouts. We show that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts.
arXiv Detail & Related papers (2025-03-03T15:41:51Z)
Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation [0.0]
Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. We propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information.
arXiv Detail & Related papers (2025-01-15T22:55:26Z)
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation [78.21134311493303]
Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. Previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs) Inherit the advantages of MM-DiT, we use a separate set network weights to process the image and text modalities. We contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities.
arXiv Detail & Related papers (2024-12-05T04:09:47Z)
Generating Compositional Scenes via Text-to-image RGBA Instance Generation [82.63805151691024]
Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
arXiv Detail & Related papers (2024-11-16T23:44:14Z)
A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding [76.44979557843367]
We propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior. We introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information. We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image.
arXiv Detail & Related papers (2024-11-04T08:50:16Z)
HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation [11.087309945227826]
We propose a textbfHierarchical textbfControllable (HiCo) diffusion model for layout-to-image generation. Our key insight is to achieve spatial disentanglement through hierarchical modeling of layouts. To evaluate the performance of multi-objective controllable layout generation in natural scenes, we introduce the HiCo-7K benchmark.
arXiv Detail & Related papers (2024-10-18T09:36:10Z)
Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images. Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement. Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet) Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z)
DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. layout is employed as an intermedium to bridge large language models and layout-based diffusion models. We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z)
SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z)
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z)
LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation [24.694298869398033]
Our method trains efficiently, generates images with both high perceptual quality and layout alignment. Our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.
arXiv Detail & Related papers (2023-02-16T14:20:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.