Related papers: Spatial-Aware Latent Initialization for Controllable Image Generation

Spatial-Aware Latent Initialization for Controllable Image Generation

URL: http://arxiv.org/abs/2401.16157v1
Date: Mon, 29 Jan 2024 13:42:01 GMT
Title: Spatial-Aware Latent Initialization for Controllable Image Generation
Authors: Wenqiang Sun, Teng Li, Zehong Lin, Jun Zhang
Abstract summary: Text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. Previous research has primarily focused on aligning cross-attention maps with layout conditions. We propose leveraging a spatial-aware initialization noise during the denoising process to achieve better layout control.
Score: 9.23227552726271
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. However, these models struggle to accurately adhere to textual instructions regarding spatial layout information. While previous research has primarily focused on aligning cross-attention maps with layout conditions, they overlook the impact of the initialization noise on the layout guidance. To achieve better layout control, we propose leveraging a spatial-aware initialization noise during the denoising process. Specifically, we find that the inverted reference image with finite inversion steps contains valuable spatial awareness regarding the object's position, resulting in similar layouts in the generated images. Based on this observation, we develop an open-vocabulary framework to customize a spatial-aware initialization noise for each layout condition. Without modifying other modules except the initialization noise, our approach can be seamlessly integrated as a plug-and-play module within other training-free layout guidance frameworks. We evaluate our approach quantitatively and qualitatively on the available Stable Diffusion model and COCO dataset. Equipped with the spatial-aware latent initialization, our method significantly improves the effectiveness of layout guidance while preserving high-quality content.

Related papers

STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation [4.769823364778397]
We propose a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation. A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships.
arXiv Detail & Related papers (2025-03-15T17:36:24Z)
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis [9.11767497956649]
This paper proposes leveraging the language comprehension capabilities of large vision-language models to guide the optimization of the initial noisy latent. We introduce the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.
arXiv Detail & Related papers (2024-11-25T15:40:47Z)
TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models. We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis [62.29033292210752]
High-quality images with consistent semantics and layout remains a challenge. We propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Our approach performs favorably in terms of visual quality, semantic consistency, and layout alignment.
arXiv Detail & Related papers (2024-03-04T09:03:16Z)
Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework. We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z)
FreePIH: Training-Free Painterly Image Harmonization with Diffusion Model [19.170302996189335]
Our FreePIH method tames the denoising process as a plug-in module for foreground image style transfer. We make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space. Our method can surpass representative baselines by large margins.
arXiv Detail & Related papers (2023-11-25T04:23:49Z)
LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z)
Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions. We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions. We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z)
Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks. One critical limitation of these models is the low fidelity of generated images with respect to the text description. We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z)
Image Harmonization with Region-wise Contrastive Learning [51.309905690367835]
We propose a novel image harmonization framework with external style fusion and region-wise contrastive learning scheme. Our method attempts to bring together corresponding positive and negative samples by maximizing the mutual information between the foreground and background styles.
arXiv Detail & Related papers (2022-05-27T15:46:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.