Layout-to-Image Generation with Localized Descriptions using ControlNet
with Cross-Attention Control
- URL: http://arxiv.org/abs/2402.13404v1
- Date: Tue, 20 Feb 2024 22:15:13 GMT
- Title: Layout-to-Image Generation with Localized Descriptions using ControlNet
with Cross-Attention Control
- Authors: Denis Lukovnikov, Asja Fischer
- Abstract summary: We show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions.
We develop a novel cross-attention manipulation method in order to maintain image quality while improving control.
- Score: 20.533597112330018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While text-to-image diffusion models can generate highquality images from
textual descriptions, they generally lack fine-grained control over the visual
composition of the generated images. Some recent works tackle this problem by
training the model to condition the generation process on additional input
describing the desired image layout. Arguably the most popular among such
methods, ControlNet, enables a high degree of control over the generated image
using various types of conditioning inputs (e.g. segmentation maps). However,
it still lacks the ability to take into account localized textual descriptions
that indicate which image region is described by which phrase in the prompt. In
this work, we show the limitations of ControlNet for the layout-to-image task
and enable it to use localized descriptions using a training-free approach that
modifies the crossattention scores during generation. We adapt and investigate
several existing cross-attention control methods in the context of ControlNet
and identify shortcomings that cause failure (concept bleeding) or image
degradation under specific conditions. To address these shortcomings, we
develop a novel cross-attention manipulation method in order to maintain image
quality while improving control. Qualitative and quantitative experimental
studies focusing on challenging cases are presented, demonstrating the
effectiveness of the investigated general approach, and showing the
improvements obtained by the proposed cross-attention control method.
Related papers
- Layout Control and Semantic Guidance with Attention Loss Backward for T2I Diffusion Model [13.67619785783182]
We propose a train-free method based on attention loss backward, cleverly controlling the cross attention map.
Our approach has achieved excellent practical applications in production, and we hope it can serve as an inspiring technical report.
arXiv Detail & Related papers (2024-11-11T03:27:18Z) - AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback [20.910939141948123]
ControlNet++ is a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls.
It achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.
arXiv Detail & Related papers (2024-04-11T17:59:09Z) - ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models.
Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information.
Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss.
This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z) - Local Conditional Controlling for Text-to-Image Diffusion Models [26.54188248406709]
Diffusion models have exhibited impressive prowess in the text-to-image task.
Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images.
This controlling process is globally operated on the entire image, which limits the flexibility of control regions.
arXiv Detail & Related papers (2023-12-14T09:31:33Z) - Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image
Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.
However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery.
We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Attribute-specific Control Units in StyleGAN for Fine-grained Image
Manipulation [57.99007520795998]
We discover attribute-specific control units, which consist of multiple channels of feature maps and modulation styles.
Specifically, we collaboratively manipulate the modulation style channels and feature maps in control units to obtain the semantic and spatial disentangled controls.
We move the modulation style along a specific sparse direction vector and replace the filter-wise styles used to compute the feature maps to manipulate these control units.
arXiv Detail & Related papers (2021-11-25T10:42:10Z) - Style Intervention: How to Achieve Spatial Disentanglement with
Style-based Generators? [100.60938767993088]
We propose a lightweight optimization-based algorithm which could adapt to arbitrary input images and render natural translation effects under flexible objectives.
We verify the performance of the proposed framework in facial attribute editing on high-resolution images, where both photo-realism and consistency are required.
arXiv Detail & Related papers (2020-11-19T07:37:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.