Local Conditional Controlling for Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2312.08768v2
- Date: Tue, 6 Feb 2024 14:45:39 GMT
- Title: Local Conditional Controlling for Text-to-Image Diffusion Models
- Authors: Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Wei
Zhao, qinglin lu, Boxi Wu, Wei Liu
- Abstract summary: Diffusion models have exhibited impressive prowess in the text-to-image task.
Recent methods add image-level controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images.
We introduce a new simple yet practical task setting: local control.
- Score: 22.732346931679555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have exhibited impressive prowess in the text-to-image task.
Recent methods add image-level controls, e.g., edge and depth maps, to
manipulate the generation process together with text prompts to obtain desired
images. This controlling process is globally operated on the entire image,
which limits the flexibility of control regions. In this paper, we introduce a
new simple yet practical task setting: local control. It focuses on controlling
specific local areas according to user-defined image conditions, where the rest
areas are only conditioned by the original text prompt. This manner allows the
users to flexibly control the image generation in a fine-grained way. However,
it is non-trivial to achieve this goal. The naive manner of directly adding
local conditions may lead to the local control dominance problem. To mitigate
this problem, we propose a training-free method that leverages the updates of
noised latents and parameters in the cross-attention map during the denosing
process to promote concept generation in non-control areas. Moreover, we use
feature mask constraints to mitigate the degradation of synthesized image
quality caused by information differences inside and outside the local control
area. Extensive experiments demonstrate that our method can synthesize
high-quality images to the prompt under local control conditions. Code is
available at https://github.com/YibooZhao/Local-Control.
Related papers
- AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - DreamWalk: Style Space Exploration using Diffusion Guidance [19.065568106372222]
Unlike direct-editing tools like Photoshop, text conditioned models require the artist to perform "prompt engineering"
Our goal is to provide fine-grained control over the style and substance specified by the prompt.
Method can be used in conjunction with LoRA- or DreamBooth-trained models.
arXiv Detail & Related papers (2024-04-04T01:39:01Z) - Layout-to-Image Generation with Localized Descriptions using ControlNet
with Cross-Attention Control [20.533597112330018]
We show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions.
We develop a novel cross-attention manipulation method in order to maintain image quality while improving control.
arXiv Detail & Related papers (2024-02-20T22:15:13Z) - ZONE: Zero-Shot Instruction-Guided Local Editing [56.56213730578504]
We propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE.
We first convert the editing intent from the user-provided instruction into specific image editing regions through InstructPix2Pix.
We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model.
arXiv Detail & Related papers (2023-12-28T02:54:34Z) - ControlNet-XS: Designing an Efficient and Effective Architecture for
Controlling Text-to-Image Diffusion Models [21.379896810560282]
A popular approach is to use a controlling network, such as ControlNet, in combination with a pre-trained image generation model, such as Stable Diffusion.
In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem.
In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time.
arXiv Detail & Related papers (2023-12-11T17:58:06Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Attribute-specific Control Units in StyleGAN for Fine-grained Image
Manipulation [57.99007520795998]
We discover attribute-specific control units, which consist of multiple channels of feature maps and modulation styles.
Specifically, we collaboratively manipulate the modulation style channels and feature maps in control units to obtain the semantic and spatial disentangled controls.
We move the modulation style along a specific sparse direction vector and replace the filter-wise styles used to compute the feature maps to manipulate these control units.
arXiv Detail & Related papers (2021-11-25T10:42:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.