Local Conditional Controlling for Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2312.08768v3
- Date: Thu, 22 Aug 2024 06:27:48 GMT
- Title: Local Conditional Controlling for Text-to-Image Diffusion Models
- Authors: Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Zheng Yang, Xiaofei He, Wei Zhao, qinglin lu, Boxi Wu, Wei Liu,
- Abstract summary: Diffusion models have exhibited impressive prowess in the text-to-image task.
Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images.
This controlling process is globally operated on the entire image, which limits the flexibility of control regions.
- Score: 26.54188248406709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve local conditional controlling. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.
Related papers
- Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement [40.94329069897935]
We present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition.
RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.
arXiv Detail & Related papers (2024-11-10T18:45:41Z) - GLoD: Composing Global Contexts and Local Details in Image Generation [0.0]
Global-Local Diffusion (textitGLoD) is a novel framework which allows simultaneous control over the global contexts and the local details.
It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process.
Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities.
arXiv Detail & Related papers (2024-04-23T18:39:57Z) - Layout-to-Image Generation with Localized Descriptions using ControlNet
with Cross-Attention Control [20.533597112330018]
We show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions.
We develop a novel cross-attention manipulation method in order to maintain image quality while improving control.
arXiv Detail & Related papers (2024-02-20T22:15:13Z) - LIME: Localized Image Editing via Attention Regularization in Diffusion
Models [74.3811832586391]
This paper introduces LIME for localized image editing in diffusion models that do not require user-specified regions of interest (RoI) or additional text input.
Our method employs features from pre-trained methods and a simple clustering technique to obtain precise semantic segmentation maps.
We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - Region-Aware Diffusion for Zero-shot Text-driven Image Editing [78.58917623854079]
We propose a novel region-aware diffusion model (RDM) for entity-level image editing.
To strike a balance between image fidelity and inference speed, we design the intensive diffusion pipeline.
The results show that RDM outperforms the previous approaches in terms of visual quality, overall harmonization, non-editing region content preservation, and text-image semantic consistency.
arXiv Detail & Related papers (2023-02-23T06:20:29Z) - LC-NeRF: Local Controllable Face Generation in Neural Randiance Field [55.54131820411912]
LC-NeRF is composed of a Local Region Generators Module and a Spatial-Aware Fusion Module.
Our method provides better local editing than state-of-the-art face editing methods.
Our method also performs well in downstream tasks, such as text-driven facial image editing.
arXiv Detail & Related papers (2023-02-19T05:50:08Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - MinMaxCAM: Improving object coverage for CAM-basedWeakly Supervised
Object Localization [46.36600006968488]
We propose two representation regularization mechanisms for weakly supervised object localization.
Full Region Regularization tries to maximize the coverage of the localization map inside the object region, and Common Region Regularization minimizes the activations occurring in background regions.
We evaluate the two regularizations on the ImageNet, CUB-200-2011 and OpenImages-segmentation datasets, and show that the proposed regularizations tackle both problems, outperforming the state-of-the-art by a significant margin.
arXiv Detail & Related papers (2021-04-29T14:39:53Z) - Style Intervention: How to Achieve Spatial Disentanglement with
Style-based Generators? [100.60938767993088]
We propose a lightweight optimization-based algorithm which could adapt to arbitrary input images and render natural translation effects under flexible objectives.
We verify the performance of the proposed framework in facial attribute editing on high-resolution images, where both photo-realism and consistency are required.
arXiv Detail & Related papers (2020-11-19T07:37:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.