Related papers: Local Conditional Controlling for Text-to-Image Diffusion Models

Local Conditional Controlling for Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2312.08768v2
Date: Tue, 6 Feb 2024 14:45:39 GMT
Title: Local Conditional Controlling for Text-to-Image Diffusion Models
Authors: Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Wei Zhao, qinglin lu, Boxi Wu, Wei Liu
Abstract summary: Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. We introduce a new simple yet practical task setting: local control.
Score: 22.732346931679555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we introduce a new simple yet practical task setting: local control. It focuses on controlling specific local areas according to user-defined image conditions, where the rest areas are only conditioned by the original text prompt. This manner allows the users to flexibly control the image generation in a fine-grained way. However, it is non-trivial to achieve this goal. The naive manner of directly adding local conditions may lead to the local control dominance problem. To mitigate this problem, we propose a training-free method that leverages the updates of noised latents and parameters in the cross-attention map during the denosing process to promote concept generation in non-control areas. Moreover, we use feature mask constraints to mitigate the degradation of synthesized image quality caused by information differences inside and outside the local control area. Extensive experiments demonstrate that our method can synthesize high-quality images to the prompt under local control conditions. Code is available at https://github.com/YibooZhao/Local-Control.

Related papers

Control and Realism: Best of Both Worlds in Layout-to-Image without Training [59.16447569868382]
We present WinWinLay, a training-free method for layout-to-Image generation.<n>We propose two key strategies, Non-local Attention Energy and Adaptive Update, that collaboratively enhance control precision and realism.<n>WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
arXiv Detail & Related papers (2025-06-18T15:39:02Z)
PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation [24.964136963713102]
We present PixelPonder, a novel unified control framework that allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets.
arXiv Detail & Related papers (2025-03-09T16:27:02Z)
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement [40.94329069897935]
We present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.
arXiv Detail & Related papers (2024-11-10T18:45:41Z)
GLoD: Composing Global Contexts and Local Details in Image Generation [0.0]
Global-Local Diffusion (textitGLoD) is a novel framework which allows simultaneous control over the global contexts and the local details. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities.
arXiv Detail & Related papers (2024-04-23T18:39:57Z)
Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control [20.533597112330018]
We show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions. We develop a novel cross-attention manipulation method in order to maintain image quality while improving control.
arXiv Detail & Related papers (2024-02-20T22:15:13Z)
LIME: Localized Image Editing via Attention Regularization in Diffusion Models [74.3811832586391]
This paper introduces LIME for localized image editing in diffusion models that do not require user-specified regions of interest (RoI) or additional text input. Our method employs features from pre-trained methods and a simple clustering technique to obtain precise semantic segmentation maps. We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z)
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models. Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z)
Region-Aware Diffusion for Zero-shot Text-driven Image Editing [78.58917623854079]
We propose a novel region-aware diffusion model (RDM) for entity-level image editing. To strike a balance between image fidelity and inference speed, we design the intensive diffusion pipeline. The results show that RDM outperforms the previous approaches in terms of visual quality, overall harmonization, non-editing region content preservation, and text-image semantic consistency.
arXiv Detail & Related papers (2023-02-23T06:20:29Z)
LC-NeRF: Local Controllable Face Generation in Neural Randiance Field [55.54131820411912]
LC-NeRF is composed of a Local Region Generators Module and a Spatial-Aware Fusion Module. Our method provides better local editing than state-of-the-art face editing methods. Our method also performs well in downstream tasks, such as text-driven facial image editing.
arXiv Detail & Related papers (2023-02-19T05:50:08Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
MinMaxCAM: Improving object coverage for CAM-basedWeakly Supervised Object Localization [46.36600006968488]
We propose two representation regularization mechanisms for weakly supervised object localization. Full Region Regularization tries to maximize the coverage of the localization map inside the object region, and Common Region Regularization minimizes the activations occurring in background regions. We evaluate the two regularizations on the ImageNet, CUB-200-2011 and OpenImages-segmentation datasets, and show that the proposed regularizations tackle both problems, outperforming the state-of-the-art by a significant margin.
arXiv Detail & Related papers (2021-04-29T14:39:53Z)
Style Intervention: How to Achieve Spatial Disentanglement with Style-based Generators? [100.60938767993088]
We propose a lightweight optimization-based algorithm which could adapt to arbitrary input images and render natural translation effects under flexible objectives. We verify the performance of the proposed framework in facial attribute editing on high-resolution images, where both photo-realism and consistency are required.
arXiv Detail & Related papers (2020-11-19T07:37:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.