Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion
- URL: http://arxiv.org/abs/2404.14768v1
- Date: Tue, 23 Apr 2024 06:10:43 GMT
- Title: Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion
- Authors: Hongyu Chen, Yiqi Gao, Min Zhou, Peng Wang, Xubin Li, Tiezheng Ge, Bo Zheng,
- Abstract summary: We propose a training-free approach named Mask-guided Prompt Following (MGPF) to enhance prompt following with visual control.
The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.
- Score: 27.61734719689046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and object masks. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.
Related papers
- LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps [5.836227628651603]
We propose a pipeline leveraging Large Language Models, open-vocabulary detectors, cross-attention maps and diffusion U-Net for instance-level image manipulation.
Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks.
arXiv Detail & Related papers (2025-01-23T19:26:14Z) - ControlFace: Harnessing Facial Parametric Control for Face Rigging [31.765503860508378]
We introduce ControlFace, a novel face rigging method conditioned on 3DMM renderings that enables flexible, high-fidelity control.
We employ a dual-branch U-Nets: one, referred to as FaceNet, captures identity and fine details, while the other focuses on generation.
By training on a facial video dataset, we fully utilize FaceNet's rich representations while ensuring control adherence.
arXiv Detail & Related papers (2024-12-02T06:00:27Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions [59.53867290769282]
We present a novel T2I generation method dubbed SmartControl to modify the rough visual conditions for adapting to text prompt.
The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts.
Experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts.
arXiv Detail & Related papers (2024-04-09T16:53:43Z) - When ControlNet Meets Inexplicit Masks: A Case Study of ControlNet on its Contour-following Ability [93.15085958220024]
ControlNet excels at creating content that closely matches precise contours in user-provided masks.
When these masks contain noise, as a frequent occurrence with non-expert users, the output would include unwanted artifacts.
This paper first highlights the crucial role of controlling the impact of these inexplicit masks with diverse deterioration levels through in-depth analysis.
An advanced Shape-aware ControlNet consisting of a deterioration estimator and a shape-prior modulation block is devised.
arXiv Detail & Related papers (2024-03-01T11:45:29Z) - Layout-to-Image Generation with Localized Descriptions using ControlNet
with Cross-Attention Control [20.533597112330018]
We show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions.
We develop a novel cross-attention manipulation method in order to maintain image quality while improving control.
arXiv Detail & Related papers (2024-02-20T22:15:13Z) - FineControlNet: Fine-level Text Control for Image Generation with
Spatially Aligned Text Control Injection [28.65209293141492]
FineControlNet provides fine control over each instance's appearance while maintaining the precise pose control capability.
FineControlNet achieves superior performance in generating images that follow the user-provided instance-specific text prompts and poses.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - UniControl: A Unified Diffusion Model for Controllable Visual Generation
In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks.
UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts.
trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z) - Self-Supervised Visual Representations Learning by Contrastive Mask
Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning.
MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions.
We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.