Related papers: ECNet: Effective Controllable Text-to-Image Diffusion Models

ECNet: Effective Controllable Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2403.18417v1
Date: Wed, 27 Mar 2024 10:09:38 GMT
Title: ECNet: Effective Controllable Text-to-Image Diffusion Models
Authors: Sicheng Li, Keqiang Sun, Zhixin Lai, Xiaoshi Wu, Feng Qiu, Haoran Xie, Kazunori Miyata, Hongsheng Li,
Abstract summary: We introduce two innovative solutions for conditional text-to-image models. Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
Score: 31.21525123716149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output. The combination of SGI and DCL results in our Effective Controllable Network (ECNet), which offers a more accurate controllable end-to-end text-to-image generation framework with a more precise conditioning input and stronger controllable supervision. We validate our approach through extensive experiments on generation under various conditions, such as human body skeletons, facial landmarks, and sketches of general objects. The results consistently demonstrate that our method significantly enhances the controllability and robustness of the generated images, outperforming existing state-of-the-art controllable text-to-image models.

Related papers

ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision [62.41380823195191]
We propose Attention-Conditional Diffusion, a framework for direct conditional control in video diffusion models via attention supervision.<n>ACD achieves better controllability by aligning the model's attention maps with external control signals.<n>Experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs.
arXiv Detail & Related papers (2025-12-24T16:24:18Z)
Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution [59.71803719801537]
CODSR is a controllable one-step diffusion network for image super-resolution.<n>We propose an LQ-guided feature modulation module to provide high-fidelity conditioning for the diffusion process.<n>We develop a region-adaptive generative prior activation method to effectively enhance perceptual richness.
arXiv Detail & Related papers (2025-12-16T03:56:02Z)
DivControl: Knowledge Diversion for Controllable Image Generation [38.166949036830886]
DivControl is a decomposable pretraining framework for unified controllable generation.<n>It achieves state-of-the-art controllability with 36.4$times$ less training cost.<n>It also delivers strong zero-shot and few-shot performance on unseen conditions.
arXiv Detail & Related papers (2025-07-31T15:00:15Z)
ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation [4.769823364778397]
We propose a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation. A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships.
arXiv Detail & Related papers (2025-03-15T17:36:24Z)
OminiControl2: Efficient Conditioning for Diffusion Transformers [68.3243031301164]
We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps.
arXiv Detail & Related papers (2025-03-11T10:50:14Z)
PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models [17.12906933388337]
Malicious actors can fine-tune text-to-image (T2I) diffusion models to generate illegal content. We propose a novel prompt selection algorithm based on learning automaton (PromptLA) for efficient and accurate verification.
arXiv Detail & Related papers (2024-12-20T07:24:32Z)
ControlAR: Controllable Image Generation with Autoregressive Models [40.74890550081335]
We introduce ControlAR, an efficient framework for integrating spatial controls into autoregressive image generation models. ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens. Results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models.
arXiv Detail & Related papers (2024-10-03T17:28:07Z)
CODE: Confident Ordinary Differential Editing [62.83365660727034]
Confident Ordinary Differential Editing (CODE) is a novel approach for image synthesis that effectively handles Out-of-Distribution (OoD) guidance images. CODE enhances images through score-based updates along the probability-flow Ordinary Differential Equation (ODE) trajectory. Our method operates in a fully blind manner, relying solely on a pre-trained generative model.
arXiv Detail & Related papers (2024-08-22T14:12:20Z)
ControlVAR: Exploring Controllable Visual Autoregressive Modeling [48.66209303617063]
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs) Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces Controlmore, a novel framework that explores pixel-level controls in visual autoregressive modeling for flexible and efficient conditional generation.
arXiv Detail & Related papers (2024-06-14T06:35:33Z)
OmniControlNet: Dual-stage Integration for Conditional Image Generation [61.1432268643639]
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method. Our proposed OmniControlNet consolidates 1) the condition generation by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance.
arXiv Detail & Related papers (2024-06-09T18:03:47Z)
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts. We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z)
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback [20.910939141948123]
ControlNet++ is a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. It achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.
arXiv Detail & Related papers (2024-04-11T17:59:09Z)
Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion [35.21106030549071]
Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks. We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs) By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
arXiv Detail & Related papers (2024-02-26T05:08:40Z)
Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control [20.533597112330018]
We show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions. We develop a novel cross-attention manipulation method in order to maintain image quality while improving control.
arXiv Detail & Related papers (2024-02-20T22:15:13Z)
Fine-grained Controllable Video Generation via Object Appearance and Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control. FACTOR aims to control objects' appearances and context, including their location and category. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.