Related papers: FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

URL: http://arxiv.org/abs/2405.04834v2
Date: Wed, 22 May 2024 02:45:13 GMT
Title: FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation
Authors: Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang,
Abstract summary: Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. We propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation.
Score: 99.4649330193233
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.

Related papers

OminiControl2: Efficient Conditioning for Diffusion Transformers [68.3243031301164]
We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps.
arXiv Detail & Related papers (2025-03-11T10:50:14Z)
FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation [10.675687253961595]
ControlNet offers a powerful way to guide diffusion-based generative models. Most implementations rely on ad-hocs to choose which network blocks to control-an approach that varies unpredictably with different tasks. We propose FlexControl, a framework that copies all diffusion blocks during training and employs a trainable gating mechanism.
arXiv Detail & Related papers (2025-02-11T23:27:58Z)
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals. We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z)
OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone. OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
arXiv Detail & Related papers (2024-11-22T17:55:15Z)
ControlNeXt: Powerful and Efficient Control for Image and Video Generation [59.62289489036722]
We propose ControlNeXt: a powerful and efficient method for controllable image and video generation. We first design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost. As for training, we reduce up to 90% of learnable parameters compared to the alternatives.
arXiv Detail & Related papers (2024-08-12T11:41:18Z)
FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation [19.65838242227773]
This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner. Our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band.
arXiv Detail & Related papers (2024-08-02T04:13:38Z)
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z)
Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance [36.50036055679903]
Recent controllable generation approaches bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance.
arXiv Detail & Related papers (2024-06-11T17:59:01Z)
OmniControlNet: Dual-stage Integration for Conditional Image Generation [61.1432268643639]
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method. Our proposed OmniControlNet consolidates 1) the condition generation by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance.
arXiv Detail & Related papers (2024-06-09T18:03:47Z)
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models. Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks. UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts. trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.