Related papers: ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2312.06573v1
Date: Mon, 11 Dec 2023 17:58:06 GMT
Title: ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models
Authors: Denis Zavadski, Johann-Friedrich Feiden, Carsten Rother
Abstract summary: A popular approach is to use a controlling network, such as ControlNet, in combination with a pre-trained image generation model, such as Stable Diffusion. In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem. In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time.
Score: 21.379896810560282
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. For this, a recent and highly popular approach is to use a controlling network, such as ControlNet, in combination with a pre-trained image generation model, such as Stable Diffusion. When evaluating the design of existing controlling networks, we observe that they all suffer from the same problem of a delay in information flowing between the generation and controlling process. This, in turn, means that the controlling network must have generative capabilities. In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem, and hence can focus on the given task of learning to control. In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time. Furthermore, the generated images are of higher quality and the control is of higher fidelity. All code and pre-trained models will be made publicly available.

Related papers

ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z)
Minimal Impact ControlNet: Advancing Multi-ControlNet Integration [35.40147040893738]
In current ControlNet training, each control is designed to influence all areas of an image.<n>Silent control signals can suppress the generation of textures in related areas.<n>We propose Minimal Impact ControlNet to address this problem.
arXiv Detail & Related papers (2025-06-02T13:41:43Z)
Enabling Versatile Controls for Video Diffusion Models [18.131652071161266]
VCtrl is a novel framework designed to enable fine control over pre-trained video diffusion models. Comprehensive experiments and human evaluations demonstrate VCtrl effectively enhances controllability and generation quality.
arXiv Detail & Related papers (2025-03-21T09:48:00Z)
ControlAR: Controllable Image Generation with Autoregressive Models [40.74890550081335]
We introduce ControlAR, an efficient framework for integrating spatial controls into autoregressive image generation models. ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens. Results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models.
arXiv Detail & Related papers (2024-10-03T17:28:07Z)
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z)
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model [62.51232333352754]
Ctrl-Adapter adds diverse controls to any image/video diffusion model through the adaptation of pretrained ControlNets. With six diverse U-Net/DiT-based image/video diffusion models, Ctrl-Adapter matches the performance of pretrained ControlNets on COCO.
arXiv Detail & Related papers (2024-04-15T17:45:36Z)
Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control [20.533597112330018]
We show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions. We develop a novel cross-attention manipulation method in order to maintain image quality while improving control.
arXiv Detail & Related papers (2024-02-20T22:15:13Z)
Fine-grained Controllable Video Generation via Object Appearance and Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control. FACTOR aims to control objects' appearances and context, including their location and category. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z)
Readout Guidance: Learning Control from Diffusion Features [96.22155562120231]
We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity.
arXiv Detail & Related papers (2023-12-04T18:59:32Z)
Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery. We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z)
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models. Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks. UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts. trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z)
Adding Conditional Control to Text-to-Image Diffusion Models [37.98427255384245]
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls.
arXiv Detail & Related papers (2023-02-10T23:12:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.