ControlNet-XS: Designing an Efficient and Effective Architecture for
Controlling Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2312.06573v1
- Date: Mon, 11 Dec 2023 17:58:06 GMT
- Title: ControlNet-XS: Designing an Efficient and Effective Architecture for
Controlling Text-to-Image Diffusion Models
- Authors: Denis Zavadski, Johann-Friedrich Feiden, Carsten Rother
- Abstract summary: A popular approach is to use a controlling network, such as ControlNet, in combination with a pre-trained image generation model, such as Stable Diffusion.
In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem.
In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time.
- Score: 21.379896810560282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of image synthesis has made tremendous strides forward in the last
years. Besides defining the desired output image with text-prompts, an
intuitive approach is to additionally use spatial guidance in form of an image,
such as a depth map. For this, a recent and highly popular approach is to use a
controlling network, such as ControlNet, in combination with a pre-trained
image generation model, such as Stable Diffusion. When evaluating the design of
existing controlling networks, we observe that they all suffer from the same
problem of a delay in information flowing between the generation and
controlling process. This, in turn, means that the controlling network must
have generative capabilities. In this work we propose a new controlling
architecture, called ControlNet-XS, which does not suffer from this problem,
and hence can focus on the given task of learning to control. In contrast to
ControlNet, our model needs only a fraction of parameters, and hence is about
twice as fast during inference and training time. Furthermore, the generated
images are of higher quality and the control is of higher fidelity. All code
and pre-trained models will be made publicly available.
Related papers
- PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models [55.080748327139176]
PerlDiff is a method for effective street view image generation that fully leverages perspective 3D geometric information.
Our results justify that our PerlDiff markedly enhances the precision of generation on the NuScenes and KITTI datasets.
arXiv Detail & Related papers (2024-07-08T16:46:47Z) - AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - OmniControlNet: Dual-stage Integration for Conditional Image Generation [61.1432268643639]
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method.
Our proposed OmniControlNet consolidates 1) the condition generation by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance.
arXiv Detail & Related papers (2024-06-09T18:03:47Z) - Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model [62.51232333352754]
Ctrl-Adapter adds diverse controls to any image/video diffusion model through the adaptation of pretrained ControlNets.
With six diverse U-Net/DiT-based image/video diffusion models, Ctrl-Adapter matches the performance of pretrained ControlNets on COCO.
arXiv Detail & Related papers (2024-04-15T17:45:36Z) - Layout-to-Image Generation with Localized Descriptions using ControlNet
with Cross-Attention Control [20.533597112330018]
We show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions.
We develop a novel cross-attention manipulation method in order to maintain image quality while improving control.
arXiv Detail & Related papers (2024-02-20T22:15:13Z) - FineControlNet: Fine-level Text Control for Image Generation with
Spatially Aligned Text Control Injection [28.65209293141492]
FineControlNet provides fine control over each instance's appearance while maintaining the precise pose control capability.
FineControlNet achieves superior performance in generating images that follow the user-provided instance-specific text prompts and poses.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - UniControl: A Unified Diffusion Model for Controllable Visual Generation
In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks.
UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts.
trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z) - Adding Conditional Control to Text-to-Image Diffusion Models [37.98427255384245]
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models.
ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls.
arXiv Detail & Related papers (2023-02-10T23:12:37Z) - Towards a Neural Graphics Pipeline for Controllable Image Generation [96.11791992084551]
We present Neural Graphics Pipeline (NGP), a hybrid generative model that brings together neural and traditional image formation models.
NGP decomposes the image into a set of interpretable appearance feature maps, uncovering direct control handles for controllable image generation.
We demonstrate the effectiveness of our approach on controllable image generation of single-object scenes.
arXiv Detail & Related papers (2020-06-18T14:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.