Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image
Generation
- URL: http://arxiv.org/abs/2306.00964v1
- Date: Thu, 1 Jun 2023 17:55:32 GMT
- Title: Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image
Generation
- Authors: Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang,
Dacheng Tao, Tat-Jen Cham
- Abstract summary: Text-conditional diffusion models are able to generate high-fidelity images with diverse contents.
However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery.
We propose Cocktail, a pipeline to mix various modalities into one embedding.
- Score: 79.8881514424969
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-conditional diffusion models are able to generate high-fidelity images
with diverse contents. However, linguistic representations frequently exhibit
ambiguous descriptions of the envisioned objective imagery, requiring the
incorporation of additional control signals to bolster the efficacy of
text-guided diffusion models. In this work, we propose Cocktail, a pipeline to
mix various modalities into one embedding, amalgamated with a generalized
ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a
spatial guidance sampling method, to actualize multi-modal and
spatially-refined control for text-conditional diffusion models. Specifically,
we introduce a hyper-network gControlNet, dedicated to the alignment and
infusion of the control signals from disparate modalities into the pre-trained
diffusion model. gControlNet is capable of accepting flexible modality signals,
encompassing the simultaneous reception of any combination of modality signals,
or the supplementary fusion of multiple modality signals. The control signals
are then fused and injected into the backbone model according to our proposed
ControlNorm. Furthermore, our advanced spatial guidance sampling methodology
proficiently incorporates the control signal into the designated region,
thereby circumventing the manifestation of undesired objects within the
generated image. We demonstrate the results of our method in controlling
various modalities, proving high-quality synthesis and fidelity to multiple
external signals.
Related papers
- UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework.
Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture.
Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z) - DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals.
We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z) - AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - ControlVAR: Exploring Controllable Visual Autoregressive Modeling [48.66209303617063]
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs)
Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs.
This paper introduces Controlmore, a novel framework that explores pixel-level controls in visual autoregressive modeling for flexible and efficient conditional generation.
arXiv Detail & Related papers (2024-06-14T06:35:33Z) - ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems [19.02295657801464]
In this work, we take an existing controlling network (ControlNet) and change the communication between the controlling network and the generation process to be of high-frequency and with large-bandwidth.
We outperform state-of-the-art approaches for pixel-level guidance, such as depth, canny-edges, and semantic segmentation, and are on a par for loose keypoint-guidance of human poses.
All code and pre-trained models will be made publicly available.
arXiv Detail & Related papers (2023-12-11T17:58:06Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - Adding Conditional Control to Text-to-Image Diffusion Models [37.98427255384245]
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models.
ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls.
arXiv Detail & Related papers (2023-02-10T23:12:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.