CCM: Adding Conditional Controls to Text-to-Image Consistency Models
- URL: http://arxiv.org/abs/2312.06971v1
- Date: Tue, 12 Dec 2023 04:16:03 GMT
- Title: CCM: Adding Conditional Controls to Text-to-Image Consistency Models
- Authors: Jie Xiao, Kai Zhu, Han Zhang, Zhiheng Liu, Yujun Shen, Yu Liu, Xueyang
Fu, Zheng-Jun Zha
- Abstract summary: We consider alternative strategies for adding ControlNet-like conditional control to Consistency Models.
A lightweight adapter can be jointly optimized under multiple conditions through Consistency Training.
We study these three solutions across various conditional controls, including edge, depth, human pose, low-resolution image and masked image.
- Score: 89.75377958996305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consistency Models (CMs) have showed a promise in creating visual content
efficiently and with high quality. However, the way to add new conditional
controls to the pretrained CMs has not been explored. In this technical report,
we consider alternative strategies for adding ControlNet-like conditional
control to CMs and present three significant findings. 1) ControlNet trained
for diffusion models (DMs) can be directly applied to CMs for high-level
semantic controls but struggles with low-level detail and realism control. 2)
CMs serve as an independent class of generative models, based on which
ControlNet can be trained from scratch using Consistency Training proposed by
Song et al. 3) A lightweight adapter can be jointly optimized under multiple
conditions through Consistency Training, allowing for the swift transfer of
DMs-based ControlNet to CMs. We study these three solutions across various
conditional controls, including edge, depth, human pose, low-resolution image
and masked image with text-to-image latent consistency models.
Related papers
- ControlVAR: Exploring Controllable Visual Autoregressive Modeling [48.66209303617063]
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs)
Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs.
This paper introduces Controlmore, a novel framework that explores pixel-level controls in visual autoregressive modeling for flexible and efficient conditional generation.
arXiv Detail & Related papers (2024-06-14T06:35:33Z) - ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback [20.910939141948123]
ControlNet++ is a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls.
It achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.
arXiv Detail & Related papers (2024-04-11T17:59:09Z) - FreeControl: Training-Free Spatial Control of Any Text-to-Image
Diffusion Model with Any Condition [41.92032568474062]
FreeControl is a training-free approach for controllable T2I generation.
It supports multiple conditions, architectures, and checkpoints simultaneously.
It achieves competitive synthesis quality with training-based approaches.
arXiv Detail & Related papers (2023-12-12T18:59:14Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.