ControlAR: Controllable Image Generation with Autoregressive Models
- URL: http://arxiv.org/abs/2410.02705v1
- Date: Thu, 3 Oct 2024 17:28:07 GMT
- Title: ControlAR: Controllable Image Generation with Autoregressive Models
- Authors: Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang,
- Abstract summary: We introduce ControlAR, an efficient framework for integrating spatial controls into autoregressive image generation models.
ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens.
Results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models.
- Score: 40.74890550081335
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at https://github.com/hustvl/ControlAR.
Related papers
- CAR: Controllable Autoregressive Modeling for Visual Generation [100.33455832783416]
Controllable AutoRegressive Modeling (CAR) is a novel, plug-and-play framework that integrates conditional control into multi-scale latent variable modeling.
CAR progressively refines and captures control representations, which are injected into each autoregressive step of the pre-trained model to guide the generation process.
Our approach demonstrates excellent controllability across various types of conditions and delivers higher image quality compared to previous methods.
arXiv Detail & Related papers (2024-10-07T00:55:42Z) - ControlVAR: Exploring Controllable Visual Autoregressive Modeling [48.66209303617063]
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs)
Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs.
This paper introduces Controlmore, a novel framework that explores pixel-level controls in visual autoregressive modeling for flexible and efficient conditional generation.
arXiv Detail & Related papers (2024-06-14T06:35:33Z) - OmniControlNet: Dual-stage Integration for Conditional Image Generation [61.1432268643639]
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method.
Our proposed OmniControlNet consolidates 1) the condition generation by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance.
arXiv Detail & Related papers (2024-06-09T18:03:47Z) - ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback [20.910939141948123]
ControlNet++ is a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls.
It achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.
arXiv Detail & Related papers (2024-04-11T17:59:09Z) - ECNet: Effective Controllable Text-to-Image Diffusion Models [31.21525123716149]
We introduce two innovative solutions for conditional text-to-image models.
Firstly, we propose a Spatial Guidance (SGI) which enhances conditional detail by encoding text inputs with precise annotation information.
Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss.
This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output.
arXiv Detail & Related papers (2024-03-27T10:09:38Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - UniControl: A Unified Diffusion Model for Controllable Visual Generation
In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks.
UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts.
trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z) - Adding Conditional Control to Text-to-Image Diffusion Models [37.98427255384245]
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models.
ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls.
arXiv Detail & Related papers (2023-02-10T23:12:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.