Related papers: ControlVAR: Exploring Controllable Visual Autoregressive Modeling

ControlVAR: Exploring Controllable Visual Autoregressive Modeling

URL: http://arxiv.org/abs/2406.09750v1
Date: Fri, 14 Jun 2024 06:35:33 GMT
Title: ControlVAR: Exploring Controllable Visual Autoregressive Modeling
Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Zhe Lin, Rita Singh, Bhiksha Raj,
Abstract summary: Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs) Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs.
Score: 48.66209303617063
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs), especially in tasks like control-to-image generation. However, challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces ControlVAR, a novel framework that explores pixel-level controls in visual autoregressive (VAR) modeling for flexible and efficient conditional generation. In contrast to traditional conditional models that learn the conditional distribution, ControlVAR jointly models the distribution of image and pixel-level conditions during training and imposes conditional controls during testing. To enhance the joint modeling, we adopt the next-scale AR prediction paradigm and unify control and image representations. A teacher-forcing guidance strategy is proposed to further facilitate controllable generation with joint modeling. Extensive experiments demonstrate the superior efficacy and flexibility of ControlVAR across various conditional generation tasks against popular conditional DMs, \eg, ControlNet and T2I-Adaptor.

Related papers

SCALAR: Scale-wise Controllable Visual Autoregressive Learning [15.775596699630633]
We present SCALAR, a controllable generation method based on Visual Autoregressive ( VAR)<n>We leverage a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone.<n>Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model.
arXiv Detail & Related papers (2025-07-26T13:23:08Z)
A Practical Investigation of Spatially-Controlled Image Generation with Transformers [16.682348277650817]
We aim to provide clear takeaways across generation paradigms for practitioners wishing to develop systems for spatially-controlled generation.<n>We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models.
arXiv Detail & Related papers (2025-07-21T15:33:49Z)
ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z)
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals. We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z)
A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks. Our approach enables versatile capabilities via different inference-time sampling schemes. Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z)
Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference. Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z)
CAR: Controllable Autoregressive Modeling for Visual Generation [100.33455832783416]
Controllable AutoRegressive Modeling (CAR) is a novel, plug-and-play framework that integrates conditional control into multi-scale latent variable modeling. CAR progressively refines and captures control representations, which are injected into each autoregressive step of the pre-trained model to guide the generation process. Our approach demonstrates excellent controllability across various types of conditions and delivers higher image quality compared to previous methods.
arXiv Detail & Related papers (2024-10-07T00:55:42Z)
ControlAR: Controllable Image Generation with Autoregressive Models [40.74890550081335]
We introduce ControlAR, an efficient framework for integrating spatial controls into autoregressive image generation models. ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens. Results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models.
arXiv Detail & Related papers (2024-10-03T17:28:07Z)
TCIG: Two-Stage Controlled Image Generation with Quality Enhancement through Diffusion [0.0]
A two-stage method that combines controllability and high quality in the generation of images is proposed. By separating controllability from high quality, This method achieves outstanding results.
arXiv Detail & Related papers (2024-03-02T13:59:02Z)
Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery. We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks. UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts. trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z)
ControlVAE: Model-Based Learning of Generative Controllers for Physics-Based Characters [28.446959320429656]
We introduce ControlVAE, a model-based framework for learning generative motion control policies based on variational autoencoders (VAE) Our framework can learn a rich and flexible latent representation of skills and a skill-conditioned generative control policy from a diverse set of unorganized motion sequences. We demonstrate the effectiveness of ControlVAE using a diverse set of tasks, which allows realistic and interactive control of the simulated characters.
arXiv Detail & Related papers (2022-10-12T10:11:36Z)
Transformer-based Conditional Variational Autoencoder for Controllable Story Generation [39.577220559911055]
We investigate large-scale latent variable models (LVMs) for neural story generation with objectives in two threads: generation effectiveness and controllability. We advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers. Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE)
arXiv Detail & Related papers (2021-01-04T08:31:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.