Related papers: Self-control: A Better Conditional Mechanism for Masked Autoregressive Model

Self-control: A Better Conditional Mechanism for Masked Autoregressive Model

URL: http://arxiv.org/abs/2412.13635v1
Date: Wed, 18 Dec 2024 09:09:39 GMT
Title: Self-control: A Better Conditional Mechanism for Masked Autoregressive Model
Authors: Qiaoying Qu, Shiyu Shen,
Abstract summary: This paper introduces a novel conditional introduction network for continuous masked autoregressive models.<n>The proposed self-control network serves to mitigate the negative impact of vector quantization on the quality of the generated images.
Score: 1.9950682531209158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive conditional image generation algorithms are capable of generating photorealistic images that are consistent with given textual or image conditions, and have great potential for a wide range of applications. Nevertheless, the majority of popular autoregressive image generation methods rely heavily on vector quantization, and the inherent discrete characteristic of codebook presents a considerable challenge to achieving high-quality image generation. To address this limitation, this paper introduces a novel conditional introduction network for continuous masked autoregressive models. The proposed self-control network serves to mitigate the negative impact of vector quantization on the quality of the generated images, while simultaneously enhancing the conditional control during the generation process. In particular, the self-control network is constructed upon a continuous mask autoregressive generative model, which incorporates multimodal conditional information, including text and images, into a unified autoregressive sequence in a serial manner. Through a self-attention mechanism, the network is capable of generating images that are controllable based on specific conditions. The self-control network discards the conventional cross-attention-based conditional fusion mechanism and effectively unifies the conditional and generative information within the same space, thereby facilitating more seamless learning and fusion of multimodal features.

Related papers

A Watermark for Auto-Regressive Image Generation Models [50.599325258178254]
We propose C-reweight, a distortion-free watermarking method explicitly designed for image generation models.<n>C-reweight mitigates retokenization mismatch while preserving image fidelity.
arXiv Detail & Related papers (2025-06-13T00:15:54Z)
Context-Aware Autoregressive Models for Multi-Conditional Image Generation [24.967166342680112]
ContextAR is a flexible and effective framework for multi-conditional image generation.<n>It embeds diverse conditions directly into the token sequence, preserving modality-specific semantics.<n>We show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline.
arXiv Detail & Related papers (2025-05-18T07:27:02Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z)
A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks. Our approach enables versatile capabilities via different inference-time sampling schemes. Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z)
MCGM: Mask Conditional Text-to-Image Generative Model [1.909929271850469]
We propose a novel Conditional Mask Text-to-Image Generative Model (MCGM) Our model builds upon the success of the Break-a-scene [1] model in generating new scenes using a single image with multiple subjects. By introducing this additional level of control, MCGM offers a flexible and intuitive approach for generating specific poses for one or more subjects learned from a single image.
arXiv Detail & Related papers (2024-10-01T08:13:47Z)
ControlVAR: Exploring Controllable Visual Autoregressive Modeling [48.66209303617063]
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs) Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces Controlmore, a novel framework that explores pixel-level controls in visual autoregressive modeling for flexible and efficient conditional generation.
arXiv Detail & Related papers (2024-06-14T06:35:33Z)
RefDrop: Controllable Consistency in Image or Video Generation via Reference Feature Guidance [22.326405355520176]
RefDrop allows users to control the influence of reference context in a direct and precise manner. Our method also enables more interesting applications, such as the consistent generation of multiple subjects.
arXiv Detail & Related papers (2024-05-27T21:23:20Z)
Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation [17.035117118768945]
We propose a plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. Our work opens the door to applying adversarial attack to low-level vision tasks.
arXiv Detail & Related papers (2024-03-13T06:57:23Z)
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation [49.3016007471979]
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. We introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs.
arXiv Detail & Related papers (2023-10-02T17:59:18Z)
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z)
Conditional Generation from Unconditional Diffusion Models using Denoiser Representations [94.04631421741986]
We propose adapting pre-trained unconditional diffusion models to new conditions using the learned internal representations of the denoiser network. We show that augmenting the Tiny ImageNet training set with synthetic images generated by our approach improves the classification accuracy of ResNet baselines by up to 8%.
arXiv Detail & Related papers (2023-06-02T20:09:57Z)
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation [34.61940502872307]
MultiDiffusion is a unified framework that enables versatile and controllable image generation. We show that MultiDiffusion can be readily applied to generate high quality and diverse images.
arXiv Detail & Related papers (2023-02-16T06:28:29Z)
Auto-regressive Image Synthesis with Integrated Quantization [55.51231796778219]
This paper presents a versatile framework for conditional image generation. It incorporates the inductive bias of CNNs and powerful sequence modeling of auto-regression. Our method achieves superior diverse image generation performance as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-21T22:19:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.