Related papers: Context-Aware Autoregressive Models for Multi-Conditional Image Generation

Context-Aware Autoregressive Models for Multi-Conditional Image Generation

URL: http://arxiv.org/abs/2505.12274v1
Date: Sun, 18 May 2025 07:27:02 GMT
Title: Context-Aware Autoregressive Models for Multi-Conditional Image Generation
Authors: Yixiao Chen, Zhiyuan Ma, Guoli Jia, Che Jiang, Jianjun Li, Bowen Zhou,
Abstract summary: ContextAR is a flexible and effective framework for multi-conditional image generation.<n>It embeds diverse conditions directly into the token sequence, preserving modality-specific semantics.<n>We show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline.
Score: 24.967166342680112
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $\textbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $\href{https://context-ar.github.io/}{https://context-ar.github.io/.}$

Related papers

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation [11.170848285659572]
Autoencoder accuracy on segmentation mask using quantized embeddings is 8% lower than continuous-valued embeddings.<n>We propose a continuous-valued embedding framework for semantic segmentation.<n>Our approach eliminates the need for discrete latent representations while preserving fine-grained semantic details.
arXiv Detail & Related papers (2025-03-19T18:06:54Z)
Conditional Consistency Guided Image Translation and Enhancement [0.0]
We introduce Conditional Consistency Models ( CCMs) for multi-domain image translation.<n>We implement these modifications by introducing task-specific conditional inputs that guide the denoising process.<n>We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images.
arXiv Detail & Related papers (2025-01-02T12:13:31Z)
Self-control: A Better Conditional Mechanism for Masked Autoregressive Model [1.9950682531209158]
This paper introduces a novel conditional introduction network for continuous masked autoregressive models.<n>The proposed self-control network serves to mitigate the negative impact of vector quantization on the quality of the generated images.
arXiv Detail & Related papers (2024-12-18T09:09:39Z)
A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks.<n>Our approach enables versatile capabilities via different inference-time sampling schemes.<n>Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z)
Auto-regressive Image Synthesis with Integrated Quantization [55.51231796778219]
This paper presents a versatile framework for conditional image generation. It incorporates the inductive bias of CNNs and powerful sequence modeling of auto-regression. Our method achieves superior diverse image generation performance as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-21T22:19:17Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
Spatially Multi-conditional Image Generation [80.04130168156792]
We propose a novel neural architecture to address the problem of multi-conditional image generation. The proposed method uses a transformer-like architecture operating pixel-wise, which receives the available labels as input tokens. Our experiments on three benchmark datasets demonstrate the clear superiority of our method over the state-of-the-art and the compared baselines.
arXiv Detail & Related papers (2022-03-25T17:57:13Z)
Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework. Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.