UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation
- URL: http://arxiv.org/abs/2508.05399v1
- Date: Thu, 07 Aug 2025 13:51:17 GMT
- Title: UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation
- Authors: Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho,
- Abstract summary: Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding.<n>We propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps.<n>UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead.
- Score: 15.585320469279813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.
Related papers
- Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers [79.94246924019984]
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation.<n>We propose textbfTemperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions.<n>Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models.
arXiv Detail & Related papers (2025-06-09T17:54:04Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity.<n>In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator.<n>In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z) - Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens [46.361925096761915]
We introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok)<n>TA-TiTok is an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens.<n>We also introduce a family of text-to-image Masked Generative Models (MaskGen) trained exclusively on open data.
arXiv Detail & Related papers (2025-01-13T22:37:17Z) - Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis [7.099258248662009]
Text-to-image (T2I) models have significantly advanced the development of artificial intelligence.
However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image.
We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
arXiv Detail & Related papers (2024-09-27T19:31:04Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements.
layout is employed as an intermedium to bridge large language models and layout-based diffusion models.
We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.