Related papers: Local Representative Token Guided Merging for Text-to-Image Generation

Local Representative Token Guided Merging for Text-to-Image Generation

URL: http://arxiv.org/abs/2507.12771v1
Date: Thu, 17 Jul 2025 04:16:24 GMT
Title: Local Representative Token Guided Merging for Text-to-Image Generation
Authors: Min-Jeong Lee, Hee-Dong Kim, Seong-Whan Lee,
Abstract summary: Local representative token guided merging (ReToM) is a novel token merging strategy applicable to any attention mechanism in image generation.<n> Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline.
Score: 26.585985828583304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stable diffusion is an outstanding image generation model for text-to-image, but its time-consuming generation process remains a challenge due to the quadratic complexity of attention operations. Recent token merging methods improve efficiency by reducing the number of tokens during attention operations, but often overlook the characteristics of attention-based image generation models, limiting their effectiveness. In this paper, we propose local representative token guided merging (ReToM), a novel token merging strategy applicable to any attention mechanism in image generation. To merge tokens based on various contextual information, ReToM defines local boundaries as windows within attention inputs and adjusts window sizes. Furthermore, we introduce a representative token, which represents the most representative token per window by computing similarity at a specific timestep and selecting the token with the highest average similarity. This approach preserves the most salient local features while minimizing computational overhead. Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline, while maintaining comparable inference time. We empirically demonstrate that ReToM is effective in balancing visual quality and computational efficiency.

Related papers

MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention [10.077033449956806]
Masked autoregressive (MAR) models unify the strengths of masked and autoregressive generation by predicting tokens in a fixed order using bidirectional attention for image generation.<n>While effective, MAR models suffer from significant computational overhead, as they recompute attention and feed-forward representations for all tokens at every decoding step.<n>We propose a training-free generation framework MARch'e to address this inefficiency through two key components: cache-aware attention and selective KV refresh.
arXiv Detail & Related papers (2025-05-22T23:26:56Z)
CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching [31.42896369011162]
CoMatch is a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy.<n>A covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores.<n>A fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level.
arXiv Detail & Related papers (2025-03-31T10:17:01Z)
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency.<n>Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z)
CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution [42.76046559103463]
Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR)<n>These methods limit attention to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency.<n>We propose a lightweight Content-Aware Token Aggregation Network (CATANet) to address these issues.<n>Our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
arXiv Detail & Related papers (2025-03-10T04:00:27Z)
Importance-Based Token Merging for Efficient Image and Video Generation [41.94334394794811]
We show that preserving high-information tokens during merging significantly improves sample quality.<n>We propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation.
arXiv Detail & Related papers (2024-11-23T02:01:49Z)
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC)<n>SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.<n>We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z)
AMMUNet: Multi-Scale Attention Map Merging for Remote Sensing Image Segmentation [4.618389486337933]
We propose AMMUNet, a UNet-based framework that employs multi-scale attention map merging. The proposed AMMM effectively combines multi-scale attention maps into a unified representation using a fixed mask template. We show that our approach achieves remarkable mean intersection over union (mIoU) scores of 75.48% on the Vaihingen dataset and an exceptional 77.90% on the Potsdam dataset.
arXiv Detail & Related papers (2024-04-20T15:23:15Z)
Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world.<n>Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation.<n>We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation [8.46894039954642]
We propose a novel multi-scale token adaptation algorithm for interactive segmentation. By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified. We also propose a token learning algorithm based on contrastive loss.
arXiv Detail & Related papers (2024-01-09T07:59:42Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models [52.29800567587504]
We propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information. TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts. We validate the efficacy of TCTS combined with Frequency Adaptive Sampling (FAS) with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality.
arXiv Detail & Related papers (2023-04-04T03:52:49Z)
Robust Person Re-Identification through Contextual Mutual Boosting [77.1976737965566]
We propose the Contextual Mutual Boosting Network (CMBN) to localize pedestrians. It localizes pedestrians and recalibrates features by effectively exploiting contextual information and statistical inference. Experiments on the benchmarks demonstrate the superiority of the architecture compared the state-of-the-art.
arXiv Detail & Related papers (2020-09-16T06:33:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.