Related papers: ToMA: Token Merge with Attention for Diffusion Models

ToMA: Token Merge with Attention for Diffusion Models

URL: http://arxiv.org/abs/2509.10918v2
Date: Tue, 23 Sep 2025 02:10:29 GMT
Title: ToMA: Token Merge with Attention for Diffusion Models
Authors: Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, Shengjie Wang,
Abstract summary: Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers' quadratic attention complexity.<n>We propose Token Merge with Attention (ToMA), an off-the-shelf method that negates token reduction for GPU-aligned efficiency.<n>ToMA reduces SDXL/Flux generation latency by 24%/23%, respectively (with DINO $Delta 0.07$), outperforming prior methods.
Score: 8.079656935981193
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers' quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose Token Merge with Attention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merge as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23%, respectively (with DINO $\Delta < 0.07$), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion.

Related papers

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding [37.86179431483446]
Autoregressive models (ARMs) are hindered by slow sequential inference.<n>We introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency.<n>ReFusion bridges the performance gap to strong ARMs while maintaining a 2.33$times$ average speedup.
arXiv Detail & Related papers (2025-12-15T17:41:19Z)
FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding [2.309307613420651]
We introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe)<n> Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.
arXiv Detail & Related papers (2025-10-13T00:23:17Z)
Plug-and-Play Context Feature Reuse for Efficient Masked Generation [36.563229330549284]
Masked generative models (MGMs) have emerged as a powerful framework for image synthesis.<n>We introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs.
arXiv Detail & Related papers (2025-05-25T10:57:35Z)
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation [57.56385490252605]
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention.<n>We propose SVG2, a training-free framework that maximizes identification accuracy and computation minimizes waste.
arXiv Detail & Related papers (2025-05-24T21:30:29Z)
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z)
High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution [87.56382172827526]
High-frequency regions are most critical for reconstruction.<n>We propose a training-free adaptive masking module for acceleration.<n>Our method reduces FLOPs by 24--43% for state-of-the-art models.
arXiv Detail & Related papers (2025-05-11T13:18:03Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
DiTFastAttn: Attention Compression for Diffusion Transformer Models [26.095923502799664]
Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
arXiv Detail & Related papers (2024-06-12T18:00:08Z)
An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences. TiTok achieves competitive performance to state-of-the-art approaches. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
EcoFormer: Energy-Saving Attention with Linear Complexity [40.002608785252164]
Transformer is a transformative framework that models sequential data. We propose a new binarization paradigm customized to high-dimensional softmax attention. We show that EcoFormer consistently achieves comparable performance with standard attentions.
arXiv Detail & Related papers (2022-09-19T13:28:32Z)
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.