Importance-based Token Merging for Diffusion Models
- URL: http://arxiv.org/abs/2411.16720v1
- Date: Sat, 23 Nov 2024 02:01:49 GMT
- Title: Importance-based Token Merging for Diffusion Models
- Authors: Haoyu Wu, Jingyi Xu, Hieu Le, Dimitris Samaras,
- Abstract summary: Diffusion models excel at high-quality image and video generation.
A powerful way to speed them up is by merging similar tokens for faster computation.
We show that preserving important tokens during merging significantly improves sample quality.
- Score: 41.94334394794811
- License:
- Abstract: Diffusion models excel at high-quality image and video generation. However, a major drawback is their high latency. A simple yet powerful way to speed them up is by merging similar tokens for faster computation, though this can result in some quality loss. In this paper, we demonstrate that preserving important tokens during merging significantly improves sample quality. Notably, the importance of each token can be reliably determined using the classifier-free guidance magnitude, as this measure is strongly correlated with the conditioning input and corresponds to output fidelity. Since classifier-free guidance incurs no additional computational cost or requires extra modules, our method can be easily integrated into most diffusion-based frameworks. Experiments show that our approach significantly outperforms the baseline across various applications, including text-to-image synthesis, multi-view image generation, and video generation.
Related papers
- Flowing from Words to Pixels: A Framework for Cross-Modality Evolution [14.57591222028278]
We present a general and simple framework, CrossFlow, for cross-modal flow matching.
We show the importance of applying Variationals to the input data, and introduce a method to enable-free guidance.
To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intramodal mapping tasks.
arXiv Detail & Related papers (2024-12-19T18:59:56Z) - Parallelized Autoregressive Visual Generation [65.9579525736345]
We propose a simple yet effective approach for parallelized autoregressive visual generation.
Our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks.
arXiv Detail & Related papers (2024-12-19T17:59:54Z) - Efficient Generative Modeling with Residual Vector Quantization-Based Tokens [5.949779668853557]
ResGen is an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed.
We validate the efficacy and generalizability of the proposed method on two challenging tasks: conditional image generation on ImageNet 256x256 and zero-shot text-to-speech synthesis.
As we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.
arXiv Detail & Related papers (2024-12-13T15:31:17Z) - Video Token Merging for Long-form Video Understanding [17.59960070514554]
We propose a learnable video token merging algorithm that dynamically merges tokens based on their saliency.
Our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.
arXiv Detail & Related papers (2024-10-31T09:55:32Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs.
computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging.
We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.