Related papers: ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

URL: http://arxiv.org/abs/2402.13573v3
Date: Wed, 8 May 2024 05:09:48 GMT
Title: ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
Authors: Ethan Smith, Nayan Saxena, Aninda Saha,
Abstract summary: This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048.
Score: 5.213225264281229
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.

Related papers

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
We propose an algorithm that enables fast and high-quality generation under arbitrary constraints. During inference, we can interchange between gradient updates computed on the noisy image and updates computed on the final, clean image. Our approach produces results that rival or surpass the state-of-the-art training-free inference approaches.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
High-Precision Dichotomous Image Segmentation via Probing Diffusion Capacity [69.32473738284374]
We propose DiffDIS, a diffusion-driven segmentation model that taps into the potential of the pre-trained U-Net within diffusion models. By leveraging the robust generalization capabilities and rich, versatile image representation prior to the SD models, we significantly reduce the inference time while preserving high-fidelity, detailed generation. Experiments on the DIS5K dataset demonstrate the superiority of DiffDIS, achieving state-of-the-art results through a streamlined inference process.
arXiv Detail & Related papers (2024-10-14T02:49:23Z)
DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance [11.44012694656102]
Large-scale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains. Existing large-scale diffusion models are confined to generating images of up to 1K resolution. We propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images.
arXiv Detail & Related papers (2024-06-26T16:10:31Z)
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z)
Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention [62.671435607043875]
Research indicates that text-to-image diffusion models replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks. We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens. We introduce an innovative approach to detect and mitigate memorization in diffusion models.
arXiv Detail & Related papers (2024-03-17T01:27:00Z)
QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning [52.157939524815866]
In this paper, we empirically unravel three properties in quantized diffusion models that compromise the efficacy of current methods. We identify two critical types of quantized layers: those holding vital temporal information and those sensitive to reduced bit-width. Our method is evaluated over three high-resolution image generation tasks and achieves state-of-the-art performance under various bit-width settings.
arXiv Detail & Related papers (2024-02-06T03:39:44Z)
Diffusion Models Without Attention [110.5623058129782]
Diffusion State Space Model (DiffuSSM) is an architecture that supplants attention mechanisms with a more scalable state space model backbone. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward.
arXiv Detail & Related papers (2023-11-30T05:15:35Z)
HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models [13.68666823175341]
HiDiffusion is a tuning-free higher-resolution framework for image synthesis. RAU-Net dynamically adjusts the feature map size to resolve object duplication. MSW-MSA engages optimized window attention to reduce computations.
arXiv Detail & Related papers (2023-11-29T11:01:38Z)
ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models [126.35334860896373]
We investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. We propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference.
arXiv Detail & Related papers (2023-10-11T17:52:39Z)
Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models [6.821399706256863]
W"urstchen is a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation.
arXiv Detail & Related papers (2023-06-01T13:00:53Z)
Projected GANs Converge Faster [50.23237734403834]
Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train. We make significant headway on these issues by projecting generated and real samples into a fixed, pretrained feature space. Our Projected GAN improves image quality, sample efficiency, and convergence speed.
arXiv Detail & Related papers (2021-11-01T15:11:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.