DiTFastAttn: Attention Compression for Diffusion Transformer Models
- URL: http://arxiv.org/abs/2406.08552v2
- Date: Fri, 18 Oct 2024 12:05:21 GMT
- Title: DiTFastAttn: Attention Compression for Diffusion Transformer Models
- Authors: Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang,
- Abstract summary: Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to self-attention operators.
We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT.
Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
- Score: 26.095923502799664
- License:
- Abstract: Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
Related papers
- Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer [54.97718043685824]
We present the textbfHadamard textbfAttention textbfRecurrent Stereo textbfTransformer (HART)
For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity.
We design a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses.
We propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions.
arXiv Detail & Related papers (2025-01-02T02:51:16Z) - Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers [55.87192133758051]
Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency.
We propose DiffRatio-MoD, a dynamic DiT inference framework with differentiable compression ratios.
arXiv Detail & Related papers (2024-12-22T02:04:17Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task [42.422925759342874]
We propose the Proxy-Tokenized Diffusion Transformer (PT-DiT) to model global visual information efficiently.
Within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region.
We also introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism.
arXiv Detail & Related papers (2024-09-06T03:13:45Z) - Mutual Information-driven Triple Interaction Network for Efficient Image
Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing.
The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal.
The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z) - Scalable Adaptive Computation for Iterative Generation [13.339848496653465]
Recurrent Interface Networks (RINs) are an attention-based architecture that decouples its core computation from the dimensionality of the data.
RINs focus the bulk of computation on a set of latent tokens, using cross-attention to read and write information between latent and data tokens.
RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance.
arXiv Detail & Related papers (2022-12-22T18:55:45Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - RD-Optimized Trit-Plane Coding of Deep Compressed Image Latent Tensors [40.86513649546442]
DPICT is the first learning-based image supporting fine granular scalability.
In this paper, we describe how to implement two key components of DPICT efficiently: trit-plane slicing and RD-prioritized transmission.
arXiv Detail & Related papers (2022-03-25T06:33:16Z) - FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously.
FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.