Related papers: DiTFastAttn: Attention Compression for Diffusion Transformer Models

DiTFastAttn: Attention Compression for Diffusion Transformer Models

URL: http://arxiv.org/abs/2406.08552v2
Date: Fri, 18 Oct 2024 12:05:21 GMT
Title: DiTFastAttn: Attention Compression for Diffusion Transformer Models
Authors: Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang,
Abstract summary: Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
Score: 26.095923502799664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.

Related papers

Exploring Diffusion with Test-Time Training on Efficient Image Restoration [1.3830502387127932]
DiffRWKVIR is a novel framework unifying Test-Time Training (TTT) with efficient diffusion.<n>Our method establishes a new paradigm for adaptive, high-efficiency image restoration with optimized hardware utilization.
arXiv Detail & Related papers (2025-06-17T14:01:59Z)
Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers [24.105473321347894]
We propose Sparse-vDiT, a sparsity acceleration framework for Video Diffusion Transformer (vDiT)<n>We show that Sparse-vDiT achieves 2.09$times$, 2.38$times$, and 1.67$times$ theoretical FLOP reduction, and actual inference speedups of 1.76$times$, 1.85$times$, and 1.58$times$, respectively.<n>Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
arXiv Detail & Related papers (2025-06-03T16:42:37Z)
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape [23.01286982392074]
A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length.<n>Existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads.<n>We propose Re-ttention, which implements very high sparse attention for visual generation models.
arXiv Detail & Related papers (2025-05-28T22:39:12Z)
One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models [65.96186414865747]
Text-to-Image (T2I) diffusion models face a trade-off between inference speed and image quality.<n>We introduce the first Time-independent Unified TiUE for the student model UNet architecture.<n>Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling.
arXiv Detail & Related papers (2025-05-28T04:23:22Z)
PiT: Progressive Diffusion Transformer [50.46345527963736]
We propose a series of Pseudo textbfProgressive Dtextbfiffusion textbfTransformer (textbfPiT)<n>Our proposed PiT-L achieves 54%$uparrow$ FID improvement over DiT-XL/2 while using less computation.
arXiv Detail & Related papers (2025-05-19T15:02:33Z)
DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers [8.697604549478159]
We introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in text-to-image generation models. Through an in-depth analysis of MMDiT's attention patterns, we propose head-wise arrow attention and caching mechanisms. With the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation.
arXiv Detail & Related papers (2025-03-28T18:00:12Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability. We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF) PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models. FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer [54.97718043685824]
We present the Hadamard Attention Recurrent Stereo Transformer (HART) that incorporates the following components. For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission.
arXiv Detail & Related papers (2025-01-02T02:51:16Z)
Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers [55.87192133758051]
Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency. We propose DiffCR, a dynamic DiT inference framework with differentiable compression ratios.
arXiv Detail & Related papers (2024-12-22T02:04:17Z)
Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs. We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z)
Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task [42.422925759342874]
We propose the Proxy-Tokenized Diffusion Transformer (PT-DiT) to model global visual information efficiently. Within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. We also introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism.
arXiv Detail & Related papers (2024-09-06T03:13:45Z)
Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions [27.111140222002653]
This paper investigates the role of CLIP image embeddings within the Stable Video Diffusion (SVD) framework. We introduce the VCUT, a training-free approach optimized for efficiency within the SVD architecture. The implementation of VCUT leads to a reduction of up to 322T Multiple-Accumulate Operations (MACs) per video and a decrease in model parameters by up to 50M, achieving a 20% reduction in latency compared to the baseline.
arXiv Detail & Related papers (2024-07-27T08:21:14Z)
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation [52.56469577812338]
We introduce StreamDiffusion, a real-time diffusion pipeline for interactive image generation.<n>Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.<n>We present a novel approach that transforms the original sequential denoising into the denoising process.
arXiv Detail & Related papers (2023-12-19T18:18:33Z)
CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks. We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed. Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z)
Mutual Information-driven Triple Interaction Network for Efficient Image Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing. The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal. The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z)
Scalable Adaptive Computation for Iterative Generation [13.339848496653465]
Recurrent Interface Networks (RINs) are an attention-based architecture that decouples its core computation from the dimensionality of the data. RINs focus the bulk of computation on a set of latent tokens, using cross-attention to read and write information between latent and data tokens. RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance.
arXiv Detail & Related papers (2022-12-22T18:55:45Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
RD-Optimized Trit-Plane Coding of Deep Compressed Image Latent Tensors [40.86513649546442]
DPICT is the first learning-based image supporting fine granular scalability. In this paper, we describe how to implement two key components of DPICT efficiently: trit-plane slicing and RD-prioritized transmission.
arXiv Detail & Related papers (2022-03-25T06:33:16Z)
Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks. We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation. We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z)
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.