Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
- URL: http://arxiv.org/abs/2406.01733v2
- Date: Sat, 16 Nov 2024 07:43:28 GMT
- Title: Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
- Authors: Xinyin Ma, Gongfan Fang, Michael Bi Mi, Xinchao Wang,
- Abstract summary: We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
- Score: 56.286064975443026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers have recently demonstrated unprecedented generative capabilities for various tasks. The encouraging results, however, come with the cost of slow inference, since each denoising step requires inference on a transformer model with a large scale of parameters. In this study, we make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through introducing a caching mechanism, can be readily removed even without updating the model parameters. In the case of U-ViT-H/2, for example, we may remove up to 93.68% of the computation in the cache steps (46.84% for all steps), with less than 0.01 drop in FID. To achieve this, we introduce a novel scheme, named Learning-to-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Specifically, by leveraging the identical structure of layers in transformers and the sequential nature of diffusion, we explore redundant computations between timesteps by treating each layer as the fundamental unit for caching. To address the challenge of the exponential search space in deep models for identifying layers to cache and remove, we propose a novel differentiable optimization objective. An input-invariant yet timestep-variant router is then optimized, which can finally produce a static computation graph. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed. Code is available at https://github.com/horseee/learning-to-cache
Related papers
- SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching [75.02865981328509]
Caching reduces computation by reusing previously computed model outputs across timesteps.<n>We propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis.<n>SenCache achieves better visual quality than existing caching methods under similar computational budgets.
arXiv Detail & Related papers (2026-02-27T17:36:09Z) - LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference [0.0]
Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment.<n>We present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences.
arXiv Detail & Related papers (2025-12-18T18:18:57Z) - InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models [2.6735992385049663]
InvarDiff is a training-free acceleration method that exploits the relative temporal invariance across timestep-scale and layer-scale.<n> Experiments show that InvarDiff achieves $2$-$3times$ end-to-end speed-ups with minimal impact on standard quality metrics.
arXiv Detail & Related papers (2025-11-29T02:34:23Z) - Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching [11.75972316736487]
We introduce Cluster-Driven Feature Caching (ClusCa) to accelerate diffusion transformers.<n>ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens.<n>Experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation.
arXiv Detail & Related papers (2025-09-12T14:53:45Z) - DiCache: Let Diffusion Model Determine Its Own Cache [62.954717254728166]
DiCache is a training-free adaptive caching strategy for accelerating diffusion models at runtime.<n>Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time.<n> Dynamic Cache Trajectory Alignment approximates the deep-layer feature output from multi-step historical caches.
arXiv Detail & Related papers (2025-08-24T13:30:00Z) - FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [46.57781555466333]
Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks.<n>FastCache is a hidden-state-level caching and compression framework that accelerates DiT inference.<n> Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage.
arXiv Detail & Related papers (2025-05-26T05:58:49Z) - Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition [4.0594792247165]
Diffusion transformer (DiT) models have achieved remarkable success in image generation.<n>We propose increment-calibrated caching, a training-free method for DiT acceleration.<n>Our method eliminates more than 45% and improves IS by 12 at the cost of less than 0.06 FID increase.
arXiv Detail & Related papers (2025-05-09T06:56:17Z) - Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching [7.393824353099595]
Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity.
We analyze the impact of caching on the SNR of the diffusion process.
We introduce EB-Cache, a joint cache strategy that aligns the Non-exposure bias.
arXiv Detail & Related papers (2025-03-10T09:49:18Z) - Accelerating Diffusion Transformer via Gradient-Optimized Cache [18.32157920050325]
Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse.<n>It is a challenging problem since Progressive error accumulation from cached blocks significantly degrades generation quality.<n>We propose the Gradient- perturbationd Cache (GOC) with two key innovations.<n>GOC achieves IS 216.28 (26.3% higher) and FID 3.907 (43% lower) compared to baseline DiT, while maintaining identical computational costs.
arXiv Detail & Related papers (2025-03-07T05:31:47Z) - CacheQuant: Comprehensively Accelerated Diffusion Models [3.78219736760145]
CacheQuant is a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques.
Experimental results show that CacheQuant achieves a 5.18 speedup and 4 compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in CLIP score.
arXiv Detail & Related papers (2025-03-03T09:04:51Z) - LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers [79.07412045476872]
Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks.
We show that performing the full of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps.
We propose a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations.
arXiv Detail & Related papers (2024-12-17T01:12:35Z) - SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers [4.7170474122879575]
Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis.
We introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures.
Our experiments demonstrate that SmoothCache achieves 71% 8% to speed up while maintaining or even improving generation quality across diverse modalities.
arXiv Detail & Related papers (2024-11-15T16:24:02Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - Token Caching for Diffusion Transformer Acceleration [30.437462937127773]
TokenCache is a novel post-training acceleration method for diffusion transformers.
It reduces redundant computations among tokens across inference steps.
We show that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers.
arXiv Detail & Related papers (2024-09-27T08:05:34Z) - FORA: Fast-Forward Caching in Diffusion Transformer Acceleration [39.51519525071639]
Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos.
Fast-FORward CAching (FORA) is designed to accelerate DiT by exploiting the repetitive nature of the diffusion process.
arXiv Detail & Related papers (2024-07-01T16:14:37Z) - Cache Me if You Can: Accelerating Diffusion Models through Block Caching [67.54820800003375]
A large image-to-image network has to be applied many times to iteratively refine an image from random noise.
We investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small.
We propose a technique to automatically determine caching schedules based on each block's changes over timesteps.
arXiv Detail & Related papers (2023-12-06T00:51:38Z) - DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture.
DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC)
We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.