From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
- URL: http://arxiv.org/abs/2503.06923v1
- Date: Mon, 10 Mar 2025 05:09:42 GMT
- Title: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
- Authors: Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang,
- Abstract summary: Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications.<n> feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps.<n>We propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps.
- Score: 14.402483491830138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:https://github.com/Shenyi-Z/TaylorSeer
Related papers
- Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor [10.899451333703437]
Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks.<n>Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference.<n>TaylorSeer instead uses cached features to predict future ones via Taylor expansion.<n>We propose a novel approach to better leverage Taylor-based acceleration.
arXiv Detail & Related papers (2025-08-04T09:39:31Z) - Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation [74.34633861289662]
Radial Attention is a scalable sparse attention mechanism with $O(n log n)$ complexity that translates energy decay into exponentially decaying compute density.<n>It maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$times$ speedup over the original dense attention.
arXiv Detail & Related papers (2025-06-24T17:59:59Z) - Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model [55.64316746098431]
Timestep Embedding Aware Cache (TeaCache) is a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps.<n>TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible degradation of visual quality.
arXiv Detail & Related papers (2024-11-28T12:50:05Z) - Accelerating Vision Diffusion Transformers with Skip Branches [47.07564477125228]
Diffusion Transformers (DiT) are an emerging image and video generation model architecture.<n>DiT's practical deployment is constrained by computational complexity and redundancy in the sequential denoising process.<n>We introduce Skip-DiT, which converts standard DiT into Skip-DiT with skip branches to enhance feature smoothness.<n>We also introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time.
arXiv Detail & Related papers (2024-11-26T17:28:10Z) - Accelerating Diffusion Transformers with Token-wise Feature Caching [19.140800616594294]
Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs.
We introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching.
Experiments on PixArt-$alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training.
arXiv Detail & Related papers (2024-10-05T03:47:06Z) - HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration [31.982294870690925]
We propose a novel learning-based caching framework dubbed HarmoniCa.
It incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process.
It also incorporates an Image Error Proxy-Guided Objective (IEPO) to balance image quality against cache utilization.
arXiv Detail & Related papers (2024-10-02T16:34:29Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - $Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers [13.433352602762511]
We propose an overall training-free inference acceleration framework $Delta$-DiT.
$Delta$-DiT uses a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages.
Experiments on PIXART-$alpha$ and DiT-XL demonstrate that the $Delta$-DiT can achieve a $1.6times$ speedup on the 20-step generation.
arXiv Detail & Related papers (2024-06-03T09:10:44Z) - DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
Diffusion Gated Linear Attention Transformers (DiG) is a simple, adoptable solution with minimal parameter overhead.<n>We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness.
arXiv Detail & Related papers (2024-05-28T17:59:33Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture.
DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z) - DiffSmooth: Certifiably Robust Learning via Diffusion Models and Local
Smoothing [39.962024242809136]
We propose DiffSmooth, which first performs adversarial purification via diffusion models and then maps the purified instances to a common region via a simple yet effective local smoothing strategy.
For instance, DiffSmooth improves the SOTA-certified accuracy from $36.0%$ to $53.0%$ under $ell$ $1.5$ on ImageNet.
arXiv Detail & Related papers (2023-08-28T06:22:43Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Patch Diffusion: Faster and More Data-Efficient Training of Diffusion
Models [166.64847903649598]
We propose Patch Diffusion, a generic patch-wise training framework.
Patch Diffusion significantly reduces the training time costs while improving data efficiency.
We achieve outstanding FID scores in line with state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-04-25T02:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.