Related papers: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

URL: http://arxiv.org/abs/2509.10312v1
Date: Fri, 12 Sep 2025 14:53:45 GMT
Title: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
Authors: Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang,
Abstract summary: We introduce Cluster-Driven Feature Caching (ClusCa) to accelerate diffusion transformers.<n>ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens.<n>Experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation.
Score: 11.75972316736487
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.

Related papers

Flow caching for autoregressive video generation [72.10021661412364]
We present FlowCache, the first caching framework specifically designed for autoregressive video generation.<n>Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation.
arXiv Detail & Related papers (2026-02-11T13:11:04Z)
FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching [13.999620910665612]
We show that different frequency bands in the features of diffusion models exhibit different dynamics across timesteps.<n>We propose Frequency-aware Caching (FreqCa) which directly reuses features of low-frequency components based on their similarity.<n>We also propose to cache Cumulative Residual Feature (CRF) instead of the features in all the layers, which reduces the memory footprint of feature caching by 99%.
arXiv Detail & Related papers (2025-10-09T17:22:23Z)
SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching [17.724549528455317]
Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications.<n>We present SpeCa, a novel 'Forecast-then-verify' acceleration framework that effectively addresses both limitations.<n>Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction.
arXiv Detail & Related papers (2025-09-15T06:46:22Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos.<n>We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces.<n>We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models [5.406829638216823]
Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis.<n>However, their iterative denoising process demands substantial computational resources.<n>We present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge.
arXiv Detail & Related papers (2025-02-01T13:46:02Z)
Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache) We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z)
Accelerating Diffusion Transformers with Token-wise Feature Caching [19.140800616594294]
Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs.<n>We introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching.<n>Experiments on PixArt-$alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training.
arXiv Detail & Related papers (2024-10-05T03:47:06Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
Cache Me if You Can: Accelerating Diffusion Models through Block Caching [67.54820800003375]
A large image-to-image network has to be applied many times to iteratively refine an image from random noise. We investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We propose a technique to automatically determine caching schedules based on each block's changes over timesteps.
arXiv Detail & Related papers (2023-12-06T00:51:38Z)
DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture. DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models. Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.