Fugu-MT 論文翻訳(概要): Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

論文の概要: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

arxiv url: http://arxiv.org/abs/2509.10312v1
Date: Fri, 12 Sep 2025 14:53:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-15 16:03:08.132949
Title: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
Title（参考訳）: クラスタ駆動機能キャッシングによる拡散変換器の高速化
Authors: Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang,
Abstract要約: 本稿では,拡散変圧器の高速化を目的としたクラスタ駆動型特徴キャッシング(ClusCa)を提案する。 ClusCaは各タイムステップ内のトークンに空間的クラスタリングを行い、各クラスタ内のトークンを1つだけ計算し、その情報を他のすべてのトークンに伝達する。 DiT、FLUX、HunyuanVideoの実験は、テキスト・ツー・イメージとテキスト・ツー・ビデオの生成において、その効果を実証している。
参考スコア（独自算出の注目度）: 11.75972316736487
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.
Abstract（参考訳）: 拡散変換器は近年,高品質な画像やビデオを生成する能力で注目されている。近年,空間次元の類似性を無視しつつ拡散モデルの時間的類似性を生かした,前回の時間ステップで特徴計算をキャッシュし,次の時間ステップで再利用することにより,拡散トランスフォーマーを高速化する機能キャッシングが導入されている。本稿では,クラスタ駆動型特徴キャッシング(ClusCa)を,従来の特徴キャッシングの直交的かつ補完的な視点として紹介する。具体的には、ClusCaは各タイムステップ内のトークンに空間的クラスタリングを行い、各クラスタ内のトークンを1つだけ計算し、他のトークンに情報を伝達することで、トークンの数を90%以上削減することができる。 DiT、FLUX、HunyuanVideoの大規模な実験は、テキスト・ツー・イメージとテキスト・ツー・ビデオの生成において、その効果を実証している。また、いかなる拡散変圧器にも訓練の必要なしに直接適用することができる。例えば、ClusCaはイメージリワード99.49%のFLUXで4.96倍の加速を達成し、オリジナルのモデルを0.51%上回った。コードはhttps://github.com/Shenyi-Z/Cache4Diffusionで入手できる。

論文の概要: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

関連論文リスト