Fugu-MT 論文翻訳(概要): MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration

論文の概要: MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration

arxiv url: http://arxiv.org/abs/2508.12691v1
Date: Mon, 18 Aug 2025 07:49:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:11.067529
Title: MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration
Title（参考訳）: MixCache: ビデオ拡散変換器高速化のためのmixture-of-Cache
Authors: Yuanxin Wei, Lansong Diao, Bujiao Chen, Shenggan Cheng, Zhengping Qian, Wenyuan Yu, Nong Xiao, Wei Lin, Jiangsu Du,
Abstract要約: キャッシングは、DiTモデルで広く採用されている最適化手法である。効率的なビデオDiT推論のためのトレーニング不要なキャッシュベースのフレームワークであるMixCacheを提案する。
参考スコア（独自算出の注目度）: 15.22288174114487
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited to single-granularity strategies, struggling to balance generation quality and inference speed in a flexible manner. In this work, we propose MixCache, a training-free caching-based framework for efficient video DiT inference. It first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled, along with an adaptive hybrid cache decision strategy for dynamically selecting the optimal caching granularity. Extensive experiments on diverse models demonstrate that, MixCache can significantly accelerate video generation (e.g., 1.94$\times$ speedup on Wan 14B, 1.97$\times$ speedup on HunyuanVideo) while delivering both superior generation quality and inference efficiency compared to baseline methods.
Abstract（参考訳）: トランスフォーマーアーキテクチャと拡散プロセスを活用することで、ビデオDiTモデルは高品質のビデオ生成において支配的なアプローチとして現れてきた。しかし、その多段階反復的復調処理は、高い計算コストと推論遅延を引き起こす。 DiTモデルで広く採用された最適化手法であるキャッシングは、拡散過程の冗長性を利用して、異なる粒度の計算(例えば、ステップ、cfg、ブロック)をスキップする。それでも、既存のキャッシュ手法は単一粒度戦略に限られており、フレキシブルな方法で生成品質と推論速度のバランスをとるのに苦労している。本研究では,効率的なビデオDiT推論のためのトレーニング不要キャッシングベースのフレームワークであるMixCacheを提案する。まず、異なるキャッシュ戦略間の干渉と境界を識別し、次に、最適なキャッシュ粒度を動的に選択する適応型ハイブリッドキャッシュ決定戦略とともに、キャッシュがいつ有効になるかを決定する、コンテキスト対応のキャッシュトリガ戦略を導入する。 MixCacheはビデオ生成を著しく高速化できる(Wan 14Bでは1.94$\times$スピードアップ、HunyuanVideoでは1.97$\times$スピードアップ)。

論文の概要: MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration

関連論文リスト