Fugu-MT 論文翻訳(概要): LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

論文の概要: LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

arxiv url: http://arxiv.org/abs/2602.20497v1
Date: Tue, 24 Feb 2026 02:53:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-25 17:34:53.584952
Title: LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
Title（参考訳）: LESA:拡散モデル加速のための学習可能な段階認識予測器
Authors: Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, Linfeng Zhang,
Abstract要約: 拡散モデルは画像およびビデオ生成タスクにおいて顕著な成功を収めた。しかし、拡散変換器の高い計算要求は、実際の展開に重大な課題をもたらす。 2段階トレーニングに基づくLESA(LEarnable Stage-Aware)予測フレームワークを提案する。
参考スコア（独自算出の注目度）: 12.183601881545039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.
Abstract（参考訳）: 拡散モデルは画像およびビデオ生成タスクにおいて顕著な成功を収めた。しかし、Diffusion Transformers (DiTs) の高計算要求は、その実践的な展開に重大な課題をもたらす。機能キャッシングは有望な加速戦略である一方で、拡散プロセスの複雑なステージ依存のダイナミックスに適応するために、単純な再利用やトレーニングなし予測に基づく既存の手法は、しばしば品質劣化を引き起こし、標準的な認知プロセスとの整合性の維持に失敗する。そこで本研究では,2段階学習に基づくLESA(LEarnable Stage-Aware)予測フレームワークを提案する。我々のアプローチは、KAN(Kolmogorov-Arnold Network)を利用して、データから時間的特徴マッピングを正確に学習する。さらに,様々なノイズレベルステージに特殊予測器を割り当てるマルチステージ・マルチエキスパートアーキテクチャを導入し,より正確でロバストな特徴予測を実現する。本手法は高忠実度生成を維持しながら, 高い加速を達成できることを示す。 FLUX.1-devでは5.00倍、Qwen-Imageでは6.25倍、以前のSOTA(TaylorSeer)では20.2%、HunyuanVideoでは5.00倍、TaylorSeerでは24.7%である。テキスト・ツー・イメージとテキスト・ツー・ビデオ合成の両面での最先端性能は、異なるモデルにわたるトレーニング・ベース・フレームワークの有効性と一般化能力を検証する。私たちのコードは補足資料に含まれており、GitHubでリリースされます。

論文の概要: LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

関連論文リスト