Fugu-MT 論文翻訳(概要): RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

論文の概要: RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

arxiv url: http://arxiv.org/abs/2606.06309v1
Date: Thu, 04 Jun 2026 15:49:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.916317
Title: RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling
Title（参考訳）: RhymeFlow: Asynchronous Denoising Flow Schedulingによるビデオ生成のためのトレーニング不要高速化
Authors: Chensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang, Zheng Zhu, Yueqi Duan,
Abstract要約: Diffusion Transformers (DiTs) に基づく映像生成モデルは,映像合成において顕著な性能を発揮している。 DiTは3次元の注意の二次的な複雑さのために、高い推論遅延と計算コストに悩まされる。我々はbftextRhymeFlowを紹介した。bftextRhymeFlowはトレーニング不要のフレームワークで、異なるフレームの認知軌道を分離する。
参考スコア（独自算出の注目度）: 51.279397568734424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.
Abstract（参考訳）: Diffusion Transformers (DiTs) に基づくビデオ生成モデルは、ビデオ合成において顕著な性能を達成しているが、3次元の注意の二次的複雑さにより、高い推論遅延と計算コストに悩まされている。既存の加速法は、スパースアテンションやKVキャッシングといった技術により、個々のデノナイジングステップの計算複雑性を減少させる。しかし、それらは標準拡散パイプラインの固有の制約に固執する: 対象の動画シーケンスのすべてのフレームは、すべての拡散時間ステップにわたって完全に密度の高い復調プロセスに従わなければならない。隣接フレーム間の対応する内容や動きから、重要な意味遷移を持つキーフレームが固定されている場合、他のフレームの中間状態はより予測可能な軌跡に従うことが多く、このような一様で密度の高いデノナイジングプロセスが自然ビデオデータに対して本質的に冗長であることを示す。この目的のために,異なるフレームの認知軌跡を分離するトレーニングフリーフレームワークである‘textbf{RhymeFlow} を紹介した。具体的には、潜伏したセマンティック進化を支配するキーフレームのスパース集合を最初に同定する。そして、これらのキーフレームだけが構造的整合性を確保するために密集したステップバイステップのDenoisingを実行し、非キーフレームは計算コストを最小限にするために段階的にdenoisingステップをスキップする。非鍵フレームの中間状態をスキップすることで、キーフレームの認知ステップにおける時間的コヒーレンスを破り、視覚的劣化につながるため、キーフレームが完全かつ時間的に一貫したシーケンス表現と相互作用できる潜在軌道投影モジュールも導入する。現在のDiTベースビデオ生成モデルに対する大規模な実験により,提案手法は既存のベースラインよりも高い推論速度と視覚的品質で優れていることが示された。

論文の概要: RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

関連論文リスト