Fugu-MT 論文翻訳(概要): FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

論文の概要: FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

arxiv url: http://arxiv.org/abs/2605.11869v1
Date: Tue, 12 May 2026 09:49:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.771633
Title: FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
Title（参考訳）: FIS-DiT: トレーニング不要フレームインターリーブによるビデオ推論バリアの破壊
Authors: Jian Tang, Jiawei Fan, Qingbin Liu, Zheng Wei,
Abstract要約: ビデオ拡散変換器(DiT)はモデルによって大幅に削減できるが、ステップごとの推論レイテンシは依然として重要なボトルネックである。本研究では、時間軌道から潜在フレーム位置への最適化焦点をシフトさせる、トレーニング不要で演算子に依存しないフレームワークであるFrame Inter Sparsity DiT (FIS-DiT)を提案する。 FIS-DiTは、VBench-QおよびCLIPメトリクス間で無視できない劣化を伴う2.11--2.41$times$ Speedupを一貫して達成している。
参考スコア（独自算出の注目度）: 23.184639887235218
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.
Abstract（参考訳）: ビデオ拡散変換器(DiT)の全体的な推論レイテンシはモデル蒸留によって大幅に低減されるが、ステップごとの推論レイテンシは依然として重要なボトルネックである。既存の加速パラダイムは、主に認知軌道を横断する冗長性を利用するが、これらの段階的な戦略が数段階のレジームにおいて減少するリターンに遭遇する限界を特定する。このようなシナリオでは、時間状態の不足は効果的な特徴の再利用や予測モデリングを阻害し、さらなる加速のための恐ろしい障壁を生み出す。これを解決するために、時間軌道から潜在フレーム次元へ最適化焦点をシフトするトレーニングフリーで演算子に依存しないフレームワーク、Frame Interleaved Sparsity DiT (FIS-DiT)を提案する。この次元における本質的な双対性、すなわち、計算の削減を可能にするフレームワイド空間の存在と、各フレーム位置がグローバル時空間に等しく不可欠である構造的一貫性の両立によって、我々のアプローチは動機付けられている。この知見を活かして、モデル階層全体にわたるフレームサブセットを操作する実行戦略としてFrame Interleaved Sparsity(FIS)を実装し、完全なブロック計算を必要とせずに、すべての潜在位置をリフレッシュする。 Wan 2.2 と HunyuanVideo 1.5 の実証的な評価により、FIS-DiT は VBench-Q と CLIP のメトリクス間で無視できない劣化を伴い、一貫した 2.11--2.41$\times$ のスピードアップを達成し、リアルタイム高精細ビデオ生成へのスケーラブルで堅牢な経路を提供することを示した。

論文の概要: FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

関連論文リスト