Fugu-MT 論文翻訳(概要): Streaming Autoregressive Video Generation via Diagonal Distillation

論文の概要: Streaming Autoregressive Video Generation via Diagonal Distillation

arxiv url: http://arxiv.org/abs/2603.09488v1
Date: Tue, 10 Mar 2026 10:45:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.238052
Title: Streaming Autoregressive Video Generation via Diagonal Distillation
Title（参考訳）: 対角蒸留による自己回帰映像のストリーミング
Authors: Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang Liu,
Abstract要約: 自己回帰モデルは、シーケンシャルフレーム合成のための自然なフレームワークを提供するが、高い忠実性を達成するためには重い計算を必要とする。ビデオチャンクとデノイングステップの時間的情報を活用するために,ダイアゴナル蒸留を提案する。本手法は,2.61秒(最大31FPS)で5秒ビデオを生成し,未蒸留モデル上で277.3倍のスピードアップを実現する。
参考スコア（独自算出の注目度）: 50.13573884115673
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.
Abstract（参考訳）: 大規模な事前学習拡散モデルは、生成したビデオの品質を大幅に向上させたが、リアルタイムストリーミングでの使用は制限されている。自己回帰モデルは、シーケンシャルフレーム合成のための自然なフレームワークを提供するが、高い忠実性を達成するためには重い計算を必要とする。拡散蒸留はこれらのモデルを効率的な数ステップの変種に圧縮することができるが、既存のビデオ蒸留手法は時間的依存を無視する画像固有の手法に大きく適応している。これらの技術はしばしば画像生成に優れるが、ビデオ合成では性能が劣り、動きのコヒーレンスが低下し、長いシーケンスでエラーが蓄積され、遅延品質のトレードオフが生じる。これらの制約を生じる2つの要因として,ステップリダクション時の時間的文脈の不十分な利用と,次のチャンク予測(露光バイアス)におけるその後の雑音レベルの暗黙的な予測があげられる。これらの問題に対処するために,既存のアプローチに直交して動作する対角蒸留法を提案し,ビデオチャンクとデノナイジングステップの時間的情報をよりよく活用する。私たちのアプローチの中心は、非対称な生成戦略です。この設計により、後続のチャンクは、完全に処理された初期チャンクからリッチな外観情報を継承し、部分的な分別チャンクを条件入力として後続の合成を行うことができる。チャンク生成中の暗黙的なノイズレベルの予測と実際の推測条件を一致させることにより,提案手法は誤りの伝播を軽減し,長距離シーケンスにおける過飽和を低減する。さらに、厳密なステップ制約下での動作品質を維持するために、暗黙の光学フローモデリングを取り入れた。本手法は,2.61秒(最大31FPS)で5秒ビデオを生成し,未蒸留モデル上で277.3倍のスピードアップを実現する。

論文の概要: Streaming Autoregressive Video Generation via Diagonal Distillation

関連論文リスト