Fugu-MT 論文翻訳(概要): Towards Error-Free Long Video Generation

論文の概要: Towards Error-Free Long Video Generation

arxiv url: http://arxiv.org/abs/2606.22370v1
Date: Sun, 21 Jun 2026 07:39:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 21:40:04.581929
Title: Towards Error-Free Long Video Generation
Title（参考訳）: エラーフリー長ビデオ生成に向けて
Authors: Shuning Chang, Weihua Chen, Jiasheng Tang, Hao Xu, Zeyu Zhang, Hangjie Yuan, Yu Lu, Ruigang Niu, Fan Wang, Bohan Zhuang, Yi Yang,
Abstract要約: 本稿では,高品質でダイナミックでアイデンティティに一貫性のある単一ショット長ビデオを生成する,無限長のビデオ生成フレームワークを提案する。まず,大規模なショートビデオデータに基づいて拡散モデルをビデオ拡張モデルとして微調整し,時間的コヒーレントなクリップを自動的に生成する。我々のフレームワークは、リアルでコヒーレントな微小レベルのビデオ合成のための新しいベンチマークを確立する。
参考スコア（独自算出の注目度）: 56.86952045212838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in video generation have made minute-level synthesis possible; however, generating long videos remains challenging due to error accumulation, attribute drift, and the limited availability of long video data. In this paper, we introduce an infinite-length video generation framework that focusing on addressing these issues and produces high-quality, dynamic, and identity-consistent single-shot long videos. We first finetune a diffusion model as a video extension model on large-scale short video data to autoregressively generate temporally coherent clips. Inspired by the success of large language models (LLMs), we adopt causal attention computation between clips to further finetune this model on long video data. In this way, the tokens in one clip (short video) are computed by bidirectional attention while tokens among clips are computed by unidirectional attention. This design leverages the strengths of modern diffusion models while preserving long-term context information, effectively mitigating error accumulation and attribute drift. To achieve memory efficiency during inference, we adopt a key-value (KV) caching mechanism to maintain a constant KV memory. Furthermore, we introduce truncation-rectified flow (T-RFlow) technique to further suppress error accumulation. Experimental results demonstrate the effectiveness of our method. Our framework establishes a new benchmark for realistic and coherent minute-level video synthesis.
Abstract（参考訳）: ビデオ生成の最近の進歩により、微細な合成が可能になったが、誤りの蓄積、属性のドリフト、長大なビデオデータの可用性の制限により、長大なビデオを生成することは依然として困難である。本稿では,これらの問題に対処することに集中し,高品質でダイナミックで,かつ一眼一眼一眼一眼一眼一眼一眼ビデオを生成する無限長ビデオ生成フレームワークを提案する。まず,大規模なショートビデオデータに基づいて拡散モデルをビデオ拡張モデルとして微調整し,時間的コヒーレントなクリップを自動的に生成する。大規模言語モデル(LLMs)の成功に触発されて、クリップ間の因果注意計算を採用し、このモデルを長いビデオデータ上でさらに微調整する。このように、1つのクリップ(ショートビデオ)内のトークンは双方向の注意によって計算され、クリップ間のトークンは一方向の注意によって計算される。この設計は、長期の文脈情報を保存しながら、現代の拡散モデルの強みを活用し、エラー蓄積と属性ドリフトを効果的に軽減する。推論中にメモリ効率を達成するために,キー値(KV)キャッシング機構を採用し,一定のKVメモリを維持する。さらに,Truncation-rectified Flow (T-RFlow) 技術を導入し,エラーの蓄積を抑制する。実験の結果,本手法の有効性が示された。我々のフレームワークは、リアルでコヒーレントな微小レベルのビデオ合成のための新しいベンチマークを確立する。

論文の概要: Towards Error-Free Long Video Generation

関連論文リスト