Fugu-MT 論文翻訳(概要): Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation

論文の概要: Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation

arxiv url: http://arxiv.org/abs/2603.21366v1
Date: Sun, 22 Mar 2026 18:59:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.381209
Title: Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation
Title（参考訳）: RelaxForcing: 一貫性のある長時間ビデオ生成のためのKVメモリの緩和
Authors: Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, Ioannis Patras,
Abstract要約: オートレグレッシブ(AR)ビデオ拡散は,近年,長大なビデオ生成において有望なパラダイムとして浮上している。時間的劣化が進行しているため, 生成から微小スケールの地平線への延長は依然として困難であることを示す。本稿では,AR拡散のための時間記憶機構であるRelax Forcingを紹介する。
参考スコア（独自算出の注目度）: 73.84423888025171
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.
Abstract（参考訳）: 自己回帰(AR)ビデオ拡散は、近年、双方向モデルの限界を超えた因果合成を可能にする、長いビデオ生成のための有望なパラダイムとして出現している。トレーニングと推論のミスマッチに対処するために、トレーニング中にモデルを自身の予測に条件付けすることでロールアウト安定性を向上させるための一連の自己強制戦略が提案されている。これらのアプローチは露光バイアスを大幅に軽減するが、進行的な時間的劣化のため、生成から微小スケールの地平線への延長は困難である。本研究では、この制限は、主にメモリ不足によるものではなく、推論時に時間記憶がどのように利用されるかによって生じるものであることを示す。経験的分析により、記憶の増大は長軸生成を継続的に改善するものではなく、歴史的文脈の時間的配置が視覚的品質を大きく変化させながら運動力学に著しく影響を及ぼすことが判明した。これらの結果から,時間記憶は同種バッファとして扱うべきではないことが示唆された。本稿では,AR拡散のための時間記憶機構であるRelax Forcingを紹介する。 Relax Forcingは、高密度に生成された歴史に参画する代わりに、時間的文脈を3つの機能的役割に分解する: 世界的安定性のためのシンク、短期的連続性のためのタイル、構造的動きのガイダンスのための動的選択されたヒストリー、そして最も関係のある過去の情報のみを選択的に組み込む。この設計は、運動の進化を保ちながら外挿中の誤差蓄積を緩和する。 VBench-Longの実験では、Relax Forcingは注意のオーバーヘッドを低減しつつ、動きのダイナミクスと全体的な時間的一貫性を改善している。この結果から,構造化時間記憶は,既存の強制型トレーニング戦略を補完するスケーラブルな長時間ビデオ生成に不可欠であることが示唆された。

論文の概要: Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation

関連論文リスト