Fugu-MT 論文翻訳(概要): Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

論文の概要: Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

arxiv url: http://arxiv.org/abs/2605.16003v1
Date: Fri, 15 May 2026 14:33:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.320403
Title: Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
Title（参考訳）: Echo-Forcing: 対話型ロングビデオ生成のためのシーンメモリフレームワーク
Authors: Mingqiang Wu, Weilun Feng, Zhefeng Zhang, Haotong Qin, Yuqi Li, Guoxin Fan, Xiaokun Liu, Zhulin An, Libo Huang, Yongjun Xu, Chuanguang Yang,
Abstract要約: Echo-Forcingは、インタラクティブなロングビデオ生成のためのトレーニング不要のシーンメモリフレームワークである。キャッシュのバウンダリでスムーズなトランジション、ハードカット、長距離シーンリコールをサポートする。
参考スコア（独自算出の注目度）: 48.476317015122625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing
Abstract（参考訳）: 自動回帰ビデオ拡散モデルにより、局所的な注意とKVキャッシングによるオープンエンド生成が可能となる。しかし、既存のトレーニングフリーの長ビデオ最適化手法は、主に単一のプロンプト下での安定した拡張に焦点を当てており、即時スイッチング、古いシーンの忘れ、過去のシーンリコールといった対話的なシナリオを扱うのが困難である。安定アンカーと最近のダイナミックスは、同じキャッシュポリシーで処理され、時代遅れのバックグラウンド汚染、新しいプロンプトへの応答の遅れ、長距離メモリの損失につながる。この問題を解決するために,(1)安定したアンカーと圧縮された歴史と最近のRoPEの窓を分離する階層的テンポラルメモリ,(2)長期リコールをサポートするために歴史的シーンを空間的に構造化したKV表現に圧縮するシーンリコールフレーム,(3)古いシーンと新しいシーンの相違を適応的に無視する差分認識メモリデケイ,の3つのメカニズムで,インタラクティブな長ビデオ生成に特化したトレーニングフリーなシーンメモリフレームワークであるEcho-Forcingを提案する。これらの設計に基づいて、Echo-Forcingはバウンダリキャッシュ予算の下でスムーズなトランジション、ハードカット、長距離シーンリコールを均一にサポートする。 VBench-Longでの広範囲な評価により、Echo-Forcingは、長ビデオ生成とインタラクティブなビデオ生成設定の両方において、最高の全体的なパフォーマンスを達成することが示された。私たちのコードはhttps://github.com/mingqiangWu/Echo-Forcingでリリースされています。

論文の概要: Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

関連論文リスト