Fugu-MT 論文翻訳(概要): Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

論文の概要: Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

arxiv url: http://arxiv.org/abs/2605.18346v1
Date: Mon, 18 May 2026 12:58:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.616638
Title: Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion
Title（参考訳）: 集中強制: 自己回帰的ビデオ拡散のためのコンテンツ対応フレーム毎KV選択
Authors: Peiliang Cai, Evelyn Zhang, Jiacheng Liu, Hao Lin, Ruiqi Zhang, Weile Mo, Yue Ma, Shikang Zheng, Jiehang Huang, Dongrui Liu, Linfeng Zhang,
Abstract要約: textbfFocused Forcingは、生成フレームとヘッド次元の両方に沿ってキャッシュされた履歴に焦点を当てた、トレーニング不要なKV選択手法である。生成されたフレームごとに、Focused Forcingは最も関連性があり、独特な歴史的フレームを保存する。複数の自己回帰生成パラダイム全体で、Focused Forcingはトレーニングなしで最大$textbf1.48times$エンドツーエンドアクセラレーションを達成する。
参考スコア（独自算出の注目度）: 25.555611454522126
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}
Abstract（参考訳）: 自動回帰ビデオ拡散の最近の進歩は、シーケンシャルおよびストリーミングビデオ生成を可能にしている。しかし、長いホライゾン生成はKVキャッシュの増大を必要とし、品質を犠牲にすることなく効率的な圧縮を実現する。既存の手法は、主に注目スコアに基づいて歴史的フレームを選択するが、その文脈決定はいまだに粗いままである。複数のフレームが同じチャンク内で生成される場合、これらの手法は、チャンク全体に対して共有履歴の選択を適用し、注意のみによって履歴フレームをスコアし、明示的なヘッドインパタンス推定ではなく、一様または注目パターンのヒューリスティックスによって頭回りの予算を割り当てる。同一のチャンク内のフレームは、異なる歴史的フレームに依存することができ、同じ歴史的フレームは、現在のフレームと相対的時間的距離が変化するにつれて異なる注意スコアを受け取ることができ、異なるヘッドのマスキングが不平等な生成劣化を引き起こすことを示す。これらの知見に触発されて、生成フレームとヘッド次元の両方に沿ってキャッシュされた履歴に焦点を当てたトレーニング不要なKV選択法である「textbf{Focused Forcing}」を提案する。生成されたフレームごとに、Focused Forcingは、注目スコアと過去のフレームの多様性スコアを組み合わせることで、最も関連性が高く、独特な歴史的フレームを保持します。複数の自己回帰生成パラダイム全体で、Focused Forcingはトレーニングなしで最大$\textbf{1.48}\times$ end-to-endAccelerationを達成する。 \textit{Our codeはGitHubでリリースされる。 ※

論文の概要: Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

関連論文リスト