Fugu-MT 論文翻訳(概要): When to Lock Attention: Training-Free KV Control in Video Diffusion

論文の概要: When to Lock Attention: Training-Free KV Control in Video Diffusion

arxiv url: http://arxiv.org/abs/2603.09657v1
Date: Tue, 10 Mar 2026 13:31:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.338463
Title: When to Lock Attention: Training-Free KV Control in Video Diffusion
Title（参考訳）: ビデオ拡散における無訓練KV制御
Authors: Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang,
Abstract要約: 本稿では,Ditベースのビデオ拡散モデルに適したトレーニングフリーフレームワークであるKV-Lockを提案する。 KV-Lockは幻覚検出を利用して、キャッシュされたバックグラウンドキー値(KV)と新たに生成されたKVとの融合比とCFGスケールの2つの主要なコンポーネントを動的にスケジュールする。トレーニングフリーのプラグアンドプレイモジュールとして、KV-Lockは任意のトレーニング済みのDiTベースのモデルに簡単に統合できる。
参考スコア（独自算出の注目度）: 28.00662653127216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
Abstract（参考訳）: 前景の質を向上しながらバックグラウンドの一貫性を維持することは、ビデオ編集における中核的な課題である。フルイメージ情報を注入すると、しばしば背景のアーティファクトが発生するが、厳密な背景ロックは前景生成のためのモデルの能力に厳しい制約を与える。この問題に対処するために、我々は、DiTベースのビデオ拡散モデルに適したトレーニング不要のフレームワークであるKV-Lockを提案する。我々の中核的な洞察は、幻覚量(認知予測のばらつき)が生成の多様性を直接定量化することであり、これは本質的に分類器フリーガイダンス(CFG)尺度と結びついている。これに基づいて、KV-Lockは拡散幻覚検出を利用して、キャッシュされたバックグラウンドキー値(KV)と新たに生成されたKVの融合比とCFGスケールの2つの主要なコンポーネントを動的にスケジュールする。幻覚リスクが検出されると、KV-LockはバックグラウンドKVロックを強化し、前景生成のための条件ガイダンスを同時に増幅し、アーティファクトを緩和し、生成精度を向上させる。トレーニングフリーのプラグアンドプレイモジュールとして、KV-Lockは任意のトレーニング済みのDiTベースのモデルに簡単に統合できる。本手法は,様々なビデオ編集作業において,背景の忠実度が高く,前景品質の向上に優れた手法であることが実証された。

論文の概要: When to Lock Attention: Training-Free KV Control in Video Diffusion

関連論文リスト