Fugu-MT 論文翻訳(概要): Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

論文の概要: Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

arxiv url: http://arxiv.org/abs/2605.09681v1
Date: Sun, 10 May 2026 17:59:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.369277
Title: Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Title（参考訳）: 強制KV:効率的な自己回帰型ビデオ拡散モデルのためのハイブリッドKVキャッシュ圧縮
Authors: Yicheng Ji, Zhizhou Zhong, Jun Zhang, Qin Yang, XiTai Jin, Ying Qin, Wenhan Luo, Shuiyang Mao, Wei Liu, Huan Li,
Abstract要約: 自動回帰ビデオ拡散にKVキャッシュ圧縮を導入する。本稿では,静的ヘッドに対する構造化静的プルーニングと動的ヘッドに対するセグメントワイド類似性に基づく動的プルーニングを行うハイブリッドKVキャッシュ圧縮戦略であるForcing-KVを提案する。提案手法は,1つのNVIDIA H200 GPU上で毎秒29フレーム以上の生成速度と30%のキャッシュメモリ削減を実現し,LongLiveとSelf Forcingで最大1.35倍,1.50倍のスピードアップを実現し,さらに1080Pで2.82倍のスピードアップを実現した。
参考スコア（独自算出の注目度）: 32.39747481484621
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.
Abstract（参考訳）: 自己回帰(AR)ビデオ拡散モデルはストリーミング生成フレームワークを採用しており、セルフフォーストレーニングパラダイムで例示されるように、リアルタイムの応答性を備えた長時間水平ビデオ生成を可能にする。しかし、既存のARビデオ拡散モデルは、スケーラビリティを制限した履歴フレームにまたがる冗長なキー値(KV)キャッシュのため、注意の複雑さとメモリオーバーヘッドに悩まされている。本稿では,自動回帰ビデオ拡散にKVキャッシュ圧縮を導入することで,この問題に対処する。メインストリームAR拡散モデルにおける注目ヘッドは、サンプル間で安定な注意パターンと機能的役割を顕著に表している。頭部機能専門化に関する実証研究に基づいて, 頭部を静的な頭部, 自己回帰的チャンク, フレーム内忠実度, フレーム間の運動と整合性を管理する動的頭部の2つのカテゴリに分けた。次に、静的ヘッドに対する構造化静的プルーニングと動的ヘッドに対するセグメントワイド類似性に基づく動的プルーニングを行うハイブリッドKVキャッシュ圧縮戦略であるForcing-KVを提案する。出力品質を維持しながら、1つのNVIDIA H200 GPUで毎秒29フレーム以上の生成速度と30%のキャッシュメモリ削減を実現し、480P解像度でLongLiveとSelf Forcingで最大1.35倍と1.50倍のスピードアップを実現し、さらに1080P解像度で2.82倍のスピードアップを実現した。コードとデモビデオはhttps://zju-jiyicheng.github.io/Forcing-KV-Page.comで公開されている。

論文の概要: Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

関連論文リスト