Fugu-MT 論文翻訳(概要): KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study

論文の概要: KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study

arxiv url: http://arxiv.org/abs/2603.27469v1
Date: Sun, 29 Mar 2026 01:35:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.97543
Title: KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study
Title（参考訳）: セルフフォースビデオ生成のためのKVキャッシュ量子化:33手法による実証的研究
Authors: Suraj Ranganath, Vaishak Menon, Anish Patnaik,
Abstract要約: 本稿では, Wan2.1 ベースの自己強制スタック上での自己強制ビデオ生成のための KV-cache 圧縮に関する総合的研究を行う。本研究は,33種類の量子化とキャッシュ・ポリティクス,610個のプロンプトレベルの観測,63個のベンチマークレベルの要約について検討した。我々は,ピークVRAM,ランタイム,圧縮率,VBench画像品質,BF16参照忠実度(SSIM,LPIPS,PSNR),端末ドリフトを共同評価した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-forcing video generation extends a short-horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key-value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV-cache compression for self-forcing video generation on a Wan2.1-based Self-Forcing stack. Our study covers 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity (SSIM, LPIPS, PSNR), and terminal drift. Three findings are robust. First, the strongest practical operating region is a FlowCache-inspired soft-prune INT4 adaptation, which reaches 5.42-5.49x compression while reducing peak VRAM from 19.28 GB to about 11.7 GB with only modest runtime overhead. Second, the highest-fidelity compressed methods, especially PRQ_INT4 and QUAROT_KV_INT4, are not the best deployment choices because they preserve quality at severe runtime or memory cost. Third, nominal compression alone is not sufficient: several methods shrink KV storage but still exceed BF16 peak VRAM because the current integration reconstructs or retains large BF16 buffers during attention and refresh stages. The result is a benchmark harness, analysis workflow, and empirical map of which KV-cache ideas are practical today and which are promising research directions for better memory integration. Code, data products, and the presentation dashboard are available at https://github.com/suraj-ranganath/kv-quant-longhorizon/.
Abstract（参考訳）: 自己強制ビデオ生成は、短い水平ビデオモデルを拡張して、生成されたコンテンツをコンテキストとして繰り返し送り返すことで、ロールアウトを延長する。キー値(KV)キャッシュはロールアウト期間とともに増大するので、長ビデオは生成品質が向上するだけでなく、メモリの挙動も大幅に向上する。 Wan2.1 ベースの自己強制スタック上での自己強制ビデオ生成のための KV-cache 圧縮に関する総合的研究について述べる。本研究は,33種類の量子化とキャッシュポリシクス,610個のプロンプトレベルの観測,63個のベンチマークレベルのサマリーを2つの評価条件で比較した。我々は,ピークVRAM,ランタイム,圧縮率,VBench画像品質,BF16参照忠実度(SSIM,LPIPS,PSNR),端末ドリフトを共同評価した。 3つの発見は堅牢である。まず、FlowCacheにインスパイアされたSoft-prune INT4は5.42-5.49xの圧縮を実現し、ピークVRAMを19.28GBから11.7GBに減らした。第2に、特にPRQ_INT4とQUIROT_KV_INT4は、厳しい実行時やメモリコストで品質を維持するため、最良のデプロイメント選択ではない。いくつかのメソッドはKVストレージを縮小するが、現在の統合は注意とリフレッシュの段階で大きなBF16バッファを再構築または保持するため、BF16ピークVRAMを超えている。その結果、ベンチマークハーネス、分析ワークフロー、KV-Cacheのアイデアが現在実用的であり、より優れたメモリ統合のための有望な研究方向である経験的マップが得られた。コード、データ製品、プレゼンテーションダッシュボードはhttps://github.com/suraj-ranganath/kv-quant-longhorizon/で公開されている。

論文の概要: KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study

関連論文リスト