Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
- URL: http://arxiv.org/abs/2602.02958v1
- Date: Tue, 03 Feb 2026 00:54:32 GMT
- Title: Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
- Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, Kurt Keutzer,
- Abstract summary: Quant VideoGen (QVG) is a training free KV cache quantization framework for autoregressive video diffusion models.<n>It reduces KV memory by up to 7.0 times with less than 4% end to end latency overhead.<n>It consistently outperforms existing baselines in generation quality.
- Score: 83.406036390582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality.
Related papers
- XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression [20.18561757219652]
XStreamVGGT is a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache.<n>XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$times$.
arXiv Detail & Related papers (2026-02-25T11:02:02Z) - Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion [53.14908419375226]
existing approaches generally rely on KV Cache policies, which ignore differences in token importance in long-term video generation.<n>We propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV)<n>Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional cache teacher to estimate salience scores.<n>This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time.
arXiv Detail & Related papers (2026-01-29T15:55:29Z) - PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache [61.57938553036056]
We introduce PackCache, a training-free KV-cache management method that compacts the KV cache through three coordinated mechanisms.<n>In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences.
arXiv Detail & Related papers (2026-01-07T19:51:06Z) - XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression [54.28208936996186]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks.<n> Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information.<n>We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization.
arXiv Detail & Related papers (2025-10-13T10:17:21Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.