Related papers: PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

URL: http://arxiv.org/abs/2601.04359v1
Date: Wed, 07 Jan 2026 19:51:06 GMT
Title: PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache
Authors: Kunyang Li, Mubarak Shah, Yuzhang Shang,
Abstract summary: We introduce PackCache, a training-free KV-cache management method that compacts the KV cache through three coordinated mechanisms.<n>In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences.
Score: 61.57938553036056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.

Related papers

Flow caching for autoregressive video generation [72.10021661412364]
We present FlowCache, the first caching framework specifically designed for autoregressive video generation.<n>Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation.
arXiv Detail & Related papers (2026-02-11T13:11:04Z)
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization [83.406036390582]
Quant VideoGen (QVG) is a training free KV cache quantization framework for autoregressive video diffusion models.<n>It reduces KV memory by up to 7.0 times with less than 4% end to end latency overhead.<n>It consistently outperforms existing baselines in generation quality.
arXiv Detail & Related papers (2026-02-03T00:54:32Z)
Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention [37.91838955436801]
Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines.<n>As generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency.<n>We propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor matching; and AnnSA sparsifies self-attention by restricting each query
arXiv Detail & Related papers (2026-02-02T08:31:21Z)
EpiCache: Episodic KV Cache Management for Long Conversational Question Answering [15.288494370436469]
We introduce EpiCache, a training-free KV cache management framework for long conversational question answering.<n>EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression.<n>Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40%, maintains near-full KV accuracy under 4-6x compression, and reduces latency/memory by up to 2.4x/3.5x.
arXiv Detail & Related papers (2025-09-22T06:56:35Z)
dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z)
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences [36.05521425453999]
Large language models (LLMs) excel at processing long sequences, boosting demand for key-value ( KV) caching.<n>We introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem"<n>CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner.
arXiv Detail & Related papers (2025-03-16T12:49:44Z)
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing [66.66090399385304]
Ca2-VDM is an efficient autoregressive VDM with Causal generation and Cache sharing.<n>For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps.<n>For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost.
arXiv Detail & Related papers (2024-11-25T13:33:41Z)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression. Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption. We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z)
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens. The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.