AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers
- URL: http://arxiv.org/abs/2511.16047v1
- Date: Thu, 20 Nov 2025 05:10:12 GMT
- Title: AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers
- Authors: Boxun Xu, Yu Wang, Zihu Wang, Peng Li,
- Abstract summary: Key and Value ( KV) caching in large language models (LLMs) has been extensively studied, but next-scale prediction presents unique challenges.<n>We introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models.<n>Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%.
- Score: 6.1675897118034975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.
Related papers
- ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution [84.41751286055909]
We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
arXiv Detail & Related papers (2026-02-03T07:16:51Z) - PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache [61.57938553036056]
We introduce PackCache, a training-free KV-cache management method that compacts the KV cache through three coordinated mechanisms.<n>In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences.
arXiv Detail & Related papers (2026-01-07T19:51:06Z) - KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models [3.5171501100868876]
The KV cache grows with sequence length and embedding dimension, often exceeding the memory footprint of the model itself.<n>We present KV CAR, a unified and agnostic architecture framework that significantly reduces KV cache storage while maintaining model fidelity.<n> Evaluations on GPT 2 and TinyLLaMA models across Wikitext, C4, PIQA, and Winogrande datasets demonstrate that KV CAR achieves up to 47.85 percent KV cache memory reduction.
arXiv Detail & Related papers (2025-12-07T08:40:52Z) - KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache [0.9238700679836854]
Vision-Language-Action (VLA) models promise unified robotic perception and control, yet their scalability is constrained by the quadratic cost of attention and the unbounded growth of key-value (KV) memory during long-horizon inference.<n>We present KV-Efficient VLA, a model-agnostic memory compression framework that addresses these limitations by introducing a lightweight, training-friendly mechanism to selectively retain high-utility context.<n>Our method integrates seamlessly into existing autoregressive and hybrid VLA stacks, enabling scalable inference without modifying training pipelines or downstream control logic.
arXiv Detail & Related papers (2025-09-20T02:04:24Z) - KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache [7.019967158501771]
We present KVComp, a generic and efficient KV cache management framework optimized for long-text generation.<n> KVComp employs novel lossy compression techniques specifically designed for KV cache data characteristics.<n>We show that KVComp achieves on average 47% and up to 83% higher memory reduction rate compared to existing methods.
arXiv Detail & Related papers (2025-08-30T18:25:19Z) - Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression [21.840636839249026]
We introduce ScaleKV, a novel KV cache compression framework tailored for Visual Autoregressive ( VAR) architectures.<n>Based on two critical observations, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners.<n>Our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.
arXiv Detail & Related papers (2025-05-26T07:11:42Z) - KVCrush: Key value cache size-reduction using similarity in head-behaviour [40.792661186062396]
Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs)<n>However, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size.<n>We propose KVCrush which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory.
arXiv Detail & Related papers (2025-02-24T02:57:51Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [38.732413451399]
Pyramid KV is a novel and effective KV cache compression method.<n>We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.<n>In the Needle-in-a-Haystack experiment, Pyramid KV outperforms competing methods in maintaining long-context comprehension.
arXiv Detail & Related papers (2024-06-04T07:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.