KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
- URL: http://arxiv.org/abs/2602.23592v1
- Date: Fri, 27 Feb 2026 01:48:07 GMT
- Title: KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
- Authors: Zebin Yang, Tong Xie, Baotong Lu, Shaoshan Liu, Bo Yu, Meng Li,
- Abstract summary: We propose KEEP, a KV-cache-centric memory management system for efficient embodied planning.<n>KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups; and (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention across different layers.
- Score: 8.216400469571084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are greatly undermined due to frequent KV cache updates. In this paper, we propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively; (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers. Extensive experimental results have demonstrated that KEEP achieves 2.68x speedup with negligible accuracy loss compared with text-based memory methods on ALFRED dataset. Compared with the KV re-computation method CacheBlend (EuroSys'25), KEEP shows 4.13% success rate improvement and 1.90x time-to-first-token (TTFT) reduction. Our code is available on https://github.com/PKU-SEC-Lab/KEEP_Embodied_Memory.
Related papers
- DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z) - CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving [5.216774377033164]
Large Language Models (LLMs) have revolutionized natural language processing tasks.<n>LLMs face challenges due to the massive memory requirements of key-value ( KV) caches.<n>We propose textbfCXL-SpecKV, a novel disaggregated KV-cache architecture.
arXiv Detail & Related papers (2025-12-11T15:40:36Z) - MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse [14.695547830142516]
Large Reasoning Models (LRMs) have achieved significant advances in mathematical reasoning and formal logic tasks.<n>Their tendency to generate lengthy chain-of-thought sequences leads to substantial memory overhead during inference.<n>We propose MemShare, a novel KV cache management approach that effectively reduces memory overhead.
arXiv Detail & Related papers (2025-07-29T02:05:51Z) - ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs [15.76582272387931]
We propose ZSMerge, a dynamic KV cache compression framework for efficient cache management.<n>ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation.
arXiv Detail & Related papers (2025-03-13T03:36:03Z) - CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs [45.77132019859689]
CalibQuant is a visual quantization strategy that drastically reduces both memory and computational overhead.<n>We achieve a 10x throughput increase on InternVL models.
arXiv Detail & Related papers (2025-02-15T05:08:01Z) - ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.<n>We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.<n>Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.