You Need an Encoder for Native Position-Independent Caching
- URL: http://arxiv.org/abs/2602.01519v1
- Date: Mon, 02 Feb 2026 01:23:13 GMT
- Title: You Need an Encoder for Native Position-Independent Caching
- Authors: Shiju Zhao, Junhao Hu, Jiaqi Zheng, Guihai Chen,
- Abstract summary: Key-Value cache of Large Language Models (LLMs) is prefix-based.<n>Position-Independent Caching (PIC) has been proposed to enable KV reuse without positional constraints.<n>We propose native PIC by reintroducing the encoder to prevalent decoder-only LLMs and explicitly training it to support PIC.<n>We further develop COMB, a PIC-aware caching system that integrates seamlessly with existing inference frameworks.
- Score: 28.778240400537175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been proposed to enable KV reuse without positional constraints; however, existing approaches often incur substantial accuracy degradation, limiting their practical adoption. To address this issue, we propose native PIC by reintroducing the encoder to prevalent decoder-only LLMs and explicitly training it to support PIC. We further develop COMB, a PIC-aware caching system that integrates seamlessly with existing inference frameworks. Experimental results show that COMB reduces Time-to-First-Token (TTFT) by 51-94% and increases throughput by 3$\times$ with comparable accuracy. Furthermore, the quality improvement when using DeepSeek-V2-Lite-Chat demonstrates the applicability of COMB to other types of decoder-only LLMs. Our code is available at https://github.com/shijuzhao/Comb.
Related papers
- Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking [8.189266513060621]
Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings.<n>Unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent.<n>We introduce EDJE, an Efficient Discriminative Joint that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter.
arXiv Detail & Related papers (2025-10-08T09:46:09Z) - CommVQ: Commutative Vector Quantization for KV Cache Compression [50.37946553931796]
We propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference.<n>We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache.<n>Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook.
arXiv Detail & Related papers (2025-06-23T17:50:11Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving [28.024240207609854]
This paper proposes position-independent caching as a more effective approach for multimodal information management.<n>We have designed and implemented a caching system, named MPIC, to address both system-level and algorithm-level challenges.
arXiv Detail & Related papers (2025-02-04T03:13:09Z) - EPIC: Efficient Position-Independent Caching for Serving Large Language Models [19.510078997414606]
Caching improves serving performance by reusing Key-Value vectors across requests.<n>Existing context caching requires exact prefixes across requests.<n>We introduce Position-Independent Caching (PIC), which enables modular reuse of KV vectors regardless of prefixes.<n>We also introduce EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning.
arXiv Detail & Related papers (2024-10-20T08:42:29Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)<n>PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.<n>Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z) - Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - Key Frame Mechanism For Efficient Conformer Based End-to-end Speech
Recognition [9.803556181225193]
Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance.
However, the Conformer-based model encounters an issue with the self-attention mechanism.
We introduce key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames.
arXiv Detail & Related papers (2023-10-23T13:55:49Z) - Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs.
Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z) - Quick Dense Retrievers Consume KALE: Post Training Kullback Leibler
Alignment of Embeddings for Asymmetrical dual encoders [89.29256833403169]
We introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods.
KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation.
Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
arXiv Detail & Related papers (2023-03-31T15:44:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.