Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2602.12235v2
- Date: Fri, 13 Feb 2026 11:58:27 GMT
- Title: Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
- Authors: Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko,
- Abstract summary: We define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query.<n>In this paper, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations.<n>Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average.<n>These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
- Score: 49.48204107529758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
Related papers
- Multi-Vector Index Compression in Any Modality [73.7330345057813]
Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos.<n>We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC)<n>AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation.
arXiv Detail & Related papers (2026-02-24T18:57:33Z) - Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z) - Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective [21.41673002861847]
Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge.<n>Recent research on soft context compression aims to address this by encoding long documents into compact embeddings.<n>We introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector.
arXiv Detail & Related papers (2026-01-25T09:06:24Z) - Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings [52.49524240846879]
We propose Hierarchical Token Prepending to mitigate attention-level compression and readout-level over-squashing.<n>HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating pathways for backward information flow.<n>As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.
arXiv Detail & Related papers (2025-11-18T19:37:40Z) - Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods [54.4711434793961]
We show that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks.<n>Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks.
arXiv Detail & Related papers (2025-10-08T15:44:28Z) - AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation [27.480791258325066]
We introduce AttnComp, an adaptive, efficient and context-aware compression framework.<n>AttnComp employs a Top-P compression algorithm to retain the minimal set of documents.<n>In addition to compression, AttnComp estimates response confidence by assessing the overall relevance of the retrieved content.
arXiv Detail & Related papers (2025-09-22T08:18:50Z) - UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression [86.33995240043936]
UniGist is a sequence-level long-context compression framework for large language models.<n>It efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner.<n>Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings.
arXiv Detail & Related papers (2025-09-19T08:47:37Z) - Compressed Feature Quality Assessment: Dataset and Baselines [89.62929964441962]
We propose the first benchmark dataset for evaluating semantic fidelity of compressed features.<n>We systematically assess three widely used metrics -- MSE, cosine similarity, and Centered Kernel Alignment (CKA) -- in terms of their ability to capture semantic degradation.<n>This work advances the field by establishing a foundational benchmark and providing a critical resource for the community to explore CFQA.
arXiv Detail & Related papers (2025-06-09T04:16:39Z) - EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation [8.757777529568383]
Current RAG systems often struggle when retrieval models fail to rank the most relevant documents.<n>We introduce EXIT, an extractive context compression framework.<n>Our evaluations show that EXIT consistently surpasses existing compression methods.
arXiv Detail & Related papers (2024-12-17T05:38:27Z) - Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models [21.025001473355996]
We formalize the problem of prompt compression for large language models (LLMs)<n>We present a framework to unify token-level prompt compression methods which create hard prompts for black-box models.<n>We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy.
arXiv Detail & Related papers (2024-07-22T09:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.