Related papers: Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

URL: http://arxiv.org/abs/2602.12235v2
Date: Fri, 13 Feb 2026 11:58:27 GMT
Title: Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Authors: Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko,
Abstract summary: We define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query.<n>In this paper, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations.<n>Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average.<n>These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
Score: 49.48204107529758
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.

Related papers

Multi-Vector Index Compression in Any Modality [73.7330345057813]
Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos.<n>We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC)<n>AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation.
arXiv Detail & Related papers (2026-02-24T18:57:33Z)
Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z)
Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective [21.41673002861847]
Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge.<n>Recent research on soft context compression aims to address this by encoding long documents into compact embeddings.<n>We introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector.
arXiv Detail & Related papers (2026-01-25T09:06:24Z)
Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings [52.49524240846879]
We propose Hierarchical Token Prepending to mitigate attention-level compression and readout-level over-squashing.<n>HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating pathways for backward information flow.<n>As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.
arXiv Detail & Related papers (2025-11-18T19:37:40Z)
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods [54.4711434793961]
We show that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks.<n>Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks.
arXiv Detail & Related papers (2025-10-08T15:44:28Z)
AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation [27.480791258325066]
We introduce AttnComp, an adaptive, efficient and context-aware compression framework.<n>AttnComp employs a Top-P compression algorithm to retain the minimal set of documents.<n>In addition to compression, AttnComp estimates response confidence by assessing the overall relevance of the retrieved content.
arXiv Detail & Related papers (2025-09-22T08:18:50Z)
UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression [86.33995240043936]
UniGist is a sequence-level long-context compression framework for large language models.<n>It efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner.<n>Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings.
arXiv Detail & Related papers (2025-09-19T08:47:37Z)
Compressed Feature Quality Assessment: Dataset and Baselines [89.62929964441962]
We propose the first benchmark dataset for evaluating semantic fidelity of compressed features.<n>We systematically assess three widely used metrics -- MSE, cosine similarity, and Centered Kernel Alignment (CKA) -- in terms of their ability to capture semantic degradation.<n>This work advances the field by establishing a foundational benchmark and providing a critical resource for the community to explore CFQA.
arXiv Detail & Related papers (2025-06-09T04:16:39Z)
EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation [8.757777529568383]
Current RAG systems often struggle when retrieval models fail to rank the most relevant documents.<n>We introduce EXIT, an extractive context compression framework.<n>Our evaluations show that EXIT consistently surpasses existing compression methods.
arXiv Detail & Related papers (2024-12-17T05:38:27Z)
Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models [21.025001473355996]
We formalize the problem of prompt compression for large language models (LLMs)<n>We present a framework to unify token-level prompt compression methods which create hard prompts for black-box models.<n>We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy.
arXiv Detail & Related papers (2024-07-22T09:40:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.