Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
- URL: http://arxiv.org/abs/2510.08907v3
- Date: Sat, 18 Oct 2025 02:48:35 GMT
- Title: Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
- Authors: Xin Liu, Runsong Zhao, Pengcheng Huang, Xinyu Liu, Junyi Xiao, Chunyang Xiao, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu,
- Abstract summary: Current context compression methods rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics.<n>We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability.<n>SAC consistently outperforms existing context compression methods across various compression ratios.
- Score: 43.02557489472655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations. Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textit{a priori}. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context. Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines, with increasing advantages at higher compression ratios.
Related papers
- Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking [28.492055407384495]
Long contexts increase inference latency, as the computational cost of self-attention grows quadratically with sequence length.<n>Existing methods typically compress the entire context indiscriminately into a set of memory tokens.<n>We propose Parallelized Iterative Compression (PIC), which restricts the receptive field of memory tokens to sequential local chunks.
arXiv Detail & Related papers (2026-02-15T03:58:13Z) - Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning [3.2641459166493405]
We propose a novel compression method based on Reinforcement Learning applied to a T5 language model architecture.<n>This approach enables the compression of data into sequences of tokens rather than traditional vector representations.<n>By leveraging the latent information within language models, our system effectively compresses data without requiring explicit content understanding.
arXiv Detail & Related papers (2026-02-12T16:30:55Z) - Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z) - Context Compression via Explicit Information Transmission [25.078241611630585]
Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches.<n>We propose ComprExIT, a lightweight framework that formulates soft compression into a new paradigm.
arXiv Detail & Related papers (2026-02-03T17:44:12Z) - Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - CompLLM: Compression for Long Context Q&A [47.90063873976842]
We introduce CompLLM, a soft compression technique designed for practical deployment.<n>Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently.<n>Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%.
arXiv Detail & Related papers (2025-09-23T16:49:43Z) - UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression [86.33995240043936]
UniGist is a sequence-level long-context compression framework for large language models.<n>It efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner.<n>Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings.
arXiv Detail & Related papers (2025-09-19T08:47:37Z) - GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment [18.256369876037883]
This paper introduces GMSA, a context compression framework based on the encoder-decoder architecture.<n>GMSA reduces input sequence length and redundant information.<n>It can achieve approximately a 2x speedup in end-to-end inference.
arXiv Detail & Related papers (2025-05-18T03:21:30Z) - CODEPROMPTZIP: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs [6.936336826531964]
Retrieval-Augmented Generation (RAG) enhances coding tasks by incorporating retrieved code examples into prompts.<n>Existing prompt compression techniques focus on natural language, lacking tailored solutions for code.<n>We propose CodePromptZip, a framework that compresses code examples before integrating into RAG.
arXiv Detail & Related papers (2025-02-19T23:15:23Z) - ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z) - Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles [49.65811277223873]
Style-Compress is a lightweight framework that adapts a smaller language model to compress prompts for a larger model on a new task without additional training.
Our approach iteratively generates and selects effective compressed prompts as task-specific demonstrations through style variation and in-context learning.
Style-Compress outperforms two baseline compression models in four tasks: original prompt reconstruction, text summarization, multi-hop QA, and CoT reasoning.
arXiv Detail & Related papers (2024-10-17T21:35:49Z) - Concise and Precise Context Compression for Tool-Using Language Models [60.606281074373136]
We propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models.
Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
arXiv Detail & Related papers (2024-07-02T08:17:00Z) - LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression [43.048684907893104]
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency.
We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one.
Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
arXiv Detail & Related papers (2024-03-19T17:59:56Z) - Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs.
It targets effective, efficient, and flexible compression of long contexts.
It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.