Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking
- URL: http://arxiv.org/abs/2602.13980v1
- Date: Sun, 15 Feb 2026 03:58:13 GMT
- Title: Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking
- Authors: Guojie Liu, Yiqi Wang, Yanfeng Yang, Wenqi Fan, Songlei Jian, Jianfeng Zhang, Jie Yu,
- Abstract summary: Long contexts increase inference latency, as the computational cost of self-attention grows quadratically with sequence length.<n>Existing methods typically compress the entire context indiscriminately into a set of memory tokens.<n>We propose Parallelized Iterative Compression (PIC), which restricts the receptive field of memory tokens to sequential local chunks.
- Score: 28.492055407384495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized Iterative Compression (PIC). By simply modifying the Transformer's attention mask, PIC explicitly restricts the receptive field of memory tokens to sequential local chunks, thereby lowering the difficulty of compressor training. Experiments across multiple downstream tasks demonstrate that PIC consistently outperforms competitive baselines, with superiority being particularly pronounced in high compression scenarios (e.g., achieving relative improvements of 29.8\% in F1 score and 40.7\% in EM score on QA tasks at the $64\times$ compression ratio). Furthermore, PIC significantly expedites the training process. Specifically, when training the 16$\times$ compressor, it surpasses the peak performance of the competitive baseline while effectively reducing the training time by approximately 40\%.
Related papers
- Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z) - Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning [47.87361916374891]
We propose a framework for efficient long-context inference based on chunk-wise compression and selective memory recall.<n>The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor.<n>It achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
arXiv Detail & Related papers (2026-02-09T08:33:11Z) - Simple Context Compression: Mean-Pooling and Multi-Ratio Training [12.049015994907629]
We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture.<n>We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios.<n>Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios.
arXiv Detail & Related papers (2025-10-23T17:57:23Z) - Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors [43.02557489472655]
Current context compression methods rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics.<n>We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability.<n>SAC consistently outperforms existing context compression methods across various compression ratios.
arXiv Detail & Related papers (2025-10-10T01:42:14Z) - CompLLM: Compression for Long Context Q&A [47.90063873976842]
We introduce CompLLM, a soft compression technique designed for practical deployment.<n>Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently.<n>Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%.
arXiv Detail & Related papers (2025-09-23T16:49:43Z) - UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression [86.33995240043936]
UniGist is a sequence-level long-context compression framework for large language models.<n>It efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner.<n>Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings.
arXiv Detail & Related papers (2025-09-19T08:47:37Z) - MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores [5.893964327109089]
MOOSComp is a token-classification-based long-context compression method.<n>We introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression.<n>Our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.
arXiv Detail & Related papers (2025-04-23T15:02:53Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.