Simple Context Compression: Mean-Pooling and Multi-Ratio Training
- URL: http://arxiv.org/abs/2510.20797v1
- Date: Thu, 23 Oct 2025 17:57:23 GMT
- Title: Simple Context Compression: Mean-Pooling and Multi-Ratio Training
- Authors: Yair Feldman, Yoav Artzi,
- Abstract summary: We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture.<n>We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios.<n>Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios.
- Score: 12.049015994907629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
Related papers
- Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z) - Test-Time Steering for Lossless Text Compression via Weighted Product of Experts [27.679089540901007]
We propose a novel framework that performs Test-Time Steering via a Weighted Product of Experts (wPoE)<n>At inference, our method adaptively combines a universal compression model with a pretrained neural language model, ensuring the compression rate is at least as good as that of the best individual model.<n>It seamlessly integrates with any autoregressive language model, providing a practical solution for enhancing text compression across diverse data distributions.
arXiv Detail & Related papers (2025-11-04T16:37:56Z) - Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - OpenZL: A Graph-Based Model for Compression [2.043259170549106]
We propose a new theoretical framework for representing compression as a directed acyclic graph of modular codecs.<n>This motivates OpenZL, an implementation of this model that compresses data into a self-describing wire format.<n>OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors.
arXiv Detail & Related papers (2025-10-03T17:40:29Z) - CompLLM: Compression for Long Context Q&A [47.90063873976842]
We introduce CompLLM, a soft compression technique designed for practical deployment.<n>Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently.<n>Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%.
arXiv Detail & Related papers (2025-09-23T16:49:43Z) - UniPCGC: Towards Practical Point Cloud Geometry Compression via an Efficient Unified Approach [4.754973569457509]
We propose an efficient unified point cloud geometry compression framework, dubbed as UniPCGC.<n>It supports lossy compression, lossless compression, variable rate and variable complexity.<n>Our method achieves a compression ratio (CR) gain of 8.1% on lossless compression, and a Bjontegaard Delta Rate (BD-Rate) gain of 14.02% on lossy compression.
arXiv Detail & Related papers (2025-03-24T10:51:28Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Neural Network Compression Via Sparse Optimization [23.184290795230897]
We propose a model compression framework based on the recent progress on sparse optimization.
We achieve up to 7.2 and 2.9 times FLOPs reduction with the same level of evaluation of accuracy on VGG16 for CIFAR10 and ResNet50 for ImageNet.
arXiv Detail & Related papers (2020-11-10T03:03:55Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.