Related papers: Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model

Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model

URL: http://arxiv.org/abs/2602.01778v1
Date: Mon, 02 Feb 2026 08:01:57 GMT
Title: Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model
Authors: Kangtao Lv, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Shilei Liu, Yongwei Wang, Yujin Yuan, Wenbo Su, Bo Zheng,
Abstract summary: We investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data.<n>We show that encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting.
Score: 20.1054266241262
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model's internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.

Related papers

Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective [21.41673002861847]
Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge.<n>Recent research on soft context compression aims to address this by encoding long documents into compact embeddings.<n>We introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector.
arXiv Detail & Related papers (2026-01-25T09:06:24Z)
DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction [4.634179787231294]
We present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ)<n>Our results highlight the framework's exceptional predictive accuracy, with prediction errors generally under 10% across most settings.
arXiv Detail & Related papers (2025-12-24T21:46:17Z)
Test-Time Steering for Lossless Text Compression via Weighted Product of Experts [27.679089540901007]
We propose a novel framework that performs Test-Time Steering via a Weighted Product of Experts (wPoE)<n>At inference, our method adaptively combines a universal compression model with a pretrained neural language model, ensuring the compression rate is at least as good as that of the best individual model.<n>It seamlessly integrates with any autoregressive language model, providing a practical solution for enhancing text compression across diverse data distributions.
arXiv Detail & Related papers (2025-11-04T16:37:56Z)
Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z)
Compressed Feature Quality Assessment: Dataset and Baselines [89.62929964441962]
We propose the first benchmark dataset for evaluating semantic fidelity of compressed features.<n>We systematically assess three widely used metrics -- MSE, cosine similarity, and Centered Kernel Alignment (CKA) -- in terms of their ability to capture semantic degradation.<n>This work advances the field by establishing a foundational benchmark and providing a critical resource for the community to explore CFQA.
arXiv Detail & Related papers (2025-06-09T04:16:39Z)
Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity.<n>Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z)
ODDN: Addressing Unpaired Data Challenges in Open-World Deepfake Detection on Online Social Networks [51.03118447290247]
We propose the open-world deepfake detection network (ODDN), which comprises open-world data aggregation (ODA) and compression-discard gradient correction (CGC) ODA effectively aggregates correlations between compressed and raw samples through both fine-grained and coarse-grained analyses. CGC incorporates a compression-discard gradient correction to further enhance performance across diverse compression methods in online social networks (OSNs)
arXiv Detail & Related papers (2024-10-24T12:32:22Z)
Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models [0.0]
Large language models (LLMs) offer powerful capabilities but incur substantial computational costs. This study evaluates the impact of popular compression methods on the LLaMA-2-7B model. We show that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks.
arXiv Detail & Related papers (2024-09-17T14:34:11Z)
Sparse $L^1$-Autoencoders for Scientific Data Compression [0.0]
We introduce effective data compression methods by developing autoencoders using high dimensional latent spaces that are $L1$-regularized. We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data.
arXiv Detail & Related papers (2024-05-23T07:48:00Z)
Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and Depth [83.15263499262824]
We prove that gradient descent converges to a solution that completely disregards the sparse structure of the input. We show how to improve upon Gaussian performance for the compression of sparse data by adding a denoising function to a shallow architecture. We validate our findings on image datasets, such as CIFAR-10 and MNIST.
arXiv Detail & Related papers (2024-02-07T16:32:29Z)
What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning. We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets. We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z)
Neural Distributed Source Coding [59.630059301226474]
We present a framework for lossy DSC that is agnostic to the correlation structure and can scale to high dimensions. We evaluate our method on multiple datasets and show that our method can handle complex correlations and state-of-the-art PSNR.
arXiv Detail & Related papers (2021-06-05T04:50:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.