Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention
- URL: http://arxiv.org/abs/2505.15774v1
- Date: Wed, 21 May 2025 17:26:11 GMT
- Title: Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention
- Authors: Huanxuan Liao, Wen Hu, Yao Xu, Shizhu He, Jun Zhao, Kang Liu,
- Abstract summary: Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing.<n>Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression.<n>We propose HyCo$$, which integrates both global and local perspectives to guide context compression.
- Score: 30.580674811560613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose $\textbf{Hy}$brid $\textbf{Co}$ntext $\textbf{Co}$mpression (HyCo$_2$) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo$_2$ method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1\% across seven knowledge-intensive QA benchmarks. Moreover, HyCo$_2$ matches the performance of uncompressed methods while reducing token consumption by 88.8\%.
Related papers
- DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression [63.83422894663496]
We propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC)<n>This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression.<n>Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements.
arXiv Detail & Related papers (2025-07-16T06:16:06Z) - PIS: Linking Importance Sampling and Attention Mechanisms for Efficient Prompt Compression [3.6268731121741067]
Large language models (LLMs) have achieved remarkable progress, demonstrating unprecedented capabilities across various natural language processing tasks.<n>Existing prompt compression methods rely on truncation or abstractive summarization techniques.<n>We introduce Prompt Importance Sampling (PIS), a novel compression framework that dynamically compresses prompts by sampling important tokens.
arXiv Detail & Related papers (2025-04-23T09:53:01Z) - Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models [36.16630765077807]
We propose a Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom)<n>We use the instruction as a condition to guide the compression from both local and global levels.<n>Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens.
arXiv Detail & Related papers (2025-03-20T11:09:18Z) - DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens [20.044306399439265]
Large Language Models (LLMs) face computational inefficiencies and redundant processing when handling long context inputs.<n>We propose Dynamic Allocation of Soft Tokens (DAST), a simple yet effective method that leverages the LLM's intrinsic understanding of contextual relevance to guide compression.<n> Experimental results across multiple benchmarks demonstrate that DAST surpasses state-of-the-art methods.
arXiv Detail & Related papers (2025-02-17T06:55:13Z) - Federated Class-Incremental Learning: A Hybrid Approach Using Latent Exemplars and Data-Free Techniques to Address Local and Global Forgetting [10.061328213032088]
Federated Class-Incremental Learning (FCIL) refers to a scenario where a dynamically changing number of clients collaboratively learn an ever-increasing number of incoming tasks.<n>We develop a mathematical framework for FCIL that formulates local and global forgetting.<n>We propose an approach called Hybrid Rehearsal which utilizes latent exemplars and data-free techniques to address local and global forgetting.
arXiv Detail & Related papers (2025-01-26T01:08:01Z) - Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models [28.311125014789905]
"Global Compression Commander" (i.e., GlobalCom$2$) is a novel plug-and-play token compression framework for HR-LVLMs.<n>Our experiments show that GlobalCom$2$ maintains over 90% performance while compressing 90% visual tokens.
arXiv Detail & Related papers (2025-01-09T11:57:58Z) - Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models [50.637714223178456]
We propose Enhanced Position Layout (EPL) to improve the context compression capability of large language models (LLMs)<n>EPL minimizes the distance between context tokens and their corresponding special tokens and at the same time maintains the sequence order in position IDs.<n>When extended to multimodal scenarios, EPL brings an average accuracy gain of 2.6 to vision compression LLMs.
arXiv Detail & Related papers (2024-09-22T08:51:18Z) - QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory [66.01597794579568]
We introduce information bottleneck theory (IB) to model the problem.<n>We propose a cross-attention-based approach to approximate mutual information in IB.<n>Our method achieves a 25% increase in compression rate compared to the state-of-the-art.
arXiv Detail & Related papers (2024-08-20T02:44:45Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z) - Coupling Global Context and Local Contents for Weakly-Supervised
Semantic Segmentation [54.419401869108846]
We propose a single-stage WeaklySupervised Semantic (WSSS) model with only the image-level class label supervisions.
A flexible context aggregation module is proposed to capture the global object context in different granular spaces.
A semantically consistent feature fusion module is proposed in a bottom-up parameter-learnable fashion to aggregate the fine-grained local contents.
arXiv Detail & Related papers (2023-04-18T15:29:23Z) - Faster Non-Convex Federated Learning via Global and Local Momentum [57.52663209739171]
textttFedGLOMO is the first (first-order) FLtexttFedGLOMO algorithm.
Our algorithm is provably optimal even with communication between the clients and the server.
arXiv Detail & Related papers (2020-12-07T21:05:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.