Long Context In-Context Compression by Getting to the Gist of Gisting
- URL: http://arxiv.org/abs/2504.08934v1
- Date: Fri, 11 Apr 2025 19:23:31 GMT
- Title: Long Context In-Context Compression by Getting to the Gist of Gisting
- Authors: Aleksandar Petrov, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov,
- Abstract summary: GistPool is an in-context compression method with no architectural modification to the decoder transformer.<n>We demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates.<n>GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.
- Score: 50.24627831994713
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates. Surprisingly, a simple average pooling baseline consistently outperforms gisting. We analyze the limitations of gisting, including information flow interruptions, capacity limitations and the inability to restrict its attention to subsets of the context. Motivated by theoretical insights into the performance gap between gisting and average pooling, and supported by extensive experimentation, we propose GistPool, a new in-context compression method. GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.
Related papers
- MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores [5.893964327109089]
MOOSComp is a token-classification-based long-context compression method.
We introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression.
Our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.
arXiv Detail & Related papers (2025-04-23T15:02:53Z) - Mask-Encoded Sparsification: Mitigating Biased Gradients in Communication-Efficient Split Learning [15.78336840511033]
This paper introduces a novel framework designed to achieve a high compression ratio in Split Learning (SL) scenarios.
Our investigations demonstrate that compressing feature maps within SL leads to biased gradients that can negatively impact the convergence rates.
We employ a narrow bit-width encoded mask to compensate for the sparsification error without increasing the order of time complexity.
arXiv Detail & Related papers (2024-08-25T09:30:34Z) - QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory [66.01597794579568]
We introduce information bottleneck theory (IB) to model the problem.
We propose a cross-attention-based approach to approximate mutual information in IB.
Our method achieves a 25% increase in compression rate compared to the state-of-the-art.
arXiv Detail & Related papers (2024-08-20T02:44:45Z) - Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models [21.025001473355996]
We formalize the problem of prompt compression for large language models (LLMs)
We present a framework to unify token-level prompt compression methods which create hard prompts for black-box models.
We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy.
arXiv Detail & Related papers (2024-07-22T09:40:13Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Learning Accurate Performance Predictors for Ultrafast Automated Model
Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment.
Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z) - Structured Sparsification with Joint Optimization of Group Convolution
and Channel Shuffle [117.95823660228537]
We propose a novel structured sparsification method for efficient network compression.
The proposed method automatically induces structured sparsity on the convolutional weights.
We also address the problem of inter-group communication with a learnable channel shuffle mechanism.
arXiv Detail & Related papers (2020-02-19T12:03:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.