Related papers: MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores

MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores

URL: http://arxiv.org/abs/2504.16786v1
Date: Wed, 23 Apr 2025 15:02:53 GMT
Title: MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores
Authors: Fengwei Zhou, Jiafei Song, Wenjin Jason Li, Gengjian Xue, Zhikang Zhao, Yichao Lu, Bailin Na,
Abstract summary: MOOSComp is a token-classification-based long-context compression method.<n>We introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression.<n>Our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.
Score: 5.893964327109089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large language models have significantly improved their ability to process long-context input, but practical applications are challenged by increased inference time and resource consumption, particularly in resource-constrained environments. To address these challenges, we propose MOOSComp, a token-classification-based long-context compression method that enhances the performance of a BERT-based compressor by mitigating the over-smoothing problem and incorporating outlier scores. In the training phase, we add an inter-class cosine similarity loss term to penalize excessively similar token representations, thereby improving the token classification accuracy. During the compression phase, we introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression. These scores are integrated with the classifier's output, making the compressor more generalizable to various tasks. Superior performance is achieved at various compression ratios on long-context understanding and reasoning benchmarks. Moreover, our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.

Related papers

Long Context In-Context Compression by Getting to the Gist of Gisting [50.24627831994713]
GistPool is an in-context compression method with no architectural modification to the decoder transformer.<n>We demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates.<n>GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.
arXiv Detail & Related papers (2025-04-11T19:23:31Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Optimizing Singular Spectrum for Large Language Model Compression [95.7621116637755]
We introduce SoCo, a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner.<n>Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores.<n> Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.
arXiv Detail & Related papers (2025-02-20T23:18:39Z)
Fast Feedforward 3D Gaussian Splatting Compression [55.149325473447384]
3D Gaussian Splatting (FCGS) is an optimization-free model that can compress 3DGS representations rapidly in a single feed-forward pass. FCGS achieves a compression ratio of over 20X while maintaining fidelity, surpassing most per-scene SOTA optimization-based methods.
arXiv Detail & Related papers (2024-10-10T15:13:08Z)
Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios [17.720102137585503]
Perception is a training-free prompt compression framework for large language models.<n>It includes a perception retriever that leverages guiding questions and instruction to retrieve the most relevant demonstrations.<n>We conduct extensive experiments on long context, benchmarks, iSie, LongBench, and MuSiQue.
arXiv Detail & Related papers (2024-09-28T07:13:33Z)
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs) However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages. We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z)
In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs) We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z)
Retaining Key Information under High Compression Ratios: Query-Guided Compressor for LLMs [35.91962517513945]
Performance of previous methods degrades dramatically as compression ratios increase, sometimes even falling to the closed-book level. We introduce Query-Guided (QGC) which leverages queries to guide the context compression process. We validate the effectiveness of our proposed QGC on the Question Answering task, including NaturalQuestions, TriviaQA, and HotpotQA datasets.
arXiv Detail & Related papers (2024-06-04T14:53:24Z)
Context Compression for Auto-regressive Transformers with Sentinel Tokens [37.07722536907739]
We propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach.
arXiv Detail & Related papers (2023-10-12T09:18:19Z)
Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression [17.692238652162203]
We implement a sign-bit compression-based learning synchronization framework, Marsit. It reduces up to 35% training time while preserving the same accuracy as training without compression.
arXiv Detail & Related papers (2022-04-14T06:54:32Z)
Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification [12.517161466778655]
Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes. To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates. In this work, we show that such performance degradation due to choosing a high compression ratio is not fundamental. An adaptive compression strategy can reduce communication while maintaining final test accuracy.
arXiv Detail & Related papers (2020-10-29T16:41:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.