Related papers: Characterizing Prompt Compression Methods for Long Context Inference

Characterizing Prompt Compression Methods for Long Context Inference

URL: http://arxiv.org/abs/2407.08892v1
Date: Thu, 11 Jul 2024 23:34:32 GMT
Title: Characterizing Prompt Compression Methods for Long Context Inference
Authors: Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami,
Abstract summary: Long context inference presents challenges at the system level with increased compute and memory requirements. Several methods have been proposed to compress the prompt to reduce the context length. We perform a comprehensive characterization and evaluation of different prompt compression methods.
Score: 36.9745587176401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compress the prompt to reduce the context length. However, there has been little work on comparing the different proposed methods across different tasks through a standardized analysis. This has led to conflicting results. To address this, here we perform a comprehensive characterization and evaluation of different prompt compression methods. In particular, we analyze extractive compression, summarization-based abstractive compression, and token pruning methods. Surprisingly, we find that extractive compression often outperforms all the other approaches, and enables up to 10x compression with minimal accuracy degradation. Interestingly, we also find that despite several recent claims, token pruning methods often lag behind extractive compression. We only found marginal improvements on summarization tasks.

Related papers

Perception Compressor:A training-free prompt compression method in long context scenarios [17.720102137585503]
Perception is a training-free prompt compression method for large language models. It outperforms existing methods by a large margin, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-09-28T07:13:33Z)
Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference [16.830389144259584]
We propose context-aware prompt compression (CPC), a sentence-level prompt compression technique. Key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. Our method considerably outperforms prior works on prompt compression on benchmark datasets.
arXiv Detail & Related papers (2024-09-02T13:02:51Z)
LanguaShrink: Reducing Token Overhead with Psycholinguistics [8.123272461141815]
LanguaShrink is a prompt compression framework for large language models. It reduces prompt length while preserving essential information. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.
arXiv Detail & Related papers (2024-09-01T22:09:20Z)
Concise and Precise Context Compression for Tool-Using Language Models [60.606281074373136]
We propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models. Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
arXiv Detail & Related papers (2024-07-02T08:17:00Z)
In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs) We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z)
Compressing Lengthy Context With UltraGist [22.054232261437186]
We propose a new method called UltraGist, which is distinguished for its high-quality compression of lengthy context. UltraGist contributes to the flexibility of compression, as it can be effectively learned to support a broad range of context lengths and compression ratios. It makes the training process sample-efficient and thus maximizes the use of training data.
arXiv Detail & Related papers (2024-05-26T17:23:56Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text. We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
Once-for-All Sequence Compression for Self-Supervised Speech Models [62.60723685118747]
We introduce a once-for-all sequence compression framework for self-supervised speech models. The framework is evaluated on various tasks, showing marginal degradation compared to the fixed compressing rate variants. We also explore adaptive compressing rate learning, demonstrating the ability to select task-specific preferred frame periods without needing a grid search.
arXiv Detail & Related papers (2022-11-04T09:19:13Z)
Unrolled Compressed Blind-Deconvolution [77.88847247301682]
sparse multichannel blind deconvolution (S-MBD) arises frequently in many engineering applications such as radar/sonar/ultrasound imaging. We propose a compression method that enables blind recovery from much fewer measurements with respect to the full received signal in time.
arXiv Detail & Related papers (2022-09-28T15:16:58Z)
Optimal Gradient Compression for Distributed and Federated Learning [9.711326718689492]
Communication between computing nodes in distributed learning is typically an unavoidable burden. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques. In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error.
arXiv Detail & Related papers (2020-10-07T07:58:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.