Related papers: ICPC: In-context Prompt Compression with Faster Inference

ICPC: In-context Prompt Compression with Faster Inference

URL: http://arxiv.org/abs/2501.01625v1
Date: Fri, 03 Jan 2025 03:46:51 GMT
Title: ICPC: In-context Prompt Compression with Faster Inference
Authors: Ziyang Yu, Yuyu Liu,
Abstract summary: We propose I CPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length.<n>The key idea of I CPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function.<n> Empirically, we demonstrate that I CPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the recent success of Large Language Models (LLMs), it remains challenging to feed LLMs with long prompts due to the fixed size of LLM inputs. As a remedy, prompt compression becomes a promising solution by removing redundant tokens in the prompt. However, using LLM in the existing works requires additional computation resources and leads to memory overheads. To address it, we propose ICPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length. The key idea of ICPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function, which effectively reduces the information loss during prompt compression and increases the speed of compression. Empirically, we demonstrate that ICPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.

Related papers

Dynamic Compressing Prompts for Efficient Inference of Large Language Models [38.604760935983364]
Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible.
arXiv Detail & Related papers (2025-04-15T09:20:45Z)
LightThinker: Thinking Step-by-Step Compression [53.8069487638972]
We propose LightThinker, a method that enables large language models to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses thought steps into compact representations and discards the original reasoning chains. Experiments show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-02-21T16:57:22Z)
Efficient Long Context Language Model Retrieval with Compression [57.09163579304332]
Long Context Language Models (LCLMs) have emerged as a new paradigm to perform Information Retrieval (IR)<n>We propose a new compression approach tailored for LCLM retrieval, which is trained to maximize the retrieval performance while minimizing the length of the compressed passages.<n>We show that CoLoR improves the retrieval performance by 6% while compressing the in-context size by a factor of 1.91.
arXiv Detail & Related papers (2024-12-24T07:30:55Z)
Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference [16.830389144259584]
We propose context-aware prompt compression (CPC), a sentence-level prompt compression technique.<n>Key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question.<n>Our method considerably outperforms prior works on prompt compression on benchmark datasets.
arXiv Detail & Related papers (2024-09-02T13:02:51Z)
LanguaShrink: Reducing Token Overhead with Psycholinguistics [8.123272461141815]
LanguaShrink is a prompt compression framework for large language models. It reduces prompt length while preserving essential information. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.
arXiv Detail & Related papers (2024-09-01T22:09:20Z)
In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs) We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.<n>We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.<n>We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression [43.048684907893104]
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
arXiv Detail & Related papers (2024-03-19T17:59:56Z)
Learning to Compress Prompt in Natural Language Formats [54.06967020905763]
Large language models (LLMs) are great at processing multiple natural language processing tasks. LLMs are constrained by inferior performance with long context, slow inference speed, and the high cost of computing the results. This work aims to compress lengthy prompts in the form of natural language with LLM transferability.
arXiv Detail & Related papers (2024-02-28T20:41:21Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)
Discrete Prompt Compression with Reinforcement Learning [2.664293070994717]
Compressed prompts aid instruction-tuned language models (LMs) in overcoming context window limitations and reducing computational costs. Existing methods, which primarily based on training embeddings, face various challenges associated with interpretability, the fixed number of embedding tokens, reusability across different LMs, and inapplicability when interacting with black-box APIs. This study proposes prompt compression with reinforcement learning (PCRL), which is a discrete prompt compression method that addresses these issues.
arXiv Detail & Related papers (2023-08-17T03:10:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.