Learning to Compress Prompts with Gist Tokens
- URL: http://arxiv.org/abs/2304.08467v3
- Date: Mon, 12 Feb 2024 19:03:18 GMT
- Title: Learning to Compress Prompts with Gist Tokens
- Authors: Jesse Mu, Xiang Lisa Li, Noah Goodman
- Abstract summary: We present gisting, which trains an LM to compress prompts into smaller sets of "gist" tokens.
On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts.
- Score: 16.64173373856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompting is the primary way to utilize the multitask capabilities of
language models (LMs), but prompts occupy valuable space in the input context
window, and repeatedly encoding the same prompt is computationally inefficient.
Finetuning and distillation methods allow for specialization of LMs without
prompting, but require retraining the model for each task. To avoid this
trade-off entirely, we present gisting, which trains an LM to compress prompts
into smaller sets of "gist" tokens which can be cached and reused for compute
efficiency. Gist models can be trained with no additional cost over standard
instruction finetuning by simply modifying Transformer attention masks to
encourage prompt compression. On decoder (LLaMA-7B) and encoder-decoder
(FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting
in up to 40% FLOPs reductions, 4.2% wall time speedups, and storage savings,
all with minimal loss in output quality.
Related papers
- CODEPROMPTZIP: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs [6.936336826531964]
Retrieval-Augmented Generation (RAG) enhances coding tasks by incorporating retrieved code examples into prompts.
Existing prompt compression techniques focus on natural language, lacking tailored solutions for code.
We propose CodePromptZip, a framework that compresses code examples before integrating into RAG.
arXiv Detail & Related papers (2025-02-19T23:15:23Z) - Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product [8.014705094248589]
Low- parameters prompt tuning method outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
Experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
arXiv Detail & Related papers (2025-02-16T05:50:12Z) - ICPC: In-context Prompt Compression with Faster Inference [0.0]
We propose I CPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length.
The key idea of I CPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function.
Empirically, we demonstrate that I CPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.
arXiv Detail & Related papers (2025-01-03T03:46:51Z) - EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [84.70637613266835]
We re-formulate the model compression problem into the customized compensation problem.
We propose Training-free Eigenspace Low-Rank Approximation (EoRA)
EoRA directly minimizes compression-induced errors without requiring gradient-based training.
arXiv Detail & Related papers (2024-10-28T17:59:03Z) - MrT5: Dynamic Token Merging for Efficient Byte-level Language Models [50.46453950887946]
This work introduces MrT5 (MergeT5), a more efficient variant of ByT5.
MrT5 integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length.
When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages.
arXiv Detail & Related papers (2024-10-28T06:14:12Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning [11.167198972934736]
Large language models (LLMs) such as GPT-4 have led to a surge in the size of prompts required for optimal performance.
We propose a novel and efficient reinforcement learning (RL) based task-aware prompt compression method.
We demonstrate that our RL-guided compression method improves the task performance by 8% - 260% over state-of-the-art compression techniques.
arXiv Detail & Related papers (2024-09-19T18:11:59Z) - Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling [53.58854856174773]
Speculative decoding is an approach to accelerate inference through a guess-and-verify paradigm.
Token Recycling stores candidate tokens in an adjacency matrix and employs a breadth-first search algorithm.
It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
arXiv Detail & Related papers (2024-08-16T12:20:56Z) - Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks.
Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z) - LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression [43.048684907893104]
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency.
We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one.
Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
arXiv Detail & Related papers (2024-03-19T17:59:56Z) - Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [1.9639467358416092]
Transformers have emerged as the backbone of large language models (LLMs)
We propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time.
We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU.
arXiv Detail & Related papers (2024-03-14T17:59:26Z) - Discrete Prompt Compression with Reinforcement Learning [2.664293070994717]
Compressed prompts aid instruction-tuned language models (LMs) in overcoming context window limitations and reducing computational costs.
Existing methods, which primarily based on training embeddings, face various challenges associated with interpretability, the fixed number of embedding tokens, reusability across different LMs, and inapplicability when interacting with black-box APIs.
This study proposes prompt compression with reinforcement learning (PCRL), which is a discrete prompt compression method that addresses these issues.
arXiv Detail & Related papers (2023-08-17T03:10:17Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z) - PERFECT: Prompt-free and Efficient Few-shot Learning with Language
Models [67.3725459417758]
PERFECT is a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting.
We show that manually engineered task prompts can be replaced with task-specific adapters that enable sample-efficient fine-tuning.
Experiments on a wide range of few-shot NLP tasks demonstrate that PERFECT, while being simple and efficient, also outperforms existing state-of-the-art few-shot learning methods.
arXiv Detail & Related papers (2022-04-03T22:31:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.