SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself
- URL: http://arxiv.org/abs/2405.17052v2
- Date: Tue, 18 Jun 2024 06:50:30 GMT
- Title: SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself
- Authors: Jun Gao, Ziqiang Cao, Wenjie Li,
- Abstract summary: Long prompt leads to huge hardware costs when using Large Language Models.
This paper proposes a Self-Compressor (SelfCP) to compress over-limit prompts into dense vectors while keeping the allowed prompts unmodified.
We show that SelfCP effectively substitutes 12$times$ over-limit prompts with dense tokens to reduce memory costs and booster inference throughputs.
- Score: 14.545490629324295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long prompt leads to huge hardware costs when using transformer-based Large Language Models (LLMs). Unfortunately, many tasks, such as summarization, inevitably introduce long documents, and the wide application of in-context learning easily makes the prompt length explode. This paper proposes a Self-Compressor (SelfCP), which employs the target LLM itself to compress over-limit prompts into dense vectors while keeping the allowed prompts unmodified. Dense vectors are then projected into dense tokens via a learnable connector to make the same LLM unburden to understand. The connector is supervised-tuned under the language modeling objective of the LLM on relatively long texts selected from publicly accessed datasets, involving an instruction dataset to make SelfCP respond to various prompts, while the target LLM keeps frozen during training. We build the lightweight SelfCP upon 2 different backbones with merely 17M learnable parameters originating from the connector and a learnable embedding. Evaluation on both English and Chinese benchmarks demonstrate that SelfCP effectively substitutes 12$\times$ over-limit prompts with dense tokens to reduce memory costs and booster inference throughputs, yet improving response quality. The outstanding performance brings an efficient solution for LLMs to tackle long prompts without training LLMs from scratch.
Related papers
- InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs)
We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm.
Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z) - LLM-AutoDiff: Auto-Differentiate Any LLM Workflow [58.56731133392544]
We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE)
LLMs-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine to generate feedback-akin to textual gradients.
It consistently outperforms existing textual gradient baselines in both accuracy and training cost.
arXiv Detail & Related papers (2025-01-28T03:18:48Z) - PDL: A Declarative Prompt Programming Language [1.715270928578365]
This paper introduces the Prompt Declaration Language (PDL)
PDL is a simple declarative data-oriented language that puts prompts at the forefront, based on YAML.
It supports writing interactive applications that call large language models (LLMs) and tools, and makes it easy to implement common use-cases such as chatbots, RAG, or agents.
arXiv Detail & Related papers (2024-10-24T20:07:08Z) - Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens [21.61634020256455]
Transformer-based large language models (LLMs) suffer a performance degradation when modeling long-term contexts.
We propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks.
arXiv Detail & Related papers (2024-06-16T15:50:10Z) - SirLLM: Streaming Infinite Retentive LLM [74.40196814292426]
Large Language Models (LLMs) process inputs of any length and maintain a degree of memory.
Recent efforts have employed streaming inputs to alleviate the pressure of excessively long text inputs.
We introduce Streaming Infinite Retentive LLM (SirLLM), which allows LLMs to maintain longer memory during infinite-length dialogues.
arXiv Detail & Related papers (2024-05-21T06:37:03Z) - Learning to Compress Prompt in Natural Language Formats [54.06967020905763]
Large language models (LLMs) are great at processing multiple natural language processing tasks.
LLMs are constrained by inferior performance with long context, slow inference speed, and the high cost of computing the results.
This work aims to compress lengthy prompts in the form of natural language with LLM transferability.
arXiv Detail & Related papers (2024-02-28T20:41:21Z) - MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained
Language Models [10.783764497590473]
Transformer-based language models (LMs) track contextual information through large, hard-coded input windows.
We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors.
tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history.
arXiv Detail & Related papers (2024-02-23T11:30:39Z) - InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM.
InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention.
Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z) - Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation.
We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation.
We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - P-Adapters: Robustly Extracting Factual Information from Language Models
with Diverse Prompts [7.657992756210283]
We introduce P-Adapters: lightweight models that sit between the embedding layer and first attention layer of Large Language Models.
They take LLM embeddings as input and output continuous prompts that are used to query the LLM.
They show between 12-26% absolute improvement in consistency and 36-50% absolute improvement in precision over a baseline of only using natural language queries.
arXiv Detail & Related papers (2021-10-14T11:32:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.